Skip to content

Automated betting angles in Python

| Betting strategies based on your existing insights: no modelling required


Workshop


This tutorial was written by Tom Bishop and was originally published on Github. It is shared here with his permission.

This tutorial follows on logically from the JSON to CSV tutorial and backteseting ratings in Python tutorial we shared previously. If you're still new to working with the JSON data sets we suggest you take a look at those tutorials before diving into this one.

As always please reach out with feedback, suggestions or queries, or feel free to submit a pull request if you catch some bugs or have other improvements!

Cheat sheet

  • This is presented as a Jupyter notebook as this format is interactive and lets you run snippets of code from wihtin the notebook. To use this functionality you'll need to download a copy of the ipynb file locally and open it in a text editor (i.e. VS code).
  • If you're looking for the complete code head to the bottom of the page or download the script from Github.
  • To run the code, save it to your machine, open a command prompt, or a terminal in your text editor of choice (we're using VS code), make sure you've navigated in the terminal to the folder you've saved the script in and then type py main.py (or whatever you've called your script file if not main) then hit enter. To stop the code running use Ctrl C.
  • Make sure you amend your data path to point to your data file. We'll be taking in an input of a historical tar file downloaded from the Betfair historic data site. We're using a PRO version, though the code should work on ADVANCED too. This approach won't work with the BASIC data tier.
  • We're using the betfairlightweight package to do the heavy lifting
  • We've also posted the completed code logic on the betfair-downunder Github repo.

0.1 Setup

Once again I'll be presenting the analysis in a jupyter notebook and will be using python as a programming language.

Some of the data processing code takes a while to execute - that code will be in cells that are commented out - and will require a bit of adjustment to point to places on your computer where you want to locally store the intermediate data files.

You'll also need betfairlightweight which you can install with something like pip install betfairlightweight.

import requests
import pandas as pd
from datetime import date, timedelta
import numpy as np
import os
import re
import tarfile
import zipfile
import bz2
import glob
import logging
import yaml
from unittest.mock import patch
from typing import List, Set, Dict, Tuple, Optional
from itertools import zip_longest
import betfairlightweight
from betfairlightweight import StreamListener
from betfairlightweight.resources.bettingresources import (
    PriceSize,
    MarketBook
)
from scipy.stats import t
import plotly.express as px

0.2 Context

Formulating betting angles (or "strategies" as some call them) is quite a common pasttime for some. These angles can range all the way from very simple to quite sophisticated, and could include things like:

  • Laying NBA teams playing on the second night of a back-to-back
  • Laying AFL team coming off a bye when matched against a team who played last week
  • Backing a greyhound in boxes 1 or 2 in short sprint style races
  • Backing a horse pre-race who typically runs at the front of the field and placing an order to lay the same horse if it shortens to some lower price in-play, locking in a profit

Beyond the complexity of the actual concept what really seperates these angles is evidence. You might have heard TV personalities and betting ads suggest a certain strategy (resembling one of the above) are real-world predictive trends but they rarely are. They are rarely derived from the right historical data or are concluded without the necessary statistical rigour. Most simply formulated their angles off intuition or observing a trend across a small sample of data.

There are many users on betting exchanges who profit off these angles. In fact, when most people talk about automated or sophisticated exchange betting they are often talking about automating these kind of betting angles, as opposed to betting ratings produced from a sophisticated bottom-up fundemental modelling. That's because profitable fundemental modelling (where your model which arrives at some estimation of fair value from first principles) is very hard.

The reason this approach is so much easier is that you assume the market odds are right except x and go from there, applying small top-down adjustments for factors that haven't historically been incorporated in the market opinion. The challenge lies in finding those factors and making sure you aren't tricking yourself in thinking you've found one that you can profit off in the future.

Once again this is another example of the uses of the Betfair historical stream data. To get cracking - as always - we need historical odds and the best place to get that is to self serve the historical stream files.


0.3 Examples

I'll go through an end-to-end example of 3 different betting angles on Australian thoroughbred racing. Which will include: Which will include:

  • Sourcing data
  • Assembling data
  • Formulating hypotheses
  • Testing Hypotheses
  • Discussion about implementation

1.0 Data

1.1 Betfair Odds Data

We'll follow a very similar template as other tutorials extracting key information from the betfair stream data.

It's important to note that given the volume of data you need to handle with these stream files, your workflow will probably involve choosing some methods of aggregation / summary that you'll reconsider after your first cut of analysis. Parsing and saving a dataset, using it to test some hypotheses which likely results in more questions that need to be examined by reparsing the stream files in a slightly different way. Your workflow will likely follow something like this diagram.

For the purposes of this article I'm interested in backtesting some betting angles at the BSP, using some indication of price momentum/market support in some angles, and testing some back to lay strategies so we'll need to pull out some information about each runners in-play trading.

So we'll extract the following for each runner: - BSP - Last Traded Price - Volume Weighted Avg Price (top 3 boxes) 5 mins before the scheduled jump time - Volume Weighted Avg Price (top 3 boxes) 30 seconds before the scheduled jump time - The volume traded on the selection - The minimum "best available to lay" price offered inplay (which is a measure of how low the selection traded during the race)

First we'll establish some utility functions needed to parse the data. Most of these were discussed in the previous backtest your ratings tutorial.

# Utility Functions For Stream Parsing
# _________________________________

def as_str(v) -> str:
    return '%.2f' % v if type(v) is float else v if type(v) is str else ''

def split_anz_horse_market_name(market_name: str) -> (str, str, str):
    parts = market_name.split(' ')
    race_no = parts[0] # return example R6
    race_len = parts[1] # return example 1400m
    race_type = parts[2].lower() # return example grp1, trot, pace
    return (race_no, race_len, race_type)

def filter_market(market: MarketBook) -> bool: 
    d = market.market_definition
    return (d.country_code == 'AU' 
        and d.market_type == 'WIN' 
        and (c := split_anz_horse_market_name(d.name)[2]) != 'trot' and c != 'pace')

def load_markets(file_paths):
    for file_path in file_paths:
        print(file_path)
        if os.path.isdir(file_path):
            for path in glob.iglob(file_path + '**/**/*.bz2', recursive=True):
                f = bz2.BZ2File(path, 'rb')
                yield f
                f.close()
        elif os.path.isfile(file_path):
            ext = os.path.splitext(file_path)[1]
            # iterate through a tar archive
            if ext == '.tar':
                with tarfile.TarFile(file_path) as archive:
                    for file in archive:
                        yield bz2.open(archive.extractfile(file))
            # or a zip archive
            elif ext == '.zip':
                with zipfile.ZipFile(file_path) as archive:
                    for file in archive.namelist():
                        yield bz2.open(archive.open(file))

    return None

def slicePrice(l, n):
    try:
        x = l[n].price
    except:
        x = np.nan
    return(x)

def sliceSize(l, n):
    try:
        x = l[n].size
    except:
        x = np.nan
    return(x)

def wapPrice(l, n):
    try:
        x = round(sum( [rung.price * rung.size for rung in l[0:(n-1)] ] ) / sum( [rung.size for rung in l[0:(n-1)] ]),2)
    except:
        x = np.nan
    return(x)

def ladder_traded_volume(ladder):
    return(sum([rung.size for rung in ladder]))

Then we'll create our core execution functions that will scan over the historical stream files and use betfairlightweight to recreate the state of the exchange for each thoroughbred market and extract key information for each selection

# Core Execution Fucntions
# _________________________________

def extract_components_from_stream(s):

    with patch("builtins.open", lambda f, _: f):   

        evaluate_market = None
        prev_market = None
        postplay = None
        preplay = None
        t5m = None
        t30s = None
        inplay_min_lay = None

        gen = s.get_generator()

        for market_books in gen():

            for market_book in market_books:

                # If markets don't meet filter return None's
                if evaluate_market is None and ((evaluate_market := filter_market(market_book)) == False):
                    return (None, None, None, None, None, None)

                # final market view before market goes in play
                if prev_market is not None and prev_market.inplay != market_book.inplay:
                    preplay = market_book

                # final market view before market goes is closed for settlement
                if prev_market is not None and prev_market.status == "OPEN" and market_book.status != prev_market.status:
                    postplay = market_book

                # Calculate Seconds Till Scheduled Market Start Time
                seconds_to_start = (market_book.market_definition.market_time - market_book.publish_time).total_seconds()

                # Market at 30 seconds before scheduled off
                if t30s is None and seconds_to_start < 30:
                    t30s = market_book

                # Market at 5 mins before scheduled off
                if t5m is None and seconds_to_start < 5*60:
                    t5m = market_book

                # Manage Inplay Vectors
                if market_book.inplay:

                    if inplay_min_lay is None:
                        inplay_min_lay = [ slicePrice(runner.ex.available_to_lay,0) for runner in market_book.runners]
                    else:
                        inplay_min_lay = np.fmin(inplay_min_lay, [ slicePrice(runner.ex.available_to_lay,0) for runner in market_book.runners])

                # update reference to previous market
                prev_market = market_book

        # If market didn't go inplay
        if postplay is not None and preplay is None:
            preplay = postplay
            inplay_min_lay = ["" for runner in market_book.runners]

        return (t5m, t30s, preplay, postplay, inplay_min_lay, prev_market) # Final market is last prev_market

def parse_stream(stream_files, output_file):

    with open(output_file, "w+") as output:

        output.write("market_id,selection_id,selection_name,wap_5m,wap_30s,bsp,ltp,traded_vol,inplay_min_lay\n")

        for file_obj in load_markets(stream_files):

            stream = trading.streaming.create_historical_generator_stream(
                file_path=file_obj,
                listener=listener,
            )

            (t5m, t30s, preplay, postplay, inplayMin, final) = extract_components_from_stream(stream)

            # If no price data for market don't write to file
            if postplay is None or final is None or t30s is None:
                continue; 

            # All runner removed
            if all(runner.status == "REMOVED" for runner in final.runners):
                continue

            runnerMeta = [
                {
                    'selection_id': r.selection_id,
                    'selection_name': next((rd.name for rd in final.market_definition.runners if rd.selection_id == r.selection_id), None),
                    'selection_status': r.status,
                    'sp': r.sp.actual_sp
                }
                for r in final.runners 
            ]

            ltp = [runner.last_price_traded for runner in preplay.runners]

            tradedVol = [ ladder_traded_volume(runner.ex.traded_volume) for runner in postplay.runners ]

            wapBack30s = [ wapPrice(runner.ex.available_to_back, 3) for runner in t30s.runners]

            wapBack5m = [ wapPrice(runner.ex.available_to_back, 3) for runner in t5m.runners]

            # Writing To CSV
            # ______________________

            for (runnerMeta, ltp, tradedVol, inplayMin, wapBack5m, wapBack30s) in zip(runnerMeta, ltp, tradedVol, inplayMin, wapBack5m, wapBack30s):

                if runnerMeta['selection_status'] != 'REMOVED':

                    output.write(
                        "{},{},{},{},{},{},{},{},{}\n".format(
                            str(final.market_id),
                            runnerMeta['selection_id'],
                            runnerMeta['selection_name'],
                            wapBack5m,
                            wapBack30s,
                            runnerMeta['sp'],
                            ltp,
                            round(tradedVol),
                            inplayMin
                        )
                    )

Finally, after sourcing and downloading 12 months of stream files (ask automation@betfair.com.au for more info if you don't know how to do this) we'll use the above code to parse each file and write to a single csv file to be used for analysis.

# Description:
#   Will loop through a set of stream data archive files and extract a few key pricing measures for each selection
# Estimated Time:
#   ~6 hours

# Parameters
# _________________________________

# trading = betfairlightweight.APIClient("username", "password")

# listener = StreamListener(max_latency=None)

# stream_files = glob.glob("[PATH TO LOCAL FOLDER STORING ARCHIVE FILES]*.tar")
# output_file = "[SOME OUTPUT DIRECTORY]/thoroughbred-odds-2021.csv"

# Run
# _________________________________

# if __name__ == '__main__':
#     parse_stream(stream_files, output_file)

1.2 Race Data

If you're building a fundamental bottom-up model, finding and managing ETL from an appropriate data source is a large part of the exercise. If your needs are simpler (for these types of automated strategies for example) there's plenty of good information that's available right inside the betfair API itself.

The RUNNER_METADATA slot inside the listMarketCatalogue response for example will return a pretty good slice of metadata about the horses racing in upcoming races including but not limited to: the trainer, the jockey, the horses age, and a class rating. The documentaion for this endpoint will give you the full extent of this what's inside this response.

Our problem for this exercise is that the historical stream files don't include this RUNNER_METADATA so we weren't able to extract it in the previous step. However, a sneaky workaround is to use an unsuppoerted back-end endpoint, one which Betfair use for the Hub racing results page.

These API endpoints are:

Extract Betfair Racing Markets for a Given Date

First we'll hit the https://apigateway.betfair.com.au/hub/racecard endpoint to get the racing markets available on Betfair for a given day in the past:

def getBfMarkets(dte):

    url = 'https://apigateway.betfair.com.au/hub/racecard?date={}'.format(dte)

    responseJson = requests.get(url).json()

    marketList = []

    for meeting in responseJson['MEETINGS']:
        for markets in meeting['MARKETS']:
            marketList.append(
                {
                    'date': dte,
                    'track': meeting['VENUE_NAME'],
                    'country': meeting['COUNTRY'],
                    'race_type': meeting['RACE_TYPE'],
                    'race_number': markets['RACE_NO'],
                    'market_id': str('1.' + markets['MARKET_ID']),
                    'start_time': markets['START_TIME']
                }
            )

    marketDf = pd.DataFrame(marketList)

    return(marketDf)

Extract Key Race Metadata

Then (for one of these market_ids) we'll hit the https://apigateway.betfair.com.au/hub/raceevent/ enpdoint to get some key runner metadata for the runners in this race. It's important to note that this information is available through the Betfair API so we won't need to go to a secondary datasource to find it at the point of implementation, this would add a large layer of complexity to the project including things like string cleaning and matching.

def getBfRaceMeta(market_id):

    url = 'https://apigateway.betfair.com.au/hub/raceevent/{}'.format(market_id)

    responseJson = requests.get(url).json()

    if 'error' in responseJson:
        return(pd.DataFrame())

    raceList = []

    for runner in responseJson['runners']:

        if 'isScratched' in runner and runner['isScratched']:
            continue

        # Jockey not always populated
        try:
            jockey = runner['jockeyName']
        except:
            jockey = ""

        # Place not always populated
        try:
            placeResult = runner['placedResult']
        except:
            placeResult = ""

        # Place not always populated
        try:
            trainer = runner['trainerName']
        except:
            trainer = ""

        raceList.append(
            {
                'market_id': market_id,
                'weather': responseJson['weather'],
                'track_condition': responseJson['trackCondition'],
                'race_distance': responseJson['raceLength'],
                'selection_id': runner['selectionId'],
                'selection_name': runner['runnerName'],
                'barrier': runner['barrierNo'],
                'place': placeResult,
                'trainer': trainer,
                'jockey': jockey,
                'weight': runner['weight']
            }
        )

    raceDf = pd.DataFrame(raceList)

    return(raceDf)

Wrapper Function

Stiching these two functions together we can create a wrapper function that hits both endpoints for all the thoroughbred races in a given day and extract all the runner metadata and results.

def scrapeThoroughbredBfDate(dte):

    markets = getBfMarkets(dte)

    if markets.shape[0] == 0:
        return(pd.DataFrame())

    thoMarkets = markets.query('country == "AUS" and race_type == "R"')

    if thoMarkets.shape[0] == 0:
        return(pd.DataFrame())

    raceMetaList = []

    for market in thoMarkets.market_id:
        raceMetaList.append(getBfRaceMeta(market))

    raceMeta = pd.concat(raceMetaList)

    return(markets.merge(raceMeta, on = 'market_id'))
# Executing the wrapper for an example date
scrapeThoroughbredBfDate(date(2021,2,10))
date track country race_type race_number market_id start_time weather track_condition race_distance selection_id selection_name barrier place trainer jockey weight
0 2021-02-10 Ascot AUS R 1 1.179077389 2021-02-10 04:34:00 None None 1000 38448397 Triple Missile 3 1 Todd Harvey Paul Harvey 60.0
1 2021-02-10 Ascot AUS R 1 1.179077389 2021-02-10 04:34:00 None None 1000 28763768 Shock Result 5 4 P H Jordan Craig Staples 59.5
2 2021-02-10 Ascot AUS R 1 1.179077389 2021-02-10 04:34:00 None None 1000 8772321 Secret Plan 6 3 G & A Williams William Pike 59.0
3 2021-02-10 Ascot AUS R 1 1.179077389 2021-02-10 04:34:00 None None 1000 9021011 Command Force 2 0 Daniel & Ben Pearce J Azzopardi 58.0
4 2021-02-10 Ascot AUS R 1 1.179077389 2021-02-10 04:34:00 None None 1000 38448398 Fish Hook 7 2 M P Allan Madi Derrick 57.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
458 2021-02-10 Warwick Farm AUS R 7 1.179081635 2021-02-10 06:50:00 None None 1200 133456 Sedition 12 2 Richard Litt Ms Rachel King 58.0
459 2021-02-10 Warwick Farm AUS R 7 1.179081635 2021-02-10 06:50:00 None None 1200 38447782 Amusez Moi 9 6 Richard Litt Josh Parr 57.0
460 2021-02-10 Warwick Farm AUS R 7 1.179081635 2021-02-10 06:50:00 None None 1200 25388274 Savoury 1 5 Bjorn Baker Jason Collett 57.0
461 2021-02-10 Warwick Farm AUS R 7 1.179081635 2021-02-10 06:50:00 None None 1200 38447783 Born A Warrior 7 3 Michael & Wayne & John Hawkes Tommy Berry 56.5
462 2021-02-10 Warwick Farm AUS R 7 1.179081635 2021-02-10 06:50:00 None None 1200 38447784 Newsreader 10 1 John O'shea James Mcdonald 55.5

463 rows × 17 columns

Then to produce a historical slice of all races between two dates we could just loop over a set of dates and append each results set

# Description:
#   Will loop through a set of dates (starting July 2020 in this instance) and return race metadata from betfair 
# Estimated Time:
#   ~60 mins
# 
# dataList = []
# dateList = pd.date_range(date(2020,7,1),date.today()-timedelta(days=1),freq='d')
# for dte in dateList:
#     dte = dte.date()
#     print(dte)
#     races = scrapeThoroughbredBfDate(dte)
#     dataList.append(races)
# data = pd.concat(dataList)
# data.to_csv("[LOCAL PATH SOMEWHERE]", index=False)

2.0 Analysis

I'll be running through 3 simple betting angles, one easy, one medium, and one hard to illustrate different types of angles you might want to try at home. The process I lay out is very similar (if not identical) but the implementation might be a bit trickier in each case and might take a little more programming skill to get up and running.

We'll use a simple evaluation function (POT and strike rate) to evaluate each of these strategies.

def bet_eval_metrics(d, side = False):

    metrics = pd.DataFrame(d
    .agg({"npl": "sum", "stake": "sum", "win": "mean"})
    ).transpose().assign(pot=lambda x: x['npl'] / x['stake'])

    return(metrics[metrics['stake'] != 0])

2.1 Assemble Data

Now that we have our 2 core datasets (odds + race / runner metadata) we can join them together and do some analysis

# Local Paths (will be different on your machine)
path_odds_local = "[PATH TO YOUR LOCAL FILES]/thoroughbred-odds-2021.csv"
path_race_local = "[PATH TO YOUR LOCAL FILES]/thoroughbred-race-data.csv"

odds = pd.read_csv(path_odds_local, dtype={'market_id': object, 'selection_id': object})
race = pd.read_csv(path_race_local, dtype={'market_id': object, 'selection_id': object})
odds.head(3)
market_id selection_id selection_name wap_5m wap_30s bsp ltp traded_vol inplay_min_lay
0 1.179845158 23493550 1. Larmour 6.27 5.84 6.20 6.2 8277 1.19
1 1.179845158 16374800 3. Careering Away 3.31 3.67 3.60 3.65 18592 1.08
2 1.179845158 19740699 4. Bells N Bows 6.87 6.36 6.62 6.4 7413 1.42
race.head(3)
date track country race_type race_number market_id start_time weather track_condition race_distance selection_id selection_name barrier place trainer jockey weight
0 2020-07-01 Balaklava AUS R 1 1.171091087 2020-07-01 02:40:00 FINE GOOD4 2200 19674744 Baldy 2 4.0 Peter Nolan Karl Zechner 59.0
1 2020-07-01 Balaklava AUS R 1 1.171091087 2020-07-01 02:40:00 FINE GOOD4 2200 401615 Nostrovia 4 7.0 Dennis O'leary Margaret Collett 59.0
2 2020-07-01 Balaklava AUS R 1 1.171091087 2020-07-01 02:40:00 FINE GOOD4 2200 26789410 Ammo Loco 5 1.0 John Hickmott Barend Vorster 58.5
# Joining two datasets
df = race.merge(odds.loc[:, odds.columns != 'selection_name'], how = "inner", on = ['market_id', 'selection_id'])
# I'll also add columns for the net profit from backing and laying each selection to be picked up in subsequent sections
df['back_npl'] = np.where(df['place'] == 1, 0.95 * (df['bsp']-1), -1)
df['lay_npl'] = np.where(df['place'] == 1, -1 * (df['bsp']-1), 0.95)

2.2 Methodology

Looping back around to the context discussion in part 0.2 we need to decide on how to set up our analysis that will help us: find angles, formulate strategies, and test them with enough rigour that will give us a good estimate of our forward looking profitability on any that we choose to implement and automate.

The 3 key tricks I'll lay out in this piece are: - Using a statistical estimate to quantify the robustness of historical profitibalility - Using out-of-sample validation (much like a you would in a model building exercise) to get an accurate view of forward looking profitability - Using domain knowledge to chunk selections to get broader sample for more stable estimate of profitability

2.3.1 Chunking

This is a technique you can use to group together variables in conceptually similar groups. For example, thoroughbred races are run over many different exact distances (800m, 810m, 850m, 860m etc) which - using a domain overlay - are all very short sprint style races for a horse race. Similarly, barriers 1, 2 and 3 being on the very inside of the race field and closest to the rail all present similar early race challenges and advantages for horses jumping from those barriers.

So formulating your betting angles you may want to overlay semantically similar variable groups to test your betting hypothesis.

I'll add variable chunks for race distance and barrier for now but you may want to test more (for example horse experience, trainer stable size etc)

def distance_group(distance):

    if distance is None:
        return("missing")
    elif distance < 1100:
        return("sprint")
    elif distance < 1400:
        return("mid_short")
    elif distance < 1800:
        return("mid_long")
    else:
        return("long")

def barrier_group(barrier):
    if barrier is None:
        return("missing")
    elif barrier < 4:
        return("inside")
    elif barrier < 9:
        return("mid_field")
    else:
        return("outside")


df['distance_group'] = pd.to_numeric(df.race_distance, errors = "coerce").apply(distance_group)
df['barrier_group'] = pd.to_numeric(df.barrier, errors = "coerce").apply(barrier_group)

2.3.2 In Sample vs Out of Sample

The first thing I'm going to do is to split off a largish chunk of my data before even looking at it. I'll ultimately use it to paper trade some of my candidate angles but I want it to be as seperate from the idea generation process as possible.

I'll use the model building nomenclature "train" and "test" even though I'm not really doing any "training". My data contains all AUS thoroughbred races from July 2020 until end of June 2021 so I'll cut off the period Apr-June 2021 as my "test" set.

dfTrain = df.query('date < "2021-04-01"')
dfTest = df.query('date >= "2021-04-01"')

'{} rows in the "training" set and {} rows in the "test" data'.format(dfTrain.shape[0], dfTest.shape[0])
'119244 rows in the "training" set and 40783 rows in the "test" data'

2.3.3 Statistically Measuring Profit

Betting outcomes, and the randomness associated with them, at their core are the types of things the discipline of statistics was created to solve. Concepts like sample size, expected value, and variance are terms you might hear from sophisticated (and some novice) bettors and they are all drawn from the field of statistics. Though you don't need to become a PHD of statistics every little extra technique or concept you can glean from the field will help your betting if you want it to.

To illustrate with an example, let's group by net backing profit on turnover for a horse to see which horses have the highest historical back POT:

(
    dfTrain
    .assign(stake=1)
    .groupby('selection_name', as_index = False)
    .agg({'back_npl': 'sum', 'stake': 'sum'})
    .assign(pot=lambda x: x['back_npl'] / x['stake'])
    .sort_values('pot', ascending=False)  
    .head(3) 
)
selection_name back_npl stake pot
12247 Little Vulpine 274.550 1 274.550000
15384 Not Tonight Dear 130.701 1 130.701000
9987 Im Cheeky 617.307 7 88.186714

So back Little Vulpine whenever it races? We all know intuitively what's wrong with that betting angle - it's raced one time in our sample and happened to win at a bsp of 270. Sample size and variance are dominating this simple measure of historical POT.

Instead what we can do is treat the historical betting outcomes as a random variable and apply some statistical tests of signifance to them. A more detailed discussion of this particular test can be found here as can an excel calculator you can input your stats into. I'll simply translate the test to python to enable it's use when formulating our betting angles.

The TLDR version of this test is that; based on your bet sample size, your profit, and the average odds across that sample of bets, the calculation produces a p value which estimates the probability your profit (or loss) happened by pure chance (where chance would be an expectation of breakeven betting simply at fair odds).

def pl_pValue(number_bets, npl, stake, average_odds):

    pot = npl / stake

    tStatistic = (pot * np.sqrt(number_bets)) / np.sqrt( (1 + pot) * (average_odds - 1 - pot) )

    pValue = 2 * t.cdf(-abs(tStatistic), number_bets-1)

    return(np.where(np.logical_or(np.isnan(pValue), pValue == 0), 1, pValue))

That doesn't mean we can formulate our angles and use this metric (and this metric alone) to validate their profitability. You'll find that it will give you misleading results in some instances. As analysts we're also prone to finding infinite different ways to unintentionally overfit our analysis as you might have heard elsewhere described as the concept of p-hacking, but it does give us an extra filter to cast over our hypotheses before really validating them with out-of-sample testing.

2.4 Angle 1: Track | Distance | Barrier

The first thing I'll test is whether or not there are any combinations of track/distance/barrier where backing or laying could produce robust long term profit. This probably fits within the types of betting angles people before you have already sucked all the value out of long before you started reading this article. That's not to say that you shouldn't test them though, as people have made livings on betting angles as simple as these.

# Calculate the profit (back and lay) and average odds across all track / distance / barrier group combos
trackDistanceBarrier = (
    dfTrain
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .groupby(['track', 'race_distance', 'barrier_group'], as_index=False)
    .agg({'back_npl': 'sum', 'lay_npl': 'sum','stake': 'sum', 'odds': 'mean'})
)

trackDistanceBarrier
track race_distance barrier_group back_npl lay_npl stake odds
0 Albany 1000 inside 11.2550 -11.95 2 15.450000
1 Albany 1000 mid_field -5.0000 4.75 5 101.136000
2 Albany 1000 outside -5.0000 4.75 5 88.374000
3 Albany 1100 inside -3.0525 2.70 6 29.430000
4 Albany 1100 mid_field -6.4040 5.92 9 37.483333
... ... ... ... ... ... ... ...
6325 York 1500 inside 1.8995 -2.41 6 41.195000
6326 York 1500 mid_field -7.0000 6.65 7 32.472857
6327 York 1920 inside -3.0000 2.85 3 21.883333
6328 York 1920 mid_field -0.3520 -0.04 5 20.978000
6329 York 1920 outside -2.0000 1.90 2 21.450000

6330 rows × 7 columns

So it looks like over 2 selections jumping from the inside 3 barriers at Albany 1000m you would have made a healthy profit if you'd decide to back them historically.

Let's use our lense of statistical significance to view these profit figures

trackDistanceBarrier = (
    trackDistanceBarrier
    .assign(backPL_pValue = lambda x: pl_pValue(number_bets = x['stake'], npl = x['back_npl'], stake = x['stake'], average_odds = x['odds']))
    .assign(layPL_pValue = lambda x: pl_pValue(number_bets = x['stake'], npl = x['lay_npl'], stake = x['stake'], average_odds = x['odds']))
)

trackDistanceBarrier
/home/tmbish/.local/lib/python3.9/site-packages/pandas/core/arraylike.py:364: RuntimeWarning: invalid value encountered in sqrt
  result = getattr(ufunc, method)(*inputs, **kwargs)

track race_distance barrier_group back_npl lay_npl stake odds backPL_pValue layPL_pValue
0 Albany 1000 inside 11.2550 -11.95 2 15.450000 0.487280 1.000000
1 Albany 1000 mid_field -5.0000 4.75 5 101.136000 1.000000 0.885995
2 Albany 1000 outside -5.0000 4.75 5 88.374000 1.000000 0.877954
3 Albany 1100 inside -3.0525 2.70 6 29.430000 0.754412 0.869397
4 Albany 1100 mid_field -6.4040 5.92 9 37.483333 0.532857 0.804366
... ... ... ... ... ... ... ... ... ...
6325 York 1500 inside 1.8995 -2.41 6 41.195000 0.918934 0.849635
6326 York 1500 mid_field -7.0000 6.65 7 32.472857 1.000000 0.755643
6327 York 1920 inside -3.0000 2.85 3 21.883333 1.000000 0.816546
6328 York 1920 mid_field -0.3520 -0.04 5 20.978000 0.972659 0.996987
6329 York 1920 outside -2.0000 1.90 2 21.450000 1.000000 0.863432

6330 rows × 9 columns

So as you can see, whilst having a back POT of nearly 500% because the results were generated over 2 runners at quite high odds the p value (50%) suggest that it's quite likely we could have seen these exact results due to randomness, which is very intuitive.

Let's have a look to see if there's any statistically significant edge to be gained on the lay side

# Top 5 lay combos Track | Distance | Barrier (TDB)
TDB_bestLay = trackDistanceBarrier.query('lay_npl>0').sort_values('layPL_pValue').head(5)
TDB_bestLay
track race_distance barrier_group back_npl lay_npl stake odds backPL_pValue layPL_pValue
188 Ascot 1000 inside -83.6395 77.06 115 24.616870 0.003054 0.248157
3619 Moonee Valley 1200 inside -52.2195 48.81 64 17.725313 0.000565 0.254399
6299 Yeppoon 1400 inside -11.0000 10.45 11 6.022727 1.000000 0.289686
959 Caulfield 1400 mid_field -74.3150 67.45 114 24.828772 0.018780 0.301137
1366 Darwin 1200 mid_field -47.3980 43.94 64 19.354531 0.009844 0.318178

So despite high lay POT none of these angles suggest an irrefutablely profitable angle laying these combinations. However, that doesn't mean we shouldn't test them on our out of sample set of races. These are our statistically most promising examples, we'll just take the top 5 for now and see how we would have performed if we had of started betting them on April first 2021.

Keep in mind this should give us a pretty good indication of what we could get over the next 3 months into the future if we started today because we haven't contaminated/leaked any data from the post April period into our angle formulation.

# First let's test laying on the train set (by definition we know these will be profitable)
train_TDB_bestLay = (
    dfTrain
    .merge(TDB_bestLay[['track', 'race_distance']])
    .assign(npl=lambda x: x['lay_npl'])
    .assign(stake=1)
    .assign(win=lambda x: np.where(x['lay_npl'] > 0, 1, 0))
)

# This is the key test (non of the races has been part of analysis to this point)
test_TDB_bestLay = (
    dfTest
    .merge(TDB_bestLay[['track', 'race_distance']])
    .assign(npl=lambda x: x['lay_npl'])
    .assign(stake=1)
    .assign(win=lambda x: np.where(x['lay_npl'] > 0, 1, 0))
)

# Peaking at the bets in the test set
test_TDB_bestLay[['track', 'race_distance', 'barrier', 'barrier_group', 'bsp', 'lay_npl', 'win', 'stake']]
track race_distance barrier barrier_group bsp lay_npl win stake
0 Ascot 1000 4 mid_field 11.08 0.95 1 1
1 Ascot 1000 11 outside 5.41 0.95 1 1
2 Ascot 1000 1 inside 4.73 -3.73 0 1
3 Ascot 1000 5 mid_field 7.35 0.95 1 1
4 Ascot 1000 10 outside 4.97 0.95 1 1
... ... ... ... ... ... ... ... ...
219 Darwin 1200 10 outside 18.31 0.95 1 1
220 Darwin 1200 8 mid_field 42.00 0.95 1 1
221 Darwin 1200 6 mid_field 29.91 0.95 1 1
222 Darwin 1200 11 outside 6.60 -5.60 0 1
223 Darwin 1200 7 mid_field 5.74 0.95 1 1

224 rows × 8 columns

# Let's run our evaluation on the training set
bet_eval_metrics(train_TDB_bestLay)
npl stake win pot
0 98.1 1047.0 0.892073 0.093696
# And on the test set
bet_eval_metrics(test_TDB_bestLay)
npl stake win pot
0 18.44 224.0 0.870536 0.082321

That's promising results. Our test set shows similar betting performance as our training set and we're still seeing a profitble trend. These are lay strategies so they aren't as robust as backing strategies as your profit distribution is lots of small wins and some large losses, but this is potentially a profitble betting angle!

2.5 Angle 2: Jockeys + Market Opinion

Moving up slightly in level of difficulty our angles could include different kinds of reference points. Jockeys seem to be a divisive form factor in thoroughbred racing, and their quality can be hard to isolate relative to the quality of the horse and its preperation etc.

I'm going to look at isolating jockeys that are either favoured or unfavoured by the market to see if I can formulate a betting angle that could generate me expected profit.

The metric I'm going to use to determine market favour will be the ratio between back price 5 minutes before the scheduled jump and 30 seconds before the scheduled jump. Plotting this ratio for jockeys in our training set we can see which jockeys tend to have high market support by a high ratio (horses they are riding tend to shorten before the off)

(
    dfTrain
    .assign(market_support=lambda x: x['wap_5m'] / x['wap_30s'])
    .assign(races=1)
    .groupby('jockey')
    .agg({'market_support': 'mean', 'races': 'count'})
    .query('races > 10')
    .sort_values('market_support', ascending = False)
    .head()
)
market_support races
jockey
Scott Sheargold 1.133095 192
Lorelle Crow 1.056582 106
Chris Mc Carthy 1.051022 26
Anthony Darmanin 1.048931 142
James Winks 1.048893 12
Bob El-Issa 1.046756 196
Elyce Smith 1.043593 164
Jessica Gray 1.043376 108
Paul Francis Hamblin 1.042248 61
Alana Livesey 1.042188 32

Next, let's split the sample of each jockey's races between two scenarios a) the market firmed for their horse b) their horse drifted in the market in the last 5 minutes of trading.

We then calculate the same summary table of inputs (profit, average odds etc) for backing these jockeys at the BSP given some market move. We can then feed these metrics into our statistical significance test to get an idea of the profitability of each combination.

# Group By Jockey and Market Support
jockeys = (
    dfTrain
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .assign(npl=lambda x: np.where(x['place'] == 1, 0.95 * (x['odds']-1), -1))
    .assign(market_support=lambda x: np.where(x['wap_5m'] > x['wap_30s'], "Y", "N"))
    .groupby(['jockey', 'market_support'], as_index=False)
    .agg({'odds': 'mean', 'stake': 'sum', 'npl': 'sum'})
    .assign(pValue = lambda x: pl_pValue(number_bets = x['stake'], npl = x['npl'], stake = x['stake'], average_odds = x['odds']))
)

jockeys.sort_values('pValue').query('npl > 0').head(10)
jockey market_support odds stake npl pValue
624 K Jennings Y 18.106118 85 178.6955 0.005643
496 Jade Darose Y 87.343333 39 579.4265 0.008942
263 Clayton Gallagher Y 24.225338 148 226.7145 0.012994
906 Ms T Harrison Y 27.944125 160 241.3065 0.018095
615 Justin P Stanley N 13.084502 231 155.7305 0.019913
802 Michael Dee N 36.338213 263 299.7255 0.031634
753 Madeleine Wishart Y 25.249872 78 156.5065 0.033329
433 Hannah Fitzgerald Y 32.171944 72 170.1830 0.045334
937 Nick Heywood N 17.172857 98 111.8905 0.049176
745 M Pateman N 22.690808 260 189.2050 0.052283

You can think of each of these scenarios representing different cases. If profitable: - Under market support this could indicate the jockey is being correctly favoured to maximise their horse's chances of winning the race or perhaps even some kind of insider knowledge coming out of certain stables - Under market drift this could indicate some incorrect skepticism about the jockeys ability and thus their horse has been overlayed

Either way we're interested to see how these combinations would perform paper trading in our out of sample set

# First evaluate on our training set
train_jockeyMarket = (
    dfTrain
    .assign(market_support=lambda x: np.where(x['wap_5m'] > x['wap_30s'], "Y", "N"))
    .merge(jockeys.sort_values('pValue').query('npl > 0').head(10)[['jockey', 'market_support']])
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .assign(npl=lambda x: np.where(x['place'] == 1, 0.95 * (x['odds']-1), -1))
    .assign(win=lambda x: np.where(x['npl'] > 0, 1, 0))
)

# And on the test set
test_jockeyMarket = (
    dfTest
    .assign(market_support=lambda x: np.where(x['wap_5m'] > x['wap_30s'], "Y", "N"))
    .merge(jockeys.sort_values('pValue').query('npl > 0').head(10)[['jockey', 'market_support']])
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .assign(npl=lambda x: np.where(x['place'] == 1, 0.95 * (x['odds']-1), -1))
    .assign(win=lambda x: np.where(x['npl'] > 0, 1, 0))
)
bet_eval_metrics(train_jockeyMarket)
npl stake win pot
0 2309.384 1434.0 0.154812 1.610449
bet_eval_metrics(test_jockeyMarket)
npl stake win pot
0 36.329 375.0 0.109333 0.096877

You can see overfitting in full effect here with the train set performance. However, our out-of-sample performance is still decently profitable. We might have found another profitable betting angle!

It's worth noting that implementing this strategy would be slightly more complex than implementing our first strategy. Our code (or third party tool) would need to be able to check whether the market had firmed between 2 distinct time points before the jump of the race and cross reference that with the jockey name. Trivial for someone who is comfortable with bet placement and the betfair API but a little more involved for the uninitiated. It's important to formulate angles that you would know how and are capable of implementing.

2.6 Angle 3: Backing To Lay

Now let's try to use some of our inplay price data we extracted from the stream files. I'm interested in testing some back-to-lay strategies where a horse is backed preplay with the intention to get some tradeout lay order filled during the race. The types of scenarios where this could be conceivably profitable would be on certain kinds of horses or jockeys that show promise or strength early in the race but generally fade late and might not convert those early advantages often.

Things we could look at here are: - Horses that typically trade lower than their preplay odds but don't win often - Jockeys that typically trade lower than their preplay odds but don't win often - Certain combinations of jockey / trainer / horse / race distance that meet these criteria

# First Investigate The Average Inplay Minimums And Loss Rates of Certain Jockeys
tradeOutIndex = (
    dfTrain
    .query('distance_group in ["long", "mid_long"]')
    .assign(inplay_odds_ratio=lambda x: x['inplay_min_lay'] / x['bsp'])
    .assign(win=lambda x: np.where(x['place']==1,1,0))
    .assign(races=lambda x: 1)
    .groupby(['jockey'], as_index=False)
    .agg({'inplay_odds_ratio': 'mean', 'win': 'mean', 'races': 'sum'})
    .sort_values('inplay_odds_ratio')
    .query('races >= 5')
)

tradeOutIndex
jockey inplay_odds_ratio win races
291 John Rudd 0.352796 0.000000 8
457 Natalie M Morton 0.357216 0.142857 7
451 Murray Henderson 0.455943 0.166667 6
92 Bridget Grylls 0.474635 0.000000 11
431 Ms Heather Poland 0.478529 0.000000 5
... ... ... ... ...
438 Ms K Stanley 0.898819 0.000000 21
619 Yasuhiro Nishitani 0.902459 0.043478 23
99 Cameron Quilty 0.907503 0.000000 20
87 Brett Fliedner 0.923814 0.000000 20
169 Desiree Stra 0.949329 0.000000 5

558 rows × 4 columns

Ok so what we have here is a list of all jockeys with over 5 races on long and mid-long race distance groups (over 1800m) ordered by their average ratio of inplay minimum traded price compared with their jump price.

If this trend is predictive we could assume that these jockeys tend to have an agressive race style and like to get out and lead the race. We'd like to capitalise on that race style by backing the jockeys pre-play and putting in a lay order which we'll leave inplay hoping to get matched during the race.

For simplicity let's just assume we're flat staking on both sides so that our payoff profile looks like this: - Horse never trades at <50% of it's BSP our lay bet never get's matched and we lose 1 unit - Horse trades at <50% of it's BSP but loses (our lay bet gets filled) we're breakeven for the market - Horse trades wins (our lay bet get's filled) and we profit on our back bet and lose our lay bet so our profit is: (BSP-1) - (0.5*BSP-1)

Let's run this backtest on the top 20 jockeys in our tradeOutIndex dataframe to see how we'd perform on the train and test set.

targetTradeoutFraction = 0.5

train_JockeyBackToLay = (
    dfTrain
    .query('distance_group in ["long", "mid_long"]')
    .merge(tradeOutIndex.head(20)['jockey'])
    .assign(npl=lambda x: np.where(x['inplay_min_lay'] &lt;= targetTradeoutFraction * x['bsp'], np.where(x['place'] == 1, 0.95 * (x['bsp']-1-(0.5*x['bsp']-1)), 0), -1))
    .assign(stake=lambda x: np.where(x['npl'] != -1, 2, 1))
    .assign(win=lambda x: np.where(x['npl'] &gt;= 0, 1, 0))
)

bet_eval_metrics(train_JockeyBackToLay)
npl stake win pot
0 23.797 671.0 0.5181 0.035465
test_JockeyBackToLay = (
    dfTest
    .query('distance_group in ["long", "mid_long"]')
    .merge(tradeOutIndex.head(20)['jockey'])
    .assign(npl=lambda x: np.where(x['inplay_min_lay'] &lt;= targetTradeoutFraction * x['bsp'], np.where(x['place'] == 1, 0.95 * (x['bsp']-1-(0.5*x['bsp']-1)), 0), -1))
    .assign(stake=lambda x: np.where(x['npl'] != -1, 2, 1))
    .assign(win=lambda x: np.where(x['npl'] &gt;= 0, 1, 0))

)

bet_eval_metrics(test_JockeyBackToLay)
npl stake win pot
0 45.62475 255.0 0.342105 0.178921

Not bad! Looks like we found another possibly promising lead.

Again it's worth noting that this is probably another step up in implementation complexity again from previous angles. It's not very hard when you're familiar with betfair order types and placing them through the API but it requires some additional API savviness. But the documentation is quite good and there's plenty of resources available online to help you understand how to automate something like this.


3.0 Conclusion

This analysis is just a sketch. Hopefully it helps inspire you to think about what kinds of betting angles you could test for a sport or racing code you're interested in. It should give you a framework for thinking about this kind of automated betting, and how it differs from fundamental modelling. It should also give you a few tricks for coming up with your own angles and testing them with the rigour needed to have any realistic expectations of profit. Most of the betting angles you're sold are faulty or have long evaporated from the market by people long before you even knew the rules of the sport. You'll need to be creative and scientific to create your own profitable betting angles, but it's certainly worth it to try.


Complete code

Run the code from your ide by using py <filename>.py, making sure you amend the path to point to your input data.

Download from Github

import requests
import pandas as pd
from datetime import date, timedelta
import numpy as np
import os
import re
import tarfile
import zipfile
import bz2
import glob
import logging
import yaml
from unittest.mock import patch
from typing import List, Set, Dict, Tuple, Optional
from itertools import zip_longest
import betfairlightweight
from betfairlightweight import StreamListener
from betfairlightweight.resources.bettingresources import (
    PriceSize,
    MarketBook
)
from scipy.stats import t
import plotly.express as px


# Utility Functions
#   + Stream Parsing
#   + Betfair Race Data Scraping
#   + Various utilities
# _________________________________

def as_str(v) -&gt; str:
    return '%.2f' % v if type(v) is float else v if type(v) is str else ''

def split_anz_horse_market_name(market_name: str) -&gt; (str, str, str):
    parts = market_name.split(' ')
    race_no = parts[0] # return example R6
    race_len = parts[1] # return example 1400m
    race_type = parts[2].lower() # return example grp1, trot, pace
    return (race_no, race_len, race_type)

def filter_market(market: MarketBook) -&gt; bool: 
    d = market.market_definition
    return (d.country_code == 'AU' 
        and d.market_type == 'WIN' 
        and (c := split_anz_horse_market_name(d.name)[2]) != 'trot' and c != 'pace')

def load_markets(file_paths):
    for file_path in file_paths:
        print(file_path)
        if os.path.isdir(file_path):
            for path in glob.iglob(file_path + '**/**/*.bz2', recursive=True):
                f = bz2.BZ2File(path, 'rb')
                yield f
                f.close()
        elif os.path.isfile(file_path):
            ext = os.path.splitext(file_path)[1]
            # iterate through a tar archive
            if ext == '.tar':
                with tarfile.TarFile(file_path) as archive:
                    for file in archive:
                        yield bz2.open(archive.extractfile(file))
            # or a zip archive
            elif ext == '.zip':
                with zipfile.ZipFile(file_path) as archive:
                    for file in archive.namelist():
                        yield bz2.open(archive.open(file))

    return None

def slicePrice(l, n):
    try:
        x = l[n].price
    except:
        x = np.nan
    return(x)

def sliceSize(l, n):
    try:
        x = l[n].size
    except:
        x = np.nan
    return(x)

def wapPrice(l, n):
    try:
        x = round(sum( [rung.price * rung.size for rung in l[0:(n-1)] ] ) / sum( [rung.size for rung in l[0:(n-1)] ]),2)
    except:
        x = np.nan
    return(x)

def ladder_traded_volume(ladder):
    return(sum([rung.size for rung in ladder]))

# Core Execution Fucntions
# _________________________________

def extract_components_from_stream(s):

    with patch("builtins.open", lambda f, _: f):   

        evaluate_market = None
        prev_market = None
        postplay = None
        preplay = None
        t5m = None
        t30s = None
        inplay_min_lay = None

        gen = s.get_generator()

        for market_books in gen():

            for market_book in market_books:

                # If markets don't meet filter return None's
                if evaluate_market is None and ((evaluate_market := filter_market(market_book)) == False):
                    return (None, None, None, None, None, None)

                # final market view before market goes in play
                if prev_market is not None and prev_market.inplay != market_book.inplay:
                    preplay = market_book

                # final market view before market goes is closed for settlement
                if prev_market is not None and prev_market.status == "OPEN" and market_book.status != prev_market.status:
                    postplay = market_book

                # Calculate Seconds Till Scheduled Market Start Time
                seconds_to_start = (market_book.market_definition.market_time - market_book.publish_time).total_seconds()

                # Market at 30 seconds before scheduled off
                if t30s is None and seconds_to_start &lt; 30:
                    t30s = market_book

                # Market at 5 mins before scheduled off
                if t5m is None and seconds_to_start &lt; 5*60:
                    t5m = market_book

                # Manage Inplay Vectors
                if market_book.inplay:

                    if inplay_min_lay is None:
                        inplay_min_lay = [ slicePrice(runner.ex.available_to_lay,0) for runner in market_book.runners]
                    else:
                        inplay_min_lay = np.fmin(inplay_min_lay, [ slicePrice(runner.ex.available_to_lay,0) for runner in market_book.runners])

                # update reference to previous market
                prev_market = market_book

        # If market didn't go inplay
        if postplay is not None and preplay is None:
            preplay = postplay
            inplay_min_lay = ["" for runner in market_book.runners]

        return (t5m, t30s, preplay, postplay, inplay_min_lay, prev_market) # Final market is last prev_market

def parse_stream(stream_files, output_file):

    with open(output_file, "w+") as output:

        output.write("market_id,selection_id,selection_name,wap_5m,wap_30s,bsp,ltp,traded_vol,inplay_min_lay\n")

        for file_obj in load_markets(stream_files):

            stream = trading.streaming.create_historical_generator_stream(
                file_path=file_obj,
                listener=listener,
            )

            (t5m, t30s, preplay, postplay, inplayMin, final) = extract_components_from_stream(stream)

            # If no price data for market don't write to file
            if postplay is None or final is None or t30s is None:
                continue; 

            # All runner removed
            if all(runner.status == "REMOVED" for runner in final.runners):
                continue

            runnerMeta = [
                {
                    'selection_id': r.selection_id,
                    'selection_name': next((rd.name for rd in final.market_definition.runners if rd.selection_id == r.selection_id), None),
                    'selection_status': r.status,
                    'sp': r.sp.actual_sp
                }
                for r in final.runners 
            ]

            ltp = [runner.last_price_traded for runner in preplay.runners]

            tradedVol = [ ladder_traded_volume(runner.ex.traded_volume) for runner in postplay.runners ]

            wapBack30s = [ wapPrice(runner.ex.available_to_back, 3) for runner in t30s.runners]

            wapBack5m = [ wapPrice(runner.ex.available_to_back, 3) for runner in t5m.runners]

            # Writing To CSV
            # ______________________

            for (runnerMeta, ltp, tradedVol, inplayMin, wapBack5m, wapBack30s) in zip(runnerMeta, ltp, tradedVol, inplayMin, wapBack5m, wapBack30s):

                if runnerMeta['selection_status'] != 'REMOVED':

                    output.write(
                        "{},{},{},{},{},{},{},{},{}\n".format(
                            str(final.market_id),
                            runnerMeta['selection_id'],
                            runnerMeta['selection_name'],
                            wapBack5m,
                            wapBack30s,
                            runnerMeta['sp'],
                            ltp,
                            round(tradedVol),
                            inplayMin
                        )
                    )

def get_bf_markets(dte):

    url = 'https://apigateway.betfair.com.au/hub/racecard?date={}'.format(dte)

    responseJson = requests.get(url).json()

    marketList = []

    for meeting in responseJson['MEETINGS']:
        for markets in meeting['MARKETS']:
            marketList.append(
                {
                    'date': dte,
                    'track': meeting['VENUE_NAME'],
                    'country': meeting['COUNTRY'],
                    'race_type': meeting['RACE_TYPE'],
                    'race_number': markets['RACE_NO'],
                    'market_id': str('1.' + markets['MARKET_ID']),
                    'start_time': markets['START_TIME']
                }
            )

    marketDf = pd.DataFrame(marketList)

    return(marketDf)

def get_bf_race_meta(market_id):

    url = 'https://apigateway.betfair.com.au/hub/raceevent/{}'.format(market_id)

    responseJson = requests.get(url).json()

    if 'error' in responseJson:
        return(pd.DataFrame())

    raceList = []

    for runner in responseJson['runners']:

        if 'isScratched' in runner and runner['isScratched']:
            continue

        # Jockey not always populated
        try:
            jockey = runner['jockeyName']
        except:
            jockey = ""

        # Place not always populated
        try:
            placeResult = runner['placedResult']
        except:
            placeResult = ""

        # Place not always populated
        try:
            trainer = runner['trainerName']
        except:
            trainer = ""

        raceList.append(
            {
                'market_id': market_id,
                'weather': responseJson['weather'],
                'track_condition': responseJson['trackCondition'],
                'race_distance': responseJson['raceLength'],
                'selection_id': runner['selectionId'],
                'selection_name': runner['runnerName'],
                'barrier': runner['barrierNo'],
                'place': placeResult,
                'trainer': trainer,
                'jockey': jockey,
                'weight': runner['weight']
            }
        )

    raceDf = pd.DataFrame(raceList)

    return(raceDf)

def scrape_thoroughbred_bf_date(dte):

    markets = get_bf_markets(dte)

    if markets.shape[0] == 0:
        return(pd.DataFrame())

    thoMarkets = markets.query('country == "AUS" and race_type == "R"')

    if thoMarkets.shape[0] == 0:
        return(pd.DataFrame())

    raceMetaList = []

    for market in thoMarkets.market_id:
        raceMetaList.append(get_bf_race_meta(market))

    raceMeta = pd.concat(raceMetaList)

    return(markets.merge(raceMeta, on = 'market_id'))


# Execute Data Pipeline
# _________________________________

# Description:
#   Will loop through a set of dates (starting July 2020 in this instance) and return race metadata from betfair 
# Estimated Time:
#   ~60 mins
# 
# if __name__ == '__main__':
    # dataList = []
    # dateList = pd.date_range(date(2020,7,1),date.today()-timedelta(days=1),freq='d')
    # for dte in dateList:
    #     dte = dte.date()
    #     print(dte)
    #     races = scrapeThoroughbredBfDate(dte)
    #     dataList.append(races)
    # data = pd.concat(dataList)
    # data.to_csv("[LOCAL PATH SOMEWHERE]", index=False)


# Description:
#   Will loop through a set of stream data archive files and extract a few key pricing measures for each selection
# Estimated Time:
#   ~6 hours
#
# trading = betfairlightweight.APIClient("username", "password")
# listener = StreamListener(max_latency=None)
# stream_files = glob.glob("[PATH TO LOCAL FOLDER STORING ARCHIVE FILES]*.tar")
# output_file = "[SOME OUTPUT DIRECTORY]/thoroughbred-odds-2021.csv"
# if __name__ == '__main__':
#     parse_stream(stream_files, output_file)


# Analysis
# _________________________________


# Functions ++++++++

def bet_eval_metrics(d, side = False):

    metrics = pd.DataFrame(d
    .agg({"npl": "sum", "stake": "sum", "win": "mean"})
    ).transpose().assign(pot=lambda x: x['npl'] / x['stake'])

    return(metrics[metrics['stake'] != 0])

def pl_pValue(number_bets, npl, stake, average_odds):

    pot = npl / stake

    tStatistic = (pot * np.sqrt(number_bets)) / np.sqrt( (1 + pot) * (average_odds - 1 - pot) )

    pValue = 2 * t.cdf(-abs(tStatistic), number_bets-1)

    return(np.where(np.logical_or(np.isnan(pValue), pValue == 0), 1, pValue))

def distance_group(distance):

    if distance is None:
        return("missing")
    elif distance &lt; 1100:
        return("sprint")
    elif distance &lt; 1400:
        return("mid_short")
    elif distance &lt; 1800:
        return("mid_long")
    else:
        return("long")

def barrier_group(barrier):
    if barrier is None:
        return("missing")
    elif barrier &lt; 4:
        return("inside")
    elif barrier &lt; 9:
        return("mid_field")
    else:
        return("outside")

# Analysis ++++++++

# Local Paths (will be different on your machine)
path_odds_local = "[PATH TO YOUR LOCAL FILES]/thoroughbred-odds-2021.csv"
path_race_local = "[PATH TO YOUR LOCAL FILES]/thoroughbred-race-data.csv"

odds = pd.read_csv(path_odds_local, dtype={'market_id': object, 'selection_id': object})
race = pd.read_csv(path_race_local, dtype={'market_id': object, 'selection_id': object})

# Joining two datasets
df = race.merge(odds.loc[:, odds.columns != 'selection_name'], how = "inner", on = ['market_id', 'selection_id'])

# I'll also add columns for the net profit from backing and laying each selection to be picked up in subsequent sections
df['back_npl'] = np.where(df['place'] == 1, 0.95 * (df['bsp']-1), -1)
df['lay_npl'] = np.where(df['place'] == 1, -1 * (df['bsp']-1), 0.95)

# Adding Variable Chunks
df['distance_group'] = pd.to_numeric(df.race_distance, errors = "coerce").apply(distance_group)
df['barrier_group'] = pd.to_numeric(df.barrier, errors = "coerce").apply(barrier_group)

# Data Partitioning
dfTrain = df.query('date &lt; "2021-04-01"')
dfTest = df.query('date &gt;= "2021-04-01"')

'{} rows in the "training" set and {} rows in the "test" data'.format(dfTrain.shape[0], dfTest.shape[0])

# Angle 1 ++++++++++++++++++++++++++++++++++++++++++++++

(
    dfTrain
    .assign(stake=1)
    .groupby('selection_name', as_index = False)
    .agg({'back_npl': 'sum', 'stake': 'sum'})
    .assign(pot=lambda x: x['back_npl'] / x['stake'])
    .sort_values('pot', ascending=False)  
    .head(3) 
)

# Calculate the profit (back and lay) and average odds across all track / distance / barrier group combos
trackDistanceBarrier = (
    dfTrain
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .groupby(['track', 'race_distance', 'barrier_group'], as_index=False)
    .agg({'back_npl': 'sum', 'lay_npl': 'sum','stake': 'sum', 'odds': 'mean'})
)

trackDistanceBarrier

trackDistanceBarrier = (
    trackDistanceBarrier
    .assign(backPL_pValue = lambda x: pl_pValue(number_bets = x['stake'], npl = x['back_npl'], stake = x['stake'], average_odds = x['odds']))
    .assign(layPL_pValue = lambda x: pl_pValue(number_bets = x['stake'], npl = x['lay_npl'], stake = x['stake'], average_odds = x['odds']))
)

trackDistanceBarrier

# Top 5 lay combos Track | Distance | Barrier (TDB)
TDB_bestLay = trackDistanceBarrier.query('lay_npl&gt;0').sort_values('layPL_pValue').head(5)
TDB_bestLay

# First let's test laying on the train set (by definition we know these will be profitable)
train_TDB_bestLay = (
    dfTrain
    .merge(TDB_bestLay[['track', 'race_distance']])
    .assign(npl=lambda x: x['lay_npl'])
    .assign(stake=1)
    .assign(win=lambda x: np.where(x['lay_npl'] &gt; 0, 1, 0))
)

# This is the key test (non of the races has been part of analysis to this point)
test_TDB_bestLay = (
    dfTest
    .merge(TDB_bestLay[['track', 'race_distance']])
    .assign(npl=lambda x: x['lay_npl'])
    .assign(stake=1)
    .assign(win=lambda x: np.where(x['lay_npl'] &gt; 0, 1, 0))
)

# Peaking at the bets in the test set
test_TDB_bestLay[['track', 'race_distance', 'barrier', 'barrier_group', 'bsp', 'lay_npl', 'win', 'stake']]

# Let's run our evaluation on the training set
bet_eval_metrics(train_TDB_bestLay)

# And on the test set
bet_eval_metrics(test_TDB_bestLay)

# Angle 2 ++++++++++++++++++++++++++++++++++++++++++++++

(
    dfTrain
    .assign(market_support=lambda x: x['wap_5m'] / x['wap_30s'])
    .assign(races=1)
    .groupby('jockey')
    .agg({'market_support': 'mean', 'races': 'count'})
    .query('races &gt; 10')
    .sort_values('market_support', ascending = False)
    .head()
)

# Group By Jockey and Market Support
jockeys = (
    dfTrain
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .assign(npl=lambda x: np.where(x['place'] == 1, 0.95 * (x['odds']-1), -1))
    .assign(market_support=lambda x: np.where(x['wap_5m'] &gt; x['wap_30s'], "Y", "N"))
    .groupby(['jockey', 'market_support'], as_index=False)
    .agg({'odds': 'mean', 'stake': 'sum', 'npl': 'sum'})
    .assign(pValue = lambda x: pl_pValue(number_bets = x['stake'], npl = x['npl'], stake = x['stake'], average_odds = x['odds']))
)

jockeys.sort_values('pValue').query('npl &gt; 0').head(10)

# First evaluate on our training set
train_jockeyMarket = (
    dfTrain
    .assign(market_support=lambda x: np.where(x['wap_5m'] &gt; x['wap_30s'], "Y", "N"))
    .merge(jockeys.sort_values('pValue').query('npl &gt; 0').head(10)[['jockey', 'market_support']])
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .assign(npl=lambda x: np.where(x['place'] == 1, 0.95 * (x['odds']-1), -1))
    .assign(win=lambda x: np.where(x['npl'] &gt; 0, 1, 0))
)

# And on the test set
test_jockeyMarket = (
    dfTest
    .assign(market_support=lambda x: np.where(x['wap_5m'] &gt; x['wap_30s'], "Y", "N"))
    .merge(jockeys.sort_values('pValue').query('npl &gt; 0').head(10)[['jockey', 'market_support']])
    .assign(stake = 1)
    .assign(odds = lambda x: x['bsp'])
    .assign(npl=lambda x: np.where(x['place'] == 1, 0.95 * (x['odds']-1), -1))
    .assign(win=lambda x: np.where(x['npl'] &gt; 0, 1, 0))
)

bet_eval_metrics(train_jockeyMarket)

bet_eval_metrics(test_jockeyMarket)

# Angle 3 ++++++++++++++++++++++++++++++++++++++++++++++


# First Investigate The Average Inplay Minimums And Loss Rates of Certain Jockeys
tradeOutIndex = (
    dfTrain
    .query('distance_group in ["long", "mid_long"]')
    .assign(inplay_odds_ratio=lambda x: x['inplay_min_lay'] / x['bsp'])
    .assign(win=lambda x: np.where(x['place']==1,1,0))
    .assign(races=lambda x: 1)
    .groupby(['jockey'], as_index=False)
    .agg({'inplay_odds_ratio': 'mean', 'win': 'mean', 'races': 'sum'})
    .sort_values('inplay_odds_ratio')
    .query('races &gt;= 5')
)

tradeOutIndex

targetTradeoutFraction = 0.5

train_JockeyBackToLay = (
    dfTrain
    .query('distance_group in ["long", "mid_long"]')
    .merge(tradeOutIndex.head(20)['jockey'])
    .assign(npl=lambda x: np.where(x['inplay_min_lay'] &lt;= targetTradeoutFraction * x['bsp'], np.where(x['place'] == 1, 0.95 * (x['bsp']-1-(0.5*x['bsp']-1)), 0), -1))
    .assign(stake=lambda x: np.where(x['npl'] != -1, 2, 1))
    .assign(win=lambda x: np.where(x['npl'] &gt;= 0, 1, 0))
)

bet_eval_metrics(train_JockeyBackToLay)

test_JockeyBackToLay = (
    dfTest
    .query('distance_group in ["long", "mid_long"]')
    .merge(tradeOutIndex.head(20)['jockey'])
    .assign(npl=lambda x: np.where(x['inplay_min_lay'] &lt;= targetTradeoutFraction * x['bsp'], np.where(x['place'] == 1, 0.95 * (x['bsp']-1-(0.5*x['bsp']-1)), 0), -1))
    .assign(stake=lambda x: np.where(x['npl'] != -1, 2, 1))
    .assign(win=lambda x: np.where(x['npl'] &gt;= 0, 1, 0))

)

bet_eval_metrics(test_JockeyBackToLay)

Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.