Skip to content

Greyhound Modelling in Python using the Topaz API

Building a greyhound racing model using Python and Machine Learning

This tutorial is a refreshed version of our previous tutorials utilising the new version of the FastTrack API (now called Topaz). Topaz is a product provided to Betfair Australia & New Zealand customers by Greyhound Racing Victoria (GRV).

If you would like your own Topaz API key, please contact us here. Access can only be provided to Betfair Australia or New Zealand customers with active accounts


Overview

This tutorial will walk you through the different steps required to generate Greyhound racing winning probabilities

  1. Download historic greyhound data from Topaz API
  2. Cleanse and normalise the data
  3. Generate features using raw data
  4. Build and train classification models
  5. Evaluate models' performances
  6. Evaluate feature importance

Requirements

  • Coding environment which supports Jupyter Notebooks (e.g. Visual Studio Code)
  • Betfair API Key. If you don't have one please follow the steps outlined on the The Automation Hub
  • Topaz API Key. If you would like to be considered for a Topaz key, please email data@betfair.com.au (Australian/New Zealand customers only).
  • Python Topaz API wrapper. To install this package using pip, type 'pip install topaz_api' into your terminal

Historic Data

To get started on building our own Topaz model, first we need to download the historic data from the Topaz API. The API has rate limits in place and so for the purposes of a bulk download of historic data, we will need to implement some way of handling these rate limits in order for us to download the data with a high degree of completeness. After all, the maxim 'Garbage In, Garbage Out' in regards to modelling holds true. If you don't feed your model good, complete data, then you won't have much success.

In the code block below, there are multiple instances of retries and programmed sleep functions to allow the rate limits to reset. We are also requesting the races for each state in blocks of 7 days.

NOTE: For state / date-range combinations where there are genuinely no races that occurred, the function will continuously error until it reaches the maximum specified retries before continuing to the next block. This may occur during the pandemic shutdown period in 2020 or the NSW Greyhound racing ban in 2017 or for time periods with the 'NT' jurisdiction where greyhound meetings may be up to 14 days apart.

Downloading Historic Data
import pandas as pd
from tqdm import tqdm
import time
from datetime import datetime, timedelta
from topaz import TopazAPI
from sklearn.preprocessing import MinMaxScaler 
import numpy as np
import requests
import os
import itertools

# Insert your TOPAZ API key 
TOPAZ_API_KEY = ''

# Define the states you require
JURISDICTION_CODES = ['NSW','QLD','WA','TAS','NT','NZ','WA','SA']

# Define the start and end date for the historic data download
start_date = datetime(2020,1,1)
end_date = (datetime.today() - timedelta(days=1))
'''
It is pythonic convention to define hard-coded variables (like credentials) in all caps. Variables whose value may change in use should be defined in lowercase with underscore spacing
'''

def generate_date_range(start_date, end_date):
    ''' 
    Here we generate our date range for passing into the Topaz API.
    This is necessary due to the rate limits implemented in the Topaz API. 
    Passing a start date and end date 12 months apart in the API call will cause significant amounts of data to be omitted, so we need to generate a range to pass much smaller date ranges in the Topaz API calls
    '''
    # Initialise the date_range list
    date_range = []

    current_date = start_date

    while current_date <= end_date:
        date_range.append(current_date.strftime("%Y-%m-%d"))
        current_date += timedelta(days=1)

    return date_range

def define_topaz_api(api_key):
    '''
    This is where we define the Topaz API and pass our credentials
    '''
    topaz_api = TopazAPI(api_key)

    return topaz_api

def download_topaz_data(topaz_api,date_range,codes,datatype):
    '''
    The parameters passed here are:
        1. The TopazAPI instance with our credentials
        2. Our date range that we generated which is in the format of a list of strings
        3. Our list of JURISDICTION_CODES
        4. Our datatype (either 'HISTORICAL' or 'UPCOMING'). 
            It is important to separate these in our database to ensure that the upcoming races with lots of empty fields (like margin) do not contaminate our historical dataset.
    '''

    # Iterate over 10-day blocks
    for i in range(0, len(date_range), 10):
        start_block_date = date_range[i]
        print(start_block_date)
        end_block_date = date_range[min(i + 9, len(date_range) - 1)]  # Ensure the end date is within the range


        for code in codes:
            # initialise the race list
            all_races = []

            print(code)

            retries = 10  # Number of retries
            '''
            In this code block we are attempting to download a list of raceIds by passing our JURISDICTION_CODES and date range.
            The sleep functions are included to allow time for the rate limits to reset and any other errors to clear.
            The Topaz API does not return much detail in error messages and through extensive trial and error, we have found the best way to resolve errors is simply to pause for a short time and then try again.
            After 10 retries, the function will move to the next block. This will usually only occur if there actually were zero races for that JURISDICTION_CODE and date_range or there is a total Topaz API outage.
            '''
            while retries > 0:
                try:
                    races = topaz_api.get_races(from_date=start_block_date, to_date=end_block_date, owning_authority_code=code)
                    all_races.append(races)
                    break  # Break out of the loop if successful
                except requests.HTTPError as http_err:
                    if http_err.response.status_code == 429:
                        retries -= 1
                        if retries > 0:
                            print(f"Rate limited. Retrying in 2 minutes...")
                            time.sleep(121)
                        else:
                            print("Max retries reached. Moving to the next block.")
                    else:
                        print(f"Error fetching races for {code}: {http_err.response.status_code}")
                        retries -= 1
                        if retries > 0:
                            print(f"Retrying in 30 seconds...")
                            time.sleep(30)
                        else:
                            print("Max retries reached. Moving to the next block.")

            try:
                all_races_df = pd.concat(all_races, ignore_index=True)
            except ValueError:
                continue

            # Extract unique race IDs
            race_ids = list(all_races_df['raceId'].unique())

            # Define the file path
            file_path = code + '_DATA_'+datatype+'.csv'

            # Use tqdm to create a progress bar
            for race_id in tqdm(race_ids, desc="Processing races", unit="race"):
                result_retries = 10
                '''
                Here we utilise the retries function again to maximise the chances of our data being complete.
                We spend some time here gathering the splitPosition and splitTime. While these fields are not utilised in the completed model, the process to extract this data has been included here for the sake of completeness only, as it is not straightforward.

                '''
                while result_retries > 0:
                    # Get race run data
                    try:
                        race_run = topaz_api.get_race_runs(race_id=race_id)
                        race_result_json = topaz_api.get_race_result(race_id=race_id)

                        file_exists = os.path.isfile(file_path)
                        header_param = not file_exists

                        race_result = pd.DataFrame.from_dict([race_result_json])
                        split_times_df = pd.DataFrame(race_result['splitTimes'].tolist(),index=race_result.index)

                        splits_dict = split_times_df.T.stack().to_frame()
                        splits_dict.reset_index(drop=True, inplace= True)
                        splits_normalised = pd.json_normalize(splits_dict[0])

                        if len(splits_normalised) == 0:
                            race_run.to_csv(file_path, mode='a', header=header_param, index=False)
                            break

                        first_split = splits_normalised[splits_normalised['splitTimeMarker'] == 1]
                        first_split = first_split[['runId','position','time']]
                        first_split = first_split.rename(columns={'position':'firstSplitPosition','time':'firstSplitTime'})
                        second_split = splits_normalised[splits_normalised['splitTimeMarker'] == 2]
                        second_split = second_split[['runId','position','time']]
                        second_split = second_split.rename(columns={'position':'secondSplitPosition','time':'secondSplitTime'})

                        split_times = splits_normalised[['runId']]
                        split_times = pd.merge(split_times,first_split,how='left',on=['runId'])
                        split_times = pd.merge(split_times,second_split,how='left',on=['runId'])

                        race_run = pd.merge(race_run,split_times,how='left',on=['runId'])
                        race_run.drop_duplicates(inplace=True)
                        race_run.to_csv(file_path, mode='a', header=header_param, index=False)
                        break
                    except requests.HTTPError as http_err:
                        if http_err.response.status_code == 404:
                            file_exists = os.path.isfile(file_path)
                            header_param = not file_exists
                            race_run.to_csv(file_path, mode='a', header=header_param, index=False)
                            break
                    except Exception as e:
                        print(race_id)
                        result_retries -= 1
                        if result_retries > 0:
                            time.sleep(15)
                        else:
                            time.sleep(120)

The Topaz race results endpoint requires the passing of a jurisdiction ID as a parameter to pull the results. Simply passing a date range will return no data. This is why we have looped over all jurisdictions in 10 day blocks.

Bulk downloading the historical data in this way may take a while depending on how many years of data you are requesting. Bruno, the author of one of our previous greyhound modelling tutorials, uses 6 years worth of data for his backtesting. This however is computationally expensive to process in a machine learning model, so we suggest 2-3 years for those just starting out.

NOTE: In the above code we have exported each state separately to its own csv file. This will keep each file under a million rows ensuring that you can manually inspect the data by opening the file in Excel. This is not required (we will pull in each file to our model before we begin to process the data)

Cleaning the data

Let's pull all of our state files and get cleaning on this data!

Cleaning The Data
def collect_topaz_data(api_key,codes,start_date,end_date,datatype):
    '''
    This function here combines our three previously defined functions into one neat function in the correct order of execution.
    As the data is written to csv files, we do not need to return anything at the end of the function. This also means that it is not necessary to define a variable as the output of a function
    '''
    date_range = generate_date_range(start_date,end_date)

    topaz_api = define_topaz_api(api_key)

    download_topaz_data(topaz_api,date_range,codes,datatype)

collect_topaz_data(TOPAZ_API_KEY,JURISDICTION_CODES,start_date,end_date,'HISTORICAL')

def load_topaz_data(codes,datatype):
    '''
    This function here will loop over our previously written csv files and load them into one dataframe.
    Due to the time taken to gather data from the Topaz API and the sheer size of the files, it makes sense to store the Topaz data locally as csv files locally rather than as a variable in the operating environment.
    If the python kernel needs to be reset for whatever reason, all stored variables are lost - so storing the data in csv files means we can just load up the files directly rather than spending another 24 hours re-downloading all the data
    '''
    # initialise the dataframe
    TopazDataAll = pd.DataFrame()

    # loop over all csv files
    for code in codes:
        # Read the csv force two columns into datetime format
        StateData = pd.read_csv(code+'_DATA_'+datatype+'.csv',low_memory=False,parse_dates=['meetingDate','dateWhelped'])
        # Add the state as a column
        StateData['state']=code
        # Concatenate the dataframes
        TopazDataAll = pd.concat([TopazDataAll,StateData])

    return TopazDataAll

'''
Because this new function returns a dataframe as the output, we need to define the output of the function as a named variable. This name can be different to the name of the variable named after the 'return' operation.
Simply calling the function without assigning the output as a variable will result in the function's output not being defined which will affect our downstream operations.
'''

TopazDataAll = load_topaz_data(JURISDICTION_CODES,'HISTORICAL')

def discard_scratched_runners_data(TopazDataAll):
    '''
    This function will discard all scratched runners and abandoned races by dropping all rows where place = None
    Note that this should only be done for past races, as upcoming races will also have place = None, because they have not been run yet, and so dropping these would remove all upcoming races
    '''
    # Discard scratched runners and abandoned races
    TopazDataAll.dropna(subset=['place'], how='all', inplace=True)

    return TopazDataAll

TOPAZ_COLUMNS_TO_KEEP = ['state',    
                        'track',
                        'distance',
                        'raceId',
                        'meetingDate',
                        'raceTypeCode',
                        'runId',
                        'dogId',
                        'dogName',
                        'weightInKg',
                        'gradedTo',
                        'rating',
                        'raceNumber',
                        'boxNumber',
                        'rugNumber',
                        'sex',
                        'trainerId',
                        'trainerState',
                        'damId',
                        'damName',
                        'sireId',
                        'sireName',
                        'dateWhelped',
                        'last5',
                        'pir',
                        'place',
                        'prizeMoney',
                        'resultTime',
                        'resultMargin']

def discard_unnecessary_columns(TopazDataAll,columns):
    '''
    This function serves to keep only the subset of columns that we will use for our model.
    Columns have been discarded because:
     - All entries are identical or null
     - Entries are duplicates of another column - simply being abbreviated in this column
     - Entries contain data leakage which is data that has been added after the race has been run and would not be available for a live model
     - Significant amounts of missing data

    Duplicate rows have also been discarded. These rows may exist if the collect_topaz_data function has been run for the same date more than once.
    Each race a dog runs will have a unique runId which is why this column is ideal for identifying duplicates
    '''
    # Keep the required columns
    TopazData = TopazDataAll[columns]

    # Drop duplicate runIds
    TopazData = TopazData.drop_duplicates(subset=['runId'])

    # Reset the index
    TopazData.reset_index(drop=True, inplace=True)

    return TopazData

TopazDataAll = discard_scratched_runners_data(TopazDataAll)
TopazData = discard_unnecessary_columns(TopazDataAll,TOPAZ_COLUMNS_TO_KEEP)

# Let's correct the track names to align with Betfair names (Not all tracks in Topaz are on the exchange due to being closed or not hosting TAB-meetings)
# We're not using the NZ tracks in this tutorial, however they are included below for completeness
# The "Straight" tracks at Murray Bridge and Richmond are not differentiated on Betfair but we can treat these later.

TRACK_DICTIONARY= {
                    'Auckland (NZ)':'Manukau',
                    'Christchurch (NZ)':'Addington',
                    'Dport @ HOB':'Hobart',
                    'Dport @ LCN':'Launceston',
                    'Meadows (MEP)':'The Meadows',
                    'Otago (NZ)':'Forbury Park',
                    'Palmerston Nth (NZ)':'Manawatu',
                    'Sandown (SAP)':'Sandown Park',
                    'Southland (NZ)':'Ascot Park',
                    'Tokoroa (NZ)':'Tokoroa',
                    'Waikato (NZ)':'Cambridge',
                    'Wanganui (NZ)':'Hatrick',
                    'Taranaki (NZ)':'Taranaki',
                    'Ashburton (NZ)':'Ashburton',
                    'Richmond (RIS)':'Richmond',
                    'Murray Bridge (MBR)':'Murray Bridge',
                    'Murray Bridge (MBS)':'Murray Bridge'
                    }

def clean_track_data(TopazData,TrackDict):
    '''
    This function replaces the track names in Topaz with the track names as they are displayed on the Betfair Exchange.
    This is important for performing our historical backtesting later on
    '''
    TopazData['track'] = TopazData['track'].replace(TrackDict)

    return TopazData

TopazData = clean_track_data(TopazData,TRACK_DICTIONARY)

def correct_result_margin(TopazData):
    '''
    resultMargin has the same value for 1st and 2nd placed dogs, but should be 0 for the 1st placed dog.
    This function simply replaces the entered value for resultMargin with 0 if the place value = 1
    '''
    # Replace margin to make sense for our model
    TopazData.loc[TopazData['place'] == 1, ['resultMargin']] = 0

    return TopazData

TopazData = correct_result_margin(TopazData)

Here we've cleaned our dataset as much as we can. Next it's on to the feature creation!

Create the features

Creating Basic Features
def generate_dogAge(TopazData):
    '''
    This function creates a dogAge variable by subtracting the race date from the dateWhelped (i.e. date of birth).
    '''
    # Subtract dateWhelped from meetingDate
    TopazData['dogAge'] = (TopazData['meetingDate'] - TopazData['dateWhelped']).dt.days

    return TopazData

TopazData = generate_dogAge(TopazData)

LAST_FIVE_RACE_COLUMNS = ['posL1R', 'posL2R', 'posL3R','posL4R', 'posL5R']

def extract_form(TopazData):

    ''' 
    This function splits the last5 variable, which is a column of 5 characters or less that contain the dogs finishing position in the last 5 races. 
    First we define it as a string in order to extract all the characters, as some of them may be letters (e.g. F = Fell, D = Disqualified)

    We then create five seperate columns for each race, and fill any blank cells with 10. Cells may be blank because the dog could have run fewer than 5 previous races.
    We have chosen 10 as the padding value because a greyhound can never finish in 10th place owing to the maximum field size of 8 in greyhound racing.
    '''
    # Ensure variable is a string
    TopazData['last5'] = TopazData['last5'].astype(str)
    # Function to extract numbers from the 'last5' column
    def extract_numbers(row):
        try:
            numbers = list(map(int, row.split('-')))
            # If there are fewer than 5 numbers, pad with tens
            numbers += [10] * (5 - len(numbers))
            return numbers
        except ValueError:
            # Handle the case where the string cannot be split into integers
            return [10, 10, 10, 10, 10]

    # Apply the function to create new columns for each position
    TopazData[LAST_FIVE_RACE_COLUMNS] = TopazData['last5'].apply(extract_numbers).apply(pd.Series)

    return TopazData

TopazData = extract_form(TopazData)

def generate_winPercentage(TopazData):
    '''
    This function generates a percentage of the dog's previous 5 races in which it has either won or placed in the top 3. 
    For generating the win percentage we take the number of '1's from the the position columns we generated previously and divides by the number of columns that are not equal to 10 which we previously defined as races that did not exist
    For generating the top 3 place percentage by dividing the number of places which are less than or equal to 3.
    '''
    # Generate win percentage from the last 5 races
    TopazData['lastFiveWinPercentage'] = ((TopazData[LAST_FIVE_RACE_COLUMNS] == 1).sum(axis=1)) / ((TopazData[LAST_FIVE_RACE_COLUMNS] != 10).sum(axis=1))

    # Fill any empty values with 0
    TopazData['lastFiveWinPercentage'].fillna(0,inplace=True)

    # Generate top 3 place percentage from the last 5 races
    TopazData['lastFivePlacePercentage'] = ((TopazData[LAST_FIVE_RACE_COLUMNS] <= 3).sum(axis=1)) / ((TopazData[LAST_FIVE_RACE_COLUMNS] != 10).sum(axis=1))

    # Fill any empty values with 0
    TopazData['lastFivePlacePercentage'].fillna(0,inplace=True)

    return TopazData

TopazData = generate_winPercentage(TopazData)

def generate_finishingPlaceMovement(TopazData):
    '''
    This function attempts to determine if the dog has a tendency to fall back in the field towards the end of the race or it finishes very strongly by finding the difference between its final place and its field position at the last split.
    We first fill empty rows and convert to a string, before extracting the second last character from the string. 
    Some tracks have much more detailed position data but these make up a minority of races and so using the fourth or fifth last position in-running would omit most of the data and so is not overly useful for modelling purposes

    If the pir variable only has 1 character, then it corresponds to the final placing and so the fillna operation used on the 2ndLastPIR variable simply fills its final place, assigning a finishingPlaceMovement of 0
    '''
    # Fill any empty values for pir with 0
    TopazData['pir'] = TopazData['pir'].fillna(0)

    # Convert the 'pir' column to string
    TopazData['pir'] = TopazData['pir'].astype(int).astype(str)

    # Extract the second last letter and create a new column '2ndLastPIR'
    TopazData['2ndLastPIR'] = TopazData['pir'].apply(lambda x: x[-2] if len(x) >= 2 else None)

    # Fill any empty values for 2ndLastPIR with the dog's place
    TopazData['2ndLastPIR'].fillna(TopazData['place'],inplace=True)

    # Convert the column to integer format
    TopazData['2ndLastPIR'] = TopazData['2ndLastPIR'].astype(int)

    # Create a feature that calculates places gained/conceded in the home straight
    TopazData['finishingPlaceMovement'] = TopazData['2ndLastPIR'] - TopazData['place']

    return TopazData

TopazData = generate_finishingPlaceMovement(TopazData)

def scale_values(TopazData):

    '''
    This function scales the 'dogAge' and 'weightInKg' columns based on the 'raceId' groups using Min-Max scaling. 
    Additionally, it applies logarithmic transformations to 'prizeMoney','place', and 'resultMargin' columns. 
    These transformations are performed to ensure that the data follows a more Gaussian distribution and meets the assumptions of certain statistical analyses. 
    '''
    # Scale existing values using MinMaxScaler
    TopazData['dogAgeScaled'] = TopazData.groupby('raceId')['dogAge'].transform(lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1, 1)).flatten())
    TopazData['weightInKgScaled'] = TopazData.groupby('raceId')['weightInKg'].transform(lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1, 1)).flatten())

    # Transform existing values using log functions
    TopazData['prizemoneyLog'] = np.log10(TopazData['prizeMoney'] + 1)
    TopazData['placeLog'] = np.log10(TopazData['place'] + 1)
    TopazData['marginLog'] = np.log10(TopazData['resultMargin'] + 1)

    return TopazData

TopazData = scale_values(TopazData)

def generate_runTimeNorm(TopazData):
    '''
    This function finds the rolling 365 day median win time for each track/distance combination by creating a dataframe that contains only winning runs
    The median time is then compared to the dog's actual result time and then a normalised value is attained.
    The purpose of the clip(0.9,1.1) is to ensure that very high or very low values (which may be the result of dirty data) do not create erroneous measurements and are therefore clipped to be between 0.9 and 1.1

    The speed index is used to account for differences in the speed of each track, and will assign different values based on whether a track is fast or slow using the MinMaxScaler
    '''

    # Calculate median winner time per track/distance
    win_results = TopazData[TopazData['place'] == 1]

    # Find the median for each unique combination of track, distance and meetingDate
    grouped_data = win_results.groupby(['track', 'distance', 'meetingDate'])['resultTime'].median().reset_index()

    # Find the rolling 365 median of the median values
    median_win_time = pd.DataFrame(grouped_data.groupby(['track', 'distance']).apply(lambda x: x.sort_values('meetingDate').set_index('meetingDate')['resultTime'].shift(1).rolling('365D', min_periods=1).median())).reset_index()

    # Rename the column
    median_win_time.rename(columns={"resultTime": "runTimeMedian"},inplace=True)

    # Calculate the speedIndex for the track
    median_win_time['speedIndex'] = (median_win_time['runTimeMedian'] / median_win_time['distance'])

    # Apply the MinMaxScaler
    median_win_time['speedIndex'] = MinMaxScaler().fit_transform(median_win_time[['speedIndex']])

    # Merge with median winner time
    TopazData = TopazData.merge(median_win_time, how='left', on=['track', 'distance','meetingDate'])

    # Normalise time comparison
    TopazData['runTimeNorm'] = (TopazData['runTimeMedian'] / TopazData['resultTime']).clip(0.9, 1.1)

    # Fill the empty values with 0
    TopazData['runTimeNorm'].fillna(0, inplace=True)

    # Apply the MinMaxScaler
    TopazData['runTimeNorm'] = TopazData.groupby('raceId')['runTimeNorm'].transform(lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1, 1)).flatten())

    return TopazData

TopazData = generate_runTimeNorm(TopazData)

def generate_adjacentVacantBox_state(TopazData):
    '''
    This function determines whether the dog has at least one vacant box directly next to them at the beginning of the race. 
    In greyhound racing, especially for very short races, a dog with more room to move at the beginning of the race is less likely to be bumped (or 'checked') if it has a vacant box next to it. 
    This is also true for dogs in box 1 as there is a lot of room adjacent to box 1. 

    We find this by ordering by raceId and boxNumber and determining if the box number above or below is filled. 
    Box 8 is counted as always having the box above filled as Box 8 is hard-up on the fence and so the dog does not have room to move here.
    We then determine if the dog has adjacent space by generating the variable 'hasAtLeast1VacantBox'
    '''

    # Sort the DataFrame by 'RaceId' and 'Box'
    TopazData = TopazData.sort_values(by=['raceId', 'boxNumber'])

    # Check if there is an entry equal to boxNumber plus or minues one
    TopazData['hasEntryBoxNumberPlus1'] = (TopazData.groupby('raceId')['boxNumber'].shift(1) == TopazData['boxNumber'] + 1) | (TopazData['boxNumber'] == 8)
    TopazData['hasEntryBoxNumberMinus1'] = (TopazData.groupby('raceId')['boxNumber'].shift(-1) == TopazData['boxNumber'] - 1)

    # Convert boolean values to 1
    TopazData['hasEntryBoxNumberPlus1'] = TopazData['hasEntryBoxNumberPlus1'].astype(int)
    TopazData['hasEntryBoxNumberMinus1'] = TopazData['hasEntryBoxNumberMinus1'].astype(int)

    # Calculate 'hasAtLeast1VacantBox'
    TopazData['hasAtLeast1VacantBox'] = (TopazData['hasEntryBoxNumberPlus1'] + TopazData['hasEntryBoxNumberMinus1']> 0).astype(int)

    return TopazData

TopazData = generate_adjacentVacantBox_state(TopazData)

def generate_boxWinPercentage(TopazData):
    ''' 
    This function creates a rolling win percentage for each combination of 'track', 'distance', 'boxNumber', 'hasAtLeast1VacantBox', 'meetingDate'
    The purpose of this is to help the model determine if the box placement and presence of vacant adjacent boxes can give the dog an improved chance of winning the race
    The shift(1) is used to ensure that the results of that meetingDate are excluded from the calculation to prevent data leakage
    '''
    # Create a win column
    TopazData['win'] = TopazData['place'].apply(lambda x: 1 if x == 1 else 0)

    # Find the mean (win percentage) for each 'track', 'distance', 'boxNumber', 'hasAtLeast1VacantBox', 'meetingDate' combination
    grouped_data = TopazData.groupby(['track', 'distance', 'boxNumber', 'hasAtLeast1VacantBox', 'meetingDate'])['win'].mean().reset_index()

    # set the index to be the meeting date for the rolling window calculation
    grouped_data.set_index('meetingDate', inplace=True)

    # Apply rolling mean calculation to the aggregated data
    box_win_percent = grouped_data.groupby(['track', 'distance', 'boxNumber', 'hasAtLeast1VacantBox']).apply(lambda x: x.sort_values('meetingDate')['win'].shift(1).rolling('365D', min_periods=1).mean()).reset_index()

    # Reset index and rename columns
    box_win_percent.columns = ['track', 'distance', 'boxNumber', 'hasAtLeast1VacantBox', 'meetingDate', 'rolling_box_win_percentage']

    # Add to dog results dataframe
    TopazData = TopazData.merge(box_win_percent, on=['track', 'distance', 'meetingDate','boxNumber','hasAtLeast1VacantBox'], how='left')

    return TopazData

TopazData = generate_boxWinPercentage(TopazData)
Now we've created some basic features, it's time to create our big group of features where we iterate over different subsets, variables and time points to generate a large number of different features

Creating Bulk Features
# Calculate values for dog, trainer, dam and sire
SUBSETS = ['dog', 'trainer', 'dam', 'sire']

# Use rolling window of 28, 91 and 365 days
ROLLING_WINDOWS = ['28D','91D', '365D']

# Features to use for rolling windows calculation
FEATURES = ['runTimeNorm', 'placeLog', 'prizemoneyLog', 'marginLog','finishingPlaceMovement']

# Aggregation functions to apply
AGGREGATES = ['min', 'max', 'mean', 'median', 'std']

def generate_rollingFeatures(TopazData):

    '''
    This function generates rolling window features for subsets including dog, trainer, dam, and sire.
    The rolling windows used are defined as 28 days, 91 days, and 365 days. 
    Features such as 'runTimeNorm','placeLog', 'prizemoneyLog', 'marginLog', and 'finishingPlaceMovement' are considered for rolling window calculation.
    The aggregation functions applied include 'min', 'max', 'mean', 'median', and 'std'.

    The function iterates through each subset, calculates rolling window features for each specified rolling window, and adds them to the dataset. 
    The generated feature names indicate the subset, feature, aggregation function, and rolling window used for calculation.

    Finally, the function returns the dataset with generated features added and a list of column names representing the generated rolling window features.
    The function enables the creation of a bulk set of features to be created using statistical aggregates, which will be entrusted to the algorithm to sort through which ones have significance

    We create a copy of the TopazData dataframe rather than modifying it directly in case of issues with this function. It means that this cell can simply be rerun instead of having to reload and reprocess the entire database
    '''
    # Create a copy of the dataframe and ensure meetingDate is in datetime format
    dataset = TopazData.copy()
    dataset['meetingDate'] = pd.to_datetime(dataset['meetingDate'])

    # Keep track of generated feature names
    feature_cols = []

    for i in SUBSETS:

        # Add the Id string
        idnumber = i + 'Id'

        # Create a subset dataframe
        subset_dataframe = dataset[['meetingDate',idnumber] + FEATURES]

        # intitialise the dataframe
        average_df = pd.DataFrame()

        for feature in FEATURES:

            # Find the mean value for the meetingDate
            feature_average_df = subset_dataframe.groupby([idnumber, 'meetingDate'])[feature].mean().reset_index()

            # Rename the feature column to indicate it's the average of that feature
            feature_average_df.rename(columns={feature: f'{feature}{i}DayAverage'}, inplace=True)

            # If average_df is empty, assign the feature_average_df to it
            if average_df.empty:
                average_df = feature_average_df
            else:
                # Otherwise, merge feature_average_df with average_df
                average_df = pd.merge(average_df, feature_average_df, on=[idnumber, 'meetingDate'])

        # Create a list from the column names
        column_names = average_df.columns.tolist()

        # Columns to exclude
        columns_to_exclude = [idnumber,'meetingDate']

        # Exclude specified columns from the list
        column_names_filtered = [col for col in column_names if col not in columns_to_exclude]

        # Drop any duplicate rows (for when any dam/sire has more than one progeny race in one day or an owner/trainer has more than one dog race in one day)
        average_df.drop_duplicates(inplace=True)

        # convert meetingDate to datetime format
        average_df['meetingDate'] = pd.to_datetime(average_df['meetingDate'])

        # set the index before applying the rolling windows
        average_df = average_df.set_index([idnumber, 'meetingDate']).sort_index() 

        for rolling_window in ROLLING_WINDOWS:
            print(f'Processing {i} rolling window {rolling_window} days')

            rolling_result = (
                average_df
                .reset_index(level=0)
                .groupby(idnumber)[column_names_filtered]
                .rolling(rolling_window)
                .agg(AGGREGATES)
                .groupby(level=0)
                .shift(1)
            )

            # Generate list of rolling window feature names (eg: RunTime_norm_min_365D)
            agg_features_cols = [f'{i}_{f}_{a}_{rolling_window}' for f, a in itertools.product(FEATURES, AGGREGATES)]

            # Add features to dataset
            average_df[agg_features_cols] = rolling_result

            # Keep track of generated feature names
            feature_cols.extend(agg_features_cols)

            # fill empty values with 0
            average_df.fillna(0, inplace=True)

        average_df.reset_index(inplace=True)
        dataset = pd.merge(dataset,average_df,on=[idnumber, 'meetingDate'])

    return dataset, feature_cols

dataset, feature_cols = generate_rollingFeatures(TopazData)

def generate_feature_list(feature_cols):
    '''
    This function converts the numpy series to a list for later use
    '''
    feature_cols = np.unique(feature_cols).tolist()

    return feature_cols

feature_cols = generate_feature_list(feature_cols)

def discard_first_year(dataset,start_date):
    '''
    This function discards the first 12 months of the dataset. 
    This is done because the generate_rollingFeatures function utilises 365 days as one of the windows, and so to maintain data integrity, any data between the beginning of the dataset and 12 months afterwards is now 'unclean' and should be discarded
    '''
    dataset = dataset[dataset['meetingDate'] >= start_date + timedelta(years=1)]

    return dataset

DATASET_COLUMNS_TO_KEEP = [
                    'meetingDate',
                    'state',
                    'track',
                    'distance',
                    'raceId',
                    'raceTypeCode',
                    'raceNumber',
                    'boxNumber',
                    'rugNumber',
                    'runId',
                    'dogId',
                    'dogName',
                    'weightInKg',
                    'sex',
                    'trainerId',
                    'trainerState',
                    'damId',
                    'damName',
                    'sireId',
                    'sireName',
                    'win',
                    'speed_index',
                    'dogAgeScaled',
                    'lastFiveWinPercentage',
                    'lastFivePlacePercentage',
                    'weightInKgScaled',
                    'rolling_box_win_percentage',
                    'hasAtLeast1VacantBox']

def generate_modelling_dataset(dataset, feature_cols):
    '''
    This function extends the list of feature columns from the generate_rollingFeatures function by adding the features we previously created, and then keeps only the columns specified to prepare the dataset for training by the algorithm
    '''
    # Extend the feature_cols list
    feature_cols.extend(['dogAgeScaled',
                'lastFiveWinPercentage',
                'lastFivePlacePercentage',
                'weightInKgScaled',
                'rolling_box_win_percentage',
                'hasAtLeast1VacantBox'])

    # Trim the dataset
    dataset = dataset[DATASET_COLUMNS_TO_KEEP + feature_cols]

    # Fill any missing values with 0 as a final step before training
    dataset.fillna(0,inplace=True)

    # Drop any duplicate runIds
    dataset.drop_duplicates(subset=['runId'],inplace=True)

    return dataset, feature_cols

dataset = discard_first_year(dataset,start_date)

dataset, feature_cols = generate_modelling_dataset(dataset, feature_cols)

Now that we've created our features, lets feed them into our Machine Learning Algorithm to generate some probabilities!

Training

Training the model
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

import pickle
from sklearn.metrics import log_loss
from sklearn.metrics import brier_score_loss
import time

# Gradient Boosting Machines libraries
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# Common models parameters
verbose       = 0
learning_rate = 0.1
n_estimators  = 500

def define_models():
    '''
    This function defines a number of machine learning algorithms with some suggested hyperparameters.
    The purpose of this is to demonstrate the differences in accuracy as well as training times for each different algorithm specifically in relation to this dataset
    '''
    # Train different types of models
    models = {
        'GradientBoostingClassifier': GradientBoostingClassifier(verbose=verbose, learning_rate=learning_rate, n_estimators=n_estimators, max_depth=8, max_features=8),
        'RandomForestClassifier':     RandomForestClassifier(verbose=verbose, n_estimators=n_estimators, max_depth=20, max_features=8, min_samples_leaf=5,min_samples_split=10),
        'LGBMClassifier':             LGBMClassifier(learning_rate=learning_rate, n_estimators=n_estimators, force_col_wise=True,num_leaves=80,verbose=-1),
        'XGBClassifier':              XGBClassifier(verbosity=verbose, learning_rate=learning_rate, n_estimators=n_estimators, max_depth = 8, subsample = 1.0, objective='binary:logistic'),
        'CatBoostClassifier':         CatBoostClassifier(verbose=verbose, learning_rate=learning_rate, n_estimators=n_estimators, depth=8, l2_leaf_reg=10),
        'KNeighborsClassifier':       KNeighborsClassifier(n_neighbors=15,weights='distance',p=2),
        'DecisionTreeClassifier':     DecisionTreeClassifier(max_depth=20,min_samples_split=10,min_samples_leaf=5,max_features=8),
        'MLPClassifier':              MLPClassifier(hidden_layer_sizes=100,activation='relu',solver='adam', alpha=0.0001),
        'AdaBoostClassifier':         AdaBoostClassifier(n_estimators=n_estimators,learning_rate=learning_rate,algorithm='SAMME'),
        'GuassianNB':                 GaussianNB(),
        'QuadDiscriminantAnalysis':   QuadraticDiscriminantAnalysis(),
        'LogisticsRegression':        LogisticRegression(verbose=0, solver='liblinear',max_iter=5000000, C=10,penalty='l2')
    }

    return models

models = define_models()

def remove_races(dataset):
    '''
    This function discards Qualifying Trials and any races which may only have 1 runner like Match Races which are outside the scope of this model
    '''
    #remove trial races
    final_dataset = dataset[dataset['raceTypeCode'] != 'QT']

    # Remove rows where raceId appears only once
    counts = final_dataset['raceId'].value_counts()
    repeated_raceIds = counts[counts > 1].index
    final_dataset = final_dataset[final_dataset['raceId'].isin(repeated_raceIds)]

    return final_dataset

final_dataset = remove_races(dataset)

def downcast_data(final_dataset):
    '''
    This function downcasts int64 and float64 to int32 and float32 respectively in order to reduce the memory usage of the model during the training process
    For systems with high RAM capacity this may not be required. Feel free to comment out this function, however it may result in memory allocation failures
    '''
    # Loop through each column and convert int64 to int32 and float64 to float32
    for col in final_dataset.columns:
        if final_dataset[col].dtype == 'int64':
            final_dataset.loc[:, col] = final_dataset[col].astype('int32')
        elif final_dataset[col].dtype == 'float64':
            final_dataset.loc[:, col] = final_dataset[col].astype('float32')

    return final_dataset

final_dataset = downcast_data(final_dataset)

def train_test_split(final_dataset,end_date):
    '''
    This function splits the dataset into a training set and a test set for the purposes of model training.
    This is to enable testing of the trained model on an unseen test set to establish statistical metrics regarding its accuracy
    There are other ways to split a dataset into test and training sets, however this method is useful for the purposes of backtesting profitability
    '''
    # Split the data into train and test data
    train_data = final_dataset[final_dataset['meetingDate'] < end_date - timedelta(years=1)].reset_index(drop=True)
    test_data = final_dataset[final_dataset['meetingDate'] >= end_date - timedelta(years=1)].reset_index(drop=True)

    return test_data, train_data

test_data, train_data = train_test_split(final_dataset,end_date)

def generate_xy(test_data, train_data):
    '''
    This function separates the target column 'win' from the actual features of the dataset and also seperates the training features from the race metadata which is not being used for the training (e.g. raceId)
    '''
    train_x, train_y = train_data[feature_cols], train_data['win']
    test_x, test_y = test_data[feature_cols], test_data['win']

    return train_x, train_y, test_x, test_y

train_x, train_y, test_x, test_y = generate_xy(test_data, train_data)

def train_models(models,train_x, train_y, test_x, test_y, test_data):
    '''
    This function loads each individual machine learning algorithm and trains each model on the training dataset.
    It begins by initialising an empty list and passing over each model by training it and then normalising the probabilities to 1 for each race field.
    This normalisation could not be performed if a different method was used to generate the train/test split that resulted in incomplete race fields being included

    Once normalised, then the log loss and brier score for each model is calculated and printed. These scores are not stored, merely displayed as indicators (along with the training time for each algorithm)
    Finally the trained model (also called a pickle) is 'dumped' (i.e. saved) as a _pickle.dat file

    '''
    probs_columns = []

    for key, model in models.items():

        prob_col_key = f'prob_{key}'
        print(f'Fitting model {key} for {model}')

        # time and train the model
        start_time = time.time()  # Record start time
        model.fit(train_x, train_y)
        end_time = time.time()  # Record end time
        print(f"Training time for {key}: {end_time - start_time} seconds")

        # Calculate runner win probability
        dog_win_probs = model.predict_proba(test_x)[:, 1]
        test_data[prob_col_key] = dog_win_probs

        # Normalise probabilities
        test_data[prob_col_key] = test_data.groupby('raceId', group_keys=False)[f'prob_{key}'].apply(lambda x: x / sum(x))

        # Assign the dog with the highest probability a value of 1 as the predicted race winner
        test_data['predicted_winner'] = test_data.groupby('raceId', group_keys=False)[f'prob_{key}'].apply(lambda x: x == max(x))

        # Calculate log loss
        test_loss = log_loss(train_y, test_data[prob_col_key])
        print(f"Log loss for {key} on test data: {test_loss}")

        # Calculate Brier Score
        brier_score = brier_score_loss(test_y, test_data[prob_col_key])
        print(f"Brier score for {key} on test data: {brier_score}")

        # Add the probability column key to the list probs_columns
        probs_columns.append(prob_col_key)

        # Dump the pickle file
        with open(key + '_.pickle.dat', 'wb') as f:
            pickle.dump(model, f)

    return test_data, probs_columns

all_models_test_data, probs_columns = train_models(models,train_x, train_y, test_x, test_y, test_data)

BACKTESTING_COLUMNS = ['meetingDate','state','track','race_number','boxNumber','rugNumber']

def trim_and_export_test_data(all_models_test_data):
    '''
    This functions trims all the model features from the test dataset and outputs only the required the race information in addition to the win probability for each machine learning model to a csv file for the purposes of backtesting
    '''
    trimmed_test_data=all_models_test_data[BACKTESTING_COLUMNS + probs_columns]

    trimmed_test_data.to_csv('multiple_algorithms_predictions.csv',index=False)

trim_and_export_test_data(all_models_test_data)

Alright, now that we've trained the models, let's have a look at the log loss calculations for each model and the time taken to train each model

Model Training Time Log Loss
AdaBoostClassifier 6248 0.532
XGBClassifier 2599 0.278
RandomForestClassifier 1930 0.269
MLPClassifier 1513 0.376
GradientBoostingClassifier 1278 0.308
LogisticsRegression 492 0.378
CatBoostClassifier 200 0.355
LGBMClassifier 92 0.305
QuadDiscriminantAnalysis 29 3.641
DecisionTreeClassifier 6 0.278
GuassianNB 3 7.015
KNeighborsClassifier 1 0.00000935

As we can see some models took a lot longer than others to train, with the longest being the AdaBoostClassifier taking close to 2 hours and KNeighbours Classifier at 1 second. We've only used one set of parameters for training each model. You could try using a GridSearch to find the best parameters for each model.

Here's a great explanation of the differences between log loss and Brier score for machine learning models - more info

Plotting feature importance
import matplotlib.pyplot as plt
import numpy as np

def plot_feature_importance(models):
    """
    This function iterates through each model in the provided dictionary and plots the feature importance if available.
    If the model has a 'feature_importances_' attribute, it plots the feature importance either using 'plot_importance' method (if available) or a horizontal bar chart. Each plot is saved as a JPG file with a filename indicating the model name.
    If a model does not have feature importance information, a message is printed indicating that.
    """
    for key, model in models.items():

        if hasattr(model, 'feature_importances_'):
            plt.figure(figsize=(10, 70))
            if hasattr(model, 'plot_importance'):
                model.plot_importance(ax=plt.gca(), importance_type='gain', title=f'Feature Importance - {key}')
            else:
                feature_importance = model.feature_importances_
                sorted_idx = np.argsort(feature_importance)
                plt.barh(train_x.columns[sorted_idx], feature_importance[sorted_idx])
                plt.xlabel("Feature Importance")
                plt.ylabel("Features")
                plt.title(f"Feature Importance - {key}")

            # Save plot to JPG file
            plt.savefig(f'feature_importance_{key}.jpg')

            plt.close()  # Close the plot to release memory
        else:
            print(f"No feature importance available for {key}")

plot_feature_importance(models)

png

For our first round of training our model, we have used some suggested parameters for each diifferent type of machine learning model. However, perhaps these parameters aren't suited to our use case. In this case we should use a technique called 'Grid Search' to find the best hyperparameters for one of the machine learning models (in this case we'll use LGBM). A grid search methodically tests different hyperparameter combinations, checking each one's performance through cross-validation. This helps pinpoint the best setup that maximises the model's performance metric, like accuracy or area under the curve (AUC).

With LightGBM, which relies heavily on hyperparameters like learning rate, maximum tree depth, and regularization parameters, a grid search is essential. By trying out various combinations of these hyperparameters, we can fine-tune the model for superior predictive performance. LightGBM's speed advantage makes grid search even more appealing. Since it's built for efficiency, the computational burden of grid searching is reduced compared to other algorithms. This means we can explore a wide range of hyperparameters without waiting forever. As above, the AdaBoost Classifier took 1.75 hours to train just once, so for this model would not be appropriate for a Grid Search. LGBM took only 90 seconds, so a grid search of 3 folds for 27 different hyperparameter combinations (81 fits) would take around 120 minutes - much more acceptable.

Grid Search for LGBM
from sklearn.model_selection import GridSearchCV

# Define parameter grid for LGBMClassifier. (More parameters and parameters can be added if desired)
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 500],
    'num_leaves': [50, 80, 120]
}

def LGBM_GridSearch(train_x, train_y):

    '''
    This function demonstrates how to perform a gridSearch of different combinations of hyperparameter functions to find an optimal combination for a particular machine learning model
    We have performed the wider model training first in order to find which model has a short enough training time to be a candidate for gridSearch. 
    Models with training times of 2 hours per fit won't be suitable for gridSearch unless the user has a lot of patience.

    It will perform the grid search and return the parameters associated with the fit with the most favourable neg_log_loss score
    '''

    # Initialize LGBMClassifier
    lgbm = LGBMClassifier(force_col_wise=True, verbose=-1)

    # Initialize GridSearchCV
    grid_search = GridSearchCV(estimator=lgbm, param_grid=param_grid, scoring='neg_log_loss', cv=3, verbose=3)

    # Fit GridSearchCV
    grid_search.fit(train_x, train_y)

    # Print results
    print("GridSearchCV Results:")
    print("Best parameters found:", grid_search.best_params_)
    print("Best negative log loss found:", grid_search.best_score_)

    lgbm_best_params = grid_search.best_params_

    return lgbm_best_params

lgbm_best_params = LGBM_GridSearch(train_x, train_y)

def apply_best_GridSearch_params(lgbm_best_params,train_x, train_y):
    '''
    This function takes the previously defined best parameters, trains the model using those parameters and then outputs the pickle file.
    '''
    # Define the model with best parameters
    best_model = LGBMClassifier(force_col_wise=True, verbose=3, n_estimators=lgbm_best_params['n_estimators'],learning_rate=lgbm_best_params['learning_rate'],num_leaves=lgbm_best_params['num_leaves'])

    # Train the model on best parameters
    best_model.fit(train_x, train_y)

    # Dump the pickle for the best model
    with open('best_lgbm_model.pickle', 'wb') as f:
        pickle.dump(best_model, f)

    return best_model

best_model = apply_best_GridSearch_params(lgbm_best_params,train_x, train_y)

def best_model_predictions(best_model,test_data):
    '''
    This function generates a column of probabilities for each runner, before normalising the probabilities across each race.
    Normalising the probabilities is essential for use in a betting model as the sum of the probabilities should be equal to 1 given that each race has 1 winner.

    Unnecessary columns are dropped and then the dataframe is exported to a csv file
    '''

    # Predict probabilities using the best model
    test_data['win_probability'] = best_model.predict_proba(test_x)[:, 1]

    # Normalise probabilities across each race
    test_data['win_probability'] = test_data.groupby('raceId', group_keys=False)['win_probability'].apply(lambda x: x / sum(x))

    # Keep only required columns
    test_data = test_data[BACKTESTING_COLUMNS + ['win_probability']]

    # Export DataFrame to CSV
    test_data.to_csv('lgbm_gridSearch_predictions.csv', index=False)

    return test_data

test_data = best_model_predictions(best_model,test_data)
Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV 1/3] END learning_rate=0.01, n_estimators=100, num_leaves=50;, score=-0.387 total time=  36.8s
[CV 2/3] END learning_rate=0.01, n_estimators=100, num_leaves=50;, score=-0.387 total time=  39.6s
[CV 3/3] END learning_rate=0.01, n_estimators=100, num_leaves=50;, score=-0.386 total time=  47.6s
[CV 1/3] END learning_rate=0.01, n_estimators=100, num_leaves=80;, score=-0.387 total time= 1.0min
[CV 2/3] END learning_rate=0.01, n_estimators=100, num_leaves=80;, score=-0.386 total time=  53.8s
[CV 3/3] END learning_rate=0.01, n_estimators=100, num_leaves=80;, score=-0.385 total time= 1.0min

#
# Output trimmed
#

GridSearchCV Results:
Best parameters found: {'learning_rate': 0.01, 'n_estimators': 500, 'num_leaves': 120}
Best negative log loss found: -0.3791625785797655

Analysing profitability

Let's explore now how we can take our model's probability predictions and compare them to actual market data. We'll join them onto the Betfair BSP datasets and then apply some segmentation to see if we can choose a segment for a staking strategy. For this profitability analysis, we have used proportional staking where the model's probability is multipled by 100 to calculate a stake (e.g. 0.07 x 100 = $7) and then bets placed at BSP.

It's important to be cautious when segmenting the data, especially if more than one segment is applied, that the set of datapoints is not too small to expose the strategy to overfitting. E.g. Picking races in QLD between 320m and 330m on a Monday night would most likely result in overfitting and unreproducable results. This page here has a great tool to help assess if the strategy is overfitted.

Joining model predictions with Betfair data
BETFAIR_RESULTS_COLUMNS = ['LOCAL_MEETING_DATE',
                        'SCHEDULED_RACE_TIME',
                        'ACTUAL_OFF_TIME',
                        'TRACK',
                        'STATE_CODE',
                        'RACE_NO',
                        'WIN_MARKET_ID',
                        'DISTANCE',
                        'RACE_TYPE',
                        'SELECTION_ID',
                        'TAB_NUMBER',
                        'SELECTION_NAME',
                        'WIN_RESULT',
                        'WIN_BSP',
                        'BEST_AVAIL_BACK_AT_SCHEDULED_OFF']

def retrieve_betfair_results(end_date):
    '''
    This function retrieves the Betfair Historical results csv files using a series of urls in a consistent format.
    Each csv file is retrieved and then concatenated into a single dataframe before being trimming by dropping unnecessary columns and dropping rows with None in particular columns as well as duplicates

    A runner may have None for BSP if it was a very late scratching and was not able to be removed before commencement of the race.
    Additionally, some NZ results are known as not having had BSP offered due to lack of vision

    A runner may have None for for 'BEST_AVAIL_BACK_AT_SCHEDULED_OFF' and a value for BSP if another runner was a very late scratching and all bets in the market at the time had to be voided and manual BSP reconciliation required.

    These datasets are also known to have duplicate values, so as an additional cleaning step, duplicate values were dropped
    '''

    # Define the start and end date for iteration
    start_results_date = end_date - timedelta(years=1)
    end_results_date = datetime.today() - timedelta(months=1)

    # List to store DataFrames
    result_dataframes = []

    # Define our date variable that will be changed
    current_date = start_results_date

    # Initialise the while loop to ensure we cover our whole date range
    while current_date <= end_results_date:
        # Generate the URL
        url = f'https://betfair-datascientists.github.io/data/assets/ANZ_Greyhounds_{current_date.year}_{current_date.month:02d}.csv'

        try:
            # Read CSV data into a DataFrame directly from the URL
            df = pd.read_csv(url)
            result_dataframes.append(df)
            print(f"Processed: {url}")
        except Exception as e:
            print(f"Failed to fetch data from: {url}, Error: {e}")

        # Move to the next month
        current_date = current_date.replace(day=1) + pd.DateOffset(months=1)

    # Concatenate all DataFrames into one
    betfair_results = pd.concat(result_dataframes, ignore_index=True)

    # Keep only required columns
    betfair_results= betfair_results[BETFAIR_RESULTS_COLUMNS]

    # Drop rows with None in WIN_BSP or BEST_AVAIL_BACK_AT_SCHEDULED_OFF
    betfair_results = betfair_results.dropna(subset=['WIN_BSP'])
    betfair_results = betfair_results.dropna(subset=['BEST_AVAIL_BACK_AT_SCHEDULED_OFF'])

    # Drop any duplicate rows
    betfair_results.drop_duplicates(inplace=True)

    # Ensure LOCAL_MEETING_DATE is in datetime format
    betfair_results['LOCAL_MEETING_DATE']=pd.to_datetime(betfair_results['LOCAL_MEETING_DATE']).dt.date

    return betfair_results

betfair_results = retrieve_betfair_results(end_date)

def create_backtesting_dataset(betfair_results,best_model_test_data):
    '''
    This function merges together our model predictions with the retrieved Betfair data
    '''

    # merge dataframes
    backtesting = pd.merge(betfair_results,
                        best_model_test_data,
                        how='left',
                        left_on=['LOCAL_MEETING_DATE','TRACK','RACE_NO','TAB_NUMBER'],
                        right_on=['meetingDate','track','raceNumber','rugNumber'])

    # Drop Betfair races where our model didn't have a prediction
    backtesting.dropna(subset=['win_probability'],inplace=True)

    return backtesting

backtesting = create_backtesting_dataset(betfair_results,test_data)

def generate_backtesting_columns(backtesting):
    '''
    This function generates additional columns for the purposes of profitability testing.

    Model Rank: This function ranks each runner within the race in descending order by win_probability
    Model Value: Implied value of the model's probability against the best available odds at the scheduled off - These odds are used as a signal only
    Proportional Back Stake: A Profit/Loss column for each runner before commission

    '''
    # Rank each runner by the model's probability
    backtesting['MODEL_RANK'] = backtesting.groupby('WIN_MARKET_ID')['win_probability'].rank(ascending=False)

    # Generate our model's implied value against the best market price at the scheduled jump time
    backtesting['MODEL_VALUE'] = backtesting['win_probability'] - 1/backtesting['BEST_AVAIL_BACK_AT_SCHEDULED_OFF']

    # Generate a profit / loss column on a per runner basis without commission
    backtesting['PROPORTIONAL_BACK_STAKE'] = np.where(backtesting['WIN_RESULT'] == 'WINNER',
                            backtesting['win_probability'] * 100 * (backtesting['WIN_BSP'] - 1),
                            backtesting['win_probability'] * -100)

    return backtesting

backtesting = generate_backtesting_columns(backtesting)

# This is a non-exhaustive list of race types from the Betfair Pricing Data which groups together different race types. 
# The greyhound race grading system is convoluted and varies by state
RACE_TYPE_DICTIONARY = {
            'B8':'Other',
            'FFA':'Ungraded',
            'Final':'Prestige',
            'G3/4/5':'Graded',
            'Gr':'Graded',
            'Gr1':'Graded',
            'Gr1/2':'Graded',
            'Gr1/2/3':'Graded',
            'Gr2/3':'Graded',
            'Gr2/3/4':'Graded',
            'Gr3':'Graded',
            'Gr3/4':'Graded',
            'Gr3/4/5':'Graded',
            'Gr4':'Graded',
            'Gr4/5':'Graded',
            'Gr4/5/6':'Graded',
            'Gr5':'Graded',
            'Gr5/6':'Graded',
            'Gr5/Mdn':'Graded',
            'Gr6':'Graded',
            'Gr6/7':'Graded',
            'Gr7':'Graded',
            'Grp1':'Prestige',
            'Grp2':'Prestige',
            'Grp3':'Prestige',
            'Heat':'Heat',
            'Inv':'Other',
            'Juv':'Novice',
            'Juv/Grad':'Novice',
            'Juv/Mdn':'Novice',
            'Listed':'Prestige',
            'M1':'Masters',
            'M1/2':'Masters',
            'M1/2/3':'Masters',
            'M1/M2':'Masters',
            'M1/M2/M3':'Masters',
            'M12':'Masters',
            'M2':'Masters',
            'M2/3':'Masters',
            'M2/M3':'Masters',
            'M3':'Masters',
            'M3/4':'Masters',
            'M4/5':'Masters',
            'M4/6':'Masters',
            'M4/M5':'Masters',
            'M5':'Masters',
            'M6':'Masters',
            'Mdn':'Maiden',
            'Mdn/Gr4/5':'Maiden',
            'Mdn/Gr5':'Maiden',
            'Mdn/Gr6':'Maiden',
            'Mdn/Juv':'Maiden',
            'N/G':'Ungraded',
            'Nov':'Novice',
            'Nov/Mdn':'Novice',
            'Nvce':'Novice',
            'Prov':'Other',
            'Rest':'Win Restricted',
            'S/E':'Other',
            'SE':'Other',
            'Semi':'Other',
            'Vets':'Other',
            }


def group_racetypes(backtesting):
    '''
    This function generates a race_type_group column using a dictionary of loose race categories
    '''
    # Create 'RACE_TYPE_GROUP' column by mapping values from 'race_type' column
    backtesting['RACE_TYPE_GROUP'] = backtesting['RACE_TYPE'].map(RACE_TYPE_DICTIONARY)

    # Export to csv
    backtesting.to_csv('backtesting_results.csv',index=False)

    return backtesting

backtesting = group_racetypes(backtesting)

'''
This concludes the model generation tutorial.
The following code is intended for use in a live model
'''

Once we've exported this to a csv file, we've conducted some basic excel analysis to segment the predictions. All figures quoted below are before commission and may not be reflective of actual profitability. These segments are intended as ideas only.

png

Creating new ratings

The next step will be to load up our database, and update with recent results

Load up the database and begin the preprocessing
def generate_historic_data_enddate(TopazDataAll):
    '''
    This function finds the date to which the Topaz Historical Data has been downloaded, and then generates a datetime variable which equals the following date

    '''
    # Find the maximum value in the 'meetingDate' column
    max_meeting_date = TopazDataAll['meetingDate'].max()

    # Remove time information from the max_meeting_date and format it to '%Y-%m-%d'
    max_meeting_date = pd.to_datetime(max_meeting_date, format='%Y-%m-%dT%H:%M:%S.%fZ').strftime('%Y-%m-%d')

    # Convert max_meeting_date to datetime object
    max_meeting_date = pd.to_datetime(max_meeting_date)

    # Add one day to max_meeting_date
    max_meeting_date_plus_one_day = max_meeting_date + timedelta(days=1)

    print("Max meeting date plus one day:", max_meeting_date_plus_one_day)

    # Define yesterday's date
    yesterday = datetime.today() - timedelta(days=1)

    return max_meeting_date_plus_one_day, yesterday

TopazDataAll = load_topaz_data(JURISDICTION_CODES,'HISTORICAL')

max_meeting_date_plus_one_day, yesterday = generate_historic_data_enddate(TopazDataAll)

# Collect upcoming race data - note the use of 'UPCOMING' here.
collect_topaz_data(TOPAZ_API_KEY,JURISDICTION_CODES,max_meeting_date_plus_one_day,yesterday,'UPCOMING')

def topaz_last_12_months(TopazDataHistorical):
    '''
    This function generates a timezone naive column and then discards all data except the previous 12 months.
    The Topaz API meetingDate is timezone aware and filtering using datetime.today() - timedelta(years=1) isn't possible. 
    Removing the timezone will create errors with the dogAge function so creating a timezone naive copy is the best way to accomplish this before discarding it
    '''
    # Create the naive meetingDate
    TopazDataHistorical['meetingDateNaive'] = pd.to_datetime(TopazData['meetingDate'], format='%Y-%m-%dT%H:%M:%S.%fZ', utc=True).dt.tz_localize(None)

    # Discard all but the last 12 months
    TopazDataHistorical = TopazDataHistorical[TopazDataHistorical['meetingDateNaive'] >= datetime.today() - timedelta(years=1)]

    return TopazDataHistorical

def load_topaz_preprocessing_data(codes,datatype):
    '''
    This function combines 4 of our previous functions to load our existing topaz historical data into a dataframe from our database and perform some basic cleaning and then keep only the last 12 months for preprocessing
    '''
    TopazDataHistorical = load_topaz_data(codes,datatype)

    TopazDataHistorical = discard_scratched_runners_data(TopazDataHistorical)

    TopazDataHistorical = topaz_last_12_months(TopazDataHistorical)

    TopazDataHistorical = discard_unnecessary_columns(TopazDataHistorical,TOPAZ_COLUMNS_TO_KEEP)

    return TopazDataHistorical

TopazDataHistorical = load_topaz_preprocessing_data(JURISDICTION_CODES,'HISTORICAL')

Following this step, we'll need to download today's upcoming races from the Topaz API and also gather market information from the Betfair API to generate today's ratings.

An important thing to note here is that the update frequency of box information on Topaz is not known and for upcoming races, may not be reliable. We do, however, know that the Betfair Fields only appear once fields are final and so box information will be accurate. There we need to query the market clarifications field and extract the box information using a rather ugly regex script. Once we've done this, we can replace the box information from Topaz with the box information taken from the Betfair API before we process the raw Topaz data ready for application of our lovingly trained GridSearch'ed LGBM model

Gathering today's upcoming races from Topaz and the Betfair API
import betfairlightweight
from betfairlightweight import filters
import pandas as pd
from datetime import timedelta
from nltk.tokenize import regexp_tokenize
import warnings
import json

warnings.filterwarnings('ignore', message='The behavior of DataFrame concatenation with empty or all-NA entries is deprecated.*')

def bflw_trading():
    '''
    This function loads the credentials file, and passes the credentials into the betfairlightweight instance
    '''
    with open('credentials.json') as f:
        cred = json.load(f)
        username = cred['username']
        password = cred['password']
        app_key = cred['app_key']

    # Define the betfairlightweight client
    trading = betfairlightweight.APIClient(username, password, app_key=app_key)

    return trading

def login(trading):
    # login to the API
    trading.login_interactive()

def greyhound_market_filter():
    # Define the greyhound market filter
    market_filter = filters.market_filter(
        event_type_ids=[4339],  # For horse racing
        market_countries=['AU'],  # For Australia
        market_type_codes=['WIN']  # For win markets
    )

    return market_filter

def process_runner_books(runner_books):
    # Define the fields required from the runner book
    selection_ids = [runner_book.selection_id for runner_book in runner_books]

    df = pd.DataFrame({
        'selectionId': selection_ids,
    })
    return df

def generate_greyhound_catalogue(trading,market_filter):
    # Load the greyhound market catalogues from the Betfair API
    greyhound_market_catalogues = trading.betting.list_market_catalogue(
    filter=market_filter,
    market_projection=['RUNNER_DESCRIPTION', 'EVENT', 'MARKET_DESCRIPTION'],
    max_results='200')

    print(f"Found {len(greyhound_market_catalogues)} markets.")

    return greyhound_market_catalogues

RUNNER_DATA_COLUMNS = [
            'marketStart',
            'track',
            'raceNumber',
            'raceType',
            'winMarketId',
            'selectionId',
            'rugNumber',
            'boxNumber',
            'dogName'
            ]

def initilise_dataframe():
    # Create the empty dataframe
    data = pd.DataFrame(columns=RUNNER_DATA_COLUMNS)

    return data

PATTERN1 = r'(?<=<br>Dog ).+?(?= starts)'

PATTERN2 = r"(?<=\bbox no. )(\w+)"

def process_market_clarifications(runners_df,clarifications):
    '''
    This function accesses the market clarifications field which explains which box the reserve runner will be starting from (if any) and parses the information using regex
    We utilise this information rather than the Topaz API data because Betfair markets only use final field information

    A clarification will look like: "<br>Box changes:<br>Dog 9. Tralee Blaze starts from box no. 8<br><br>Dog 6. That Other One starts from box no. 2<br><br>"
    '''
    # Define the clarifications dataframe
    market_clarifications = pd.DataFrame(regexp_tokenize(clarifications, PATTERN1), columns = ['dogName'])

    # Remove dog name from runner_number
    market_clarifications['rugNumber'] = market_clarifications['dogName'].str.split(r'. ').str[0]

    # Extract box number from clarifications
    market_clarifications['boxNumber'] = regexp_tokenize(clarifications, PATTERN2)

    # Keep only boxNumber and rugNumber
    market_clarifications=market_clarifications[['rugNumber','boxNumber']]

    # Merge the clarifications with the original dataframe
    runners_df = pd.merge(runners_df,market_clarifications,how='left',on=['rugNumber'])

    # Any runners with no clarifications will start in the box that matches the rugNumber
    runners_df['boxNumber'].fillna(runners_df['rugNumber'],inplace=True)

    return runners_df


def collect_greyhound_market_data(trading,greyhound_market_catalogues,data):
    '''
    This function will process the greyhound market catalogue to access information about the market including:
     - Market ID
     - Market Name
     - Event Name
     - Start Time
     - Clarifications

    It will then process each individual market book to gather the runner information, following by some operations to put market information into the dataframe columns including adjusting the timezone from UTC to AEST
    Finally it will then perform some string splitting operations to generate more useful market/runner information:
     - Track
     - Race Number
     - Race Type
     - Rug Number
     - Dog Name

    These operations may be useful depending on whether the betting intention is for a specific subset of races. It is also possible to split out race distance from the market name
    '''
    # Initiate the for loop
    for market_catalogue in greyhound_market_catalogues:

        # Name variables for market parameters
        market_id = market_catalogue.market_id
        market_name = market_catalogue.market_name
        event_name = market_catalogue.event.name
        market_start_time = market_catalogue.description.market_time

        # Try to access clarifications and replace a known string replacement to prepare it for our regex functuon
        try:
            clarifications = market_catalogue.description.clarifications.replace("<br> Dog","<br>Dog")
        except AttributeError:
            clarifications = None

        # Generate our market_books list
        market_books = trading.betting.list_market_book(market_ids=[market_id])

        # Generate our runner_catalogues list
        runner_catalogues = market_catalogue.runners

        # Initiate the market_books for loop
        for market_book in market_books:

            # Call the process_runner_books function
            runners_df = process_runner_books(market_book.runners)

            # Get the runner catalogue
            for runner in market_book.runners:

                # define the runner catalogue
                runner_catalogue = next((rd for rd in runner_catalogues if rd.selection_id == runner.selection_id), None)

                # define the runner name for non-empty runner_catalogues
                if runner_catalogue is not None:
                    runner_name = runner_catalogue.runner_name
                    runners_df.loc[runners_df['selectionId'] == runner.selection_id, 'dogName'] = runner_name

            # Assign market variables to the dataframe
            runners_df['winMarketId'] = market_id
            runners_df['marketName'] = market_name
            runners_df['eventName'] = event_name
            runners_df['marketStart'] = market_start_time

            # Adjust the timezone from UTC to AEST
            runners_df['marketStart'] = runners_df['marketStart'] + timedelta(hours=10)

            # Perform string split operations 
            runners_df['track']=runners_df['eventName'].str.split(' \(').str[0]
            runners_df['raceNumber']=runners_df['marketName'].str.split(r' ').str[0]
            runners_df['raceNumber']=runners_df['raceNumber'].str.split('R').str[1]
            runners_df['raceType']=runners_df['marketName'].str.split(r'm ').str[1]
            runners_df['rugNumber']=runners_df['dogName'].str.split(r'. ').str[0]
            runners_df['dogName']=runners_df['dogName'].str.split('\. ').str[1]

            # Call the process_market_clarifications function. If there no reserve runners running then the boxNumber = rugNumber
            try:
                runners_df = process_market_clarifications(runners_df,clarifications)
            except TypeError:
                runners_df['boxNumber'] = runners_df['rugNumber']

            # concatenate the dataframes together
            data=pd.concat([data,runners_df], sort=False)

    # Keep only required columns
    data = data[RUNNER_DATA_COLUMNS]
    data = pd.DataFrame(data)

    print(data.head)

    return data

def download_betfair_market_data():
    '''
    This function combines all our previously defined functions to generate our market csv from the Betfair API
    '''
    trading = bflw_trading()

    login(trading)

    market_filter = greyhound_market_filter()

    greyhound_market_catalogues = generate_greyhound_catalogue(trading,market_filter)

    data = initilise_dataframe()

    data = collect_greyhound_market_data(trading,greyhound_market_catalogues,data)

    return data

betfair_data = download_betfair_market_data()

def upcoming_topaz_data(codes,datatype,betfair_data):
    '''
    This function loads our upcoming races, discards the Topaz API boxNumber and adds the boxNumber information retrieved from the Betfair API
    '''
    # Load today's race information
    TodaysTopazData = load_topaz_data(codes,datatype)

    # Keep only required Betfair information
    betfair_fields = betfair_data[['track','raceNumber','rugNumber','boxNumber']]

    # Discard the Topaz API boxNumber information
    TodaysTopazData.drop(columns=['boxNumber'], inplace=True)

    # Merge the Betfair boxNumber information
    TodaysTopazData = pd.merge(TodaysTopazData,betfair_fields,how='left',on=['track','raceNumber','rugNumber'])

    return TodaysTopazData

TodaysTopazData = (JURISDICTION_CODES,'UPCOMING',betfair_data)

def concatenate_data(TopazDataHistorical,TodaysTopazData):

    # Concatenate the last 12 months of Topaz Data with today's races
    TopazDataPreProcessing = pd.concat([TopazDataHistorical,TodaysTopazData])

    return TopazDataPreProcessing

TopazDataPreProcessing = concatenate_data(TopazDataHistorical,TodaysTopazData)

The final step now will be to apply all of our previously defined functions to the last 12 months of Topaz data to create our rolling window features before, finally, we can apply our trained model (in the form of a pickle file) to today's field and output our ratings ready to put in the market

Generate today's ratings
def preprocess_data_for_today(TopazDataPreProcessing):
    '''
    This function will apply all of our previously defined functions to our dataset in order to generate all the features required for today's races
    '''

    TopazDataPreProcessing = clean_track_data(TopazDataPreProcessing)

    TopazDataPreProcessing = extract_form(TopazDataPreProcessing)

    TopazDataPreProcessing = correct_result_margin(TopazDataPreProcessing)

    TopazDataPreProcessing = generate_dogAge(TopazDataPreProcessing)

    TopazDataPreProcessing = generate_winPercentage(TopazDataPreProcessing)

    TopazDataPreProcessing = generate_finishingPlaceMovement(TopazDataPreProcessing)

    TopazDataPreProcessing = scale_values(TopazDataPreProcessing)

    TopazDataPreProcessing = generate_runTimeNorm(TopazDataPreProcessing)

    TopazDataPreProcessing = generate_adjacentVacantBox_state(TopazDataPreProcessing)

    TopazDataPreProcessing = generate_boxWinPercentage(TopazDataPreProcessing)

    TopazDataPreProcessing, feature_cols = generate_rollingFeatures(TopazDataPreProcessing)

    feature_cols = generate_feature_list(feature_cols)

    TopazDataPreProcessing, feature_cols = generate_modelling_dataset(TopazDataPreProcessing, feature_cols)

    TopazDataPreProcessing = remove_races(TopazDataPreProcessing)

    TopazDataPreProcessing = downcast_data(TopazDataPreProcessing)

    return TopazDataPreProcessing

TopazDataPreProcessing = preprocess_data_for_today(TopazDataPreProcessing)

def todays_fields(TopazDataPreProcessing):
    '''
    This function removes all race data except for today's races before we load our trained model
    '''
    # Create our naive meetingDate
    TopazDataPreProcessing['meetingDateNaive'] = pd.to_datetime(TopazDataPreProcessing['meetingDate'], format='%Y-%m-%dT%H:%M:%S.%fZ', utc=True).dt.tz_localize(None)

    # Keep only today's races
    todays_runners = TopazDataPreProcessing[TopazDataPreProcessing['meetingDateNaive'] > datetime.today() - timedelta(days=1)].reset_index(drop=True)

    # Discard the naive meetingDate
    todays_runners.drop(columns=['meetingDateNaive'], inplace=True)

    return todays_runners

todays_runners = todays_fields(TopazDataPreProcessing)

import joblib

def load_trained_lgbm_model():

    # Load the trained model
    model = joblib.load('best_lgbm_model.pickle')

    return model

model = load_trained_lgbm_model()

MODEL_RATINGS_COLUMNS = ['state',
                        'track',
                        'raceNumber', 
                        'boxNumber',
                        'rugNumber',
                        'win_probability']

def apply_trained_model(model, todays_runners,feature_cols):
    '''
    This function will apply our previously trained lgbm model to our dataset to generate today's rated prices
    '''

    # Generate the feature set
    todays_runners_features = todays_runners[feature_cols]

    # Apply the trained model
    todays_runners['win_probability'] = model.predict_proba(todays_runners_features)[:, 1]

    # Normalise all probabilities
    todays_runners['win_probability'] = todays_runners.groupby('raceId', group_keys=False)['win_probability'].apply(lambda x: x / sum(x))

    # Select desired columns
    todays_runners = todays_runners[MODEL_RATINGS_COLUMNS]

    return todays_runners

todays_greyhound_ratings = apply_trained_model(model, todays_runners,feature_cols)

def join_betfair_data(todays_greyhound_ratings,betfair_data):
    '''
    This function will join our Betfair market and runner information with today's model ratings
    '''
    # Merge the dataframes
    betfair_data = pd.merge(betfair_data,todays_greyhound_ratings,how='left',on=['track','raceNumber','rugNumber'])

    # Export the data to csv
    betfair_data.to_csv('todays_ratings.csv',index=False)

    return betfair_data

betfair_data = join_betfair_data(todays_greyhound_ratings,betfair_data)

To be continued

The final step here is to fire up flumine to place bets on our rated prices... Stay tuned!

Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.