Skip to content

Betfair’s 2025 Greyhound Racing Datathon

Greyhound Datathon Banner

Registration

Register Here!

Registrations are open until May 16th 2025. Only entrants who have registered through the link will be eligible to win a prize


The Competition

Think you’ve got what it takes to price up a Greyhound market? Now’s your chance to showcase your data modeling skills in Betfair’s 2025 Greyhound Racing Datathon!

With $5,000 in prizes on offer, this is your opportunity to create a predictive model for Greyhound racing on the Betfair Exchange. Whether you're a seasoned pro or new to greyhound modeling, we encourage you to get creative—adapt your skills from other fields, improve an existing model, or start fresh!

This year’s Greyhound Racing Datathon runs across 2 weeks, and we challenge you to test your skills against others for both prizes and ultimate bragging rights.

  • Leaderboard updates will be posted here throughout the competition, so check back often.
  • Join the conversation in the Quants Discord server (#datathon channel), where you can discuss models with fellow participants.
  • Don't forget to complete the registration form to join the Discord Server.

For questions and submissions, contact datathon@betfair.com.au.


The Specifics

Review the full Terms and Conditions for the 2025 Greyhound Racing Datathon here.

Prizes

$5,000 in prizes are up for grabs! Here's the breakdown of the prize pool:

Place Prize
1 $2,500.00
2 $1,000.00
3 $500.00
4 $250.00
5 $250.00
6 $100.00
7 $100.00
8 $100.00
9 $100.00
10 $100.00
Total Prize Pool $5,000.00

Winners will be announced at the end of the competition, with prizes distributed shortly afterward.

Competition Rules

The aim of this competition is to generate a rated price for every runner in every race across a set of meetings during the competition period (WIN markets only)

The competition period is May 19th, 2025 - May, 30th 2025

The specific meetings will meet the following criteria:

  • The meeting will occur in Australia
  • The meeting will NOT occur at a venue in Western Australia
  • The meeting will occur on a weekday (Monday to Friday)
  • The meeting will commence after 5pm AEST (Evening meeting)

Submissions are by due by 4:59pm AEST each day


Submission Process

Submission templates will be provided here by 12:00pm AEST each day

Example Submission Template

Entrants should not edit the template in any way except to add the rated price for each runner.

Generate Submission Template
import betfairlightweight
from betfairlightweight import filters
import pandas as pd
import dateutil.tz
from datetime import datetime, timedelta, time
import json
from tqdm import tqdm

local_tz = dateutil.tz.tzlocal()
now = datetime.now(dateutil.tz.tzlocal())

with open("credentials.json") as f:
    cred = json.load(f)
    my_username = cred["username"]
    my_password = cred["password"]
    my_app_key = cred["app_key"]

trading = betfairlightweight.APIClient(username=my_username,
                                    password=my_password,
                                    app_key=my_app_key
                                    )

trading.login_interactive()

# Define the market filter
market_filter = filters.market_filter(
    bsp_only=True,
    event_type_ids=[4339],  # For greyhound racing
    market_countries=['AU'],  # For Australia
    market_type_codes=['WIN']  # For win markets
)

def process_runner_books(runner_books):
    '''
    This function processes the runner books and returns a DataFrame with the best back/lay prices + vol for each runner
    :param runner_books:
    :return:
    '''

    selection_ids = [runner_book.selection_id for runner_book in runner_books]
    statuses = [runner_book.status for runner_book in runner_books]

    df = pd.DataFrame({
        'selection_id': selection_ids,
        'status': statuses,
    })
    return df

# Get a list of all markets that match the filter
market_catalogues = trading.betting.list_market_catalogue(
    filter=market_filter,
    market_projection=['RUNNER_DESCRIPTION', 'EVENT', 'MARKET_START_TIME'],
    max_results='1000')

print(f"Found {len(market_catalogues)} markets.")

data = pd.DataFrame(columns=[
        'event_start_date',
        'market_start',
        'venue', 
        'race_no',
        'race_type',
        'win_market_id',
        'selection_id',
        'tab_number',
        'runner_name',
        'status',
    ])

runner_dataframes = []

for market_catalogue in tqdm(market_catalogues, desc="Processing markets"):
    market_id = market_catalogue.market_id
    market_name = market_catalogue.market_name
    event_name = market_catalogue.event.name
    venue = market_catalogue.event.venue
    market_start_time = market_catalogue.market_start_time
    event_open_date = market_catalogue.event.open_date

    market_books = trading.betting.list_market_book(market_ids=[market_id])
    runner_catalogues = market_catalogue.runners

    for market_book in market_books:
        runners_df = process_runner_books(market_book.runners)
        #get the runner catalogue
        for runner in market_book.runners:
            runner_catalogue = next((rd for rd in runner_catalogues if rd.selection_id == runner.selection_id), None)

            if runner_catalogue is not None:
                runner_name = runner_catalogue.runner_name
                near_price = runner_catalogue
                runners_df.loc[runners_df['selection_id'] == runner.selection_id, 'runner_name'] = runner_name
        #process some of the data to more helpful values
        runners_df['event_open_date'] = event_open_date
        runners_df['event_open_date'] = runners_df['event_open_date'] + timedelta(hours=10)
        runners_df['win_market_id'] = market_id
        runners_df['market_name'] = market_name
        runners_df['event_name'] = event_name
        runners_df['market_start'] = market_start_time
        runners_df['market_start'] = runners_df['market_start'] + timedelta(hours=10)
        runners_df['venue']=venue
        runners_df['race_no']=runners_df['market_name'].str.split(r' ').str[0]
        runners_df['race_no']=runners_df['race_no'].str.split('R').str[1]
        runners_df['race_type']=runners_df['market_name'].str.split(r'm ').str[1]
        runners_df['tab_number']=runners_df['runner_name'].str.split(r'. ').str[0]
        runners_df['runner_name']=runners_df['runner_name'].str.split(r'\. ').str[1]
        #reorder the columns
        runners_df=runners_df[[
        'event_open_date',
        'market_start',
        'venue',
        'race_no',
        'race_type',
        'win_market_id',
        'selection_id',
        'tab_number',
        'runner_name',
        'status',
        ]]
        #join the dataframes together
        runner_dataframes.append(runners_df)

df = pd.concat(runner_dataframes, ignore_index=True)

#remove scratched runners, WA Tracks and meetings starting before 6pm
df=df[~(df['status'] == 'REMOVED')]
df = df[~df['venue'].isin(['Cannington', 'Mandurah', 'Northam'])]
df = df[df['event_open_date'].dt.time >= time(18, 0)]
df = df.drop(columns=['event_open_date'])

#sort columns
df['race_no'] = df['race_no'].astype(int)
df['tab_number'] = df['tab_number'].astype(int)
df = df.sort_values(by=['venue', 'race_no', 'tab_number'])

#initiate rated price column
df['rated_price'] = None

today = datetime.today().strftime('%Y-%m-%d')
filename = f'submission_template_{today}.csv'
df.to_csv(filename, index=False)

# Logout from your Betfair account
trading.logout()

Missed submissions?

  • You’ll be assigned the median log loss of all other entrants for that race.

Judging

Submissions will be evaluated based on the Log Loss Method.

The log loss score for each runner in a race will be added together and entrants will be marked on their average log loss per race

The entrant with the lowest average log loss per race will be declared the winner.


Historic Data

Registrants will be provided with a link for a historic dataset from the Topaz API. Updates will be posted here weekly leading up to the competition and daily during the competition.

Registrants with a Topaz API key can utilise the code below:

Data Download Code

import pandas as pd
from tqdm import tqdm
import time
from datetime import datetime, timedelta
from topaz import TopazAPI
import requests
import os

# Insert your TOPAZ API key 
TOPAZ_API_KEY = ''

# Define the states you require
JURISDICTION_CODES = ['VIC','NSW','QLD','SA','NZ','TAS','NT','WA']

'''
It is pythonic convention to define hard-coded variables (like credentials) in all caps. Variables whose value may change in use should be defined in lowercase with underscore spacing
'''

# Define the start and end date for the historic data download
start_date = datetime(2025,2,1)
end_date = datetime(2025,3,20)

def generate_date_range(start_date, end_date):
    ''' 
    Here we generate our date range for passing into the Topaz API.
    This is necessary due to the rate limits implemented in the Topaz API. 
    Passing a start date and end date 12 months apart in the API call will cause significant amounts of data to be omitted, so we need to generate a range to pass much smaller date ranges in the Topaz API calls
    '''
    # Initialise the date_range list
    date_range = []

    current_date = start_date

    while current_date <= end_date:
        date_range.append(current_date.strftime("%Y-%m-%d"))
        current_date += timedelta(days=1)

    return date_range

def define_topaz_api(api_key):
    '''
    This is where we define the Topaz API and pass our credentials
    '''
    topaz_api = TopazAPI(api_key)

    return topaz_api

def download_topaz_data(topaz_api,date_range,codes,datatype,number_of_retries,sleep_time):
    '''
    The parameters passed here are:
        1. The TopazAPI instance with our credentials
        2. Our date range that we generated which is in the format of a list of strings
        3. Our list of JURISDICTION_CODES
        4. Our datatype (either 'HISTORICAL' or 'UPCOMING'). 
            It is important to separate these in our database to ensure that the upcoming races with lots of empty fields (like margin) do not contaminate our historical dataset.
    '''

    # Iterate over 10-day blocks
    for i in range(0, len(date_range), 10):
        start_block_date = date_range[i]
        print(start_block_date)
        end_block_date = date_range[min(i + 9, len(date_range) - 1)]  # Ensure the end date is within the range


        for code in codes:
            # initialise the race list
            all_races = []

            print(code)

            retries = number_of_retries  # Number of retries
            '''
            In this code block we are attempting to download a list of raceIds by passing our JURISDICTION_CODES and date range.
            The sleep functions are included to allow time for the rate limits to reset and any other errors to clear.
            The Topaz API does not return much detail in error messages and through extensive trial and error, we have found the best way to resolve errors is simply to pause for a short time and then try again.
            After 10 retries, the function will move to the next block. This will usually only occur if there actually were zero races for that JURISDICTION_CODE and date_range or there is a total Topaz API outage.
            '''
            while retries > 0:
                try:
                    races = topaz_api.get_races(from_date=start_block_date, to_date=end_block_date, owning_authority_code=code)
                    all_races.append(races)
                    break  # Break out of the loop if successful
                except requests.HTTPError as http_err:
                    if http_err.response.status_code == 429:
                        retries -= 1
                        if retries > 0:
                            print(f"Rate limited. Retrying in {sleep_time * 4/60} minutes...")
                            time.sleep(sleep_time * 4)
                        else:
                            print("Max retries reached. Moving to the next block.")
                    else:
                        print(f"Error fetching races for {code}: {http_err.response.status_code}")
                        retries -= 1
                        if retries > 0:
                            print(f"Retrying in {sleep_time} seconds...")
                            time.sleep(sleep_time)
                        else:
                            print("Max retries reached. Moving to the next block.")

            try:
                all_races_df = pd.concat(all_races, ignore_index=True)
            except ValueError:
                continue

            # Extract unique race IDs
            race_ids = list(all_races_df['raceId'].unique())

            # Define the file path
            file_path = code + '_DATA_'+datatype+'.csv'

            # Use tqdm to create a progress bar
            for race_id in tqdm(race_ids, desc="Processing races", unit="race"):
                result_retries = number_of_retries
                '''
                Here we utilise the retries function again to maximise the chances of our data being complete.
                We spend some time here gathering the splitPosition and splitTime. While these fields are not utilised in the completed model, the process to extract this data has been included here for the sake of completeness only, as it is not straightforward.

                '''
                while result_retries > 0:
                    # Check if we already have a file for this jurisdiction
                    file_exists = os.path.isfile(file_path)
                    # Set the header_param to the opposite Bool
                    header_param = not file_exists

                    try:                        
                        # Get the race result data
                        race_result_json = topaz_api.get_race_result(race_id=race_id)

                        # Flatten the JSON response into a dataframe
                        race_result = pd.json_normalize(race_result_json)

                        race_run_df = pd.DataFrame(race_result['runs'].tolist(),index=race_result.index)
                        race_run = race_run_df.T.stack().to_frame()
                        race_run.reset_index(drop=True, inplace= True)
                        race_run_normalised = pd.json_normalize(race_run[0])

                        # Separate the split times and flatten them
                        split_times_df = pd.DataFrame(race_result['splitTimes'].tolist(),index=race_result.index)
                        splits_dict = split_times_df.T.stack().to_frame()
                        splits_dict.reset_index(drop=True, inplace= True)
                        splits_normalised = pd.json_normalize(splits_dict[0])

                        if len(splits_normalised) > 0:

                            # Create a dataframe from the first split
                            first_split = splits_normalised[splits_normalised['splitTimeMarker'] == 1]
                            first_split = first_split[['runId','position','time']]
                            first_split = first_split.rename(columns={'position':'firstSplitPosition','time':'firstSplitTime'})

                            # Create a dataframe from the second split
                            second_split = splits_normalised[splits_normalised['splitTimeMarker'] == 2]
                            second_split = second_split[['runId','position','time']]
                            second_split = second_split.rename(columns={'position':'secondSplitPosition','time':'secondSplitTime'})

                            # Create a dataframe from the runIds, then merge the first and second split dataframes
                            split_times = splits_normalised[['runId']]
                            split_times = pd.merge(split_times,first_split,how='left',on=['runId'])
                            split_times = pd.merge(split_times,second_split,how='left',on=['runId'])

                        try:
                            # Attach the split times to the original race_run dataframe
                            race_run = pd.merge(race_run_normalised,split_times,how='left',on=['runId'])
                        except Exception:
                            race_run=race_run_normalised   

                    except requests.HTTPError as http_err:
                        if http_err.response.status_code == 404:
                            break

                    except Exception as e:
                        print(f"Error {e}")
                        result_retries -= 1
                        if result_retries > 0:
                            time.sleep(sleep_time)
                        else:
                            break

                    finally:
                        race_run.drop_duplicates(inplace=True)
                        race_run.to_csv(file_path, mode='a', header=header_param, index=False)
                        break

def collect_topaz_data(api_key,codes,start_date,end_date,datatype,number_of_retries,sleep_time):
    '''
    This function here combines our three previously defined functions into one neat function in the correct order of execution.
    As the data is written to csv files, we do not need to return anything at the end of the function. This also means that it is not necessary to define a variable as the output of a function
    '''
    date_range = generate_date_range(start_date,end_date)

    topaz_api = define_topaz_api(api_key)

    download_topaz_data(topaz_api,date_range,codes,datatype,number_of_retries,sleep_time)

collect_topaz_data(TOPAZ_API_KEY,JURISDICTION_CODES,start_date,end_date,'HISTORICAL',10,30)

Leaderboard

Leaderboards will be published here throughout the competition


FAQs

Can I resubmit my submission if I notice an error?

  • Yes, only the final entry received before the deadline will be used for scoring

What are the guidelines for the rated prices?

  • Prices submitted must be greater than 1 (exactly 1 is not valid)
  • The overround for the market must be 1 (i.e. the sum of the reciprocals of all the rated prices)

What if my overround isn't 1?

  • Rated Prices will be normalised so that the market sums to 1

What if I want to use a Machine Learning model?

  • The algorithm choice doesn’t matter from a competition perspective. However, you should use a classification model, not a regression model.

What happens if I miss a race?

  • You’ll be assigned the median log loss from other entrants for that race.

What happens if a race is not run (abandoned) or is declared a no-race?

  • That race will be excluded from scoring

What happens if there is a scratching?

  • The runner will be removed and the remaining prices will be normalised so that the overround sums to 1

What happens if there is a deadheat?

  • The result will be recorded as 1 divided by the number of winners

Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.