How to Automate IV: Automate your own Model
This is an archived version of How to Automate 4 the latest version is available here
For this tutorial we will be automating the model that Bruno taught us how to make in the Greyhound Modelling Tutorial. This tutorial follows on logically from How to Automate III. If you haven't already, make sure you take a look at the rest of the series first those before continuing here as they cover some key concepts!
Saving and loading in our model
To generate our predictions, we have two options: we can generate our predictions using the same notebook used to train our model then read those predictions into this notebook, or we can save the model and read that model into this notebook.
For this tutorial we have chosen to save the model, as it becomes a bit less confusing and easier to manage, although there are some pieces of code we may have to write twice (copy and paste). So first we will need to run the code from the tutorial and then save the model. This is super as simple we can just copy and paste the complete code provided at the end of the tutorial or download from Github. Then we can just run this extra line code (which I have copied from the documentation page) at the end of the notebook to save the model.
Now that the file is saved, let's read it into this note book:
Generating predictions for today
Now that we have the model loaded in, we need the data, to generate our predictions for today's races!
# Import libraries required to download today's races
import os
import sys
# Allow imports from src folder
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
sys.path.append(module_path)
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
from dateutil import tz
from pandas.tseries.offsets import MonthEnd
from sklearn.preprocessing import MinMaxScaler
import itertools
import numpy as np
import pandas as pd
from nltk.tokenize import regexp_tokenize
# settings to display all columns
pd.set_option("display.max_columns", None)
import fasttrack as ft
from dotenv import load_dotenv
load_dotenv()
# Import race data excluding NZ races
au_tracks_filter = list(track_codes[track_codes['state'] != 'NZ']['track_code'])
# Time window to import data
# First day of the month 46 months back from now
date_from = (datetime.today() - relativedelta(months=46)).replace(day=1).strftime('%Y-%m-%d')
# First day of previous month
date_to = (datetime.today() - relativedelta(months=1)).replace(day=1).strftime('%Y-%m-%d')
# Dataframes to populate data with
race_details = pd.DataFrame()
dog_results = pd.DataFrame()
# For each month, either fetch data from API or use local CSV file if we already have downloaded it
for start in pd.date_range(date_from, date_to, freq='MS'):
start_date = start.strftime("%Y-%m-%d")
end_date = (start + MonthEnd(1)).strftime("%Y-%m-%d")
try:
filename_races = f'FT_AU_RACES_{start_date}.csv'
filename_dogs = f'FT_AU_DOGS_{start_date}.csv'
filepath_races = f'../data/{filename_races}'
filepath_dogs = f'../data/{filename_dogs}'
print(f'Loading data from {start_date} to {end_date}')
if os.path.isfile(filepath_races):
# Load local CSV file
month_race_details = pd.read_csv(filepath_races)
month_dog_results = pd.read_csv(filepath_dogs)
else:
# Fetch data from API
month_race_details, month_dog_results = client.getRaceResults(start_date, end_date, au_tracks_filter)
month_race_details.to_csv(filepath_races, index=False)
month_dog_results.to_csv(filepath_dogs, index=False)
# Combine monthly data
race_details = race_details.append(month_race_details, ignore_index=True)
dog_results = dog_results.append(month_dog_results, ignore_index=True)
except:
print(f'Could not load data from {start_date} to {end_date}')
This piece of code we copied and pasted from the Greyhound Modelling Tutorial is fantastic! It has downloaded/read-in a ton of historic data! There is an issue though! We don't have the data for today's races, and also for any races that has occurred this month. This is because the code above only downloaded data up until the end of last month.
For example, if we are in the middle of June, then any races in the first two weeks of June won't be downloaded by the chunk of code above. An issue is that if we download it now, when tomorrow rolls around it won't include the extra races that have finished today.
So, the simple but inefficient solution is that every single day we redownload all the races that have already concluded this month. (Ideally you have some sort of database set up or you store and download your data in a daily format instead of the monthly format)
current_month_start_date = pd.Timestamp.now().replace(day=1).strftime("%Y-%m-%d")
current_month_end_date = (pd.Timestamp.now().replace(day=1)+ MonthEnd(1))
current_month_end_date = (current_month_end_date - pd.Timedelta('1 day')).strftime("%Y-%m-%d")
print(f'Start date: {current_month_start_date}')
print(f'End Date: {current_month_end_date}')
# Download data for races that have concluded this current month up untill today
# Start and end dates for current month
current_month_start_date = pd.Timestamp.now().replace(day=1).strftime("%Y-%m-%d")
current_month_end_date = (pd.Timestamp.now().replace(day=1)+ MonthEnd(1))
current_month_end_date = (current_month_end_date - pd.Timedelta('1 day')).strftime("%Y-%m-%d")
# Files names
filename_races = f'FT_AU_RACES_{current_month_start_date}.csv'
filename_dogs = f'FT_AU_DOGS_{current_month_start_date}.csv'
# Where to store files locally
filepath_races = f'../data/{filename_races}'
filepath_dogs = f'../data/{filename_dogs}'
# Fetch data from API
month_race_details, month_dog_results = client.getRaceResults(current_month_start_date, current_month_end_date, au_tracks_filter)
# Save the files locally and replace any out of date fields
month_race_details.to_csv(filepath_races, index=False)
month_dog_results.to_csv(filepath_dogs, index=False)
What we are really interested in are races that are scheduled for today as we want to use our model to predict their ratings. So, let's write some code we can run in the morning that will download the data for the day:
# It also seems that in todays_dogs dataframe Box is labeled as RaceBox instead, so let's rename it
# We can also see that there are some specific dogs that have "Res." as a suffix of their name, i.e. they are reserve dogs,
# We will treat this later
todays_dogs = todays_dogs.rename(columns={"RaceBox":"Box"})
todays_dogs.tail(3)
# Appending todays data to this months data
month_dog_results = pd.concat([month_dog_results,todays_dogs],join='outer')[month_dog_results.columns]
month_race_details = pd.concat([month_race_details,todays_races],join='outer')[month_race_details.columns]
# Appending this months data to the rest of our historical data
race_details = race_details.append(month_race_details, ignore_index=True)
dog_results = dog_results.append(month_dog_results, ignore_index=True)
Cleaning our data and feature creation
Originally I thought that since we now that we have all the data we can easily copy and paste the code used in the greyhound modelling tutorial to clean our data and create the features.
But after staring at weird predictions and spending hours trying to work out why some things weren't working I realised that for the most part we can copy and paste code, but when working with the live data we do need to make a few changes. I'll point them out when we get to it, but the main things that tripped me up is the data types the FastTrack API gives and that we need a system to work around reserve dogs
The first thing that tripped me up was when FastTrack_DogId
for live data was in a string format, and because everything looks like it works, it took ages to find this error. So, let's make sure we deal with it here using:
## Cleanse and normalise the data
# Clean up the race dataset
race_details = race_details.rename(columns = {'@id': 'FastTrack_RaceId'})
race_details['Distance'] = race_details['Distance'].apply(lambda x: int(x.replace("m", "")))
race_details['date_dt'] = pd.to_datetime(race_details['date'], format = '%d %b %y')
# Clean up the dogs results dataset
dog_results = dog_results.rename(columns = {'@id': 'FastTrack_DogId', 'RaceId': 'FastTrack_RaceId'})
# New line of code (rest of this code chunk is copied from bruno's code)
dog_results['FastTrack_DogId'] = pd.to_numeric(dog_results['FastTrack_DogId'])
# Combine dogs results with race attributes
dog_results = dog_results.merge(
race_details,
how = 'left',
on = 'FastTrack_RaceId'
)
# Convert StartPrice to probability
dog_results['StartPrice'] = dog_results['StartPrice'].apply(lambda x: None if x is None else float(x.replace('$', '').replace('F', '')) if isinstance(x, str) else x)
dog_results['StartPrice_probability'] = (1 / dog_results['StartPrice']).fillna(0)
dog_results['StartPrice_probability'] = dog_results.groupby('FastTrack_RaceId')['StartPrice_probability'].apply(lambda x: x / x.sum())
# Discard entries without results (scratched or did not finish)
dog_results = dog_results[~dog_results['Box'].isnull()]
dog_results['Box'] = dog_results['Box'].astype(int)
# Clean up other attributes
dog_results['RunTime'] = dog_results['RunTime'].astype(float)
dog_results['SplitMargin'] = dog_results['SplitMargin'].astype(float)
dog_results['Prizemoney'] = dog_results['Prizemoney'].astype(float).fillna(0)
dog_results['Place'] = pd.to_numeric(dog_results['Place'].apply(lambda x: x.replace("=", "") if isinstance(x, str) else 0), errors='coerce').fillna(0)
dog_results['win'] = dog_results['Place'].apply(lambda x: 1 if x == 1 else 0)
# Normalise some of the raw values
dog_results['Prizemoney_norm'] = np.log10(dog_results['Prizemoney'] + 1) / 12
dog_results['Place_inv'] = (1 / dog_results['Place']).fillna(0)
dog_results['Place_log'] = np.log10(dog_results['Place'] + 1).fillna(0)
dog_results['RunSpeed'] = (dog_results['RunTime'] / dog_results['Distance']).fillna(0)
## Generate features using raw data
# Calculate median winner time per track/distance
win_results = dog_results[dog_results['win'] == 1]
median_win_time = pd.DataFrame(data=win_results[win_results['RunTime'] > 0].groupby(['Track', 'Distance'])['RunTime'].median()).rename(columns={"RunTime": "RunTime_median"}).reset_index()
median_win_split_time = pd.DataFrame(data=win_results[win_results['SplitMargin'] > 0].groupby(['Track', 'Distance'])['SplitMargin'].median()).rename(columns={"SplitMargin": "SplitMargin_median"}).reset_index()
median_win_time.head()
# Calculate track speed index
median_win_time['speed_index'] = (median_win_time['RunTime_median'] / median_win_time['Distance'])
median_win_time['speed_index'] = MinMaxScaler().fit_transform(median_win_time[['speed_index']])
median_win_time.head()
# Compare dogs finish time with median winner time
dog_results = dog_results.merge(median_win_time, on=['Track', 'Distance'], how='left')
dog_results = dog_results.merge(median_win_split_time, on=['Track', 'Distance'], how='left')
# Normalise time comparison
dog_results['RunTime_norm'] = (dog_results['RunTime_median'] / dog_results['RunTime']).clip(0.9, 1.1)
dog_results['RunTime_norm'] = MinMaxScaler().fit_transform(dog_results[['RunTime_norm']])
dog_results['SplitMargin_norm'] = (dog_results['SplitMargin_median'] / dog_results['SplitMargin']).clip(0.9, 1.1)
dog_results['SplitMargin_norm'] = MinMaxScaler().fit_transform(dog_results[['SplitMargin_norm']])
dog_results.head()
# Calculate box winning percentage for each track/distance
box_win_percent = pd.DataFrame(data=dog_results.groupby(['Track', 'Distance', 'Box'])['win'].mean()).rename(columns={"win": "box_win_percent"}).reset_index()
# Add to dog results dataframe
dog_results = dog_results.merge(box_win_percent, on=['Track', 'Distance', 'Box'], how='left')
# Display example of barrier winning probabilities
print(box_win_percent.head(8))
The second thing that we need to add is related to reserve dogs, and this took me ages to come to this solution, but if you have a better one, please submit a pull request.
Basically, a single greyhound can be a reserve dog for multiple races on the same day. They each appear as a new row in our data frame. For example, 'MACI REID' is a reserve dog for three different races on the 2022-09-02:
When we try lag our data by using .shift(1)
like in Bruno's original code it will produce the wrong values for our features. In the above example only the first race The Gardens Race 4 (the third row) will have correct data but all the rows under it will have incorrectly calculated features. We need each of the following rows to be the same as the third row. The solution that I have come up with is a little bit complicated, but it gets the job done:
# Please submit a pull request if you have a better solution
temp = rolling_result.reset_index()
temp = temp[temp['date_dt'] == pd.Timestamp.now().normalize()]
temp.groupby(['FastTrack_DogId','date_dt']).first()
rolling_result.loc[pd.IndexSlice[:, pd.Timestamp.now().normalize()], :] = temp.groupby(['FastTrack_DogId','date_dt']).first()
Basically, for each greyhound we can just take the first row of data (which is correct) and set the rest of today's races to have the same value
# Generate rolling window features
dataset = dog_results.copy()
dataset = dataset.set_index(['FastTrack_DogId', 'date_dt']).sort_index()
# Use rolling window of 28, 91 and 365 days
rolling_windows = ['28D', '91D', '365D']
# Features to use for rolling windows calculation
features = ['RunTime_norm', 'SplitMargin_norm', 'Place_inv', 'Place_log', 'Prizemoney_norm']
# Aggregation functions to apply
aggregates = ['min', 'max', 'mean', 'median', 'std']
# Keep track of generated feature names
feature_cols = ['speed_index', 'box_win_percent']
for rolling_window in rolling_windows:
print(f'Processing rolling window {rolling_window}')
rolling_result = (
dataset
.reset_index(level=0).sort_index()
.groupby('FastTrack_DogId')[features]
.rolling(rolling_window)
.agg(aggregates)
.groupby(level=0) # Thanks to Brett for finding this!
.shift(1)
)
# My own dodgey code to work with reserve dogs
temp = rolling_result.reset_index()
temp = temp[temp['date_dt'] == pd.Timestamp.now().normalize()]
temp.groupby(['FastTrack_DogId','date_dt']).first()
rolling_result.loc[pd.IndexSlice[:, pd.Timestamp.now().normalize()], :] = temp.groupby(['FastTrack_DogId','date_dt']).first()
# Generate list of rolling window feature names (eg: RunTime_norm_min_365D)
agg_features_cols = [f'{f}_{a}_{rolling_window}' for f, a in itertools.product(features, aggregates)]
# Add features to dataset
dataset[agg_features_cols] = rolling_result
# Keep track of generated feature names
feature_cols.extend(agg_features_cols)
# Replace missing values with 0
dataset.fillna(0, inplace=True)
display(dataset.head(8))
# Only keep data after 2018-12-01
model_df = dataset.reset_index()
feature_cols = np.unique(feature_cols).tolist()
model_df = model_df[model_df['date_dt'] >= '2018-12-01']
# This line was originally part of Bruno's tutorial, but we don't run it in this script
# model_df = model_df[['date_dt', 'FastTrack_RaceId', 'DogName', 'win', 'StartPrice_probability'] + feature_cols]
# Only train model off of races where each dog has a value for each feature
races_exclude = model_df[model_df.isnull().any(axis = 1)]['FastTrack_RaceId'].drop_duplicates()
model_df = model_df[~model_df['FastTrack_RaceId'].isin(races_exclude)]
Generate predictions
Now this is the part that gets a bit hairy, so I am going to split it up into two parts. The good thing is that the coding will remain relatively simple.
The two things that I want to do is place live bets and save our predictions so that we can use them in a simulator we will create in the Part V.
Let's save our historical ratings for our simulator first as its quick and straight forward and then move on to placing live bets:
Getting data ready for our simulator
Feeding our predictions through the simulator is entirely optional, but, in my opinion it is where the real sauce is made. The idea is that if we are testing our model live, we can also use the simulator to test what would happen if we tested different staking methodologies, market timings and bet placement to optimise our model. This way you can have a model but test out different strategies to optimise model performance. The thing is, I have had a play with the simulator already and we can't simulate market_catalogue
unless you have recorded it yourself (which is what I'll be using to get market_id
and selection_id
to place live bets). The simulator we will use later on will only take your ratings, market_id and selection_id, so we need our data in a similar format to what we had in How to automate III. In other words, since we don't have market_catalogue
in the simulator, we need another way to get the market_id and selection_id.
My hacky work around is to generate the probabilities like normal (since the data is historical), we don't need to deal with reserve dogs and scratching's, then get the market_id
and selection_id
from the Betfair datascience greyhound model by merging on DogName and date. We can take the code we wrote in How to automate III that downloads the greyhound ratings and convert that into a function that downloads the ratings for a date range.
# Generate predictions like normal
# Range of dates that we want to simulate later '2022-03-01' to '2022-04-01'
todays_data = model_df[(model_df['date_dt'] >= pd.Timestamp('2022-03-01').strftime('%Y-%m-%d')) & (model_df['date_dt'] < pd.Timestamp('2022-04-01').strftime('%Y-%m-%d'))]
dog_win_probabilities = brunos_model.predict_proba(todays_data[feature_cols])[:,1]
todays_data['prob_LogisticRegression'] = dog_win_probabilities
todays_data['renormalise_prob'] = todays_data.groupby('FastTrack_RaceId')['prob_LogisticRegression'].apply(lambda x: x / x.sum())
todays_data['rating'] = 1/todays_data['renormalise_prob']
todays_data = todays_data.sort_values(by = 'date_dt')
todays_data
def download_iggy_ratings(date):
"""Downloads the Betfair Iggy model ratings for a given date and formats it into a nice DataFrame.
Args:
date (datetime): the date we want to download the ratings for
"""
iggy_url_1 = 'https://betfair-data-supplier-prod.herokuapp.com/api/widgets/iggy-joey/datasets?date='
iggy_url_2 = date.strftime("%Y-%m-%d")
iggy_url_3 = '&presenter=RatingsPresenter&csv=true'
iggy_url = iggy_url_1 + iggy_url_2 + iggy_url_3
# Download todays greyhounds ratings
iggy_df = pd.read_csv(iggy_url)
# Data clearning
iggy_df = iggy_df.rename(
columns={
"meetings.races.bfExchangeMarketId":"market_id",
"meetings.races.runners.bfExchangeSelectionId":"selection_id",
"meetings.races.runners.ratedPrice":"rating",
"meetings.races.number":"RaceNum",
"meetings.name":"Track",
"meetings.races.runners.name":"DogName"
}
)
# iggy_df = iggy_df[['market_id','selection_id','rating']]
iggy_df['market_id'] = iggy_df['market_id'].astype(str)
iggy_df['date_dt'] = date
# Set market_id and selection_id as index for easy referencing
# iggy_df = iggy_df.set_index(['market_id','selection_id'])
return(iggy_df)
# Download historical ratings over a time period and convert into a big DataFrame.
back_test_period = pd.date_range(start='2022-03-01', end='2022-04-01')
frames = [download_iggy_ratings(day) for day in back_test_period]
iggy_df = pd.concat(frames)
iggy_df
# format DogNames to merge
todays_data['DogName'] = todays_data['DogName'].apply(lambda x: x.replace("'", "").replace(".", "").replace("Res", "").strip())
iggy_df['DogName'] = iggy_df['DogName'].str.upper()
# Merge
backtest = iggy_df[['market_id','selection_id','DogName','date_dt']].merge(todays_data[['rating','DogName','date_dt']], how = 'inner', on = ['DogName','date_dt'])
backtest
Perfect, with our hacky solution we have managed to merge around a months' worth of data relatively quickly and saved it in a csv format. With all the merging it seems we have only lost around 1000 - 2000 rows of data out of 27,000 rows of data, which seems only a small price to pay.
Getting data ready for placing live bets
Placing live bets is pretty simple but we have one issue. FastTrack Data alone is unable to tell us how many greyhounds will run in the race. For example, this race later today (2022-07-04) has 8 runners + 2 reserves:
If we predict probabilities and renormalise now, we will calculate incorrect probabilities.
I've spent a really long time thinking about this and testing different methods that didn't work or weren't optimal. The best solution (and least complicated) that I have come up with is to predict probabilities on the FastTrack data first. Then a few minutes before the jump when all the lineups have been confirmed we use market_catalogue
from the Betfair API to merge our predicted probabilities, merging on DogName
,Track
and RaceNum
. If we merge on these three fields, it will bypass any issues with reserve dogs and scratchings. Then we can renormalise probabilities live within Flumine.
# Select todays data
todays_data = model_df[model_df['date_dt'] == pd.Timestamp.now().strftime('%Y-%m-%d')]
# Generate runner win predictions
dog_win_probabilities = brunos_model.predict_proba(todays_data[feature_cols])[:,1]
todays_data['prob_LogisticRegression'] = dog_win_probabilities
# We no longer renomralise probability in this chunk of code, do it in Flumine instead
# todays_data['renormalise_prob'] = todays_data.groupby('FastTrack_RaceId')['prob_LogisticRegression'].apply(lambda x: x / x.sum())
# todays_data['rating'] = 1/todays_data['renormalise_prob']
# todays_data = todays_data.sort_values(by = 'date_dt')
todays_data
Before we merge, let's do some minor formatting changes to the FastTrack names so we can match onto the Betfair names. Betfair excludes all apostrophes and full stops in their naming convention, so we'll create a Betfair equivalent dog name on the dataset removing these characters. We also need to do this for the tracks, sometimes FastTrack will name tracks differently to Betfair e.g., Sandown Park from Betfair is known as Sandown (SAP) in the FastTrack database.
# Prepare data for easy reference in flumine
todays_data['DogName_bf'] = todays_data['DogName'].apply(lambda x: x.replace("'", "").replace(".", "").replace("Res", "").strip())
todays_data.replace({'Sandown (SAP)': 'Sandown Park'}, regex=True, inplace=True)
todays_data = todays_data.set_index(['DogName_bf','Track','RaceNum'])
todays_data.head()
If you look closely at the data frame above you might notice that for reserve dogs, they will have a Box number of 9 or 10. There is only ever a max of 8 greyhounds per race therefore we will need to adjust it somehow. I didn't notice this issue for quite a while, but the good thing is the website gives us the info we need to adjust:
We can see that Rhinestone Ash is a reserve dog and has the number 9, if you click on rules, you can see what Box it is starting from:
The problem is, my webscraping is pretty poor, and it would take significant time for me to learn it. But after going through the documentation again, changes to boxes are actually available through the API under the clarifications
attribute of marketDescription
. You will be able to access this within Flumine as market.market_catalogue.description.clarifications
, but it's a bit weird. It returns box changes as a string that looks like this:
Originally I had planned to leave this article as it is since, I've never worked with anything like this before and its already getting pretty long, however huge shoutout to Betfair Quants community and especially Brett who provided his solution to working with box changes.
from nltk.tokenize import regexp_tokenize
# my_string is an example string, that you will need to get live from the api via: market.market_catalogue.description.clarifications.replace("<br/> Dog","<br/>Dog")
my_string = "<br/>Box changes:<br/>Dog 9. Tralee Blaze starts from box no. 8<br/><br/>Dog 6. That Other One starts from box no. 2<br/><br/>"
print(f'HTML Comment: {my_string}')
pattern1 = r'(?<=<br/>Dog ).+?(?= starts)'
pattern2 = r"(?<=\bbox no. )(\w+)"
runners_df = pd.DataFrame (regexp_tokenize(my_string, pattern1), columns = ['runner_name'])
runners_df['runner_name'] = runners_df['runner_name'].astype(str)
# Remove dog name from runner_number
runners_df['runner_number'] = runners_df['runner_name'].apply(lambda x: x[:(x.find(" ") - 1)].upper())
# Remove dog number from runner_name
runners_df['runner_name'] = runners_df['runner_name'].apply(lambda x: x[(x.find(" ") + 1):].upper())
runners_df['Box'] = regexp_tokenize(my_string, pattern2)
runners_df
Brett's solution is amazing, there is only one problem, currently our code is structured so that we generate our predictions in the morning well before the race starts. To implement the above fix, we need to generate our predictions just before the race starts to incorporate the Box information.
This means we need to write a little bit more code to make it happen, but we are almost there.
So now my plan to update the old data and generate probabilities just before the race. So now just before the jump my code structure will look like this:
- pull any data on box changes from the Betfair API
- convert the box change data into a dataframe named
runners_df
using the Brett's code - in my original dataframe named
todays_data
replace any Box data withrunners_df
data, otherwise leave it untouched - then merge the
box_win_percent
dataframe back onto thetodays_data
dataframe - now we can predict probabilities again and then renormalise them
It may sound a little complicated but as we already have Brett's code there is only a few extra lines of code we need to write. This is what we will add into our Flumine strategy along with Brett's code:
# Running Brett's code gives us a nice dataframe named runners_df that we can work with
# Replace any old Box info in our original dataframe with data available in
runners_df = runners_df.set_index('runner_name')
todays_data.loc[(runners_df.index[runners_df.index.isin(dog_names)],track,race_number),'Box'] = runners_df.loc[runners_df.index.isin(dog_names),'Box'].to_list()
# Merge box_win_percent data onto todays_data
todays_data = todays_data.merge(box_win_percent, on=['Track', 'Distance', 'Box'], how='left')
# Merge box_win_percentage back on:
todays_data = todays_data.drop(columns = 'box_win_percentage', axis = 1)
todays_data = todays_data.merge(box_win_percent, on = ['Track', 'Distance','Box'], how = 'left')
# Generate probabilities using Bruno's model
todays_data.loc[(dog_names,track,race_number),'prob_LogisticRegression'] = brunos_model.predict_proba(todays_data.loc[(dog_names,track,race_number)][feature_cols])[:,1]
# renomalise probabilities
probabilities = todays_data.loc[dog_names,track,race_number]['prob_LogisticRegression']
todays_data.loc[(dog_names,track,race_number),'renormalised_prob'] = probabilities/probabilities.sum()
# convert probaiblities to ratings
todays_data.loc[(dog_names,track,race_number),'rating'] = 1/todays_data.loc[dog_names,track,race_number]['renormalised_prob']
Now everything is done, and we can finally move onto placing our bets
Automating our predictions
Now that we have our data nicely set up. We can reference our probabilities by getting the DogName, Track and RaceNum from the Betfair polling API and then renormalised probabilities to calculate ratings with only a few lines of code. Then the rest is the same as How to Automate III
# Import libraries for logging in
import betfairlightweight
from flumine import Flumine, clients
# Credentials to login and logging in
trading = betfairlightweight.APIClient('username','password',app_key='appkey')
client = clients.BetfairClient(trading, interactive_login=True)
# Login
framework = Flumine(client=client)
# Code to login when using security certificates
# trading = betfairlightweight.APIClient('username','password',app_key='appkey', certs=r'C:\Users\zhoui\openssl_certs')
# client = clients.BetfairClient(trading)
# framework = Flumine(client=client)
# Import libraries and logging
from flumine import BaseStrategy
from flumine.order.trade import Trade
from flumine.order.order import LimitOrder
from flumine.markets.market import Market
from betfairlightweight.filters import streaming_market_filter
from betfairlightweight.resources import MarketBook
import re
import pandas as pd
import numpy as np
import datetime
import logging
logging.basicConfig(filename = 'how_to_automate_4.log', level=logging.INFO, format='%(asctime)s:%(levelname)s:%(message)s')
Let's create a new class for our strategy called FlatBetting that finds the best available to back and lay price 60 seconds before the jump. If any of those prices have value, we will place a flat bet for $5 at those prices. This code is almost the same as How to Automate III
Since we are now editing our todays_data
dataframe inside our Flumine strategy we will also need to convert todays_data
to a global variable which is a simple one liner:
I also wanted to call out one gotcha that, Brett found that is almost impossible to find unless you are keeping a close eye on your logs. Sometimes the polling API and streaming API doesn't match up when there are scratchings, so we need to check if it does:
class FlatBetting(BaseStrategy):
def start(self) -> None:
print("starting strategy 'FlatBetting' using the model we created the Greyhound modelling in Python Tutorial")
def check_market_book(self, market: Market, market_book: MarketBook) -> bool:
if market_book.status != "CLOSED":
return True
def process_market_book(self, market: Market, market_book: MarketBook) -> None:
# Convert dataframe to a global variable
global todays_data
# At the 60 second mark:
if market.seconds_to_start < 60 and market_book.inplay == False:
# get the list of dog_names, name of the track/venue and race_number/RaceNum from Betfair Polling API
dog_names = []
track = market.market_catalogue.event.venue
race_number = market.market_catalogue.market_name.split(' ',1)[0] # comes out as R1/R2/R3 .. etc
race_number = re.sub("[^0-9]", "", race_number) # only keep the numbers
for runner_cata in market.market_catalogue.runners:
dog_name = runner_cata.runner_name.split(' ',1)[1].upper()
dog_names.append(dog_name)
# Check if there are box changes, if there are then use Brett's code
if market.market_catalogue.description.clarifications != None:
# Brett's code to get Box changes:
my_string = market.market_catalogue.description.clarifications.replace("<br/> Dog","<br/>Dog")
pattern1 = r'(?<=<br/>Dog ).+?(?= starts)'
pattern2 = r"(?<=\bbox no. )(\w+)"
runners_df = pd.DataFrame (regexp_tokenize(my_string, pattern1), columns = ['runner_name'])
runners_df['runner_name'] = runners_df['runner_name'].astype(str)
# Remove dog name from runner_number
runners_df['runner_number'] = runners_df['runner_name'].apply(lambda x: x[:(x.find(" ") - 1)].upper())
# Remove dog number from runner_name
runners_df['runner_name'] = runners_df['runner_name'].apply(lambda x: x[(x.find(" ") + 1):].upper())
runners_df['Box'] = regexp_tokenize(my_string, pattern2)
# Replace any old Box info in our original dataframe with data available in runners_df
runners_df = runners_df.set_index('runner_name')
todays_data.loc[(runners_df.index[runners_df.index.isin(dog_names)],track,race_number),'Box'] = runners_df.loc[runners_df.index.isin(dog_names),'Box'].to_list()
# Merge box_win_percentage back on:
todays_data = todays_data.drop(columns = 'box_win_percentage', axis = 1)
todays_data = todays_data.reset_index().merge(box_win_percent, on = ['Track', 'Distance','Box'], how = 'left').set_index(['DogName_bf','Track','RaceNum'])
# Generate probabilities using Bruno's model
todays_data.loc[(dog_names,track,race_number),'prob_LogisticRegression'] = brunos_model.predict_proba(todays_data.loc[(dog_names,track,race_number)][feature_cols])[:,1]
# renomalise probabilities
probabilities = todays_data.loc[dog_names,track,race_number]['prob_LogisticRegression']
todays_data.loc[(dog_names,track,race_number),'renormalised_prob'] = probabilities/probabilities.sum()
# convert probaiblities to ratings
todays_data.loc[(dog_names,track,race_number),'rating'] = 1/todays_data.loc[dog_names,track,race_number]['renormalised_prob']
# Use both the polling api (market.catalogue) and the streaming api at once:
for runner_cata, runner in zip(market.market_catalogue.runners, market_book.runners):
# Check the polling api and streaming api matches up (sometimes it doesn't)
if runner_cata.selection_id == runner.selection_id:
# Get the dog_name from polling api then reference our data for our model rating
dog_name = runner_cata.runner_name.split(' ',1)[1].upper()
# Rest is the same as How to Automate III
model_price = todays_data.loc[dog_name,track,race_number]['rating']
### If you have an issue such as:
# Unknown error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# Then do model_price = todays_data.loc[dog_name,track,race_number]['rating'].item()
# Log info before placing bets
logging.info(f'dog_name: {dog_name}')
logging.info(f'model_price: {model_price}')
logging.info(f'market_id: {market_book.market_id}')
logging.info(f'selection_id: {runner.selection_id}')
# If best available to back price is > rated price then flat $5 back
if runner.status == "ACTIVE" and runner.ex.available_to_back[0]['price'] > model_price:
trade = Trade(
market_id=market_book.market_id,
selection_id=runner.selection_id,
handicap=runner.handicap,
strategy=self,
)
order = trade.create_order(
side="BACK", order_type=LimitOrder(price=runner.ex.available_to_back[0]['price'], size=5.00)
)
market.place_order(order)
# If best available to lay price is < rated price then flat $5 lay
if runner.status == "ACTIVE" and runner.ex.available_to_lay[0]['price'] < model_price:
trade = Trade(
market_id=market_book.market_id,
selection_id=runner.selection_id,
handicap=runner.handicap,
strategy=self,
)
order = trade.create_order(
side="LAY", order_type=LimitOrder(price=runner.ex.available_to_lay[0]['price'], size=5.00)
)
market.place_order(order)
As the model we have built is a greyhound model for Australian racing let's point our strategy to Australian greyhound win markets
greyhounds_strategy = FlatBetting(
market_filter=streaming_market_filter(
event_type_ids=["4339"], # Greyhounds markets
country_codes=["AU"], # Australian markets
market_types=["WIN"], # Win markets
),
max_order_exposure= 50, # Max exposure per order = 50
max_trade_count=1, # Max 1 trade per selection
max_live_trade_count=1, # Max 1 unmatched trade per selection
)
framework.add_strategy(greyhounds_strategy)
And add our auto-terminate and bet logging from the previous tutorials:
# import logging
import datetime
from flumine.worker import BackgroundWorker
from flumine.events.events import TerminationEvent
# logger = logging.getLogger(__name__)
"""
Worker can be used as followed:
framework.add_worker(
BackgroundWorker(
framework,
terminate,
func_kwargs={"today_only": True, "seconds_closed": 1200},
interval=60,
start_delay=60,
)
)
This will run every 60s and will terminate
the framework if all markets starting 'today'
have been closed for at least 1200s
"""
# Function that stops automation running at the end of the day
def terminate(
context: dict, flumine, today_only: bool = True, seconds_closed: int = 600
) -> None:
"""terminate framework if no markets
live today.
"""
markets = list(flumine.markets.markets.values())
markets_today = [
m
for m in markets
if m.market_start_datetime.date() == datetime.datetime.utcnow().date()
and (
m.elapsed_seconds_closed is None
or (m.elapsed_seconds_closed and m.elapsed_seconds_closed < seconds_closed)
)
]
if today_only:
market_count = len(markets_today)
else:
market_count = len(markets)
if market_count == 0:
# logger.info("No more markets available, terminating framework")
flumine.handler_queue.put(TerminationEvent(flumine))
# Add the stopped to our framework
framework.add_worker(
BackgroundWorker(
framework,
terminate,
func_kwargs={"today_only": True, "seconds_closed": 1200},
interval=60,
start_delay=60,
)
)
import os
import csv
import logging
from flumine.controls.loggingcontrols import LoggingControl
from flumine.order.ordertype import OrderTypes
logger = logging.getLogger(__name__)
FIELDNAMES = [
"bet_id",
"strategy_name",
"market_id",
"selection_id",
"trade_id",
"date_time_placed",
"price",
"price_matched",
"size",
"size_matched",
"profit",
"side",
"elapsed_seconds_executable",
"order_status",
"market_note",
"trade_notes",
"order_notes",
]
class LiveLoggingControl(LoggingControl):
NAME = "BACKTEST_LOGGING_CONTROL"
def __init__(self, *args, **kwargs):
super(LiveLoggingControl, self).__init__(*args, **kwargs)
self._setup()
# Changed file path and checks if the file orders_hta_4.csv already exists, if it doens't then create it
def _setup(self):
if os.path.exists("orders_hta_4.csv"):
logging.info("Results file exists")
else:
with open("orders_hta_4.csv", "w") as m:
csv_writer = csv.DictWriter(m, delimiter=",", fieldnames=FIELDNAMES)
csv_writer.writeheader()
def _process_cleared_orders_meta(self, event):
orders = event.event
with open("orders_hta_4.csv", "a") as m:
for order in orders:
if order.order_type.ORDER_TYPE == OrderTypes.LIMIT:
size = order.order_type.size
else:
size = order.order_type.liability
if order.order_type.ORDER_TYPE == OrderTypes.MARKET_ON_CLOSE:
price = None
else:
price = order.order_type.price
try:
order_data = {
"bet_id": order.bet_id,
"strategy_name": order.trade.strategy,
"market_id": order.market_id,
"selection_id": order.selection_id,
"trade_id": order.trade.id,
"date_time_placed": order.responses.date_time_placed,
"price": price,
"price_matched": order.average_price_matched,
"size": size,
"size_matched": order.size_matched,
"profit": 0 if not order.cleared_order else order.cleared_order.profit,
"side": order.side,
"elapsed_seconds_executable": order.elapsed_seconds_executable,
"order_status": order.status.value,
"market_note": order.trade.market_notes,
"trade_notes": order.trade.notes_str,
"order_notes": order.notes_str,
}
csv_writer = csv.DictWriter(m, delimiter=",", fieldnames=FIELDNAMES)
csv_writer.writerow(order_data)
except Exception as e:
logger.error(
"_process_cleared_orders_meta: %s" % e,
extra={"order": order, "error": e},
)
logger.info("Orders updated", extra={"order_count": len(orders)})
def _process_cleared_markets(self, event):
cleared_markets = event.event
for cleared_market in cleared_markets.orders:
logger.info(
"Cleared market",
extra={
"market_id": cleared_market.market_id,
"bet_count": cleared_market.bet_count,
"profit": cleared_market.profit,
"commission": cleared_market.commission,
},
)
framework.add_logging_control(
LiveLoggingControl()
)
Conclusion and next steps
Boom! We now have an automated script that will downloads all the data we need in the morning, generates a set of predictions, place flat stakes bets, logs all bets and switches itself off at the end of the day. All we need to do is hit play in the morning!
We have now written code automation code for three different strategies, however we haven't actually backtested any of our strategies or models yet. So for the final part of the How to Automate series we will be writing code to How to simulate the Exchange to backtest and optimise our strategies. Make sure not to miss it as this is where I believe the sauce is made (not that I have made significant sauce).
Complete code
Run the code from your ide by using py <filename>
.py, making sure you amend the path to point to your input data.
from joblib import load
import os
import sys
# Allow imports from src folder
module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
sys.path.append(module_path)
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
from dateutil import tz
from pandas.tseries.offsets import MonthEnd
from sklearn.preprocessing import MinMaxScaler
import itertools
import numpy as np
import pandas as pd
from nltk.tokenize import regexp_tokenize
# settings to display all columns
pd.set_option("display.max_columns", None)
import fasttrack as ft
from dotenv import load_dotenv
load_dotenv()
# Import libraries for logging in
import betfairlightweight
from flumine import Flumine, clients
# Import libraries and logging
from flumine import BaseStrategy
from flumine.order.trade import Trade
from flumine.order.order import LimitOrder
from flumine.markets.market import Market
from betfairlightweight.filters import streaming_market_filter
from betfairlightweight.resources import MarketBook
import re
import pandas as pd
import numpy as np
import datetime
import logging
logging.basicConfig(filename = 'how_to_automate_4.log', level=logging.INFO, format='%(asctime)s:%(levelname)s:%(message)s')
# import logging
from flumine.worker import BackgroundWorker
from flumine.events.events import TerminationEvent
import csv
from flumine.controls.loggingcontrols import LoggingControl
from flumine.order.ordertype import OrderTypes
logger = logging.getLogger(__name__)
brunos_model = load('logistic_regression.joblib')
brunos_model
# Validate FastTrack API connection
api_key = os.getenv('FAST_TRACK_API_KEY',)
client = ft.Fasttrack(api_key)
track_codes = client.listTracks()
# Import race data excluding NZ races
au_tracks_filter = list(track_codes[track_codes['state'] != 'NZ']['track_code'])
# Time window to import data
# First day of the month 46 months back from now
date_from = (datetime.today() - relativedelta(months=46)).replace(day=1).strftime('%Y-%m-%d')
# First day of previous month
date_to = (datetime.today() - relativedelta(months=1)).replace(day=1).strftime('%Y-%m-%d')
# Dataframes to populate data with
race_details = pd.DataFrame()
dog_results = pd.DataFrame()
# For each month, either fetch data from API or use local CSV file if we already have downloaded it
for start in pd.date_range(date_from, date_to, freq='MS'):
start_date = start.strftime("%Y-%m-%d")
end_date = (start + MonthEnd(1)).strftime("%Y-%m-%d")
try:
filename_races = f'FT_AU_RACES_{start_date}.csv'
filename_dogs = f'FT_AU_DOGS_{start_date}.csv'
filepath_races = f'../data/{filename_races}'
filepath_dogs = f'../data/{filename_dogs}'
print(f'Loading data from {start_date} to {end_date}')
if os.path.isfile(filepath_races):
# Load local CSV file
month_race_details = pd.read_csv(filepath_races)
month_dog_results = pd.read_csv(filepath_dogs)
else:
# Fetch data from API
month_race_details, month_dog_results = client.getRaceResults(start_date, end_date, au_tracks_filter)
month_race_details.to_csv(filepath_races, index=False)
month_dog_results.to_csv(filepath_dogs, index=False)
# Combine monthly data
race_details = race_details.append(month_race_details, ignore_index=True)
dog_results = dog_results.append(month_dog_results, ignore_index=True)
except:
print(f'Could not load data from {start_date} to {end_date}')
race_details.tail()
current_month_start_date = pd.Timestamp.now().replace(day=1).strftime("%Y-%m-%d")
current_month_end_date = (pd.Timestamp.now().replace(day=1)+ MonthEnd(1))
current_month_end_date = (current_month_end_date - pd.Timedelta('1 day')).strftime("%Y-%m-%d")
print(f'Start date: {current_month_start_date}')
print(f'End Date: {current_month_end_date}')
# Download data for races that have concluded this current month up untill today
# Start and end dates for current month
current_month_start_date = pd.Timestamp.now().replace(day=1).strftime("%Y-%m-%d")
current_month_end_date = (pd.Timestamp.now().replace(day=1)+ MonthEnd(1))
current_month_end_date = (current_month_end_date - pd.Timedelta('1 day')).strftime("%Y-%m-%d")
# Files names
filename_races = f'FT_AU_RACES_{current_month_start_date}.csv'
filename_dogs = f'FT_AU_DOGS_{current_month_start_date}.csv'
# Where to store files locally
filepath_races = f'../data/{filename_races}'
filepath_dogs = f'../data/{filename_dogs}'
# Fetch data from API
month_race_details, month_dog_results = client.getRaceResults(current_month_start_date, current_month_end_date, au_tracks_filter)
# Save the files locally and replace any out of date fields
month_race_details.to_csv(filepath_races, index=False)
month_dog_results.to_csv(filepath_dogs, index=False)
dog_results
# This is super important I have spent literally hours before I found out this was causing errors
dog_results['@id'] = pd.to_numeric(dog_results['@id'])
# Append the extra data to our data frames
race_details = race_details.append(month_race_details, ignore_index=True)
dog_results = dog_results.append(month_dog_results, ignore_index=True)
# Download the data for todays races
todays_date = pd.Timestamp.now().strftime("%Y-%m-%d")
todays_races, todays_dogs = client.getFullFormat(dt= todays_date, tracks = au_tracks_filter)
# display is for ipython notebooks only
# display(todays_races.head(1), todays_dogs.head(1))
# It seems that the todays_races dataframe doesn't have the date column, so let's add that on
todays_races['date'] = pd.Timestamp.now().strftime('%d %b %y')
todays_races.head(1)
# It also seems that in todays_dogs dataframe Box is labeled as RaceBox instead, so let's rename it
# We can also see that there are some specific dogs that have "Res." as a suffix of their name, i.e. they are reserve dogs,
# We will treat this later
todays_dogs = todays_dogs.rename(columns={"RaceBox":"Box"})
todays_dogs.tail(3)
# Appending todays data to this months data
month_dog_results = pd.concat([month_dog_results,todays_dogs],join='outer')[month_dog_results.columns]
month_race_details = pd.concat([month_race_details,todays_races],join='outer')[month_race_details.columns]
# Appending this months data to the rest of our historical data
race_details = race_details.append(month_race_details, ignore_index=True)
dog_results = dog_results.append(month_dog_results, ignore_index=True)
race_details
## Cleanse and normalise the data
# Clean up the race dataset
race_details = race_details.rename(columns = {'@id': 'FastTrack_RaceId'})
race_details['Distance'] = race_details['Distance'].apply(lambda x: int(x.replace("m", "")))
race_details['date_dt'] = pd.to_datetime(race_details['date'], format = '%d %b %y')
# Clean up the dogs results dataset
dog_results = dog_results.rename(columns = {'@id': 'FastTrack_DogId', 'RaceId': 'FastTrack_RaceId'})
# New line of code (rest of this code chunk is copied from bruno's code)
dog_results['FastTrack_DogId'] = pd.to_numeric(dog_results['FastTrack_DogId'])
# Combine dogs results with race attributes
dog_results = dog_results.merge(
race_details,
how = 'left',
on = 'FastTrack_RaceId'
)
# Convert StartPrice to probability
dog_results['StartPrice'] = dog_results['StartPrice'].apply(lambda x: None if x is None else float(x.replace('$', '').replace('F', '')) if isinstance(x, str) else x)
dog_results['StartPrice_probability'] = (1 / dog_results['StartPrice']).fillna(0)
dog_results['StartPrice_probability'] = dog_results.groupby('FastTrack_RaceId')['StartPrice_probability'].apply(lambda x: x / x.sum())
# Discard entries without results (scratched or did not finish)
dog_results = dog_results[~dog_results['Box'].isnull()]
dog_results['Box'] = dog_results['Box'].astype(int)
# Clean up other attributes
dog_results['RunTime'] = dog_results['RunTime'].astype(float)
dog_results['SplitMargin'] = dog_results['SplitMargin'].astype(float)
dog_results['Prizemoney'] = dog_results['Prizemoney'].astype(float).fillna(0)
dog_results['Place'] = pd.to_numeric(dog_results['Place'].apply(lambda x: x.replace("=", "") if isinstance(x, str) else 0), errors='coerce').fillna(0)
dog_results['win'] = dog_results['Place'].apply(lambda x: 1 if x == 1 else 0)
# Normalise some of the raw values
dog_results['Prizemoney_norm'] = np.log10(dog_results['Prizemoney'] + 1) / 12
dog_results['Place_inv'] = (1 / dog_results['Place']).fillna(0)
dog_results['Place_log'] = np.log10(dog_results['Place'] + 1).fillna(0)
dog_results['RunSpeed'] = (dog_results['RunTime'] / dog_results['Distance']).fillna(0)
## Generate features using raw data
# Calculate median winner time per track/distance
win_results = dog_results[dog_results['win'] == 1]
median_win_time = pd.DataFrame(data=win_results[win_results['RunTime'] > 0].groupby(['Track', 'Distance'])['RunTime'].median()).rename(columns={"RunTime": "RunTime_median"}).reset_index()
median_win_split_time = pd.DataFrame(data=win_results[win_results['SplitMargin'] > 0].groupby(['Track', 'Distance'])['SplitMargin'].median()).rename(columns={"SplitMargin": "SplitMargin_median"}).reset_index()
median_win_time.head()
# Calculate track speed index
median_win_time['speed_index'] = (median_win_time['RunTime_median'] / median_win_time['Distance'])
median_win_time['speed_index'] = MinMaxScaler().fit_transform(median_win_time[['speed_index']])
median_win_time.head()
# Compare dogs finish time with median winner time
dog_results = dog_results.merge(median_win_time, on=['Track', 'Distance'], how='left')
dog_results = dog_results.merge(median_win_split_time, on=['Track', 'Distance'], how='left')
# Normalise time comparison
dog_results['RunTime_norm'] = (dog_results['RunTime_median'] / dog_results['RunTime']).clip(0.9, 1.1)
dog_results['RunTime_norm'] = MinMaxScaler().fit_transform(dog_results[['RunTime_norm']])
dog_results['SplitMargin_norm'] = (dog_results['SplitMargin_median'] / dog_results['SplitMargin']).clip(0.9, 1.1)
dog_results['SplitMargin_norm'] = MinMaxScaler().fit_transform(dog_results[['SplitMargin_norm']])
dog_results.head()
# Calculate box winning percentage for each track/distance
box_win_percent = pd.DataFrame(data=dog_results.groupby(['Track', 'Distance', 'Box'])['win'].mean()).rename(columns={"win": "box_win_percent"}).reset_index()
# Add to dog results dataframe
dog_results = dog_results.merge(box_win_percent, on=['Track', 'Distance', 'Box'], how='left')
# Display example of barrier winning probabilities
print(box_win_percent.head(8))
dog_results[dog_results['FastTrack_DogId'] == 592253143].tail()[['date_dt','Place','DogName','RaceNum','Track','Distance','win','Prizemoney_norm','Place_inv','Place_log']]
# Generate rolling window features
dataset = dog_results.copy()
dataset = dataset.set_index(['FastTrack_DogId', 'date_dt']).sort_index()
# Use rolling window of 28, 91 and 365 days
rolling_windows = ['28D', '91D', '365D']
# Features to use for rolling windows calculation
features = ['RunTime_norm', 'SplitMargin_norm', 'Place_inv', 'Place_log', 'Prizemoney_norm']
# Aggregation functions to apply
aggregates = ['min', 'max', 'mean', 'median', 'std']
# Keep track of generated feature names
feature_cols = ['speed_index', 'box_win_percent']
for rolling_window in rolling_windows:
print(f'Processing rolling window {rolling_window}')
rolling_result = (
dataset
.reset_index(level=0).sort_index()
.groupby('FastTrack_DogId')[features]
.rolling(rolling_window)
.agg(aggregates)
.groupby(level=0) # Thanks to Brett for finding this!
.shift(1)
)
# My own dodgey code to work with reserve dogs
temp = rolling_result.reset_index()
temp = temp[temp['date_dt'] == pd.Timestamp.now().normalize()]
temp.groupby(['FastTrack_DogId','date_dt']).first()
rolling_result.loc[pd.IndexSlice[:, pd.Timestamp.now().normalize()], :] = temp.groupby(['FastTrack_DogId','date_dt']).first()
# Generate list of rolling window feature names (eg: RunTime_norm_min_365D)
agg_features_cols = [f'{f}_{a}_{rolling_window}' for f, a in itertools.product(features, aggregates)]
# Add features to dataset
dataset[agg_features_cols] = rolling_result
# Keep track of generated feature names
feature_cols.extend(agg_features_cols)
# Replace missing values with 0
dataset.fillna(0, inplace=True)
# display(dataset.head(8)) # display is only for ipython notebooks
# Only keep data after 2018-12-01
model_df = dataset.reset_index()
feature_cols = np.unique(feature_cols).tolist()
model_df = model_df[model_df['date_dt'] >= '2018-12-01']
# This line was originally part of Bruno's tutorial, but we don't run it in this script
# model_df = model_df[['date_dt', 'FastTrack_RaceId', 'DogName', 'win', 'StartPrice_probability'] + feature_cols]
# Only train model off of races where each dog has a value for each feature
races_exclude = model_df[model_df.isnull().any(axis = 1)]['FastTrack_RaceId'].drop_duplicates()
model_df = model_df[~model_df['FastTrack_RaceId'].isin(races_exclude)]
# Generate predictions like normal
# Range of dates that we want to simulate later '2022-03-01' to '2022-04-01'
todays_data = model_df[(model_df['date_dt'] >= pd.Timestamp('2022-03-01').strftime('%Y-%m-%d')) & (model_df['date_dt'] < pd.Timestamp('2022-04-01').strftime('%Y-%m-%d'))]
dog_win_probabilities = brunos_model.predict_proba(todays_data[feature_cols])[:,1]
todays_data['prob_LogisticRegression'] = dog_win_probabilities
todays_data['renormalise_prob'] = todays_data.groupby('FastTrack_RaceId')['prob_LogisticRegression'].apply(lambda x: x / x.sum())
todays_data['rating'] = 1/todays_data['renormalise_prob']
todays_data = todays_data.sort_values(by = 'date_dt')
todays_data
def download_iggy_ratings(date):
"""Downloads the Betfair Iggy model ratings for a given date and formats it into a nice DataFrame.
Args:
date (datetime): the date we want to download the ratings for
"""
iggy_url_1 = 'https://betfair-data-supplier-prod.herokuapp.com/api/widgets/iggy-joey/datasets?date='
iggy_url_2 = date.strftime("%Y-%m-%d")
iggy_url_3 = '&presenter=RatingsPresenter&csv=true'
iggy_url = iggy_url_1 + iggy_url_2 + iggy_url_3
# Download todays greyhounds ratings
iggy_df = pd.read_csv(iggy_url)
# Data clearning
iggy_df = iggy_df.rename(
columns={
"meetings.races.bfExchangeMarketId":"market_id",
"meetings.races.runners.bfExchangeSelectionId":"selection_id",
"meetings.races.runners.ratedPrice":"rating",
"meetings.races.number":"RaceNum",
"meetings.name":"Track",
"meetings.races.runners.name":"DogName"
}
)
# iggy_df = iggy_df[['market_id','selection_id','rating']]
iggy_df['market_id'] = iggy_df['market_id'].astype(str)
iggy_df['date_dt'] = date
# Set market_id and selection_id as index for easy referencing
# iggy_df = iggy_df.set_index(['market_id','selection_id'])
return(iggy_df)
# Download historical ratings over a time period and convert into a big DataFrame.
back_test_period = pd.date_range(start='2022-03-01', end='2022-04-01')
frames = [download_iggy_ratings(day) for day in back_test_period]
iggy_df = pd.concat(frames)
iggy_df
# format DogNames to merge
todays_data['DogName'] = todays_data['DogName'].apply(lambda x: x.replace("'", "").replace(".", "").replace("Res", "").strip())
iggy_df['DogName'] = iggy_df['DogName'].str.upper()
# Merge
backtest = iggy_df[['market_id','selection_id','DogName','date_dt']].merge(todays_data[['rating','DogName','date_dt']], how = 'inner', on = ['DogName','date_dt'])
backtest
# Save predictions for if we want to backtest/simulate it later
backtest.to_csv('backtest.csv', index=False) # Csv format
# backtest.to_pickle('backtest.pkl') # pickle format (faster, but can't open in excel)
todays_data[todays_data['FastTrack_RaceId'] == '798906744']
# Select todays data
todays_data = model_df[model_df['date_dt'] == pd.Timestamp.now().strftime('%Y-%m-%d')]
# Generate runner win predictions
dog_win_probabilities = brunos_model.predict_proba(todays_data[feature_cols])[:,1]
todays_data['prob_LogisticRegression'] = dog_win_probabilities
# We no longer renomralise probability in this chunk of code, do it in Flumine instead
# todays_data['renormalise_prob'] = todays_data.groupby('FastTrack_RaceId')['prob_LogisticRegression'].apply(lambda x: x / x.sum())
# todays_data['rating'] = 1/todays_data['renormalise_prob']
# todays_data = todays_data.sort_values(by = 'date_dt')
todays_data
# Prepare data for easy reference in flumine
todays_data['DogName_bf'] = todays_data['DogName'].apply(lambda x: x.replace("'", "").replace(".", "").replace("Res", "").strip())
todays_data.replace({'Sandown (SAP)': 'Sandown Park'}, regex=True, inplace=True)
todays_data = todays_data.set_index(['DogName_bf','Track','RaceNum'])
todays_data.head()
# Credentials to login and logging in
trading = betfairlightweight.APIClient('username','password',app_key='appkey')
client = clients.BetfairClient(trading, interactive_login=True)
# Login
framework = Flumine(client=client)
# Code to login when using security certificates
# trading = betfairlightweight.APIClient('username','password',app_key='appkey', certs=r'C:\Users\zhoui\openssl_certs')
# client = clients.BetfairClient(trading)
# framework = Flumine(client=client)
class FlatBetting(BaseStrategy):
def start(self) -> None:
print("starting strategy 'FlatBetting' using the model we created the Greyhound modelling in Python Tutorial")
def check_market_book(self, market: Market, market_book: MarketBook) -> bool:
if market_book.status != "CLOSED":
return True
def process_market_book(self, market: Market, market_book: MarketBook) -> None:
# Convert dataframe to a global variable
global todays_data
# At the 60 second mark:
if market.seconds_to_start < 60 and market_book.inplay == False:
# get the list of dog_names, name of the track/venue and race_number/RaceNum from Betfair Polling API
dog_names = []
track = market.market_catalogue.event.venue
race_number = market.market_catalogue.market_name.split(' ',1)[0] # comes out as R1/R2/R3 .. etc
race_number = re.sub("[^0-9]", "", race_number) # only keep the numbers
for runner_cata in market.market_catalogue.runners:
dog_name = runner_cata.runner_name.split(' ',1)[1].upper()
dog_names.append(dog_name)
# Check if there are box changes, if there are then use Brett's code
if market.market_catalogue.description.clarifications != None:
# Brett's code to get Box changes:
my_string = market.market_catalogue.description.clarifications.replace("<br/> Dog","<br/>Dog")
pattern1 = r'(?<=<br/>Dog ).+?(?= starts)'
pattern2 = r"(?<=\bbox no. )(\w+)"
runners_df = pd.DataFrame (regexp_tokenize(my_string, pattern1), columns = ['runner_name'])
runners_df['runner_name'] = runners_df['runner_name'].astype(str)
# Remove dog name from runner_number
runners_df['runner_number'] = runners_df['runner_name'].apply(lambda x: x[:(x.find(" ") - 1)].upper())
# Remove dog number from runner_name
runners_df['runner_name'] = runners_df['runner_name'].apply(lambda x: x[(x.find(" ") + 1):].upper())
runners_df['Box'] = regexp_tokenize(my_string, pattern2)
# Replace any old Box info in our original dataframe with data available in runners_df
runners_df = runners_df.set_index('runner_name')
todays_data.loc[(runners_df.index[runners_df.index.isin(dog_names)],track,race_number),'Box'] = runners_df.loc[runners_df.index.isin(dog_names),'Box'].to_list()
# Merge box_win_percentage back on:
todays_data = todays_data.drop(columns = 'box_win_percentage', axis = 1)
todays_data = todays_data.reset_index().merge(box_win_percent, on = ['Track', 'Distance','Box'], how = 'left').set_index(['DogName_bf','Track','RaceNum'])
# Generate probabilities using Bruno's model
todays_data.loc[(dog_names,track,race_number),'prob_LogisticRegression'] = brunos_model.predict_proba(todays_data.loc[(dog_names,track,race_number)][feature_cols])[:,1]
# renomalise probabilities
probabilities = todays_data.loc[dog_names,track,race_number]['prob_LogisticRegression']
todays_data.loc[(dog_names,track,race_number),'renormalised_prob'] = probabilities/probabilities.sum()
# convert probaiblities to ratings
todays_data.loc[(dog_names,track,race_number),'rating'] = 1/todays_data.loc[dog_names,track,race_number]['renormalised_prob']
# Use both the polling api (market.catalogue) and the streaming api at once:
for runner_cata, runner in zip(market.market_catalogue.runners, market_book.runners):
# Check the polling api and streaming api matches up (sometimes it doesn't)
if runner_cata.selection_id == runner.selection_id:
# Get the dog_name from polling api then reference our data for our model rating
dog_name = runner_cata.runner_name.split(' ',1)[1].upper()
# Rest is the same as How to Automate III
model_price = todays_data.loc[dog_name,track,race_number]['rating']
### If you have an issue such as:
# Unknown error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# Then do model_price = todays_data.loc[dog_name,track,race_number]['rating'].item()
# Log info before placing bets
logging.info(f'dog_name: {dog_name}')
logging.info(f'model_price: {model_price}')
logging.info(f'market_id: {market_book.market_id}')
logging.info(f'selection_id: {runner.selection_id}')
# If best available to back price is > rated price then flat $5 back
if runner.status == "ACTIVE" and runner.ex.available_to_back[0]['price'] > model_price:
trade = Trade(
market_id=market_book.market_id,
selection_id=runner.selection_id,
handicap=runner.handicap,
strategy=self,
)
order = trade.create_order(
side="BACK", order_type=LimitOrder(price=runner.ex.available_to_back[0]['price'], size=5.00)
)
market.place_order(order)
# If best available to lay price is < rated price then flat $5 lay
if runner.status == "ACTIVE" and runner.ex.available_to_lay[0]['price'] < model_price:
trade = Trade(
market_id=market_book.market_id,
selection_id=runner.selection_id,
handicap=runner.handicap,
strategy=self,
)
order = trade.create_order(
side="LAY", order_type=LimitOrder(price=runner.ex.available_to_lay[0]['price'], size=5.00)
)
market.place_order(order)
greyhounds_strategy = FlatBetting(
market_filter=streaming_market_filter(
event_type_ids=["4339"], # Greyhounds markets
country_codes=["AU"], # Australian markets
market_types=["WIN"], # Win markets
),
max_order_exposure= 50, # Max exposure per order = 50
max_trade_count=1, # Max 1 trade per selection
max_live_trade_count=1, # Max 1 unmatched trade per selection
)
framework.add_strategy(greyhounds_strategy)
# logger = logging.getLogger(__name__)
"""
Worker can be used as followed:
framework.add_worker(
BackgroundWorker(
framework,
terminate,
func_kwargs={"today_only": True, "seconds_closed": 1200},
interval=60,
start_delay=60,
)
)
This will run every 60s and will terminate
the framework if all markets starting 'today'
have been closed for at least 1200s
"""
# Function that stops automation running at the end of the day
def terminate(
context: dict, flumine, today_only: bool = True, seconds_closed: int = 600
) -> None:
"""terminate framework if no markets
live today.
"""
markets = list(flumine.markets.markets.values())
markets_today = [
m
for m in markets
if m.market_start_datetime.date() == datetime.datetime.utcnow().date()
and (
m.elapsed_seconds_closed is None
or (m.elapsed_seconds_closed and m.elapsed_seconds_closed < seconds_closed)
)
]
if today_only:
market_count = len(markets_today)
else:
market_count = len(markets)
if market_count == 0:
# logger.info("No more markets available, terminating framework")
flumine.handler_queue.put(TerminationEvent(flumine))
# Add the stopped to our framework
framework.add_worker(
BackgroundWorker(
framework,
terminate,
func_kwargs={"today_only": True, "seconds_closed": 1200},
interval=60,
start_delay=60,
)
)
logger = logging.getLogger(__name__)
FIELDNAMES = [
"bet_id",
"strategy_name",
"market_id",
"selection_id",
"trade_id",
"date_time_placed",
"price",
"price_matched",
"size",
"size_matched",
"profit",
"side",
"elapsed_seconds_executable",
"order_status",
"market_note",
"trade_notes",
"order_notes",
]
class LiveLoggingControl(LoggingControl):
NAME = "BACKTEST_LOGGING_CONTROL"
def __init__(self, *args, **kwargs):
super(LiveLoggingControl, self).__init__(*args, **kwargs)
self._setup()
# Changed file path and checks if the file orders_hta_4.csv already exists, if it doens't then create it
def _setup(self):
if os.path.exists("orders_hta_4.csv"):
logging.info("Results file exists")
else:
with open("orders_hta_4.csv", "w") as m:
csv_writer = csv.DictWriter(m, delimiter=",", fieldnames=FIELDNAMES)
csv_writer.writeheader()
def _process_cleared_orders_meta(self, event):
orders = event.event
with open("orders_hta_4.csv", "a") as m:
for order in orders:
if order.order_type.ORDER_TYPE == OrderTypes.LIMIT:
size = order.order_type.size
else:
size = order.order_type.liability
if order.order_type.ORDER_TYPE == OrderTypes.MARKET_ON_CLOSE:
price = None
else:
price = order.order_type.price
try:
order_data = {
"bet_id": order.bet_id,
"strategy_name": order.trade.strategy,
"market_id": order.market_id,
"selection_id": order.selection_id,
"trade_id": order.trade.id,
"date_time_placed": order.responses.date_time_placed,
"price": price,
"price_matched": order.average_price_matched,
"size": size,
"size_matched": order.size_matched,
"profit": 0 if not order.cleared_order else order.cleared_order.profit,
"side": order.side,
"elapsed_seconds_executable": order.elapsed_seconds_executable,
"order_status": order.status.value,
"market_note": order.trade.market_notes,
"trade_notes": order.trade.notes_str,
"order_notes": order.notes_str,
}
csv_writer = csv.DictWriter(m, delimiter=",", fieldnames=FIELDNAMES)
csv_writer.writerow(order_data)
except Exception as e:
logger.error(
"_process_cleared_orders_meta: %s" % e,
extra={"order": order, "error": e},
)
logger.info("Orders updated", extra={"order_count": len(orders)})
def _process_cleared_markets(self, event):
cleared_markets = event.event
for cleared_market in cleared_markets.orders:
logger.info(
"Cleared market",
extra={
"market_id": cleared_market.market_id,
"bet_count": cleared_market.bet_count,
"profit": cleared_market.profit,
"commission": cleared_market.commission,
},
)
framework.add_logging_control(
LiveLoggingControl()
)
framework.run()
Disclaimer
Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.