AFL Modelling Walkthrough
01. Data Cleaning
These tutorials will walk you through how to construct your own basic AFL model, using publicly available data. The output will be odds for each team to win, which will be shown on The Hub.
In this notebook we will walk you through the basics of cleaning this dataset and how we have done it. If you want to get straight to feature creation or modelling, feel free to jump ahead!
# Import libraries
import pandas as pd
import numpy as np
import re
pd.set_option('display.max_columns', None)
We will first explore the DataFrames, and then create functions to wrangle them and clean them into more consistent sets of data.
# Read/clean each DataFrame
match_results = pd.read_csv("data/afl_match_results.csv")
odds = pd.read_csv("data/afl_odds.csv")
player_stats = pd.read_csv("data/afl_player_stats.csv")
trunc | event_name | path | selection_name | odds | |
---|---|---|---|---|---|
4179 | 2018-09-01 | Match Odds | VFL/Richmond Reserves v Williamstown | Williamstown | 2.3878 |
4180 | 2018-09-01 | Match Odds | WAFL/South Fremantle v West Perth | South Fremantle | 1.5024 |
4181 | 2018-09-01 | Match Odds | WAFL/South Fremantle v West Perth | West Perth | 2.7382 |
Game | Date | Round | Home.Team | Home.Goals | Home.Behinds | Home.Points | Away.Team | Away.Goals | Away.Behinds | Away.Points | Venue | Margin | Season | Round.Type | Round.Number | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15395 | 15396 | 2018-08-26 | R23 | Brisbane Lions | 11 | 6 | 72 | West Coast | 14 | 14 | 98 | Gabba | -26 | 2018 | Regular | 23 |
15396 | 15397 | 2018-08-26 | R23 | Melbourne | 15 | 12 | 102 | GWS | 8 | 9 | 57 | M.C.G. | 45 | 2018 | Regular | 23 |
15397 | 15398 | 2018-08-26 | R23 | St Kilda | 14 | 10 | 94 | North Melbourne | 17 | 15 | 117 | Docklands | -23 | 2018 | Regular | 23 |
AF | B | BO | CCL | CG | CL | CM | CP | D | DE | Date | ED | FA | FF | G | GA | HB | HO | I50 | ITC | K | M | MG | MI5 | Match_id | One.Percenters | Opposition | Player | R50 | Round | SC | SCL | SI | Season | Status | T | T5 | TO | TOG | Team | UP | Venue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
89317 | 38 | 1 | 0 | 0.0 | 0 | 0 | 1 | 2 | 9 | 55.6 | 25/08/2018 | 5 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2.0 | 6 | 3 | 132.0 | 2 | 9711 | 0 | Fremantle | Christopher Mayne | 1 | Round 23 | 35 | 0.0 | 2.0 | 2018 | Away | 1 | 0.0 | 1.0 | 57 | Collingwood | 7 | Optus Stadium |
89318 | 38 | 0 | 0 | 0.0 | 3 | 0 | 0 | 3 | 9 | 55.6 | 25/08/2018 | 5 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 4.0 | 6 | 3 | 172.0 | 0 | 9711 | 2 | Fremantle | Nathan Murphy | 5 | Round 23 | 29 | 0.0 | 0.0 | 2018 | Away | 1 | 0.0 | 3.0 | 70 | Collingwood | 6 | Optus Stadium |
89319 | 56 | 1 | 0 | 0.0 | 1 | 0 | 0 | 3 | 8 | 62.5 | 25/08/2018 | 5 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 2.0 | 6 | 3 | 180.0 | 3 | 9711 | 2 | Fremantle | Jaidyn Stephenson | 0 | Round 23 | 56 | 0.0 | 4.0 | 2018 | Away | 3 | 1.0 | 2.0 | 87 | Collingwood | 5 | Optus Stadium |
Have a look at the structure of the DataFrames. Notice that for the odds DataFrame, each game is split between two rows, whilst for the match_results each game is on one row. We will have to get around this by splitting the games up onto two rows, as this will allow our feature transformation functions to be applied more easily later on. For the player_stats DataFrame we will aggregate these stats into each game on separate rows.
First, we will write functions to make the odds data look a bit nicer, with only a team column, a date column and a 'home_game' column which takes the values 0 or 1 depending on if it was a home game for that team. To do this we will use the regex module to extract the team names from the path column, as well as the to_datetime function from pandas. We will also replace all the inconsistent team names with consistent team names.
def get_cleaned_odds(df=None):
# If a df hasn't been specified as a parameter, read the odds df
if df is None:
df = pd.read_csv("data/afl_odds.csv")
# Get a dictionary of team names we want to change and their new values
team_name_mapping = {
'Adelaide Crows': 'Adelaide',
'Brisbane Lions': 'Brisbane',
'Carlton Blues': 'Carlton',
'Collingwood Magpies': 'Collingwood',
'Essendon Bombers': 'Essendon',
'Fremantle Dockers': 'Fremantle',
'GWS Giants': 'GWS',
'Geelong Cats': 'Geelong',
'Gold Coast Suns': 'Gold Coast',
'Greater Western Sydney': 'GWS',
'Greater Western Sydney Giants': 'GWS',
'Hawthorn Hawks': 'Hawthorn',
'Melbourne Demons': 'Melbourne',
'North Melbourne Kangaroos': 'North Melbourne',
'Port Adelaide Magpies': 'Port Adelaide',
'Port Adelaide Power': 'Port Adelaide',
'P Adelaide': 'Port Adelaide',
'Richmond Tigers': 'Richmond',
'St Kilda Saints': 'St Kilda',
'Sydney Swans': 'Sydney',
'West Coast Eagles': 'West Coast',
'Wetsern Bulldogs': 'Western Bulldogs',
'Western Bullbogs': 'Western Bulldogs'
}
# Add columns
df = (df.assign(date=lambda df: pd.to_datetime(df.trunc), # Create a datetime column
home_team=lambda df: df.path.str.extract('(([\w\s]+) v ([\w\s]+))', expand=True)[1].str.strip(),
away_team=lambda df: df.path.str.extract('(([\w\s]+) v ([\w\s]+))', expand=True)[2].str.strip())
.drop(columns=['path', 'trunc', 'event_name']) # Drop irrelevant columns
.rename(columns={'selection_name': 'team'}) # Rename columns
.replace(team_name_mapping)
.sort_values(by='date')
.reset_index(drop=True)
.assign(home_game=lambda df: df.apply(lambda row: 1 if row.home_team == row.team else 0, axis='columns'))
.drop(columns=['home_team', 'away_team']))
return df
team | odds | date | home_game | |
---|---|---|---|---|
4177 | South Fremantle | 1.5024 | 2018-09-01 | 1 |
4178 | Port Melbourne | 2.8000 | 2018-09-01 | 0 |
4179 | Box Hill Hawks | 1.4300 | 2018-09-01 | 1 |
4180 | Casey Demons | 1.9000 | 2018-09-01 | 1 |
4181 | West Perth | 2.7382 | 2018-09-01 | 0 |
We now have a DataFrame that looks nice and easy to join with our other DataFrames. Now let's lean up the match_details DataFrame.
# Define a function which cleans the match results df, and separates each teams' stats onto individual rows
def get_cleaned_match_results(df=None):
# If a df hasn't been specified as a parameter, read the match_results df
if df is None:
df = pd.read_csv("data/afl_match_results.csv")
# Create column lists to loop through - these are the columns we want in home and away dfs
home_columns = ['Game', 'Date', 'Round.Number', 'Home.Team', 'Home.Goals', 'Home.Behinds', 'Home.Points', 'Margin', 'Venue', 'Away.Team', 'Away.Goals', 'Away.Behinds', 'Away.Points']
away_columns = ['Game', 'Date', 'Round.Number', 'Away.Team', 'Away.Goals', 'Away.Behinds', 'Away.Points', 'Margin', 'Venue', 'Home.Team', 'Home.Goals', 'Home.Behinds', 'Home.Points']
mapping = ['game', 'date', 'round', 'team', 'goals', 'behinds', 'points', 'margin', 'venue', 'opponent', 'opponent_goals', 'opponent_behinds', 'opponent_points']
team_name_mapping = {
'Brisbane Lions': 'Brisbane',
'Footscray': 'Western Bulldogs'
}
# Create a df with only home games
df_home = (df[home_columns]
.rename(columns={old_col: new_col for old_col, new_col in zip(home_columns, mapping)})
.assign(home_game=1))
# Create a df with only away games
df_away = (df[away_columns]
.rename(columns={old_col: new_col for old_col, new_col in zip(away_columns, mapping)})
.assign(home_game=0,
margin=lambda df: df.margin * -1))
# Append these dfs together
new_df = (df_home.append(df_away)
.sort_values(by='game') # Sort by game ID
.reset_index(drop=True) # Reset index
.assign(date=lambda df: pd.to_datetime(df.date)) # Create a datetime column
.replace(team_name_mapping)) # Rename team names to be consistent with other dfs
return new_df
game | date | round | team | goals | behinds | points | margin | venue | opponent | opponent_goals | opponent_behinds | opponent_points | home_game | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1897-05-08 | 1 | Fitzroy | 6 | 13 | 49 | 33 | Brunswick St | Carlton | 2 | 4 | 16 | 1 |
1 | 1 | 1897-05-08 | 1 | Carlton | 2 | 4 | 16 | -33 | Brunswick St | Fitzroy | 6 | 13 | 49 | 0 |
2 | 2 | 1897-05-08 | 1 | Collingwood | 5 | 11 | 41 | 25 | Victoria Park | St Kilda | 2 | 4 | 16 | 1 |
3 | 2 | 1897-05-08 | 1 | St Kilda | 2 | 4 | 16 | -25 | Victoria Park | Collingwood | 5 | 11 | 41 | 0 |
4 | 3 | 1897-05-08 | 1 | Geelong | 3 | 6 | 24 | -23 | Corio Oval | Essendon | 7 | 5 | 47 | 1 |
Now we have both the odds DataFrame and match_results DataFrame ready for feature creation! Finally, we will aggregate the player_stats DataFrame stats for each game rather than individual player stats. For this DataFrame we have regular stats, such as disposals, marks etc. and Advanced Stats, such as Tackles Inside 50 and Metres Gained. However these advanced stats are only available from 2015, so we will not be using them in this tutorial - as there isn't enough data from 2015 to train our models.
Let's now aggregate the player_stats DataFrame.
def get_cleaned_aggregate_player_stats(df=None):
# If a df hasn't been specified as a parameter, read the player_stats df
if df is None:
df = pd.read_csv("data/afl_player_stats.csv")
agg_stats = (df.rename(columns={ # Rename columns to lowercase
'Season': 'season',
'Round': 'round',
'Team': 'team',
'Opposition': 'opponent',
'Date': 'date'
})
.groupby(by=['date', 'season', 'team', 'opponent'], as_index=False) # Groupby to aggregate the stats for each game
.sum()
.drop(columns=['DE', 'TOG', 'Match_id']) # Drop columns
.assign(date=lambda df: pd.to_datetime(df.date, format="%d/%m/%Y")) # Create a datetime object
.sort_values(by='date')
.reset_index(drop=True))
return agg_stats
date | season | team | opponent | AF | B | BO | CCL | CG | CL | CM | CP | D | ED | FA | FF | G | GA | HB | HO | I50 | ITC | K | M | MG | MI5 | One.Percenters | R50 | SC | SCL | SI | T | T5 | TO | UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3621 | 2018-08-26 | 2018 | Brisbane | West Coast | 1652 | 5 | 0 | 14.0 | 49 | 37 | 8 | 132 | 394 | 302 | 20 | 18 | 11 | 9 | 167 | 48 | 49 | 59.0 | 227 | 104 | 5571.0 | 6 | 48 | 39 | 1645 | 23.0 | 86.0 | 62 | 13.0 | 69.0 | 256 |
3622 | 2018-08-26 | 2018 | West Coast | Brisbane | 1548 | 11 | 5 | 13.0 | 49 | 42 | 9 | 141 | 360 | 262 | 18 | 20 | 14 | 8 | 137 | 39 | 56 | 70.0 | 223 | 95 | 5809.0 | 12 | 39 | 34 | 1655 | 29.0 | 94.0 | 55 | 6.0 | 59.0 | 217 |
3623 | 2018-08-26 | 2018 | St Kilda | North Melbourne | 1587 | 8 | 11 | 19.0 | 48 | 33 | 7 | 125 | 383 | 299 | 18 | 14 | 14 | 13 | 173 | 23 | 48 | 68.0 | 210 | 112 | 5522.0 | 14 | 46 | 35 | 1568 | 14.0 | 95.0 | 50 | 7.0 | 77.0 | 269 |
3624 | 2018-08-26 | 2018 | GWS | Melbourne | 1449 | 7 | 17 | 14.0 | 42 | 31 | 12 | 111 | 355 | 274 | 19 | 13 | 8 | 7 | 159 | 18 | 50 | 54.0 | 196 | 110 | 5416.0 | 10 | 62 | 34 | 1532 | 17.0 | 78.0 | 46 | 5.0 | 58.0 | 254 |
3625 | 2018-08-26 | 2018 | Melbourne | GWS | 1712 | 8 | 12 | 10.0 | 38 | 30 | 12 | 139 | 403 | 302 | 13 | 19 | 15 | 14 | 181 | 48 | 54 | 59.0 | 222 | 106 | 6198.0 | 16 | 34 | 39 | 1768 | 20.0 | 147.0 | 60 | 2.0 | 53.0 | 269 |
We now have a three fully prepared DataFrames which are almost ready to be analysed and for a model to be built on! Let's have a look at how they look and then merge them together into our final DataFrame.
team | odds | date | home_game | |
---|---|---|---|---|
4179 | Box Hill Hawks | 1.4300 | 2018-09-01 | 1 |
4180 | Casey Demons | 1.9000 | 2018-09-01 | 1 |
4181 | West Perth | 2.7382 | 2018-09-01 | 0 |
game | date | round | team | goals | behinds | points | margin | venue | opponent | opponent_goals | opponent_behinds | opponent_points | home_game | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30793 | 15397 | 2018-08-26 | 23 | Melbourne | 15 | 12 | 102 | 45 | M.C.G. | GWS | 8 | 9 | 57 | 1 |
30794 | 15398 | 2018-08-26 | 23 | St Kilda | 14 | 10 | 94 | -23 | Docklands | North Melbourne | 17 | 15 | 117 | 1 |
30795 | 15398 | 2018-08-26 | 23 | North Melbourne | 17 | 15 | 117 | 23 | Docklands | St Kilda | 14 | 10 | 94 | 0 |
date | season | team | opponent | AF | B | BO | CCL | CG | CL | CM | CP | D | ED | FA | FF | G | GA | HB | HO | I50 | ITC | K | M | MG | MI5 | One.Percenters | R50 | SC | SCL | SI | T | T5 | TO | UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3623 | 2018-08-26 | 2018 | St Kilda | North Melbourne | 1587 | 8 | 11 | 19.0 | 48 | 33 | 7 | 125 | 383 | 299 | 18 | 14 | 14 | 13 | 173 | 23 | 48 | 68.0 | 210 | 112 | 5522.0 | 14 | 46 | 35 | 1568 | 14.0 | 95.0 | 50 | 7.0 | 77.0 | 269 |
3624 | 2018-08-26 | 2018 | GWS | Melbourne | 1449 | 7 | 17 | 14.0 | 42 | 31 | 12 | 111 | 355 | 274 | 19 | 13 | 8 | 7 | 159 | 18 | 50 | 54.0 | 196 | 110 | 5416.0 | 10 | 62 | 34 | 1532 | 17.0 | 78.0 | 46 | 5.0 | 58.0 | 254 |
3625 | 2018-08-26 | 2018 | Melbourne | GWS | 1712 | 8 | 12 | 10.0 | 38 | 30 | 12 | 139 | 403 | 302 | 13 | 19 | 15 | 14 | 181 | 48 | 54 | 59.0 | 222 | 106 | 6198.0 | 16 | 34 | 39 | 1768 | 20.0 | 147.0 | 60 | 2.0 | 53.0 | 269 |
merged_df = (odds[odds.team.isin(agg_stats.team.unique())]
.pipe(pd.merge, match_results, on=['date', 'team', 'home_game'])
.pipe(pd.merge, agg_stats, on=['date', 'team', 'opponent'])
.sort_values(by=['game']))
team | odds | date | home_game | game | round | goals | behinds | points | margin | venue | opponent | opponent_goals | opponent_behinds | opponent_points | season | AF | B | BO | CCL | CG | CL | CM | CP | D | ED | FA | FF | G | GA | HB | HO | I50 | ITC | K | M | MG | MI5 | One.Percenters | R50 | SC | SCL | SI | T | T5 | TO | UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3199 | Melbourne | 1.5116 | 2018-08-26 | 1 | 15397 | 23 | 15 | 12 | 102 | 45 | M.C.G. | GWS | 8 | 9 | 57 | 2018 | 1712 | 8 | 12 | 10.0 | 38 | 30 | 12 | 139 | 403 | 302 | 13 | 19 | 15 | 14 | 181 | 48 | 54 | 59.0 | 222 | 106 | 6198.0 | 16 | 34 | 39 | 1768 | 20.0 | 147.0 | 60 | 2.0 | 53.0 | 269 |
3195 | North Melbourne | 1.3936 | 2018-08-26 | 0 | 15398 | 23 | 17 | 15 | 117 | 23 | Docklands | St Kilda | 14 | 10 | 94 | 2018 | 1707 | 13 | 7 | 15.0 | 50 | 24 | 6 | 131 | 425 | 322 | 14 | 18 | 17 | 13 | 201 | 32 | 59 | 77.0 | 224 | 106 | 5833.0 | 23 | 33 | 29 | 1735 | 9.0 | 154.0 | 48 | 9.0 | 66.0 | 300 |
3200 | St Kilda | 3.5178 | 2018-08-26 | 1 | 15398 | 23 | 14 | 10 | 94 | -23 | Docklands | North Melbourne | 17 | 15 | 117 | 2018 | 1587 | 8 | 11 | 19.0 | 48 | 33 | 7 | 125 | 383 | 299 | 18 | 14 | 14 | 13 | 173 | 23 | 48 | 68.0 | 210 | 112 | 5522.0 | 14 | 46 | 35 | 1568 | 14.0 | 95.0 | 50 | 7.0 | 77.0 | 269 |
Great! We now have a clean looking datset with each row representing one team in a game. Let's now eliminate the outliers from a dataset. We know that Essendon had a doping scandal which resulted in their entire team being banned for a year in 2016, so let's remove all of their 2016 games. To do this we will filter based on the team and season, and then invert this with ~.
# Define a function which eliminates outliers
def outlier_eliminator(df):
# Eliminate Essendon 2016 games
essendon_filter_criteria = ~(((df['team'] == 'Essendon') & (df['season'] == 2016)) | ((df['opponent'] == 'Essendon') & (df['season'] == 2016)))
df = df[essendon_filter_criteria].reset_index(drop=True)
return df
team | odds | date | home_game | game | round | goals | behinds | points | margin | venue | opponent | opponent_goals | opponent_behinds | opponent_points | season | AF | B | BO | CCL | CG | CL | CM | CP | D | ED | FA | FF | G | GA | HB | HO | I50 | ITC | K | M | MG | MI5 | One.Percenters | R50 | SC | SCL | SI | T | T5 | TO | UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3154 | Melbourne | 1.5116 | 2018-08-26 | 1 | 15397 | 23 | 15 | 12 | 102 | 45 | M.C.G. | GWS | 8 | 9 | 57 | 2018 | 1712 | 8 | 12 | 10.0 | 38 | 30 | 12 | 139 | 403 | 302 | 13 | 19 | 15 | 14 | 181 | 48 | 54 | 59.0 | 222 | 106 | 6198.0 | 16 | 34 | 39 | 1768 | 20.0 | 147.0 | 60 | 2.0 | 53.0 | 269 |
3155 | North Melbourne | 1.3936 | 2018-08-26 | 0 | 15398 | 23 | 17 | 15 | 117 | 23 | Docklands | St Kilda | 14 | 10 | 94 | 2018 | 1707 | 13 | 7 | 15.0 | 50 | 24 | 6 | 131 | 425 | 322 | 14 | 18 | 17 | 13 | 201 | 32 | 59 | 77.0 | 224 | 106 | 5833.0 | 23 | 33 | 29 | 1735 | 9.0 | 154.0 | 48 | 9.0 | 66.0 | 300 |
3156 | St Kilda | 3.5178 | 2018-08-26 | 1 | 15398 | 23 | 14 | 10 | 94 | -23 | Docklands | North Melbourne | 17 | 15 | 117 | 2018 | 1587 | 8 | 11 | 19.0 | 48 | 33 | 7 | 125 | 383 | 299 | 18 | 14 | 14 | 13 | 173 | 23 | 48 | 68.0 | 210 | 112 | 5522.0 | 14 | 46 | 35 | 1568 | 14.0 | 95.0 | 50 | 7.0 | 77.0 | 269 |
Finally, let's mark all of the columns that we are going to use in feature creation with the string 'f_' at the start of their column name so that we can easily filter for these columns.
non_feature_cols = ['team', 'date', 'home_game', 'game', 'round', 'venue', 'opponent', 'season']
afl_data = afl_data.rename(columns={col: 'f_' + col for col in afl_data if col not in non_feature_cols})
team | f_odds | date | home_game | game | round | f_goals | f_behinds | f_points | f_margin | venue | opponent | f_opponent_goals | f_opponent_behinds | f_opponent_points | season | f_AF | f_B | f_BO | f_CCL | f_CG | f_CL | f_CM | f_CP | f_D | f_ED | f_FA | f_FF | f_G | f_GA | f_HB | f_HO | f_I50 | f_ITC | f_K | f_M | f_MG | f_MI5 | f_One.Percenters | f_R50 | f_SC | f_SCL | f_SI | f_T | f_T5 | f_TO | f_UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3154 | Melbourne | 1.5116 | 2018-08-26 | 1 | 15397 | 23 | 15 | 12 | 102 | 45 | M.C.G. | GWS | 8 | 9 | 57 | 2018 | 1712 | 8 | 12 | 10.0 | 38 | 30 | 12 | 139 | 403 | 302 | 13 | 19 | 15 | 14 | 181 | 48 | 54 | 59.0 | 222 | 106 | 6198.0 | 16 | 34 | 39 | 1768 | 20.0 | 147.0 | 60 | 2.0 | 53.0 | 269 |
3155 | North Melbourne | 1.3936 | 2018-08-26 | 0 | 15398 | 23 | 17 | 15 | 117 | 23 | Docklands | St Kilda | 14 | 10 | 94 | 2018 | 1707 | 13 | 7 | 15.0 | 50 | 24 | 6 | 131 | 425 | 322 | 14 | 18 | 17 | 13 | 201 | 32 | 59 | 77.0 | 224 | 106 | 5833.0 | 23 | 33 | 29 | 1735 | 9.0 | 154.0 | 48 | 9.0 | 66.0 | 300 |
3156 | St Kilda | 3.5178 | 2018-08-26 | 1 | 15398 | 23 | 14 | 10 | 94 | -23 | Docklands | North Melbourne | 17 | 15 | 117 | 2018 | 1587 | 8 | 11 | 19.0 | 48 | 33 | 7 | 125 | 383 | 299 | 18 | 14 | 14 | 13 | 173 | 23 | 48 | 68.0 | 210 | 112 | 5522.0 | 14 | 46 | 35 | 1568 | 14.0 | 95.0 | 50 | 7.0 | 77.0 | 269 |
Our data is now fully ready to be explored and for features to be created.
02. Feature Creation
These tutorials will walk you through how to construct your own basic AFL model. The output will be odds for each team to win, which will be shown on The Hub.
In this notebook we will walk you through creating features from our dataset, which was cleaned in the first tutorial. Feature engineering is an integral part of the Data Science process. Creative and smart features can be the difference between an average performing model and a model profitable which beats the market odds.
Grabbing Our Dataset
First, we will import our required modules, as well as the prepare_afl_data function which we created in our afl_data_cleaning script. This essentially cleans all the data for us so that we're ready to explore the data and make some features.
# Import modules
from afl_data_cleaning_v2 import *
import afl_data_cleaning_v2
import pandas as pd
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import numpy as np
# Use the prepare_afl_data function to prepare the data for us; this function condenses what we walked through in the previous tutorial
afl_data = prepare_afl_data()
team | f_odds | date | home_game | game | round | f_goals | f_behinds | f_points | f_margin | venue | opponent | f_opponent_goals | f_opponent_behinds | f_opponent_points | season | f_AF | f_B | f_BO | f_CCL | f_CG | f_CL | f_CM | f_CP | f_D | f_ED | f_FA | f_FF | f_G | f_GA | f_HB | f_HO | f_I50 | f_ITC | f_K | f_M | f_MG | f_MI5 | f_One.Percenters | f_R50 | f_SC | f_SCL | f_SI | f_T | f_T5 | f_TO | f_UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3154 | Melbourne | 1.5116 | 2018-08-26 | 1 | 15397 | 23 | 15 | 12 | 102 | 45 | M.C.G. | GWS | 8 | 9 | 57 | 2018 | 1712 | 8 | 12 | 10.0 | 38 | 30 | 12 | 139 | 403 | 302 | 13 | 19 | 15 | 14 | 181 | 48 | 54 | 59.0 | 222 | 106 | 6198.0 | 16 | 34 | 39 | 1768 | 20.0 | 147.0 | 60 | 2.0 | 53.0 | 269 |
3155 | North Melbourne | 1.3936 | 2018-08-26 | 0 | 15398 | 23 | 17 | 15 | 117 | 23 | Docklands | St Kilda | 14 | 10 | 94 | 2018 | 1707 | 13 | 7 | 15.0 | 50 | 24 | 6 | 131 | 425 | 322 | 14 | 18 | 17 | 13 | 201 | 32 | 59 | 77.0 | 224 | 106 | 5833.0 | 23 | 33 | 29 | 1735 | 9.0 | 154.0 | 48 | 9.0 | 66.0 | 300 |
3156 | St Kilda | 3.5178 | 2018-08-26 | 1 | 15398 | 23 | 14 | 10 | 94 | -23 | Docklands | North Melbourne | 17 | 15 | 117 | 2018 | 1587 | 8 | 11 | 19.0 | 48 | 33 | 7 | 125 | 383 | 299 | 18 | 14 | 14 | 13 | 173 | 23 | 48 | 68.0 | 210 | 112 | 5522.0 | 14 | 46 | 35 | 1568 | 14.0 | 95.0 | 50 | 7.0 | 77.0 | 269 |
Creating A Feature DataFrame
Let's create a feature DataFrame and merge all of our features into this DataFrame as we go.
What Each Column Refers To
Below is a DataFrame which outlines what each column refers to.
Feature Abbreviated | Feature | |
---|---|---|
0 | GA | Goal Assists |
1 | CP | Contested Possessions |
2 | UP | Uncontested Possessions |
3 | ED | Effective Disposals |
4 | CM | Contested Marks |
5 | MI5 | Marks Inside 50 |
6 | One.Percenters | One Percenters |
7 | BO | Bounces |
8 | K | Kicks |
9 | HB | Handballs |
10 | D | Disposals |
11 | M | Marks |
12 | G | Goals |
13 | B | Behinds |
14 | T | Tackles |
15 | HO | Hitouts |
16 | I50 | Inside 50s |
17 | CL | Clearances |
18 | CG | Clangers |
19 | R50 | Rebound 50s |
20 | FF | Frees For |
21 | FA | Frees Against |
22 | AF | AFL Fantasy Points |
23 | SC | Supercoach Points |
24 | CCL | Centre Clearances |
25 | SCL | Stoppage Clearances |
26 | SI | Score Involvements |
27 | MG | Metres Gained |
28 | TO | Turnovers |
29 | ITC | Intercepts |
30 | T5 | Tackles Inside 50 |
Feature Creation
Now let's think about what features we can create. We have a enormous amount of stats to sift through. To start, let's create some simple features based on our domain knowledge of Aussie Rules.
Creating Expontentially Weighted Rolling Averages as Features
Next, we will create rolling averages of statistics such as Tackles, which we will use as features.
It is fair to assume that a team's performance in a certain stat may have predictive power to the overall result. And in general, if a team consistently performs well in this stat, this may have predictive power to the result of their future games. We can't simply train a model on stats from the game which we are trying to predict (i.e. data that we don't have before the game begins), as this will leak the result. We need to train our model on past data. One way of doing this is to train our model on average stats over a certain amount of games. If a team is averaging high in this stat, this may give insight into if they are a strong team. Similarly, if the team is averaging poorly in this stat (relative to the team they are playing), this may have predictive power and give rise to a predicted loss.
To do this we will create a function which calculates the rolling averages, known as create_exp_weighted_avgs, which takes our cleaned DataFrame as an input, as well as the alpha which, when higher, weights recent performances more than old performances. To read more about expontentially weighted moving averages, please read the documentation here.
First, we will grab all the columns which we want to create EMAs for, and then use our function to create the average for that column. We will create a new DataFrame and add these columns to this new DataFrame.
# Define a function which returns a DataFrame with the expontential moving average for each numeric stat
def create_exp_weighted_avgs(df, span):
# Create a copy of the df with only the game id and the team - we will add cols to this df
ema_features = df[['game', 'team']].copy()
feature_names = [col for col in df.columns if col.startswith('f_')] # Get a list of columns we will iterate over
for feature_name in feature_names:
feature_ema = (df.groupby('team')[feature_name]
.transform(lambda row: (row.ewm(span=span)
.mean()
.shift(1))))
ema_features[feature_name] = feature_ema
return ema_features
game | team | f_odds | f_goals | f_behinds | f_points | f_margin | f_opponent_goals | f_opponent_behinds | f_opponent_points | f_AF | f_B | f_BO | f_CCL | f_CG | f_CL | f_CM | f_CP | f_D | f_ED | f_FA | f_FF | f_G | f_GA | f_HB | f_HO | f_I50 | f_ITC | f_K | f_M | f_MG | f_MI5 | f_One.Percenters | f_R50 | f_SC | f_SCL | f_SI | f_T | f_T5 | f_TO | f_UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3152 | 15396 | West Coast | 2.094236 | 12.809630 | 10.047145 | 86.904928 | 8.888770 | 11.435452 | 9.403444 | 78.016158 | 3193.612782 | 16.472115 | 11.958482 | 23.379562 | 100.095244 | 68.252001 | 27.688669 | 284.463270 | 719.884644 | 525.878017 | 36.762440 | 44.867118 | 25.618202 | 17.522871 | 270.478779 | 88.139376 | 105.698031 | 148.005305 | 449.405865 | 201.198907 | 11581.929999 | 20.048124 | 95.018480 | 74.180967 | 3314.157893 | 44.872398 | 177.894442 | 126.985101 | 20.565549 | 138.876613 | 438.848376 |
3153 | 15397 | GWS | 1.805565 | 13.100372 | 13.179329 | 91.781563 | 18.527618 | 10.371198 | 11.026754 | 73.253945 | 3165.127358 | 19.875913 | 12.947209 | 25.114002 | 105.856671 | 80.609640 | 23.374884 | 303.160047 | 741.439198 | 534.520295 | 42.597317 | 38.160889 | 26.208715 | 18.688880 | 300.188301 | 81.540693 | 106.989070 | 143.032506 | 441.250897 | 173.050118 | 12091.630837 | 21.106142 | 103.077097 | 80.201059 | 3419.245919 | 55.495610 | 219.879895 | 138.202470 | 25.313148 | 135.966798 | 438.466439 |
3154 | 15397 | Melbourne | 1.706488 | 15.157271 | 13.815113 | 104.758740 | 25.170429 | 11.814319 | 8.702396 | 79.588311 | 3312.408470 | 22.077317 | 7.724955 | 28.364418 | 114.399147 | 78.406069 | 26.934677 | 324.352577 | 775.176933 | 547.385948 | 39.353251 | 36.025646 | 30.308918 | 22.461080 | 348.613592 | 99.787800 | 120.339062 | 154.417642 | 426.563341 | 178.102118 | 12395.717925 | 32.168752 | 96.390688 | 63.786515 | 3427.596843 | 50.041649 | 232.287556 | 144.875098 | 23.789233 | 149.042149 | 456.988552 |
3155 | 15398 | North Melbourne | 2.272313 | 12.721783 | 10.733785 | 87.064486 | -1.214246 | 12.915796 | 10.783958 | 88.278732 | 3066.272143 | 17.322710 | 9.815243 | 26.015421 | 106.465181 | 67.504286 | 26.064079 | 291.259574 | 736.279779 | 534.154748 | 34.301603 | 40.908551 | 25.386136 | 17.816570 | 341.210547 | 81.541130 | 102.589427 | 145.265493 | 395.069232 | 173.089408 | 10875.002463 | 21.802751 | 82.347511 | 70.416194 | 3171.120023 | 41.488865 | 197.620152 | 122.547684 | 22.286256 | 142.780474 | 450.374058 |
3156 | 15398 | St Kilda | 5.516150 | 10.464266 | 11.957047 | 74.742643 | -21.138101 | 14.105551 | 11.247440 | 95.880745 | 3094.163405 | 20.523847 | 14.569589 | 24.134276 | 102.540441 | 66.976211 | 18.018350 | 270.674857 | 773.086015 | 573.769838 | 41.319843 | 36.198820 | 20.850476 | 14.443658 | 364.405251 | 63.498760 | 103.803779 | 130.494307 | 408.680763 | 184.780054 | 10765.717942 | 21.572806 | 94.731555 | 65.790561 | 3228.278599 | 42.841935 | 196.086493 | 115.901425 | 18.796764 | 127.364334 | 508.844514 |
As you can see our function worked perfectly! Now we have a full DataFrame of exponentially weighted moving averages. Note that as these rolling averages have been shifted by 1 to ensure no data leakage, the first round of the data will have all NA values. We can drop these later.
Let's add these averages to our features DataFrame
Creating a 'Form Between the Teams' Feature
It is well known in Aussie Rules that often some teams perform better against certain teams than others. If we isolate our features to pure stats based on previous games not between the teams playing, or elo ratings, we won't account for any relationships between certain teams. An example is the Kennett Curse, where Geelong won 11 consecutive games against Hawthorn, despite being similarly matched teams. Let's create a feature which calculates how many games a team has won against their opposition over a given window of games.
To do this, we will need to use historical data that dates back well before our current DataFrame starts at. Otherwise we will be using a lot of our games to calculate form, meaning we will have to drop these rows before feeding it into an algorithm. So let's use our prepare_match_results function which we defined in the afl_data_cleaning tutorial to grab a clean DataFrame of all match results since 1897. We can then calculate the form and join this to our current DataFrame.
game | date | round | team | goals | behinds | points | margin | venue | opponent | opponent_goals | opponent_behinds | opponent_points | home_game | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1897-05-08 | 1 | Fitzroy | 6 | 13 | 49 | 33 | Brunswick St | Carlton | 2 | 4 | 16 | 1 |
1 | 1 | 1897-05-08 | 1 | Carlton | 2 | 4 | 16 | -33 | Brunswick St | Fitzroy | 6 | 13 | 49 | 0 |
2 | 2 | 1897-05-08 | 1 | Collingwood | 5 | 11 | 41 | 25 | Victoria Park | St Kilda | 2 | 4 | 16 | 1 |
form_btwn_teams = match_results[['game', 'team', 'opponent', 'margin']].copy()
form_btwn_teams['f_form_margin_btwn_teams'] = (match_results.groupby(['team', 'opponent'])['margin']
.transform(lambda row: row.rolling(5).mean().shift())
.fillna(0))
form_btwn_teams['f_form_past_5_btwn_teams'] = \
(match_results.assign(win=lambda df: df.apply(lambda row: 1 if row.margin > 0 else 0, axis='columns'))
.groupby(['team', 'opponent'])['win']
.transform(lambda row: row.rolling(5).mean().shift() * 5)
.fillna(0))
game | team | opponent | margin | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | |
---|---|---|---|---|---|---|
30793 | 15397 | Melbourne | GWS | 45 | -23.2 | 2.0 |
30794 | 15398 | St Kilda | North Melbourne | -23 | -3.2 | 2.0 |
30795 | 15398 | North Melbourne | St Kilda | 23 | 3.2 | 3.0 |
# Merge to our features df
features = pd.merge(features, form_btwn_teams.drop(columns=['margin']), on=['game', 'team', 'opponent'])
Creating Efficiency Features
Disposal Efficiency
Disposal efficiency is pivotal in Aussie Rules football. If you are dispose of the ball effectively you are much more likely to score and much less likely to concede goals than if you dispose of it ineffectively.
Let's create a disposal efficiency feature by dividing Effective Disposals by Disposals.
Inside 50/Rebound 50 Efficiency
Similarly, one could hypothesise that teams who keep the footy in their Inside 50 regularly will be more likely to score, whilst teams who are effective at getting the ball out of their Defensive 50 will be less likely to concede. Let's use this logic to create Inside 50 Efficiency and Rebound 50 Efficiency features.
The formula used will be:
Inside 50 Efficiency = R50_Opponents / I50 (lower is better).
Rebound 50 Efficiency = R50 / I50_Opponents (higher is better).
Using these formulas, I50 Efficiency = R50 Efficiency_Opponent. So we will just need to create the formulas for I50 efficiency. To create these features we will need the opposition's Inside 50s/Rebound 50s. So we will split out data into two DataFrames, create a new DataFrame by joining these two DataFrames on the Game, calculate our efficiency features, then join our features with our main features DataFrame.
# Get each match on single rows
single_row_df = (afl_data[['game', 'team', 'f_I50', 'f_R50', 'f_D', 'f_ED', 'home_game', ]]
.query('home_game == 1')
.rename(columns={'team': 'home_team', 'f_I50': 'f_I50_home', 'f_R50': 'f_R50_home', 'f_D': 'f_D_home', 'f_ED': 'f_ED_home'})
.drop(columns='home_game')
.pipe(pd.merge, afl_data[['game', 'team', 'f_I50', 'f_R50', 'f_D', 'f_ED', 'home_game']]
.query('home_game == 0')
.rename(columns={'team': 'away_team', 'f_I50': 'f_I50_away', 'f_R50': 'f_R50_away', 'f_D': 'f_D_away', 'f_ED': 'f_ED_away'})
.drop(columns='home_game'), on='game'))
game | home_team | f_I50_home | f_R50_home | f_D_home | f_ED_home | away_team | f_I50_away | f_R50_away | f_D_away | f_ED_away | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13764 | Carlton | 69 | 21 | 373 | 268 | Richmond | 37 | 50 | 316 | 226 |
1 | 13765 | Geelong | 54 | 40 | 428 | 310 | St Kilda | 52 | 45 | 334 | 246 |
2 | 13766 | Collingwood | 70 | 38 | 398 | 289 | Port Adelaide | 50 | 44 | 331 | 232 |
3 | 13767 | Adelaide | 59 | 38 | 366 | 264 | Hawthorn | 54 | 38 | 372 | 264 |
4 | 13768 | Brisbane | 50 | 39 | 343 | 227 | Fremantle | 57 | 30 | 351 | 250 |
single_row_df = single_row_df.assign(f_I50_efficiency_home=lambda df: df.f_R50_away / df.f_I50_home,
f_I50_efficiency_away=lambda df: df.f_R50_home / df.f_I50_away)
feature_efficiency_cols = ['f_I50_efficiency_home', 'f_I50_efficiency_away']
# Now let's create an Expontentially Weighted Moving Average for these features - we will need to reshape our DataFrame to do this
efficiency_features_multi_row = (single_row_df[['game', 'home_team'] + feature_efficiency_cols]
.rename(columns={
'home_team': 'team',
'f_I50_efficiency_home': 'f_I50_efficiency',
'f_I50_efficiency_away': 'f_I50_efficiency_opponent',
})
.append((single_row_df[['game', 'away_team'] + feature_efficiency_cols]
.rename(columns={
'away_team': 'team',
'f_I50_efficiency_home': 'f_I50_efficiency_opponent',
'f_I50_efficiency_away': 'f_I50_efficiency',
})), sort=True)
.sort_values(by='game')
.reset_index(drop=True))
efficiency_features = efficiency_features_multi_row[['game', 'team']].copy()
feature_efficiency_cols = ['f_I50_efficiency', 'f_I50_efficiency_opponent']
for feature in feature_efficiency_cols:
efficiency_features[feature] = (efficiency_features_multi_row.groupby('team')[feature]
.transform(lambda row: row.ewm(span=10).mean().shift(1)))
# Get feature efficiency df back onto single rows
efficiency_features = pd.merge(efficiency_features, afl_data[['game', 'team', 'home_game']], on=['game', 'team'])
efficiency_features_single_row = (efficiency_features.query('home_game == 1')
.rename(columns={
'team': 'home_team',
'f_I50_efficiency': 'f_I50_efficiency_home',
'f_I50_efficiency_opponent': 'f_R50_efficiency_home'})
.drop(columns='home_game')
.pipe(pd.merge, (efficiency_features.query('home_game == 0')
.rename(columns={
'team': 'away_team',
'f_I50_efficiency': 'f_I50_efficiency_away',
'f_I50_efficiency_opponent': 'f_R50_efficiency_away'})
.drop(columns='home_game')), on='game'))
game | home_team | f_I50_efficiency_home | f_R50_efficiency_home | away_team | f_I50_efficiency_away | f_R50_efficiency_away | |
---|---|---|---|---|---|---|---|
1580 | 15394 | Carlton | 0.730668 | 0.675002 | Adelaide | 0.691614 | 0.677128 |
1581 | 15395 | Sydney | 0.699994 | 0.778280 | Hawthorn | 0.699158 | 0.673409 |
1582 | 15396 | Brisbane | 0.683604 | 0.691730 | West Coast | 0.696822 | 0.709605 |
1583 | 15397 | Melbourne | 0.667240 | 0.692632 | GWS | 0.684525 | 0.753783 |
1584 | 15398 | St Kilda | 0.730843 | 0.635819 | North Melbourne | 0.697018 | 0.654991 |
We will merge these features back to our features df later, when the features data frame is on a single row as well.
Creating an Elo Feature
Another feature which we could create is an Elo feature. If you don't know what Elo is, go ahead and read an article on it here. We have also written a guide on using elo to model the 2021 Euro & Copa America tournaments here.
Essentially, Elo ratings increase if you win. The amount the rating increases is based on how strong the opponent is relative to the team who won. Weak teams get more points for beating stronger teams than they do for beating weaker teams, and vice versa for losses (teams lose points for losses).
Mathematically, Elo ratings can also assign a probability for winning or losing based on the two Elo Ratings of the teams playing.
So let's get into it. We will first define a function which calculates the elo for each team and applies these elos to our DataFrame.
# Define a function which finds the elo for each team in each game and returns a dictionary with the game ID as a key and the
# elos as the key's value, in a list. It also outputs the probabilities and a dictionary of the final elos for each team
def elo_applier(df, k_factor):
# Initialise a dictionary with default elos for each team
elo_dict = {team: 1500 for team in df['team'].unique()}
elos, elo_probs = {}, {}
# Get a home and away dataframe so that we can get the teams on the same row
home_df = df.loc[df.home_game == 1, ['team', 'game', 'f_margin', 'home_game']].rename(columns={'team': 'home_team'})
away_df = df.loc[df.home_game == 0, ['team', 'game']].rename(columns={'team': 'away_team'})
df = (pd.merge(home_df, away_df, on='game')
.sort_values(by='game')
.drop_duplicates(subset='game', keep='first')
.reset_index(drop=True))
# Loop over the rows in the DataFrame
for index, row in df.iterrows():
# Get the Game ID
game_id = row['game']
# Get the margin
margin = row['f_margin']
# If the game already has the elos for the home and away team in the elos dictionary, go to the next game
if game_id in elos.keys():
continue
# Get the team and opposition
home_team = row['home_team']
away_team = row['away_team']
# Get the team and opposition elo score
home_team_elo = elo_dict[home_team]
away_team_elo = elo_dict[away_team]
# Calculated the probability of winning for the team and opposition
prob_win_home = 1 / (1 + 10**((away_team_elo - home_team_elo) / 400))
prob_win_away = 1 - prob_win_home
# Add the elos and probabilities our elos dictionary and elo_probs dictionary based on the Game ID
elos[game_id] = [home_team_elo, away_team_elo]
elo_probs[game_id] = [prob_win_home, prob_win_away]
# Calculate the new elos of each team
if margin > 0: # Home team wins; update both teams' elo
new_home_team_elo = home_team_elo + k_factor*(1 - prob_win_home)
new_away_team_elo = away_team_elo + k_factor*(0 - prob_win_away)
elif margin < 0: # Away team wins; update both teams' elo
new_home_team_elo = home_team_elo + k_factor*(0 - prob_win_home)
new_away_team_elo = away_team_elo + k_factor*(1 - prob_win_away)
elif margin == 0: # Drawn game' update both teams' elo
new_home_team_elo = home_team_elo + k_factor*(0.5 - prob_win_home)
new_away_team_elo = away_team_elo + k_factor*(0.5 - prob_win_away)
# Update elos in elo dictionary
elo_dict[home_team] = new_home_team_elo
elo_dict[away_team] = new_away_team_elo
return elos, elo_probs, elo_dict
# Use the elo applier function to get the elos and elo probabilities for each game - we will map these later
elos, probs, elo_dict = elo_applier(afl_data, 30)
Great! now we have both rolling averages for stats as a feature, and the elo of the teams! Let's have a quick look at the current elo standings with a k-factor of 30, out of curiosity.
for team in sorted(elo_dict, key=elo_dict.get)[::-1]:
print(team, elo_dict[team])
Richmond 1695.2241513840117
Sydney 1645.548990879842
Hawthorn 1632.5266709780622
West Coast 1625.871701773721
Geelong 1625.423154644809
GWS 1597.4158602131877
Adelaide 1591.1704934545442
Collingwood 1560.370309216614
Melbourne 1558.5666572771509
Essendon 1529.0198398117086
Port Adelaide 1524.8882517820093
North Melbourne 1465.5637511922569
Western Bulldogs 1452.2110697845148
Fremantle 1393.142087030804
St Kilda 1360.9120149937303
Brisbane 1276.2923772139352
Gold Coast 1239.174528704772
Carlton 1226.6780896643265
This looks extremely similar to the currently AFL ladder, so this is a good sign for elo being an effective predictor of winning.
Merging Our Features Into One Features DataFrame
Now we need to reshape our features df so that we have all of the statistics for both teams in a game on a single row. We can then merge our elo and efficiency features to this df.
date | game | team | opponent | venue | home_game | f_odds | f_goals | f_behinds | f_points | f_margin | f_opponent_goals | f_opponent_behinds | f_opponent_points | f_AF | f_B | f_BO | f_CCL | f_CG | f_CL | f_CM | f_CP | f_D | f_ED | f_FA | f_FF | f_G | f_GA | f_HB | f_HO | f_I50 | f_ITC | f_K | f_M | f_MG | f_MI5 | f_One.Percenters | f_R50 | f_SC | f_SCL | f_SI | f_T | f_T5 | f_TO | f_UP | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3156 | 2018-08-26 | 15397 | Melbourne | GWS | M.C.G. | 1 | 1.706488 | 15.157271 | 13.815113 | 104.758740 | 25.170429 | 11.814319 | 8.702396 | 79.588311 | 3312.408470 | 22.077317 | 7.724955 | 28.364418 | 114.399147 | 78.406069 | 26.934677 | 324.352577 | 775.176933 | 547.385948 | 39.353251 | 36.025646 | 30.308918 | 22.461080 | 348.613592 | 99.78780 | 120.339062 | 154.417642 | 426.563341 | 178.102118 | 12395.717925 | 32.168752 | 96.390688 | 63.786515 | 3427.596843 | 50.041649 | 232.287556 | 144.875098 | 23.789233 | 149.042149 | 456.988552 | -23.2 | 2.0 |
3157 | 2018-08-26 | 15398 | North Melbourne | St Kilda | Docklands | 0 | 2.272313 | 12.721783 | 10.733785 | 87.064486 | -1.214246 | 12.915796 | 10.783958 | 88.278732 | 3066.272143 | 17.322710 | 9.815243 | 26.015421 | 106.465181 | 67.504286 | 26.064079 | 291.259574 | 736.279779 | 534.154748 | 34.301603 | 40.908551 | 25.386136 | 17.816570 | 341.210547 | 81.54113 | 102.589427 | 145.265493 | 395.069232 | 173.089408 | 10875.002463 | 21.802751 | 82.347511 | 70.416194 | 3171.120023 | 41.488865 | 197.620152 | 122.547684 | 22.286256 | 142.780474 | 450.374058 | 3.2 | 3.0 |
3158 | 2018-08-26 | 15398 | St Kilda | North Melbourne | Docklands | 1 | 5.516150 | 10.464266 | 11.957047 | 74.742643 | -21.138101 | 14.105551 | 11.247440 | 95.880745 | 3094.163405 | 20.523847 | 14.569589 | 24.134276 | 102.540441 | 66.976211 | 18.018350 | 270.674857 | 773.086015 | 573.769838 | 41.319843 | 36.198820 | 20.850476 | 14.443658 | 364.405251 | 63.49876 | 103.803779 | 130.494307 | 408.680763 | 184.780054 | 10765.717942 | 21.572806 | 94.731555 | 65.790561 | 3228.278599 | 42.841935 | 196.086493 | 115.901425 | 18.796764 | 127.364334 | 508.844514 | -3.2 | 2.0 |
one_line_cols = ['game', 'team', 'home_game'] + [col for col in features if col.startswith('f_')]
# Get all features onto individual rows for each match
features_one_line = (features.loc[features.home_game == 1, one_line_cols]
.rename(columns={'team': 'home_team'})
.drop(columns='home_game')
.pipe(pd.merge, (features.loc[features.home_game == 0, one_line_cols]
.drop(columns='home_game')
.rename(columns={'team': 'away_team'})
.rename(columns={col: col+'_away' for col in features.columns if col.startswith('f_')})), on='game')
.drop(columns=['f_form_margin_btwn_teams_away', 'f_form_past_5_btwn_teams_away']))
# Add our created features - elo, efficiency etc.
features_one_line = (features_one_line.assign(f_elo_home=lambda df: df.game.map(elos).apply(lambda x: x[0]),
f_elo_away=lambda df: df.game.map(elos).apply(lambda x: x[1]))
.pipe(pd.merge, efficiency_features_single_row, on=['game', 'home_team', 'away_team'])
.pipe(pd.merge, afl_data.loc[afl_data.home_game == 1, ['game', 'date', 'round', 'venue']], on=['game'])
.dropna()
.reset_index(drop=True)
.assign(season=lambda df: df.date.apply(lambda row: row.year)))
ordered_cols = [col for col in features_one_line if col[:2] != 'f_'] + [col for col in features_one_line if col.startswith('f_')]
feature_df = features_one_line[ordered_cols]
Finally, let's reduce the dimensionality of the features df by subtracting the home features from the away features. This will reduce the huge amount of columns we have and make our data more manageable. To do this, we will need a list of columns which we are subtracting from each other. We will then loop over each of these columns to create our new differential columns.
We will then add in the implied probability from the odds of the home and away team, as our current odds feature is simply an exponential moving average over the past n games.
# Create differential df - this df is the home features - the away features
diff_cols = [col for col in feature_df.columns if col + '_away' in feature_df.columns and col != 'f_odds' and col.startswith('f_')]
non_diff_cols = [col for col in feature_df.columns if col not in diff_cols and col[:-5] not in diff_cols]
diff_df = feature_df[non_diff_cols].copy()
for col in diff_cols:
diff_df[col+'_diff'] = feature_df[col] - feature_df[col+'_away']
# Add current odds in to diff_df
odds = get_cleaned_odds()
home_odds = (odds[odds.home_game == 1]
.assign(f_current_odds_prob=lambda df: 1 / df.odds)
.rename(columns={'team': 'home_team'})
.drop(columns=['home_game', 'odds']))
away_odds = (odds[odds.home_game == 0]
.assign(f_current_odds_prob_away=lambda df: 1 / df.odds)
.rename(columns={'team': 'away_team'})
.drop(columns=['home_game', 'odds']))
diff_df = (diff_df.pipe(pd.merge, home_odds, on=['date', 'home_team'])
.pipe(pd.merge, away_odds, on=['date', 'away_team']))
game | home_team | away_team | date | round | venue | season | f_odds | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | f_odds_away | f_elo_home | f_elo_away | f_I50_efficiency_home | f_R50_efficiency_home | f_I50_efficiency_away | f_R50_efficiency_away | f_goals_diff | f_behinds_diff | f_points_diff | f_margin_diff | f_opponent_goals_diff | f_opponent_behinds_diff | f_opponent_points_diff | f_AF_diff | f_B_diff | f_BO_diff | f_CCL_diff | f_CG_diff | f_CL_diff | f_CM_diff | f_CP_diff | f_D_diff | f_ED_diff | f_FA_diff | f_FF_diff | f_G_diff | f_GA_diff | f_HB_diff | f_HO_diff | f_I50_diff | f_ITC_diff | f_K_diff | f_M_diff | f_MG_diff | f_MI5_diff | f_One.Percenters_diff | f_R50_diff | f_SC_diff | f_SCL_diff | f_SI_diff | f_T_diff | f_T5_diff | f_TO_diff | f_UP_diff | f_current_odds_prob | f_current_odds_prob_away | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1626 | 15394 | Carlton | Adelaide | 2018-08-25 | 23 | Docklands | 2018 | 6.467328 | -26.2 | 1.0 | 2.066016 | 1230.072138 | 1587.776445 | 0.730668 | 0.675002 | 0.691614 | 0.677128 | -3.498547 | -5.527193 | -26.518474 | -34.473769 | 1.289715 | 0.217006 | 7.955295 | -341.342677 | -9.317269 | 3.088569 | -2.600593 | 15.192839 | -12.518345 | -4.136673 | -41.855717 | -72.258378 | -51.998775 | 9.499447 | 8.670917 | -6.973088 | -4.740623 | -26.964945 | -13.147675 | -23.928700 | -28.940883 | -45.293433 | -15.183406 | -1900.784014 | -0.362402 | -1.314627 | 4.116133 | -294.813511 | -9.917793 | -34.724925 | -5.462844 | -9.367141 | -19.623785 | -38.188082 | 0.187709 | 0.816860 |
1627 | 15395 | Sydney | Hawthorn | 2018-08-25 | 23 | S.C.G. | 2018 | 2.128611 | 1.0 | 2.0 | 1.777290 | 1662.568452 | 1615.507209 | 0.699994 | 0.778280 | 0.699158 | 0.673409 | -1.756730 | -0.874690 | -11.415069 | -15.575319 | 0.014390 | 4.073909 | 4.160250 | -174.005092 | -0.942357 | -4.078635 | -4.192916 | 7.814496 | -2.225780 | 6.215760 | 15.042979 | -34.894261 | -50.615255 | 4.214158 | 0.683548 | -3.535594 | -3.168608 | -12.068691 | -30.493980 | -9.867332 | 2.588103 | -22.825570 | -5.604199 | 253.086090 | -2.697132 | -22.612327 | 25.340623 | -90.812188 | 1.967104 | -31.047879 | 0.007606 | -6.880120 | 11.415593 | -49.957313 | 0.440180 | 0.561924 |
1628 | 15396 | Brisbane | West Coast | 2018-08-26 | 23 | Gabba | 2018 | 3.442757 | -49.2 | 0.0 | 2.094236 | 1279.963814 | 1622.200265 | 0.683604 | 0.691730 | 0.696822 | 0.709605 | -0.190413 | 1.182699 | 0.040221 | -13.621456 | 1.772577 | 3.026217 | 13.661677 | -22.709485 | 2.424261 | -4.848054 | 1.800473 | 5.051157 | 6.440524 | -5.549630 | -17.041838 | 27.543023 | 33.983159 | 4.459181 | -3.213885 | -0.428455 | 1.514474 | 42.646138 | -7.141638 | 1.457375 | -17.472537 | -15.103115 | 8.001966 | -383.083539 | 6.458915 | 7.275716 | 0.942863 | 44.461590 | 4.640136 | 13.180967 | -15.704694 | 2.366444 | -5.985843 | 38.195255 | 0.433501 | 0.569866 |
1629 | 15397 | Melbourne | GWS | 2018-08-26 | 23 | M.C.G. | 2018 | 1.706488 | -23.2 | 2.0 | 1.805565 | 1540.367850 | 1615.614668 | 0.667240 | 0.692632 | 0.684525 | 0.753783 | 2.056899 | 0.635785 | 12.977177 | 6.642811 | 1.443121 | -2.324358 | 6.334366 | 147.281112 | 2.201404 | -5.222254 | 3.250416 | 8.542475 | -2.203571 | 3.559792 | 21.192530 | 33.737734 | 12.865653 | -3.244066 | -2.135243 | 4.100203 | 3.772200 | 48.425291 | 18.247107 | 13.349992 | 11.385136 | -14.687556 | 5.052000 | 304.087088 | 11.062610 | -6.686409 | -16.414544 | 8.350924 | -5.453961 | 12.407662 | 6.672628 | -1.523915 | 13.075351 | 18.522113 | 0.661551 | 0.340379 |
1630 | 15398 | St Kilda | North Melbourne | 2018-08-26 | 23 | Docklands | 2018 | 5.516150 | -3.2 | 2.0 | 2.272313 | 1372.453734 | 1454.022032 | 0.730843 | 0.635819 | 0.697018 | 0.654991 | -2.257517 | 1.223261 | -12.321842 | -19.923855 | 1.189755 | 0.463481 | 7.602012 | 27.891262 | 3.201137 | 4.754346 | -1.881145 | -3.924740 | -0.528075 | -8.045729 | -20.584717 | 36.806235 | 39.615090 | 7.018240 | -4.709732 | -4.535660 | -3.372912 | 23.194704 | -18.042370 | 1.214353 | -14.771187 | 13.611531 | 11.690647 | -109.284521 | -0.229945 | 12.384044 | -4.625633 | 57.158576 | 1.353070 | -1.533659 | -6.646259 | -3.489492 | -15.416140 | 58.470456 | 0.284269 | 0.717566 |
Wrapping it Up
We now have a fairly decent amount of features. Some other features which could be added include whether the game is in a major Capital city outisde of Mebourne (i.e. Sydney, Adelaide or Peth), how many 'Elite' players are playing (which could be judged by average SuperCoach scores over 110, for example), as well as your own metrics for attacking and defending.
Note that all of our features have columns starting with 'f_' so in the section, we will grab this feature dataframe and use these features to sport predicting the matches.
03. Modelling
These tutorials will walk you through how to construct your own basic AFL model, using publically available data. The output will be odds for each team to win, which will be shown on The Hub.
In this notebook we will walk you through modelling our AFL data to create predictions. We will train a variety of quick and easy models to get a feel of what works and what doesn't. We will then tune our hyperparameters so that we are ready to make week by week predictions.
Grabbing Our Dataset
First, we will import our required modules, as well as the prepare_afl_features function which we created in our afl_feature_creation script. This essentially creates some basic features for us so that we can get started on the modelling component.
# Import libraries
from afl_data_cleaning_v2 import *
import datetime
import pandas as pd
import numpy as np
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
# from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_selection import RFECV
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn import feature_selection
from sklearn import metrics
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
import warnings
warnings.filterwarnings('ignore')
import afl_feature_creation_v2
import afl_data_cleaning_v2
# Grab our feature DataFrame which we created in the previous tutorial
feature_df = afl_feature_creation_v2.prepare_afl_features()
afl_data = afl_data_cleaning_v2.prepare_afl_data()
game | home_team | away_team | date | round | venue | season | f_odds | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | f_odds_away | f_elo_home | f_elo_away | f_I50_efficiency_home | f_R50_efficiency_home | f_I50_efficiency_away | f_R50_efficiency_away | f_goals_diff | f_behinds_diff | f_points_diff | f_margin_diff | f_opponent_goals_diff | f_opponent_behinds_diff | f_opponent_points_diff | f_AF_diff | f_B_diff | f_BO_diff | f_CCL_diff | f_CG_diff | f_CL_diff | f_CM_diff | f_CP_diff | f_D_diff | f_ED_diff | f_FA_diff | f_FF_diff | f_G_diff | f_GA_diff | f_HB_diff | f_HO_diff | f_I50_diff | f_ITC_diff | f_K_diff | f_M_diff | f_MG_diff | f_MI5_diff | f_One.Percenters_diff | f_R50_diff | f_SC_diff | f_SCL_diff | f_SI_diff | f_T_diff | f_T5_diff | f_TO_diff | f_UP_diff | f_current_odds_prob | f_current_odds_prob_away | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1628 | 15396 | Brisbane | West Coast | 2018-08-26 | 23 | Gabba | 2018 | 3.442757 | -49.2 | 0.0 | 2.094236 | 1279.963814 | 1622.200265 | 0.683604 | 0.691730 | 0.696822 | 0.709605 | -0.190413 | 1.182699 | 0.040221 | -13.621456 | 1.772577 | 3.026217 | 13.661677 | -22.709485 | 2.424261 | -4.848054 | 1.800473 | 5.051157 | 6.440524 | -5.549630 | -17.041838 | 27.543023 | 33.983159 | 4.459181 | -3.213885 | -0.428455 | 1.514474 | 42.646138 | -7.141638 | 1.457375 | -17.472537 | -15.103115 | 8.001966 | -383.083539 | 6.458915 | 7.275716 | 0.942863 | 44.461590 | 4.640136 | 13.180967 | -15.704694 | 2.366444 | -5.985843 | 38.195255 | 0.433501 | 0.569866 |
1629 | 15397 | Melbourne | GWS | 2018-08-26 | 23 | M.C.G. | 2018 | 1.706488 | -23.2 | 2.0 | 1.805565 | 1540.367850 | 1615.614668 | 0.667240 | 0.692632 | 0.684525 | 0.753783 | 2.056899 | 0.635785 | 12.977177 | 6.642811 | 1.443121 | -2.324358 | 6.334366 | 147.281112 | 2.201404 | -5.222254 | 3.250416 | 8.542475 | -2.203571 | 3.559792 | 21.192530 | 33.737734 | 12.865653 | -3.244066 | -2.135243 | 4.100203 | 3.772200 | 48.425291 | 18.247107 | 13.349992 | 11.385136 | -14.687556 | 5.052000 | 304.087088 | 11.062610 | -6.686409 | -16.414544 | 8.350924 | -5.453961 | 12.407662 | 6.672628 | -1.523915 | 13.075351 | 18.522113 | 0.661551 | 0.340379 |
1630 | 15398 | St Kilda | North Melbourne | 2018-08-26 | 23 | Docklands | 2018 | 5.516150 | -3.2 | 2.0 | 2.272313 | 1372.453734 | 1454.022032 | 0.730843 | 0.635819 | 0.697018 | 0.654991 | -2.257517 | 1.223261 | -12.321842 | -19.923855 | 1.189755 | 0.463481 | 7.602012 | 27.891262 | 3.201137 | 4.754346 | -1.881145 | -3.924740 | -0.528075 | -8.045729 | -20.584717 | 36.806235 | 39.615090 | 7.018240 | -4.709732 | -4.535660 | -3.372912 | 23.194704 | -18.042370 | 1.214353 | -14.771187 | 13.611531 | 11.690647 | -109.284521 | -0.229945 | 12.384044 | -4.625633 | 57.158576 | 1.353070 | -1.533659 | -6.646259 | -3.489492 | -15.416140 | 58.470456 | 0.284269 | 0.717566 |
# Get the result and merge to the feature_df
match_results = (pd.read_csv("data/afl_match_results.csv")
.rename(columns={'Game': 'game'})
.assign(result=lambda df: df.apply(lambda row: 1 if row['Home.Points'] > row['Away.Points'] else 0, axis=1)))
# Merge result column to feature_df
feature_df = pd.merge(feature_df, match_results[['game', 'result']], on='game')
Creating a Training and Testing Set
So that we don't train our data on the data that we will later test our model on, we will create separate train and test sets. For this exercise we will use the 2018 season to test how our model performs, whilst the rest of the data can be used to train the model.
# Create our test and train sets from our afl DataFrame; drop the columns which leak the result, duplicates, and the advanced
# stats which don't have data until 2015
feature_columns = [col for col in feature_df if col.startswith('f_')]
# Create our test set
test_x = feature_df.loc[feature_df.season == 2018, ['game'] + feature_columns]
test_y = feature_df.loc[feature_df.season == 2018, 'result']
# Create our train set
X = feature_df.loc[feature_df.season != 2018, ['game'] + feature_columns]
y = feature_df.loc[feature_df.season != 2018, 'result']
# Scale features
scaler = StandardScaler()
X[feature_columns] = scaler.fit_transform(X[feature_columns])
test_x[feature_columns] = scaler.transform(test_x[feature_columns])
Using Cross Validation to Find The Best Algorithms
Now that we have our training set, we can run through a list of popular classifiers to determine which classifier is best for modelling our data. To do this we will create a function which uses Kfold cross-validation to find the 'best' algorithms, based on how accurate the algorithms' predictions are.
This function will take in a list of classifiers, which we will define below, as well as the training set and it's outcome, and output a DataFrame with the mean and std of the accuracy of each algorithm. Let's jump into it!
# Create a list of standard classifiers
classifiers = [
#Ensemble Methods
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
#Gaussian Processes
gaussian_process.GaussianProcessClassifier(),
#GLM
linear_model.LogisticRegressionCV(),
#Navies Bayes
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
#SVM
svm.SVC(probability=True),
svm.NuSVC(probability=True),
#Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
# XGBClassifier()
]
# Define a functiom which finds the best algorithms for our modelling task
def find_best_algorithms(classifier_list, X, y):
# This function is adapted from https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling
# Cross validate model with Kfold stratified cross validation
kfold = StratifiedKFold(n_splits=5)
# Grab the cross validation scores for each algorithm
cv_results = [cross_val_score(classifier, X, y, scoring = "neg_log_loss", cv = kfold) for classifier in classifier_list]
cv_means = [cv_result.mean() * -1 for cv_result in cv_results]
cv_std = [cv_result.std() for cv_result in cv_results]
algorithm_names = [alg.__class__.__name__ for alg in classifiers]
# Create a DataFrame of all the CV results
cv_results = pd.DataFrame({
"Mean Log Loss": cv_means,
"Log Loss Std": cv_std,
"Algorithm": algorithm_names
})
return cv_results.sort_values(by='Mean Log Loss').reset_index(drop=True)
Mean Log Loss | Log Loss Std | Algorithm | |
---|---|---|---|
0 | 0.539131 | 3.640578e-02 | LogisticRegressionCV |
1 | 0.551241 | 5.775685e-02 | LinearDiscriminantAnalysis |
2 | 0.630994 | 8.257481e-02 | GradientBoostingClassifier |
3 | 0.670041 | 9.205780e-03 | AdaBoostClassifier |
4 | 0.693147 | 2.360121e-08 | GaussianProcessClassifier |
5 | 0.712537 | 2.770864e-02 | SVC |
6 | 0.712896 | 2.440755e-02 | NuSVC |
7 | 0.836191 | 2.094224e-01 | ExtraTreesClassifier |
8 | 0.874307 | 1.558144e-01 | RandomForestClassifier |
9 | 1.288174 | 3.953037e-01 | BaggingClassifier |
10 | 1.884019 | 4.769589e-01 | QuadraticDiscriminantAnalysis |
11 | 2.652161 | 6.886897e-01 | BernoulliNB |
12 | 3.299651 | 6.427551e-01 | GaussianNB |
# Try a logistic regression model and see how it performs in terms of accuracy
kfold = StratifiedKFold(n_splits=5)
cv_scores = cross_val_score(linear_model.LogisticRegressionCV(), X, y, scoring='accuracy', cv=kfold)
cv_scores.mean()
0.7452268937025035
Choosing Our Algorithms
As we can see from above, there are some pretty poor algorithms for predicting the winner. On the other hand, whilst attaining an accuracy of 74.5% (at the time of writing) may seem like a decent result; we must first establish a baseline to judge our performance on. In this case, we will have two baselines; the proportion of games won by the home team and what the odds predict. If we can beat the odds we have created a very powerful model.
Note that a baseline for the log loss can also be both the odds log loss and randomly guessing. Randomly guessing between two teams attains a log loss of log(2) = 0.69, so we have beaten this result.
Once we establish our baseline, we will choose the top algorithms from above and tune their hyperparameters, as well as automatically selecting the best features to be used in our model.
Defining Our Baseline
As stated above, we must define our baseline so that we have a measure to beat. We will use the proportion of games won by the home team, as well as the proportion of favourites who won, based off the odds. To establish this baseline we will use our feature_df, as this has no dropped rows.
# Find the percentage chance of winning at home in each season.
afl_data = afl_data_cleaning_v2.prepare_afl_data()
afl_data['home_win'] = afl_data.apply(lambda x: 1 if x['f_margin'] > 0 else 0, axis=1)
home_games = afl_data[afl_data['home_game'] == 1]
home_games[["home_win", 'season']].groupby(['season']).mean()
season | home_win |
---|---|
2011 | 0.561856 |
2012 | 0.563725 |
2013 | 0.561576 |
2014 | 0.574257 |
2015 | 0.539604 |
2016 | 0.606742 |
2017 | 0.604061 |
2018 | 0.540404 |
# Find the proportion of favourites who have won
# Define a function which finds if the odds correctly guessed the response
def find_odds_prediction(a_row):
if a_row['f_odds'] <= a_row['f_odds_away'] and a_row['home_win'] == 1:
return 1
elif a_row['f_odds_away'] < a_row['f_odds'] and a_row['home_win'] == 0:
return 1
else:
return 0
# Define a function which splits our DataFrame so each game is on one row instead of two
def get_df_on_one_line(df):
cols_to_drop = ['date', 'home_game', 'opponent',
'f_opponent_behinds', 'f_opponent_goals', 'f_opponent_points', 'f_points',
'round', 'venue', 'season']
home_df = df[df['home_game'] == 1].rename(columns={'team': 'home_team'})
away_df = df[df['home_game'] == 0].rename(columns={'team': 'away_team'})
away_df = away_df.drop(columns=cols_to_drop)
# Rename away_df columns
away_df_renamed = away_df.rename(columns={col: col + '_away' for col in away_df.columns if col != 'game'})
merged_df = pd.merge(home_df, away_df_renamed, on='game')
merged_df['home_win'] = merged_df.f_margin.apply(lambda x: 1 if x > 0 else 0)
return merged_df
afl_data_one_line = get_df_on_one_line(afl_data)
afl_data_one_line['odds_prediction'] = afl_data_one_line.apply(find_odds_prediction, axis=1)
print('The overall mean accuracy of choosing the favourite based on the odds is {}%'.format(
round(afl_data_one_line['odds_prediction'].mean() * 100, 2)))
afl_data_one_line[["odds_prediction", 'season']].groupby(['season']).mean()
The overall mean accuracy of choosing the favourite based on the odds is 73.15%
season | odds_prediction |
---|---|
2011 | 0.784615 |
2012 | 0.774510 |
2013 | 0.748768 |
2014 | 0.727723 |
2015 | 0.727723 |
2016 | 0.713483 |
2017 | 0.659898 |
2018 | 0.712121 |
## Get a baseline log loss score from the odds
afl_data_one_line['odds_home_prob'] = 1 / afl_data_one_line.f_odds
afl_data_one_line['odds_away_prob'] = 1 / afl_data_one_line.f_odds_away
metrics.log_loss(afl_data_one_line.home_win, afl_data_one_line[['odds_away_prob', 'odds_home_prob']])
0.5375306549682837
We can see that the odds are MUCH more accurate than just choosing the home team to win. We can also see that the mean accuracy of choosing the favourite is around 73%. That means that this is the score we need to beat. Similarly, the log loss of the odds is around 0.5385, whilst our model scores around 0.539 (at the time of writing), without hyperparamter optimisation. Let's choose only the algorithms with log losses below 0.67
chosen_algorithms = best_algos.loc[best_algos['Mean Log Loss'] < 0.67, 'Algorithm'].tolist()
chosen_algorithms
['LogisticRegressionCV',
'LinearDiscriminantAnalysis',
'GradientBoostingClassifier']
Using Grid Search To Tune Hyperparameters
Now that we have our best models, we can use Grid Search to optimise our hyperparameters. Grid search basically involves searching through a range of different algorithm hyperparameters, and choosing those which result in the best score from some metrics, which in our case is accuracy. Let's do this for the algorithms which have hyperparameters which can be tuned. Note that if you are running this on your own computer it may take up to 10 minutes.
# Define a function which optimises the hyperparameters of our chosen algorithms
def optimise_hyperparameters(train_x, train_y, algorithms, parameters):
kfold = StratifiedKFold(n_splits=5)
best_estimators = []
for alg, params in zip(algorithms, parameters):
gs = GridSearchCV(alg, param_grid=params, cv=kfold, scoring='neg_log_loss', verbose=1)
gs.fit(train_x, train_y)
best_estimators.append(gs.best_estimator_)
return best_estimators
# Define our parameters to run a grid search over
lr_grid = {
"C": [0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],
"solver": ["newton-cg", "lbfgs", "liblinear"]
}
# Add our algorithms and parameters to lists to be used in our function
alg_list = [LogisticRegression()]
param_list = [lr_grid]
# Find the best estimators, then add our other estimators which don't need optimisation
best_estimators = optimise_hyperparameters(X, y, alg_list, param_list)
[Parallel(n_jobs=1)]: Done 90 out of 90 | elapsed: 5.2s finished
lr_best_params = best_estimators[0].get_params()
lr_best_params
{'C': 0.01,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'ovr',
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'newton-cg',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
kfold = StratifiedKFold(n_splits=10)
cv_scores = cross_val_score(linear_model.LogisticRegression(**lr_best_params), X, y, scoring='neg_log_loss', cv=kfold)
cv_scores.mean()
-0.528741673153639
In the next iteration of this tutorial we will also optimise an XGB model and hopefully outperform our logistic regression model.
Creating Predictions for the 2018 Season
Now that we have an optimised logistic regression model, let's see how it performs on predicting the 2018 season.
lr = LogisticRegression(**lr_best_params)
lr.fit(X, y)
final_predictions = lr.predict(test_x)
accuracy = (final_predictions == test_y).mean() * 100
print("Our accuracy in predicting the 2018 season is: {:.2f}%".format(accuracy))
Our accuracy in predicting the 2018 season is: 67.68%
Now let's have a look at all the games which we incorrectly predicted.
game_ids = test_x[(final_predictions != test_y)].game
afl_data_one_line.loc[afl_data_one_line.game.isin(game_ids), ['date', 'home_team', 'opponent', 'f_odds', 'f_odds_away', 'f_margin']]
date | home_team | opponent | f_odds | f_odds_away | f_margin | |
---|---|---|---|---|---|---|
1386 | 2018-03-24 | Gold Coast | North Melbourne | 2.0161 | 1.9784 | 16 |
1388 | 2018-03-25 | Melbourne | Geelong | 1.7737 | 2.2755 | -3 |
1391 | 2018-03-30 | North Melbourne | St Kilda | 3.5769 | 1.3867 | 52 |
1392 | 2018-03-31 | Carlton | Gold Coast | 1.5992 | 2.6620 | -34 |
1396 | 2018-04-01 | Western Bulldogs | West Coast | 1.8044 | 2.2445 | -51 |
1397 | 2018-04-01 | Sydney | Port Adelaide | 1.4949 | 3.0060 | -23 |
1398 | 2018-04-02 | Geelong | Hawthorn | 1.7597 | 2.3024 | -1 |
1406 | 2018-04-08 | Western Bulldogs | Essendon | 3.8560 | 1.3538 | 21 |
1408 | 2018-04-13 | Adelaide | Collingwood | 1.2048 | 5.9197 | -48 |
1412 | 2018-04-14 | North Melbourne | Carlton | 1.5799 | 2.7228 | 86 |
1415 | 2018-04-15 | Hawthorn | Melbourne | 2.2855 | 1.7772 | 67 |
1417 | 2018-04-20 | Sydney | Adelaide | 1.2640 | 4.6929 | -10 |
1420 | 2018-04-21 | Port Adelaide | Geelong | 1.5053 | 2.9515 | -34 |
1422 | 2018-04-22 | North Melbourne | Hawthorn | 2.6170 | 1.6132 | 28 |
1423 | 2018-04-22 | Brisbane | Gold Coast | 1.7464 | 2.3277 | -5 |
1425 | 2018-04-25 | Collingwood | Essendon | 1.8372 | 2.1754 | 49 |
1427 | 2018-04-28 | Geelong | Sydney | 1.5019 | 2.9833 | -17 |
1434 | 2018-04-29 | Fremantle | West Coast | 2.4926 | 1.6531 | -8 |
1437 | 2018-05-05 | Essendon | Hawthorn | 2.8430 | 1.5393 | -23 |
1439 | 2018-05-05 | Sydney | North Melbourne | 1.2777 | 4.5690 | -2 |
1444 | 2018-05-11 | Hawthorn | Sydney | 1.6283 | 2.5818 | -8 |
1445 | 2018-05-12 | GWS | West Coast | 1.5425 | 2.8292 | -25 |
1446 | 2018-05-12 | Carlton | Essendon | 3.1742 | 1.4570 | 13 |
1452 | 2018-05-13 | Collingwood | Geelong | 2.4127 | 1.7040 | -21 |
1455 | 2018-05-19 | North Melbourne | GWS | 1.5049 | 2.9752 | 43 |
1456 | 2018-05-19 | Essendon | Geelong | 5.6530 | 1.2104 | 34 |
1460 | 2018-05-20 | Brisbane | Hawthorn | 3.2891 | 1.4318 | 56 |
1461 | 2018-05-20 | West Coast | Richmond | 1.9755 | 2.0154 | 47 |
1466 | 2018-05-26 | GWS | Essendon | 1.4364 | 3.2652 | -35 |
1467 | 2018-05-27 | Hawthorn | West Coast | 2.2123 | 1.8133 | -15 |
... | ... | ... | ... | ... | ... | ... |
1483 | 2018-06-10 | Brisbane | Essendon | 2.3018 | 1.7543 | -22 |
1485 | 2018-06-11 | Melbourne | Collingwood | 1.6034 | 2.6450 | -42 |
1492 | 2018-06-21 | West Coast | Essendon | 1.3694 | 3.6843 | -28 |
1493 | 2018-06-22 | Port Adelaide | Melbourne | 1.7391 | 2.3426 | 10 |
1499 | 2018-06-29 | Western Bulldogs | Geelong | 6.2067 | 1.1889 | 2 |
1501 | 2018-06-30 | Adelaide | West Coast | 1.4989 | 2.9756 | 10 |
1504 | 2018-07-01 | Melbourne | St Kilda | 1.1405 | 7.7934 | -2 |
1505 | 2018-07-01 | Essendon | North Melbourne | 2.0993 | 1.9022 | 17 |
1506 | 2018-07-01 | Fremantle | Brisbane | 1.2914 | 4.3743 | -55 |
1507 | 2018-07-05 | Sydney | Geelong | 1.7807 | 2.2675 | -12 |
1514 | 2018-07-08 | Essendon | Collingwood | 2.5442 | 1.6473 | -16 |
1515 | 2018-07-08 | West Coast | GWS | 1.6790 | 2.4754 | 11 |
1516 | 2018-07-12 | Adelaide | Geelong | 2.0517 | 1.9444 | 15 |
1518 | 2018-07-14 | Hawthorn | Brisbane | 1.2281 | 5.4105 | -33 |
1521 | 2018-07-14 | GWS | Richmond | 2.7257 | 1.5765 | 2 |
1522 | 2018-07-15 | Collingwood | West Coast | 1.5600 | 2.7815 | -35 |
1523 | 2018-07-15 | North Melbourne | Sydney | 1.9263 | 2.0647 | -6 |
1524 | 2018-07-15 | Fremantle | Port Adelaide | 5.9110 | 1.2047 | 9 |
1527 | 2018-07-21 | Sydney | Gold Coast | 1.0342 | 27.8520 | -24 |
1529 | 2018-07-21 | Brisbane | Adelaide | 2.4614 | 1.6730 | -5 |
1533 | 2018-07-22 | Port Adelaide | GWS | 1.6480 | 2.5452 | -22 |
1538 | 2018-07-28 | Gold Coast | Carlton | 1.3933 | 3.5296 | -35 |
1546 | 2018-08-04 | Adelaide | Port Adelaide | 2.0950 | 1.9135 | 3 |
1548 | 2018-08-04 | St Kilda | Western Bulldogs | 1.6120 | 2.6368 | -35 |
1555 | 2018-08-11 | Port Adelaide | West Coast | 1.4187 | 3.3505 | -4 |
1558 | 2018-08-12 | North Melbourne | Western Bulldogs | 1.3175 | 4.1239 | -7 |
1559 | 2018-08-12 | Melbourne | Sydney | 1.3627 | 3.7445 | -9 |
1564 | 2018-08-18 | GWS | Sydney | 1.8478 | 2.1672 | -20 |
1576 | 2018-08-26 | Brisbane | West Coast | 2.3068 | 1.7548 | -26 |
1578 | 2018-08-26 | St Kilda | North Melbourne | 3.5178 | 1.3936 | -23 |
Very interesting! Most of the games we got wrong were upsets. Let's have a look at the games we incorrectly predicted that weren't upsets.
(afl_data_one_line.loc[afl_data_one_line.game.isin(game_ids), ['date', 'home_team', 'opponent', 'f_odds', 'f_odds_away', 'f_margin']]
.assign(home_favourite=lambda df: df.apply(lambda row: 1 if row.f_odds < row.f_odds_away else 0, axis=1))
.assign(upset=lambda df: df.apply(lambda row: 1 if row.home_favourite == 1 and row.f_margin < 0 else
(1 if row.home_favourite == 0 and row.f_margin > 0 else 0), axis=1))
.query('upset == 0'))
date | home_team | opponent | f_odds | f_odds_away | f_margin | home_favourite | upset | |
---|---|---|---|---|---|---|---|---|
1412 | 2018-04-14 | North Melbourne | Carlton | 1.5799 | 2.7228 | 86 | 1 | 0 |
1425 | 2018-04-25 | Collingwood | Essendon | 1.8372 | 2.1754 | 49 | 1 | 0 |
1434 | 2018-04-29 | Fremantle | West Coast | 2.4926 | 1.6531 | -8 | 0 | 0 |
1437 | 2018-05-05 | Essendon | Hawthorn | 2.8430 | 1.5393 | -23 | 0 | 0 |
1452 | 2018-05-13 | Collingwood | Geelong | 2.4127 | 1.7040 | -21 | 0 | 0 |
1455 | 2018-05-19 | North Melbourne | GWS | 1.5049 | 2.9752 | 43 | 1 | 0 |
1461 | 2018-05-20 | West Coast | Richmond | 1.9755 | 2.0154 | 47 | 1 | 0 |
1467 | 2018-05-27 | Hawthorn | West Coast | 2.2123 | 1.8133 | -15 | 0 | 0 |
1479 | 2018-06-08 | Port Adelaide | Richmond | 1.7422 | 2.3420 | 14 | 1 | 0 |
1483 | 2018-06-10 | Brisbane | Essendon | 2.3018 | 1.7543 | -22 | 0 | 0 |
1493 | 2018-06-22 | Port Adelaide | Melbourne | 1.7391 | 2.3426 | 10 | 1 | 0 |
1501 | 2018-06-30 | Adelaide | West Coast | 1.4989 | 2.9756 | 10 | 1 | 0 |
1514 | 2018-07-08 | Essendon | Collingwood | 2.5442 | 1.6473 | -16 | 0 | 0 |
1515 | 2018-07-08 | West Coast | GWS | 1.6790 | 2.4754 | 11 | 1 | 0 |
1529 | 2018-07-21 | Brisbane | Adelaide | 2.4614 | 1.6730 | -5 | 0 | 0 |
1576 | 2018-08-26 | Brisbane | West Coast | 2.3068 | 1.7548 | -26 | 0 | 0 |
1578 | 2018-08-26 | St Kilda | North Melbourne | 3.5178 | 1.3936 | -23 | 0 | 0 |
Let's now look at our model's log loss for the 2018 season compared to the odds.
test_x_unscaled = feature_df.loc[feature_df.season == 2018, ['game'] + feature_columns]
metrics.log_loss(test_y, test_x_unscaled[['f_current_odds_prob_away', 'f_current_odds_prob']])
0.5545776633924343
So whilst our model performs decently, it doesn't beat the odds in terms of log loss. That's okay, it's still a decent start. In future iterations we can implement other algorithms and create new features which may improve performance.
Next Steps
Now that we have a model up and running, the next steps are to implement the model on a week to week basis.
04. Weekly Predictions
Now that we have explored different algorithms for modelling, we can implement our chosen model and predict this week's AFL games! All you need to do is run the afl_modelling script each Thursday or Friday to predict the following week's games.
# Import Modules
from afl_feature_creation_v2 import prepare_afl_features
import afl_data_cleaning_v2
import afl_feature_creation_v2
import afl_modelling_v2
import datetime
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
Creating The Features For This Weekend's Games
To actually predict this weekend's games, we need to create the same features that we have created in the previous tutorials for the games that will be played this weekend. This includes all the rolling averages, efficiency features, elo features etc. So the majority of this tutorial will be using previously defined functions to create features for the following weekend's games.
Create Next Week's DataFrame
Let's first get our cleaned afl_data dataset, as well as the odds for next weekend and the 2018 fixture.
# Grab the cleaned AFL dataset and the column order
afl_data = afl_data_cleaning_v2.prepare_afl_data()
ordered_cols = afl_data.columns
# Define a function which grabs the odds for each game for the following weekend
def get_next_week_odds(path):
# Get next week's odds
next_week_odds = pd.read_csv(path)
next_week_odds = next_week_odds.rename(columns={"team_1": "home_team",
"team_2": "away_team",
"team_1_odds": "odds",
"team_2_odds": "odds_away"
})
return next_week_odds
# Import the fixture
# Define a function which gets the fixture and cleans it up
def get_fixture(path):
# Get the afl fixture
fixture = pd.read_csv(path)
# Replace team names and reformat
fixture = fixture.replace({'Brisbane Lions': 'Brisbane', 'Footscray': 'Western Bulldogs'})
fixture['Date'] = pd.to_datetime(fixture['Date']).dt.date.astype(str)
fixture = fixture.rename(columns={"Home.Team": "home_team", "Away.Team": "away_team"})
return fixture
next_week_odds = get_next_week_odds("data/weekly_odds.csv")
fixture = get_fixture("data/afl_fixture_2018.csv")
Date | Season | Season.Game | Round | home_team | away_team | Venue | |
---|---|---|---|---|---|---|---|
202 | 2018-09-14 | 2018 | 1 | 26 | Hawthorn | Melbourne | MCG |
203 | 2018-09-15 | 2018 | 1 | 26 | Collingwood | GWS | MCG |
204 | 2018-09-21 | 2018 | 1 | 27 | Richmond | Collingwood | MCG |
205 | 2018-09-22 | 2018 | 1 | 27 | West Coast | Melbourne | Optus Stadium |
206 | 2018-09-29 | 2018 | 1 | 28 | West Coast | Collingwood | MCG |
home_team | away_team | odds | odds_away | |
---|---|---|---|---|
0 | West Coast | Collingwood | 2.34 | 1.75 |
Now that we have these DataFrames, we will define a function which combines the fixture and next week's odds to create a single DataFrame for the games over the next 7 days. To use this function we will need Game IDs for next week. So we will create another function which creates Game IDs by using the Game ID from the last game played and adding 1 to it.
# Define a function which creates game IDs for this week's footy games
def create_next_weeks_game_ids(afl_data):
odds = get_next_week_odds("data/weekly_odds.csv")
# Get last week's Game ID
last_afl_data_game = afl_data['game'].iloc[-1]
# Create Game IDs for next week
game_ids = [(i+1) + last_afl_data_game for i in range(odds.shape[0])]
return game_ids
# Define a function which creates this week's footy game DataFrame
def get_next_week_df(afl_data):
# Get the fixture and the odds for next week's footy games
fixture = get_fixture("data/afl_fixture_2018.csv")
next_week_odds = get_next_week_odds("data/weekly_odds.csv")
next_week_odds['game'] = create_next_weeks_game_ids(afl_data)
# Get today's date and next week's date and create a DataFrame for next week's games
# todays_date = datetime.datetime.today().strftime('%Y-%m-%d')
# date_in_7_days = (datetime.datetime.today() + datetime.timedelta(days=7)).strftime('%Y-%m-%d')
todays_date = '2018-09-27'
date_in_7_days = '2018-10-04'
fixture = fixture[(fixture['Date'] >= todays_date) & (fixture['Date'] < date_in_7_days)].drop(columns=['Season.Game'])
next_week_df = pd.merge(fixture, next_week_odds, on=['home_team', 'away_team'])
# Split the DataFrame onto two rows for each game
h_df = (next_week_df[['Date', 'game', 'home_team', 'away_team', 'odds', 'Season', 'Round', 'Venue']]
.rename(columns={'home_team': 'team', 'away_team': 'opponent'})
.assign(home_game=1))
a_df = (next_week_df[['Date', 'game', 'home_team', 'away_team', 'odds_away', 'Season', 'Round', 'Venue']]
.rename(columns={'odds_away': 'odds', 'home_team': 'opponent', 'away_team': 'team'})
.assign(home_game=0))
next_week = a_df.append(h_df).sort_values(by='game').rename(columns={
'Date': 'date',
'Season': 'season',
'Round': 'round',
'Venue': 'venue'
})
next_week['date'] = pd.to_datetime(next_week.date)
next_week['round'] = afl_data['round'].iloc[-1] + 1
return next_week
next_week_df = get_next_week_df(afl_data)
game_ids_next_round = create_next_weeks_game_ids(afl_data)
next_week_df
date | round | season | venue | game | home_game | odds | opponent | team | |
---|---|---|---|---|---|---|---|---|---|
0 | 2018-09-29 | 27 | 2018 | MCG | 15407 | 0 | 1.75 | West Coast | Collingwood |
0 | 2018-09-29 | 27 | 2018 | MCG | 15407 | 1 | 2.34 | Collingwood | West Coast |
Date | Season | Season.Game | Round | home_team | away_team | Venue | |
---|---|---|---|---|---|---|---|
202 | 2018-09-14 | 2018 | 1 | 26 | Hawthorn | Melbourne | MCG |
203 | 2018-09-15 | 2018 | 1 | 26 | Collingwood | GWS | MCG |
204 | 2018-09-21 | 2018 | 1 | 27 | Richmond | Collingwood | MCG |
205 | 2018-09-22 | 2018 | 1 | 27 | West Coast | Melbourne | Optus Stadium |
206 | 2018-09-29 | 2018 | 1 | 28 | West Coast | Collingwood | MCG |
Create Each Feature
Now let's append next week's DataFrame to our afl_data, match_results and odds DataFrames and then create all the features we used in the AFL Feature Creation Tutorial. We need to append the games and then feed them into our function so that we can create features for upcoming games.
# Append next week's games to our afl_data DataFrame
afl_data = afl_data.append(next_week_df).reset_index(drop=True)
# Append next week's games to match results (we need to do this for our feature creation to run)
match_results = afl_data_cleaning_v2.get_cleaned_match_results().append(next_week_df)
# Append next week's games to odds
odds = (afl_data_cleaning_v2.get_cleaned_odds().pipe(lambda df: df.append(next_week_df[df.columns]))
.reset_index(drop=True))
features_df = afl_feature_creation_v2.prepare_afl_features(afl_data=afl_data, match_results=match_results, odds=odds)
game | home_team | away_team | date | round | venue | season | f_odds | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | f_odds_away | f_elo_home | f_elo_away | f_I50_efficiency_home | f_R50_efficiency_home | f_I50_efficiency_away | f_R50_efficiency_away | f_AF_diff | f_B_diff | f_BO_diff | f_CCL_diff | f_CG_diff | f_CL_diff | f_CM_diff | f_CP_diff | f_D_diff | f_ED_diff | f_FA_diff | f_FF_diff | f_G_diff | f_GA_diff | f_GA1_diff | f_HB_diff | f_HO_diff | f_I50_diff | f_ITC_diff | f_K_diff | f_M_diff | f_MG_diff | f_MI5_diff | f_One.Percenters_diff | f_R50_diff | f_SC_diff | f_SCL_diff | f_SI_diff | f_T_diff | f_T5_diff | f_TO_diff | f_UP_diff | f_Unnamed: 0_diff | f_behinds_diff | f_goals_diff | f_margin_diff | f_opponent_behinds_diff | f_opponent_goals_diff | f_opponent_points_diff | f_points_diff | f_current_odds_prob | f_current_odds_prob_away | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1065 | 15397 | Melbourne | GWS | 2018-08-26 | 23 | M.C.G. | 2018 | 1.966936 | -23.2 | 2.0 | 1.813998 | 1523.456734 | 1609.444874 | 0.653525 | 0.680168 | 0.704767 | 0.749812 | 140.535514 | 0.605144 | -9.771981 | 5.892176 | 7.172376 | 6.614609 | -1.365211 | 30.766262 | 21.998618 | 0.067228 | -1.404730 | -3.166732 | 6.933998 | 6.675576 | 0.000000 | 38.708158 | 24.587333 | 12.008987 | 10.482382 | -16.709540 | -15.415060 | 289.188486 | 6.350287 | -2.263536 | -20.966818 | 50.388632 | 0.723637 | 15.537783 | 22.912269 | 2.065039 | 10.215523 | -6.689429 | 3259.163465 | -0.136383 | 3.553795 | 16.563721 | -2.353514 | 1.162696 | 4.622664 | 21.186385 | 0.661551 | 0.340379 |
1066 | 15398 | St Kilda | North Melbourne | 2018-08-26 | 23 | Docklands | 2018 | 5.089084 | -3.2 | 2.0 | 2.577161 | 1397.237139 | 1499.366007 | 0.725980 | 0.655749 | 0.723949 | 0.677174 | 51.799992 | 3.399035 | 6.067393 | -2.189489 | -10.475859 | 1.154766 | -8.883840 | -21.810962 | 33.058382 | 40.618410 | 2.286314 | -0.345734 | -3.778445 | -2.182673 | 0.000000 | 19.816372 | -21.562916 | 2.678384 | -14.777698 | 13.242010 | 12.065594 | -82.381996 | -2.176564 | 2.335825 | -4.952336 | 45.719406 | 3.344217 | -2.095613 | -3.929084 | -3.182381 | -12.832197 | 57.226776 | -20221.371526 | 1.968709 | -1.897958 | -15.177001 | 1.067099 | 0.781811 | 5.757963 | -9.419038 | 0.284269 | 0.717566 |
1067 | 15404 | Collingwood | GWS | 2018-09-15 | 25 | M.C.G. | 2018 | 1.882301 | 12.6 | 3.0 | 2.018344 | 1546.000498 | 1590.806454 | 0.693185 | 0.706222 | 0.718446 | 0.727961 | 205.916671 | -1.642954 | -2.980828 | -0.266023 | 8.547225 | -3.751909 | -0.664977 | 10.563513 | 48.175985 | 43.531908 | -5.836979 | 5.388668 | 4.395675 | 2.555152 | 0.000000 | 51.588962 | 11.558254 | 4.276481 | 11.284445 | -3.412977 | -2.206815 | -234.577304 | 2.637758 | -10.537765 | -11.127876 | 125.607377 | -3.485896 | 3.532031 | 15.102292 | -2.500685 | 8.187543 | 38.053445 | 12500.525732 | -1.006173 | 2.520135 | 18.634835 | -2.159882 | -0.393386 | -4.520198 | 14.114637 | 0.608495 | 0.393856 |
1068 | 15406 | West Coast | Melbourne | 2018-09-22 | 26 | Perth Stadium | 2018 | 2.013572 | 21.2 | 3.0 | 1.884148 | 1577.888606 | 1542.095154 | 0.688877 | 0.708941 | 0.649180 | 0.698319 | -118.135184 | -3.005709 | 2.453190 | -5.103869 | -14.368949 | -12.245458 | 2.771411 | -45.364271 | -60.210182 | -24.049523 | -2.791277 | 6.115918 | -5.041030 | -5.335746 | 0.000000 | -78.816902 | -18.784547 | -13.957754 | -5.527613 | 18.606721 | 25.366778 | -910.988860 | -5.515812 | -9.483590 | 8.914093 | -131.380758 | -7.142529 | -49.484957 | -13.718798 | -4.862994 | -9.834616 | -23.673638 | -3178.282073 | -1.785349 | -2.569957 | -20.008787 | 0.476202 | 0.387915 | 2.803694 | -17.205093 | 0.543774 | 0.457875 |
1069 | 15407 | West Coast | Collingwood | 2018-09-29 | 27 | MCG | 2018 | 1.981832 | 17.2 | 3.0 | 1.838864 | 1591.348723 | 1562.924273 | 0.679011 | 0.724125 | 0.711352 | 0.709346 | 159.522670 | 0.893421 | -0.475725 | 3.391070 | -5.088751 | 5.875388 | 5.352234 | 7.729063 | -7.358202 | -4.719968 | 6.113565 | 4.822252 | 2.871241 | 2.690270 | 3.636364 | -64.238180 | -0.631102 | 2.078832 | 6.005613 | 56.879978 | 34.373271 | 1016.491933 | 1.199751 | 2.454685 | 12.197047 | 219.666562 | 2.484363 | 0.379162 | 2.566991 | 0.639666 | 2.258377 | -23.841529 | -368920.360240 | -0.646160 | 0.892051 | 3.040850 | 1.589568 | 0.012622 | 1.665299 | 4.706148 | 0.427350 | 0.571429 |
Create Predictions For the Upcoming Round
Now that we have our features, we can use our model that we created in part 3 to predict the next round. First we need to filter our features_df into a training df and a df with next round's features/matches. Then we can use the model created in the last tutorial to create predictions. For simplicity, I have hardcoded the parameters we used in the last tutorial.
# Get the train df by only taking the games IDs which aren't in the next week df
train_df = features_df[~features_df.game.isin(next_week_df.game)]
# Get the result and merge to the feature_df
match_results = (pd.read_csv("data/afl_match_results.csv")
.rename(columns={'Game': 'game'})
.assign(result=lambda df: df.apply(lambda row: 1 if row['Home.Points'] > row['Away.Points'] else 0, axis=1)))
train_df = pd.merge(train_df, match_results[['game', 'result']], on='game')
train_x = train_df.drop(columns=['result'])
train_y = train_df.result
next_round_x = features_df[features_df.game.isin(next_week_df.game)]
# Fit out logistic regression model - note that our predictions come out in the order of [away_team_prob, home_team_prob]
lr_best_params = {'C': 0.01,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'ovr',
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'newton-cg',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
feature_cols = [col for col in train_df if col.startswith('f_')]
# Scale features
scaler = StandardScaler()
train_x[feature_cols] = scaler.fit_transform(train_x[feature_cols])
next_round_x[feature_cols] = scaler.transform(next_round_x[feature_cols])
lr = LogisticRegression(**lr_best_params)
lr.fit(train_x[feature_cols], train_y)
prediction_probs = lr.predict_proba(next_round_x[feature_cols])
modelled_home_odds = [1/i[1] for i in prediction_probs]
modelled_away_odds = [1/i[0] for i in prediction_probs]
# Create a predictions df
preds_df = (next_round_x[['date', 'home_team', 'away_team', 'venue', 'game']].copy()
.assign(modelled_home_odds=modelled_home_odds,
modelled_away_odds=modelled_away_odds)
.pipe(pd.merge, next_week_odds, on=['home_team', 'away_team'])
.pipe(pd.merge, features_df[['game', 'f_elo_home', 'f_elo_away']], on='game')
.drop(columns='game')
)
date | home_team | away_team | venue | modelled_home_odds | modelled_away_odds | odds | odds_away | f_elo_home | f_elo_away | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-09-29 | West Coast | Collingwood | MCG | 2.326826 | 1.753679 | 2.34 | 1.75 | 1591.348723 | 1562.924273 |
Alternatively, if you want to generate predictions using a script which uses all the above code, just run the following:
date home_team away_team venue modelled_home_odds \
0 2018-09-29 West Coast Collingwood MCG 2.326826
modelled_away_odds odds odds_away f_elo_home f_elo_away
0 1.753679 2.34 1.75 1591.348723 1562.924273
Conclusion
Congratulations! You have created AFL predictions for this week. If you are beginner to this, don't be overwhelmed. The process gets easier each time you do it. And it is super rewarding. In future iterations we will update this tutorial to predict actual odds, and then integrate this with Betfair's API so that you can create an automated betting strategy using Machine Learning to create your predictions!
Disclaimer
Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.