AFL Modelling Walkthrough

01. Data Cleaning

These tutorials will walk you through how to construct your own basic AFL model, using publicly available data. The output will be odds for each team to win, which will be shown on The Hub.

In this notebook we will walk you through the basics of cleaning this dataset and how we have done it. If you want to get straight to feature creation or modelling, feel free to jump ahead!

# Import libraries
import pandas as pd
import numpy as np
import re
pd.set_option('display.max_columns', None)

We will first explore the DataFrames, and then create functions to wrangle them and clean them into more consistent sets of data.

# Read/clean each DataFrame
match_results = pd.read_csv("data/afl_match_results.csv")
odds = pd.read_csv("data/afl_odds.csv")
player_stats = pd.read_csv("data/afl_player_stats.csv")

odds.tail(3)

	trunc	event_name	path	selection_name	odds
4179	2018-09-01	Match Odds	VFL/Richmond Reserves v Williamstown	Williamstown	2.3878
4180	2018-09-01	Match Odds	WAFL/South Fremantle v West Perth	South Fremantle	1.5024
4181	2018-09-01	Match Odds	WAFL/South Fremantle v West Perth	West Perth	2.7382

match_results.tail(3)

	Game	Date	Round	Home.Team	Home.Goals	Home.Behinds	Home.Points	Away.Team	Away.Goals	Away.Behinds	Away.Points	Venue	Margin	Season	Round.Type	Round.Number
15395	15396	2018-08-26	R23	Brisbane Lions	11	6	72	West Coast	14	14	98	Gabba	-26	2018	Regular	23
15396	15397	2018-08-26	R23	Melbourne	15	12	102	GWS	8	9	57	M.C.G.	45	2018	Regular	23
15397	15398	2018-08-26	R23	St Kilda	14	10	94	North Melbourne	17	15	117	Docklands	-23	2018	Regular	23

player_stats.tail(3)

	AF	B	CG	CM	CP	D	DE	Date	ED	FF	G	HB	ITC	K	M	MG	MI5	Match_id	One.Percenters	Opposition	Player	R50	Round	SC	SI	Season	Status	T	T5	TO	TOG	Team	UP	Venue
89317	38	1	0	1	2	9	55.6	25/08/2018	5	0	0	3	2.0	6	3	132.0	2	9711	0	Fremantle	Christopher Mayne	1	Round 23	35	2.0	2018	Away	1	0.0	1.0	57	Collingwood	7	Optus Stadium
89318	38	0	3	0	3	9	55.6	25/08/2018	5	1	0	3	4.0	6	3	172.0	0	9711	2	Fremantle	Nathan Murphy	5	Round 23	29	0.0	2018	Away	1	0.0	3.0	70	Collingwood	6	Optus Stadium
89319	56	1	1	0	3	8	62.5	25/08/2018	5	0	2	2	2.0	6	3	180.0	3	9711	2	Fremantle	Jaidyn Stephenson	0	Round 23	56	4.0	2018	Away	3	1.0	2.0	87	Collingwood	5	Optus Stadium

Have a look at the structure of the DataFrames. Notice that for the odds DataFrame, each game is split between two rows, whilst for the match_results each game is on one row. We will have to get around this by splitting the games up onto two rows, as this will allow our feature transformation functions to be applied more easily later on. For the player_stats DataFrame we will aggregate these stats into each game on separate rows.

First, we will write functions to make the odds data look a bit nicer, with only a team column, a date column and a 'home_game' column which takes the values 0 or 1 depending on if it was a home game for that team. To do this we will use the regex module to extract the team names from the path column, as well as the to_datetime function from pandas. We will also replace all the inconsistent team names with consistent team names.

def get_cleaned_odds(df=None):
    # If a df hasn't been specified as a parameter, read the odds df
    if df is None:
        df = pd.read_csv("data/afl_odds.csv")

    # Get a dictionary of team names we want to change and their new values
    team_name_mapping = {
    'Adelaide Crows': 'Adelaide',
    'Brisbane Lions': 'Brisbane',
    'Carlton Blues': 'Carlton',
    'Collingwood Magpies': 'Collingwood',
    'Essendon Bombers': 'Essendon',
    'Fremantle Dockers': 'Fremantle',
    'GWS Giants': 'GWS',
    'Geelong Cats': 'Geelong',
    'Gold Coast Suns': 'Gold Coast',
    'Greater Western Sydney': 'GWS',
    'Greater Western Sydney Giants': 'GWS',
    'Hawthorn Hawks': 'Hawthorn',
    'Melbourne Demons': 'Melbourne', 
    'North Melbourne Kangaroos': 'North Melbourne',
    'Port Adelaide Magpies': 'Port Adelaide',
    'Port Adelaide Power': 'Port Adelaide', 
    'P Adelaide': 'Port Adelaide',
    'Richmond Tigers': 'Richmond',
    'St Kilda Saints': 'St Kilda', 
    'Sydney Swans': 'Sydney',
    'West Coast Eagles': 'West Coast',
    'Wetsern Bulldogs': 'Western Bulldogs',
    'Western Bullbogs': 'Western Bulldogs'
    }

    # Add columns
    df = (df.assign(date=lambda df: pd.to_datetime(df.trunc), # Create a datetime column
                    home_team=lambda df: df.path.str.extract('(([\w\s]+) v ([\w\s]+))', expand=True)[1].str.strip(),
                    away_team=lambda df: df.path.str.extract('(([\w\s]+) v ([\w\s]+))', expand=True)[2].str.strip())
            .drop(columns=['path', 'trunc', 'event_name']) # Drop irrelevant columns
            .rename(columns={'selection_name': 'team'}) # Rename columns
            .replace(team_name_mapping)
            .sort_values(by='date')
            .reset_index(drop=True)
            .assign(home_game=lambda df: df.apply(lambda row: 1 if row.home_team == row.team else 0, axis='columns'))
            .drop(columns=['home_team', 'away_team']))
    return df

# Apply the wrangling and cleaning function
odds = get_cleaned_odds(odds)
odds.tail()

	team	odds	date	home_game
4177	South Fremantle	1.5024	2018-09-01	1
4178	Port Melbourne	2.8000	2018-09-01	0
4179	Box Hill Hawks	1.4300	2018-09-01	1
4180	Casey Demons	1.9000	2018-09-01	1
4181	West Perth	2.7382	2018-09-01	0

We now have a DataFrame that looks nice and easy to join with our other DataFrames. Now let's lean up the match_details DataFrame.

# Define a function which cleans the match results df, and separates each teams' stats onto individual rows
def get_cleaned_match_results(df=None):
    # If a df hasn't been specified as a parameter, read the match_results df
    if df is None:
        df = pd.read_csv("data/afl_match_results.csv")

    # Create column lists to loop through - these are the columns we want in home and away dfs
    home_columns = ['Game', 'Date', 'Round.Number', 'Home.Team', 'Home.Goals', 'Home.Behinds', 'Home.Points', 'Margin', 'Venue', 'Away.Team', 'Away.Goals', 'Away.Behinds', 'Away.Points']
    away_columns = ['Game', 'Date', 'Round.Number', 'Away.Team', 'Away.Goals', 'Away.Behinds', 'Away.Points', 'Margin', 'Venue', 'Home.Team', 'Home.Goals', 'Home.Behinds', 'Home.Points']

    mapping = ['game', 'date', 'round', 'team', 'goals', 'behinds', 'points', 'margin', 'venue', 'opponent', 'opponent_goals', 'opponent_behinds', 'opponent_points']

    team_name_mapping = {
    'Brisbane Lions': 'Brisbane',
    'Footscray': 'Western Bulldogs'
    }

    # Create a df with only home games
    df_home = (df[home_columns]
                .rename(columns={old_col: new_col for old_col, new_col in zip(home_columns, mapping)})
                .assign(home_game=1))

    # Create a df with only away games
    df_away = (df[away_columns]
                .rename(columns={old_col: new_col for old_col, new_col in zip(away_columns, mapping)})
                .assign(home_game=0,
                        margin=lambda df: df.margin * -1))

    # Append these dfs together
    new_df = (df_home.append(df_away)
                     .sort_values(by='game') # Sort by game ID
                     .reset_index(drop=True) # Reset index
                     .assign(date=lambda df: pd.to_datetime(df.date)) # Create a datetime column
                     .replace(team_name_mapping)) # Rename team names to be consistent with other dfs
    return new_df

match_results = get_cleaned_match_results(match_results)
match_results.head()

	game	date	round	team	goals	behinds	points	margin	venue	opponent	opponent_goals	opponent_behinds	opponent_points	home_game
0	1	1897-05-08	1	Fitzroy	6	13	49	33	Brunswick St	Carlton	2	4	16	1
1	1	1897-05-08	1	Carlton	2	4	16	-33	Brunswick St	Fitzroy	6	13	49	0
2	2	1897-05-08	1	Collingwood	5	11	41	25	Victoria Park	St Kilda	2	4	16	1
3	2	1897-05-08	1	St Kilda	2	4	16	-25	Victoria Park	Collingwood	5	11	41	0
4	3	1897-05-08	1	Geelong	3	6	24	-23	Corio Oval	Essendon	7	5	47	1

Now we have both the odds DataFrame and match_results DataFrame ready for feature creation! Finally, we will aggregate the player_stats DataFrame stats for each game rather than individual player stats. For this DataFrame we have regular stats, such as disposals, marks etc. and Advanced Stats, such as Tackles Inside 50 and Metres Gained. However these advanced stats are only available from 2015, so we will not be using them in this tutorial - as there isn't enough data from 2015 to train our models.

Let's now aggregate the player_stats DataFrame.

def get_cleaned_aggregate_player_stats(df=None):
    # If a df hasn't been specified as a parameter, read the player_stats df
    if df is None:
        df = pd.read_csv("data/afl_player_stats.csv")

    agg_stats = (df.rename(columns={ # Rename columns to lowercase
                    'Season': 'season',
                    'Round': 'round',
                    'Team': 'team',
                    'Opposition': 'opponent',
                    'Date': 'date'
                    })
                   .groupby(by=['date', 'season', 'team', 'opponent'], as_index=False) # Groupby to aggregate the stats for each game
                   .sum()
                   .drop(columns=['DE', 'TOG', 'Match_id']) # Drop columns
                   .assign(date=lambda df: pd.to_datetime(df.date, format="%d/%m/%Y")) # Create a datetime object
                   .sort_values(by='date')
                   .reset_index(drop=True))
    return agg_stats

agg_stats = get_cleaned_aggregate_player_stats(player_stats)

agg_stats.tail()

	date	season	team	opponent	AF	B	BO	CCL	CG	CL	CM	CP	D	ED	FA	FF	G	GA	HB	HO	I50	ITC	K	M	MG	MI5	One.Percenters	R50	SC	SCL	SI	T	T5	TO	UP
3621	2018-08-26	2018	Brisbane	West Coast	1652	5	0	14.0	49	37	8	132	394	302	20	18	11	9	167	48	49	59.0	227	104	5571.0	6	48	39	1645	23.0	86.0	62	13.0	69.0	256
3622	2018-08-26	2018	West Coast	Brisbane	1548	11	5	13.0	49	42	9	141	360	262	18	20	14	8	137	39	56	70.0	223	95	5809.0	12	39	34	1655	29.0	94.0	55	6.0	59.0	217
3623	2018-08-26	2018	St Kilda	North Melbourne	1587	8	11	19.0	48	33	7	125	383	299	18	14	14	13	173	23	48	68.0	210	112	5522.0	14	46	35	1568	14.0	95.0	50	7.0	77.0	269
3624	2018-08-26	2018	GWS	Melbourne	1449	7	17	14.0	42	31	12	111	355	274	19	13	8	7	159	18	50	54.0	196	110	5416.0	10	62	34	1532	17.0	78.0	46	5.0	58.0	254
3625	2018-08-26	2018	Melbourne	GWS	1712	8	12	10.0	38	30	12	139	403	302	13	19	15	14	181	48	54	59.0	222	106	6198.0	16	34	39	1768	20.0	147.0	60	2.0	53.0	269

We now have a three fully prepared DataFrames which are almost ready to be analysed and for a model to be built on! Let's have a look at how they look and then merge them together into our final DataFrame.

odds.tail(3)

	team	odds	date	home_game
4179	Box Hill Hawks	1.4300	2018-09-01	1
4180	Casey Demons	1.9000	2018-09-01	1
4181	West Perth	2.7382	2018-09-01	0

match_results.tail(3)

	game	date	round	team	goals	behinds	points	margin	venue	opponent	opponent_goals	opponent_behinds	opponent_points	home_game
30793	15397	2018-08-26	23	Melbourne	15	12	102	45	M.C.G.	GWS	8	9	57	1
30794	15398	2018-08-26	23	St Kilda	14	10	94	-23	Docklands	North Melbourne	17	15	117	1
30795	15398	2018-08-26	23	North Melbourne	17	15	117	23	Docklands	St Kilda	14	10	94	0

agg_stats.tail(3)

	date	season	team	opponent	AF	B	BO	CCL	CG	CL	CM	CP	D	ED	FA	FF	G	GA	HB	HO	I50	ITC	K	M	MG	MI5	One.Percenters	R50	SC	SCL	SI	T	T5	TO	UP
3623	2018-08-26	2018	St Kilda	North Melbourne	1587	8	11	19.0	48	33	7	125	383	299	18	14	14	13	173	23	48	68.0	210	112	5522.0	14	46	35	1568	14.0	95.0	50	7.0	77.0	269
3624	2018-08-26	2018	GWS	Melbourne	1449	7	17	14.0	42	31	12	111	355	274	19	13	8	7	159	18	50	54.0	196	110	5416.0	10	62	34	1532	17.0	78.0	46	5.0	58.0	254
3625	2018-08-26	2018	Melbourne	GWS	1712	8	12	10.0	38	30	12	139	403	302	13	19	15	14	181	48	54	59.0	222	106	6198.0	16	34	39	1768	20.0	147.0	60	2.0	53.0	269

merged_df = (odds[odds.team.isin(agg_stats.team.unique())]
                .pipe(pd.merge, match_results, on=['date', 'team', 'home_game'])
                .pipe(pd.merge, agg_stats, on=['date', 'team', 'opponent'])
                .sort_values(by=['game']))

merged_df.tail(3)

	team	odds	date	home_game	game	round	goals	behinds	points	margin	venue	opponent	opponent_goals	opponent_behinds	opponent_points	season	AF	B	BO	CCL	CG	CL	CM	CP	D	ED	FA	FF	G	GA	HB	HO	I50	ITC	K	M	MG	MI5	One.Percenters	R50	SC	SCL	SI	T	T5	TO	UP
3199	Melbourne	1.5116	2018-08-26	1	15397	23	15	12	102	45	M.C.G.	GWS	8	9	57	2018	1712	8	12	10.0	38	30	12	139	403	302	13	19	15	14	181	48	54	59.0	222	106	6198.0	16	34	39	1768	20.0	147.0	60	2.0	53.0	269
3195	North Melbourne	1.3936	2018-08-26	0	15398	23	17	15	117	23	Docklands	St Kilda	14	10	94	2018	1707	13	7	15.0	50	24	6	131	425	322	14	18	17	13	201	32	59	77.0	224	106	5833.0	23	33	29	1735	9.0	154.0	48	9.0	66.0	300
3200	St Kilda	3.5178	2018-08-26	1	15398	23	14	10	94	-23	Docklands	North Melbourne	17	15	117	2018	1587	8	11	19.0	48	33	7	125	383	299	18	14	14	13	173	23	48	68.0	210	112	5522.0	14	46	35	1568	14.0	95.0	50	7.0	77.0	269

Great! We now have a clean looking datset with each row representing one team in a game. Let's now eliminate the outliers from a dataset. We know that Essendon had a doping scandal which resulted in their entire team being banned for a year in 2016, so let's remove all of their 2016 games. To do this we will filter based on the team and season, and then invert this with ~.

# Define a function which eliminates outliers
def outlier_eliminator(df):
    # Eliminate Essendon 2016 games
    essendon_filter_criteria = ~(((df['team'] == 'Essendon') & (df['season'] == 2016)) | ((df['opponent'] == 'Essendon') & (df['season'] == 2016)))
    df = df[essendon_filter_criteria].reset_index(drop=True)

    return df

afl_data = outlier_eliminator(merged_df)

afl_data.tail(3)

	team	odds	date	home_game	game	round	goals	behinds	points	margin	venue	opponent	opponent_goals	opponent_behinds	opponent_points	season	AF	B	BO	CCL	CG	CL	CM	CP	D	ED	FA	FF	G	GA	HB	HO	I50	ITC	K	M	MG	MI5	One.Percenters	R50	SC	SCL	SI	T	T5	TO	UP
3154	Melbourne	1.5116	2018-08-26	1	15397	23	15	12	102	45	M.C.G.	GWS	8	9	57	2018	1712	8	12	10.0	38	30	12	139	403	302	13	19	15	14	181	48	54	59.0	222	106	6198.0	16	34	39	1768	20.0	147.0	60	2.0	53.0	269
3155	North Melbourne	1.3936	2018-08-26	0	15398	23	17	15	117	23	Docklands	St Kilda	14	10	94	2018	1707	13	7	15.0	50	24	6	131	425	322	14	18	17	13	201	32	59	77.0	224	106	5833.0	23	33	29	1735	9.0	154.0	48	9.0	66.0	300
3156	St Kilda	3.5178	2018-08-26	1	15398	23	14	10	94	-23	Docklands	North Melbourne	17	15	117	2018	1587	8	11	19.0	48	33	7	125	383	299	18	14	14	13	173	23	48	68.0	210	112	5522.0	14	46	35	1568	14.0	95.0	50	7.0	77.0	269

Finally, let's mark all of the columns that we are going to use in feature creation with the string 'f_' at the start of their column name so that we can easily filter for these columns.

non_feature_cols = ['team', 'date', 'home_game', 'game', 'round', 'venue', 'opponent', 'season']
afl_data = afl_data.rename(columns={col: 'f_' + col for col in afl_data if col not in non_feature_cols})

afl_data.tail(3)

	team	f_odds	date	home_game	game	round	f_goals	f_behinds	f_points	f_margin	venue	opponent	f_opponent_goals	f_opponent_behinds	f_opponent_points	season	f_AF	f_B	f_BO	f_CCL	f_CG	f_CL	f_CM	f_CP	f_D	f_ED	f_FA	f_FF	f_G	f_GA	f_HB	f_HO	f_I50	f_ITC	f_K	f_M	f_MG	f_MI5	f_One.Percenters	f_R50	f_SC	f_SCL	f_SI	f_T	f_T5	f_TO	f_UP
3154	Melbourne	1.5116	2018-08-26	1	15397	23	15	12	102	45	M.C.G.	GWS	8	9	57	2018	1712	8	12	10.0	38	30	12	139	403	302	13	19	15	14	181	48	54	59.0	222	106	6198.0	16	34	39	1768	20.0	147.0	60	2.0	53.0	269
3155	North Melbourne	1.3936	2018-08-26	0	15398	23	17	15	117	23	Docklands	St Kilda	14	10	94	2018	1707	13	7	15.0	50	24	6	131	425	322	14	18	17	13	201	32	59	77.0	224	106	5833.0	23	33	29	1735	9.0	154.0	48	9.0	66.0	300
3156	St Kilda	3.5178	2018-08-26	1	15398	23	14	10	94	-23	Docklands	North Melbourne	17	15	117	2018	1587	8	11	19.0	48	33	7	125	383	299	18	14	14	13	173	23	48	68.0	210	112	5522.0	14	46	35	1568	14.0	95.0	50	7.0	77.0	269

Our data is now fully ready to be explored and for features to be created.

02. Feature Creation

These tutorials will walk you through how to construct your own basic AFL model. The output will be odds for each team to win, which will be shown on The Hub.

In this notebook we will walk you through creating features from our dataset, which was cleaned in the first tutorial. Feature engineering is an integral part of the Data Science process. Creative and smart features can be the difference between an average performing model and a model profitable which beats the market odds.

Grabbing Our Dataset

First, we will import our required modules, as well as the prepare_afl_data function which we created in our afl_data_cleaning script. This essentially cleans all the data for us so that we're ready to explore the data and make some features.

# Import modules
from afl_data_cleaning_v2 import *
import afl_data_cleaning_v2
import pandas as pd
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import numpy as np

# Use the prepare_afl_data function to prepare the data for us; this function condenses what we walked through in the previous tutorial
afl_data = prepare_afl_data()

afl_data.tail(3)

	team	f_odds	date	home_game	game	round	f_goals	f_behinds	f_points	f_margin	venue	opponent	f_opponent_goals	f_opponent_behinds	f_opponent_points	season	f_AF	f_B	f_BO	f_CCL	f_CG	f_CL	f_CM	f_CP	f_D	f_ED	f_FA	f_FF	f_G	f_GA	f_HB	f_HO	f_I50	f_ITC	f_K	f_M	f_MG	f_MI5	f_One.Percenters	f_R50	f_SC	f_SCL	f_SI	f_T	f_T5	f_TO	f_UP
3154	Melbourne	1.5116	2018-08-26	1	15397	23	15	12	102	45	M.C.G.	GWS	8	9	57	2018	1712	8	12	10.0	38	30	12	139	403	302	13	19	15	14	181	48	54	59.0	222	106	6198.0	16	34	39	1768	20.0	147.0	60	2.0	53.0	269
3155	North Melbourne	1.3936	2018-08-26	0	15398	23	17	15	117	23	Docklands	St Kilda	14	10	94	2018	1707	13	7	15.0	50	24	6	131	425	322	14	18	17	13	201	32	59	77.0	224	106	5833.0	23	33	29	1735	9.0	154.0	48	9.0	66.0	300
3156	St Kilda	3.5178	2018-08-26	1	15398	23	14	10	94	-23	Docklands	North Melbourne	17	15	117	2018	1587	8	11	19.0	48	33	7	125	383	299	18	14	14	13	173	23	48	68.0	210	112	5522.0	14	46	35	1568	14.0	95.0	50	7.0	77.0	269

Creating A Feature DataFrame

Let's create a feature DataFrame and merge all of our features into this DataFrame as we go.

features = afl_data[['date', 'game', 'team', 'opponent', 'venue', 'home_game']].copy()

What Each Column Refers To

Below is a DataFrame which outlines what each column refers to.

column_abbreviations = pd.read_csv("data/afl_data_columns_mapping.csv")
column_abbreviations

	Feature Abbreviated	Feature
0	GA	Goal Assists
1	CP	Contested Possessions
2	UP	Uncontested Possessions
3	ED	Effective Disposals
4	CM	Contested Marks
5	MI5	Marks Inside 50
6	One.Percenters	One Percenters
7	BO	Bounces
8	K	Kicks
9	HB	Handballs
10	D	Disposals
11	M	Marks
12	G	Goals
13	B	Behinds
14	T	Tackles
15	HO	Hitouts
16	I50	Inside 50s
17	CL	Clearances
18	CG	Clangers
19	R50	Rebound 50s
20	FF	Frees For
21	FA	Frees Against
22	AF	AFL Fantasy Points
23	SC	Supercoach Points
24	CCL	Centre Clearances
25	SCL	Stoppage Clearances
26	SI	Score Involvements
27	MG	Metres Gained
28	TO	Turnovers
29	ITC	Intercepts
30	T5	Tackles Inside 50

Feature Creation

Now let's think about what features we can create. We have a enormous amount of stats to sift through. To start, let's create some simple features based on our domain knowledge of Aussie Rules.

Creating Expontentially Weighted Rolling Averages as Features

Next, we will create rolling averages of statistics such as Tackles, which we will use as features.

It is fair to assume that a team's performance in a certain stat may have predictive power to the overall result. And in general, if a team consistently performs well in this stat, this may have predictive power to the result of their future games. We can't simply train a model on stats from the game which we are trying to predict (i.e. data that we don't have before the game begins), as this will leak the result. We need to train our model on past data. One way of doing this is to train our model on average stats over a certain amount of games. If a team is averaging high in this stat, this may give insight into if they are a strong team. Similarly, if the team is averaging poorly in this stat (relative to the team they are playing), this may have predictive power and give rise to a predicted loss.

To do this we will create a function which calculates the rolling averages, known as create_exp_weighted_avgs, which takes our cleaned DataFrame as an input, as well as the alpha which, when higher, weights recent performances more than old performances. To read more about expontentially weighted moving averages, please read the documentation here.

First, we will grab all the columns which we want to create EMAs for, and then use our function to create the average for that column. We will create a new DataFrame and add these columns to this new DataFrame.

# Define a function which returns a DataFrame with the expontential moving average for each numeric stat
def create_exp_weighted_avgs(df, span):
    # Create a copy of the df with only the game id and the team - we will add cols to this df
    ema_features = df[['game', 'team']].copy()

    feature_names = [col for col in df.columns if col.startswith('f_')] # Get a list of columns we will iterate over

    for feature_name in feature_names:
        feature_ema = (df.groupby('team')[feature_name]
                         .transform(lambda row: (row.ewm(span=span)
                                                    .mean()
                                                    .shift(1))))
        ema_features[feature_name] = feature_ema

    return ema_features

features_rolling_averages = create_exp_weighted_avgs(afl_data, span=10)

features_rolling_averages.tail()

	game	team	f_odds	f_goals	f_behinds	f_points	f_margin	f_opponent_goals	f_opponent_behinds	f_opponent_points	f_AF	f_B	f_BO	f_CCL	f_CG	f_CL	f_CM	f_CP	f_D	f_ED	f_FA	f_FF	f_G	f_GA	f_HB	f_HO	f_I50	f_ITC	f_K	f_M	f_MG	f_MI5	f_One.Percenters	f_R50	f_SC	f_SCL	f_SI	f_T	f_T5	f_TO	f_UP
3152	15396	West Coast	2.094236	12.809630	10.047145	86.904928	8.888770	11.435452	9.403444	78.016158	3193.612782	16.472115	11.958482	23.379562	100.095244	68.252001	27.688669	284.463270	719.884644	525.878017	36.762440	44.867118	25.618202	17.522871	270.478779	88.139376	105.698031	148.005305	449.405865	201.198907	11581.929999	20.048124	95.018480	74.180967	3314.157893	44.872398	177.894442	126.985101	20.565549	138.876613	438.848376
3153	15397	GWS	1.805565	13.100372	13.179329	91.781563	18.527618	10.371198	11.026754	73.253945	3165.127358	19.875913	12.947209	25.114002	105.856671	80.609640	23.374884	303.160047	741.439198	534.520295	42.597317	38.160889	26.208715	18.688880	300.188301	81.540693	106.989070	143.032506	441.250897	173.050118	12091.630837	21.106142	103.077097	80.201059	3419.245919	55.495610	219.879895	138.202470	25.313148	135.966798	438.466439
3154	15397	Melbourne	1.706488	15.157271	13.815113	104.758740	25.170429	11.814319	8.702396	79.588311	3312.408470	22.077317	7.724955	28.364418	114.399147	78.406069	26.934677	324.352577	775.176933	547.385948	39.353251	36.025646	30.308918	22.461080	348.613592	99.787800	120.339062	154.417642	426.563341	178.102118	12395.717925	32.168752	96.390688	63.786515	3427.596843	50.041649	232.287556	144.875098	23.789233	149.042149	456.988552
3155	15398	North Melbourne	2.272313	12.721783	10.733785	87.064486	-1.214246	12.915796	10.783958	88.278732	3066.272143	17.322710	9.815243	26.015421	106.465181	67.504286	26.064079	291.259574	736.279779	534.154748	34.301603	40.908551	25.386136	17.816570	341.210547	81.541130	102.589427	145.265493	395.069232	173.089408	10875.002463	21.802751	82.347511	70.416194	3171.120023	41.488865	197.620152	122.547684	22.286256	142.780474	450.374058
3156	15398	St Kilda	5.516150	10.464266	11.957047	74.742643	-21.138101	14.105551	11.247440	95.880745	3094.163405	20.523847	14.569589	24.134276	102.540441	66.976211	18.018350	270.674857	773.086015	573.769838	41.319843	36.198820	20.850476	14.443658	364.405251	63.498760	103.803779	130.494307	408.680763	184.780054	10765.717942	21.572806	94.731555	65.790561	3228.278599	42.841935	196.086493	115.901425	18.796764	127.364334	508.844514

As you can see our function worked perfectly! Now we have a full DataFrame of exponentially weighted moving averages. Note that as these rolling averages have been shifted by 1 to ensure no data leakage, the first round of the data will have all NA values. We can drop these later.

Let's add these averages to our features DataFrame

features = pd.merge(features, features_rolling_averages, on=['game', 'team'])

Creating a 'Form Between the Teams' Feature

It is well known in Aussie Rules that often some teams perform better against certain teams than others. If we isolate our features to pure stats based on previous games not between the teams playing, or elo ratings, we won't account for any relationships between certain teams. An example is the Kennett Curse, where Geelong won 11 consecutive games against Hawthorn, despite being similarly matched teams. Let's create a feature which calculates how many games a team has won against their opposition over a given window of games.

To do this, we will need to use historical data that dates back well before our current DataFrame starts at. Otherwise we will be using a lot of our games to calculate form, meaning we will have to drop these rows before feeding it into an algorithm. So let's use our prepare_match_results function which we defined in the afl_data_cleaning tutorial to grab a clean DataFrame of all match results since 1897. We can then calculate the form and join this to our current DataFrame.

match_results = afl_data_cleaning_v2.get_cleaned_match_results()

match_results.head(3)

	game	date	round	team	goals	behinds	points	margin	venue	opponent	opponent_goals	opponent_behinds	opponent_points	home_game
0	1	1897-05-08	1	Fitzroy	6	13	49	33	Brunswick St	Carlton	2	4	16	1
1	1	1897-05-08	1	Carlton	2	4	16	-33	Brunswick St	Fitzroy	6	13	49	0
2	2	1897-05-08	1	Collingwood	5	11	41	25	Victoria Park	St Kilda	2	4	16	1

form_btwn_teams = match_results[['game', 'team', 'opponent', 'margin']].copy()

form_btwn_teams['f_form_margin_btwn_teams'] = (match_results.groupby(['team', 'opponent'])['margin']
                                                          .transform(lambda row: row.rolling(5).mean().shift())
                                                          .fillna(0))

form_btwn_teams['f_form_past_5_btwn_teams'] = \
(match_results.assign(win=lambda df: df.apply(lambda row: 1 if row.margin > 0 else 0, axis='columns'))
              .groupby(['team', 'opponent'])['win']
              .transform(lambda row: row.rolling(5).mean().shift() * 5)
              .fillna(0))

form_btwn_teams.tail(3)

	game	team	opponent	margin	f_form_margin_btwn_teams	f_form_past_5_btwn_teams
30793	15397	Melbourne	GWS	45	-23.2	2.0
30794	15398	St Kilda	North Melbourne	-23	-3.2	2.0
30795	15398	North Melbourne	St Kilda	23	3.2	3.0

# Merge to our features df
features = pd.merge(features, form_btwn_teams.drop(columns=['margin']), on=['game', 'team', 'opponent'])

Creating Efficiency Features

Disposal Efficiency

Disposal efficiency is pivotal in Aussie Rules football. If you are dispose of the ball effectively you are much more likely to score and much less likely to concede goals than if you dispose of it ineffectively.

Let's create a disposal efficiency feature by dividing Effective Disposals by Disposals.

Inside 50/Rebound 50 Efficiency

Similarly, one could hypothesise that teams who keep the footy in their Inside 50 regularly will be more likely to score, whilst teams who are effective at getting the ball out of their Defensive 50 will be less likely to concede. Let's use this logic to create Inside 50 Efficiency and Rebound 50 Efficiency features.

The formula used will be:

Inside 50 Efficiency = R50_Opponents / I50 (lower is better).
Rebound 50 Efficiency = R50 / I50_Opponents (higher is better).

Using these formulas, I50 Efficiency = R50 Efficiency_Opponent. So we will just need to create the formulas for I50 efficiency. To create these features we will need the opposition's Inside 50s/Rebound 50s. So we will split out data into two DataFrames, create a new DataFrame by joining these two DataFrames on the Game, calculate our efficiency features, then join our features with our main features DataFrame.

# Get each match on single rows
single_row_df = (afl_data[['game', 'team', 'f_I50', 'f_R50', 'f_D', 'f_ED', 'home_game', ]]
                    .query('home_game == 1')
                    .rename(columns={'team': 'home_team', 'f_I50': 'f_I50_home', 'f_R50': 'f_R50_home', 'f_D': 'f_D_home', 'f_ED': 'f_ED_home'})
                    .drop(columns='home_game')
                    .pipe(pd.merge, afl_data[['game', 'team', 'f_I50', 'f_R50', 'f_D', 'f_ED', 'home_game']]
                                    .query('home_game == 0')
                                    .rename(columns={'team': 'away_team', 'f_I50': 'f_I50_away', 'f_R50': 'f_R50_away', 'f_D': 'f_D_away', 'f_ED': 'f_ED_away'})
                                    .drop(columns='home_game'), on='game'))

single_row_df.head()

	game	home_team	f_I50_home	f_R50_home	f_D_home	f_ED_home	away_team	f_I50_away	f_R50_away	f_D_away	f_ED_away
0	13764	Carlton	69	21	373	268	Richmond	37	50	316	226
1	13765	Geelong	54	40	428	310	St Kilda	52	45	334	246
2	13766	Collingwood	70	38	398	289	Port Adelaide	50	44	331	232
3	13767	Adelaide	59	38	366	264	Hawthorn	54	38	372	264
4	13768	Brisbane	50	39	343	227	Fremantle	57	30	351	250

single_row_df = single_row_df.assign(f_I50_efficiency_home=lambda df: df.f_R50_away / df.f_I50_home,
                                    f_I50_efficiency_away=lambda df: df.f_R50_home / df.f_I50_away)

feature_efficiency_cols = ['f_I50_efficiency_home', 'f_I50_efficiency_away']

# Now let's create an Expontentially Weighted Moving Average for these features - we will need to reshape our DataFrame to do this
efficiency_features_multi_row = (single_row_df[['game', 'home_team'] + feature_efficiency_cols]
                                    .rename(columns={
                                        'home_team': 'team',
                                        'f_I50_efficiency_home': 'f_I50_efficiency',
                                        'f_I50_efficiency_away': 'f_I50_efficiency_opponent',
                                    })
                                    .append((single_row_df[['game', 'away_team'] + feature_efficiency_cols]
                                                 .rename(columns={
                                                     'away_team': 'team',
                                                     'f_I50_efficiency_home': 'f_I50_efficiency_opponent',
                                                     'f_I50_efficiency_away': 'f_I50_efficiency',
                                                 })), sort=True)
                                    .sort_values(by='game')
                                    .reset_index(drop=True))

efficiency_features = efficiency_features_multi_row[['game', 'team']].copy()
feature_efficiency_cols = ['f_I50_efficiency', 'f_I50_efficiency_opponent']

for feature in feature_efficiency_cols:
    efficiency_features[feature] = (efficiency_features_multi_row.groupby('team')[feature]
                                        .transform(lambda row: row.ewm(span=10).mean().shift(1)))

# Get feature efficiency df back onto single rows
efficiency_features = pd.merge(efficiency_features, afl_data[['game', 'team', 'home_game']], on=['game', 'team'])
efficiency_features_single_row = (efficiency_features.query('home_game == 1')
                                    .rename(columns={
                                        'team': 'home_team', 
                                        'f_I50_efficiency': 'f_I50_efficiency_home',
                                        'f_I50_efficiency_opponent': 'f_R50_efficiency_home'})
                                    .drop(columns='home_game')
                                    .pipe(pd.merge, (efficiency_features.query('home_game == 0')
                                                        .rename(columns={
                                                            'team': 'away_team',
                                                            'f_I50_efficiency': 'f_I50_efficiency_away',
                                                            'f_I50_efficiency_opponent': 'f_R50_efficiency_away'})
                                                        .drop(columns='home_game')), on='game'))

efficiency_features_single_row.tail(5)

	game	home_team	f_I50_efficiency_home	f_R50_efficiency_home	away_team	f_I50_efficiency_away	f_R50_efficiency_away
1580	15394	Carlton	0.730668	0.675002	Adelaide	0.691614	0.677128
1581	15395	Sydney	0.699994	0.778280	Hawthorn	0.699158	0.673409
1582	15396	Brisbane	0.683604	0.691730	West Coast	0.696822	0.709605
1583	15397	Melbourne	0.667240	0.692632	GWS	0.684525	0.753783
1584	15398	St Kilda	0.730843	0.635819	North Melbourne	0.697018	0.654991

We will merge these features back to our features df later, when the features data frame is on a single row as well.

Creating an Elo Feature

Another feature which we could create is an Elo feature. If you don't know what Elo is, go ahead and read an article on it here. We have also written a guide on using elo to model the 2021 Euro & Copa America tournaments here.

Essentially, Elo ratings increase if you win. The amount the rating increases is based on how strong the opponent is relative to the team who won. Weak teams get more points for beating stronger teams than they do for beating weaker teams, and vice versa for losses (teams lose points for losses).

Mathematically, Elo ratings can also assign a probability for winning or losing based on the two Elo Ratings of the teams playing.

So let's get into it. We will first define a function which calculates the elo for each team and applies these elos to our DataFrame.

# Define a function which finds the elo for each team in each game and returns a dictionary with the game ID as a key and the
# elos as the key's value, in a list. It also outputs the probabilities and a dictionary of the final elos for each team
def elo_applier(df, k_factor):
    # Initialise a dictionary with default elos for each team
    elo_dict = {team: 1500 for team in df['team'].unique()}
    elos, elo_probs = {}, {}

    # Get a home and away dataframe so that we can get the teams on the same row
    home_df = df.loc[df.home_game == 1, ['team', 'game', 'f_margin', 'home_game']].rename(columns={'team': 'home_team'})
    away_df = df.loc[df.home_game == 0, ['team', 'game']].rename(columns={'team': 'away_team'})

    df = (pd.merge(home_df, away_df, on='game')
            .sort_values(by='game')
            .drop_duplicates(subset='game', keep='first')
            .reset_index(drop=True))

    # Loop over the rows in the DataFrame
    for index, row in df.iterrows():
        # Get the Game ID
        game_id = row['game']

        # Get the margin
        margin = row['f_margin']

        # If the game already has the elos for the home and away team in the elos dictionary, go to the next game
        if game_id in elos.keys():
            continue

        # Get the team and opposition
        home_team = row['home_team']
        away_team = row['away_team']

        # Get the team and opposition elo score
        home_team_elo = elo_dict[home_team]
        away_team_elo = elo_dict[away_team]

        # Calculated the probability of winning for the team and opposition
        prob_win_home = 1 / (1 + 10**((away_team_elo - home_team_elo) / 400))
        prob_win_away = 1 - prob_win_home

        # Add the elos and probabilities our elos dictionary and elo_probs dictionary based on the Game ID
        elos[game_id] = [home_team_elo, away_team_elo]
        elo_probs[game_id] = [prob_win_home, prob_win_away]

        # Calculate the new elos of each team
        if margin > 0: # Home team wins; update both teams' elo
            new_home_team_elo = home_team_elo + k_factor*(1 - prob_win_home)
            new_away_team_elo = away_team_elo + k_factor*(0 - prob_win_away)
        elif margin < 0: # Away team wins; update both teams' elo
            new_home_team_elo = home_team_elo + k_factor*(0 - prob_win_home)
            new_away_team_elo = away_team_elo + k_factor*(1 - prob_win_away)
        elif margin == 0: # Drawn game' update both teams' elo
            new_home_team_elo = home_team_elo + k_factor*(0.5 - prob_win_home)
            new_away_team_elo = away_team_elo + k_factor*(0.5 - prob_win_away)

        # Update elos in elo dictionary
        elo_dict[home_team] = new_home_team_elo
        elo_dict[away_team] = new_away_team_elo

    return elos, elo_probs, elo_dict

# Use the elo applier function to get the elos and elo probabilities for each game - we will map these later
elos, probs, elo_dict = elo_applier(afl_data, 30)

Great! now we have both rolling averages for stats as a feature, and the elo of the teams! Let's have a quick look at the current elo standings with a k-factor of 30, out of curiosity.

for team in sorted(elo_dict, key=elo_dict.get)[::-1]:
    print(team, elo_dict[team])

    Richmond 1695.2241513840117
    Sydney 1645.548990879842
    Hawthorn 1632.5266709780622
    West Coast 1625.871701773721
    Geelong 1625.423154644809
    GWS 1597.4158602131877
    Adelaide 1591.1704934545442
    Collingwood 1560.370309216614
    Melbourne 1558.5666572771509
    Essendon 1529.0198398117086
    Port Adelaide 1524.8882517820093
    North Melbourne 1465.5637511922569
    Western Bulldogs 1452.2110697845148
    Fremantle 1393.142087030804
    St Kilda 1360.9120149937303
    Brisbane 1276.2923772139352
    Gold Coast 1239.174528704772
    Carlton 1226.6780896643265

This looks extremely similar to the currently AFL ladder, so this is a good sign for elo being an effective predictor of winning.

Merging Our Features Into One Features DataFrame

Now we need to reshape our features df so that we have all of the statistics for both teams in a game on a single row. We can then merge our elo and efficiency features to this df.

# Look at our current features df
features.tail(3)

	date	game	team	opponent	venue	home_game	f_odds	f_goals	f_behinds	f_points	f_margin	f_opponent_goals	f_opponent_behinds	f_opponent_points	f_AF	f_B	f_BO	f_CCL	f_CG	f_CL	f_CM	f_CP	f_D	f_ED	f_FA	f_FF	f_G	f_GA	f_HB	f_HO	f_I50	f_ITC	f_K	f_M	f_MG	f_MI5	f_One.Percenters	f_R50	f_SC	f_SCL	f_SI	f_T	f_T5	f_TO	f_UP	f_form_margin_btwn_teams	f_form_past_5_btwn_teams
3156	2018-08-26	15397	Melbourne	GWS	M.C.G.	1	1.706488	15.157271	13.815113	104.758740	25.170429	11.814319	8.702396	79.588311	3312.408470	22.077317	7.724955	28.364418	114.399147	78.406069	26.934677	324.352577	775.176933	547.385948	39.353251	36.025646	30.308918	22.461080	348.613592	99.78780	120.339062	154.417642	426.563341	178.102118	12395.717925	32.168752	96.390688	63.786515	3427.596843	50.041649	232.287556	144.875098	23.789233	149.042149	456.988552	-23.2	2.0
3157	2018-08-26	15398	North Melbourne	St Kilda	Docklands	0	2.272313	12.721783	10.733785	87.064486	-1.214246	12.915796	10.783958	88.278732	3066.272143	17.322710	9.815243	26.015421	106.465181	67.504286	26.064079	291.259574	736.279779	534.154748	34.301603	40.908551	25.386136	17.816570	341.210547	81.54113	102.589427	145.265493	395.069232	173.089408	10875.002463	21.802751	82.347511	70.416194	3171.120023	41.488865	197.620152	122.547684	22.286256	142.780474	450.374058	3.2	3.0
3158	2018-08-26	15398	St Kilda	North Melbourne	Docklands	1	5.516150	10.464266	11.957047	74.742643	-21.138101	14.105551	11.247440	95.880745	3094.163405	20.523847	14.569589	24.134276	102.540441	66.976211	18.018350	270.674857	773.086015	573.769838	41.319843	36.198820	20.850476	14.443658	364.405251	63.49876	103.803779	130.494307	408.680763	184.780054	10765.717942	21.572806	94.731555	65.790561	3228.278599	42.841935	196.086493	115.901425	18.796764	127.364334	508.844514	-3.2	2.0

one_line_cols = ['game', 'team', 'home_game'] + [col for col in features if col.startswith('f_')]

# Get all features onto individual rows for each match
features_one_line = (features.loc[features.home_game == 1, one_line_cols]
                     .rename(columns={'team': 'home_team'})
                     .drop(columns='home_game')
                     .pipe(pd.merge, (features.loc[features.home_game == 0, one_line_cols]
                                              .drop(columns='home_game')
                                              .rename(columns={'team': 'away_team'})
                                              .rename(columns={col: col+'_away' for col in features.columns if col.startswith('f_')})), on='game')
                    .drop(columns=['f_form_margin_btwn_teams_away', 'f_form_past_5_btwn_teams_away']))

# Add our created features - elo, efficiency etc.
features_one_line = (features_one_line.assign(f_elo_home=lambda df: df.game.map(elos).apply(lambda x: x[0]),
                                            f_elo_away=lambda df: df.game.map(elos).apply(lambda x: x[1]))
                                      .pipe(pd.merge, efficiency_features_single_row, on=['game', 'home_team', 'away_team'])
                                      .pipe(pd.merge, afl_data.loc[afl_data.home_game == 1, ['game', 'date', 'round', 'venue']], on=['game'])
                                      .dropna()
                                      .reset_index(drop=True)
                                      .assign(season=lambda df: df.date.apply(lambda row: row.year)))

ordered_cols = [col for col in features_one_line if col[:2] != 'f_'] + [col for col in features_one_line if col.startswith('f_')]

feature_df = features_one_line[ordered_cols]

Finally, let's reduce the dimensionality of the features df by subtracting the home features from the away features. This will reduce the huge amount of columns we have and make our data more manageable. To do this, we will need a list of columns which we are subtracting from each other. We will then loop over each of these columns to create our new differential columns.

We will then add in the implied probability from the odds of the home and away team, as our current odds feature is simply an exponential moving average over the past n games.

# Create differential df - this df is the home features - the away features
diff_cols = [col for col in feature_df.columns if col + '_away' in feature_df.columns and col != 'f_odds' and col.startswith('f_')]
non_diff_cols = [col for col in feature_df.columns if col not in diff_cols and col[:-5] not in diff_cols]

diff_df = feature_df[non_diff_cols].copy()

for col in diff_cols:
    diff_df[col+'_diff'] = feature_df[col] - feature_df[col+'_away']

# Add current odds in to diff_df
odds = get_cleaned_odds()
home_odds = (odds[odds.home_game == 1]
             .assign(f_current_odds_prob=lambda df: 1 / df.odds)
             .rename(columns={'team': 'home_team'})
             .drop(columns=['home_game', 'odds']))

away_odds = (odds[odds.home_game == 0]
             .assign(f_current_odds_prob_away=lambda df: 1 / df.odds)
             .rename(columns={'team': 'away_team'})
             .drop(columns=['home_game', 'odds']))

diff_df = (diff_df.pipe(pd.merge, home_odds, on=['date', 'home_team'])
              .pipe(pd.merge, away_odds, on=['date', 'away_team']))

diff_df.tail()

	game	home_team	away_team	date	round	venue	season	f_odds	f_form_margin_btwn_teams	f_form_past_5_btwn_teams	f_odds_away	f_elo_home	f_elo_away	f_I50_efficiency_home	f_R50_efficiency_home	f_I50_efficiency_away	f_R50_efficiency_away	f_goals_diff	f_behinds_diff	f_points_diff	f_margin_diff	f_opponent_goals_diff	f_opponent_behinds_diff	f_opponent_points_diff	f_AF_diff	f_B_diff	f_BO_diff	f_CCL_diff	f_CG_diff	f_CL_diff	f_CM_diff	f_CP_diff	f_D_diff	f_ED_diff	f_FA_diff	f_FF_diff	f_G_diff	f_GA_diff	f_HB_diff	f_HO_diff	f_I50_diff	f_ITC_diff	f_K_diff	f_M_diff	f_MG_diff	f_MI5_diff	f_One.Percenters_diff	f_R50_diff	f_SC_diff	f_SCL_diff	f_SI_diff	f_T_diff	f_T5_diff	f_TO_diff	f_UP_diff	f_current_odds_prob	f_current_odds_prob_away
1626	15394	Carlton	Adelaide	2018-08-25	23	Docklands	2018	6.467328	-26.2	1.0	2.066016	1230.072138	1587.776445	0.730668	0.675002	0.691614	0.677128	-3.498547	-5.527193	-26.518474	-34.473769	1.289715	0.217006	7.955295	-341.342677	-9.317269	3.088569	-2.600593	15.192839	-12.518345	-4.136673	-41.855717	-72.258378	-51.998775	9.499447	8.670917	-6.973088	-4.740623	-26.964945	-13.147675	-23.928700	-28.940883	-45.293433	-15.183406	-1900.784014	-0.362402	-1.314627	4.116133	-294.813511	-9.917793	-34.724925	-5.462844	-9.367141	-19.623785	-38.188082	0.187709	0.816860
1627	15395	Sydney	Hawthorn	2018-08-25	23	S.C.G.	2018	2.128611	1.0	2.0	1.777290	1662.568452	1615.507209	0.699994	0.778280	0.699158	0.673409	-1.756730	-0.874690	-11.415069	-15.575319	0.014390	4.073909	4.160250	-174.005092	-0.942357	-4.078635	-4.192916	7.814496	-2.225780	6.215760	15.042979	-34.894261	-50.615255	4.214158	0.683548	-3.535594	-3.168608	-12.068691	-30.493980	-9.867332	2.588103	-22.825570	-5.604199	253.086090	-2.697132	-22.612327	25.340623	-90.812188	1.967104	-31.047879	0.007606	-6.880120	11.415593	-49.957313	0.440180	0.561924
1628	15396	Brisbane	West Coast	2018-08-26	23	Gabba	2018	3.442757	-49.2	0.0	2.094236	1279.963814	1622.200265	0.683604	0.691730	0.696822	0.709605	-0.190413	1.182699	0.040221	-13.621456	1.772577	3.026217	13.661677	-22.709485	2.424261	-4.848054	1.800473	5.051157	6.440524	-5.549630	-17.041838	27.543023	33.983159	4.459181	-3.213885	-0.428455	1.514474	42.646138	-7.141638	1.457375	-17.472537	-15.103115	8.001966	-383.083539	6.458915	7.275716	0.942863	44.461590	4.640136	13.180967	-15.704694	2.366444	-5.985843	38.195255	0.433501	0.569866
1629	15397	Melbourne	GWS	2018-08-26	23	M.C.G.	2018	1.706488	-23.2	2.0	1.805565	1540.367850	1615.614668	0.667240	0.692632	0.684525	0.753783	2.056899	0.635785	12.977177	6.642811	1.443121	-2.324358	6.334366	147.281112	2.201404	-5.222254	3.250416	8.542475	-2.203571	3.559792	21.192530	33.737734	12.865653	-3.244066	-2.135243	4.100203	3.772200	48.425291	18.247107	13.349992	11.385136	-14.687556	5.052000	304.087088	11.062610	-6.686409	-16.414544	8.350924	-5.453961	12.407662	6.672628	-1.523915	13.075351	18.522113	0.661551	0.340379
1630	15398	St Kilda	North Melbourne	2018-08-26	23	Docklands	2018	5.516150	-3.2	2.0	2.272313	1372.453734	1454.022032	0.730843	0.635819	0.697018	0.654991	-2.257517	1.223261	-12.321842	-19.923855	1.189755	0.463481	7.602012	27.891262	3.201137	4.754346	-1.881145	-3.924740	-0.528075	-8.045729	-20.584717	36.806235	39.615090	7.018240	-4.709732	-4.535660	-3.372912	23.194704	-18.042370	1.214353	-14.771187	13.611531	11.690647	-109.284521	-0.229945	12.384044	-4.625633	57.158576	1.353070	-1.533659	-6.646259	-3.489492	-15.416140	58.470456	0.284269	0.717566

Wrapping it Up

We now have a fairly decent amount of features. Some other features which could be added include whether the game is in a major Capital city outisde of Mebourne (i.e. Sydney, Adelaide or Peth), how many 'Elite' players are playing (which could be judged by average SuperCoach scores over 110, for example), as well as your own metrics for attacking and defending.

Note that all of our features have columns starting with 'f_' so in the section, we will grab this feature dataframe and use these features to sport predicting the matches.

03. Modelling

These tutorials will walk you through how to construct your own basic AFL model, using publically available data. The output will be odds for each team to win, which will be shown on The Hub.

In this notebook we will walk you through modelling our AFL data to create predictions. We will train a variety of quick and easy models to get a feel of what works and what doesn't. We will then tune our hyperparameters so that we are ready to make week by week predictions.

Grabbing Our Dataset

First, we will import our required modules, as well as the prepare_afl_features function which we created in our afl_feature_creation script. This essentially creates some basic features for us so that we can get started on the modelling component.

# Import libraries
from afl_data_cleaning_v2 import *
import datetime
import pandas as pd
import numpy as np
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
# from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.feature_selection import RFECV
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn import feature_selection
from sklearn import metrics
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
import warnings
warnings.filterwarnings('ignore')
import afl_feature_creation_v2
import afl_data_cleaning_v2

# Grab our feature DataFrame which we created in the previous tutorial
feature_df = afl_feature_creation_v2.prepare_afl_features()
afl_data = afl_data_cleaning_v2.prepare_afl_data()

feature_df.tail(3)

	game	home_team	away_team	date	round	venue	season	f_odds	f_form_margin_btwn_teams	f_form_past_5_btwn_teams	f_odds_away	f_elo_home	f_elo_away	f_I50_efficiency_home	f_R50_efficiency_home	f_I50_efficiency_away	f_R50_efficiency_away	f_goals_diff	f_behinds_diff	f_points_diff	f_margin_diff	f_opponent_goals_diff	f_opponent_behinds_diff	f_opponent_points_diff	f_AF_diff	f_B_diff	f_BO_diff	f_CCL_diff	f_CG_diff	f_CL_diff	f_CM_diff	f_CP_diff	f_D_diff	f_ED_diff	f_FA_diff	f_FF_diff	f_G_diff	f_GA_diff	f_HB_diff	f_HO_diff	f_I50_diff	f_ITC_diff	f_K_diff	f_M_diff	f_MG_diff	f_MI5_diff	f_One.Percenters_diff	f_R50_diff	f_SC_diff	f_SCL_diff	f_SI_diff	f_T_diff	f_T5_diff	f_TO_diff	f_UP_diff	f_current_odds_prob	f_current_odds_prob_away
1628	15396	Brisbane	West Coast	2018-08-26	23	Gabba	2018	3.442757	-49.2	0.0	2.094236	1279.963814	1622.200265	0.683604	0.691730	0.696822	0.709605	-0.190413	1.182699	0.040221	-13.621456	1.772577	3.026217	13.661677	-22.709485	2.424261	-4.848054	1.800473	5.051157	6.440524	-5.549630	-17.041838	27.543023	33.983159	4.459181	-3.213885	-0.428455	1.514474	42.646138	-7.141638	1.457375	-17.472537	-15.103115	8.001966	-383.083539	6.458915	7.275716	0.942863	44.461590	4.640136	13.180967	-15.704694	2.366444	-5.985843	38.195255	0.433501	0.569866
1629	15397	Melbourne	GWS	2018-08-26	23	M.C.G.	2018	1.706488	-23.2	2.0	1.805565	1540.367850	1615.614668	0.667240	0.692632	0.684525	0.753783	2.056899	0.635785	12.977177	6.642811	1.443121	-2.324358	6.334366	147.281112	2.201404	-5.222254	3.250416	8.542475	-2.203571	3.559792	21.192530	33.737734	12.865653	-3.244066	-2.135243	4.100203	3.772200	48.425291	18.247107	13.349992	11.385136	-14.687556	5.052000	304.087088	11.062610	-6.686409	-16.414544	8.350924	-5.453961	12.407662	6.672628	-1.523915	13.075351	18.522113	0.661551	0.340379
1630	15398	St Kilda	North Melbourne	2018-08-26	23	Docklands	2018	5.516150	-3.2	2.0	2.272313	1372.453734	1454.022032	0.730843	0.635819	0.697018	0.654991	-2.257517	1.223261	-12.321842	-19.923855	1.189755	0.463481	7.602012	27.891262	3.201137	4.754346	-1.881145	-3.924740	-0.528075	-8.045729	-20.584717	36.806235	39.615090	7.018240	-4.709732	-4.535660	-3.372912	23.194704	-18.042370	1.214353	-14.771187	13.611531	11.690647	-109.284521	-0.229945	12.384044	-4.625633	57.158576	1.353070	-1.533659	-6.646259	-3.489492	-15.416140	58.470456	0.284269	0.717566

# Get the result and merge to the feature_df

match_results = (pd.read_csv("data/afl_match_results.csv")
                    .rename(columns={'Game': 'game'})
                    .assign(result=lambda df: df.apply(lambda row: 1 if row['Home.Points'] > row['Away.Points'] else 0, axis=1)))

# Merge result column to feature_df
feature_df = pd.merge(feature_df, match_results[['game', 'result']], on='game')

Creating a Training and Testing Set

So that we don't train our data on the data that we will later test our model on, we will create separate train and test sets. For this exercise we will use the 2018 season to test how our model performs, whilst the rest of the data can be used to train the model.

# Create our test and train sets from our afl DataFrame; drop the columns which leak the result, duplicates, and the advanced
# stats which don't have data until 2015

feature_columns = [col for col in feature_df if col.startswith('f_')]

# Create our test set
test_x = feature_df.loc[feature_df.season == 2018, ['game'] + feature_columns]
test_y = feature_df.loc[feature_df.season == 2018, 'result']

# Create our train set
X = feature_df.loc[feature_df.season != 2018, ['game'] + feature_columns]
y = feature_df.loc[feature_df.season != 2018, 'result']

# Scale features
scaler = StandardScaler()
X[feature_columns] = scaler.fit_transform(X[feature_columns])
test_x[feature_columns] = scaler.transform(test_x[feature_columns])

Using Cross Validation to Find The Best Algorithms

Now that we have our training set, we can run through a list of popular classifiers to determine which classifier is best for modelling our data. To do this we will create a function which uses Kfold cross-validation to find the 'best' algorithms, based on how accurate the algorithms' predictions are.

This function will take in a list of classifiers, which we will define below, as well as the training set and it's outcome, and output a DataFrame with the mean and std of the accuracy of each algorithm. Let's jump into it!

# Create a list of standard classifiers
classifiers = [
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),

    #GLM
    linear_model.LogisticRegressionCV(),

    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),

    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),

    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
#     XGBClassifier()    
]

# Define a functiom which finds the best algorithms for our modelling task
def find_best_algorithms(classifier_list, X, y):
    # This function is adapted from https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling
    # Cross validate model with Kfold stratified cross validation
    kfold = StratifiedKFold(n_splits=5)

    # Grab the cross validation scores for each algorithm
    cv_results = [cross_val_score(classifier, X, y, scoring = "neg_log_loss", cv = kfold) for classifier in classifier_list]
    cv_means = [cv_result.mean() * -1 for cv_result in cv_results]
    cv_std = [cv_result.std() for cv_result in cv_results]
    algorithm_names = [alg.__class__.__name__ for alg in classifiers]

    # Create a DataFrame of all the CV results
    cv_results = pd.DataFrame({
        "Mean Log Loss": cv_means,
        "Log Loss Std": cv_std,
        "Algorithm": algorithm_names
    })

    return cv_results.sort_values(by='Mean Log Loss').reset_index(drop=True)

best_algos = find_best_algorithms(classifiers, X, y)
best_algos

	Mean Log Loss	Log Loss Std	Algorithm
0	0.539131	3.640578e-02	LogisticRegressionCV
1	0.551241	5.775685e-02	LinearDiscriminantAnalysis
2	0.630994	8.257481e-02	GradientBoostingClassifier
3	0.670041	9.205780e-03	AdaBoostClassifier
4	0.693147	2.360121e-08	GaussianProcessClassifier
5	0.712537	2.770864e-02	SVC
6	0.712896	2.440755e-02	NuSVC
7	0.836191	2.094224e-01	ExtraTreesClassifier
8	0.874307	1.558144e-01	RandomForestClassifier
9	1.288174	3.953037e-01	BaggingClassifier
10	1.884019	4.769589e-01	QuadraticDiscriminantAnalysis
11	2.652161	6.886897e-01	BernoulliNB
12	3.299651	6.427551e-01	GaussianNB

# Try a logistic regression model and see how it performs in terms of accuracy
kfold = StratifiedKFold(n_splits=5)
cv_scores = cross_val_score(linear_model.LogisticRegressionCV(), X, y, scoring='accuracy', cv=kfold)
cv_scores.mean()
    0.7452268937025035

Choosing Our Algorithms

As we can see from above, there are some pretty poor algorithms for predicting the winner. On the other hand, whilst attaining an accuracy of 74.5% (at the time of writing) may seem like a decent result; we must first establish a baseline to judge our performance on. In this case, we will have two baselines; the proportion of games won by the home team and what the odds predict. If we can beat the odds we have created a very powerful model.

Note that a baseline for the log loss can also be both the odds log loss and randomly guessing. Randomly guessing between two teams attains a log loss of log(2) = 0.69, so we have beaten this result.

Once we establish our baseline, we will choose the top algorithms from above and tune their hyperparameters, as well as automatically selecting the best features to be used in our model.

Defining Our Baseline

As stated above, we must define our baseline so that we have a measure to beat. We will use the proportion of games won by the home team, as well as the proportion of favourites who won, based off the odds. To establish this baseline we will use our feature_df, as this has no dropped rows.

# Find the percentage chance of winning at home in each season.
afl_data = afl_data_cleaning_v2.prepare_afl_data()
afl_data['home_win'] = afl_data.apply(lambda x: 1 if x['f_margin'] > 0 else 0, axis=1)
home_games = afl_data[afl_data['home_game'] == 1]
home_games[["home_win", 'season']].groupby(['season']).mean()

season	home_win
2011	0.561856
2012	0.563725
2013	0.561576
2014	0.574257
2015	0.539604
2016	0.606742
2017	0.604061
2018	0.540404

# Find the proportion of favourites who have won

# Define a function which finds if the odds correctly guessed the response
def find_odds_prediction(a_row):
    if a_row['f_odds'] <= a_row['f_odds_away'] and a_row['home_win'] == 1:
        return 1
    elif a_row['f_odds_away'] < a_row['f_odds'] and a_row['home_win'] == 0:
        return 1
    else:
        return 0

# Define a function which splits our DataFrame so each game is on one row instead of two
def get_df_on_one_line(df):
    cols_to_drop = ['date', 'home_game', 'opponent', 
       'f_opponent_behinds', 'f_opponent_goals', 'f_opponent_points', 'f_points',
       'round', 'venue', 'season']

    home_df = df[df['home_game'] == 1].rename(columns={'team': 'home_team'})
    away_df = df[df['home_game'] == 0].rename(columns={'team': 'away_team'})
    away_df = away_df.drop(columns=cols_to_drop)

    # Rename away_df columns
    away_df_renamed = away_df.rename(columns={col: col + '_away' for col in away_df.columns if col != 'game'})
    merged_df = pd.merge(home_df, away_df_renamed, on='game')

    merged_df['home_win'] = merged_df.f_margin.apply(lambda x: 1 if x > 0 else 0)
    return merged_df

afl_data_one_line = get_df_on_one_line(afl_data)
afl_data_one_line['odds_prediction'] = afl_data_one_line.apply(find_odds_prediction, axis=1)
print('The overall mean accuracy of choosing the favourite based on the odds is {}%'.format(
    round(afl_data_one_line['odds_prediction'].mean() * 100, 2)))
afl_data_one_line[["odds_prediction", 'season']].groupby(['season']).mean()

The overall mean accuracy of choosing the favourite based on the odds is 73.15%

season	odds_prediction
2011	0.784615
2012	0.774510
2013	0.748768
2014	0.727723
2015	0.727723
2016	0.713483
2017	0.659898
2018	0.712121

## Get a baseline log loss score from the odds
afl_data_one_line['odds_home_prob'] = 1 / afl_data_one_line.f_odds
afl_data_one_line['odds_away_prob'] = 1 / afl_data_one_line.f_odds_away

metrics.log_loss(afl_data_one_line.home_win, afl_data_one_line[['odds_away_prob', 'odds_home_prob']])
    0.5375306549682837

We can see that the odds are MUCH more accurate than just choosing the home team to win. We can also see that the mean accuracy of choosing the favourite is around 73%. That means that this is the score we need to beat. Similarly, the log loss of the odds is around 0.5385, whilst our model scores around 0.539 (at the time of writing), without hyperparamter optimisation. Let's choose only the algorithms with log losses below 0.67

chosen_algorithms = best_algos.loc[best_algos['Mean Log Loss'] < 0.67, 'Algorithm'].tolist()
chosen_algorithms
    ['LogisticRegressionCV',
     'LinearDiscriminantAnalysis',
     'GradientBoostingClassifier']

Using Grid Search To Tune Hyperparameters

Now that we have our best models, we can use Grid Search to optimise our hyperparameters. Grid search basically involves searching through a range of different algorithm hyperparameters, and choosing those which result in the best score from some metrics, which in our case is accuracy. Let's do this for the algorithms which have hyperparameters which can be tuned. Note that if you are running this on your own computer it may take up to 10 minutes.

# Define a function which optimises the hyperparameters of our chosen algorithms
def optimise_hyperparameters(train_x, train_y, algorithms, parameters):
    kfold = StratifiedKFold(n_splits=5)
    best_estimators = []

    for alg, params in zip(algorithms, parameters):
        gs = GridSearchCV(alg, param_grid=params, cv=kfold, scoring='neg_log_loss', verbose=1)
        gs.fit(train_x, train_y)
        best_estimators.append(gs.best_estimator_)
    return best_estimators

# Define our parameters to run a grid search over
lr_grid = {
    "C": [0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],
    "solver": ["newton-cg", "lbfgs", "liblinear"]
}

# Add our algorithms and parameters to lists to be used in our function
alg_list = [LogisticRegression()]
param_list = [lr_grid]

# Find the best estimators, then add our other estimators which don't need optimisation
best_estimators = optimise_hyperparameters(X, y, alg_list, param_list)

Fitting 5 folds for each of 18 candidates, totalling 90 fits

[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:    5.2s finished

lr_best_params = best_estimators[0].get_params()
lr_best_params
    {'C': 0.01,
     'class_weight': None,
     'dual': False,
     'fit_intercept': True,
     'intercept_scaling': 1,
     'max_iter': 100,
     'multi_class': 'ovr',
     'n_jobs': 1,
     'penalty': 'l2',
     'random_state': None,
     'solver': 'newton-cg',
     'tol': 0.0001,
     'verbose': 0,
     'warm_start': False}

kfold = StratifiedKFold(n_splits=10)
cv_scores = cross_val_score(linear_model.LogisticRegression(**lr_best_params), X, y, scoring='neg_log_loss', cv=kfold)
cv_scores.mean()
    -0.528741673153639

In the next iteration of this tutorial we will also optimise an XGB model and hopefully outperform our logistic regression model.

Creating Predictions for the 2018 Season

Now that we have an optimised logistic regression model, let's see how it performs on predicting the 2018 season.

lr = LogisticRegression(**lr_best_params)
lr.fit(X, y)
final_predictions = lr.predict(test_x)

accuracy = (final_predictions == test_y).mean() * 100

print("Our accuracy in predicting the 2018 season is: {:.2f}%".format(accuracy))

Our accuracy in predicting the 2018 season is: 67.68%

Now let's have a look at all the games which we incorrectly predicted.

game_ids = test_x[(final_predictions != test_y)].game
afl_data_one_line.loc[afl_data_one_line.game.isin(game_ids), ['date', 'home_team', 'opponent', 'f_odds', 'f_odds_away', 'f_margin']]

	date	home_team	opponent	f_odds	f_odds_away	f_margin
1386	2018-03-24	Gold Coast	North Melbourne	2.0161	1.9784	16
1388	2018-03-25	Melbourne	Geelong	1.7737	2.2755	-3
1391	2018-03-30	North Melbourne	St Kilda	3.5769	1.3867	52
1392	2018-03-31	Carlton	Gold Coast	1.5992	2.6620	-34
1396	2018-04-01	Western Bulldogs	West Coast	1.8044	2.2445	-51
1397	2018-04-01	Sydney	Port Adelaide	1.4949	3.0060	-23
1398	2018-04-02	Geelong	Hawthorn	1.7597	2.3024	-1
1406	2018-04-08	Western Bulldogs	Essendon	3.8560	1.3538	21
1408	2018-04-13	Adelaide	Collingwood	1.2048	5.9197	-48
1412	2018-04-14	North Melbourne	Carlton	1.5799	2.7228	86
1415	2018-04-15	Hawthorn	Melbourne	2.2855	1.7772	67
1417	2018-04-20	Sydney	Adelaide	1.2640	4.6929	-10
1420	2018-04-21	Port Adelaide	Geelong	1.5053	2.9515	-34
1422	2018-04-22	North Melbourne	Hawthorn	2.6170	1.6132	28
1423	2018-04-22	Brisbane	Gold Coast	1.7464	2.3277	-5
1425	2018-04-25	Collingwood	Essendon	1.8372	2.1754	49
1427	2018-04-28	Geelong	Sydney	1.5019	2.9833	-17
1434	2018-04-29	Fremantle	West Coast	2.4926	1.6531	-8
1437	2018-05-05	Essendon	Hawthorn	2.8430	1.5393	-23
1439	2018-05-05	Sydney	North Melbourne	1.2777	4.5690	-2
1444	2018-05-11	Hawthorn	Sydney	1.6283	2.5818	-8
1445	2018-05-12	GWS	West Coast	1.5425	2.8292	-25
1446	2018-05-12	Carlton	Essendon	3.1742	1.4570	13
1452	2018-05-13	Collingwood	Geelong	2.4127	1.7040	-21
1455	2018-05-19	North Melbourne	GWS	1.5049	2.9752	43
1456	2018-05-19	Essendon	Geelong	5.6530	1.2104	34
1460	2018-05-20	Brisbane	Hawthorn	3.2891	1.4318	56
1461	2018-05-20	West Coast	Richmond	1.9755	2.0154	47
1466	2018-05-26	GWS	Essendon	1.4364	3.2652	-35
1467	2018-05-27	Hawthorn	West Coast	2.2123	1.8133	-15
...	...	...	...	...	...	...
1483	2018-06-10	Brisbane	Essendon	2.3018	1.7543	-22
1485	2018-06-11	Melbourne	Collingwood	1.6034	2.6450	-42
1492	2018-06-21	West Coast	Essendon	1.3694	3.6843	-28
1493	2018-06-22	Port Adelaide	Melbourne	1.7391	2.3426	10
1499	2018-06-29	Western Bulldogs	Geelong	6.2067	1.1889	2
1501	2018-06-30	Adelaide	West Coast	1.4989	2.9756	10
1504	2018-07-01	Melbourne	St Kilda	1.1405	7.7934	-2
1505	2018-07-01	Essendon	North Melbourne	2.0993	1.9022	17
1506	2018-07-01	Fremantle	Brisbane	1.2914	4.3743	-55
1507	2018-07-05	Sydney	Geelong	1.7807	2.2675	-12
1514	2018-07-08	Essendon	Collingwood	2.5442	1.6473	-16
1515	2018-07-08	West Coast	GWS	1.6790	2.4754	11
1516	2018-07-12	Adelaide	Geelong	2.0517	1.9444	15
1518	2018-07-14	Hawthorn	Brisbane	1.2281	5.4105	-33
1521	2018-07-14	GWS	Richmond	2.7257	1.5765	2
1522	2018-07-15	Collingwood	West Coast	1.5600	2.7815	-35
1523	2018-07-15	North Melbourne	Sydney	1.9263	2.0647	-6
1524	2018-07-15	Fremantle	Port Adelaide	5.9110	1.2047	9
1527	2018-07-21	Sydney	Gold Coast	1.0342	27.8520	-24
1529	2018-07-21	Brisbane	Adelaide	2.4614	1.6730	-5
1533	2018-07-22	Port Adelaide	GWS	1.6480	2.5452	-22
1538	2018-07-28	Gold Coast	Carlton	1.3933	3.5296	-35
1546	2018-08-04	Adelaide	Port Adelaide	2.0950	1.9135	3
1548	2018-08-04	St Kilda	Western Bulldogs	1.6120	2.6368	-35
1555	2018-08-11	Port Adelaide	West Coast	1.4187	3.3505	-4
1558	2018-08-12	North Melbourne	Western Bulldogs	1.3175	4.1239	-7
1559	2018-08-12	Melbourne	Sydney	1.3627	3.7445	-9
1564	2018-08-18	GWS	Sydney	1.8478	2.1672	-20
1576	2018-08-26	Brisbane	West Coast	2.3068	1.7548	-26
1578	2018-08-26	St Kilda	North Melbourne	3.5178	1.3936	-23

Very interesting! Most of the games we got wrong were upsets. Let's have a look at the games we incorrectly predicted that weren't upsets.

(afl_data_one_line.loc[afl_data_one_line.game.isin(game_ids), ['date', 'home_team', 'opponent', 'f_odds', 'f_odds_away', 'f_margin']]
    .assign(home_favourite=lambda df: df.apply(lambda row: 1 if row.f_odds < row.f_odds_away else 0, axis=1))
    .assign(upset=lambda df: df.apply(lambda row: 1 if row.home_favourite == 1 and row.f_margin < 0 else 
                                      (1 if row.home_favourite == 0 and row.f_margin > 0 else 0), axis=1))
    .query('upset == 0'))

	date	home_team	opponent	f_odds	f_odds_away	f_margin	home_favourite
1412	2018-04-14	North Melbourne	Carlton	1.5799	2.7228	86	1
1425	2018-04-25	Collingwood	Essendon	1.8372	2.1754	49	1
1434	2018-04-29	Fremantle	West Coast	2.4926	1.6531	-8	0
1437	2018-05-05	Essendon	Hawthorn	2.8430	1.5393	-23	0
1452	2018-05-13	Collingwood	Geelong	2.4127	1.7040	-21	0
1455	2018-05-19	North Melbourne	GWS	1.5049	2.9752	43	1
1461	2018-05-20	West Coast	Richmond	1.9755	2.0154	47	1
1467	2018-05-27	Hawthorn	West Coast	2.2123	1.8133	-15	0
1479	2018-06-08	Port Adelaide	Richmond	1.7422	2.3420	14	1
1483	2018-06-10	Brisbane	Essendon	2.3018	1.7543	-22	0
1493	2018-06-22	Port Adelaide	Melbourne	1.7391	2.3426	10	1
1501	2018-06-30	Adelaide	West Coast	1.4989	2.9756	10	1
1514	2018-07-08	Essendon	Collingwood	2.5442	1.6473	-16	0
1515	2018-07-08	West Coast	GWS	1.6790	2.4754	11	1
1529	2018-07-21	Brisbane	Adelaide	2.4614	1.6730	-5	0
1576	2018-08-26	Brisbane	West Coast	2.3068	1.7548	-26	0
1578	2018-08-26	St Kilda	North Melbourne	3.5178	1.3936	-23	0

Let's now look at our model's log loss for the 2018 season compared to the odds.

predictions_probs = lr.predict_proba(test_x)

metrics.log_loss(test_y, predictions_probs)
    0.584824211055384

test_x_unscaled = feature_df.loc[feature_df.season == 2018, ['game'] + feature_columns]

metrics.log_loss(test_y, test_x_unscaled[['f_current_odds_prob_away', 'f_current_odds_prob']])
    0.5545776633924343

So whilst our model performs decently, it doesn't beat the odds in terms of log loss. That's okay, it's still a decent start. In future iterations we can implement other algorithms and create new features which may improve performance.

Next Steps

Now that we have a model up and running, the next steps are to implement the model on a week to week basis.

04. Weekly Predictions

Now that we have explored different algorithms for modelling, we can implement our chosen model and predict this week's AFL games! All you need to do is run the afl_modelling script each Thursday or Friday to predict the following week's games.

# Import Modules
from afl_feature_creation_v2 import prepare_afl_features
import afl_data_cleaning_v2
import afl_feature_creation_v2
import afl_modelling_v2
import datetime
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

Creating The Features For This Weekend's Games

To actually predict this weekend's games, we need to create the same features that we have created in the previous tutorials for the games that will be played this weekend. This includes all the rolling averages, efficiency features, elo features etc. So the majority of this tutorial will be using previously defined functions to create features for the following weekend's games.

Create Next Week's DataFrame

Let's first get our cleaned afl_data dataset, as well as the odds for next weekend and the 2018 fixture.

# Grab the cleaned AFL dataset and the column order
afl_data = afl_data_cleaning_v2.prepare_afl_data()
ordered_cols = afl_data.columns

# Define a function which grabs the odds for each game for the following weekend
def get_next_week_odds(path):
    # Get next week's odds
    next_week_odds = pd.read_csv(path)
    next_week_odds = next_week_odds.rename(columns={"team_1": "home_team", 
                                                "team_2": "away_team", 
                                                "team_1_odds": "odds", 
                                                "team_2_odds": "odds_away"
                                               })
    return next_week_odds

# Import the fixture
# Define a function which gets the fixture and cleans it up
def get_fixture(path):
    # Get the afl fixture
    fixture = pd.read_csv(path)

    # Replace team names and reformat
    fixture = fixture.replace({'Brisbane Lions': 'Brisbane', 'Footscray': 'Western Bulldogs'})
    fixture['Date'] = pd.to_datetime(fixture['Date']).dt.date.astype(str)
    fixture = fixture.rename(columns={"Home.Team": "home_team", "Away.Team": "away_team"})
    return fixture

next_week_odds = get_next_week_odds("data/weekly_odds.csv")
fixture = get_fixture("data/afl_fixture_2018.csv")

fixture.tail()

	Date	Season	Season.Game	Round	home_team	away_team	Venue
202	2018-09-14	2018	1	26	Hawthorn	Melbourne	MCG
203	2018-09-15	2018	1	26	Collingwood	GWS	MCG
204	2018-09-21	2018	1	27	Richmond	Collingwood	MCG
205	2018-09-22	2018	1	27	West Coast	Melbourne	Optus Stadium
206	2018-09-29	2018	1	28	West Coast	Collingwood	MCG

next_week_odds

	home_team	away_team	odds	odds_away
0	West Coast	Collingwood	2.34	1.75

Now that we have these DataFrames, we will define a function which combines the fixture and next week's odds to create a single DataFrame for the games over the next 7 days. To use this function we will need Game IDs for next week. So we will create another function which creates Game IDs by using the Game ID from the last game played and adding 1 to it.

# Define a function which creates game IDs for this week's footy games
def create_next_weeks_game_ids(afl_data):
    odds = get_next_week_odds("data/weekly_odds.csv")

    # Get last week's Game ID
    last_afl_data_game = afl_data['game'].iloc[-1]

    # Create Game IDs for next week
    game_ids = [(i+1) + last_afl_data_game for i in range(odds.shape[0])]
    return game_ids

# Define a function which creates this week's footy game DataFrame
def get_next_week_df(afl_data):
    # Get the fixture and the odds for next week's footy games
    fixture = get_fixture("data/afl_fixture_2018.csv")
    next_week_odds = get_next_week_odds("data/weekly_odds.csv")
    next_week_odds['game'] = create_next_weeks_game_ids(afl_data)

    # Get today's date and next week's date and create a DataFrame for next week's games
#     todays_date = datetime.datetime.today().strftime('%Y-%m-%d')

#     date_in_7_days = (datetime.datetime.today() + datetime.timedelta(days=7)).strftime('%Y-%m-%d')
    todays_date = '2018-09-27'
    date_in_7_days = '2018-10-04'
    fixture = fixture[(fixture['Date'] >= todays_date) & (fixture['Date'] < date_in_7_days)].drop(columns=['Season.Game'])
    next_week_df = pd.merge(fixture, next_week_odds, on=['home_team', 'away_team'])

    # Split the DataFrame onto two rows for each game
    h_df = (next_week_df[['Date', 'game', 'home_team', 'away_team', 'odds', 'Season', 'Round', 'Venue']]
               .rename(columns={'home_team': 'team', 'away_team': 'opponent'})
               .assign(home_game=1))

    a_df = (next_week_df[['Date', 'game', 'home_team', 'away_team', 'odds_away', 'Season', 'Round', 'Venue']]
                .rename(columns={'odds_away': 'odds', 'home_team': 'opponent', 'away_team': 'team'})
                .assign(home_game=0))

    next_week = a_df.append(h_df).sort_values(by='game').rename(columns={
        'Date': 'date',
        'Season': 'season',
        'Round': 'round',
        'Venue': 'venue'
    })
    next_week['date'] = pd.to_datetime(next_week.date)
    next_week['round'] = afl_data['round'].iloc[-1] + 1
    return next_week

next_week_df = get_next_week_df(afl_data)
game_ids_next_round = create_next_weeks_game_ids(afl_data)
next_week_df

	date	round	season	venue	game	home_game	odds	opponent	team
0	2018-09-29	27	2018	MCG	15407	0	1.75	West Coast	Collingwood
0	2018-09-29	27	2018	MCG	15407	1	2.34	Collingwood	West Coast

fixture.tail()

	Date	Season	Season.Game	Round	home_team	away_team	Venue
202	2018-09-14	2018	1	26	Hawthorn	Melbourne	MCG
203	2018-09-15	2018	1	26	Collingwood	GWS	MCG
204	2018-09-21	2018	1	27	Richmond	Collingwood	MCG
205	2018-09-22	2018	1	27	West Coast	Melbourne	Optus Stadium
206	2018-09-29	2018	1	28	West Coast	Collingwood	MCG

Create Each Feature

Now let's append next week's DataFrame to our afl_data, match_results and odds DataFrames and then create all the features we used in the AFL Feature Creation Tutorial. We need to append the games and then feed them into our function so that we can create features for upcoming games.

# Append next week's games to our afl_data DataFrame
afl_data = afl_data.append(next_week_df).reset_index(drop=True)

# Append next week's games to match results (we need to do this for our feature creation to run)
match_results = afl_data_cleaning_v2.get_cleaned_match_results().append(next_week_df)

# Append next week's games to odds
odds = (afl_data_cleaning_v2.get_cleaned_odds().pipe(lambda df: df.append(next_week_df[df.columns]))
       .reset_index(drop=True))

features_df = afl_feature_creation_v2.prepare_afl_features(afl_data=afl_data, match_results=match_results, odds=odds)

features_df.tail()

	game	home_team	away_team	date	round	venue	season	f_odds	f_form_margin_btwn_teams	f_form_past_5_btwn_teams	f_odds_away	f_elo_home	f_elo_away	f_I50_efficiency_home	f_R50_efficiency_home	f_I50_efficiency_away	f_R50_efficiency_away	f_AF_diff	f_B_diff	f_BO_diff	f_CCL_diff	f_CG_diff	f_CL_diff	f_CM_diff	f_CP_diff	f_D_diff	f_ED_diff	f_FA_diff	f_FF_diff	f_G_diff	f_GA_diff	f_GA1_diff	f_HB_diff	f_HO_diff	f_I50_diff	f_ITC_diff	f_K_diff	f_M_diff	f_MG_diff	f_MI5_diff	f_One.Percenters_diff	f_R50_diff	f_SC_diff	f_SCL_diff	f_SI_diff	f_T_diff	f_T5_diff	f_TO_diff	f_UP_diff	f_Unnamed: 0_diff	f_behinds_diff	f_goals_diff	f_margin_diff	f_opponent_behinds_diff	f_opponent_goals_diff	f_opponent_points_diff	f_points_diff	f_current_odds_prob	f_current_odds_prob_away
1065	15397	Melbourne	GWS	2018-08-26	23	M.C.G.	2018	1.966936	-23.2	2.0	1.813998	1523.456734	1609.444874	0.653525	0.680168	0.704767	0.749812	140.535514	0.605144	-9.771981	5.892176	7.172376	6.614609	-1.365211	30.766262	21.998618	0.067228	-1.404730	-3.166732	6.933998	6.675576	0.000000	38.708158	24.587333	12.008987	10.482382	-16.709540	-15.415060	289.188486	6.350287	-2.263536	-20.966818	50.388632	0.723637	15.537783	22.912269	2.065039	10.215523	-6.689429	3259.163465	-0.136383	3.553795	16.563721	-2.353514	1.162696	4.622664	21.186385	0.661551	0.340379
1066	15398	St Kilda	North Melbourne	2018-08-26	23	Docklands	2018	5.089084	-3.2	2.0	2.577161	1397.237139	1499.366007	0.725980	0.655749	0.723949	0.677174	51.799992	3.399035	6.067393	-2.189489	-10.475859	1.154766	-8.883840	-21.810962	33.058382	40.618410	2.286314	-0.345734	-3.778445	-2.182673	0.000000	19.816372	-21.562916	2.678384	-14.777698	13.242010	12.065594	-82.381996	-2.176564	2.335825	-4.952336	45.719406	3.344217	-2.095613	-3.929084	-3.182381	-12.832197	57.226776	-20221.371526	1.968709	-1.897958	-15.177001	1.067099	0.781811	5.757963	-9.419038	0.284269	0.717566
1067	15404	Collingwood	GWS	2018-09-15	25	M.C.G.	2018	1.882301	12.6	3.0	2.018344	1546.000498	1590.806454	0.693185	0.706222	0.718446	0.727961	205.916671	-1.642954	-2.980828	-0.266023	8.547225	-3.751909	-0.664977	10.563513	48.175985	43.531908	-5.836979	5.388668	4.395675	2.555152	0.000000	51.588962	11.558254	4.276481	11.284445	-3.412977	-2.206815	-234.577304	2.637758	-10.537765	-11.127876	125.607377	-3.485896	3.532031	15.102292	-2.500685	8.187543	38.053445	12500.525732	-1.006173	2.520135	18.634835	-2.159882	-0.393386	-4.520198	14.114637	0.608495	0.393856
1068	15406	West Coast	Melbourne	2018-09-22	26	Perth Stadium	2018	2.013572	21.2	3.0	1.884148	1577.888606	1542.095154	0.688877	0.708941	0.649180	0.698319	-118.135184	-3.005709	2.453190	-5.103869	-14.368949	-12.245458	2.771411	-45.364271	-60.210182	-24.049523	-2.791277	6.115918	-5.041030	-5.335746	0.000000	-78.816902	-18.784547	-13.957754	-5.527613	18.606721	25.366778	-910.988860	-5.515812	-9.483590	8.914093	-131.380758	-7.142529	-49.484957	-13.718798	-4.862994	-9.834616	-23.673638	-3178.282073	-1.785349	-2.569957	-20.008787	0.476202	0.387915	2.803694	-17.205093	0.543774	0.457875
1069	15407	West Coast	Collingwood	2018-09-29	27	MCG	2018	1.981832	17.2	3.0	1.838864	1591.348723	1562.924273	0.679011	0.724125	0.711352	0.709346	159.522670	0.893421	-0.475725	3.391070	-5.088751	5.875388	5.352234	7.729063	-7.358202	-4.719968	6.113565	4.822252	2.871241	2.690270	3.636364	-64.238180	-0.631102	2.078832	6.005613	56.879978	34.373271	1016.491933	1.199751	2.454685	12.197047	219.666562	2.484363	0.379162	2.566991	0.639666	2.258377	-23.841529	-368920.360240	-0.646160	0.892051	3.040850	1.589568	0.012622	1.665299	4.706148	0.427350	0.571429

Create Predictions For the Upcoming Round

Now that we have our features, we can use our model that we created in part 3 to predict the next round. First we need to filter our features_df into a training df and a df with next round's features/matches. Then we can use the model created in the last tutorial to create predictions. For simplicity, I have hardcoded the parameters we used in the last tutorial.

# Get the train df by only taking the games IDs which aren't in the next week df
train_df = features_df[~features_df.game.isin(next_week_df.game)]

# Get the result and merge to the feature_df
match_results = (pd.read_csv("data/afl_match_results.csv")
                    .rename(columns={'Game': 'game'})
                    .assign(result=lambda df: df.apply(lambda row: 1 if row['Home.Points'] > row['Away.Points'] else 0, axis=1)))

train_df = pd.merge(train_df,  match_results[['game', 'result']], on='game')

train_x = train_df.drop(columns=['result'])
train_y = train_df.result

next_round_x = features_df[features_df.game.isin(next_week_df.game)]

# Fit out logistic regression model - note that our predictions come out in the order of [away_team_prob, home_team_prob]

lr_best_params = {'C': 0.01,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'max_iter': 100,
 'multi_class': 'ovr',
 'n_jobs': 1,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'newton-cg',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

feature_cols = [col for col in train_df if col.startswith('f_')]

# Scale features
scaler = StandardScaler()
train_x[feature_cols] = scaler.fit_transform(train_x[feature_cols])
next_round_x[feature_cols] = scaler.transform(next_round_x[feature_cols])

lr = LogisticRegression(**lr_best_params)
lr.fit(train_x[feature_cols], train_y)
prediction_probs = lr.predict_proba(next_round_x[feature_cols])

modelled_home_odds = [1/i[1] for i in prediction_probs]
modelled_away_odds = [1/i[0] for i in prediction_probs]

# Create a predictions df
preds_df = (next_round_x[['date', 'home_team', 'away_team', 'venue', 'game']].copy()
               .assign(modelled_home_odds=modelled_home_odds,
                      modelled_away_odds=modelled_away_odds)
               .pipe(pd.merge, next_week_odds, on=['home_team', 'away_team'])
               .pipe(pd.merge, features_df[['game', 'f_elo_home', 'f_elo_away']], on='game')
               .drop(columns='game')
           )

preds_df

	date	home_team	away_team	venue	modelled_home_odds	modelled_away_odds	odds	odds_away	f_elo_home	f_elo_away
0	2018-09-29	West Coast	Collingwood	MCG	2.326826	1.753679	2.34	1.75	1591.348723	1562.924273

Alternatively, if you want to generate predictions using a script which uses all the above code, just run the following:

print(afl_modelling_v2.create_predictions())

        date   home_team    away_team venue  modelled_home_odds  \
0 2018-09-29  West Coast  Collingwood   MCG            2.326826

   modelled_away_odds  odds  odds_away   f_elo_home   f_elo_away  
0            1.753679  2.34       1.75  1591.348723  1562.924273

Conclusion

Congratulations! You have created AFL predictions for this week. If you are beginner to this, don't be overwhelmed. The process gets easier each time you do it. And it is super rewarding. In future iterations we will update this tutorial to predict actual odds, and then integrate this with Betfair's API so that you can create an automated betting strategy using Machine Learning to create your predictions!

Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.