AFL Modelling Walkthrough¶
02. Feature Creation¶
These tutorials will walk you through how to construct your own basic AFL model. The output will be odds for each team to win, which will be shown on The Hub.
In this notebook we will walk you through creating features from our dataset, which was cleaned in the first tutorial. Feature engineering is an integral part of the Data Science process. Creative and smart features can be the difference between an average performing model and a model profitable which beats the market odds.
Grabbing Our Dataset¶
First, we will import our required modules, as well as the prepare_afl_data function which we created in our afl_data_cleaning script. This essentially cleans all the data for us so that we're ready to explore the data and make some features.
# Import modules
from afl_data_cleaning_v2 import *
import afl_data_cleaning_v2
import pandas as pd
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import numpy as np
# Use the prepare_afl_data function to prepare the data for us; this function condenses what we walked through in the previous tutorial
afl_data = prepare_afl_data()
afl_data.tail(3)
team | f_odds | date | home_game | game | round | f_goals | f_behinds | f_points | f_margin | venue | opponent | f_opponent_goals | f_opponent_behinds | f_opponent_points | season | f_AF | f_B | f_BO | f_CCL | f_CG | f_CL | f_CM | f_CP | f_D | f_ED | f_FA | f_FF | f_G | f_GA | f_HB | f_HO | f_I50 | f_ITC | f_K | f_M | f_MG | f_MI5 | f_One.Percenters | f_R50 | f_SC | f_SCL | f_SI | f_T | f_T5 | f_TO | f_UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3154 | Melbourne | 1.5116 | 2018-08-26 | 1 | 15397 | 23 | 15 | 12 | 102 | 45 | M.C.G. | GWS | 8 | 9 | 57 | 2018 | 1712 | 8 | 12 | 10.0 | 38 | 30 | 12 | 139 | 403 | 302 | 13 | 19 | 15 | 14 | 181 | 48 | 54 | 59.0 | 222 | 106 | 6198.0 | 16 | 34 | 39 | 1768 | 20.0 | 147.0 | 60 | 2.0 | 53.0 | 269 |
3155 | North Melbourne | 1.3936 | 2018-08-26 | 0 | 15398 | 23 | 17 | 15 | 117 | 23 | Docklands | St Kilda | 14 | 10 | 94 | 2018 | 1707 | 13 | 7 | 15.0 | 50 | 24 | 6 | 131 | 425 | 322 | 14 | 18 | 17 | 13 | 201 | 32 | 59 | 77.0 | 224 | 106 | 5833.0 | 23 | 33 | 29 | 1735 | 9.0 | 154.0 | 48 | 9.0 | 66.0 | 300 |
3156 | St Kilda | 3.5178 | 2018-08-26 | 1 | 15398 | 23 | 14 | 10 | 94 | -23 | Docklands | North Melbourne | 17 | 15 | 117 | 2018 | 1587 | 8 | 11 | 19.0 | 48 | 33 | 7 | 125 | 383 | 299 | 18 | 14 | 14 | 13 | 173 | 23 | 48 | 68.0 | 210 | 112 | 5522.0 | 14 | 46 | 35 | 1568 | 14.0 | 95.0 | 50 | 7.0 | 77.0 | 269 |
Creating A Feature DataFrame¶
Let's create a feature DataFrame and merge all of our features into this DataFrame as we go.
features = afl_data[['date', 'game', 'team', 'opponent', 'venue', 'home_game']].copy()
What Each Column Refers To¶
Below is a DataFrame which outlines what each column refers to.
column_abbreviations = pd.read_csv("data/afl_data_columns_mapping.csv")
column_abbreviations
Feature Abbreviated | Feature | |
---|---|---|
0 | GA | Goal Assists |
1 | CP | Contested Possessions |
2 | UP | Uncontested Possessions |
3 | ED | Effective Disposals |
4 | CM | Contested Marks |
5 | MI5 | Marks Inside 50 |
6 | One.Percenters | One Percenters |
7 | BO | Bounces |
8 | K | Kicks |
9 | HB | Handballs |
10 | D | Disposals |
11 | M | Marks |
12 | G | Goals |
13 | B | Behinds |
14 | T | Tackles |
15 | HO | Hitouts |
16 | I50 | Inside 50s |
17 | CL | Clearances |
18 | CG | Clangers |
19 | R50 | Rebound 50s |
20 | FF | Frees For |
21 | FA | Frees Against |
22 | AF | AFL Fantasy Points |
23 | SC | Supercoach Points |
24 | CCL | Centre Clearances |
25 | SCL | Stoppage Clearances |
26 | SI | Score Involvements |
27 | MG | Metres Gained |
28 | TO | Turnovers |
29 | ITC | Intercepts |
30 | T5 | Tackles Inside 50 |
Feature Creation¶
Now let's think about what features we can create. We have a enormous amount of stats to sift through. To start, let's create some simple features based on our domain knowledge of Aussie Rules.
Creating Expontentially Weighted Rolling Averages as Features¶
Next, we will create rolling averages of statistics such as Tackles, which we will use as features.
It is fair to assume that a team's performance in a certain stat may have predictive power to the overall result. And in general, if a team consistently performs well in this stat, this may have predictive power to the result of their future games. We can't simply train a model on stats from the game which we are trying to predict (i.e. data that we don't have before the game begins), as this will leak the result. We need to train our model on past data. One way of doing this is to train our model on average stats over a certain amount of games. If a team is averaging high in this stat, this may give insight into if they are a strong team. Similarly, if the team is averaging poorly in this stat (relative to the team they are playing), this may have predictive power and give rise to a predicted loss.
To do this we will create a function which calculates the rolling averages, known as create_exp_weighted_avgs, which takes our cleaned DataFrame as an input, as well as the alpha which, when higher, weights recent performances more than old performances. To read more about expontentially weighted moving averages, please read the documentation here.
First, we will grab all the columns which we want to create EMAs for, and then use our function to create the average for that column. We will create a new DataFrame and add these columns to this new DataFrame.
# Define a function which returns a DataFrame with the expontential moving average for each numeric stat
def create_exp_weighted_avgs(df, span):
# Create a copy of the df with only the game id and the team - we will add cols to this df
ema_features = df[['game', 'team']].copy()
feature_names = [col for col in df.columns if col.startswith('f_')] # Get a list of columns we will iterate over
for feature_name in feature_names:
feature_ema = (df.groupby('team')[feature_name]
.transform(lambda row: (row.ewm(span=span)
.mean()
.shift(1))))
ema_features[feature_name] = feature_ema
return ema_features
features_rolling_averages = create_exp_weighted_avgs(afl_data, span=10)
features_rolling_averages.tail()
game | team | f_odds | f_goals | f_behinds | f_points | f_margin | f_opponent_goals | f_opponent_behinds | f_opponent_points | f_AF | f_B | f_BO | f_CCL | f_CG | f_CL | f_CM | f_CP | f_D | f_ED | f_FA | f_FF | f_G | f_GA | f_HB | f_HO | f_I50 | f_ITC | f_K | f_M | f_MG | f_MI5 | f_One.Percenters | f_R50 | f_SC | f_SCL | f_SI | f_T | f_T5 | f_TO | f_UP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3152 | 15396 | West Coast | 2.094236 | 12.809630 | 10.047145 | 86.904928 | 8.888770 | 11.435452 | 9.403444 | 78.016158 | 3193.612782 | 16.472115 | 11.958482 | 23.379562 | 100.095244 | 68.252001 | 27.688669 | 284.463270 | 719.884644 | 525.878017 | 36.762440 | 44.867118 | 25.618202 | 17.522871 | 270.478779 | 88.139376 | 105.698031 | 148.005305 | 449.405865 | 201.198907 | 11581.929999 | 20.048124 | 95.018480 | 74.180967 | 3314.157893 | 44.872398 | 177.894442 | 126.985101 | 20.565549 | 138.876613 | 438.848376 |
3153 | 15397 | GWS | 1.805565 | 13.100372 | 13.179329 | 91.781563 | 18.527618 | 10.371198 | 11.026754 | 73.253945 | 3165.127358 | 19.875913 | 12.947209 | 25.114002 | 105.856671 | 80.609640 | 23.374884 | 303.160047 | 741.439198 | 534.520295 | 42.597317 | 38.160889 | 26.208715 | 18.688880 | 300.188301 | 81.540693 | 106.989070 | 143.032506 | 441.250897 | 173.050118 | 12091.630837 | 21.106142 | 103.077097 | 80.201059 | 3419.245919 | 55.495610 | 219.879895 | 138.202470 | 25.313148 | 135.966798 | 438.466439 |
3154 | 15397 | Melbourne | 1.706488 | 15.157271 | 13.815113 | 104.758740 | 25.170429 | 11.814319 | 8.702396 | 79.588311 | 3312.408470 | 22.077317 | 7.724955 | 28.364418 | 114.399147 | 78.406069 | 26.934677 | 324.352577 | 775.176933 | 547.385948 | 39.353251 | 36.025646 | 30.308918 | 22.461080 | 348.613592 | 99.787800 | 120.339062 | 154.417642 | 426.563341 | 178.102118 | 12395.717925 | 32.168752 | 96.390688 | 63.786515 | 3427.596843 | 50.041649 | 232.287556 | 144.875098 | 23.789233 | 149.042149 | 456.988552 |
3155 | 15398 | North Melbourne | 2.272313 | 12.721783 | 10.733785 | 87.064486 | -1.214246 | 12.915796 | 10.783958 | 88.278732 | 3066.272143 | 17.322710 | 9.815243 | 26.015421 | 106.465181 | 67.504286 | 26.064079 | 291.259574 | 736.279779 | 534.154748 | 34.301603 | 40.908551 | 25.386136 | 17.816570 | 341.210547 | 81.541130 | 102.589427 | 145.265493 | 395.069232 | 173.089408 | 10875.002463 | 21.802751 | 82.347511 | 70.416194 | 3171.120023 | 41.488865 | 197.620152 | 122.547684 | 22.286256 | 142.780474 | 450.374058 |
3156 | 15398 | St Kilda | 5.516150 | 10.464266 | 11.957047 | 74.742643 | -21.138101 | 14.105551 | 11.247440 | 95.880745 | 3094.163405 | 20.523847 | 14.569589 | 24.134276 | 102.540441 | 66.976211 | 18.018350 | 270.674857 | 773.086015 | 573.769838 | 41.319843 | 36.198820 | 20.850476 | 14.443658 | 364.405251 | 63.498760 | 103.803779 | 130.494307 | 408.680763 | 184.780054 | 10765.717942 | 21.572806 | 94.731555 | 65.790561 | 3228.278599 | 42.841935 | 196.086493 | 115.901425 | 18.796764 | 127.364334 | 508.844514 |
As you can see our function worked perfectly! Now we have a full DataFrame of exponentially weighted moving averages. Note that as these rolling averages have been shifted by 1 to ensure no data leakage, the first round of the data will have all NA values. We can drop these later.
Let's add these averages to our features DataFrame
features = pd.merge(features, features_rolling_averages, on=['game', 'team'])
Creating a 'Form Between the Teams' Feature¶
It is well known in Aussie Rules that often some teams perform better against certain teams than others. If we isolate our features to pure stats based on previous games not between the teams playing, or elo ratings, we won't account for any relationships between certain teams. An example is the Kennett Curse, where Geelong won 11 consecutive games against Hawthorn, despite being similarly matched teams. Let's create a feature which calculates how many games a team has won against their opposition over a given window of games.
To do this, we will need to use historical data that dates back well before our current DataFrame starts at. Otherwise we will be using a lot of our games to calculate form, meaning we will have to drop these rows before feeding it into an algorithm. So let's use our prepare_match_results function which we defined in the afl_data_cleaning tutorial to grab a clean DataFrame of all match results since 1897. We can then calculate the form and join this to our current DataFrame.
match_results = afl_data_cleaning_v2.get_cleaned_match_results()
match_results.head(3)
game | date | round | team | goals | behinds | points | margin | venue | opponent | opponent_goals | opponent_behinds | opponent_points | home_game | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1897-05-08 | 1 | Fitzroy | 6 | 13 | 49 | 33 | Brunswick St | Carlton | 2 | 4 | 16 | 1 |
1 | 1 | 1897-05-08 | 1 | Carlton | 2 | 4 | 16 | -33 | Brunswick St | Fitzroy | 6 | 13 | 49 | 0 |
2 | 2 | 1897-05-08 | 1 | Collingwood | 5 | 11 | 41 | 25 | Victoria Park | St Kilda | 2 | 4 | 16 | 1 |
form_btwn_teams = match_results[['game', 'team', 'opponent', 'margin']].copy()
form_btwn_teams['f_form_margin_btwn_teams'] = (match_results.groupby(['team', 'opponent'])['margin']
.transform(lambda row: row.rolling(5).mean().shift())
.fillna(0))
form_btwn_teams['f_form_past_5_btwn_teams'] = \
(match_results.assign(win=lambda df: df.apply(lambda row: 1 if row.margin > 0 else 0, axis='columns'))
.groupby(['team', 'opponent'])['win']
.transform(lambda row: row.rolling(5).mean().shift() * 5)
.fillna(0))
form_btwn_teams.tail(3)
game | team | opponent | margin | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | |
---|---|---|---|---|---|---|
30793 | 15397 | Melbourne | GWS | 45 | -23.2 | 2.0 |
30794 | 15398 | St Kilda | North Melbourne | -23 | -3.2 | 2.0 |
30795 | 15398 | North Melbourne | St Kilda | 23 | 3.2 | 3.0 |
# Merge to our features df
features = pd.merge(features, form_btwn_teams.drop(columns=['margin']), on=['game', 'team', 'opponent'])
Creating Efficiency Features¶
Disposal Efficiency¶
Disposal efficiency is pivotal in Aussie Rules football. If you are dispose of the ball effectively you are much more likely to score and much less likely to concede goals than if you dispose of it ineffectively.
Let's create a disposal efficiency feature by dividing Effective Disposals by Disposals.
Inside 50/Rebound 50 Efficiency¶
Similarly, one could hypothesise that teams who keep the footy in their Inside 50 regularly will be more likely to score, whilst teams who are effective at getting the ball out of their Defensive 50 will be less likely to concede. Let's use this logic to create Inside 50 Efficiency and Rebound 50 Efficiency features.
The formula used will be:
Inside 50 Efficiency = R50_Opponents / I50 (lower is better).
Rebound 50 Efficiency = R50 / I50_Opponents (higher is better).
Using these formulas, I50 Efficiency = R50 Efficiency_Opponent. So we will just need to create the formulas for I50 efficiency. To create these features we will need the opposition's Inside 50s/Rebound 50s. So we will split out data into two DataFrames, create a new DataFrame by joining these two DataFrames on the Game, calculate our efficiency features, then join our features with our main features DataFrame.
# Get each match on single rows
single_row_df = (afl_data[['game', 'team', 'f_I50', 'f_R50', 'f_D', 'f_ED', 'home_game', ]]
.query('home_game == 1')
.rename(columns={'team': 'home_team', 'f_I50': 'f_I50_home', 'f_R50': 'f_R50_home', 'f_D': 'f_D_home', 'f_ED': 'f_ED_home'})
.drop(columns='home_game')
.pipe(pd.merge, afl_data[['game', 'team', 'f_I50', 'f_R50', 'f_D', 'f_ED', 'home_game']]
.query('home_game == 0')
.rename(columns={'team': 'away_team', 'f_I50': 'f_I50_away', 'f_R50': 'f_R50_away', 'f_D': 'f_D_away', 'f_ED': 'f_ED_away'})
.drop(columns='home_game'), on='game'))
single_row_df.head()
game | home_team | f_I50_home | f_R50_home | f_D_home | f_ED_home | away_team | f_I50_away | f_R50_away | f_D_away | f_ED_away | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13764 | Carlton | 69 | 21 | 373 | 268 | Richmond | 37 | 50 | 316 | 226 |
1 | 13765 | Geelong | 54 | 40 | 428 | 310 | St Kilda | 52 | 45 | 334 | 246 |
2 | 13766 | Collingwood | 70 | 38 | 398 | 289 | Port Adelaide | 50 | 44 | 331 | 232 |
3 | 13767 | Adelaide | 59 | 38 | 366 | 264 | Hawthorn | 54 | 38 | 372 | 264 |
4 | 13768 | Brisbane | 50 | 39 | 343 | 227 | Fremantle | 57 | 30 | 351 | 250 |
single_row_df = single_row_df.assign(f_I50_efficiency_home=lambda df: df.f_R50_away / df.f_I50_home,
f_I50_efficiency_away=lambda df: df.f_R50_home / df.f_I50_away)
feature_efficiency_cols = ['f_I50_efficiency_home', 'f_I50_efficiency_away']
# Now let's create an Expontentially Weighted Moving Average for these features - we will need to reshape our DataFrame to do this
efficiency_features_multi_row = (single_row_df[['game', 'home_team'] + feature_efficiency_cols]
.rename(columns={
'home_team': 'team',
'f_I50_efficiency_home': 'f_I50_efficiency',
'f_I50_efficiency_away': 'f_I50_efficiency_opponent',
})
.append((single_row_df[['game', 'away_team'] + feature_efficiency_cols]
.rename(columns={
'away_team': 'team',
'f_I50_efficiency_home': 'f_I50_efficiency_opponent',
'f_I50_efficiency_away': 'f_I50_efficiency',
})), sort=True)
.sort_values(by='game')
.reset_index(drop=True))
efficiency_features = efficiency_features_multi_row[['game', 'team']].copy()
feature_efficiency_cols = ['f_I50_efficiency', 'f_I50_efficiency_opponent']
for feature in feature_efficiency_cols:
efficiency_features[feature] = (efficiency_features_multi_row.groupby('team')[feature]
.transform(lambda row: row.ewm(span=10).mean().shift(1)))
# Get feature efficiency df back onto single rows
efficiency_features = pd.merge(efficiency_features, afl_data[['game', 'team', 'home_game']], on=['game', 'team'])
efficiency_features_single_row = (efficiency_features.query('home_game == 1')
.rename(columns={
'team': 'home_team',
'f_I50_efficiency': 'f_I50_efficiency_home',
'f_I50_efficiency_opponent': 'f_R50_efficiency_home'})
.drop(columns='home_game')
.pipe(pd.merge, (efficiency_features.query('home_game == 0')
.rename(columns={
'team': 'away_team',
'f_I50_efficiency': 'f_I50_efficiency_away',
'f_I50_efficiency_opponent': 'f_R50_efficiency_away'})
.drop(columns='home_game')), on='game'))
efficiency_features_single_row.tail(5)
game | home_team | f_I50_efficiency_home | f_R50_efficiency_home | away_team | f_I50_efficiency_away | f_R50_efficiency_away | |
---|---|---|---|---|---|---|---|
1580 | 15394 | Carlton | 0.730668 | 0.675002 | Adelaide | 0.691614 | 0.677128 |
1581 | 15395 | Sydney | 0.699994 | 0.778280 | Hawthorn | 0.699158 | 0.673409 |
1582 | 15396 | Brisbane | 0.683604 | 0.691730 | West Coast | 0.696822 | 0.709605 |
1583 | 15397 | Melbourne | 0.667240 | 0.692632 | GWS | 0.684525 | 0.753783 |
1584 | 15398 | St Kilda | 0.730843 | 0.635819 | North Melbourne | 0.697018 | 0.654991 |
We will merge these features back to our features df later, when the features data frame is on a single row as well.
Creating an Elo Feature¶
Another feature which we could create is an Elo feature. If you don't know what Elo is, go ahead and read our article on it here. We have also written a guide on using elo to model the 2018 FIFA World Cup here.
Essentially, Elo ratings increase if you win. The amount the rating increases is based on how strong the opponent is relative to the team who won. Weak teams get more points for beating stronger teams than they do for beating weaker teams, and vice versa for losses (teams lose points for losses).
Mathematically, Elo ratings can also assign a probability for winning or losing based on the two Elo Ratings of the teams playing.
So let's get into it. We will first define a function which calculates the elo for each team and applies these elos to our DataFrame.
# Define a function which finds the elo for each team in each game and returns a dictionary with the game ID as a key and the
# elos as the key's value, in a list. It also outputs the probabilities and a dictionary of the final elos for each team
def elo_applier(df, k_factor):
# Initialise a dictionary with default elos for each team
elo_dict = {team: 1500 for team in df['team'].unique()}
elos, elo_probs = {}, {}
# Get a home and away dataframe so that we can get the teams on the same row
home_df = df.loc[df.home_game == 1, ['team', 'game', 'f_margin', 'home_game']].rename(columns={'team': 'home_team'})
away_df = df.loc[df.home_game == 0, ['team', 'game']].rename(columns={'team': 'away_team'})
df = (pd.merge(home_df, away_df, on='game')
.sort_values(by='game')
.drop_duplicates(subset='game', keep='first')
.reset_index(drop=True))
# Loop over the rows in the DataFrame
for index, row in df.iterrows():
# Get the Game ID
game_id = row['game']
# Get the margin
margin = row['f_margin']
# If the game already has the elos for the home and away team in the elos dictionary, go to the next game
if game_id in elos.keys():
continue
# Get the team and opposition
home_team = row['home_team']
away_team = row['away_team']
# Get the team and opposition elo score
home_team_elo = elo_dict[home_team]
away_team_elo = elo_dict[away_team]
# Calculated the probability of winning for the team and opposition
prob_win_home = 1 / (1 + 10**((away_team_elo - home_team_elo) / 400))
prob_win_away = 1 - prob_win_home
# Add the elos and probabilities our elos dictionary and elo_probs dictionary based on the Game ID
elos[game_id] = [home_team_elo, away_team_elo]
elo_probs[game_id] = [prob_win_home, prob_win_away]
# Calculate the new elos of each team
if margin > 0: # Home team wins; update both teams' elo
new_home_team_elo = home_team_elo + k_factor*(1 - prob_win_home)
new_away_team_elo = away_team_elo + k_factor*(0 - prob_win_away)
elif margin < 0: # Away team wins; update both teams' elo
new_home_team_elo = home_team_elo + k_factor*(0 - prob_win_home)
new_away_team_elo = away_team_elo + k_factor*(1 - prob_win_away)
elif margin == 0: # Drawn game' update both teams' elo
new_home_team_elo = home_team_elo + k_factor*(0.5 - prob_win_home)
new_away_team_elo = away_team_elo + k_factor*(0.5 - prob_win_away)
# Update elos in elo dictionary
elo_dict[home_team] = new_home_team_elo
elo_dict[away_team] = new_away_team_elo
return elos, elo_probs, elo_dict
# Use the elo applier function to get the elos and elo probabilities for each game - we will map these later
elos, probs, elo_dict = elo_applier(afl_data, 30)
Great! now we have both rolling averages for stats as a feature, and the elo of the teams! Let's have a quick look at the current elo standings with a k-factor of 30, out of curiosity.
for team in sorted(elo_dict, key=elo_dict.get)[::-1]:
print(team, elo_dict[team])
Richmond 1695.2241513840117
Sydney 1645.548990879842
Hawthorn 1632.5266709780622
West Coast 1625.871701773721
Geelong 1625.423154644809
GWS 1597.4158602131877
Adelaide 1591.1704934545442
Collingwood 1560.370309216614
Melbourne 1558.5666572771509
Essendon 1529.0198398117086
Port Adelaide 1524.8882517820093
North Melbourne 1465.5637511922569
Western Bulldogs 1452.2110697845148
Fremantle 1393.142087030804
St Kilda 1360.9120149937303
Brisbane 1276.2923772139352
Gold Coast 1239.174528704772
Carlton 1226.6780896643265
This looks extremely similar to the currently AFL ladder, so this is a good sign for elo being an effective predictor of winning.
Merging Our Features Into One Features DataFrame¶
Now we need to reshape our features df so that we have all of the statistics for both teams in a game on a single row. We can then merge our elo and efficiency features to this df.
# Look at our current features df
features.tail(3)
date | game | team | opponent | venue | home_game | f_odds | f_goals | f_behinds | f_points | f_margin | f_opponent_goals | f_opponent_behinds | f_opponent_points | f_AF | f_B | f_BO | f_CCL | f_CG | f_CL | f_CM | f_CP | f_D | f_ED | f_FA | f_FF | f_G | f_GA | f_HB | f_HO | f_I50 | f_ITC | f_K | f_M | f_MG | f_MI5 | f_One.Percenters | f_R50 | f_SC | f_SCL | f_SI | f_T | f_T5 | f_TO | f_UP | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3156 | 2018-08-26 | 15397 | Melbourne | GWS | M.C.G. | 1 | 1.706488 | 15.157271 | 13.815113 | 104.758740 | 25.170429 | 11.814319 | 8.702396 | 79.588311 | 3312.408470 | 22.077317 | 7.724955 | 28.364418 | 114.399147 | 78.406069 | 26.934677 | 324.352577 | 775.176933 | 547.385948 | 39.353251 | 36.025646 | 30.308918 | 22.461080 | 348.613592 | 99.78780 | 120.339062 | 154.417642 | 426.563341 | 178.102118 | 12395.717925 | 32.168752 | 96.390688 | 63.786515 | 3427.596843 | 50.041649 | 232.287556 | 144.875098 | 23.789233 | 149.042149 | 456.988552 | -23.2 | 2.0 |
3157 | 2018-08-26 | 15398 | North Melbourne | St Kilda | Docklands | 0 | 2.272313 | 12.721783 | 10.733785 | 87.064486 | -1.214246 | 12.915796 | 10.783958 | 88.278732 | 3066.272143 | 17.322710 | 9.815243 | 26.015421 | 106.465181 | 67.504286 | 26.064079 | 291.259574 | 736.279779 | 534.154748 | 34.301603 | 40.908551 | 25.386136 | 17.816570 | 341.210547 | 81.54113 | 102.589427 | 145.265493 | 395.069232 | 173.089408 | 10875.002463 | 21.802751 | 82.347511 | 70.416194 | 3171.120023 | 41.488865 | 197.620152 | 122.547684 | 22.286256 | 142.780474 | 450.374058 | 3.2 | 3.0 |
3158 | 2018-08-26 | 15398 | St Kilda | North Melbourne | Docklands | 1 | 5.516150 | 10.464266 | 11.957047 | 74.742643 | -21.138101 | 14.105551 | 11.247440 | 95.880745 | 3094.163405 | 20.523847 | 14.569589 | 24.134276 | 102.540441 | 66.976211 | 18.018350 | 270.674857 | 773.086015 | 573.769838 | 41.319843 | 36.198820 | 20.850476 | 14.443658 | 364.405251 | 63.49876 | 103.803779 | 130.494307 | 408.680763 | 184.780054 | 10765.717942 | 21.572806 | 94.731555 | 65.790561 | 3228.278599 | 42.841935 | 196.086493 | 115.901425 | 18.796764 | 127.364334 | 508.844514 | -3.2 | 2.0 |
one_line_cols = ['game', 'team', 'home_game'] + [col for col in features if col.startswith('f_')]
# Get all features onto individual rows for each match
features_one_line = (features.loc[features.home_game == 1, one_line_cols]
.rename(columns={'team': 'home_team'})
.drop(columns='home_game')
.pipe(pd.merge, (features.loc[features.home_game == 0, one_line_cols]
.drop(columns='home_game')
.rename(columns={'team': 'away_team'})
.rename(columns={col: col+'_away' for col in features.columns if col.startswith('f_')})), on='game')
.drop(columns=['f_form_margin_btwn_teams_away', 'f_form_past_5_btwn_teams_away']))
# Add our created features - elo, efficiency etc.
features_one_line = (features_one_line.assign(f_elo_home=lambda df: df.game.map(elos).apply(lambda x: x[0]),
f_elo_away=lambda df: df.game.map(elos).apply(lambda x: x[1]))
.pipe(pd.merge, efficiency_features_single_row, on=['game', 'home_team', 'away_team'])
.pipe(pd.merge, afl_data.loc[afl_data.home_game == 1, ['game', 'date', 'round', 'venue']], on=['game'])
.dropna()
.reset_index(drop=True)
.assign(season=lambda df: df.date.apply(lambda row: row.year)))
ordered_cols = [col for col in features_one_line if col[:2] != 'f_'] + [col for col in features_one_line if col.startswith('f_')]
feature_df = features_one_line[ordered_cols]
Finally, let's reduce the dimensionality of the features df by subtracting the home features from the away features. This will reduce the huge amount of columns we have and make our data more manageable. To do this, we will need a list of columns which we are subtracting from each other. We will then loop over each of these columns to create our new differential columns.
We will then add in the implied probability from the odds of the home and away team, as our current odds feature is simply an exponential moving average over the past n games.
# Create differential df - this df is the home features - the away features
diff_cols = [col for col in feature_df.columns if col + '_away' in feature_df.columns and col != 'f_odds' and col.startswith('f_')]
non_diff_cols = [col for col in feature_df.columns if col not in diff_cols and col[:-5] not in diff_cols]
diff_df = feature_df[non_diff_cols].copy()
for col in diff_cols:
diff_df[col+'_diff'] = feature_df[col] - feature_df[col+'_away']
# Add current odds in to diff_df
odds = get_cleaned_odds()
home_odds = (odds[odds.home_game == 1]
.assign(f_current_odds_prob=lambda df: 1 / df.odds)
.rename(columns={'team': 'home_team'})
.drop(columns=['home_game', 'odds']))
away_odds = (odds[odds.home_game == 0]
.assign(f_current_odds_prob_away=lambda df: 1 / df.odds)
.rename(columns={'team': 'away_team'})
.drop(columns=['home_game', 'odds']))
diff_df = (diff_df.pipe(pd.merge, home_odds, on=['date', 'home_team'])
.pipe(pd.merge, away_odds, on=['date', 'away_team']))
diff_df.tail()
game | home_team | away_team | date | round | venue | season | f_odds | f_form_margin_btwn_teams | f_form_past_5_btwn_teams | f_odds_away | f_elo_home | f_elo_away | f_I50_efficiency_home | f_R50_efficiency_home | f_I50_efficiency_away | f_R50_efficiency_away | f_goals_diff | f_behinds_diff | f_points_diff | f_margin_diff | f_opponent_goals_diff | f_opponent_behinds_diff | f_opponent_points_diff | f_AF_diff | f_B_diff | f_BO_diff | f_CCL_diff | f_CG_diff | f_CL_diff | f_CM_diff | f_CP_diff | f_D_diff | f_ED_diff | f_FA_diff | f_FF_diff | f_G_diff | f_GA_diff | f_HB_diff | f_HO_diff | f_I50_diff | f_ITC_diff | f_K_diff | f_M_diff | f_MG_diff | f_MI5_diff | f_One.Percenters_diff | f_R50_diff | f_SC_diff | f_SCL_diff | f_SI_diff | f_T_diff | f_T5_diff | f_TO_diff | f_UP_diff | f_current_odds_prob | f_current_odds_prob_away | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1626 | 15394 | Carlton | Adelaide | 2018-08-25 | 23 | Docklands | 2018 | 6.467328 | -26.2 | 1.0 | 2.066016 | 1230.072138 | 1587.776445 | 0.730668 | 0.675002 | 0.691614 | 0.677128 | -3.498547 | -5.527193 | -26.518474 | -34.473769 | 1.289715 | 0.217006 | 7.955295 | -341.342677 | -9.317269 | 3.088569 | -2.600593 | 15.192839 | -12.518345 | -4.136673 | -41.855717 | -72.258378 | -51.998775 | 9.499447 | 8.670917 | -6.973088 | -4.740623 | -26.964945 | -13.147675 | -23.928700 | -28.940883 | -45.293433 | -15.183406 | -1900.784014 | -0.362402 | -1.314627 | 4.116133 | -294.813511 | -9.917793 | -34.724925 | -5.462844 | -9.367141 | -19.623785 | -38.188082 | 0.187709 | 0.816860 |
1627 | 15395 | Sydney | Hawthorn | 2018-08-25 | 23 | S.C.G. | 2018 | 2.128611 | 1.0 | 2.0 | 1.777290 | 1662.568452 | 1615.507209 | 0.699994 | 0.778280 | 0.699158 | 0.673409 | -1.756730 | -0.874690 | -11.415069 | -15.575319 | 0.014390 | 4.073909 | 4.160250 | -174.005092 | -0.942357 | -4.078635 | -4.192916 | 7.814496 | -2.225780 | 6.215760 | 15.042979 | -34.894261 | -50.615255 | 4.214158 | 0.683548 | -3.535594 | -3.168608 | -12.068691 | -30.493980 | -9.867332 | 2.588103 | -22.825570 | -5.604199 | 253.086090 | -2.697132 | -22.612327 | 25.340623 | -90.812188 | 1.967104 | -31.047879 | 0.007606 | -6.880120 | 11.415593 | -49.957313 | 0.440180 | 0.561924 |
1628 | 15396 | Brisbane | West Coast | 2018-08-26 | 23 | Gabba | 2018 | 3.442757 | -49.2 | 0.0 | 2.094236 | 1279.963814 | 1622.200265 | 0.683604 | 0.691730 | 0.696822 | 0.709605 | -0.190413 | 1.182699 | 0.040221 | -13.621456 | 1.772577 | 3.026217 | 13.661677 | -22.709485 | 2.424261 | -4.848054 | 1.800473 | 5.051157 | 6.440524 | -5.549630 | -17.041838 | 27.543023 | 33.983159 | 4.459181 | -3.213885 | -0.428455 | 1.514474 | 42.646138 | -7.141638 | 1.457375 | -17.472537 | -15.103115 | 8.001966 | -383.083539 | 6.458915 | 7.275716 | 0.942863 | 44.461590 | 4.640136 | 13.180967 | -15.704694 | 2.366444 | -5.985843 | 38.195255 | 0.433501 | 0.569866 |
1629 | 15397 | Melbourne | GWS | 2018-08-26 | 23 | M.C.G. | 2018 | 1.706488 | -23.2 | 2.0 | 1.805565 | 1540.367850 | 1615.614668 | 0.667240 | 0.692632 | 0.684525 | 0.753783 | 2.056899 | 0.635785 | 12.977177 | 6.642811 | 1.443121 | -2.324358 | 6.334366 | 147.281112 | 2.201404 | -5.222254 | 3.250416 | 8.542475 | -2.203571 | 3.559792 | 21.192530 | 33.737734 | 12.865653 | -3.244066 | -2.135243 | 4.100203 | 3.772200 | 48.425291 | 18.247107 | 13.349992 | 11.385136 | -14.687556 | 5.052000 | 304.087088 | 11.062610 | -6.686409 | -16.414544 | 8.350924 | -5.453961 | 12.407662 | 6.672628 | -1.523915 | 13.075351 | 18.522113 | 0.661551 | 0.340379 |
1630 | 15398 | St Kilda | North Melbourne | 2018-08-26 | 23 | Docklands | 2018 | 5.516150 | -3.2 | 2.0 | 2.272313 | 1372.453734 | 1454.022032 | 0.730843 | 0.635819 | 0.697018 | 0.654991 | -2.257517 | 1.223261 | -12.321842 | -19.923855 | 1.189755 | 0.463481 | 7.602012 | 27.891262 | 3.201137 | 4.754346 | -1.881145 | -3.924740 | -0.528075 | -8.045729 | -20.584717 | 36.806235 | 39.615090 | 7.018240 | -4.709732 | -4.535660 | -3.372912 | 23.194704 | -18.042370 | 1.214353 | -14.771187 | 13.611531 | 11.690647 | -109.284521 | -0.229945 | 12.384044 | -4.625633 | 57.158576 | 1.353070 | -1.533659 | -6.646259 | -3.489492 | -15.416140 | 58.470456 | 0.284269 | 0.717566 |
Wrapping it Up¶
We now have a fairly decent amount of features. Some other features which could be added include whether the game is in a major Capital city outisde of Mebourne (i.e. Sydney, Adelaide or Peth), how many 'Elite' players are playing (which could be judged by average SuperCoach scores over 110, for example), as well as your own metrics for attacking and defending.
Note that all of our features have columns starting with 'f_' so in the next tutorial, we will grab this feature dataframe and use these features to sport predicting the matches.