Modelling the Brownlow

This walkthrough will provide a brief, yet effective tutorial on how to model the Brownlow medal. We will use data from 2010 to 2018, which includes Supercoach points and other useful stats.

The output will be the number of votes predicted for each player in each match, and we will aggregate these to create aggregates for each team and for the whole competition. No doubt we'll have Mitchell right up the top, and if we don't, then we know we've done something wrong!

# Import modules libraries
import pandas as pd
import h2o
from h2o.automl import H2OAutoML
import numpy as np
from sklearn.preprocessing import StandardScaler
import os
import pickle

# Change notebook settings
pd.options.display.max_columns = None
pd.options.display.max_rows = 300
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

This walkthrough uses H2O's AutoML to automate machine learning. We have saved the models created by AutoML into the Github repo which can be loaded into the notebook to save you time training the model. However a drawback of H2O is that you can only load models if your currently installed verson of H2O is the same as the version used to create the model. You can check what version of H2O you have installed by running h2o.__version__:

# Checking the current version of H2O
print(f'Current version of H2O installed: {h2o.__version__}')

Current version of H2O installed: 3.36.0.3

If you have a different version of H2O installed you have two options. You can train the models yourself, all the code to do that is commented out in the notebook. Or you can pip uninstall H2O and then pip install H2O again, but specifically the 3.36.0.3 version. The code to do this is below and will only take 1 or 2 minutes:

# # Uncomment the code below if you need to uninstall the current version of H2O and reinstall version 3.36.0.3
# # The following command removes the H2O module for Python.
# pip uninstall h2o

# # Next, use pip to install this version of the H2O Python module.
# pip install http://h2o-release.s3.amazonaws.com/h2o/rel-zorn/3/Python/h2o-3.36.0.3-py2.py3-none-any.whl

EDA

Read in the data

I have collated this data using the fitzRoy R package and merging the afltables dataset with the footywire dataset, so that we can Supercoach and other advanced stats with Brownlow votes. Let's read in the data and have a sneak peak at what it looks like.

brownlow_data = pd.read_csv('data/afl_brownlow_data.csv')

brownlow_data.tail(3)

	date	season	round	venue	ID	match_id	player	jumper_no	team	opposition	status	team_score	opposition_score	margin	brownlow_votes	CP	UP	ED	DE	GA	MI5	one_perc	TOG	K	HB	D	M	G	B	T	I50	CL	CG	R50	FF	FA	AF	SC
76585	25/08/2018	2018	23	Optus Stadium	12587	9711	J Stephenson	35	Collingwood	Fremantle	Away	76	67	9	NaN	3	5	5	62.5	0	3	2	87	6	2	8	3	2	1	3	0	0	1	0	0	0	56	56
76586	25/08/2018	2018	23	Optus Stadium	12144	9711	J Thomas	24	Collingwood	Fremantle	Away	76	67	9	NaN	6	11	14	82.4	2	0	1	79	11	6	17	4	1	0	3	4	1	4	1	1	3	67	89
76587	25/08/2018	2018	23	Optus Stadium	11549	9711	T Varcoe	18	Collingwood	Fremantle	Away	76	67	9	NaN	7	4	7	53.8	1	0	2	73	6	7	13	1	0	0	3	2	0	4	0	1	1	45	53

It looks like we've got about 76,000 rows of data and have stats like hitouts, clangers, effective disposals etc. Let's explore some certain scenarios. Using my domain knowledge of footy, I can hypothesise that if a player kicks 5 goals, he is pretty likely to poll votes. Similarly, if a player gets 30 possessions and 2+ goals, he is also probably likely to poll votes. Let's have a look at the mean votes for players for both of these situations.

Exploring votes if a bag is kicked (5+ goals)

brownlow_data.query('G &gt;= 5').groupby('season').brownlow_votes.mean()
print("Mean votes if a player has kicked a bag:", brownlow_data.query('G &gt;= 5').brownlow_votes.mean())

season
2010    1.420455
2011    1.313433
2012    1.413333
2013    1.253731
2014    1.915254
2015    1.765625
2016    1.788732
2017    2.098361
2018    0.000000
Name: brownlow_votes, dtype: float64

Mean votes if a player has kicked a bag: 1.4708818635607321

Exploring votes if the player has 30+ possies & 2+ goals

brownlow_data.query('G &gt;= 2 and D &gt;= 30').groupby('season').brownlow_votes.mean()
print("Mean votes if a player has 30 possies and kicks 2+ goals:", brownlow_data.query('G &gt;= 2 and D &gt;= 30').brownlow_votes.mean())

season
2010    1.826923
2011    1.756410
2012    2.118421
2013    2.000000
2014    2.253731
2015    2.047619
2016    2.103448
2017    2.050000
2018    0.000000
Name: brownlow_votes, dtype: float64

Mean votes if a player has 30 possies and kicks 2+ goals: 1.8741379310344828

As suspected, the average votes for these two situations is 1.87! That's huge. Let's get an idea of the average votes for each player. It should be around 6/44, as there are always 6 votes per match and around 44 players per match.

brownlow_data.brownlow_votes.mean()

0.12347341475121326

So the average vote is 0.12. Let's see how this changes is the player is a captain. I have collected data on if players are captains from wikipedia and collated it into a csv. Let's load this in and create a "Is the player captain" feature, then check the average votes for captains.

Create Is Player Captain Feature

captains = pd.read_csv('data/captains.csv').set_index('player')

def is_captain_for_that_season(captains_df, player, year):

    if player in captains_df.index:
        # Get years they were captain
        seasons = captains_df.loc[player].season.split('-')
        if len(seasons) == 1:
            seasons_captain = list(map(int, seasons))
        elif len(seasons) == 2:
            if seasons[1] == '':
                seasons_captain = list(range(int(seasons[0]), 2019))
            else:
                seasons_captain = list(range(int(seasons[0]), int(seasons[1]) + 1))

        if year in seasons_captain:
            return 1
    return 0

brownlow_data['is_captain'] = brownlow_data.apply(lambda x: is_captain_for_that_season(captains, x.player, x.season), axis='columns')

brownlow_data.query('is_captain == 1').groupby('season').brownlow_votes.mean()
print("Mean votes if a player is captain:", brownlow_data.query('is_captain == 1').brownlow_votes.mean())

season
2010    0.408497
2011    0.429936
2012    0.274194
2013    0.438725
2014    0.519663
2015    0.447222
2016    0.347826
2017    0.425806
2018    0.000000
Name: brownlow_votes, dtype: float64

Mean votes if a player is captain: 0.36661698956780925

This is significantly higher than if they aren't captain. What would be interesting is to look at the average difference in votes between when they were captain and when they weren't, to try and find if there is a 'captain bias' in brownlow votes. Go ahead and try. For now, we're going to move onto feature creation

Feature Creation

Let's make a range of features, including: * Ratios of each statistic per game * If the player is a captain * If they kicked a bag (4/5+) * If they kicked 2 and had 30+ possies

First we will make features of ratios. What is important is not how many of a certain stat a player has, but how much of that stat a player has relative to everyone else in the same match. It doesn't matter if Dusty Martin has 31 possessions if Tom Mitchell has had 50 - Mitchell is probably more likely to poll (assuming all else is equal). So rather than using the actual number of possessions for example, we can divide these possessions by the total amount of possessions in the game. To do this we'll use pandas groupby and transform methods.

Create Ratios As Features

%%time

# Get a list of stats of which to create ratios for
ratio_cols = ['CP', 'UP', 'ED', 'DE', 'CM', 'GA', 'MI5', 'one_perc', 'BO', 'TOG', 
               'K', 'HB', 'D', 'M', 'G', 'B', 'T', 'HO', 'I50', 'CL', 'CG', 'R50', 
               'FF', 'FA', 'AF', 'SC']

# Create a ratios df
ratios = (brownlow_data.copy()
          .loc[:, ['match_id'] + ratio_cols]
          .groupby('match_id')
          .transform(lambda x: x / x.sum()))

feature_cols = ['date', 'season', 'round', 'venue', 'ID', 'match_id', 'player', 'jumper_no', 'team', 
                'opposition', 'status', 'team_score', 'opposition_score', 'margin', 'brownlow_votes']

# Create a features df - join the ratios to this df
features = (brownlow_data[feature_cols].copy()
               .join(ratios))

Wall time: 17.1 s

features.head()

	date	season	round	venue	ID	match_id	player	jumper_no	team	opposition	status	team_score	opposition_score	margin	brownlow_votes	CP	UP	ED	DE	CM	GA	MI5	one_perc	TOG	K	HB	D	M	G	B	T	I50	CL	CG	R50	FF	FA	AF	SC
0	25/03/2010	2010	1	MCG	11559	5089	J Anderson	26	Carlton	Richmond	Away	120	64	56	0.0	0.024194	0.021186	0.020599	0.021931	0.00	0.000000	0.000000	0.000000	0.024411	0.017032	0.028481	0.022008	0.019802	0.000000	0.0000	0.029197	0.000000	0.000000	0.010204	0.014286	0.00000	0.025	0.020075	0.019353
1	25/03/2010	2010	1	MCG	4060	5089	E Betts	19	Carlton	Richmond	Away	120	64	56	0.0	0.044355	0.014831	0.024345	0.021804	0.05	0.133333	0.130435	0.033333	0.025243	0.021898	0.031646	0.026135	0.029703	0.111111	0.1875	0.043796	0.000000	0.014925	0.010204	0.000000	0.00000	0.000	0.034504	0.041125
2	25/03/2010	2010	1	MCG	3281	5089	P Bower	18	Carlton	Richmond	Away	120	64	56	0.0	0.016129	0.038136	0.044944	0.030602	0.10	0.000000	0.000000	0.100000	0.026352	0.036496	0.031646	0.034388	0.054455	0.000000	0.0000	0.036496	0.009804	0.000000	0.000000	0.014286	0.00000	0.000	0.037014	0.039311
3	25/03/2010	2010	1	MCG	4056	5089	A Carrazzo	44	Carlton	Richmond	Away	120	64	56	2.0	0.032258	0.069915	0.059925	0.025501	0.00	0.000000	0.000000	0.000000	0.024133	0.051095	0.060127	0.055021	0.024752	0.000000	0.0000	0.029197	0.078431	0.074627	0.040816	0.057143	0.04878	0.025	0.041092	0.040822
4	25/03/2010	2010	1	MCG	11535	5089	B Gibbs	4	Carlton	Richmond	Away	120	64	56	1.0	0.032258	0.031780	0.029963	0.022186	0.10	0.000000	0.130435	0.000000	0.023024	0.036496	0.025316	0.031637	0.029703	0.074074	0.0625	0.029197	0.009804	0.044776	0.010204	0.000000	0.02439	0.025	0.033250	0.033868

Kicked A Bag Feature

features['kicked_a_bag'] = brownlow_data.G.apply(lambda x: 1 if x &gt;= 5 else 0)

Is Captain Feature

features['is_captain'] = features.apply(lambda x: is_captain_for_that_season(captains, x.player, x.season), axis='columns')

Won the Game Feature

features['team_won'] = np.where(features.margin &gt; 0, 1, 0)

30+ & 2+ Goals Feature

features['got_30_possies_2_goals'] = np.where((brownlow_data.G &gt;= 2) &amp; (brownlow_data.D &gt;= 30), 1, 0)

Previous Top 10 Finish Feature

I have a strong feeling that past performance may be a predictor of future performance in the brownlow. For example, last year Dusty Martin won the Brownlow. The umpires may have a bias towards Dusty this year because he is known to be on their radar as being a good player. Let's create a feature which is categorical and is 1 if the player has previously finished in the top 10. Let's create a function for this and then apply it to the afltables dataset, which has data back to 1897. We will then create a lookup table for the top 10 for each season and merge this table with our current features df.

afltables = pd.read_csv('data/afltables_stats.csv').query('Season &gt;= 2000')

def replace_special_characters(name):
    name = name.replace("'", "").replace("-", " ").lower()
    name_split = name.split()

    if len(name_split) &gt; 2:
        first_name = name_split[0]
        last_name = name_split[-1]
        name = first_name + ' ' + last_name

    name_split_2 = name.split()
    name = name_split_2[0][0] + ' ' + name_split_2[1]
    return name.title()

afltables = (afltables.assign(player=lambda df: df['First.name'] + ' ' + df.Surname)
                .assign(player=lambda df: df.player.apply(replace_special_characters))
                .rename(columns={'Brownlow.Votes': 'brownlow_votes', 'Season': 'season', 'Playing.for': 'team'}))

### Create Top 10 rank look up table

brownlow_votes_yearly = (afltables.groupby(['season', 'player', 'team'], as_index=False)
                                  .brownlow_votes
                                  .sum())

brownlow_votes_yearly['yearly_rank'] = (brownlow_votes_yearly.groupby('season')
                                                             .brownlow_votes
                                                             .rank(method='max', ascending=False))

# Filter to only get a dataframe since 2000 and only the top 10 players from each season
brownlow_votes_top_10 = brownlow_votes_yearly.query('yearly_rank &lt; 11 &amp; season &gt;= 2000')

brownlow_votes_top_10.head(3)

def how_many_times_top_10(top_10_df, player, year):
    times = len(top_10_df[(top_10_df.player == player) &amp; (top_10_df.season &lt; year)])
    return times

features['times_in_top_10'] = features.apply(lambda x: how_many_times_top_10(brownlow_votes_top_10, x.player, x.season), axis=1)

	season	player	team	brownlow_votes	yearly_rank
27	2000.0	A Koutoufides	Carlton	19.0	4.0
36	2000.0	A Mcleod	Adelaide	20.0	3.0
105	2000.0	B Ratten	Carlton	18.0	6.0

Average Brownlow Votes Per Game Last Season Feature

# Create a brownlow votes lookup table
brownlow_votes_lookup_table = (brownlow_data.groupby(['player', 'team', 'season'], as_index=False)
                                    .brownlow_votes
                                    .mean()
                                    .assign(next_season=lambda df: df.season + 1)
                                    .rename(columns={
                                        'brownlow_votes': 'ave_votes_last_season'
                                    }))

# Have a look at Cripps to check if it's working
brownlow_votes_lookup_table[brownlow_votes_lookup_table.player == 'P Cripps']

# Merge it to our features df
features_with_votes_last_season = (pd.merge(features, brownlow_votes_lookup_table.drop(columns='season'),

                                             left_on=['player', 'team', 'season'], 
                                             right_on=['player', 'team', 'next_season'],
                                             how='left')
                                  .drop(columns=['next_season'])
                                  .fillna(0))

	player	team	season	ave_votes_last_season	next_season
4377	P Cripps	Carlton	2014	0.000000	2015
4378	P Cripps	Carlton	2015	0.300000	2016
4379	P Cripps	Carlton	2016	0.857143	2017
4380	P Cripps	Carlton	2017	0.333333	2018
4381	P Cripps	Carlton	2018	0.000000	2019

Historic Performance Relative To Model Feature

It is well known that some players are good Brownlow performers. For whatever reason, they always poll much better than their stats may suggest. Lance Franklin and Bontempelli are probably in this category. Perhaps these players have an X-factor that Machine Learning models struggle to pick up on. To get around this, let's create a feature which looks at the player's performance relative to the model's prediction. To do this, we'll need to train and predict 7 different models - from 2011 to 2017.

To create a model for each season, we will use h2o's AutoML. If you're new to h2o, please read about it here. It can be used in both R and Python.

The metric we will use for loss in Mean Absolute Error (MAE).

As we are using regression, some values are negative. We will convert these negative values to 0 as it doesn't make sense to poll negative brownlow votes. Similarly, some matches won't predict exactly 6 votes, so we will scale these predictions so that we predict exactly 6 votes for each match.

So that you don't have to train these models yourself, I have saved the models and we will load them in. If you are keen to train the models yourself, simply uncomment out the code below and run the cell. To bulk uncomment, highlight the rows and press ctrl + '/'

h2o.init()

# Uncomment the code below if you want to train the models yourself - otherwise, we will load them in the load cell from disk


## Join to our features df

# aml_yearly_model_objects = {}
# yearly_predictions_dfs = {}

# feature_cols = ['margin', 'CP', 'UP', 'ED', 'DE',
#        'CM', 'GA', 'MI5', 'one_perc', 'BO', 'TOG', 'K', 'HB', 'D', 'M', 'G',
#        'B', 'T', 'HO', 'I50', 'CL', 'CG', 'R50', 'FF', 'FA', 'AF', 'SC']

# for year in range(2011, 2018):
#     # Filter the data to only include past data
#     train_historic = brownlow_data[brownlow_data.season &lt; year].copy()

#     # Convert to an h2o frame
#     train_h2o_historic = h2o.H2OFrame(train_historic)

#     # Create an AutoML object
#     aml = H2OAutoML(max_runtime_secs=30,
#                    balance_classes=True,
#                    seed=42)

#     # Train the model
#     aml.train(y='brownlow_votes', x=feature_cols, training_frame=train_h2o_historic)

#     # save the model
#     model_path = h2o.save_model(model=aml.leader, path="models", force=True)

#     # Get model id
#     model_name = aml.leaderboard[0, 'model_id']

#     # Rename the model on disk
#     os.rename(f'models/{model_name}', f'models/yearly_model_{year}')

#     # Append the best model to a list
#     aml_yearly_model_objects[year] = aml.leader

#     # Make predictions on test set for that year
#     test_historic = brownlow_data[brownlow_data.season == year].copy()
#     test_h2o_historic = h2o.H2OFrame(test_historic)

#     preds = aml.predict(test_h2o_historic).as_data_frame()
#     test_historic['predicted_votes'] = preds.values

#     # Convert negative predictions to 0
#     test_historic['predicted_votes_neg_to_0'] = test_historic.predicted_votes.apply(lambda x: 0 if x &lt; 0 else x)

#     # Create a total match votes column - which calculates the number of votes predicted in each game when the predictions 
#     # are unscaled
#     test_historic['unscaled_match_votes'] = test_historic.groupby('match_id').predicted_votes_neg_to_0.transform('sum')

#     # Scale predictions
#     test_historic['predicted_votes_scaled'] = test_historic.predicted_votes_neg_to_0 / test_historic.unscaled_match_votes * 6

#     # Aggregate the predictions
#     test_grouped = (test_historic.groupby(['player', 'team'], as_index=False)
#                                  .sum()
#                                  .sort_values(by='brownlow_votes', ascending=False)
#                                  .assign(mae=lambda df: abs(df.predicted_votes_scaled - df.brownlow_votes)))

#     test_grouped['error'] = test_grouped.predicted_votes_scaled - test_grouped.brownlow_votes
#     test_grouped['next_year'] = year + 1

#     # Add this years predictions df to a dictionary to use later
#     yearly_predictions_dfs[year] = test_grouped

# preds_errors = None

# for key, value in yearly_predictions_dfs.items():
#     if preds_errors is None:
#         preds_errors = value[['player', 'season', 'next_year', 'brownlow_votes', 'predicted_votes_scaled', 'error']]
#     else:
#         preds_errors = preds_errors.append(value[['player', 'season', 'next_year', 'brownlow_votes', 'predicted_votes_scaled', 'error']], sort=True)

# with open('data/prediction_errors_df.pickle', 'wb') as handle:
#     pickle.dump(preds_errors, handle)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) Client VM (build 25.301-b09, mixed mode)
  Starting server from C:\Users\zhoui\greyhounds_bruno\greyhound-modelling\venv_greyhounds\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\zhoui\AppData\Local\Temp\tmp6umhk9gp
  JVM stdout: C:\Users\zhoui\AppData\Local\Temp\tmp6umhk9gp\h2o_ZhouI_started_from_python.out
  JVM stderr: C:\Users\zhoui\AppData\Local\Temp\tmp6umhk9gp\h2o_ZhouI_started_from_python.err

C:\Users\zhoui\greyhounds_bruno\greyhound-modelling\venv_greyhounds\lib\site-packages\h2o\backend\server.py:386: UserWarning:   You have a 32-bit version of Java. H2O works best with 64-bit Java.
  Please download the latest 64-bit Java SE JDK from Oracle.

  warn("  You have a 32-bit version of Java. H2O works best with 64-bit Java.\n"

  Server is running at http://127.0.0.1:54325
Connecting to H2O server at http://127.0.0.1:54325 ... successful.

H2O_cluster_uptime:	03 secs
H2O_cluster_timezone:	Australia/Sydney
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.36.0.3
H2O_cluster_version_age:	5 days
H2O_cluster_name:	H2O_from_python_ZhouI_yobrjv
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	247.5 Mb
H2O_cluster_total_cores:	12
H2O_cluster_allowed_cores:	12
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://127.0.0.1:54325
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
Python_version:	3.9.6 final

# Load predictions error df    
with open('data/prediction_errors_df.pickle', 'rb') as handle:
    preds_errors = pickle.load(handle)

# Look at last years predictions
preds_errors.query('next_year == 2018').sort_values(by='brownlow_votes', ascending=False).head(20)

	brownlow_votes	error	next_year	player	predicted_votes_scaled	season
139	36.0	-3.431041	2018	D Martin	32.568959	44374
486	33.0	0.065131	2018	P Dangerfield	33.065131	42357
619	25.0	0.243510	2018	T Mitchell	25.243510	44374
279	23.0	-9.267806	2018	J Kennedy	13.732194	38323
376	22.0	-7.621163	2018	L Franklin	14.378837	44374
278	21.0	-4.494170	2018	J Kelly	16.505830	42357
519	20.0	-2.909334	2018	R Sloane	17.090666	44374
410	19.0	-7.124114	2018	M Bontempelli	11.875886	44374
483	18.0	-3.734412	2018	O Wines	14.265588	44374
121	17.0	-4.034390	2018	D Beams	12.965610	38323
390	16.0	-2.797669	2018	L Parker	13.202331	44374
561	15.0	-5.458874	2018	S Pendlebury	9.541126	32272
463	15.0	-4.892862	2018	N Fyfe	10.107138	42357
42	15.0	-4.379072	2018	B Ebert	10.620928	44374
651	15.0	1.011272	2018	Z Merrett	16.011272	42357
578	14.0	2.069385	2018	T Adams	16.069385	44374
34	14.0	-4.886668	2018	B Brown	9.113332	44374
172	14.0	0.211867	2018	D Zorko	14.211867	42357
184	14.0	-0.620086	2018	G Ablett	13.379914	28238
389	14.0	-2.811169	2018	L Neale	11.188831	42357

Look at that! A simple Machine Learning ensemble model, using AutoML predicted last year's winner! That's impressive. As we can see it also predicted Bontempelli would only score 11.26, when he actually scored 19 - a huge discrepency. Let's use this as a feature.

features_with_historic_perf_relative_to_model = \
    (features_with_votes_last_season.pipe(pd.merge, preds_errors[['player', 'next_year', 'error']],
                                          left_on=['player', 'season'],
                                          right_on=['player', 'next_year'],
                                          how='left')
                                     .fillna(0)
                                     .rename(columns={'error': 'error_last_season'})
                                     .drop_duplicates(subset=['player', 'round', 'SC']))

Filtering the data to only include the top 20 SC for each match

Logically, it is extremely unlikely that a player will poll votes if their Supercoach score is not in the top 20 players. By eliminating the other 20+ players, we can reduce the noise in the data, as we are almost certain the players won't poll from the bottom half. Let's explore how many players poll if they're not in the top 20, and then filter our df if this number is not significant.

# Find number of players who vote when in top 15 SC

brownlow_data['SC_rank_match'] = brownlow_data.groupby('match_id').SC.rank(method='max', ascending=False)

brownlow_data.query('SC_rank_match &gt; 20 and season &gt; 2014').brownlow_votes.value_counts()

0.0    18330
1.0       14
2.0        8
3.0        2
Name: brownlow_votes, dtype: int64

Since 2014, there have only been 24 players who have voted and not been in the top 20 SC.

features_with_sc_rank = features_with_historic_perf_relative_to_model.copy()

features_with_sc_rank['SC_rank_match'] = features_with_sc_rank.groupby('match_id').SC.rank(method='max', ascending=False)

# Filter out rows with a SC rank of below 20
features_with_sc_rank_filtered = features_with_sc_rank.query('SC_rank_match &lt;= 20')

# Filter out 2010 and 2011 as we used these seasons to create historic model performance features
features_last_before_train = features_with_sc_rank_filtered.query('season != 2010 and season != 2011').reset_index(drop=True)

features_last_before_train.head(3)

	date	season	round	venue	ID	match_id	player	jumper_no	team	opposition	status	team_score	opposition_score	margin	CP	UP	ED	DE	CM	GA	MI5	one_perc	TOG	K	HB	D	M	G	B	T	HO	I50	CL	CG	R50	FF	FA	AF	SC	is_captain	team_won	times_in_top_10	ave_votes_last_season	next_year	error_last_season	SC_rank_match
0	24/03/2012	2012	1	ANZ Stadium	11635	5343	C Bird	14	Sydney	GWS	Away	100	37	63	0.035842	0.041002	0.030769	0.019378	0.000000	0.083333	0.055556	0.019802	0.023064	0.034562	0.037383	0.035762	0.039409	0.105263	0.111111	0.049296	0.0000	0.060606	0.093333	0.063636	0.027778	0.025	0.075	0.039041	0.032662	0	1	0	0.052632	2012.0	0.136311	6.0
1	24/03/2012	2012	1	ANZ Stadium	1013	5343	J Bolton	24	Sydney	GWS	Away	100	37	63	0.039427	0.036446	0.028846	0.019607	0.000000	0.000000	0.000000	0.049505	0.024203	0.027650	0.040498	0.033113	0.009852	0.052632	0.000000	0.042254	0.0125	0.040404	0.066667	0.009091	0.000000	0.025	0.025	0.029819	0.038767	0	1	0	0.526316	2012.0	-0.833178	3.0
2	24/03/2012	2012	1	ANZ Stadium	1012	5343	A Goodes	37	Sydney	GWS	Away	100	37	63	0.032258	0.029613	0.028846	0.023332	0.037037	0.083333	0.055556	0.009901	0.026196	0.029954	0.024922	0.027815	0.029557	0.000000	0.055556	0.014085	0.0125	0.070707	0.040000	0.027273	0.041667	0.025	0.000	0.025822	0.027473	1	1	5	0.761905	2012.0	-5.718150	14.0

Modeling The 2017 Brownlow

Now that we have all of our features, we can simply create a training set (2012-2016), and a test set (2017), and make our predictions for last year! We will use AutoML for this process again. Again, rather than waiting for the model to train, I will save the model so you can simply load it in. We will also scale our features. We can then see how our model went in predicting last year's brownlow, creating a baseline for this years' predictions. We will then predict this year's vote count.

train_baseline = features_last_before_train.query("season &lt; 2017")
holdout = features_last_before_train.query("season == 2017")

scale_cols = ['team_score', 'opposition_score', 'margin', 'CP', 'UP', 'ED', 'DE',
       'CM', 'GA', 'MI5', 'one_perc', 'BO', 'K', 'HB', 'D', 'M', 'G',
       'B', 'T', 'HO', 'I50', 'CL', 'CG', 'R50', 'FF', 'FA', 'AF', 'SC']

other_feature_cols = ['is_captain', 'kicked_a_bag', 'team_won', 'got_30_possies_2_goals', 'times_in_top_10', 
                      'ave_votes_last_season', 'error_last_season', 'SC_rank_match']


all_feature_cols = scale_cols + other_feature_cols

# Scale features
scaler = StandardScaler()

train_baseline_scaled = train_baseline.copy()
train_baseline_scaled[scale_cols] = scaler.fit_transform(train_baseline[scale_cols])

holdout_scaled = holdout.copy()
holdout_scaled[scale_cols] = scaler.transform(holdout[scale_cols])

# Convert categorical columns to categoricals
train_baseline_h2o = h2o.H2OFrame(train_baseline_scaled)
holdout_h2o = h2o.H2OFrame(holdout_scaled)

for col in ['is_captain', 'kicked_a_bag', 'team_won', 'got_30_possies_2_goals']:
    train_baseline_h2o[col] = train_baseline_h2o[col].asfactor()
    holdout_h2o[col] = holdout_h2o[col].asfactor()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%

Below I have commented out training and saving the 2017 model. Rather than training it again, we will just load it in. Uncomment this part out if you want to train it yourself.

# aml_2017_model = H2OAutoML(max_runtime_secs = 60*30,
#                    balance_classes=True,
#                    seed=42)

# aml_2017_model.train(y='brownlow_votes', x=all_feature_cols, training_frame=train_baseline_h2o)

# # save the model
# model_path = h2o.save_model(model=aml_2017_model.leader, path="models", force=True)

# # Get model id
# model_name = aml_2017_model.leaderboard[0, 'model_id']

# # Rename the model on disk
# os.rename(f'models/{model_name}', f'models/brownlow_2017_model_v1')

# Load model in
aml_2017_model = h2o.load_model('models/brownlow_2017_model_v1')

# Predict the 2017 brownlow count
preds_final_2017_model = aml_2017_model.predict(holdout_h2o)


# Scale these predictions - change negatives to 0s and scale so each game predicts 6 votes total
holdout = (holdout.assign(predicted_votes=preds_final_2017_model.as_data_frame().values)
                  .assign(predicted_votes_neg_to_0=lambda df: df.predicted_votes.apply(lambda x: 0 if x &lt;0 else x))
                  .assign(unscaled_match_votes=lambda df: df.groupby('match_id').predicted_votes_neg_to_0.transform('sum'))
                  .assign(predicted_votes_scaled=lambda df: df.predicted_votes_neg_to_0 / df.unscaled_match_votes * 6))

# Create an aggregate votes df and show the average SC points and goals scored
agg_predictions_2017 = (holdout.groupby(['player', 'team'], as_index=False)
                        .agg({
                            'brownlow_votes': sum,
                            'predicted_votes_scaled': sum,
                            'SC': 'mean',
                            'G': 'mean'})
                        .sort_values(by='brownlow_votes', ascending=False)
                        .assign(mae=lambda df: abs(df.brownlow_votes - df.predicted_votes_scaled))
                        .reset_index(drop=True))

glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%

agg_predictions_2017.head(15)

	player	team	brownlow_votes	predicted_votes_scaled	SC	G	mae
0	D Martin	Richmond	36.0	25.007991	0.037862	0.064869	10.992009
1	P Dangerfield	Geelong	33.0	25.568804	0.042441	0.070819	7.431196
2	T Mitchell	Hawthorn	25.0	19.078639	0.036040	0.016928	5.921361
3	L Franklin	Sydney	22.0	12.574692	0.034640	0.149203	9.425308
4	J Kelly	GWS	21.0	15.198580	0.034652	0.033772	5.801420
5	R Sloane	Adelaide	20.0	15.755535	0.037068	0.034821	4.244465
6	J Kennedy	Sydney	20.0	11.065483	0.032014	0.030508	8.934517
7	M Bontempelli	Western Bulldogs	19.0	13.836173	0.033233	0.040498	5.163827
8	D Beams	Brisbane	17.0	10.073366	0.034848	0.044998	6.926634
9	O Wines	Port Adelaide	16.0	10.511818	0.031601	0.021967	5.488182
10	N Fyfe	Fremantle	15.0	12.782723	0.033761	0.031680	2.217277
11	S Pendlebury	Collingwood	15.0	9.135615	0.033855	0.013660	5.864385
12	B Ebert	Port Adelaide	15.0	7.941416	0.032795	0.008431	7.058584
13	L Parker	Sydney	15.0	12.431929	0.031366	0.030311	2.568071
14	Z Merrett	Essendon	15.0	15.017767	0.033737	0.015362	0.017767

So whilst our model predicted Dangerfield to win, it was pretty damn accurate! Let's find the MAE for the top 100, 50, 25, and 10, and then compare it to 2018's MAE in week, when the Brownlow has been counted.

for top_x in [10, 25, 50, 100]:
    temp_mae = round(agg_predictions_2017.iloc[:top_x].mae.mean(), 3)
    print(f"The Average Mean Absolute Error for the top {top_x} is {temp_mae}")

The Average Mean Absolute Error for the top 10 is 7.033
The Average Mean Absolute Error for the top 25 is 4.962
The Average Mean Absolute Error for the top 50 is 3.744
The Average Mean Absolute Error for the top 100 is 2.696

Modelling This Year's Brownlow

Let's now predict this year's vote count. These predictions will be on the front page of the GitHub.

train = features_last_before_train.query("season &lt; 2018")
test = features_last_before_train.query("season == 2018")

# Scale features
scaler = StandardScaler()

train_scaled = train.copy()
train_scaled[scale_cols] = scaler.fit_transform(train[scale_cols])

test_scaled = test.copy()
test_scaled[scale_cols] = scaler.transform(test[scale_cols])

# Convert categorical columns to categoricals

train_h2o = h2o.H2OFrame(train_scaled)
test_h2o = h2o.H2OFrame(test_scaled)

for col in ['is_captain', 'kicked_a_bag', 'team_won', 'got_30_possies_2_goals']:
    train_h2o[col] = train_h2o[col].asfactor()
    test_h2o[col] = test_h2o[col].asfactor()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%

# # Train the model - this part is commented out as we will just load our model from disk

# aml = H2OAutoML(max_runtime_secs = 60*30,
#                    balance_classes=True,
#                    seed=42)

# aml.train(y='brownlow_votes', x=all_feature_cols, training_frame=train_h2o)

# # save the model
# model_path = h2o.save_model(model=aml.leader, path="models", force=True)

# # Get model id
# model_name = aml.leaderboard[0, 'model_id']

# # Rename the model on disk
# os.rename(f'models/{model_name}', f'models/brownlow_2018_model_v1')

# Load model in
aml = h2o.load_model('models/brownlow_2018_model_v1')

# Predict the 2018 brownlow count
preds_final_2018_model = aml.predict(test_h2o)


# Scale these predictions - change negatives to 0s and scale so each game predicts 6 votes total
test = (test.assign(predicted_votes=preds_final_2018_model.as_data_frame().values)
                  .assign(predicted_votes_neg_to_0=lambda df: df.predicted_votes.apply(lambda x: 0 if x &lt;0 else x))
                  .assign(unscaled_match_votes=lambda df: df.groupby('match_id').predicted_votes_neg_to_0.transform('sum'))
                  .assign(predicted_votes_scaled=lambda df: df.predicted_votes_neg_to_0 / df.unscaled_match_votes * 6))

# Create an aggregate votes df and show the average SC points and goals scored
agg_predictions_2018 = (test.groupby(['player', 'team'], as_index=False)
                        .agg({
                            'predicted_votes_scaled': sum,
                            'match_id': 'count'}) # shows how many games they played
                        .sort_values(by='predicted_votes_scaled', ascending=False)
                        .reset_index(drop=True))

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%

# Show the top 25 predictions

agg_predictions_2018.head(25)

	player	team	predicted_votes_scaled	match_id
0	T Mitchell	Hawthorn	36.577208	20
1	M Gawn	Melbourne	20.686854	22
2	C Oliver	Melbourne	19.705244	20
3	D Martin	Richmond	19.674984	19
4	B Grundy	Collingwood	19.136407	22
5	J Macrae	Western Bulldogs	18.396814	17
6	P Dangerfield	Geelong	18.207483	21
7	D Beams	Brisbane	17.280375	15
8	A Gaff	West Coast	16.217552	18
9	L Neale	Fremantle	16.038782	21
10	E Yeo	West Coast	15.866418	20
11	J Selwood	Geelong	15.749850	18
12	D Heppell	Essendon	15.073503	19
13	N Fyfe	Fremantle	14.864971	11
14	Z Merrett	Essendon	14.597278	18
15	S Sidebottom	Collingwood	14.403396	18
16	J Kennedy	Sydney	14.297521	16
17	M Crouch	Adelaide	13.464205	16
18	G Ablett	Geelong	13.403062	15
19	P Cripps	Carlton	13.140569	21
20	R Laird	Adelaide	13.016825	19
21	L Franklin	Sydney	12.562711	13
22	S Coniglio	GWS	12.484317	20
23	J Kelly	GWS	12.245194	14
24	J Lloyd	Sydney	12.019714	20

print(agg_predictions_2018.head(15))

           player              team  predicted_votes_scaled  match_id
0      T Mitchell          Hawthorn               36.577208        20
1          M Gawn         Melbourne               20.686854        22
2        C Oliver         Melbourne               19.705244        20
3        D Martin          Richmond               19.674984        19
4        B Grundy       Collingwood               19.136407        22
5        J Macrae  Western Bulldogs               18.396814        17
6   P Dangerfield           Geelong               18.207483        21
7         D Beams          Brisbane               17.280375        15
8          A Gaff        West Coast               16.217552        18
9         L Neale         Fremantle               16.038782        21
10          E Yeo        West Coast               15.866418        20
11      J Selwood           Geelong               15.749850        18
12      D Heppell          Essendon               15.073503        19
13         N Fyfe         Fremantle               14.864971        11
14      Z Merrett          Essendon               14.597278        18

Now that we have the top 25, let's also look at the top 3 from each team.

agg_predictions_2018.sort_values(by=['team', 'predicted_votes_scaled'], ascending=[True, False]).groupby('team').head(3)

	player	team	predicted_votes_scaled	match_id
17	M Crouch	Adelaide	13.464205	16
20	R Laird	Adelaide	13.016825	19
42	B Gibbs	Adelaide	8.976507	18
7	D Beams	Brisbane	17.280375	15
48	D Zorko	Brisbane	8.039436	14
58	S Martin	Brisbane	6.835975	19
19	P Cripps	Carlton	13.140569	21
51	K Simpson	Carlton	7.497976	18
91	E Curnow	Carlton	3.734706	19
4	B Grundy	Collingwood	19.136407	22
15	S Sidebottom	Collingwood	14.403396	18
26	A Treloar	Collingwood	11.173963	11
12	D Heppell	Essendon	15.073503	19
14	Z Merrett	Essendon	14.597278	18
61	D Smith	Essendon	6.367458	20
9	L Neale	Fremantle	16.038782	21
13	N Fyfe	Fremantle	14.864971	11
75	D Mundy	Fremantle	4.895438	19
22	S Coniglio	GWS	12.484317	20
23	J Kelly	GWS	12.245194	14
25	C Ward	GWS	11.251450	19
6	P Dangerfield	Geelong	18.207483	21
11	J Selwood	Geelong	15.749850	18
18	G Ablett	Geelong	13.403062	15
76	J Witts	Gold Coast	4.745802	13
79	J Lyons	Gold Coast	4.482626	14
114	D Swallow	Gold Coast	2.731594	15
0	T Mitchell	Hawthorn	36.577208	20
41	L Breust	Hawthorn	9.132324	16
50	J Gunston	Hawthorn	7.641945	19
1	M Gawn	Melbourne	20.686854	22
2	C Oliver	Melbourne	19.705244	20
30	J Hogan	Melbourne	10.510504	13
29	S Higgins	North Melbourne	10.682736	19
40	B Brown	North Melbourne	9.211568	13
43	B Cunnington	North Melbourne	8.949073	17
28	O Wines	Port Adelaide	10.948001	16
35	R Gray	Port Adelaide	10.017038	17
53	J Westhoff	Port Adelaide	7.425425	21
3	D Martin	Richmond	19.674984	19
36	J Riewoldt	Richmond	9.981867	15
57	K Lambert	Richmond	6.897522	12
38	S Ross	St Kilda	9.756485	17
46	J Steven	St Kilda	8.103860	17
90	J Steele	St Kilda	3.739189	16
16	J Kennedy	Sydney	14.297521	16
21	L Franklin	Sydney	12.562711	13
24	J Lloyd	Sydney	12.019714	20
8	A Gaff	West Coast	16.217552	18
10	E Yeo	West Coast	15.866418	20
34	J Redden	West Coast	10.200779	16
5	J Macrae	Western Bulldogs	18.396814	17
39	M Bontempelli	Western Bulldogs	9.317955	16
45	L Hunter	Western Bulldogs	8.758631	17

If you're looking for a round by round breakdown, just have a look at the test dataframe.

test[['date', 'round', 'player', 'team', 'opposition', 'margin', 'SC', 'predicted_votes_scaled']].tail(25)

	date	round	player	team	opposition	margin	SC	predicted_votes_scaled
27236	24/08/2018	23	J Laverde	Essendon	Port Adelaide	22	0.028190	0.098214
27237	24/08/2018	23	Z Merrett	Essendon	Port Adelaide	22	0.030918	0.316848
27238	24/08/2018	23	D Parish	Essendon	Port Adelaide	22	0.030615	0.031097
27239	24/08/2018	23	A Saad	Essendon	Port Adelaide	22	0.026069	0.000000
27240	24/08/2018	23	D Smith	Essendon	Port Adelaide	22	0.028190	0.714852
27241	24/08/2018	23	D Zaharakis	Essendon	Port Adelaide	22	0.034859	0.331842
27242	25/08/2018	23	H Ballantyne	Fremantle	Collingwood	-9	0.025773	0.594485
27243	25/08/2018	23	T Duman	Fremantle	Collingwood	-9	0.038205	0.068379
27244	25/08/2018	23	J Hamling	Fremantle	Collingwood	-9	0.024560	0.000000
27245	25/08/2018	23	B Hill	Fremantle	Collingwood	-9	0.031837	0.755081
27246	25/08/2018	23	E Langdon	Fremantle	Collingwood	-9	0.033354	0.420121
27247	25/08/2018	23	D Mundy	Fremantle	Collingwood	-9	0.023954	0.000000
27248	25/08/2018	23	L Neale	Fremantle	Collingwood	-9	0.042450	1.012576
27249	25/08/2018	23	S Switkowski	Fremantle	Collingwood	-9	0.027289	0.000000
27250	25/08/2018	23	T Adams	Collingwood	Fremantle	9	0.030018	0.207985
27251	25/08/2018	23	M Cox	Collingwood	Fremantle	9	0.024864	0.017404
27252	25/08/2018	23	J Crisp	Collingwood	Fremantle	9	0.031534	0.022959
27253	25/08/2018	23	B Grundy	Collingwood	Fremantle	9	0.045482	1.118902
27254	25/08/2018	23	B Maynard	Collingwood	Fremantle	9	0.026076	0.000000
27255	25/08/2018	23	B Mihocek	Collingwood	Fremantle	9	0.025167	0.000000
27256	25/08/2018	23	S Pendlebury	Collingwood	Fremantle	9	0.029715	0.888746
27257	25/08/2018	23	T Phillips	Collingwood	Fremantle	9	0.028199	0.009869
27258	25/08/2018	23	S Sidebottom	Collingwood	Fremantle	9	0.039115	0.880034
27259	25/08/2018	23	B Sier	Collingwood	Fremantle	9	0.023044	0.002407
27260	25/08/2018	23	J Thomas	Collingwood	Fremantle	9	0.026986	0.001053

And there we have it! In a single notebook we have made a fairly good Brownlow predictive model. Enjoy.

Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.