Skip to content

Modelling the Brownlow

This walkthrough will provide a brief, yet effective tutorial on how to model the Brownlow medal. We will use data from 2010 to 2018, which includes Supercoach points and other useful stats.

The output will be the number of votes predicted for each player in each match, and we will aggregate these to create aggregates for each team and for the whole competition. No doubt we'll have Mitchell right up the top, and if we don't, then we know we've done something wrong!

# Import modules libraries
import pandas as pd
import h2o
from h2o.automl import H2OAutoML
import numpy as np
from sklearn.preprocessing import StandardScaler
import os
import pickle

# Change notebook settings
pd.options.display.max_columns = None
pd.options.display.max_rows = 300
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

This walkthrough uses H2O's AutoML to automate machine learning. We have saved the models created by AutoML into the Github repo which can be loaded into the notebook to save you time training the model. However a drawback of H2O is that you can only load models if your currently installed verson of H2O is the same as the version used to create the model. You can check what version of H2O you have installed by running h2o.__version__:

# Checking the current version of H2O
print(f'Current version of H2O installed: {h2o.__version__}')
Current version of H2O installed: 3.36.0.3

If you have a different version of H2O installed you have two options. You can train the models yourself, all the code to do that is commented out in the notebook. Or you can pip uninstall H2O and then pip install H2O again, but specifically the 3.36.0.3 version. The code to do this is below and will only take 1 or 2 minutes:

# # Uncomment the code below if you need to uninstall the current version of H2O and reinstall version 3.36.0.3
# # The following command removes the H2O module for Python.
# pip uninstall h2o

# # Next, use pip to install this version of the H2O Python module.
# pip install http://h2o-release.s3.amazonaws.com/h2o/rel-zorn/3/Python/h2o-3.36.0.3-py2.py3-none-any.whl

EDA

Read in the data

I have collated this data using the fitzRoy R package and merging the afltables dataset with the footywire dataset, so that we can Supercoach and other advanced stats with Brownlow votes. Let's read in the data and have a sneak peak at what it looks like.

brownlow_data = pd.read_csv('data/afl_brownlow_data.csv')

brownlow_data.tail(3)
date season round venue ID match_id player jumper_no team opposition status team_score opposition_score margin brownlow_votes CP UP ED DE CM GA MI5 one_perc BO TOG K HB D M G B T HO I50 CL CG R50 FF FA AF SC
76585 25/08/2018 2018 23 Optus Stadium 12587 9711 J Stephenson 35 Collingwood Fremantle Away 76 67 9 NaN 3 5 5 62.5 0 0 3 2 0 87 6 2 8 3 2 1 3 0 0 0 1 0 0 0 56 56
76586 25/08/2018 2018 23 Optus Stadium 12144 9711 J Thomas 24 Collingwood Fremantle Away 76 67 9 NaN 6 11 14 82.4 0 2 0 1 0 79 11 6 17 4 1 0 3 0 4 1 4 1 1 3 67 89
76587 25/08/2018 2018 23 Optus Stadium 11549 9711 T Varcoe 18 Collingwood Fremantle Away 76 67 9 NaN 7 4 7 53.8 0 1 0 2 0 73 6 7 13 1 0 0 3 0 2 0 4 0 1 1 45 53

It looks like we've got about 76,000 rows of data and have stats like hitouts, clangers, effective disposals etc. Let's explore some certain scenarios. Using my domain knowledge of footy, I can hypothesise that if a player kicks 5 goals, he is pretty likely to poll votes. Similarly, if a player gets 30 possessions and 2+ goals, he is also probably likely to poll votes. Let's have a look at the mean votes for players for both of these situations.

Exploring votes if a bag is kicked (5+ goals)

brownlow_data.query('G >= 5').groupby('season').brownlow_votes.mean()
print("Mean votes if a player has kicked a bag:", brownlow_data.query('G >= 5').brownlow_votes.mean())
season
2010    1.420455
2011    1.313433
2012    1.413333
2013    1.253731
2014    1.915254
2015    1.765625
2016    1.788732
2017    2.098361
2018    0.000000
Name: brownlow_votes, dtype: float64
Mean votes if a player has kicked a bag: 1.4708818635607321

Exploring votes if the player has 30+ possies & 2+ goals

brownlow_data.query('G >= 2 and D >= 30').groupby('season').brownlow_votes.mean()
print("Mean votes if a player has 30 possies and kicks 2+ goals:", brownlow_data.query('G >= 2 and D >= 30').brownlow_votes.mean())
season
2010    1.826923
2011    1.756410
2012    2.118421
2013    2.000000
2014    2.253731
2015    2.047619
2016    2.103448
2017    2.050000
2018    0.000000
Name: brownlow_votes, dtype: float64
Mean votes if a player has 30 possies and kicks 2+ goals: 1.8741379310344828

As suspected, the average votes for these two situations is 1.87! That's huge. Let's get an idea of the average votes for each player. It should be around 6/44, as there are always 6 votes per match and around 44 players per match.

brownlow_data.brownlow_votes.mean()
0.12347341475121326

So the average vote is 0.12. Let's see how this changes is the player is a captain. I have collected data on if players are captains from wikipedia and collated it into a csv. Let's load this in and create a "Is the player captain" feature, then check the average votes for captains.

Create Is Player Captain Feature

captains = pd.read_csv('data/captains.csv').set_index('player')

def is_captain_for_that_season(captains_df, player, year):

    if player in captains_df.index:
        # Get years they were captain
        seasons = captains_df.loc[player].season.split('-')
        if len(seasons) == 1:
            seasons_captain = list(map(int, seasons))
        elif len(seasons) == 2:
            if seasons[1] == '':
                seasons_captain = list(range(int(seasons[0]), 2019))
            else:
                seasons_captain = list(range(int(seasons[0]), int(seasons[1]) + 1))

        if year in seasons_captain:
            return 1
    return 0

brownlow_data['is_captain'] = brownlow_data.apply(lambda x: is_captain_for_that_season(captains, x.player, x.season), axis='columns')
brownlow_data.query('is_captain == 1').groupby('season').brownlow_votes.mean()
print("Mean votes if a player is captain:", brownlow_data.query('is_captain == 1').brownlow_votes.mean())
season
2010    0.408497
2011    0.429936
2012    0.274194
2013    0.438725
2014    0.519663
2015    0.447222
2016    0.347826
2017    0.425806
2018    0.000000
Name: brownlow_votes, dtype: float64
Mean votes if a player is captain: 0.36661698956780925

This is significantly higher than if they aren't captain. What would be interesting is to look at the average difference in votes between when they were captain and when they weren't, to try and find if there is a 'captain bias' in brownlow votes. Go ahead and try. For now, we're going to move onto feature creation

Feature Creation

Let's make a range of features, including: * Ratios of each statistic per game * If the player is a captain * If they kicked a bag (4/5+) * If they kicked 2 and had 30+ possies

First we will make features of ratios. What is important is not how many of a certain stat a player has, but how much of that stat a player has relative to everyone else in the same match. It doesn't matter if Dusty Martin has 31 possessions if Tom Mitchell has had 50 - Mitchell is probably more likely to poll (assuming all else is equal). So rather than using the actual number of possessions for example, we can divide these possessions by the total amount of possessions in the game. To do this we'll use pandas groupby and transform methods.

Create Ratios As Features

%%time

# Get a list of stats of which to create ratios for
ratio_cols = ['CP', 'UP', 'ED', 'DE', 'CM', 'GA', 'MI5', 'one_perc', 'BO', 'TOG', 
               'K', 'HB', 'D', 'M', 'G', 'B', 'T', 'HO', 'I50', 'CL', 'CG', 'R50', 
               'FF', 'FA', 'AF', 'SC']

# Create a ratios df
ratios = (brownlow_data.copy()
          .loc[:, ['match_id'] + ratio_cols]
          .groupby('match_id')
          .transform(lambda x: x / x.sum()))

feature_cols = ['date', 'season', 'round', 'venue', 'ID', 'match_id', 'player', 'jumper_no', 'team', 
                'opposition', 'status', 'team_score', 'opposition_score', 'margin', 'brownlow_votes']

# Create a features df - join the ratios to this df
features = (brownlow_data[feature_cols].copy()
               .join(ratios))
Wall time: 17.1 s

features.head()
date season round venue ID match_id player jumper_no team opposition status team_score opposition_score margin brownlow_votes CP UP ED DE CM GA MI5 one_perc BO TOG K HB D M G B T HO I50 CL CG R50 FF FA AF SC
0 25/03/2010 2010 1 MCG 11559 5089 J Anderson 26 Carlton Richmond Away 120 64 56 0.0 0.024194 0.021186 0.020599 0.021931 0.00 0.000000 0.000000 0.000000 0.0 0.024411 0.017032 0.028481 0.022008 0.019802 0.000000 0.0000 0.029197 0.0 0.000000 0.000000 0.010204 0.014286 0.00000 0.025 0.020075 0.019353
1 25/03/2010 2010 1 MCG 4060 5089 E Betts 19 Carlton Richmond Away 120 64 56 0.0 0.044355 0.014831 0.024345 0.021804 0.05 0.133333 0.130435 0.033333 0.0 0.025243 0.021898 0.031646 0.026135 0.029703 0.111111 0.1875 0.043796 0.0 0.000000 0.014925 0.010204 0.000000 0.00000 0.000 0.034504 0.041125
2 25/03/2010 2010 1 MCG 3281 5089 P Bower 18 Carlton Richmond Away 120 64 56 0.0 0.016129 0.038136 0.044944 0.030602 0.10 0.000000 0.000000 0.100000 0.0 0.026352 0.036496 0.031646 0.034388 0.054455 0.000000 0.0000 0.036496 0.0 0.009804 0.000000 0.000000 0.014286 0.00000 0.000 0.037014 0.039311
3 25/03/2010 2010 1 MCG 4056 5089 A Carrazzo 44 Carlton Richmond Away 120 64 56 2.0 0.032258 0.069915 0.059925 0.025501 0.00 0.000000 0.000000 0.000000 0.0 0.024133 0.051095 0.060127 0.055021 0.024752 0.000000 0.0000 0.029197 0.0 0.078431 0.074627 0.040816 0.057143 0.04878 0.025 0.041092 0.040822
4 25/03/2010 2010 1 MCG 11535 5089 B Gibbs 4 Carlton Richmond Away 120 64 56 1.0 0.032258 0.031780 0.029963 0.022186 0.10 0.000000 0.130435 0.000000 0.0 0.023024 0.036496 0.025316 0.031637 0.029703 0.074074 0.0625 0.029197 0.0 0.009804 0.044776 0.010204 0.000000 0.02439 0.025 0.033250 0.033868

Kicked A Bag Feature

features['kicked_a_bag'] = brownlow_data.G.apply(lambda x: 1 if x >= 5 else 0)

Is Captain Feature

features['is_captain'] = features.apply(lambda x: is_captain_for_that_season(captains, x.player, x.season), axis='columns')

Won the Game Feature

features['team_won'] = np.where(features.margin > 0, 1, 0)

30+ & 2+ Goals Feature

features['got_30_possies_2_goals'] = np.where((brownlow_data.G >= 2) & (brownlow_data.D >= 30), 1, 0)

Previous Top 10 Finish Feature

I have a strong feeling that past performance may be a predictor of future performance in the brownlow. For example, last year Dusty Martin won the Brownlow. The umpires may have a bias towards Dusty this year because he is known to be on their radar as being a good player. Let's create a feature which is categorical and is 1 if the player has previously finished in the top 10. Let's create a function for this and then apply it to the afltables dataset, which has data back to 1897. We will then create a lookup table for the top 10 for each season and merge this table with our current features df.

afltables = pd.read_csv('data/afltables_stats.csv').query('Season >= 2000')

def replace_special_characters(name):
    name = name.replace("'", "").replace("-", " ").lower()
    name_split = name.split()

    if len(name_split) > 2:
        first_name = name_split[0]
        last_name = name_split[-1]
        name = first_name + ' ' + last_name

    name_split_2 = name.split()
    name = name_split_2[0][0] + ' ' + name_split_2[1]
    return name.title()

afltables = (afltables.assign(player=lambda df: df['First.name'] + ' ' + df.Surname)
                .assign(player=lambda df: df.player.apply(replace_special_characters))
                .rename(columns={'Brownlow.Votes': 'brownlow_votes', 'Season': 'season', 'Playing.for': 'team'}))
### Create Top 10 rank look up table

brownlow_votes_yearly = (afltables.groupby(['season', 'player', 'team'], as_index=False)
                                  .brownlow_votes
                                  .sum())

brownlow_votes_yearly['yearly_rank'] = (brownlow_votes_yearly.groupby('season')
                                                             .brownlow_votes
                                                             .rank(method='max', ascending=False))

# Filter to only get a dataframe since 2000 and only the top 10 players from each season
brownlow_votes_top_10 = brownlow_votes_yearly.query('yearly_rank < 11 & season >= 2000')

brownlow_votes_top_10.head(3)

def how_many_times_top_10(top_10_df, player, year):
    times = len(top_10_df[(top_10_df.player == player) & (top_10_df.season < year)])
    return times

features['times_in_top_10'] = features.apply(lambda x: how_many_times_top_10(brownlow_votes_top_10, x.player, x.season), axis=1) 
season player team brownlow_votes yearly_rank
27 2000.0 A Koutoufides Carlton 19.0 4.0
36 2000.0 A Mcleod Adelaide 20.0 3.0
105 2000.0 B Ratten Carlton 18.0 6.0

Average Brownlow Votes Per Game Last Season Feature

# Create a brownlow votes lookup table
brownlow_votes_lookup_table = (brownlow_data.groupby(['player', 'team', 'season'], as_index=False)
                                    .brownlow_votes
                                    .mean()
                                    .assign(next_season=lambda df: df.season + 1)
                                    .rename(columns={
                                        'brownlow_votes': 'ave_votes_last_season'
                                    }))

# Have a look at Cripps to check if it's working
brownlow_votes_lookup_table[brownlow_votes_lookup_table.player == 'P Cripps']

# Merge it to our features df
features_with_votes_last_season = (pd.merge(features, brownlow_votes_lookup_table.drop(columns='season'),

                                             left_on=['player', 'team', 'season'], 
                                             right_on=['player', 'team', 'next_season'],
                                             how='left')
                                  .drop(columns=['next_season'])
                                  .fillna(0))
player team season ave_votes_last_season next_season
4377 P Cripps Carlton 2014 0.000000 2015
4378 P Cripps Carlton 2015 0.300000 2016
4379 P Cripps Carlton 2016 0.857143 2017
4380 P Cripps Carlton 2017 0.333333 2018
4381 P Cripps Carlton 2018 0.000000 2019

Historic Performance Relative To Model Feature

It is well known that some players are good Brownlow performers. For whatever reason, they always poll much better than their stats may suggest. Lance Franklin and Bontempelli are probably in this category. Perhaps these players have an X-factor that Machine Learning models struggle to pick up on. To get around this, let's create a feature which looks at the player's performance relative to the model's prediction. To do this, we'll need to train and predict 7 different models - from 2011 to 2017.

To create a model for each season, we will use h2o's AutoML. If you're new to h2o, please read about it here. It can be used in both R and Python.

The metric we will use for loss in Mean Absolute Error (MAE).

As we are using regression, some values are negative. We will convert these negative values to 0 as it doesn't make sense to poll negative brownlow votes. Similarly, some matches won't predict exactly 6 votes, so we will scale these predictions so that we predict exactly 6 votes for each match.

So that you don't have to train these models yourself, I have saved the models and we will load them in. If you are keen to train the models yourself, simply uncomment out the code below and run the cell. To bulk uncomment, highlight the rows and press ctrl + '/'

h2o.init()

# Uncomment the code below if you want to train the models yourself - otherwise, we will load them in the load cell from disk


## Join to our features df

# aml_yearly_model_objects = {}
# yearly_predictions_dfs = {}

# feature_cols = ['margin', 'CP', 'UP', 'ED', 'DE',
#        'CM', 'GA', 'MI5', 'one_perc', 'BO', 'TOG', 'K', 'HB', 'D', 'M', 'G',
#        'B', 'T', 'HO', 'I50', 'CL', 'CG', 'R50', 'FF', 'FA', 'AF', 'SC']

# for year in range(2011, 2018):
#     # Filter the data to only include past data
#     train_historic = brownlow_data[brownlow_data.season < year].copy()

#     # Convert to an h2o frame
#     train_h2o_historic = h2o.H2OFrame(train_historic)

#     # Create an AutoML object
#     aml = H2OAutoML(max_runtime_secs=30,
#                    balance_classes=True,
#                    seed=42)

#     # Train the model
#     aml.train(y='brownlow_votes', x=feature_cols, training_frame=train_h2o_historic)

#     # save the model
#     model_path = h2o.save_model(model=aml.leader, path="models", force=True)

#     # Get model id
#     model_name = aml.leaderboard[0, 'model_id']

#     # Rename the model on disk
#     os.rename(f'models/{model_name}', f'models/yearly_model_{year}')

#     # Append the best model to a list
#     aml_yearly_model_objects[year] = aml.leader

#     # Make predictions on test set for that year
#     test_historic = brownlow_data[brownlow_data.season == year].copy()
#     test_h2o_historic = h2o.H2OFrame(test_historic)

#     preds = aml.predict(test_h2o_historic).as_data_frame()
#     test_historic['predicted_votes'] = preds.values

#     # Convert negative predictions to 0
#     test_historic['predicted_votes_neg_to_0'] = test_historic.predicted_votes.apply(lambda x: 0 if x < 0 else x)

#     # Create a total match votes column - which calculates the number of votes predicted in each game when the predictions 
#     # are unscaled
#     test_historic['unscaled_match_votes'] = test_historic.groupby('match_id').predicted_votes_neg_to_0.transform('sum')

#     # Scale predictions
#     test_historic['predicted_votes_scaled'] = test_historic.predicted_votes_neg_to_0 / test_historic.unscaled_match_votes * 6

#     # Aggregate the predictions
#     test_grouped = (test_historic.groupby(['player', 'team'], as_index=False)
#                                  .sum()
#                                  .sort_values(by='brownlow_votes', ascending=False)
#                                  .assign(mae=lambda df: abs(df.predicted_votes_scaled - df.brownlow_votes)))

#     test_grouped['error'] = test_grouped.predicted_votes_scaled - test_grouped.brownlow_votes
#     test_grouped['next_year'] = year + 1

#     # Add this years predictions df to a dictionary to use later
#     yearly_predictions_dfs[year] = test_grouped

# preds_errors = None

# for key, value in yearly_predictions_dfs.items():
#     if preds_errors is None:
#         preds_errors = value[['player', 'season', 'next_year', 'brownlow_votes', 'predicted_votes_scaled', 'error']]
#     else:
#         preds_errors = preds_errors.append(value[['player', 'season', 'next_year', 'brownlow_votes', 'predicted_votes_scaled', 'error']], sort=True)

# with open('data/prediction_errors_df.pickle', 'wb') as handle:
#     pickle.dump(preds_errors, handle)
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) Client VM (build 25.301-b09, mixed mode)
  Starting server from C:\Users\zhoui\greyhounds_bruno\greyhound-modelling\venv_greyhounds\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\zhoui\AppData\Local\Temp\tmp6umhk9gp
  JVM stdout: C:\Users\zhoui\AppData\Local\Temp\tmp6umhk9gp\h2o_ZhouI_started_from_python.out
  JVM stderr: C:\Users\zhoui\AppData\Local\Temp\tmp6umhk9gp\h2o_ZhouI_started_from_python.err

C:\Users\zhoui\greyhounds_bruno\greyhound-modelling\venv_greyhounds\lib\site-packages\h2o\backend\server.py:386: UserWarning:   You have a 32-bit version of Java. H2O works best with 64-bit Java.
  Please download the latest 64-bit Java SE JDK from Oracle.

  warn("  You have a 32-bit version of Java. H2O works best with 64-bit Java.\n"

  Server is running at http://127.0.0.1:54325
Connecting to H2O server at http://127.0.0.1:54325 ... successful.

H2O_cluster_uptime: 03 secs
H2O_cluster_timezone: Australia/Sydney
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.36.0.3
H2O_cluster_version_age: 5 days
H2O_cluster_name: H2O_from_python_ZhouI_yobrjv
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 247.5 Mb
H2O_cluster_total_cores: 12
H2O_cluster_allowed_cores: 12
H2O_cluster_status: locked, healthy
H2O_connection_url: http://127.0.0.1:54325
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.9.6 final
# Load predictions error df    
with open('data/prediction_errors_df.pickle', 'rb') as handle:
    preds_errors = pickle.load(handle)

# Look at last years predictions
preds_errors.query('next_year == 2018').sort_values(by='brownlow_votes', ascending=False).head(20)
brownlow_votes error next_year player predicted_votes_scaled season
139 36.0 -3.431041 2018 D Martin 32.568959 44374
486 33.0 0.065131 2018 P Dangerfield 33.065131 42357
619 25.0 0.243510 2018 T Mitchell 25.243510 44374
279 23.0 -9.267806 2018 J Kennedy 13.732194 38323
376 22.0 -7.621163 2018 L Franklin 14.378837 44374
278 21.0 -4.494170 2018 J Kelly 16.505830 42357
519 20.0 -2.909334 2018 R Sloane 17.090666 44374
410 19.0 -7.124114 2018 M Bontempelli 11.875886 44374
483 18.0 -3.734412 2018 O Wines 14.265588 44374
121 17.0 -4.034390 2018 D Beams 12.965610 38323
390 16.0 -2.797669 2018 L Parker 13.202331 44374
561 15.0 -5.458874 2018 S Pendlebury 9.541126 32272
463 15.0 -4.892862 2018 N Fyfe 10.107138 42357
42 15.0 -4.379072 2018 B Ebert 10.620928 44374
651 15.0 1.011272 2018 Z Merrett 16.011272 42357
578 14.0 2.069385 2018 T Adams 16.069385 44374
34 14.0 -4.886668 2018 B Brown 9.113332 44374
172 14.0 0.211867 2018 D Zorko 14.211867 42357
184 14.0 -0.620086 2018 G Ablett 13.379914 28238
389 14.0 -2.811169 2018 L Neale 11.188831 42357

Look at that! A simple Machine Learning ensemble model, using AutoML predicted last year's winner! That's impressive. As we can see it also predicted Bontempelli would only score 11.26, when he actually scored 19 - a huge discrepency. Let's use this as a feature.

features_with_historic_perf_relative_to_model = \
    (features_with_votes_last_season.pipe(pd.merge, preds_errors[['player', 'next_year', 'error']],
                                          left_on=['player', 'season'],
                                          right_on=['player', 'next_year'],
                                          how='left')
                                     .fillna(0)
                                     .rename(columns={'error': 'error_last_season'})
                                     .drop_duplicates(subset=['player', 'round', 'SC']))

Filtering the data to only include the top 20 SC for each match

Logically, it is extremely unlikely that a player will poll votes if their Supercoach score is not in the top 20 players. By eliminating the other 20+ players, we can reduce the noise in the data, as we are almost certain the players won't poll from the bottom half. Let's explore how many players poll if they're not in the top 20, and then filter our df if this number is not significant.

# Find number of players who vote when in top 15 SC

brownlow_data['SC_rank_match'] = brownlow_data.groupby('match_id').SC.rank(method='max', ascending=False)

brownlow_data.query('SC_rank_match > 20 and season > 2014').brownlow_votes.value_counts()
0.0    18330
1.0       14
2.0        8
3.0        2
Name: brownlow_votes, dtype: int64

Since 2014, there have only been 24 players who have voted and not been in the top 20 SC.

features_with_sc_rank = features_with_historic_perf_relative_to_model.copy()

features_with_sc_rank['SC_rank_match'] = features_with_sc_rank.groupby('match_id').SC.rank(method='max', ascending=False)

# Filter out rows with a SC rank of below 20
features_with_sc_rank_filtered = features_with_sc_rank.query('SC_rank_match <= 20')
# Filter out 2010 and 2011 as we used these seasons to create historic model performance features
features_last_before_train = features_with_sc_rank_filtered.query('season != 2010 and season != 2011').reset_index(drop=True)

features_last_before_train.head(3)
date season round venue ID match_id player jumper_no team opposition status team_score opposition_score margin brownlow_votes CP UP ED DE CM GA MI5 one_perc BO TOG K HB D M G B T HO I50 CL CG R50 FF FA AF SC kicked_a_bag is_captain team_won got_30_possies_2_goals times_in_top_10 ave_votes_last_season next_year error_last_season SC_rank_match
0 24/03/2012 2012 1 ANZ Stadium 11635 5343 C Bird 14 Sydney GWS Away 100 37 63 0.0 0.035842 0.041002 0.030769 0.019378 0.000000 0.083333 0.055556 0.019802 0.0 0.023064 0.034562 0.037383 0.035762 0.039409 0.105263 0.111111 0.049296 0.0000 0.060606 0.093333 0.063636 0.027778 0.025 0.075 0.039041 0.032662 0 0 1 0 0 0.052632 2012.0 0.136311 6.0
1 24/03/2012 2012 1 ANZ Stadium 1013 5343 J Bolton 24 Sydney GWS Away 100 37 63 0.0 0.039427 0.036446 0.028846 0.019607 0.000000 0.000000 0.000000 0.049505 0.0 0.024203 0.027650 0.040498 0.033113 0.009852 0.052632 0.000000 0.042254 0.0125 0.040404 0.066667 0.009091 0.000000 0.025 0.025 0.029819 0.038767 0 0 1 0 0 0.526316 2012.0 -0.833178 3.0
2 24/03/2012 2012 1 ANZ Stadium 1012 5343 A Goodes 37 Sydney GWS Away 100 37 63 0.0 0.032258 0.029613 0.028846 0.023332 0.037037 0.083333 0.055556 0.009901 0.0 0.026196 0.029954 0.024922 0.027815 0.029557 0.000000 0.055556 0.014085 0.0125 0.070707 0.040000 0.027273 0.041667 0.025 0.000 0.025822 0.027473 0 1 1 0 5 0.761905 2012.0 -5.718150 14.0

Modeling The 2017 Brownlow

Now that we have all of our features, we can simply create a training set (2012-2016), and a test set (2017), and make our predictions for last year! We will use AutoML for this process again. Again, rather than waiting for the model to train, I will save the model so you can simply load it in. We will also scale our features. We can then see how our model went in predicting last year's brownlow, creating a baseline for this years' predictions. We will then predict this year's vote count.

train_baseline = features_last_before_train.query("season < 2017")
holdout = features_last_before_train.query("season == 2017")

scale_cols = ['team_score', 'opposition_score', 'margin', 'CP', 'UP', 'ED', 'DE',
       'CM', 'GA', 'MI5', 'one_perc', 'BO', 'K', 'HB', 'D', 'M', 'G',
       'B', 'T', 'HO', 'I50', 'CL', 'CG', 'R50', 'FF', 'FA', 'AF', 'SC']

other_feature_cols = ['is_captain', 'kicked_a_bag', 'team_won', 'got_30_possies_2_goals', 'times_in_top_10', 
                      'ave_votes_last_season', 'error_last_season', 'SC_rank_match']


all_feature_cols = scale_cols + other_feature_cols

# Scale features
scaler = StandardScaler()

train_baseline_scaled = train_baseline.copy()
train_baseline_scaled[scale_cols] = scaler.fit_transform(train_baseline[scale_cols])

holdout_scaled = holdout.copy()
holdout_scaled[scale_cols] = scaler.transform(holdout[scale_cols])

# Convert categorical columns to categoricals
train_baseline_h2o = h2o.H2OFrame(train_baseline_scaled)
holdout_h2o = h2o.H2OFrame(holdout_scaled)

for col in ['is_captain', 'kicked_a_bag', 'team_won', 'got_30_possies_2_goals']:
    train_baseline_h2o[col] = train_baseline_h2o[col].asfactor()
    holdout_h2o[col] = holdout_h2o[col].asfactor()
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%

Below I have commented out training and saving the 2017 model. Rather than training it again, we will just load it in. Uncomment this part out if you want to train it yourself.

# aml_2017_model = H2OAutoML(max_runtime_secs = 60*30,
#                    balance_classes=True,
#                    seed=42)

# aml_2017_model.train(y='brownlow_votes', x=all_feature_cols, training_frame=train_baseline_h2o)
# # save the model
# model_path = h2o.save_model(model=aml_2017_model.leader, path="models", force=True)

# # Get model id
# model_name = aml_2017_model.leaderboard[0, 'model_id']

# # Rename the model on disk
# os.rename(f'models/{model_name}', f'models/brownlow_2017_model_v1')

# Load model in
aml_2017_model = h2o.load_model('models/brownlow_2017_model_v1')
# Predict the 2017 brownlow count
preds_final_2017_model = aml_2017_model.predict(holdout_h2o)


# Scale these predictions - change negatives to 0s and scale so each game predicts 6 votes total
holdout = (holdout.assign(predicted_votes=preds_final_2017_model.as_data_frame().values)
                  .assign(predicted_votes_neg_to_0=lambda df: df.predicted_votes.apply(lambda x: 0 if x <0 else x))
                  .assign(unscaled_match_votes=lambda df: df.groupby('match_id').predicted_votes_neg_to_0.transform('sum'))
                  .assign(predicted_votes_scaled=lambda df: df.predicted_votes_neg_to_0 / df.unscaled_match_votes * 6))

# Create an aggregate votes df and show the average SC points and goals scored
agg_predictions_2017 = (holdout.groupby(['player', 'team'], as_index=False)
                        .agg({
                            'brownlow_votes': sum,
                            'predicted_votes_scaled': sum,
                            'SC': 'mean',
                            'G': 'mean'})
                        .sort_values(by='brownlow_votes', ascending=False)
                        .assign(mae=lambda df: abs(df.brownlow_votes - df.predicted_votes_scaled))
                        .reset_index(drop=True))
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%

agg_predictions_2017.head(15)
player team brownlow_votes predicted_votes_scaled SC G mae
0 D Martin Richmond 36.0 25.007991 0.037862 0.064869 10.992009
1 P Dangerfield Geelong 33.0 25.568804 0.042441 0.070819 7.431196
2 T Mitchell Hawthorn 25.0 19.078639 0.036040 0.016928 5.921361
3 L Franklin Sydney 22.0 12.574692 0.034640 0.149203 9.425308
4 J Kelly GWS 21.0 15.198580 0.034652 0.033772 5.801420
5 R Sloane Adelaide 20.0 15.755535 0.037068 0.034821 4.244465
6 J Kennedy Sydney 20.0 11.065483 0.032014 0.030508 8.934517
7 M Bontempelli Western Bulldogs 19.0 13.836173 0.033233 0.040498 5.163827
8 D Beams Brisbane 17.0 10.073366 0.034848 0.044998 6.926634
9 O Wines Port Adelaide 16.0 10.511818 0.031601 0.021967 5.488182
10 N Fyfe Fremantle 15.0 12.782723 0.033761 0.031680 2.217277
11 S Pendlebury Collingwood 15.0 9.135615 0.033855 0.013660 5.864385
12 B Ebert Port Adelaide 15.0 7.941416 0.032795 0.008431 7.058584
13 L Parker Sydney 15.0 12.431929 0.031366 0.030311 2.568071
14 Z Merrett Essendon 15.0 15.017767 0.033737 0.015362 0.017767

So whilst our model predicted Dangerfield to win, it was pretty damn accurate! Let's find the MAE for the top 100, 50, 25, and 10, and then compare it to 2018's MAE in week, when the Brownlow has been counted.

for top_x in [10, 25, 50, 100]:
    temp_mae = round(agg_predictions_2017.iloc[:top_x].mae.mean(), 3)
    print(f"The Average Mean Absolute Error for the top {top_x} is {temp_mae}")
The Average Mean Absolute Error for the top 10 is 7.033
The Average Mean Absolute Error for the top 25 is 4.962
The Average Mean Absolute Error for the top 50 is 3.744
The Average Mean Absolute Error for the top 100 is 2.696

Modelling This Year's Brownlow

Let's now predict this year's vote count. These predictions will be on the front page of the GitHub.

train = features_last_before_train.query("season < 2018")
test = features_last_before_train.query("season == 2018")

# Scale features
scaler = StandardScaler()

train_scaled = train.copy()
train_scaled[scale_cols] = scaler.fit_transform(train[scale_cols])

test_scaled = test.copy()
test_scaled[scale_cols] = scaler.transform(test[scale_cols])
# Convert categorical columns to categoricals

train_h2o = h2o.H2OFrame(train_scaled)
test_h2o = h2o.H2OFrame(test_scaled)

for col in ['is_captain', 'kicked_a_bag', 'team_won', 'got_30_possies_2_goals']:
    train_h2o[col] = train_h2o[col].asfactor()
    test_h2o[col] = test_h2o[col].asfactor()
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%

# # Train the model - this part is commented out as we will just load our model from disk

# aml = H2OAutoML(max_runtime_secs = 60*30,
#                    balance_classes=True,
#                    seed=42)

# aml.train(y='brownlow_votes', x=all_feature_cols, training_frame=train_h2o)
# # save the model
# model_path = h2o.save_model(model=aml.leader, path="models", force=True)

# # Get model id
# model_name = aml.leaderboard[0, 'model_id']

# # Rename the model on disk
# os.rename(f'models/{model_name}', f'models/brownlow_2018_model_v1')

# Load model in
aml = h2o.load_model('models/brownlow_2018_model_v1')
# Predict the 2018 brownlow count
preds_final_2018_model = aml.predict(test_h2o)


# Scale these predictions - change negatives to 0s and scale so each game predicts 6 votes total
test = (test.assign(predicted_votes=preds_final_2018_model.as_data_frame().values)
                  .assign(predicted_votes_neg_to_0=lambda df: df.predicted_votes.apply(lambda x: 0 if x <0 else x))
                  .assign(unscaled_match_votes=lambda df: df.groupby('match_id').predicted_votes_neg_to_0.transform('sum'))
                  .assign(predicted_votes_scaled=lambda df: df.predicted_votes_neg_to_0 / df.unscaled_match_votes * 6))

# Create an aggregate votes df and show the average SC points and goals scored
agg_predictions_2018 = (test.groupby(['player', 'team'], as_index=False)
                        .agg({
                            'predicted_votes_scaled': sum,
                            'match_id': 'count'}) # shows how many games they played
                        .sort_values(by='predicted_votes_scaled', ascending=False)
                        .reset_index(drop=True))
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%

# Show the top 25 predictions

agg_predictions_2018.head(25)
player team predicted_votes_scaled match_id
0 T Mitchell Hawthorn 36.577208 20
1 M Gawn Melbourne 20.686854 22
2 C Oliver Melbourne 19.705244 20
3 D Martin Richmond 19.674984 19
4 B Grundy Collingwood 19.136407 22
5 J Macrae Western Bulldogs 18.396814 17
6 P Dangerfield Geelong 18.207483 21
7 D Beams Brisbane 17.280375 15
8 A Gaff West Coast 16.217552 18
9 L Neale Fremantle 16.038782 21
10 E Yeo West Coast 15.866418 20
11 J Selwood Geelong 15.749850 18
12 D Heppell Essendon 15.073503 19
13 N Fyfe Fremantle 14.864971 11
14 Z Merrett Essendon 14.597278 18
15 S Sidebottom Collingwood 14.403396 18
16 J Kennedy Sydney 14.297521 16
17 M Crouch Adelaide 13.464205 16
18 G Ablett Geelong 13.403062 15
19 P Cripps Carlton 13.140569 21
20 R Laird Adelaide 13.016825 19
21 L Franklin Sydney 12.562711 13
22 S Coniglio GWS 12.484317 20
23 J Kelly GWS 12.245194 14
24 J Lloyd Sydney 12.019714 20
print(agg_predictions_2018.head(15))
           player              team  predicted_votes_scaled  match_id
0      T Mitchell          Hawthorn               36.577208        20
1          M Gawn         Melbourne               20.686854        22
2        C Oliver         Melbourne               19.705244        20
3        D Martin          Richmond               19.674984        19
4        B Grundy       Collingwood               19.136407        22
5        J Macrae  Western Bulldogs               18.396814        17
6   P Dangerfield           Geelong               18.207483        21
7         D Beams          Brisbane               17.280375        15
8          A Gaff        West Coast               16.217552        18
9         L Neale         Fremantle               16.038782        21
10          E Yeo        West Coast               15.866418        20
11      J Selwood           Geelong               15.749850        18
12      D Heppell          Essendon               15.073503        19
13         N Fyfe         Fremantle               14.864971        11
14      Z Merrett          Essendon               14.597278        18

Now that we have the top 25, let's also look at the top 3 from each team.

agg_predictions_2018.sort_values(by=['team', 'predicted_votes_scaled'], ascending=[True, False]).groupby('team').head(3)
player team predicted_votes_scaled match_id
17 M Crouch Adelaide 13.464205 16
20 R Laird Adelaide 13.016825 19
42 B Gibbs Adelaide 8.976507 18
7 D Beams Brisbane 17.280375 15
48 D Zorko Brisbane 8.039436 14
58 S Martin Brisbane 6.835975 19
19 P Cripps Carlton 13.140569 21
51 K Simpson Carlton 7.497976 18
91 E Curnow Carlton 3.734706 19
4 B Grundy Collingwood 19.136407 22
15 S Sidebottom Collingwood 14.403396 18
26 A Treloar Collingwood 11.173963 11
12 D Heppell Essendon 15.073503 19
14 Z Merrett Essendon 14.597278 18
61 D Smith Essendon 6.367458 20
9 L Neale Fremantle 16.038782 21
13 N Fyfe Fremantle 14.864971 11
75 D Mundy Fremantle 4.895438 19
22 S Coniglio GWS 12.484317 20
23 J Kelly GWS 12.245194 14
25 C Ward GWS 11.251450 19
6 P Dangerfield Geelong 18.207483 21
11 J Selwood Geelong 15.749850 18
18 G Ablett Geelong 13.403062 15
76 J Witts Gold Coast 4.745802 13
79 J Lyons Gold Coast 4.482626 14
114 D Swallow Gold Coast 2.731594 15
0 T Mitchell Hawthorn 36.577208 20
41 L Breust Hawthorn 9.132324 16
50 J Gunston Hawthorn 7.641945 19
1 M Gawn Melbourne 20.686854 22
2 C Oliver Melbourne 19.705244 20
30 J Hogan Melbourne 10.510504 13
29 S Higgins North Melbourne 10.682736 19
40 B Brown North Melbourne 9.211568 13
43 B Cunnington North Melbourne 8.949073 17
28 O Wines Port Adelaide 10.948001 16
35 R Gray Port Adelaide 10.017038 17
53 J Westhoff Port Adelaide 7.425425 21
3 D Martin Richmond 19.674984 19
36 J Riewoldt Richmond 9.981867 15
57 K Lambert Richmond 6.897522 12
38 S Ross St Kilda 9.756485 17
46 J Steven St Kilda 8.103860 17
90 J Steele St Kilda 3.739189 16
16 J Kennedy Sydney 14.297521 16
21 L Franklin Sydney 12.562711 13
24 J Lloyd Sydney 12.019714 20
8 A Gaff West Coast 16.217552 18
10 E Yeo West Coast 15.866418 20
34 J Redden West Coast 10.200779 16
5 J Macrae Western Bulldogs 18.396814 17
39 M Bontempelli Western Bulldogs 9.317955 16
45 L Hunter Western Bulldogs 8.758631 17

If you're looking for a round by round breakdown, just have a look at the test dataframe.

test[['date', 'round', 'player', 'team', 'opposition', 'margin', 'SC', 'predicted_votes_scaled']].tail(25)
date round player team opposition margin SC predicted_votes_scaled
27236 24/08/2018 23 J Laverde Essendon Port Adelaide 22 0.028190 0.098214
27237 24/08/2018 23 Z Merrett Essendon Port Adelaide 22 0.030918 0.316848
27238 24/08/2018 23 D Parish Essendon Port Adelaide 22 0.030615 0.031097
27239 24/08/2018 23 A Saad Essendon Port Adelaide 22 0.026069 0.000000
27240 24/08/2018 23 D Smith Essendon Port Adelaide 22 0.028190 0.714852
27241 24/08/2018 23 D Zaharakis Essendon Port Adelaide 22 0.034859 0.331842
27242 25/08/2018 23 H Ballantyne Fremantle Collingwood -9 0.025773 0.594485
27243 25/08/2018 23 T Duman Fremantle Collingwood -9 0.038205 0.068379
27244 25/08/2018 23 J Hamling Fremantle Collingwood -9 0.024560 0.000000
27245 25/08/2018 23 B Hill Fremantle Collingwood -9 0.031837 0.755081
27246 25/08/2018 23 E Langdon Fremantle Collingwood -9 0.033354 0.420121
27247 25/08/2018 23 D Mundy Fremantle Collingwood -9 0.023954 0.000000
27248 25/08/2018 23 L Neale Fremantle Collingwood -9 0.042450 1.012576
27249 25/08/2018 23 S Switkowski Fremantle Collingwood -9 0.027289 0.000000
27250 25/08/2018 23 T Adams Collingwood Fremantle 9 0.030018 0.207985
27251 25/08/2018 23 M Cox Collingwood Fremantle 9 0.024864 0.017404
27252 25/08/2018 23 J Crisp Collingwood Fremantle 9 0.031534 0.022959
27253 25/08/2018 23 B Grundy Collingwood Fremantle 9 0.045482 1.118902
27254 25/08/2018 23 B Maynard Collingwood Fremantle 9 0.026076 0.000000
27255 25/08/2018 23 B Mihocek Collingwood Fremantle 9 0.025167 0.000000
27256 25/08/2018 23 S Pendlebury Collingwood Fremantle 9 0.029715 0.888746
27257 25/08/2018 23 T Phillips Collingwood Fremantle 9 0.028199 0.009869
27258 25/08/2018 23 S Sidebottom Collingwood Fremantle 9 0.039115 0.880034
27259 25/08/2018 23 B Sier Collingwood Fremantle 9 0.023044 0.002407
27260 25/08/2018 23 J Thomas Collingwood Fremantle 9 0.026986 0.001053

And there we have it! In a single notebook we have made a fairly good Brownlow predictive model. Enjoy.


Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.