Australian Open Datathon Python Tutorial
Overview
The Task
This notebook will outline how the Betfair Data Scientists went about modelling the Australian Open for Betfair's Australian Open Datathon. The task is simple: we ask you to predict the winner of every possible Australian Open matchup using data which we provide.
The metric used to determine the winner will be log loss, based on the actual matchups that happen in the Open. For more information on log loss, click here.
How an outline of our methodology and thought process, read this article.
Intention
This notebook will demonstrate how to:
- Process the raw data sets
- Produce simple features
- Run a predictive model on H2O
- Outputs the final predictions for the submissions
- Load the data and required packages
import numpy as np
import pandas as pd
import os
import gc
import sys
import warnings
warnings.filterwarnings('ignore')
import h2o
from h2o.automl import H2OAutoML
pd.options.display.max_columns = 999
# We are loading both the mens and womens match csvs
df_atp = pd.read_csv("data/ATP_matches.csv")
df_wta = pd.read_csv("data/WTA_matches.csv")
Data pre-processing
Filter the matches to hard and indoor hard only due to the fact that Australian Open is on hard surface and we want the models to train specifically for hard surfaces matches
Convert the columns in both datasets to the correct types. For example, we want to make sure the date columns are in the datetime format and numerical columns are either integer or floats. This will help reduce the memory in use and make the feature engineering process easier
### Include hard and indoor hard only
df_atp = df_atp.loc[df_atp.Court_Surface.isin(['Hard','Indoor Hard'])]
df_wta = df_wta.loc[df_wta.Court_Surface.isin(['Hard','Indoor Hard'])]
### Exclude qualifying rounds
df_atp = df_atp.loc[df_atp.Round_Description != 'Qualifying']
df_wta = df_wta.loc[df_wta.Round_Description != 'Qualifying']
# Store the shape of the data for reference check later
atp_shape = df_atp.shape
wta_shape = df_wta.shape
numeric_columns = ['Winner_Rank', 'Loser_Rank', 'Retirement_Ind',
'Winner_Sets_Won', 'Winner_Games_Won', 'Winner_Aces',
'Winner_DoubleFaults', 'Winner_FirstServes_Won',
'Winner_FirstServes_In', 'Winner_SecondServes_Won',
'Winner_SecondServes_In', 'Winner_BreakPoints_Won',
'Winner_BreakPoints', 'Winner_ReturnPoints_Won',
'Winner_ReturnPoints_Faced', 'Winner_TotalPoints_Won', 'Loser_Sets_Won',
'Loser_Games_Won', 'Loser_Aces', 'Loser_DoubleFaults',
'Loser_FirstServes_Won', 'Loser_FirstServes_In',
'Loser_SecondServes_Won', 'Loser_SecondServes_In',
'Loser_BreakPoints_Won', 'Loser_BreakPoints', 'Loser_ReturnPoints_Won',
'Loser_ReturnPoints_Faced', 'Loser_TotalPoints_Won']
text_columns = ['Winner', 'Loser', 'Tournament', 'Court_Surface','Round_Description']
date_columns = ['Tournament_Date']
# we set the **erros** to coerce so any non-numerical values (text,special characters) will return an NA
df_atp[numeric_columns] = df_atp[numeric_columns].apply(pd.to_numeric,errors = 'coerce')
df_wta[numeric_columns] = df_wta[numeric_columns].apply(pd.to_numeric,errors = 'coerce')
df_atp[date_columns] = df_atp[date_columns].apply(pd.to_datetime)
df_wta[date_columns] = df_wta[date_columns].apply(pd.to_datetime)
Feature Engineering
The raw datasets are constructed in a way that each row will have the seperate stats for both the winner and loser of that match. However, we want to reshape the data so that each row we will only have one player randomly selected from the winner/loser columns and the features are the difference between opponents statistics (Difference of Averages), such as the difference between average first serve % in a single column rather than Player 1’s first serve % and Player 2’s first serve % in two separate columns.
In addition, for the features, we will take the rolling average of the player's most recent 15 matches before the particular tournament starts. For example, if the match is the second round of the Australian Open 2018, the features will be the last 15 matches before the first round of Australian Open 2018. The reason of not including the stats in the first round is that we would not have known the player's stats in the first round for the final submissions
A typical row of the transformed data will look like this – For a match between Player A – Roger Federer and Player B – Rafael Nadal, we will have a bunch of features like the difference in first serve %, difference in ELO rating etc. The target variable will be whether or not Player A wins (1=Player A wins and 0=lose).
The steps we take are:
- Convert the raw data frames into long format:
- Create some new features
-
Take the rolling average for each player and each match Since we will be only training our models on US Open and Australian Open, we will only be creating features for those matches. However, the rolling average will take into account any hard surface matches before those tournaments
-
Calculate the difference of averages for each match in the data frames
Convert the raw data frames into long format:
# Before we split the data frame into winner and loser, we want to create a feature that captures the total number of games the match takes.
# We have to do it before the split or we will lose this information
df_atp['Total_Games'] = df_atp.Winner_Games_Won + df_atp.Loser_Games_Won
df_wta['Total_Games'] = df_wta.Winner_Games_Won + df_wta.Loser_Games_Won
# Get the column names for the winner and loser stats
winner_cols = [col for col in df_atp.columns if col.startswith('Winner')]
loser_cols = [col for col in df_atp.columns if col.startswith('Loser')]
# create a winner data frame to store the winner stats and a loser data frame for the losers
# In addition to the winner and loser columns, we are adding common columns as well (e.g. tournamnt dates)
common_cols = ['Total_Games','Tournament','Tournament_Date', 'Court_Surface','Round_Description']
df_winner_atp = df_atp[winner_cols + common_cols]
df_loser_atp = df_atp[loser_cols + common_cols]
df_winner_wta = df_wta[winner_cols + common_cols]
df_loser_wta = df_wta[loser_cols + common_cols]
# Create a new column to show whether the player has won or not.
df_winner_atp["won"] = 1
df_loser_atp["won"] = 0
df_winner_wta["won"] = 1
df_loser_wta["won"] = 0
# Rename the columns for the winner and loser data frames so we can append them later on.
# We will rename the Winner_ / Loser_ columns to Player_
new_column_names = [col.replace('Winner','Player') for col in winner_cols]
df_winner_atp.columns = new_column_names + common_cols + ['won']
# They all should be the same
df_loser_atp.columns = df_winner_atp.columns
df_winner_wta.columns = df_winner_atp.columns
df_loser_wta.columns = df_winner_atp.columns
# append the winner and loser data frames
df_long_atp= df_winner_atp.append(df_loser_atp)
df_long_wta= df_winner_wta.append(df_loser_wta)
So now our data frames are in long format and should looks like this
Player | Player_Rank | Player_Sets_Won | Player_Games_Won | Player_Aces | Player_DoubleFaults | Player_FirstServes_Won | Player_FirstServes_In | Player_SecondServes_Won | Player_SecondServes_In | Player_BreakPoints_Won | Player_BreakPoints | Player_ReturnPoints_Won | Player_ReturnPoints_Faced | Player_TotalPoints_Won | Total_Games | Tournament | Tournament_Date | Court_Surface | Round_Description | won |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Edouard Roger-Vasselin | 106.0 | 2.0 | 12 | 5.0 | 2.0 | 22 | 30 | 12 | 19 | 4.0 | 7.0 | 25.0 | 59.0 | 59 | 19 | Chennai | 2012-01-02 | Hard | First Round | 1 |
Dudi Sela | 83.0 | 2.0 | 12 | 2.0 | 0.0 | 14 | 17 | 11 | 16 | 6.0 | 14.0 | 36.0 | 58.0 | 61 | 13 | Chennai | 2012-01-02 | Hard | First Round | 1 |
Go Soeda | 120.0 | 2.0 | 19 | 6.0 | 1.0 | 48 | 64 | 19 | 39 | 5.0 | 11.0 | 42.0 | 105.0 | 109 | 33 | Chennai | 2012-01-02 | Hard | First Round | 1 |
Yuki Bhambri | 345.0 | 2.0 | 12 | 1.0 | 2.0 | 22 | 29 | 9 | 17 | 5.0 | 13.0 | 34.0 | 62.0 | 65 | 17 | Chennai | 2012-01-02 | Hard | First Round | 1 |
Yuichi Sugita | 235.0 | 2.0 | 12 | 3.0 | 1.0 | 37 | 51 | 11 | 27 | 3.0 | 7.0 | 22.0 | 54.0 | 70 | 19 | Chennai | 2012-01-02 | Hard | First Round | 1 |
Create some new features
Thinking about the dynamics of tennis, we know that players often will matches by “breaking” the opponent’s serve (i.e. winning a game when the opponent is serving). This is especially important in tennis. Let’s create a feature called Player_BreakPoints_Per_Game, which is the number of breakpoints a player gets per game that they play (even though they can only get breakpoints every second game, we will use total games). Let’s also create a feature called Player_Return_Win_Ratio which is the proportion of points won when returning.
Similarly, “holding” serve is important (i.e. winning a game when you are serving). Let’s create a feature called Player_Serve_Win_Ratio which is the proportion of points won when serving.
Finally, you only win a set of tennis by winning more sets than your opponent. To win a set, you need to win games. Let’s create a feature called Player_Game_Win_Percentage which is the propotion of games that a player wins.
So the four new features we will create are:
Player_Serve_Win_Ratio
Player_Return_Win_Ratio
Player_BreakPoints_Per_Return_Game
Player_Game_Win_Percentage
# Here, we will define a function so we can apply it to both atp and wta data frames
def get_new_features(df):
# Input:
# df: data frame to get the data from
# Return: the df with the new features
# Point Win ratio when serving
df['Player_Serve_Win_Ratio'] = (df.Player_FirstServes_Won + df.Player_SecondServes_Won - df.Player_DoubleFaults) \
/(df.Player_FirstServes_In + df.Player_SecondServes_In + df.Player_DoubleFaults)
# Point win ratio when returning
df['Player_Return_Win_Ratio'] = df.Player_ReturnPoints_Won / df.Player_ReturnPoints_Faced
# Breakpoints per receiving game
df['Player_BreakPoints_Per_Return_Game'] = df.Player_BreakPoints/df.Total_Games
df['Player_Game_Win_Percentage'] = df.Player_Games_Won/df.Total_Games
return df
# Apply the function we just created to the long data frames
df_long_atp = get_new_features(df_long_atp)
df_long_wta = get_new_features(df_long_wta)
# The long table should have exactly twice of the rows of the original data
assert df_long_atp.shape[0] == atp_shape[0]*2
assert df_long_wta.shape[0] == wta_shape[0]*2
Take the rolling average for each player and each match
To train our models, we cannot simply use the player stats for that current match. In fact, we wont be able to use any stats from the same tournament. The logic behind this is that when we try to predict the results in 2019, we would not know the stats of any of the matches in the Australian Open 2019 tournament. As a result, we will use the players' past performance. Here, we will do a rolling average of the most recent 15 matches before the tournament.
To do the above, we will follow the steps below:
- List all the tournament dates for US and Australian Opens
- Loop through the dates from point 1, for each date, we filter the data to only include matches before that date and take the most recent 15 games
- Take the average of those 15 games
# the two tournaments we will be using for training and thus the feature generation
tournaments = ['U.S. Open, New York','Australian Open, Melbourne']
# Store the dates for the loops
tournament_dates_atp = df_atp.loc[df_atp.Tournament.isin(tournaments)].groupby(['Tournament','Tournament_Date']) \
.size().reset_index()[['Tournament','Tournament_Date']]
tournament_dates_wta = df_wta.loc[df_wta.Tournament.isin(tournaments)].groupby(['Tournament','Tournament_Date']) \
.size().reset_index()[['Tournament','Tournament_Date']]
# We are adding one more date for the final prediction
tournament_dates_atp.loc[-1] = ['Australian Open, Melbourne',pd.to_datetime('2019-01-15')]
tournament_dates_wta.loc[-1] = ['Australian Open, Melbourne',pd.to_datetime('2019-01-15')]
Following are the dates for each tournament
Tournament | Tournament_Date |
---|---|
Australian Open, Melbourne | 2012-01-16 |
Australian Open, Melbourne | 2013-01-1 |
Australian Open, Melbourne | 2014-01-1 |
Australian Open, Melbourne | 2015-01-1 |
Australian Open, Melbourne | 2016-01-1 |
Australian Open, Melbourne | 2017-01-1 |
Australian Open, Melbourne | 2018-01-1 |
U.S. Open, New York | 2012-08-2 |
U.S. Open, New York | 2013-08-2 |
U.S. Open, New York | 2014-08-25 |
U.S. Open, New York | 2015-08-31 |
U.S. Open, New York | 2016-08-29 |
U.S. Open, New York | 2017-08-28 |
U.S. Open, New York | 2018-08-27 |
Australian Open, Melbourne | 2019-01-15 |
Tournament | Tournament_Date |
---|---|
Australian Open, Melbourne | 2014-01-13 |
Australian Open, Melbourne | 2015-01-19 |
Australian Open, Melbourne | 2016-01-18 |
Australian Open, Melbourne | 2017-01-16 |
Australian Open, Melbourne | 2018-01-15 |
U.S. Open, New York | 2014-08-25 |
U.S. Open, New York | 2015-08-31 |
U.S. Open, New York | 2016-08-29 |
U.S. Open, New York | 2017-08-28 |
U.S. Open, New York | 2018-08-27 |
Australian Open, Melbourne | 2019-01-15 |
They look fine but it is interesting that for men's, we have two more years of data from 2012 to 2013
# Let's define a function to calculate the rolling averages
def get_rolling_features (df, date_df=None,rolling_cols = None, last_cols= None):
# Input:
# df: data frame to get the data from
# date_df: data frame that has the start dates for each tournament
# rolling_cols: columns to get the rolling averages
# last_cols: columns to get the last value (most recent)
# Return: the df with the new features
# Sort the data by player and dates so the most recent matches are at the bottom
df = df.sort_values(['Player','Tournament_Date','Tournament'], ascending=True)
# For each tournament, get the rolling averages of that player's past matches before the tournament start date
for index, tournament_date in enumerate(date_df.Tournament_Date):
# create a temp df to store the interim results
df_temp = df.loc[df.Tournament_Date < tournament_date]
# for ranks, we only take the last one. (comment this out if want to take avg of rank)
df_temp_last = df_temp.groupby('Player')[last_cols].last().reset_index()
# take the most recent 15 matches for the rolling average
df_temp = df_temp.groupby('Player')[rolling_cols].rolling(15, min_periods=1).mean().reset_index()
df_temp = df_temp.groupby('Player').tail(1) # take the last row of the above
df_temp= df_temp.merge(df_temp_last, on='Player', how='left')
if index ==0:
df_result = df_temp
df_result['tournament_date_index'] = tournament_date # so we know which tournament this feature is for
else:
df_temp['tournament_date_index'] = tournament_date
df_result = df_result.append(df_temp)
df_result.drop('level_1', axis=1,inplace=True)
return df_result
# columns we are applying the rolling averages on
rolling_cols = ['Player_Serve_Win_Ratio', 'Player_Return_Win_Ratio',
'Player_BreakPoints_Per_Return_Game', 'Player_Game_Win_Percentage']
# columns we are taking the most recent values on
# For the player rank, we think we can just use the latest rank (before the tournament starts)
# as it should refect the most recent performance of the player
last_cols = ['Player_Rank']
# Apply the rolling average function to the long data frames (it will take a few mins to run)
df_rolling_atp = get_rolling_features (df_long_atp, tournament_dates_atp, rolling_cols, last_cols= last_cols )
df_rolling_wta = get_rolling_features (df_long_wta, tournament_dates_wta, rolling_cols, last_cols= last_cols)
Player | Player_Serve_Win_Ratio | Player_Return_Win_Ratio | Player_BreakPoints_Per_Return_Game | Player_Game_Win_Percentage | Player_Rank | tournament_date_index |
---|---|---|---|---|---|---|
Adrian Mannarino | 0.623408 | 0.353397 | 0.257859 | 0.447246 | 87.0 | 2012-01-16 |
Albert Montanes | 0.507246 | 0.195652 | 0.000000 | 0.294118 | 50.0 | 2012-01-16 |
Player | Player_Serve_Win_Ratio | Player_Return_Win_Ratio | Player_BreakPoints_Per_Return_Game | Player_Game_Win_Percentage | Player_Rank | tournament_date_index |
---|---|---|---|---|---|---|
Agnieszka Radwanska | 0.413333 | 0.475410 | 0.350000 | 0.350000 | 5.0 | 2014-01-13 |
Ajla Tomljanovic | 0.468457 | 0.407319 | 0.242253 | 0.462634 | 75.0 | 2014-01-13 |
Calculate the difference of averages for each match in the data frames
In the original data frames, the first column is always the winner and followed by the loser. Same for the player stats. Thus, we cannot simply calculate the difference between winner and loser and create a target variable indicating player 1 will win or not because it will always be the winner in this case (target always = 1). As a result, we need to pick a player randomly so the player might or might not be the winner
In addition, instead of using both the features for player 1 and 2, we will take the difference of averages between the randomised player 1 and 2. The main benefit is that it will reduce the number of features to half
Steps:
- We will create a random number for each player which only return 0 or 1
- If it is zero, we will assign the winner to player 1 and loser to player 2
- We will join the features to the player 1 and 2. The join will be on the player names and the tournament date (tournament_index in the feature data frames)
- For players who do not have any history, we will fill the stats by zeros and rank by 999
# Randomise the match_wide dataset so the first player is not always the winner
# set a seed so the random number is reproducable
np.random.seed(2)
# randomise a number 0/1 with 50% chance each
# if 0 then take the winner, 1 then take loser
df_atp['random_number'] = np.random.randint(2, size=len(df_atp))
df_atp['randomised_player_1'] = np.where(df_atp['random_number']==0,df_atp['Winner'],df_atp['Loser'])
df_atp['randomised_player_2'] = np.where(df_atp['random_number']==0,df_atp['Loser'],df_atp['Winner'])
df_wta['random_number'] = np.random.randint(2, size=len(df_wta))
df_wta['randomised_player_1'] = np.where(df_wta['random_number']==0,df_wta['Winner'],df_wta['Loser'])
df_wta['randomised_player_2'] = np.where(df_wta['random_number']==0,df_wta['Loser'],df_wta['Winner'])
# set the target (win/loss) based on the new randomise number
df_atp['player_1_win'] = np.where(df_atp['random_number']==0,1,0)
df_wta['player_1_win'] = np.where(df_wta['random_number']==0,1,0)
print ('After shuffling, the win rate for player 1 for the mens is {}%'.format(df_atp['player_1_win'].mean()*100))
print ('After shuffling, the win rate for player 1 for the womens is {}%'.format(df_wta['player_1_win'].mean()*100))
After shuffling, the win rate for player 1 for the mens is 49.64798919857267% After shuffling, the win rate for player 1 for the womens is 49.697671426733564% The win rates are close enough to 50%. So we are good to go
# To get our data frames ready for model training, we will exclude other tournaments from the data now because we have gotten the rolling averages from them and
# for training, we only need US and Australian Open matches
df_atp = df_atp.loc[df_atp.Tournament.isin(tournaments)]
df_wta = df_wta.loc[df_wta.Tournament.isin(tournaments)]
# now we can remove other stats columns because we will be using the differences
cols_to_keep = ['Winner','Loser','Tournament','Tournament_Date',
'player_1_win','randomised_player_1',
'randomised_player_2']
df_atp = df_atp[cols_to_keep]
df_wta = df_wta[cols_to_keep]
# Here, we are joining the rolling average data frames to the individual matches.
# We need to do it twice. One for player 1 and one for player 2
# Get the rolling features for player 1
df_atp = df_atp.merge(df_rolling_atp, how='left',
left_on = ['randomised_player_1','Tournament_Date'],
right_on = ['Player','tournament_date_index'],
validate ='m:1')
df_wta = df_wta.merge(df_rolling_wta, how='left',
left_on = ['randomised_player_1','Tournament_Date'],
right_on = ['Player','tournament_date_index'],
validate ='m:1')
# Get the rolling features for player 2
# we will use '_p1' to denote player 1 and '_p2' for player 2
df_atp = df_atp.merge(df_rolling_atp, how='left',
left_on = ['randomised_player_2','Tournament_Date'],
right_on = ['Player','tournament_date_index'],
validate ='m:1',
suffixes=('_p1','_p2'))
df_wta = df_wta.merge(df_rolling_wta, how='left',
left_on = ['randomised_player_2','Tournament_Date'],
right_on = ['Player','tournament_date_index'],
validate ='m:1',
suffixes=('_p1','_p2'))
# How many players do not have previous match history
print('{} player_1s do Not have previous match history before the tournament'.format(df_atp.loc[df_atp.Player_p1.isna(),'randomised_player_1'].nunique()))
print('{} player_2s do Not have previous match history before the tournament'.format(df_atp.loc[df_atp.Player_p2.isna(),'randomised_player_2'].nunique()))
59 player_1s do Not have previous match history before the tournament 56 player_2s do Not have previous match history before the tournament
# How many players do not have previous match history
print('{} player_1s do Not have previous match history before the tournament'.format(df_wta.loc[df_wta.Player_p1.isna(),'randomised_player_1'].nunique()))
print('{} player_2s do Not have previous match history before the tournament'.format(df_wta.loc[df_wta.Player_p2.isna(),'randomised_player_2'].nunique()))
41 player_1s do Not have previous match history before the tournament 37 player_2s do Not have previous match history before the tournament
# Most of the missing are for the early years which makes sense as we dont have enough history for them
df_wta.loc[df_wta.Player_p1.isna(),'Tournament_Date'].value_counts()
2014-01-13 29
2014-08-25 7
2015-08-31 5
2015-01-19 3
2017-08-28 3
2018-01-15 3
2018-08-27 3
Name: Tournament_Date, dtype: int64
2012-01-16 29
2012-08-27 9
2014-01-13 5
2013-08-26 5
2016-01-18 5
2013-01-14 4
2014-08-25 3
2018-01-15 3
2017-08-28 3
2018-08-27 2
2016-08-29 2
2015-01-19 1
Name: Tournament_Date, dtype: int64
Now we have gotten the rolling averages for both player 1 and 2. What we need to do next is to simply calculate their difference.
To calculate the difference, we need to:
- Split the data frames into two new data frames: Player 1 and Player 2
- Take the difference between the two data frames
def get_player_difference(df, diff_cols = None):
# Input:
# df: data frame to get the data from
# diff_cols: columns we take the difference on. For example is diff_cols = win rate. This function will calculate the
# difference of the win rates between player 1 and player 2
# Return: the df with the new features
p1_cols = [i + '_p1' for i in diff_cols] # column names for player 1 stats
p2_cols = [i + '_p2' for i in diff_cols] # column names for player 2 stats
# For any missing values, we will fill them by zeros except the ranking where we will use 999
df['Player_Rank_p1'] = df['Player_Rank_p1'].fillna(999)
df[p1_cols] = df[p1_cols].fillna(0)
df['Player_Rank_p2'] = df['Player_Rank_p2'].fillna(999)
df[p2_cols] = df[p2_cols].fillna(0)
new_column_name = [i + '_diff' for i in diff_cols]
# Take the difference
df_p1 = df[p1_cols]
df_p2 = df[p2_cols]
df_p1.columns=new_column_name
df_p2.columns=new_column_name
df_diff = df_p1 - df_p2
df_diff.columns = new_column_name
# drop the p1 and p2 columns because We have the differences now
df.drop(p1_cols + p2_cols, axis=1, inplace=True)
# Concat the df_diff and raw_df
df = pd.concat([df, df_diff], axis=1)
return df,new_column_name
diff_cols = ['Player_Serve_Win_Ratio',
'Player_Return_Win_Ratio',
'Player_BreakPoints_Per_Return_Game',
'Player_Game_Win_Percentage','Player_Rank']
# Apply the function and get the difference between player 1 and 2
df_atp,_ = get_player_difference(df_atp,diff_cols=diff_cols)
df_wta,_ = get_player_difference(df_wta,diff_cols=diff_cols)
# Make a copy of the data frames in case we need to come back to check the values
df_atp_final = df_atp.copy()
df_wta_final = df_wta.copy()
Winner | Loser | Tournament | Tournament_Date | player_1_win | randomised_player_1 | randomised_player_2 | Player_p1 | tournament_date_index_p1 | Player_p2 | tournament_date_index_p2 | Player_Serve_Win_Ratio_diff | Player_Return_Win_Ratio_diff | Player_BreakPoints_Per_Return_Game_diff | Player_Game_Win_Percentage_diff | Player_Rank_diff |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Juan Martin del Potro | Adrian Mannarino | Australian Open, Melbourne | 2012-01-16 | 1 | Juan Martin del Potro | Adrian Mannarino | Juan Martin del Potro | 2012-01-16 | Adrian Mannarino | 2012-01-16 | 0.035030 | -0.021271 | -0.025975 | 0.103479 | -76.0 |
Pere Riba | Albert Montanes | Australian Open, Melbourne | 2012-01-16 | 1 | Pere Riba | Albert Montanes | Pere Riba | 2012-01-16 | Albert Montanes | 2012-01-16 | -0.156369 | 0.008893 | 0.066667 | -0.094118 | 39.0 |
Tomas Berdych | Albert Ramos-Vinolas | Australian Open, Melbourne | 2012-01-16 | 0 | Albert Ramos-Vinolas | Tomas Berdych | Albert Ramos-Vinolas | 2012-01-16 | NaN | NaT | 0.498027 | 0.380092 | 0.414815 | 0.394444 | -934.0 |
Rafael Nadal | Alex Kuznetsov | Australian Open, Melbourne | 2012-01-16 | 0 | Alex Kuznetsov | Rafael Nadal | NaN | NaT | Rafael Nadal | 2012-01-16 | -0.670139 | -0.423057 | -0.445623 | -0.574767 | 997.0 |
Roger Federer | Alexander Kudryavtsev | Australian Open, Melbourne | 2012-01-16 | 0 | Alexander Kudryavtsev | Roger Federer | NaN | NaT | Roger Federer | 2012-01-16 | -0.721415 | -0.449516 | -0.360255 | -0.668090 | 996.0 |
Modelling
We will trian two models here, one for mens and one for womens.
For training, we will use all available data from the second year (too many missing values in the first year) up until 2017.
For validation, we will test the model on the 2018 Australian Open data
This setup allows us to 'mimic' the final prediction (using historical matches to predict 2019 results)
df_train_atp = df_atp_final.loc[(df_atp_final.Tournament_Date != '2018-01-15') # excluding Aus Open 2018, and
& (df_atp_final.Tournament_Date > '2012-01-16')] # excluding first year
df_valid_atp = df_atp_final.loc[df_atp_final.Tournament_Date == '2018-01-15'] # Australian Open 2018 only
df_train_wta = df_wta_final.loc[(df_wta_final.Tournament_Date != '2018-01-15') # excluding Aus Open 2018, and
& (df_wta_final.Tournament_Date > '2014-01-13')] # excluding first year
Winner | Loser | Tournament | Tournament_Date | player_1_win | randomised_player_1 | randomised_player_2 | Player_p1 | tournament_date_index_p1 | Player_p2 | tournament_date_index_p2 | Player_Serve_Win_Ratio_diff | Player_Return_Win_Ratio_diff | Player_BreakPoints_Per_Return_Game_diff | Player_Game_Win_Percentage_diff | Player_Rank_diff |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Daniel Brands | Adrian Ungur | U.S. Open, New York | 2012-08-27 | 0 | Adrian Ungur | Daniel Brands | NaN | NaT | Daniel Brands | 2012-08-27 | -0.535211 | -0.300000 | -0.043478 | -0.434783 | 870.0 |
Richard Gasquet | Albert Montanes | U.S. Open, New York | 2012-08-27 | 1 | Richard Gasquet | Albert Montanes | Richard Gasquet | 2012-08-27 | Albert Montanes | 2012-08-27 | 0.080003 | 0.077451 | 0.180847 | 0.131108 | -37.0 |
Martin Klizan | Alejandro Falla | U.S. Open, New York | 2012-08-27 | 1 | Martin Klizan | Alejandro Falla | Martin Klizan | 2012-08-27 | Alejandro Falla | 2012-08-27 | 0.077117 | -0.044716 | -0.087362 | 0.068180 | -2.0 |
Andy Murray | Alex Bogomolov Jr. | U.S. Open, New York | 2012-08-27 | 1 | Andy Murray | Alex Bogomolov Jr. | Andy Murray | 2012-08-27 | Alex Bogomolov Jr. | 2012-08-27 | 0.039641 | 0.031701 | 0.094722 | 0.059010 | -69.0 |
Tommy Robredo | Andreas Seppi | U.S. Open, New York | 2012-08-27 | 1 | Tommy Robredo | Andreas Seppi | Tommy Robredo | 2012-08-27 | Andreas Seppi | 2012-08-27 | -0.026814 | 0.006442 | -0.009930 | -0.067780 | 151.0 |
# target variable
target= 'player_1_win'
# features being fed into the models
feats = ['Player_Serve_Win_Ratio_diff',
'Player_Return_Win_Ratio_diff',
'Player_BreakPoints_Per_Return_Game_diff',
'Player_Game_Win_Percentage_diff',
'Player_Rank_diff']
print(feats)
H2O model for ATP
h2o.init()
# Convert to an h2o frame
df_train_atp_h2o = h2o.H2OFrame(df_train_atp)
df_valid_atp_h2o = h2o.H2OFrame(df_valid_atp)
# For binary classification, response should be a factor
df_train_atp_h2o[target] = df_train_atp_h2o[target].asfactor()
df_valid_atp_h2o[target] = df_valid_atp_h2o[target].asfactor()
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml_atp = h2o.automl.H2OAutoML(max_runtime_secs=300,
max_models=100,
stopping_metric='logloss',
sort_metric='logloss',
balance_classes=True,
seed=183
)
aml_atp.train(x=feats, y=target, training_frame=df_train_atp_h2o,validation_frame=df_valid_atp_h2o)
# View the AutoML Leaderboard
lb = aml_atp.leaderboard
lb.head()
model_id | auc | logloss | mean_per_class_error | rmse | mse |
---|---|---|---|---|---|
GBM_5_AutoML_20181221_094949 | 0.790281 | 0.554852 | 0.281363 | 0.431379 | 0.186088 |
GBM_grid_1_AutoML_20181221_094949_model_15 | 0.789329 | 0.556804 | 0.29856 | 0.431931 | 0.186564 |
GBM_grid_1_AutoML_20181221_094949_model_7 | 0.788013 | 0.557808 | 0.295899 | 0.432968 | 0.187461 |
StackedEnsemble_BestOfFamily_AutoML_20181221_094949 | 0.788131 | 0.558028 | 0.285321 | 0.432849 | 0.187358 |
GBM_grid_1_AutoML_20181221_094949_model_20 | 0.785633 | 0.561094 | 0.283932 | 0.43479 | 0.189043 |
StackedEnsemble_AllModels_AutoML_20181221_094949 | 0.784411 | 0.561587 | 0.293244 | 0.434667 | 0.188935 |
GBM_grid_1_AutoML_20181221_094949_model_25 | 0.785311 | 0.561783 | 0.291912 | 0.434888 | 0.189127 |
GBM_grid_1_AutoML_20181221_094949_model_17 | 0.774832 | 0.570883 | 0.295836 | 0.439375 | 0.193051 |
DeepLearning_1_AutoML_20181221_094949 | 0.779388 | 0.572823 | 0.311737 | 0.438479 | 0.192264 |
GBM_grid_1_AutoML_20181221_094949_model_14 | 0.7718 | 0.578867 | 0.285835 | 0.441373 | 0.19481 |
# Convert to an h2o frame
df_train_wta_h2o = h2o.H2OFrame(df_train_wta)
df_valid_wta_h2o = h2o.H2OFrame(df_valid_wta)
# For binary classification, response should be a factor
df_train_wta_h2o[target] = df_train_wta_h2o[target].asfactor()
df_valid_wta_h2o[target] = df_valid_wta_h2o[target].asfactor()
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml_wta = h2o.automl.H2OAutoML(max_runtime_secs=300,
max_models=100,
stopping_metric='logloss',
sort_metric='logloss',
balance_classes=True,
seed=183
)
aml_wta.train(x=feats, y=target, training_frame=df_train_wta_h2o,validation_frame=df_valid_wta_h2o)
# View the AutoML Leaderboard
lb = aml_wta.leaderboard
lb.head()
model_id | auc | logloss | mean_per_class_error | rmse | mse |
---|---|---|---|---|---|
StackedEnsemble_AllModels_AutoML_20181221_095400 | 0.726046 | 0.60827 | 0.321222 | 0.457117 | 0.208956 |
StackedEnsemble_BestOfFamily_AutoML_20181221_095400 | 0.724911 | 0.609329 | 0.337847 | 0.457659 | 0.209452 |
DeepLearning_grid_1_AutoML_20181221_095400_model_3 | 0.729152 | 0.612669 | 0.315971 | 0.45641 | 0.20831 |
GBM_grid_1_AutoML_20181221_095400_model_7 | 0.721204 | 0.615763 | 0.336848 | 0.460885 | 0.212415 |
GBM_5_AutoML_20181221_095400 | 0.719252 | 0.616535 | 0.319179 | 0.461055 | 0.212572 |
GBM_grid_1_AutoML_20181221_095400_model_15 | 0.715921 | 0.619263 | 0.318673 | 0.462215 | 0.213643 |
GLM_grid_1_AutoML_20181221_095400_model_1 | 0.726048 | 0.622989 | 0.366124 | 0.463099 | 0.214461 |
GBM_grid_1_AutoML_20181221_095400_model_17 | 0.709261 | 0.624902 | 0.34876 | 0.465628 | 0.216809 |
GBM_grid_1_AutoML_20181221_095400_model_18 | 0.70946 | 0.625704 | 0.393556 | 0.466147 | 0.217293 |
DeepLearning_grid_1_AutoML_20181221_095400_model_2 | 0.713419 | 0.628008 | 0.311334 | 0.463638 | 0.21496 |
Use the models to predict and make submissions
Now let's use the models we just created to make the submissions
df_predict_atp = pd.read_csv("data/men_dummy_submission_file.csv")
df_predict_wta = pd.read_csv("data/women_dummy_submission_file.csv", encoding='latin1') # for womens, there are some names need a different encoding
player_1 | player_2 | player_1_win_probability |
---|---|---|
Simona Halep | Angelique Kerber | 0.5 |
Simona Halep | Caroline Wozniacki | 0.5 |
Get the features for the predict df
We need to join the features to the 2019 players
# Before we join the features by the names and the dates, we need to convert any non-english characters to english first
translationTable = str.maketrans("éàèùâêîôûçñá", "eaeuaeioucna")
df_predict_atp['player_1'] = df_predict_atp.player_1.apply(lambda x: x.translate(translationTable))
df_predict_atp['player_2'] = df_predict_atp.player_2.apply(lambda x: x.translate(translationTable))
df_predict_wta['player_1'] = df_predict_wta.player_1.apply(lambda x: x.translate(translationTable))
df_predict_wta['player_2'] = df_predict_wta.player_2.apply(lambda x: x.translate(translationTable))
# Also we need to convert the names into lower cases
df_predict_atp['player_1'] =df_predict_atp['player_1'].str.lower()
df_predict_atp['player_2'] =df_predict_atp['player_2'].str.lower()
df_predict_wta['player_1'] =df_predict_wta['player_1'].str.lower()
df_predict_wta['player_2'] =df_predict_wta['player_2'].str.lower()
df_rolling_atp['Player'] = df_rolling_atp['Player'].str.lower()
df_rolling_wta['Player'] = df_rolling_wta['Player'].str.lower()
# Lastly, some players have slightly difference names in the submission data and the match data. So we are editing them here manually
df_predict_atp.loc[df_predict_atp.player_1=='jaume munar','player_1'] = 'jaume antoni munar clar'
df_predict_atp.loc[df_predict_atp.player_2=='jaume munar','player_2'] = 'jaume antoni munar clar'
df_predict_wta.loc[df_predict_wta.player_1=='daria kasatkina','player_1'] = 'darya kasatkina'
df_predict_wta.loc[df_predict_wta.player_2=='daria kasatkina','player_2'] = 'darya kasatkina'
df_predict_wta.loc[df_predict_wta.player_1=='lesia tsurenko','player_1'] = 'lesya tsurenko'
df_predict_wta.loc[df_predict_wta.player_2=='lesia tsurenko','player_2'] = 'lesya tsurenko'
df_predict_wta.loc[df_predict_wta.player_1=='danielle collins','player_1'] = 'danielle rose collins'
df_predict_wta.loc[df_predict_wta.player_2=='danielle collins','player_2'] = 'danielle rose collins'
df_predict_wta.loc[df_predict_wta.player_1=='anna karolina schmiedlova','player_1'] = 'anna schmiedlova'
df_predict_wta.loc[df_predict_wta.player_2=='anna karolina schmiedlova','player_2'] = 'anna schmiedlova'
df_predict_wta.loc[df_predict_wta.player_1=='georgina garcia perez','player_1'] = 'georgina garcia-perez'
df_predict_wta.loc[df_predict_wta.player_2=='georgina garcia perez','player_2'] = 'georgina garcia-perez'
# create and tournament date column and set it to 2019 so we can join the lastest features
df_predict_atp['Tournament_Date'] = pd.to_datetime('2019-01-15')
df_predict_wta['Tournament_Date'] = pd.to_datetime('2019-01-15')
# Get the rolling features for player 1
df_predict_atp = df_predict_atp.merge(df_rolling_atp, how='left',
left_on = ['player_1','Tournament_Date'],
right_on = ['Player','tournament_date_index'],validate ='m:1')
df_predict_wta = df_predict_wta.merge(df_rolling_wta, how='left',
left_on = ['player_1','Tournament_Date'],
right_on = ['Player','tournament_date_index'],validate ='m:1')
# Get the rolling features for player 2
# For duplicate columns, we will use '_p1' to denote player 1 and '_p2' for player 2
df_predict_atp = df_predict_atp.merge(df_rolling_atp, how='left',
left_on = ['player_2','Tournament_Date'],
right_on = ['Player','tournament_date_index'],validate ='m:1',suffixes=('_p1','_p2'))
df_predict_wta = df_predict_wta.merge(df_rolling_wta, how='left',
left_on = ['player_2','Tournament_Date'],
right_on = ['Player','tournament_date_index'],validate ='m:1',suffixes=('_p1','_p2'))
player_1 | player_2 | player_1_win_probability | Tournament_Date | Player_p1 | Player_Serve_Win_Ratio_p1 | Player_Return_Win_Ratio_p1 | Player_BreakPoints_Per_Return_Game_p1 | Player_Game_Win_Percentage_p1 | Player_Rank_p1 | tournament_date_index_p1 | Player_p2 | Player_Serve_Win_Ratio_p2 | Player_Return_Win_Ratio_p2 | Player_BreakPoints_Per_Return_Game_p2 | Player_Game_Win_Percentage_p2 | Player_Rank_p2 | tournament_date_index_p2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
novak djokovic | rafael nadal | 0.5 | 2019-01-15 | novak djokovic | 0.685903 | 0.426066 | 0.42378 | 0.640857 | 2.0 | 2019-01-15 | rafael nadal | 0.622425 | 0.401028 | 0.334270 | 0.570833 | 1.0 | 2019-01-15 |
novak djokovic | roger federer | 0.5 | 2019-01-15 | novak djokovic | 0.685903 | 0.426066 | 0.42378 | 0.640857 | 2.0 | 2019-01-15 | roger federer | 0.620070 | 0.389781 | 0.269224 | 0.564244 | 3.0 | 2019-01-15 |
# How many players do not have previous match history
print('{} player_1s do Not have previous match history before the tournament in the mens'.format(df_predict_atp.loc[df_predict_atp.Player_p1.isna(),'player_1'].nunique()))
print('{} player_2s do Not have previous match history before the tournament in the mens'.format(df_predict_atp.loc[df_predict_atp.Player_p2.isna(),'player_2'].nunique()))
3 player_1s do Not have previous match history before the tournament in the mens 3 player_2s do Not have previous match history before the tournament in the mens
# How many players do not have previous match history
print('{} player_1s do Not have previous match history before the tournament in the womens'.format(df_predict_wta.loc[df_predict_wta.Player_p1.isna(),'player_1'].nunique()))
print('{} player_2s do Not have previous match history before the tournament in the womens'.format(df_predict_wta.loc[df_predict_wta.Player_p2.isna(),'player_2'].nunique()))
0 player_1s do Not have previous match history before the tournament in the womens 0 player_2s do Not have previous match history before the tournament in the womens
print(df_predict_atp.loc[df_predict_atp.Player_p1.isna(),'player_1'].unique().tolist())
['christian garin', 'pedro sousa', 'hugo dellien']
We will do the differencing again for the prediction data frames exactly like what we did for training
# Apply the function and get the difference between player 1 and 2
df_predict_atp,_ = get_player_difference(df_predict_atp,diff_cols=diff_cols)
df_predict_wta,_ = get_player_difference(df_predict_wta,diff_cols=diff_cols)
Make the prediction
df_predict_atp_h2o = h2o.H2OFrame(df_predict_atp[feats])
df_predict_wta_h2o = h2o.H2OFrame(df_predict_wta[feats])
atp_preds = aml_atp.predict(df_predict_atp_h2o)['p1'].as_data_frame()
wta_preds = aml_wta.predict(df_predict_wta_h2o)['p1'].as_data_frame()
df_predict_atp['player_1_win_probability'] = atp_preds
df_predict_wta['player_1_win_probability'] = wta_preds
atp_submission = df_predict_atp[['player_1','player_2','player_1_win_probability']]
wta_submission = df_predict_wta[['player_1','player_2','player_1_win_probability']]
player_1 | player_2 | player_1_win_probability |
---|---|---|
novak djokovic | rafael nadal | 0.571588 |
novak djokovic | roger federer | 0.662511 |
novak djokovic | juan martin del potro | 0.544306 |
novak djokovic | alexander zverev | 0.709483 |
novak djokovic | kevin anderson | 0.687195 |
player_1 | player_2 | player_1_win_probability |
---|---|---|
simona halep | angelique kerber | 0.455224 |
simona halep | caroline wozniacki | 0.546276 |
simona halep | elina svitolina | 0.408014 |
simona halep | naomi osaka | 0.285125 |
simona halep | sloane stephens | 0.576643 |
Let's look at who has the highest win rate from our models
atp_submission.groupby('player_1')['player_1_win_probability'].mean() \
.reset_index().sort_values('player_1_win_probability',ascending=False).head()
player_1 | player_1_win_probability |
---|---|
novak djokovic | 0.846377 |
juan martin del potro | 0.787337 |
karen khachanov | 0.782963 |
rafael nadal | 0.778707 |
roger federer | 0.767337 |
wta_submission.groupby('player_1')['player_1_win_probability'].mean() \
.reset_index().sort_values('player_1_win_probability',ascending=False).head()
player_1 | player_1_win_probability |
---|---|
madison keys | 0.750580 |
naomi osaka | 0.749195 |
caroline wozniacki | 0.722409 |
kiki bertens | 0.713904 |
aryna sabalenka | 0.707368 |
Now we can output the predictions as csvs
atp_submission.to_csv('submission/atp_submission_python.csv',index=False)
wta_submission.to_csv('submission/wta_submission_pthon.csv',index=False)
(16256, 3)
Disclaimer
Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.