Australian Open Datathon R Tutorial

Overview

The Task

This notebook will outline how the Betfair Data Scientists went about modelling the Australian Open for Betfair's Australian Open Datathon. The task is simple: we ask you to predict the winner of every possible Australian Open matchup using data which we provide.

The metric used to determine the winner will be log loss, based on the actual matchups that happen in the Open. For more information on log loss, click here.

How an outline of our methodology and thought process, read this article.

Exploring the Data

First we need to get an idea of what the data looks like. Let's read the men's data in and get an idea of what it looks like. Note that you will need to install all the packages listed below unless you already have them.

Note that for this tutorial I will be using dplyr, if you are not familiar with the syntax I encourage you to read up on the basics.

# Import libraries
library(dplyr)
library(readr)
library(tidyr)
library(RcppRoll)
library(tidyselect)
library(lubridate)
library(stringr)
library(zoo)
library(purrr)
library(h2o)
library(DT)
mens = readr::read_csv('data/ATP_matches.csv', na = ".") # NAs are indicated by .
mens %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t", 
            scrollY = 450,
            scroller = TRUE,
            scrollX = 600,
            fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)

Winner	Loser	Tournament	Tournament_Date	Court_Surface	Round_Description	Winner_Rank	Loser_Rank	Retirement_Ind	Winner_Sets_Won	...	Loser_DoubleFaults	Loser_FirstServes_Won	Loser_FirstServes_In	Loser_SecondServes_Won	Loser_SecondServes_In	Loser_BreakPoints_Won	Loser_BreakPoints	Loser_ReturnPoints_Won	Loser_ReturnPoints_Faced	Loser_TotalPoints_Won
Edouard Roger-Vasselin	Eric Prodon	Chennai	2-Jan-12	Hard	First Round	106	97	0	2	...	3	21	33	13	26	1	3	15	49	49
Dudi Sela	Fabio Fognini	Chennai	2-Jan-12	Hard	First Round	83	48	0	2	...	4	17	32	5	26	0	1	8	33	30
Go Soeda	Frederico Gil	Chennai	2-Jan-12	Hard	First Round	120	102	0	2	...	2	45	70	18	35	2	4	36	103	99
Yuki Bhambri	Karol Beck	Chennai	2-Jan-12	Hard	First Round	345	101	0	2	...	1	15	33	13	29	2	3	15	46	43
Yuichi Sugita	Olivier Rochus	Chennai	2-Jan-12	Hard	First Round	235	67	0	2	...	0	19	32	13	22	1	7	30	78	62
Benoit Paire	Pere Riba	Chennai	2-Jan-12	Hard	First Round	95	89	0	2	...	5	13	20	12	32	0	1	9	44	34
Victor Hanescu	Sam Querrey	Chennai	2-Jan-12	Hard	First Round	90	93	0	2	...	8	29	41	7	24	1	3	17	57	53
Yen-Hsun Lu	Thiemo de Bakker	Chennai	2-Jan-12	Hard	First Round	82	223	0	2	...	0	20	32	10	24	1	1	19	57	49
Andreas Beck	Vasek Pospisil	Chennai	2-Jan-12	Hard	First Round	98	119	0	2	...	3	39	57	16	38	1	5	24	74	79
Ivan Dodig	Vishnu Vardhan	Chennai	2-Jan-12	Hard	First Round	36	313	0	2	...	5	41	59	13	27	2	8	34	101	88
David Goffin	Xavier Malisse	Chennai	2-Jan-12	Hard	First Round	174	49	0	2	...	1	31	43	19	34	1	4	27	85	77
David Goffin	Andreas Beck	Chennai	2-Jan-12	Hard	Second Round	174	98	0	2	...	0	43	71	14	27	2	8	27	82	84
Dudi Sela	Benoit Paire	Chennai	2-Jan-12	Hard	Second Round	83	95	0	2	...	5	40	58	21	46	1	7	26	87	87
Stan Wawrinka	Edouard Roger-Vasselin	Chennai	2-Jan-12	Hard	Second Round	17	106	0	2	...	0	43	70	16	34	4	6	28	82	87
Go Soeda	Ivan Dodig	Chennai	2-Jan-12	Hard	Second Round	120	36	0	2	...	2	31	41	11	28	1	4	23	73	65
Milos Raonic	Victor Hanescu	Chennai	2-Jan-12	Hard	Second Round	31	90	0	2	...	1	25	38	5	14	0	4	15	56	45
Yuichi Sugita	Yen-Hsun Lu	Chennai	2-Jan-12	Hard	Second Round	235	82	0	2	...	4	34	45	12	34	2	9	38	93	84
Janko Tipsarevic	Yuki Bhambri	Chennai	2-Jan-12	Hard	Second Round	9	345	0	2	...	2	12	22	9	17	0	1	8	41	29
Janko Tipsarevic	David Goffin	Chennai	2-Jan-12	Hard	Quarter-finals	9	174	0	2	...	5	34	51	19	40	1	2	18	67	71
Milos Raonic	Dudi Sela	Chennai	2-Jan-12	Hard	Quarter-finals	31	83	0	2	...	2	23	31	19	28	0	3	16	69	58
Go Soeda	Stan Wawrinka	Chennai	2-Jan-12	Hard	Quarter-finals	120	17	0	2	...	4	18	34	13	31	3	7	31	74	62
Nicolas Almagro	Yuichi Sugita	Chennai	2-Jan-12	Hard	Quarter-finals	10	235	0	2	...	1	36	65	30	40	3	12	45	123	111
Janko Tipsarevic	Go Soeda	Chennai	2-Jan-12	Hard	Semi-finals	9	120	0	2	...	1	21	33	10	28	1	1	10	44	41
Milos Raonic	Nicolas Almagro	Chennai	2-Jan-12	Hard	Semi-finals	31	10	0	2	...	0	31	45	8	15	0	3	12	54	51
Milos Raonic	Janko Tipsarevic	Chennai	2-Jan-12	Hard	Finals	31	9	0	2	...	2	59	83	34	55	0	4	25	113	118
Igor Andreev	Adrian Mannarino	Brisbane	2-Jan-12	Hard	First Round	115	87	0	2	...	3	24	35	13	25	1	3	21	70	58
Alexandr Dolgopolov	Alejandro Falla	Brisbane	2-Jan-12	Hard	First Round	15	74	0	2	...	3	16	33	12	25	3	7	33	75	61
Tatsuma Ito	Benjamin Mitchell	Brisbane	2-Jan-12	Hard	First Round	122	227	0	2	...	6	30	44	7	24	0	2	13	52	50
Kei Nishikori	Cedrik-Marcel Stebe	Brisbane	2-Jan-12	Hard	First Round	25	81	0	2	...	2	27	49	23	41	3	6	28	75	78
Denis Istomin	Florian Mayer	Brisbane	2-Jan-12	Hard	First Round	73	23	1	1	...	1	28	38	11	17	0	2	15	56	54
...	...	...	...	...	...	...	...	...	...		...	...	...	...	...	...	...	...	...	...
Malek Jaziri	Fernando Verdasco	Paris	29-Oct-18	Indoor Hard	Second Round	55	27	0	2	...	6	46	60	16	35	3	13	39	104	101
Alexander Zverev	Frances Tiafoe	Paris	29-Oct-18	Indoor Hard	Second Round	5	44	0	2	...	4	26	40	16	36	2	10	27	72	69
Dominic Thiem	Gilles Simon	Paris	29-Oct-18	Indoor Hard	Second Round	8	31	0	2	...	1	13	26	12	25	2	2	23	59	48
Novak Djokovic	Joao Sousa	Paris	29-Oct-18	Indoor Hard	Second Round	2	48	0	2	...	2	25	35	6	22	1	10	27	74	58
Karen Khachanov	Matthew Ebden	Paris	29-Oct-18	Indoor Hard	Second Round	18	39	1	2	...	6	8	18	5	20	1	2	10	30	23
John Isner	Mikhail Kukushkin	Paris	29-Oct-18	Indoor Hard	Second Round	9	54	0	2	...	1	54	80	24	39	0	1	13	90	91
Kevin Anderson	Nikoloz Basilashvili	Paris	29-Oct-18	Indoor Hard	Second Round	6	22	0	2	...	7	43	54	30	49	0	3	26	106	99
Marin Cilic	Philipp Kohlschreiber	Paris	29-Oct-18	Indoor Hard	Second Round	7	43	0	2	...	1	19	34	12	20	1	1	17	55	48
Jack Sock	Richard Gasquet	Paris	29-Oct-18	Indoor Hard	Second Round	23	28	0	2	...	4	18	33	16	29	0	4	19	59	53
Grigor Dimitrov	Roberto Bautista Agut	Paris	29-Oct-18	Indoor Hard	Second Round	10	25	0	2	...	0	34	48	11	20	2	4	27	76	72
Damir Dzumhur	Stefanos Tsitsipas	Paris	29-Oct-18	Indoor Hard	Second Round	52	16	0	2	...	3	14	26	15	30	2	2	17	52	46
Dominic Thiem	Borna Coric	Paris	29-Oct-18	Indoor Hard	Third Round	8	13	0	2	...	1	39	57	16	38	2	2	27	88	82
Novak Djokovic	Damir Dzumhur	Paris	29-Oct-18	Indoor Hard	Third Round	2	52	1	2	...	4	15	28	7	18	0	0	8	28	30
Alexander Zverev	Diego Schwartzman	Paris	29-Oct-18	Indoor Hard	Third Round	5	19	0	2	...	2	22	37	12	24	0	4	18	58	52
Roger Federer	Fabio Fognini	Paris	29-Oct-18	Indoor Hard	Third Round	3	14	0	2	...	6	22	32	15	37	1	5	16	54	53
Marin Cilic	Grigor Dimitrov	Paris	29-Oct-18	Indoor Hard	Third Round	7	10	0	2	...	1	37	55	14	32	1	5	22	71	73
Karen Khachanov	John Isner	Paris	29-Oct-18	Indoor Hard	Third Round	18	9	0	2	...	4	67	80	19	38	0	0	17	100	103
Kei Nishikori	Kevin Anderson	Paris	29-Oct-18	Indoor Hard	Third Round	11	6	0	2	...	1	26	33	11	19	0	0	11	51	48
Jack Sock	Malek Jaziri	Paris	29-Oct-18	Indoor Hard	Third Round	23	55	0	2	...	6	13	21	10	24	0	0	9	41	32
Karen Khachanov	Alexander Zverev	Paris	29-Oct-18	Indoor Hard	Quarter-finals	18	5	0	2	...	7	26	47	4	21	1	3	10	36	40
Dominic Thiem	Jack Sock	Paris	29-Oct-18	Indoor Hard	Quarter-finals	8	23	0	2	...	5	44	59	19	37	2	10	34	97	97
Roger Federer	Kei Nishikori	Paris	29-Oct-18	Indoor Hard	Quarter-finals	3	11	0	2	...	0	21	37	16	26	0	1	12	56	49
Novak Djokovic	Marin Cilic	Paris	29-Oct-18	Indoor Hard	Quarter-finals	2	7	0	2	...	0	38	55	11	28	2	5	29	85	78
Karen Khachanov	Dominic Thiem	Paris	29-Oct-18	Indoor Hard	Semi-finals	18	8	0	2	...	0	19	29	8	26	1	3	15	47	42
Novak Djokovic	Roger Federer	Paris	29-Oct-18	Indoor Hard	Semi-finals	2	3	0	2	...	2	69	93	25	46	1	2	29	113	123
Karen Khachanov	Novak Djokovic	Paris	29-Oct-18	Indoor Hard	Finals	18	2	0	2	...	1	30	43	14	28	1	5	20	66	64
Jaume Antoni Munar Clar	Frances Tiafoe	Milan	5-Nov-18	Indoor Hard	NA	76	40	0	3	...	3	21	29	6	17	0	2	5	46	32
Frances Tiafoe	Hubert Hurkacz	Milan	5-Nov-18	Indoor Hard	NA	40	85	0	3	...	4	35	48	10	19	1	7	22	78	67
Hubert Hurkacz	Jaume Antoni Munar Clar	Milan	5-Nov-18	Indoor Hard	NA	85	76	0	3	...	1	43	63	15	35	3	9	29	80	87
Andrey Rublev	Liam Caruana	Milan	5-Nov-18	Indoor Hard	NA	68	NA	0	3	...	1	28	39	4	14	1	3	18	57	50

As we can see, we have a Winner column, a Loser column, as well as other columns detailing the match details, and other columns which have the stats for that match. As we have a Winner column, if we use the current data structure to train a model we will leak the result. The model will simply learn that the actual winner comes from the Winner column, rather than learning from other features that we can create, such as First Serve %.

To avoid this problem, let's reshape the data from wide to long, then shuffle the data. For this, we will define a function, split_winner_loser_columns, which splits the raw data frame into two data frames, appends them together, and then shuffles the data.

Let's also remove all Grass and Clay matches from our data, as we will be modelling the Australian Open which is a hardcourt surface.

Additionally, we will add a few columns, such as Match_Id and Total_Games. These will be useful later.

split_winner_loser_columns <- function(df) {
  # This function splits the raw data into two data frames and appends them together then shuffles them
  # This output is a data frame with only one player's stats on each row (i.e. in long format)

  # Grab a df with only the Winner's stats
  winner = df %>% 
    select(-contains("Loser")) %>% # Select only the Winner columns + extra game info columns as a df
    rename_at( # Rename all columns containing "Winner" to "Player" 
      vars(contains("Winner")),
      ~str_replace(., "Winner", "Player")
    ) %>%
    mutate(Winner = 1) # Create a target column

  # Repeat the process with the loser's stats
  loser = df %>%
    select(-contains("Winner")) %>%
    rename_at(
      vars(contains("Loser")),
      ~str_replace(., "Loser", "Player")
    ) %>%
    mutate(Winner = 0)

  set.seed(183) # Set seed to replicate results - 183 is the most games played in a tennis match (Isner-Mahut)

  # Create a df that appends both the Winner and loser df together
  combined_df = winner %>% 
    rbind(loser) %>% # Append the loser df to the Winner df
    slice(sample(1:n())) %>% # Randomise row order
    arrange(Match_Id) %>% # Arrange by Match_Id
    return()
}

# Read in men and womens data; randomise the data to avoid result leakage
mens = readr::read_csv('data/ATP_matches.csv', na = ".") %>%
  filter(Court_Surface == "Hard" | Court_Surface == "Indoor Hard") %>% # Filter to only use hardcourt games
  mutate(Match_Id = row_number(), # Add a match ID column to be used as a key
         Tournament_Date = dmy(Tournament_Date), # Change Tournament to datetime
         Total_Games = Winner_Games_Won + Loser_Games_Won) %>% # Add a total games played column
  split_winner_loser_columns() # Change the data frame from wide to long
mens %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t", 
            scrollY = 450,
            scroller = TRUE,
            scrollX = 600,
            fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)

Player	Tournament	Tournament_Date	Court_Surface	Round_Description	Player_Rank	Player_Sets_Won	Player_Games_Won	Player_Aces	...	Player_SecondServes_Won	Player_SecondServes_In	Player_BreakPoints_Won	Player_BreakPoints	Player_ReturnPoints_Won	Player_ReturnPoints_Faced	Player_TotalPoints_Won	Match_Id	Total_Games	Winner
Eric Prodon	Chennai	2012-01-02	Hard	First Round	97	0	7	2	...	13	26	1	3	15	49	49	1	19	0
Edouard Roger-Vasselin	Chennai	2012-01-02	Hard	First Round	106	2	12	5	...	12	19	4	7	25	59	59	1	19	1
Dudi Sela	Chennai	2012-01-02	Hard	First Round	83	2	12	2	...	11	16	6	14	36	58	61	2	13	1
Fabio Fognini	Chennai	2012-01-02	Hard	First Round	48	0	1	1	...	5	26	0	1	8	33	30	2	13	0
Frederico Gil	Chennai	2012-01-02	Hard	First Round	102	1	14	5	...	18	35	2	4	36	103	99	3	33	0
Go Soeda	Chennai	2012-01-02	Hard	First Round	120	2	19	6	...	19	39	5	11	42	105	109	3	33	1

Feature Creation

Now that we have a fairly good understanding of what the data looks like, let's add some features. To do this we will define a function. Ideally we want to add features which will provide predictive power to our model.

Thinking about the dynamics of tennis, we know that players often will matches by "breaking" the opponent's serve (i.e. winning a game when the opponent is serving). This is especially important in mens tennis. Let's create a feature called F_Player_BreakPoints_Per_Game, which is the number of breakpoints a player gets per game that they play (even though they can only get breakpoints every second game, we will use total games). Let's also create a feature called F_Player_Return_Win_Ratio which is the proportion of points won when returning.

Similarly, "holding" serve is important (i.e. winning a game when you are serving). Let's create a feature called F_Player_Serve_Win_Ratio which is the proportion of points won when serving.

Finally, you only win a set of tennis by winning more sets than your opponent. To win a set, you need to win games. Let's create a feature called F_Player_Game_Win_Percentage which is the propotion of games that a player wins.

add_ratio_features <- function(df) {
  # This function adds ratio features to a long df
  df %>%
    mutate(
      # Point Win ratio when serving
      F_Player_Serve_Win_Ratio = (Player_FirstServes_Won + Player_SecondServes_Won - Player_DoubleFaults) / 
        (Player_FirstServes_In + Player_SecondServes_In + Player_DoubleFaults), 
      # Point win ratio when returning
      F_Player_Return_Win_Ratio = Player_ReturnPoints_Won / Player_ReturnPoints_Faced, 
      # Breakpoints per receiving game
      F_Player_BreakPoints_Per_Game = Player_BreakPoints / Total_Games, 
      F_Player_Game_Win_Percentage = Player_Games_Won / Total_Games
    ) %>%
    mutate_at(
      vars(colnames(.), -contains("Rank"), -Tournament_Date), # Replace all NAs with0 apart from Rank, Date
      ~ifelse(is.na(.), 0, .)
    ) %>%
    return()
}
mens = mens %>%
  add_ratio_features() # Add features

Now that we have added our features, we need to create rolling averages for them. We cannot simply use current match statistics, as they will leak the result to the model. Instead, we need to use past match statistics to predict future matches. Here we will use a rolling mean with a window of 15. If the player hasn't played 15 games, we will instead use a cumulative mean. We will also lag the result so as to not leak the result.

This next chunk of code simply takes all the columns starting with F_ and calculates these means.

mens = mens %>% 
  group_by(Player) %>% # Group by player
  mutate_at( # Create a rolling mean with window 15 for each player. 
    vars(starts_with("F_")), # If the player hasn't played 15 games, use a cumulative mean
    ~coalesce(rollmean(., k = 15, align = "right", fill = NA_real_), cummean(.)) %>% lag()
  ) %>%
  ungroup()

Creating a Training Feature Matrix

In predictive modelling language - features are data metrics we use to predict an outcome or target variable. We have several choices to make before we get to the prediction phase. What are the features? How do we structure the outcome variable? What does each row mean? Do we use all data or just a subset? We narrowed it down to two options

We can train the model on every tennis match in the data set, or We can only train the model on Australian Open matches. Doing Option 1 would mean we have a lot more data to build a strong model, but it might be challenging to work around the constraints described in the tournament structure.

Doing Option 2 fits better from that angle but leaves us with very few matches to train our model on.

We have decided to go with an option that combines strengths from both approaches, by training the model on matches from the Aus Open and the US Open because both grand slams are played on the same surface - hard court.

However, we also need to train our model in the same way that will be used to predict the 2019 Australian Open. When predicting the 2nd round, we won't have data from the 1st round. So we will need to build our training feature matrix with this in mind. We should extract features for a player from past games at the start of the tournament and apply them to every matchup that that player plays.

To do this, we will create a function, extract_latest_features_for_tournament, which maps over our feature data frame for the dates in the first round of a tournament and grabs features.

First, we need the Australian Open and US Open results - let's grab these and then apply our function.

# Get Australian Open and US Open Results
aus_us_open_results = 
  mens %>%
  filter((Tournament == "Australian Open, Melbourne" | Tournament == "U.S. Open, New York")
         & Round_Description != "Qualifying" & Tournament_Date != "2012-01-16") %>% # Filter out qualifiers
  select(Match_Id, Player, Tournament, Tournament_Date, Round_Description, Winner)
# Create a function which extracts features for each tournament
extract_latest_features_for_tournament = function(df, dte) {

  df %>% # Filter for the 1st round
    filter(Tournament_Date == dte, Round_Description == "First Round", Tournament_Date != "2012-01-16") %>% 
    group_by(Player) %>% # Group by player
    select_at(
      vars(Match_Id, starts_with("F_"), Player_Rank) # Grab the players' features
    ) %>%
    rename(F_Player_Rank = Player_Rank) %>%
    ungroup() %>%
    mutate(Feature_Date = dte) %>%
    select(Player, Feature_Date, everything())
}
# Create a feature matrix in long format
feature_matrix_long = 
  aus_us_open_results %>%
  distinct(Tournament_Date) %>% # Pull all Tournament Dates
  pull() %>%
  map_dfr(
    ~extract_latest_features_for_tournament(mens, .) # Get the features
  ) %>%
  filter(Feature_Date != "2012-01-16") %>% # Filter out the first Aus Open
  mutate_at( # Replace NAs with the mean
    vars(starts_with("F_")),
    ~ifelse(is.na(.), mean(., na.rm = TRUE), .)
  )

Now that we have a feature matrix in long format, we need to convert it to wide format so that the features are on the same row. To do this we will define a function gather_df, which converts the data frame from long to wide. Let's also join the results to the matrix and convert the Winner column to a factor. Finally, we will take the difference of player1 and player2's features, so as to reduce the dimensionality of the model.

gather_df <- function(df) {
  # This function puts the df back into its original format of each row containing stats for both players
  df %>%
    arrange(Match_Id) %>%
    filter(row_number() %% 2 != 0) %>% # Filter for every 2nd row, starting at the 1st index. e.g. 1, 3, 5
    rename_at( # Rename columns to player_1
      vars(contains("Player")),
      ~str_replace(., "Player", "player_1")
    ) %>%
    inner_join(df %>%
                 filter(row_number() %% 2 == 0) %>%
                 rename_at(
                   vars(contains("Player")), # Rename columns to player_2
                   ~str_replace(., "Player", "player_2")
                 ) %>%
                 select(Match_Id, contains("Player")),
               by=c('Match_Id')
    ) %>%
    select(Match_Id, player_1, player_2, Winner, everything()) %>%
    return()
}
# Joining results to features
feature_matrix_wide = aus_us_open_results %>%
  inner_join(feature_matrix_long %>% 
               select(-Match_Id), 
             by = c("Player", "Tournament_Date" = "Feature_Date")) %>%
  gather_df() %>%
  mutate(
    F_Serve_Win_Ratio_Diff = F_player_1_Serve_Win_Ratio - F_player_2_Serve_Win_Ratio,
    F_Return_Win_Ratio_Diff = F_player_1_Return_Win_Ratio - F_player_2_Return_Win_Ratio,
    F_Game_Win_Percentage_Diff = F_player_1_Game_Win_Percentage - F_player_2_Game_Win_Percentage,
    F_BreakPoints_Per_Game_Diff = F_player_1_BreakPoints_Per_Game - F_player_2_BreakPoints_Per_Game,
    F_Rank_Diff = (F_player_1_Rank - F_player_2_Rank),
    Winner = as.factor(Winner)
  ) %>%
  select(Match_Id, player_1, player_2, Tournament, Tournament_Date, Round_Description, Winner, contains("Diff"))
train = feature_matrix_wide
train %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t", 
            scrollY = 450,
            scroller = TRUE,
            scrollX = 600,
            fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)

Match_Id	player_1	player_2	Tournament	Tournament_Date	Round_Description	Winner	F_Serve_Win_Ratio_Diff	F_Return_Win_Ratio_Diff	F_Game_Win_Percentage_Diff	F_BreakPoints_Per_Game_Diff	F_Rank_Diff
1139	Adrian Ungur	Daniel Brands	U.S. Open, New York	2012-08-27	First Round	0	0.03279412	-0.014757229	0.002877458	0.073938088	-13
1140	Albert Montanes	Richard Gasquet	U.S. Open, New York	2012-08-27	First Round	0	-0.08000322	-0.077451342	-0.131108056	-0.180846832	97
1141	Martin Klizan	Alejandro Falla	U.S. Open, New York	2012-08-27	First Round	1	0.07711693	-0.044715517	0.068179841	-0.087361962	1
1142	Alex Bogomolov Jr.	Andy Murray	U.S. Open, New York	2012-08-27	First Round	0	-0.03964074	-0.031700826	-0.059010072	-0.094721700	69
1143	Tommy Robredo	Andreas Seppi	U.S. Open, New York	2012-08-27	First Round	1	-0.02681392	0.006442134	-0.067779660	-0.009930089	151
1144	Ryan Harrison	Benjamin Becker	U.S. Open, New York	2012-08-27	First Round	1	0.04251983	0.018604623	0.026486753	-0.003548973	-24

Creating the Feature Matrix for the 2019 Australian Open

Now that we have our training set, train, we need to create a feature matrix to create predictions on. To do this, we need to generate features again. We could simply append a player list to our raw data frame, create a mock date and then use the extract_latest_features_for_tournament function that we used before. Instead, we're going to create a lookup table for each unique player in the 2019 Australian Open. We will need to get their last 15 games and then find the mean for each feature so that our features are the same.

Let's first explore what the dummy submission file looks like, then use it to get the unique players.

read_csv('data/men_dummy_submission_file.csv') %>% glimpse()

As we can see, the dummy submission file contains every potential match up for the Open. This will be updated a few days before the Open starts with the actual players playing. Let's now create the lookup feature table.

# Get a vector of unique players in this years' open using the dummy submission file
unique_players = read_csv('data/men_dummy_submission_file.csv') %>% pull(player_1) %>% unique()
# Get the last 15 games played for each unique player and find their features
lookup_feature_table = read_csv('data/ATP_matches.csv', na = ".") %>%
  filter(Court_Surface == "Hard" | Court_Surface == "Indoor Hard") %>%
  mutate(Match_Id = row_number(), # Add a match ID column to be used as a key
         Tournament_Date = dmy(Tournament_Date), # Change Tournament to datetime
         Total_Games = Winner_Games_Won + Loser_Games_Won) %>% # Add a total games played column
  # clean_missing_data() %>% # Clean missing data
  split_winner_loser_columns() %>% # Change the data frame from wide to long
  add_ratio_features() %>%
  filter(Player %in% unique_players) %>%
  group_by(Player) %>%
  top_n(15, Match_Id) %>%
  summarise(
    F_Player_Serve_Win_Ratio = mean(F_Player_Serve_Win_Ratio),
    F_Player_Return_Win_Ratio = mean(F_Player_Return_Win_Ratio),
    F_Player_BreakPoints_Per_Game = mean(F_Player_BreakPoints_Per_Game),
    F_Player_Game_Win_Percentage = mean(F_Player_Game_Win_Percentage),
    F_Player_Rank = last(Player_Rank)
  )

Now let's create features for every single combination. To do this we'll join our lookup_feature_table to the player_1 and player_2 columns in the dummy_submission_file.

# Create feature matrix for the Australian Open for all player 1s
features_player_1 = read_csv('data/men_dummy_submission_file.csv') %>%
  select(player_1) %>%
  inner_join(lookup_feature_table, by=c("player_1" = "Player")) %>%
  rename(F_player_1_Serve_Win_Ratio = F_Player_Serve_Win_Ratio,
         F_player_1_Return_Win_Ratio = F_Player_Return_Win_Ratio,
         F_player_1_BreakPoints_Per_Game = F_Player_BreakPoints_Per_Game,
         F_player_1_Game_Win_Percentage = F_Player_Game_Win_Percentage,
         F_player_1_Rank = F_Player_Rank)
# Create feature matrix for the Australian Open for all player 2s
features_player_2 = read_csv('data/men_dummy_submission_file.csv') %>%
  select(player_2) %>%
  inner_join(lookup_feature_table, by=c("player_2" = "Player")) %>%
  rename(F_player_2_Serve_Win_Ratio = F_Player_Serve_Win_Ratio,
         F_player_2_Return_Win_Ratio = F_Player_Return_Win_Ratio,
         F_player_2_BreakPoints_Per_Game = F_Player_BreakPoints_Per_Game,
         F_player_2_Game_Win_Percentage = F_Player_Game_Win_Percentage,
         F_player_2_Rank = F_Player_Rank)
# Join the two dfs together and subtract features to create Difference features
aus_open_2019_features = features_player_1 %>% 
  bind_cols(features_player_2) %>%
  select(player_1, player_2, everything()) %>%
  mutate(
    F_Serve_Win_Ratio_Diff = F_player_1_Serve_Win_Ratio - F_player_2_Serve_Win_Ratio,
    F_Return_Win_Ratio_Diff = F_player_1_Return_Win_Ratio - F_player_2_Return_Win_Ratio,
    F_Game_Win_Percentage_Diff = F_player_1_Game_Win_Percentage - F_player_2_Game_Win_Percentage,
    F_BreakPoints_Per_Game_Diff = F_player_1_BreakPoints_Per_Game - F_player_2_BreakPoints_Per_Game,
    F_Rank_Diff = (F_player_1_Rank - F_player_2_Rank)
  ) %>%
  select(player_1, player_2, contains("Diff"))
aus_open_2019_features  %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t",
                          scrollY = 450,
                          scroller = TRUE,
                          scrollX = 600,
                          fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)

player_1	player_2	F_Serve_Win_Ratio_Diff	F_Return_Win_Ratio_Diff	F_Game_Win_Percentage_Diff	F_BreakPoints_Per_Game_Diff	F_Rank_Diff
Novak Djokovic	Rafael Nadal	0.06347805	0.02503802	0.07002382	0.08951024	1
Novak Djokovic	Roger Federer	0.06583364	0.03628491	0.07661295	0.15455628	-1
Novak Djokovic	Juan Martin del Potro	0.01067079	0.03436023	0.06382353	0.11259979	-2
Novak Djokovic	Alexander Zverev	0.11117863	0.03125651	0.11055585	0.08661036	-3
Novak Djokovic	Kevin Anderson	0.02132375	0.10449337	0.11184503	0.23684083	-4
Novak Djokovic	Marin Cilic	0.08410746	0.02434916	0.07653035	0.08355134	-5

Generating 2019 Australian Open Predictions

Now that we have our features, we can finally train our model and generate predictions for the 2019 Australian Open. Due to its simplicity, we will use h2o's Auto Machine Learning function h2o.automl. This will train a heap of different models and optimise the hyperparameters, as well as creating stacked ensembles automatically for us. We will use optimising by log loss.

First, we must create h2o frames for our training and feature data frames. Then we will run h2o.automl. Note that we can set the max_runtime_secs parameter. As this is a notebook, I have set it for 30 seconds - but I suggest you give it 10 minutes to create the best model. We can then create our predictions and assign them back to our aus_open_2019_features data frame. Finally, we will group_by player and find the best player, on average.

## Setup H2O
h2o.init(ip = "localhost",
         port = 54321,
         enable_assertions = TRUE,
         nthreads = 2,
         max_mem_size = "24g"

)
## Sending file to h2o
train_h2o = feature_matrix_wide %>%
  select(contains("Diff"), Winner) %>%
  as.h2o(destination_frame = "train_h2o")
aus_open_2019_features_h2o = aus_open_2019_features %>%
  select(contains("Diff")) %>%
  as.h2o(destination_frame = "aus_open_2019_features_h2o")
## Running Auto ML 
mens_model = h2o.automl(y = "Winner",
                        training_frame = train_h2o,
                        max_runtime_secs = 30,
                        max_models = 100,
                        stopping_metric = "logloss",
                        sort_metric = "logloss",
                        balance_classes = TRUE,
                        seed = 183) # Set seed to replicate results - 183 is the most games played in a tennis match (Isner-Mahut)
## Predictions on test frame
predictions = h2o.predict(mens_model@leader, aus_open_2019_features_h2o) %>%
  as.data.frame()
aus_open_2019_features$prob_player_1 = predictions$p1
aus_open_2019_features$prob_player_2 = predictions$p0
h2o.shutdown(prompt = FALSE)

Now let's find the best player by taking the mean of the prediction probability by player.

aus_open_2019_features %>% 
  select(player_1, starts_with("F_"), prob_player_1) %>%
  group_by(player_1) %>%
  summarise_all(mean) %>%
  arrange(desc(prob_player_1)) %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t",
                          scrollY = 450,
                          scroller = TRUE,
                          scrollX = 600,
                          fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)

player_1	F_Serve_Win_Ratio_Diff	F_Return_Win_Ratio_Diff	F_Game_Win_Percentage_Diff	F_BreakPoints_Per_Game_Diff	F_Rank_Diff	prob_player_1
Novak Djokovic	0.1109364627	0.076150615	0.1483970690	0.17144300	NA	0.8616486
Karen Khachanov	0.0960639298	0.061436164	0.1059967623	0.04544955	NA	0.8339594
Juan Martin del Potro	0.1003931993	0.042025222	0.0847985439	0.05943767	NA	0.8218308
Rafael Nadal	0.0480432305	0.051531252	0.0790179917	0.08181694	NA	0.8032543
Gilles Simon	0.0646937767	0.084843307	0.0901401318	0.08675350	NA	0.7985995
Roger Federer	0.0452014997	0.040992497	0.0725719954	0.01817046	NA	0.7962289
Kei Nishikori	0.0777155934	0.018720226	0.0800648870	0.02740276	NA	0.7843631
Marin Cilic	0.0285413602	0.053017465	0.0736687072	0.08883055	NA	0.7804876
Tomas Berdych	0.0471654691	0.047289449	0.0737401748	0.10584114	NA	0.7739211
Daniil Medvedev	0.0275430665	0.031121856	0.0721948279	0.01803757	NA	0.7543269
Stefanos Tsitsipas	0.0470382377	0.023825850	0.0577628626	0.02105227	NA	0.7511674
Dominic Thiem	0.0258904189	0.032481624	0.0483707080	0.05857158	NA	0.7451547
Alexander Zverev	0.0006199716	0.044811275	0.0380134371	0.08423392	NA	0.7374897
Kyle Edmund	0.0558006240	0.011963627	0.0478850676	0.05142186	NA	0.7304873
Pablo Carreno Busta	0.0321878318	0.029862068	0.0413674481	-0.00229784	NA	0.7302043
Borna Coric	0.0762084129	-0.010097922	0.0413621283	-0.01924267	NA	0.7268124
Kevin Anderson	0.0907358428	-0.027171681	0.0381421997	-0.06362578	NA	0.7260799
David Goffin	-0.0034821911	0.037247336	0.0162572061	0.05603565	NA	0.7155908
Fernando Verdasco	0.0229261365	0.032884054	0.0521212576	0.04668854	NA	0.7120831
Roberto Bautista Agut	0.0047641170	0.049939608	0.0218975349	0.07331023	NA	0.7009891
Milos Raonic	0.0849726089	-0.028732182	0.0385944327	-0.08009382	NA	0.6986865
Fabio Fognini	-0.0394792678	0.047935185	0.0226546894	0.06213496	NA	0.6982031
Hyeon Chung	0.0042489153	0.047722133	0.0158096386	0.04823304	NA	0.6958943
Jack Sock	-0.0099659903	0.026454984	0.0186547428	0.02307214	NA	0.6757770
Diego Schwartzman	-0.0317130675	0.032098381	0.0006215006	0.05621187	NA	0.6631067
John Millman	0.0016290285	0.042676556	0.0119857356	0.06228135	NA	0.6603912
Nikoloz Basilashvili	-0.0099968609	0.005561102	0.0473876170	0.03661962	NA	0.6602628
John Isner	0.1346946527	-0.070556940	0.0161348609	-0.11425009	NA	0.6598097
Gael Monfils	-0.0074254934	0.024286746	0.0295568649	0.04007519	NA	0.6449506
Richard Gasquet	0.0296009556	-0.011382437	0.0013138324	-0.03972967	NA	0.6442043
...	...	...	...	...	...	...
Laslo Djere	-0.042300822	-0.0150684095	-0.064667709	-0.0349151578	NA	0.3606923
David Ferrer	-0.036179509	0.0532782117	0.012751020	0.0914824480	NA	0.3488057
Bradley Klahn	-0.001248083	-0.0444982448	-0.025987040	-0.1181295700	NA	0.3487806
Marcel Granollers	-0.031011830	-0.0094056152	-0.049853664	0.0136841358	NA	0.3460035
Ricardas Berankis	-0.022557215	-0.0103782963	-0.047937290	-0.0468488990	NA	0.3454980
Radu Albot	-0.040829057	0.0076150564	-0.034891704	0.0443672533	NA	0.3420615
Jordan Thompson	-0.068554906	0.0261969117	-0.044349181	0.0206636045	NA	0.3358572
Thomas Fabbiano	-0.060583307	0.0275756029	-0.025883493	0.0709707306	NA	0.3319778
Roberto Carballes Baena	-0.054016396	-0.0091521177	-0.019093050	0.0347187874	NA	0.3312105
Paolo Lorenzi	-0.038613500	-0.0212206827	-0.052602703	0.0199474025	NA	0.3299791
Guido Andreozzi	-0.038614385	-0.0133763922	0.029549861	0.0636745661	NA	0.3288762
Peter Polansky	0.007461636	-0.0163389196	-0.024034159	-0.0442144260	NA	0.3216756
Ernests Gulbis	-0.062827089	-0.0134699552	-0.027633425	-0.0518663252	NA	0.3123511
Thiago Monteiro	0.001235931	-0.0288349103	-0.043831840	-0.0654744344	NA	0.3122069
Casper Ruud	0.016838968	-0.0178511679	0.015234507	0.0219131874	NA	0.3119321
Marco Trungelliti	-0.022148774	-0.0005658242	0.048542554	0.1243537739	NA	0.3092636
Jiri Vesely	-0.050204009	-0.0351868278	-0.042887646	-0.0160467165	NA	0.3089287
Guillermo Garcia-Lopez	-0.090076100	-0.0108663630	-0.048712763	-0.0124446402	NA	0.3080898
Michael Mmoh	-0.063802934	-0.0079053251	-0.011112236	-0.0332042032	NA	0.2822330
Jason Kubler	-0.124758873	-0.0202756806	-0.013998570	0.1020895301	NA	0.2814246
Ruben Bemelmans	-0.029036164	-0.0138846550	-0.032256254	-0.0363563402	NA	0.2772185
Bjorn Fratangelo	-0.014149222	0.0033574304	-0.019931504	-0.0360199607	NA	0.2652527
Pablo Andujar	-0.042869833	-0.0488261697	-0.070057834	-0.0164918910	NA	0.2647100
Christian Garin	-0.046150875	0.0235799476	-0.006209664	0.0736304057	NA	0.2631607
Ivo Karlovic	0.071597162	-0.1093833837	0.001410787	-0.1237762218	NA	0.2500242
Juan Ignacio Londero	-0.026454456	-0.0715665271	-0.016749898	-0.0363353678	NA	0.2351747
Ramkumar Ramanathan	-0.005371622	-0.0606138479	-0.041631884	-0.0005573405	NA	0.2272977
Reilly Opelka	0.025704824	-0.0607219257	-0.015474944	-0.0720809006	NA	0.2262993
Carlos Berlocq	-0.063580460	0.0074576369	-0.054277974	-0.0165235079	NA	0.2112275
Pedro Sousa	-0.197333352	-0.0734557562	-0.161962722	-0.1023311674	NA	0.1502313

Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.