Skip to content

Australian Open Datathon R Tutorial

Overview

The Task

This notebook will outline how the Betfair Data Scientists went about modelling the Australian Open for Betfair's Australian Open Datathon. The task is simple: we ask you to predict the winner of every possible Australian Open matchup using data which we provide.

The metric used to determine the winner will be log loss, based on the actual matchups that happen in the Open. For more information on log loss, click here.

For a detailed outline of the task, the prizes, and to sign up, click here.

How an outline of our methodoly and thought process, read this article.

Prizes

Place Prize Place Prize
1 $5000 9 $500
2 $3000 10 $500
3 $2000 11 $200
4 $1000 12 $200
5 $750 13 $200
6 $500 14 $200
7 $500 15 $200
8 $500 Total $15250

Submission

  • To submit your model, email your final submission to datathon@betfair.com.au. Note that you don't need to email your code, just your predictions in the format that we have specified
  • No submissions will be accepted prior to the Australian Open qualifying matches being completed and the final draw list being shared with registered participants (12 January 2019)
  • Submissions need to include all potential match ups during the Australian Open, i.e. all possible combinations for each men's and women's tournaments (this will be provided after the draw is announced and the Australian Open qualifying matches are completed (Jan 12th 2019))
  • Submissions must follow the format outlined above and shown in the 'Dummy Submission File'. Any submissions that are not in the correct format will not be accepted.
  • Submissions need to include the player names for the hypothetical match up and the probability of the first player winning i.e. player_1,player_2,probability_of_player_1_winning,
  • Submissions must be in a csv format
  • Only two models will be accepted per participant (one model for the men's draw, one model for the women's draw)

Exploring the Data

First we need to get an idea of what the data looks like. Let's read the men's data in and get an idea of what it looks like. Note that you will need to install all the packages listed below unless you already have them.

Note that for this tutorial I will be using dplyr, if you are not familiar with the syntax I encourage you to read up on the basics.

# Import libraries
library(dplyr)
library(readr)
library(tidyr)
library(RcppRoll)
library(tidyselect)
library(lubridate)
library(stringr)
library(zoo)
library(purrr)
library(h2o)
library(DT)
mens = readr::read_csv('data/ATP_matches.csv', na = ".") # NAs are indicated by .
mens %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t", 
            scrollY = 450,
            scroller = TRUE,
            scrollX = 600,
            fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)
Winner Loser Tournament Tournament_Date Court_Surface Round_Description Winner_Rank Loser_Rank Retirement_Ind Winner_Sets_Won ... Loser_DoubleFaults Loser_FirstServes_Won Loser_FirstServes_In Loser_SecondServes_Won Loser_SecondServes_In Loser_BreakPoints_Won Loser_BreakPoints Loser_ReturnPoints_Won Loser_ReturnPoints_Faced Loser_TotalPoints_Won
Edouard Roger-Vasselin Eric Prodon Chennai 2-Jan-12 Hard First Round 106 97 0 2 ... 3 21 33 13 26 1 3 15 49 49
Dudi Sela Fabio Fognini Chennai 2-Jan-12 Hard First Round 83 48 0 2 ... 4 17 32 5 26 0 1 8 33 30
Go Soeda Frederico Gil Chennai 2-Jan-12 Hard First Round 120 102 0 2 ... 2 45 70 18 35 2 4 36 103 99
Yuki Bhambri Karol Beck Chennai 2-Jan-12 Hard First Round 345 101 0 2 ... 1 15 33 13 29 2 3 15 46 43
Yuichi Sugita Olivier Rochus Chennai 2-Jan-12 Hard First Round 235 67 0 2 ... 0 19 32 13 22 1 7 30 78 62
Benoit Paire Pere Riba Chennai 2-Jan-12 Hard First Round 95 89 0 2 ... 5 13 20 12 32 0 1 9 44 34
Victor Hanescu Sam Querrey Chennai 2-Jan-12 Hard First Round 90 93 0 2 ... 8 29 41 7 24 1 3 17 57 53
Yen-Hsun Lu Thiemo de Bakker Chennai 2-Jan-12 Hard First Round 82 223 0 2 ... 0 20 32 10 24 1 1 19 57 49
Andreas Beck Vasek Pospisil Chennai 2-Jan-12 Hard First Round 98 119 0 2 ... 3 39 57 16 38 1 5 24 74 79
Ivan Dodig Vishnu Vardhan Chennai 2-Jan-12 Hard First Round 36 313 0 2 ... 5 41 59 13 27 2 8 34 101 88
David Goffin Xavier Malisse Chennai 2-Jan-12 Hard First Round 174 49 0 2 ... 1 31 43 19 34 1 4 27 85 77
David Goffin Andreas Beck Chennai 2-Jan-12 Hard Second Round 174 98 0 2 ... 0 43 71 14 27 2 8 27 82 84
Dudi Sela Benoit Paire Chennai 2-Jan-12 Hard Second Round 83 95 0 2 ... 5 40 58 21 46 1 7 26 87 87
Stan Wawrinka Edouard Roger-Vasselin Chennai 2-Jan-12 Hard Second Round 17 106 0 2 ... 0 43 70 16 34 4 6 28 82 87
Go Soeda Ivan Dodig Chennai 2-Jan-12 Hard Second Round 120 36 0 2 ... 2 31 41 11 28 1 4 23 73 65
Milos Raonic Victor Hanescu Chennai 2-Jan-12 Hard Second Round 31 90 0 2 ... 1 25 38 5 14 0 4 15 56 45
Yuichi Sugita Yen-Hsun Lu Chennai 2-Jan-12 Hard Second Round 235 82 0 2 ... 4 34 45 12 34 2 9 38 93 84
Janko Tipsarevic Yuki Bhambri Chennai 2-Jan-12 Hard Second Round 9 345 0 2 ... 2 12 22 9 17 0 1 8 41 29
Janko Tipsarevic David Goffin Chennai 2-Jan-12 Hard Quarter-finals 9 174 0 2 ... 5 34 51 19 40 1 2 18 67 71
Milos Raonic Dudi Sela Chennai 2-Jan-12 Hard Quarter-finals 31 83 0 2 ... 2 23 31 19 28 0 3 16 69 58
Go Soeda Stan Wawrinka Chennai 2-Jan-12 Hard Quarter-finals 120 17 0 2 ... 4 18 34 13 31 3 7 31 74 62
Nicolas Almagro Yuichi Sugita Chennai 2-Jan-12 Hard Quarter-finals 10 235 0 2 ... 1 36 65 30 40 3 12 45 123 111
Janko Tipsarevic Go Soeda Chennai 2-Jan-12 Hard Semi-finals 9 120 0 2 ... 1 21 33 10 28 1 1 10 44 41
Milos Raonic Nicolas Almagro Chennai 2-Jan-12 Hard Semi-finals 31 10 0 2 ... 0 31 45 8 15 0 3 12 54 51
Milos Raonic Janko Tipsarevic Chennai 2-Jan-12 Hard Finals 31 9 0 2 ... 2 59 83 34 55 0 4 25 113 118
Igor Andreev Adrian Mannarino Brisbane 2-Jan-12 Hard First Round 115 87 0 2 ... 3 24 35 13 25 1 3 21 70 58
Alexandr Dolgopolov Alejandro Falla Brisbane 2-Jan-12 Hard First Round 15 74 0 2 ... 3 16 33 12 25 3 7 33 75 61
Tatsuma Ito Benjamin Mitchell Brisbane 2-Jan-12 Hard First Round 122 227 0 2 ... 6 30 44 7 24 0 2 13 52 50
Kei Nishikori Cedrik-Marcel Stebe Brisbane 2-Jan-12 Hard First Round 25 81 0 2 ... 2 27 49 23 41 3 6 28 75 78
Denis Istomin Florian Mayer Brisbane 2-Jan-12 Hard First Round 73 23 1 1 ... 1 28 38 11 17 0 2 15 56 54
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Malek Jaziri Fernando Verdasco Paris 29-Oct-18 Indoor Hard Second Round 55 27 0 2 ... 6 46 60 16 35 3 13 39 104 101
Alexander Zverev Frances Tiafoe Paris 29-Oct-18 Indoor Hard Second Round 5 44 0 2 ... 4 26 40 16 36 2 10 27 72 69
Dominic Thiem Gilles Simon Paris 29-Oct-18 Indoor Hard Second Round 8 31 0 2 ... 1 13 26 12 25 2 2 23 59 48
Novak Djokovic Joao Sousa Paris 29-Oct-18 Indoor Hard Second Round 2 48 0 2 ... 2 25 35 6 22 1 10 27 74 58
Karen Khachanov Matthew Ebden Paris 29-Oct-18 Indoor Hard Second Round 18 39 1 2 ... 6 8 18 5 20 1 2 10 30 23
John Isner Mikhail Kukushkin Paris 29-Oct-18 Indoor Hard Second Round 9 54 0 2 ... 1 54 80 24 39 0 1 13 90 91
Kevin Anderson Nikoloz Basilashvili Paris 29-Oct-18 Indoor Hard Second Round 6 22 0 2 ... 7 43 54 30 49 0 3 26 106 99
Marin Cilic Philipp Kohlschreiber Paris 29-Oct-18 Indoor Hard Second Round 7 43 0 2 ... 1 19 34 12 20 1 1 17 55 48
Jack Sock Richard Gasquet Paris 29-Oct-18 Indoor Hard Second Round 23 28 0 2 ... 4 18 33 16 29 0 4 19 59 53
Grigor Dimitrov Roberto Bautista Agut Paris 29-Oct-18 Indoor Hard Second Round 10 25 0 2 ... 0 34 48 11 20 2 4 27 76 72
Damir Dzumhur Stefanos Tsitsipas Paris 29-Oct-18 Indoor Hard Second Round 52 16 0 2 ... 3 14 26 15 30 2 2 17 52 46
Dominic Thiem Borna Coric Paris 29-Oct-18 Indoor Hard Third Round 8 13 0 2 ... 1 39 57 16 38 2 2 27 88 82
Novak Djokovic Damir Dzumhur Paris 29-Oct-18 Indoor Hard Third Round 2 52 1 2 ... 4 15 28 7 18 0 0 8 28 30
Alexander Zverev Diego Schwartzman Paris 29-Oct-18 Indoor Hard Third Round 5 19 0 2 ... 2 22 37 12 24 0 4 18 58 52
Roger Federer Fabio Fognini Paris 29-Oct-18 Indoor Hard Third Round 3 14 0 2 ... 6 22 32 15 37 1 5 16 54 53
Marin Cilic Grigor Dimitrov Paris 29-Oct-18 Indoor Hard Third Round 7 10 0 2 ... 1 37 55 14 32 1 5 22 71 73
Karen Khachanov John Isner Paris 29-Oct-18 Indoor Hard Third Round 18 9 0 2 ... 4 67 80 19 38 0 0 17 100 103
Kei Nishikori Kevin Anderson Paris 29-Oct-18 Indoor Hard Third Round 11 6 0 2 ... 1 26 33 11 19 0 0 11 51 48
Jack Sock Malek Jaziri Paris 29-Oct-18 Indoor Hard Third Round 23 55 0 2 ... 6 13 21 10 24 0 0 9 41 32
Karen Khachanov Alexander Zverev Paris 29-Oct-18 Indoor Hard Quarter-finals 18 5 0 2 ... 7 26 47 4 21 1 3 10 36 40
Dominic Thiem Jack Sock Paris 29-Oct-18 Indoor Hard Quarter-finals 8 23 0 2 ... 5 44 59 19 37 2 10 34 97 97
Roger Federer Kei Nishikori Paris 29-Oct-18 Indoor Hard Quarter-finals 3 11 0 2 ... 0 21 37 16 26 0 1 12 56 49
Novak Djokovic Marin Cilic Paris 29-Oct-18 Indoor Hard Quarter-finals 2 7 0 2 ... 0 38 55 11 28 2 5 29 85 78
Karen Khachanov Dominic Thiem Paris 29-Oct-18 Indoor Hard Semi-finals 18 8 0 2 ... 0 19 29 8 26 1 3 15 47 42
Novak Djokovic Roger Federer Paris 29-Oct-18 Indoor Hard Semi-finals 2 3 0 2 ... 2 69 93 25 46 1 2 29 113 123
Karen Khachanov Novak Djokovic Paris 29-Oct-18 Indoor Hard Finals 18 2 0 2 ... 1 30 43 14 28 1 5 20 66 64
Jaume Antoni Munar Clar Frances Tiafoe Milan 5-Nov-18 Indoor Hard NA 76 40 0 3 ... 3 21 29 6 17 0 2 5 46 32
Frances Tiafoe Hubert Hurkacz Milan 5-Nov-18 Indoor Hard NA 40 85 0 3 ... 4 35 48 10 19 1 7 22 78 67
Hubert Hurkacz Jaume Antoni Munar Clar Milan 5-Nov-18 Indoor Hard NA 85 76 0 3 ... 1 43 63 15 35 3 9 29 80 87
Andrey Rublev Liam Caruana Milan 5-Nov-18 Indoor Hard NA 68 NA 0 3 ... 1 28 39 4 14 1 3 18 57 50

As we can see, we have a Winner column, a Loser column, as well as other columns detailing the match details, and other columns which have the stats for that match. As we have a Winner column, if we use the current data structure to train a model we will leak the result. The model will simply learn that the actual winner comes from the Winner column, rather than learning from other features that we can create, such as First Serve %.

To avoid this problem, let's reshape the data from wide to long, then shuffle the data. For this, we will define a function, split_winner_loser_columns, which splits the raw dataframe into two dataframes, appends them together, and then shuffles the data.

Let's also remove all Grass and Clay matches from our data, as we will be modelling the Australian Open which is a hardcourt surface.

Additionally, we will add a few columns, such as Match_Id and Total_Games. These will be useful later.

split_winner_loser_columns <- function(df) {
  # This function splits the raw data into two dataframes and appends them together then shuffles them
  # This output is a dataframe with only one player's stats on each row (i.e. in long format)

  # Grab a df with only the Winner's stats
  winner = df %>% 
    select(-contains("Loser")) %>% # Select only the Winner columns + extra game info columns as a df
    rename_at( # Rename all columns containing "Winner" to "Player" 
      vars(contains("Winner")),
      ~str_replace(., "Winner", "Player")
    ) %>%
    mutate(Winner = 1) # Create a target column

  # Repeat the process with the loser's stats
  loser = df %>%
    select(-contains("Winner")) %>%
    rename_at(
      vars(contains("Loser")),
      ~str_replace(., "Loser", "Player")
    ) %>%
    mutate(Winner = 0)

  set.seed(183) # Set seed to replicate results - 183 is the most games played in a tennis match (Isner-Mahut)

  # Create a df that appends both the Winner and loser df together
  combined_df = winner %>% 
    rbind(loser) %>% # Append the loser df to the Winner df
    slice(sample(1:n())) %>% # Randomise row order
    arrange(Match_Id) %>% # Arrange by Match_Id
    return()
}
# Read in men and womens data; randomise the data to avoid result leakage
mens = readr::read_csv('data/ATP_matches.csv', na = ".") %>%
  filter(Court_Surface == "Hard" | Court_Surface == "Indoor Hard") %>% # Filter to only use hardcourt games
  mutate(Match_Id = row_number(), # Add a match ID column to be used as a key
         Tournament_Date = dmy(Tournament_Date), # Change Tournament to datetime
         Total_Games = Winner_Games_Won + Loser_Games_Won) %>% # Add a total games played column
  split_winner_loser_columns() # Change the dataframe from wide to long
mens %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t", 
            scrollY = 450,
            scroller = TRUE,
            scrollX = 600,
            fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)
Player Tournament Tournament_Date Court_Surface Round_Description Player_Rank Retirement_Ind Player_Sets_Won Player_Games_Won Player_Aces ... Player_SecondServes_Won Player_SecondServes_In Player_BreakPoints_Won Player_BreakPoints Player_ReturnPoints_Won Player_ReturnPoints_Faced Player_TotalPoints_Won Match_Id Total_Games Winner
Eric Prodon Chennai 2012-01-02 Hard First Round 97 0 0 7 2 ... 13 26 1 3 15 49 49 1 19 0
Edouard Roger-Vasselin Chennai 2012-01-02 Hard First Round 106 0 2 12 5 ... 12 19 4 7 25 59 59 1 19 1
Dudi Sela Chennai 2012-01-02 Hard First Round 83 0 2 12 2 ... 11 16 6 14 36 58 61 2 13 1
Fabio Fognini Chennai 2012-01-02 Hard First Round 48 0 0 1 1 ... 5 26 0 1 8 33 30 2 13 0
Frederico Gil Chennai 2012-01-02 Hard First Round 102 0 1 14 5 ... 18 35 2 4 36 103 99 3 33 0
Go Soeda Chennai 2012-01-02 Hard First Round 120 0 2 19 6 ... 19 39 5 11 42 105 109 3 33 1

Feature Creation

Now that we have a fairly good understanding of what the data looks like, let's add some features. To do this we will define a function. Ideally we want to add features which will provide predictive power to our model.

Thinking about the dynamics of tennis, we know that players often will matches by "breaking" the opponent's serve (i.e. winning a game when the opponent is serving). This is especially important in mens tennis. Let's create a feature called F_Player_BreakPoints_Per_Game, which is the number of breakpoints a player gets per game that they play (even though they can only get breakpoints every second game, we will use total games). Let's also create a feature called F_Player_Return_Win_Ratio which is the proportion of points won when returning.

Similarly, "holding" serve is important (i.e. winning a game when you are serving). Let's create a feature called F_Player_Serve_Win_Ratio which is the proportion of points won when serving.

Finally, you only win a set of tennis by winning more sets than your opponent. To win a set, you need to win games. Let's create a feature called F_Player_Game_Win_Percentage which is the propotion of games that a player wins.

add_ratio_features <- function(df) {
  # This function adds ratio features to a long df
  df %>%
    mutate(
      # Point Win ratio when serving
      F_Player_Serve_Win_Ratio = (Player_FirstServes_Won + Player_SecondServes_Won - Player_DoubleFaults) / 
        (Player_FirstServes_In + Player_SecondServes_In + Player_DoubleFaults), 
      # Point win ratio when returning
      F_Player_Return_Win_Ratio = Player_ReturnPoints_Won / Player_ReturnPoints_Faced, 
      # Breakpoints per receiving game
      F_Player_BreakPoints_Per_Game = Player_BreakPoints / Total_Games, 
      F_Player_Game_Win_Percentage = Player_Games_Won / Total_Games
    ) %>%
    mutate_at(
      vars(colnames(.), -contains("Rank"), -Tournament_Date), # Replace all NAs with0 apart from Rank, Date
      ~ifelse(is.na(.), 0, .)
    ) %>%
    return()
}
mens = mens %>%
  add_ratio_features() # Add features

Now that we have added our features, we need to create rolling averages for them. We cannot simply use current match statistics, as they will leak the result to the model. Instead, we need to use past match statistics to predict future matches. Here we will use a rolling mean with a window of 15. If the player hasn't played 15 games, we will instead use a cumulative mean. We will also lag the result so as to not leak the result.

This next chunk of code simply takes all the columns starting with F_ and calculates these means.

mens = mens %>% 
  group_by(Player) %>% # Group by player
  mutate_at( # Create a rolling mean with window 15 for each player. 
    vars(starts_with("F_")), # If the player hasn't played 15 games, use a cumulative mean
    ~coalesce(rollmean(., k = 15, align = "right", fill = NA_real_), cummean(.)) %>% lag()
  ) %>%
  ungroup()

Creating a Training Feature Matrix

In predictive modelling language - features are data metrics we use to predict an outcome or target variable. We have several choices to make before we get to the prediction phase. What are the features? How do we structure the outcome variable? What does each row mean? Do we use all data or just a subset? We narrowed it down to two options

We can train the model on every tennis match in the data set, or We can only train the model on Australian Open matches. Doing Option 1 would mean we have a lot more data to build a strong model, but it might be challenging to work around the constraints described in the tournament structure.

Doing Option 2 fits better from that angle but leaves us with very few matches to train our model on.

We have decided to go with an option that combines strengths from both approaches, by training the model on matches from the Aus Open and the US Open because both grand slams are played on the same surface - hard court.

However, we also need to train our model in the same way that will be used to predict the 2019 Australian Open. When predicting the 2nd round, we won't have data from the 1st round. So we will need to build our training feature matrix with this in mind. We should extract features for a player from past games at the start of the tournament and apply them to every matchup that that player plays.

To do this, we will create a function, extract_latest_features_for_tournament, which maps over our feature dataframe for the dates in the first round of a tournament and grabs features.

First, we need the Australian Open and US Open results - let's grab these and then apply our function.

# Get Australian Open and US Open Results
aus_us_open_results = 
  mens %>%
  filter((Tournament == "Australian Open, Melbourne" | Tournament == "U.S. Open, New York")
         & Round_Description != "Qualifying" & Tournament_Date != "2012-01-16") %>% # Filter out qualifiers
  select(Match_Id, Player, Tournament, Tournament_Date, Round_Description, Winner)
# Create a function which extracts features for each tournament
extract_latest_features_for_tournament = function(df, dte) {

  df %>% # Filter for the 1st round
    filter(Tournament_Date == dte, Round_Description == "First Round", Tournament_Date != "2012-01-16") %>% 
    group_by(Player) %>% # Group by player
    select_at(
      vars(Match_Id, starts_with("F_"), Player_Rank) # Grab the players' features
    ) %>%
    rename(F_Player_Rank = Player_Rank) %>%
    ungroup() %>%
    mutate(Feature_Date = dte) %>%
    select(Player, Feature_Date, everything())
}
# Create a feature matrix in long format
feature_matrix_long = 
  aus_us_open_results %>%
  distinct(Tournament_Date) %>% # Pull all Tournament Dates
  pull() %>%
  map_dfr(
    ~extract_latest_features_for_tournament(mens, .) # Get the features
  ) %>%
  filter(Feature_Date != "2012-01-16") %>% # Filter out the first Aus Open
  mutate_at( # Replace NAs with the mean
    vars(starts_with("F_")),
    ~ifelse(is.na(.), mean(., na.rm = TRUE), .)
  )

Now that we have a feature matrix in long format, we need to convert it to wide format so that the features are on the same row. To do this we will define a function gather_df, which converts the dataframe from long to wide. Let's also join the results to the matrix and convert the Winner column to a factor. Finally, we will take the difference of player1 and player2's features, so as to reduce the dimensionality of the model.

gather_df <- function(df) {
  # This function puts the df back into its original format of each row containing stats for both players
  df %>%
    arrange(Match_Id) %>%
    filter(row_number() %% 2 != 0) %>% # Filter for every 2nd row, starting at the 1st index. e.g. 1, 3, 5
    rename_at( # Rename columns to player_1
      vars(contains("Player")),
      ~str_replace(., "Player", "player_1")
    ) %>%
    inner_join(df %>%
                 filter(row_number() %% 2 == 0) %>%
                 rename_at(
                   vars(contains("Player")), # Rename columns to player_2
                   ~str_replace(., "Player", "player_2")
                 ) %>%
                 select(Match_Id, contains("Player")),
               by=c('Match_Id')
    ) %>%
    select(Match_Id, player_1, player_2, Winner, everything()) %>%
    return()
}
# Joining results to features
feature_matrix_wide = aus_us_open_results %>%
  inner_join(feature_matrix_long %>% 
               select(-Match_Id), 
             by = c("Player", "Tournament_Date" = "Feature_Date")) %>%
  gather_df() %>%
  mutate(
    F_Serve_Win_Ratio_Diff = F_player_1_Serve_Win_Ratio - F_player_2_Serve_Win_Ratio,
    F_Return_Win_Ratio_Diff = F_player_1_Return_Win_Ratio - F_player_2_Return_Win_Ratio,
    F_Game_Win_Percentage_Diff = F_player_1_Game_Win_Percentage - F_player_2_Game_Win_Percentage,
    F_BreakPoints_Per_Game_Diff = F_player_1_BreakPoints_Per_Game - F_player_2_BreakPoints_Per_Game,
    F_Rank_Diff = (F_player_1_Rank - F_player_2_Rank),
    Winner = as.factor(Winner)
  ) %>%
  select(Match_Id, player_1, player_2, Tournament, Tournament_Date, Round_Description, Winner, contains("Diff"))
train = feature_matrix_wide
train %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t", 
            scrollY = 450,
            scroller = TRUE,
            scrollX = 600,
            fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)
Match_Id player_1 player_2 Tournament Tournament_Date Round_Description Winner F_Serve_Win_Ratio_Diff F_Return_Win_Ratio_Diff F_Game_Win_Percentage_Diff F_BreakPoints_Per_Game_Diff F_Rank_Diff
1139 Adrian Ungur Daniel Brands U.S. Open, New York 2012-08-27 First Round 0 0.03279412 -0.014757229 0.002877458 0.073938088 -13
1140 Albert Montanes Richard Gasquet U.S. Open, New York 2012-08-27 First Round 0 -0.08000322 -0.077451342 -0.131108056 -0.180846832 97
1141 Martin Klizan Alejandro Falla U.S. Open, New York 2012-08-27 First Round 1 0.07711693 -0.044715517 0.068179841 -0.087361962 1
1142 Alex Bogomolov Jr. Andy Murray U.S. Open, New York 2012-08-27 First Round 0 -0.03964074 -0.031700826 -0.059010072 -0.094721700 69
1143 Tommy Robredo Andreas Seppi U.S. Open, New York 2012-08-27 First Round 1 -0.02681392 0.006442134 -0.067779660 -0.009930089 151
1144 Ryan Harrison Benjamin Becker U.S. Open, New York 2012-08-27 First Round 1 0.04251983 0.018604623 0.026486753 -0.003548973 -24

Creating the Feature Matrix for the 2019 Australian Open

Now that we have our training set, train, we need to create a feature matrix to create predictions on. To do this, we need to generate features again. We could simply append a player list to our raw dataframe, create a mock date and then use the extract_latest_features_for_tournament function that we used before. Instead, we're going to create a lookup table for each unique player in the 2019 Australian Open. We will need to get their last 15 games and then find the mean for each feature so that our features are the same.

Let's first explore what the dummy submission file looks like, then use it to get the unique players.

read_csv('data/men_dummy_submission_file.csv') %>% glimpse()

As we can see, the dummy submission file contains every potential match up for the Open. This will be updated a few days before the Open starts with the actual players playing. Let's now create the lookup feature table.

# Get a vector of unique players in this years' open using the dummy submission file
unique_players = read_csv('data/men_dummy_submission_file.csv') %>% pull(player_1) %>% unique()
# Get the last 15 games played for each unique player and find their features
lookup_feature_table = read_csv('data/ATP_matches.csv', na = ".") %>%
  filter(Court_Surface == "Hard" | Court_Surface == "Indoor Hard") %>%
  mutate(Match_Id = row_number(), # Add a match ID column to be used as a key
         Tournament_Date = dmy(Tournament_Date), # Change Tournament to datetime
         Total_Games = Winner_Games_Won + Loser_Games_Won) %>% # Add a total games played column
  # clean_missing_data() %>% # Clean missing data
  split_winner_loser_columns() %>% # Change the dataframe from wide to long
  add_ratio_features() %>%
  filter(Player %in% unique_players) %>%
  group_by(Player) %>%
  top_n(15, Match_Id) %>%
  summarise(
    F_Player_Serve_Win_Ratio = mean(F_Player_Serve_Win_Ratio),
    F_Player_Return_Win_Ratio = mean(F_Player_Return_Win_Ratio),
    F_Player_BreakPoints_Per_Game = mean(F_Player_BreakPoints_Per_Game),
    F_Player_Game_Win_Percentage = mean(F_Player_Game_Win_Percentage),
    F_Player_Rank = last(Player_Rank)
  )

Now let's create features for every single combination. To do this we'll join our lookup_feature_table to the player_1 and player_2 columns in the dummy_submission_file.

# Create feature matrix for the Australian Open for all player 1s
features_player_1 = read_csv('data/men_dummy_submission_file.csv') %>%
  select(player_1) %>%
  inner_join(lookup_feature_table, by=c("player_1" = "Player")) %>%
  rename(F_player_1_Serve_Win_Ratio = F_Player_Serve_Win_Ratio,
         F_player_1_Return_Win_Ratio = F_Player_Return_Win_Ratio,
         F_player_1_BreakPoints_Per_Game = F_Player_BreakPoints_Per_Game,
         F_player_1_Game_Win_Percentage = F_Player_Game_Win_Percentage,
         F_player_1_Rank = F_Player_Rank)
# Create feature matrix for the Australian Open for all player 2s
features_player_2 = read_csv('data/men_dummy_submission_file.csv') %>%
  select(player_2) %>%
  inner_join(lookup_feature_table, by=c("player_2" = "Player")) %>%
  rename(F_player_2_Serve_Win_Ratio = F_Player_Serve_Win_Ratio,
         F_player_2_Return_Win_Ratio = F_Player_Return_Win_Ratio,
         F_player_2_BreakPoints_Per_Game = F_Player_BreakPoints_Per_Game,
         F_player_2_Game_Win_Percentage = F_Player_Game_Win_Percentage,
         F_player_2_Rank = F_Player_Rank)
# Join the two dfs together and subtract features to create Difference features
aus_open_2019_features = features_player_1 %>% 
  bind_cols(features_player_2) %>%
  select(player_1, player_2, everything()) %>%
  mutate(
    F_Serve_Win_Ratio_Diff = F_player_1_Serve_Win_Ratio - F_player_2_Serve_Win_Ratio,
    F_Return_Win_Ratio_Diff = F_player_1_Return_Win_Ratio - F_player_2_Return_Win_Ratio,
    F_Game_Win_Percentage_Diff = F_player_1_Game_Win_Percentage - F_player_2_Game_Win_Percentage,
    F_BreakPoints_Per_Game_Diff = F_player_1_BreakPoints_Per_Game - F_player_2_BreakPoints_Per_Game,
    F_Rank_Diff = (F_player_1_Rank - F_player_2_Rank)
  ) %>%
  select(player_1, player_2, contains("Diff"))
aus_open_2019_features  %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t",
                          scrollY = 450,
                          scroller = TRUE,
                          scrollX = 600,
                          fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)
player_1 player_2 F_Serve_Win_Ratio_Diff F_Return_Win_Ratio_Diff F_Game_Win_Percentage_Diff F_BreakPoints_Per_Game_Diff F_Rank_Diff
Novak Djokovic Rafael Nadal 0.06347805 0.02503802 0.07002382 0.08951024 1
Novak Djokovic Roger Federer 0.06583364 0.03628491 0.07661295 0.15455628 -1
Novak Djokovic Juan Martin del Potro 0.01067079 0.03436023 0.06382353 0.11259979 -2
Novak Djokovic Alexander Zverev 0.11117863 0.03125651 0.11055585 0.08661036 -3
Novak Djokovic Kevin Anderson 0.02132375 0.10449337 0.11184503 0.23684083 -4
Novak Djokovic Marin Cilic 0.08410746 0.02434916 0.07653035 0.08355134 -5

Generating 2019 Australian Open Predictions

Now that we have our features, we can finally train our model and generate predictions for the 2019 Australian Open. Due to its simplicity, we will use h2o's Auto Machine Learning function h2o.automl. This will train a heap of different models and optimise the hyperparameters, as well as creating stacked ensembles automatically for us. We will use optimising by log loss.

First, we must create h2o frames for our training and feature dataframes. Then we will run h2o.automl. Note that we can set the max_runtime_secs parameter. As this is a notebook, I have set it for 30 seconds - but I suggest you give it 10 minutes to create the best model. We can then create our predictions and assign them back to our aus_open_2019_features dataframe. Finally, we will group_by player and find the best player, on average.

## Setup H2O
h2o.init(ip = "localhost",
         port = 54321,
         enable_assertions = TRUE,
         nthreads = 2,
         max_mem_size = "24g"

)
## Sending file to h2o
train_h2o = feature_matrix_wide %>%
  select(contains("Diff"), Winner) %>%
  as.h2o(destination_frame = "train_h2o")
aus_open_2019_features_h2o = aus_open_2019_features %>%
  select(contains("Diff")) %>%
  as.h2o(destination_frame = "aus_open_2019_features_h2o")
## Running Auto ML 
mens_model = h2o.automl(y = "Winner",
                        training_frame = train_h2o,
                        max_runtime_secs = 30,
                        max_models = 100,
                        stopping_metric = "logloss",
                        sort_metric = "logloss",
                        balance_classes = TRUE,
                        seed = 183) # Set seed to replicate results - 183 is the most games played in a tennis match (Isner-Mahut)
## Predictions on test frame
predictions = h2o.predict(mens_model@leader, aus_open_2019_features_h2o) %>%
  as.data.frame()
aus_open_2019_features$prob_player_1 = predictions$p1
aus_open_2019_features$prob_player_2 = predictions$p0
h2o.shutdown(prompt = FALSE)
Now let's find the best player by taking the mean of the prediction probability by player.
aus_open_2019_features %>% 
  select(player_1, starts_with("F_"), prob_player_1) %>%
  group_by(player_1) %>%
  summarise_all(mean) %>%
  arrange(desc(prob_player_1)) %>%
  datatable(rownames = FALSE, extensions = 'Scroller', 
            options = list(dom = "t",
                          scrollY = 450,
                          scroller = TRUE,
                          scrollX = 600,
                          fixedColumns = TRUE)) %>%
  formatRound(columns=pluck(., "x", "data") %>% colnames(), digits=3)

player_1 F_Serve_Win_Ratio_Diff F_Return_Win_Ratio_Diff F_Game_Win_Percentage_Diff F_BreakPoints_Per_Game_Diff F_Rank_Diff prob_player_1
Novak Djokovic 0.1109364627 0.076150615 0.1483970690 0.17144300 NA 0.8616486
Karen Khachanov 0.0960639298 0.061436164 0.1059967623 0.04544955 NA 0.8339594
Juan Martin del Potro 0.1003931993 0.042025222 0.0847985439 0.05943767 NA 0.8218308
Rafael Nadal 0.0480432305 0.051531252 0.0790179917 0.08181694 NA 0.8032543
Gilles Simon 0.0646937767 0.084843307 0.0901401318 0.08675350 NA 0.7985995
Roger Federer 0.0452014997 0.040992497 0.0725719954 0.01817046 NA 0.7962289
Kei Nishikori 0.0777155934 0.018720226 0.0800648870 0.02740276 NA 0.7843631
Marin Cilic 0.0285413602 0.053017465 0.0736687072 0.08883055 NA 0.7804876
Tomas Berdych 0.0471654691 0.047289449 0.0737401748 0.10584114 NA 0.7739211
Daniil Medvedev 0.0275430665 0.031121856 0.0721948279 0.01803757 NA 0.7543269
Stefanos Tsitsipas 0.0470382377 0.023825850 0.0577628626 0.02105227 NA 0.7511674
Dominic Thiem 0.0258904189 0.032481624 0.0483707080 0.05857158 NA 0.7451547
Alexander Zverev 0.0006199716 0.044811275 0.0380134371 0.08423392 NA 0.7374897
Kyle Edmund 0.0558006240 0.011963627 0.0478850676 0.05142186 NA 0.7304873
Pablo Carreno Busta 0.0321878318 0.029862068 0.0413674481 -0.00229784 NA 0.7302043
Borna Coric 0.0762084129 -0.010097922 0.0413621283 -0.01924267 NA 0.7268124
Kevin Anderson 0.0907358428 -0.027171681 0.0381421997 -0.06362578 NA 0.7260799
David Goffin -0.0034821911 0.037247336 0.0162572061 0.05603565 NA 0.7155908
Fernando Verdasco 0.0229261365 0.032884054 0.0521212576 0.04668854 NA 0.7120831
Roberto Bautista Agut 0.0047641170 0.049939608 0.0218975349 0.07331023 NA 0.7009891
Milos Raonic 0.0849726089 -0.028732182 0.0385944327 -0.08009382 NA 0.6986865
Fabio Fognini -0.0394792678 0.047935185 0.0226546894 0.06213496 NA 0.6982031
Hyeon Chung 0.0042489153 0.047722133 0.0158096386 0.04823304 NA 0.6958943
Jack Sock -0.0099659903 0.026454984 0.0186547428 0.02307214 NA 0.6757770
Diego Schwartzman -0.0317130675 0.032098381 0.0006215006 0.05621187 NA 0.6631067
John Millman 0.0016290285 0.042676556 0.0119857356 0.06228135 NA 0.6603912
Nikoloz Basilashvili -0.0099968609 0.005561102 0.0473876170 0.03661962 NA 0.6602628
John Isner 0.1346946527 -0.070556940 0.0161348609 -0.11425009 NA 0.6598097
Gael Monfils -0.0074254934 0.024286746 0.0295568649 0.04007519 NA 0.6449506
Richard Gasquet 0.0296009556 -0.011382437 0.0013138324 -0.03972967 NA 0.6442043
... ... ... ... ... ... ...
Laslo Djere -0.042300822 -0.0150684095 -0.064667709 -0.0349151578 NA 0.3606923
David Ferrer -0.036179509 0.0532782117 0.012751020 0.0914824480 NA 0.3488057
Bradley Klahn -0.001248083 -0.0444982448 -0.025987040 -0.1181295700 NA 0.3487806
Marcel Granollers -0.031011830 -0.0094056152 -0.049853664 0.0136841358 NA 0.3460035
Ricardas Berankis -0.022557215 -0.0103782963 -0.047937290 -0.0468488990 NA 0.3454980
Radu Albot -0.040829057 0.0076150564 -0.034891704 0.0443672533 NA 0.3420615
Jordan Thompson -0.068554906 0.0261969117 -0.044349181 0.0206636045 NA 0.3358572
Thomas Fabbiano -0.060583307 0.0275756029 -0.025883493 0.0709707306 NA 0.3319778
Roberto Carballes Baena -0.054016396 -0.0091521177 -0.019093050 0.0347187874 NA 0.3312105
Paolo Lorenzi -0.038613500 -0.0212206827 -0.052602703 0.0199474025 NA 0.3299791
Guido Andreozzi -0.038614385 -0.0133763922 0.029549861 0.0636745661 NA 0.3288762
Peter Polansky 0.007461636 -0.0163389196 -0.024034159 -0.0442144260 NA 0.3216756
Ernests Gulbis -0.062827089 -0.0134699552 -0.027633425 -0.0518663252 NA 0.3123511
Thiago Monteiro 0.001235931 -0.0288349103 -0.043831840 -0.0654744344 NA 0.3122069
Casper Ruud 0.016838968 -0.0178511679 0.015234507 0.0219131874 NA 0.3119321
Marco Trungelliti -0.022148774 -0.0005658242 0.048542554 0.1243537739 NA 0.3092636
Jiri Vesely -0.050204009 -0.0351868278 -0.042887646 -0.0160467165 NA 0.3089287
Guillermo Garcia-Lopez -0.090076100 -0.0108663630 -0.048712763 -0.0124446402 NA 0.3080898
Michael Mmoh -0.063802934 -0.0079053251 -0.011112236 -0.0332042032 NA 0.2822330
Jason Kubler -0.124758873 -0.0202756806 -0.013998570 0.1020895301 NA 0.2814246
Ruben Bemelmans -0.029036164 -0.0138846550 -0.032256254 -0.0363563402 NA 0.2772185
Bjorn Fratangelo -0.014149222 0.0033574304 -0.019931504 -0.0360199607 NA 0.2652527
Pablo Andujar -0.042869833 -0.0488261697 -0.070057834 -0.0164918910 NA 0.2647100
Christian Garin -0.046150875 0.0235799476 -0.006209664 0.0736304057 NA 0.2631607
Ivo Karlovic 0.071597162 -0.1093833837 0.001410787 -0.1237762218 NA 0.2500242
Juan Ignacio Londero -0.026454456 -0.0715665271 -0.016749898 -0.0363353678 NA 0.2351747
Ramkumar Ramanathan -0.005371622 -0.0606138479 -0.041631884 -0.0005573405 NA 0.2272977
Reilly Opelka 0.025704824 -0.0607219257 -0.015474944 -0.0720809006 NA 0.2262993
Carlos Berlocq -0.063580460 0.0074576369 -0.054277974 -0.0165235079 NA 0.2112275
Pedro Sousa -0.197333352 -0.0734557562 -0.161962722 -0.1023311674 NA 0.1502313