EPL Machine Learning Walkthrough

01. Data Acquisition & Exploration

Welcome to the first part of this Machine Learning Walkthrough. This tutorial will be made of four parts; how we actually acquired our data (programmatically), exploring the data to find potential features, building the model and using the model to make predictions.

Data Acquisition

We will be grabbing our data from football-data.co.uk, which has an enormous amount of soccer data dating back to the 90s. They also generously allow us to use it for free! However, the data is in separate CSVs based on the season. That means we would need to manually download 20 different files if we wanted the past 20 seasons. Rather than do this laborious and boring task, let's create a function which downloads the files for us, and appends them all into one big CSV.

To do this, we will use BeautifulSoup, a Python library which helps to pull data from HTML and XML files. We will then define a function which collates all the data for us into one DataFrame.

# Import Modules

import pandas as pd
import requests
from bs4 import BeautifulSoup
import datetime
pd.set_option('display.max_columns', 100)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from data_preparation_functions import *

def grab_epl_data():
    # Connect to football-data.co.uk
    res = requests.get("http://www.football-data.co.uk/englandm.php")

    # Create a BeautifulSoup object
    soup = BeautifulSoup(res.content, 'lxml')

    # Find the tables with the links to the data in them.
    table = soup.find_all('table', {'align': 'center', 'cellspacing': '0', 'width': '800'})[1]
    body = table.find_all('td', {'valign': 'top'})[1]

    # Grab the urls for the csv files
    links = [link.get('href') for link in body.find_all('a')]
    links_text = [link_text.text for link_text in body.find_all('a')]

    data_urls = []

    # Create a list of links
    prefix = 'http://www.football-data.co.uk/'
    for i, text in enumerate(links_text):
        if text == 'Premier League':
            data_urls.append(prefix + links[i])

    # Get rid of last 11 uls as these don't include match stats and odds, and we
    # only want from 2005 onwards
    data_urls = data_urls[:-12]

    df = pd.DataFrame()

    # Iterate over the urls
    for url in data_urls:
        # Get the season and make it a column
        season = url.split('/')[4]

        print(f"Getting data for season {season}")

        # Read the data from the url into a DataFrame
        temp_df = pd.read_csv(url)
        temp_df['season'] = season

        # Create helpful columns like Day, Month, Year, Date etc. so that our data is clean
        temp_df = (temp_df.dropna(axis='columns', thresh=temp_df.shape[0]-30)
                          .assign(Day=lambda df: df.Date.str.split('/').str[0],
                                  Month=lambda df: df.Date.str.split('/').str[1],
                                  Year=lambda df: df.Date.str.split('/').str[2])
                          .assign(Date=lambda df: df.Month + '/' + df.Day + '/' + df.Year)
                          .assign(Date=lambda df: pd.to_datetime(df.Date))
                          .dropna())

        # Append the temp_df to the main df
        df = df.append(temp_df, sort=True)

    # Drop all NAs
    df = df.dropna(axis=1).dropna().sort_values(by='Date')
    print("Finished grabbing data.")

    return df

df = grab_epl_data()
# df.to_csv("data/epl_data.csv", index=False)

    Getting data for season 1819
    Getting data for season 1718
    Getting data for season 1617
    Getting data for season 1516
    Getting data for season 1415
    Getting data for season 1314
    Getting data for season 1213
    Getting data for season 1112
    Getting data for season 1011
    Getting data for season 0910
    Getting data for season 0809
    Getting data for season 0708
    Getting data for season 0607
    Getting data for season 0506
    Finished grabbing data.

Whenever we want to update our data (for example if we want the most recent Gameweek included), all we have to do is run that function and then save the data to a csv with the commented out line above.

Data Exploration

Now that we have our data, let's explore it. Let's first look at home team win rates since 2005 to see if there is a consistent trend. To get an idea of what our data looks like, we'll look at the tail of the dataset first.

df.tail(3)

	AC	AF	AS	AST	AY	AwayTeam	B365A	B365D	B365H	BWA	BWD	BWH	Bb1X2	BbAH	BbAHh	BbAv<2.5	BbAv>2.5	BbAvA	BbAvAHA	BbAvAHH	BbAvD	BbAvH	BbMx<2.5	BbMx>2.5	BbMxA	BbMxAHA	BbMxAHH	BbMxD	BbMxH	BbOU	Date	Day	Div	FTAG	FTHG	FTR	HC	HF	HS	HST	HTR	HY	HomeTeam	IWA	IWD	IWH	LBA	LBD	LBH	Month	Referee	VCA	VCD	VCH	Year	season
28	3.0	11.0	9.0	3.0	2.0	Crystal Palace	3.00	3.25	2.60	2.95	3.1	2.55	42.0	20.0	-0.25	1.71	2.13	2.92	1.73	2.16	3.22	2.55	1.79	2.21	3.04	1.77	2.23	3.36	2.66	39.0	2018-08-26	26	E0	1.0	2.0	H	6.0	14.0	13.0	5.0	D	4.0	Watford	2.95	3.20	2.5	2.90	3.1	2.50	08	A Taylor	2.90	3.3	2.6	18	1819
27	5.0	8.0	15.0	3.0	1.0	Chelsea	1.66	4.00	5.75	1.67	3.8	5.25	42.0	22.0	1.00	1.92	1.88	1.67	2.18	1.71	3.90	5.25	2.01	1.95	1.71	2.28	1.76	4.17	5.75	40.0	2018-08-26	26	E0	2.0	1.0	A	4.0	16.0	6.0	2.0	D	3.0	Newcastle	1.70	3.75	5.0	1.67	3.8	5.25	08	P Tierney	1.67	4.0	5.5	18	1819
29	2.0	16.0	9.0	5.0	4.0	Tottenham	2.90	3.30	2.62	2.90	3.2	2.55	42.0	20.0	-0.25	1.79	2.03	2.86	1.72	2.18	3.27	2.56	1.84	2.10	3.00	1.76	2.25	3.40	2.67	40.0	2018-08-27	27	E0	3.0	0.0	A	5.0	11.0	23.0	5.0	D	2.0	Man United	2.75	3.25	2.6	2.75	3.2	2.55	08	C Pawson	2.90	3.3	2.6	18	1819

# Create Home Win, Draw Win and Away Win columns
df = df.assign(homeWin=lambda df: df.apply(lambda row: 1 if row.FTHG > row.FTAG else 0, axis='columns'),
              draw=lambda df: df.apply(lambda row: 1 if row.FTHG == row.FTAG else 0, axis='columns'),
              awayWin=lambda df: df.apply(lambda row: 1 if row.FTHG < row.FTAG else 0, axis='columns'))

Home Ground Advantage

win_rates = \
(df.groupby('season')
    .mean()
    .loc[:, ['homeWin', 'draw', 'awayWin']])

win_rates

	homeWin	draw	awayWin
season
0506	0.505263	0.202632	0.292105
0607	0.477573	0.258575	0.263852
0708	0.463158	0.263158	0.273684
0809	0.453826	0.255937	0.290237
0910	0.507895	0.252632	0.239474
1011	0.471053	0.292105	0.236842
1112	0.450000	0.244737	0.305263
1213	0.433862	0.285714	0.280423
1314	0.472973	0.208108	0.318919
1415	0.453826	0.245383	0.300792
1516	0.414248	0.282322	0.303430
1617	0.492105	0.221053	0.286842
1718	0.455263	0.260526	0.284211
1819	0.466667	0.200000	0.333333

Findings

As we can see, winrates across home team wins, draws and away team wins are very consistent. It seems that the home team wins around 46-47% of the time, the draw happens about 25% of the time, and the away team wins about 27% of the time. Let's plot this DataFrame so that we can see the trend more easily.

# Set the style
plt.style.use('ggplot')

fig = plt.figure()
ax = fig.add_subplot(111)

home_line = ax.plot(win_rates.homeWin, label='Home Win Rate')
away_line = ax.plot(win_rates.awayWin, label='Away Win Rate')
draw_line = ax.plot(win_rates.draw, label='Draw Win Rate')
ax.set_xlabel("season")
ax.set_ylabel("Win Rate")
plt.title("Win Rates", fontsize=16)

# Add the legend locations
home_legend = plt.legend(handles=home_line, loc='upper right', bbox_to_anchor=(1, 1))
ax = plt.gca().add_artist(home_legend)
away_legend = plt.legend(handles=away_line, loc='center right', bbox_to_anchor=(0.95, 0.4))
ax = plt.gca().add_artist(away_legend)
draw_legend = plt.legend(handles=draw_line, loc='center right', bbox_to_anchor=(0.95, 0.06))

png

As we can see, the winrates are relatively stable each season, except for in 14/15 when the home win rate drops dramatically.

Out of interest, let's also have a look at which team has the best home ground advantage. Let's define HGA as home win rate - away win rate. And then plot some of the big clubs' HGA against each other.

home_win_rates = \
(df.groupby(['HomeTeam'])
    .homeWin
    .mean())

away_win_rates = \
(df.groupby(['AwayTeam'])
    .awayWin
    .mean())

hga = (home_win_rates - away_win_rates).reset_index().rename(columns={0: 'HGA'}).sort_values(by='HGA', ascending=False)

hga.head(10)

	HomeTeam	HGA
15	Fulham	0.315573
7	Brighton	0.304762
20	Man City	0.244980
14	Everton	0.241935
30	Stoke	0.241131
10	Charlton	0.236842
0	Arsenal	0.236140
27	Reading	0.234962
33	Tottenham	0.220207
21	Man United	0.215620

So the club with the best HGA is Fulham - interesting. This is most likely because Fulham have won 100% of home games in 2018 so far which is skewing the mean. Let's see how the HGA for some of the big clubs based compare over seasons.

big_clubs = ['Liverpool', 'Man City', 'Man United', 'Chelsea', 'Arsenal']
home_win_rates_5 = df[df.HomeTeam.isin(big_clubs)].groupby(['HomeTeam', 'season']).homeWin.mean()
away_win_rates_5 = df[df.AwayTeam.isin(big_clubs)].groupby(['AwayTeam', 'season']).awayWin.mean()

hga_top_5 = home_win_rates_5 - away_win_rates_5

hga_top_5.unstack(level=0)

HomeTeam	Arsenal	Chelsea	Liverpool	Man City	Man United
season
0506	0.421053	0.368421	0.263158	0.263158	0.052632
0607	0.263158	0.000000	0.421053	-0.052632	0.105263
0708	0.210526	-0.052632	0.157895	0.368421	0.368421
0809	0.105263	-0.157895	-0.052632	0.578947	0.210526
0910	0.368421	0.368421	0.421053	0.315789	0.263158
1011	0.157895	0.368421	0.368421	0.263158	0.684211
1112	0.157895	0.315789	-0.105263	0.421053	0.105263
1213	0.052632	0.105263	0.105263	0.248538	0.201754
1314	0.143275	0.251462	0.307018	0.362573	-0.026316
1415	0.131579	0.210526	0.105263	0.210526	0.421053
1516	0.210526	-0.105263	0.000000	0.263158	0.263158
1617	0.263158	0.210526	0.105263	-0.052632	-0.105263
1718	0.578947	0.052632	0.157895	0.000000	0.263158
1819	0.500000	0.000000	0.000000	0.500000	0.500000

Now let's plot it.

sns.lineplot(x='season', y='HGA', hue='team', data=hga_top_5.reset_index().rename(columns={0: 'HGA', 'HomeTeam': 'team'}))
plt.legend(loc='lower center', ncol=6, bbox_to_anchor=(0.45, -0.2))
plt.title("HGA Among the top 5 clubs", fontsize=14)
plt.show()

png

The results here seem to be quite erratic, although it seems that Arsenal consistently has a HGA above 0.

Let's now look at the distributions of each of our columns. The odds columns are likely to be highly skewed, so we may have to account for this later.

for col in df.select_dtypes('number').columns:
    sns.distplot(df[col])
    plt.title(f"Distribution for {col}")
    plt.show()

png

Exploring Referee Home Ground Bias

What may be of interest is whether certain referees are correlated with the home team winning more often. Let's explore referee home ground bias for referees for the top 10 Referees based on games.

print('Overall Home Win Rate: {:.4}%'.format(df.homeWin.mean() * 100))

# Get the top 10 refs based on games
top_10_refs = df.Referee.value_counts().head(10).index

df[df.Referee.isin(top_10_refs)].groupby('Referee').homeWin.mean().sort_values(ascending=False)

Overall Home Win Rate: 46.55%

Referee
L Mason          0.510373
C Foy            0.500000
M Clattenburg    0.480000
M Jones          0.475248
P Dowd           0.469880
M Atkinson       0.469565
M Oliver         0.466019
H Webb           0.456604
A Marriner       0.455516
M Dean           0.442049
Name: homeWin, dtype: float64

It seems that L Mason may be the most influenced by the home crowd. Whilst the overall home win rate is 46.5%, the home win rate when he is the Referee is 51%. However it should be noted that this doesn't mean that he causes the win through bias. It could just be that he referees the best clubs, so naturally their home win rate is high.

Variable Correlation With Margin

Let's now explore different variables' relationships with margin. First, we'll create a margin column, then we will pick a few different variables to look at the correlations amongst each other, using a correlation heatmap.

df['margin'] = df['FTHG'] - df['FTAG']

stat_cols = ['AC', 'AF', 'AR', 'AS', 'AST', 'AY', 'HC', 'HF', 'HR', 'HS', 'HST', 'HTR', 'HY', 'margin']

stat_correlations = df[stat_cols].corr()
stat_correlations['margin'].sort_values()

    AST      -0.345703
    AS       -0.298665
    HY       -0.153806
    HR       -0.129393
    AC       -0.073204
    HF       -0.067469
    AF        0.005474
    AY        0.013746
    HC        0.067433
    AR        0.103528
    HS        0.275847
    HST       0.367591
    margin    1.000000
    Name: margin, dtype: float64

Unsurprisingly, Home Shots on Target correlate the most with Margin, and Away Reds is also high. What is surprising is that Home Yellows has quite a strong negative correlation with margin - this may be because players will play more aggresively when they are losing to try and get the lead back, and hence receive more yellow cards.

Let's now look at the heatmap between variables.

sns.heatmap(stat_correlations, annot=True, annot_kws={'size': 10})
    <matplotlib.axes._subplots.AxesSubplot at 0x220a4227048>

png

Analysing Features

What we are really interested in, is how our features (creating in the next tutorial), correlate with winning. We will skip ahead here and use a function to create our features for us, and then examine how the moving averages/different features correlate with winning.

# Create a cleaned df of all of our data
pre_features_df = create_df('data/epl_data.csv')

# Create our features
features = create_feature_df(pre_features_df)
    Creating all games feature DataFrame

    C:\Users\wardj\Documents\Betfair Public Github\predictive-models\epl\data_preparation_functions.py:419: RuntimeWarning: invalid value encountered in double_scalars
      .pipe(lambda df: (df.eloAgainst * df[goalsForOrAgainstCol]).sum() / df.eloAgainst.sum()))

    Creating stats feature DataFrame
    Creating odds feature DataFrame
    Creating market values feature DataFrame
    Filling NAs
    Merging stats, odds and market values into one features DataFrame
    Complete.

features = (pre_features_df.assign(margin=lambda df: df.FTHG - df.FTAG)
                           .loc[:, ['gameId', 'margin']]
                           .pipe(pd.merge, features, on=['gameId']))

features.corr().margin.sort_values(ascending=False)[:20]

    margin                     1.000000
    f_awayOdds                 0.413893
    f_totalMktH%               0.330420
    f_defMktH%                 0.325392
    f_eloAgainstAway           0.317853
    f_eloForHome               0.317853
    f_midMktH%                 0.316080
    f_attMktH%                 0.312262
    f_sizeOfHandicapAway       0.301667
    f_goalsForHome             0.298930
    f_wtEloGoalsForHome        0.297157
    f_shotsForHome             0.286239
    f_cornersForHome           0.279917
    f_gkMktH%                  0.274732
    f_homeWinPc38Away          0.271326
    f_homeWinPc38Home          0.271326
    f_wtEloGoalsAgainstAway    0.269663
    f_goalsAgainstAway         0.258418
    f_cornersAgainstAway       0.257148
    f_drawOdds                 0.256807
    Name: margin, dtype: float64

As we can see away odds is most highly correlated to margin. This makes sense, as odds generally have most/all information included in the price. What is interesting is that elo seems to also be highly correlated, which is good news for our elo model that we made. Similarly, weighted goals and the the value of the defence relative to other teams ('defMktH%' etc.) is strongly correlated to margin.

02. Data Preparation & Feature Engineering

Welcome to the second part of this Machine Learning Walkthrough. This tutorial will focus on data preparation and feature creation, before we dive into modelling in the next tutorial.

Specifically, this tutorial will cover a few things:

Data wrangling specifically for sport
Feature creation - focussing on commonly used features in sports modelling, such as exponential moving averages
Using functions to modularise the data preparation process

Data Wrangling

We will begin by utilising functions we have defined in our data_preparation_functions script to wrangle our data into a format that can be consumed by Machine Learning algorithms.

A typical issue faced by aspect of modelling sport is the issue of Machine Learning algorithms requiring all features for the teams playing to be on the same row of a table, whereas when we actual calculate these features, we usually require the teams to be on separate rows as it makes it a lot easier to calculate typical features, such as expontentially weighted moving averages. We will explore this issue and show how we deal with issues like these.

# Import libraries
from data_preparation_functions import *
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 100)

We have created some functions which prepare the data for you. For thoroughly commented explanation of how the functions work, read through the data_preparation_functions.py script along side this walkthrough.

Essentially, each functions wrangles the data through a similar process. It first reads in the data from a csv file, then converts the columns to datatypes that we can work with, such as converting the Date column to a datetime data type. It then adds a Game ID column, so each game is easily identifiable and joined on. We then assign the DataFrame some other columns which may be useful, such as 'Year', 'Result' and 'homeWin'. Finally, we drop redundant column and return the DataFrame.

Let us now create six different DataFrames, which we will use to create features. Later, we will join these features back into one main feature DataFrame.

Create 6 distinct DataFrames

# This table includes all of our data in one big DataFrame
df = create_df('data/epl_data.csv')
df.head(3)

	AC	AF	AR	AS	AST	AY	AwayTeam	B365A	B365D	B365H	BWA	BWD	BWH	Bb1X2	BbAH	BbAHh	BbAv<2.5	BbAv>2.5	BbAvA	BbAvAHA	BbAvAHH	BbAvD	BbAvH	BbMx<2.5	BbMx>2.5	BbMxA	BbMxAHA	BbMxAHH	BbMxD	BbMxH	BbOU	Date	Day	Div	FTAG	FTHG	FTR	HC	HF	HS	HST	HTAG	HTHG	HTR	HY	HomeTeam	IWA	IWD	IWH	LBA	LBD	LBH	Month	Referee	VCA	VCD	VCH	Year	season	gameId	homeWin	awayWin	result
0	6.0	14.0	1.0	11.0	5.0	1.0	Blackburn	2.75	3.20	2.5	2.90	3.30	2.20	55.0	20.0	0.00	1.71	2.02	2.74	2.04	1.82	3.16	2.40	1.80	2.25	2.9	2.08	1.86	3.35	2.60	35.0	2005-08-13	13	E0	1.0	3.0	H	2.0	11.0	13.0	5.0	1.0	0.0	A	0.0	West Ham	2.7	3.0	2.3	2.75	3.0	2.38	8	A Wiley	2.75	3.25	2.4	2005	0506	1	1	0	home
1	8.0	16.0	0.0	13.0	6.0	2.0	Bolton	3.00	3.25	2.3	3.15	3.25	2.10	56.0	22.0	-0.25	1.70	2.01	3.05	1.84	2.01	3.16	2.20	1.87	2.20	3.4	1.92	2.10	3.30	2.40	36.0	2005-08-13	13	E0	2.0	2.0	D	7.0	14.0	3.0	2.0	2.0	2.0	D	0.0	Aston Villa	3.1	3.0	2.1	3.20	3.0	2.10	8	M Riley	3.10	3.25	2.2	2005	0506	2	0	0	draw
2	6.0	14.0	0.0	12.0	5.0	1.0	Man United	1.72	3.40	5.0	1.75	3.35	4.35	56.0	23.0	0.75	1.79	1.93	1.69	1.86	2.00	3.36	4.69	1.87	2.10	1.8	1.93	2.05	3.70	5.65	36.0	2005-08-13	13	E0	2.0	0.0	A	8.0	15.0	10.0	5.0	1.0	0.0	A	3.0	Everton	1.8	3.1	3.8	1.83	3.2	3.75	8	G Poll	1.80	3.30	4.5	2005	0506	3	0	1	away

# This includes only the typical soccer stats, like home corners, home shots on target etc.
stats = create_stats_df('data/epl_data.csv')
stats.head(3)

	gameId	HomeTeam	AwayTeam	FTHG	FTAG	HTHG	HTAG	HS	AS	HST	AST	HF	AF	HC	AC	HY	AY	AR
0	1	West Ham	Blackburn	3.0	1.0	0.0	1.0	13.0	11.0	5.0	5.0	11.0	14.0	2.0	6.0	0.0	1.0	1.0
1	2	Aston Villa	Bolton	2.0	2.0	2.0	2.0	3.0	13.0	2.0	6.0	14.0	16.0	7.0	8.0	0.0	2.0	0.0
2	3	Everton	Man United	0.0	2.0	0.0	1.0	10.0	12.0	5.0	5.0	15.0	14.0	8.0	6.0	3.0	1.0	0.0

# This includes all of our betting related data, such as win/draw/lose odds, asian handicaps etc.
betting = create_betting_df('data/epl_data.csv')
betting.head(3)

	B365A	B365D	B365H	BWA	BWD	BWH	Bb1X2	BbAH	BbAHh	BbAv<2.5	BbAv>2.5	BbAvA	BbAvAHA	BbAvAHH	BbAvD	BbAvH	BbMx<2.5	BbMx>2.5	BbMxA	BbMxAHA	BbMxAHH	BbMxD	BbMxH	BbOU	Day	Div	IWA	IWD	IWH	LBA	LBD	LBH	Month	VCA	VCD	VCH	Year	homeWin	awayWin	result	HomeTeam	AwayTeam	gameId
0	2.75	3.20	2.5	2.90	3.30	2.20	55.0	20.0	0.00	1.71	2.02	2.74	2.04	1.82	3.16	2.40	1.80	2.25	2.9	2.08	1.86	3.35	2.60	35.0	13	E0	2.7	3.0	2.3	2.75	3.0	2.38	8	2.75	3.25	2.4	2005	1	0	home	West Ham	Blackburn	1
1	3.00	3.25	2.3	3.15	3.25	2.10	56.0	22.0	-0.25	1.70	2.01	3.05	1.84	2.01	3.16	2.20	1.87	2.20	3.4	1.92	2.10	3.30	2.40	36.0	13	E0	3.1	3.0	2.1	3.20	3.0	2.10	8	3.10	3.25	2.2	2005	0	0	draw	Aston Villa	Bolton	2
2	1.72	3.40	5.0	1.75	3.35	4.35	56.0	23.0	0.75	1.79	1.93	1.69	1.86	2.00	3.36	4.69	1.87	2.10	1.8	1.93	2.05	3.70	5.65	36.0	13	E0	1.8	3.1	3.8	1.83	3.2	3.75	8	1.80	3.30	4.5	2005	0	1	away	Everton	Man United	3

# This includes all of the team information for each game.
team_info = create_team_info_df('data/epl_data.csv')
team_info.head(3)

	gameId	Date	season	HomeTeam	AwayTeam	FTR	HTR	Referee
0	1	2005-08-13	0506	West Ham	Blackburn	H	A	A Wiley
1	2	2005-08-13	0506	Aston Villa	Bolton	D	D	M Riley
2	3	2005-08-13	0506	Everton	Man United	A	A	G Poll

# Whilst the other DataFrames date back to 2005, this DataFrame has data from 2001 to 2005.
historic_games = create_historic_games_df('data/historic_games_pre2005.csv')
historic_games.head(3)

	Date	HomeTeam	AwayTeam	FTHG	FTAG	gameId	season	homeWin
0	2001-08-18	Charlton	Everton	1	2	-1	20012002	0
1	2001-08-18	Derby	Blackburn	2	1	-1	20012002	1
2	2001-08-18	Leeds	Southampton	2	0	-1	20012002	1

# This is the historic_games DataFrame appended to the df DataFrame.
all_games = create_all_games_df('data/epl_data.csv', 'data/historic_games_pre2005.csv')
all_games.head(3)

	Date	HomeTeam	AwayTeam	FTHG	FTAG	gameId	season	homeWin	awayWin	homeWinPc5	homeWinPc38	awayWinPc5	awayWinPc38	gameIdHistoric
0	2001-08-18	Charlton	Everton	1.0	2.0	-1	20012002	0	1	NaN	NaN	NaN	NaN	1
1	2001-08-18	Derby	Blackburn	2.0	1.0	-1	20012002	1	0	NaN	NaN	NaN	NaN	2
2	2001-08-18	Leeds	Southampton	2.0	0.0	-1	20012002	1	0	NaN	NaN	NaN	NaN	3

Feature Creation

Now that we have all of our pre-prepared DataFrames, and we know that the data is clean, we can move onto feature creation. As is common practice with sports modelling, we are going to start by creating expontentially weighted moving averages (EMA) as features. To get a better understanding of how EMAs work, read here.

In short, an EMA is like a simple moving average, except it weights recent instances more than older instances based on an alpha parameter. The documentation for the pandas (emw method)[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html] we will be using states that we can specify alpha in a number of ways. We will specify it in terms of span, where $\alpha = 2 / (span+1), span ≥ 1 $.

Let's first define a function which calculates the exponential moving average for each column in the stats DataFrame. We will then apply this function with other functions we have created, such as create_betting_features_ema, which creates moving averages of betting data.

However, we must first change the structure of our data. Notice that currently each row has both the Home Team's data and the Away Team's data on a single row. This makes it difficult to calculate rolling averages, so we will restructure our DataFrames to ensure each row only contains single team's data. To do this, we will define a function, reate_multiline_df_stats.

# Define a function which restructures our DataFrame
def create_multiline_df_stats(old_stats_df):
    # Create a list of columns we want and their mappings to more interpretable names
    home_stats_cols = ['HomeTeam', 'FTHG', 'FTAG', 'HTHG', 'HTAG', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY',
                       'HR', 'AR']

    away_stats_cols = ['AwayTeam', 'FTAG', 'FTHG', 'HTAG', 'HTHG', 'AS', 'HS', 'AST', 'HST', 'AF', 'HF', 'AC', 'HC', 'AY', 'HY',
                       'AR', 'HR']

    stats_cols_mapping = ['team', 'goalsFor', 'goalsAgainst', 'halfTimeGoalsFor', 'halfTimeGoalsAgainst', 'shotsFor',
                          'shotsAgainst', 'shotsOnTargetFor', 'shotsOnTargetAgainst', 'freesFor', 'freesAgainst', 
                          'cornersFor', 'cornersAgainst', 'yellowsFor', 'yellowsAgainst', 'redsFor', 'redsAgainst']

    # Create a dictionary of the old column names to new column names
    home_mapping = {old_col: new_col for old_col, new_col in zip(home_stats_cols, stats_cols_mapping)}
    away_mapping = {old_col: new_col for old_col, new_col in zip(away_stats_cols, stats_cols_mapping)}

    # Put each team onto an individual row
    multi_line_stats = (old_stats_df[['gameId'] + home_stats_cols] # Filter for only the home team columns
                    .rename(columns=home_mapping) # Rename the columns
                    .assign(homeGame=1) # Assign homeGame=1 so that we can use a general function later
                    .append((old_stats_df[['gameId'] + away_stats_cols]) # Append the away team columns
                            .rename(columns=away_mapping) # Rename the away team columns
                            .assign(homeGame=0), sort=True)
                    .sort_values(by='gameId') # Sort the values
                    .reset_index(drop=True))
    return multi_line_stats

# Define a function which creates an EMA DataFrame from the stats DataFrame
def create_stats_features_ema(stats, span):
    # Create a restructured DataFrames so that we can calculate EMA
    multi_line_stats = create_multiline_df_stats(stats)

    # Create a copy of the DataFrame
    ema_features = multi_line_stats[['gameId', 'team', 'homeGame']].copy()

    # Get the columns that we want to create EMA for
    feature_names = multi_line_stats.drop(columns=['gameId', 'team', 'homeGame']).columns

    # Loop over the features
    for feature_name in feature_names:
        feature_ema = (multi_line_stats.groupby('team')[feature_name] # Calculate the EMA
                                                  .transform(lambda row: row.ewm(span=span, min_periods=2)
                                                             .mean()
                                                             .shift(1))) # Shift the data down 1 so we don't leak data
        ema_features[feature_name] = feature_ema # Add the new feature to the DataFrame
    return ema_features

# Apply the function
stats_features = create_stats_features_ema(stats, span=5)
stats_features.tail()

	gameId	team	homeGame	cornersAgainst	cornersFor	freesAgainst	freesFor	goalsAgainst	goalsFor	halfTimeGoalsAgainst	halfTimeGoalsFor	redsAgainst	redsFor	shotsAgainst	shotsFor	shotsOnTargetAgainst	shotsOnTargetFor	yellowsAgainst	yellowsFor
9903	4952	Newcastle	1	4.301743	4.217300	11.789345	12.245066	0.797647	0.833658	0.644214	0.420832	2.323450e-10	3.333631e-01	11.335147	13.265955	3.211345	4.067990	1.848860	1.627140
9904	4953	Burnley	0	4.880132	5.165915	13.326703	8.800033	1.945502	0.667042	0.609440	0.529409	3.874405e-03	3.356120e-10	13.129631	10.642381	4.825874	3.970285	0.963527	0.847939
9905	4953	Fulham	1	4.550255	4.403060	10.188263	8.555589	2.531046	1.003553	0.860573	0.076949	1.002518e-04	8.670776e-03	17.463779	12.278877	8.334019	4.058213	0.980097	1.102974
9906	4954	Man United	1	3.832573	4.759683	11.640608	10.307946	1.397234	1.495032	1.034251	0.809280	6.683080e-05	1.320468e-05	8.963022	10.198642	3.216957	3.776900	1.040077	1.595650
9907	4954	Tottenham	0	3.042034	5.160211	8.991460	9.955635	1.332704	2.514789	0.573728	1.010491	4.522878e-08	1.354409e-05	12.543406	17.761004	3.757437	7.279845	1.478976	1.026601

As we can see, we now have averages for each team. Let's create a quick table to see the top 10 teams' goalsFor average EMAs since 2005.

pd.DataFrame(stats_features.groupby('team')
                           .goalsFor
                           .mean()
                           .sort_values(ascending=False)[:10])

	goalsFor
team
Man United	1.895026
Chelsea	1.888892
Arsenal	1.876770
Man City	1.835863
Liverpool	1.771125
Tottenham	1.655063
Leicester	1.425309
Blackpool	1.390936
Everton	1.387110
Southampton	1.288349

Optimising Alpha

It looks like Man United and Chelsea have been two of the best teams since 2005, based on goalsFor. Now that we have our stats features, we may be tempted to move on. However, we have arbitrarily chosen a span of 5. How do we know that this is the best value? We don't. Let's try and optimise this value.

To do this, we will use a simple Logistic Regression model to create probabilistic predictions based on the stats features we created before. We will iterate a range of span values, from say, 3 to 15, and choose the value which produces a model with the lowest log loss, based on cross validation.

To do this, we need to restructure our DataFrame back to how it was before.

def restructure_stats_features(stats_features):
    non_features = ['homeGame', 'team', 'gameId']

    stats_features_restructured = (stats_features.query('homeGame == 1')
                                    .rename(columns={col: 'f_' + col + 'Home' for col in stats_features.columns if col not in non_features})
                                    .rename(columns={'team': 'HomeTeam'})
                                    .pipe(pd.merge, (stats_features.query('homeGame == 0')
                                                        .rename(columns={'team': 'AwayTeam'})
                                                        .rename(columns={col: 'f_' + col + 'Away' for col in stats_features.columns 
                                                                         if col not in non_features})), on=['gameId'])
                                    .pipe(pd.merge, df[['gameId', 'result']], on='gameId')
                                    .dropna())
    return stats_features_restructured

restructure_stats_features(stats_features).head()

	gameId	HomeTeam	homeGame_x	f_cornersAgainstHome	f_cornersForHome	f_freesAgainstHome	f_freesForHome	f_goalsAgainstHome	f_goalsForHome	f_halfTimeGoalsAgainstHome	f_halfTimeGoalsForHome	f_redsAgainstHome	f_redsForHome	f_shotsAgainstHome	f_shotsForHome	f_shotsOnTargetAgainstHome	f_shotsOnTargetForHome	f_yellowsAgainstHome	f_yellowsForHome	AwayTeam	f_cornersAgainstAway	f_cornersForAway	f_freesAgainstAway	f_freesForAway	f_goalsAgainstAway	f_goalsForAway	f_halfTimeGoalsAgainstAway	f_halfTimeGoalsForAway	f_redsForAway	f_shotsAgainstAway	f_shotsForAway	f_shotsOnTargetAgainstAway	f_shotsOnTargetForAway	f_yellowsAgainstAway	f_yellowsForAway	result
20	21	Birmingham	1	4.8	7.8	12.0	9.4	1.2	0.6	0.6	0.6	0.0	0.0	11.4	8.2	6.4	2.8	1.0	2.6	Middlesbrough	3.0	5.6	14.0	12.8	1.2	0.0	0.0	0.0	0.4	17.2	8.8	7.6	2.6	3.0	1.4	away
21	22	Portsmouth	1	2.6	4.6	21.8	16.6	2.0	0.6	1.0	0.0	0.0	0.0	8.0	10.4	3.6	4.0	3.2	1.8	Aston Villa	9.8	7.0	14.2	18.2	1.4	0.8	0.8	0.8	0.0	16.0	3.0	9.6	2.6	2.0	0.6	draw
22	23	Sunderland	1	5.0	5.0	11.6	18.0	1.8	0.4	1.0	0.4	0.4	0.6	14.6	6.0	5.2	3.2	1.2	2.6	Man City	7.8	3.6	8.6	12.4	0.6	1.2	0.6	0.6	0.0	10.6	11.4	2.4	6.8	3.0	1.4	away
23	24	Arsenal	1	3.0	7.4	17.0	18.6	0.6	0.8	0.0	0.0	0.4	0.0	6.2	11.4	4.0	6.6	1.6	1.8	Fulham	7.2	3.0	20.8	13.2	1.2	0.6	0.6	0.0	0.0	12.4	10.8	7.0	5.2	2.0	1.6	home
24	25	Blackburn	1	1.4	7.2	12.8	21.2	1.8	1.6	0.0	1.0	0.0	0.4	10.0	14.0	4.4	7.4	1.2	1.6	Tottenham	6.4	3.8	11.2	18.8	0.0	2.0	0.0	0.4	0.0	11.6	15.2	4.6	7.2	0.6	2.6	draw

Now let's write a function that optimises our span based on log loss of the output of a Logistic Regression model.

def optimise_alpha(features):
    le = LabelEncoder()
    y = le.fit_transform(features.result) # Encode the result from away, draw, home win to 0, 1, 2
    X = features[[col for col in features.columns if col.startswith('f_')]] # Only get the features - these all start with f_
    lr = LogisticRegression()

    kfold = StratifiedKFold(n_splits=5)
    ave_cv_score = cross_val_score(lr, X, y, scoring='neg_log_loss', cv=kfold).mean()
    return ave_cv_score

best_score = np.float('inf')
best_span = 0
cv_scores = []

# Iterate over a range of spans
for span in range(1, 120, 3):
    stats_features = create_stats_features_ema(stats, span=span)
    restructured_stats_features = restructure_stats_features(stats_features)
    cv_score = optimise_alpha(restructured_stats_features)
    cv_scores.append(cv_score)

    if cv_score * -1 < best_score:
        best_score = cv_score * -1
        best_span = span

plt.style.use('ggplot')
plt.plot(list(range(1, 120, 3)), (pd.Series(cv_scores)*-1)) # Plot our results

plt.title("Optimising alpha")
plt.xlabel("Span")
plt.ylabel("Log Loss")
plt.show()

print("Our lowest log loss ({:2f}) occurred at a span of {}".format(best_score, best_span))

png

Our lowest log loss (0.980835) occurred at a span of 55

The above method is just an example of how you can optimise hyparameters. Obviously this example has many limitations, such as attempting to optimise each statistic with the same alpha. However, for the rest of these tutorial series we will use this span value.

Now let's create the rest of our features. For thorough explanations and the actual code behind some of the functions used, please refer to the data_preparation_functions.py script.

Creating our Features DataFrame

We will utilise pre-made functions to create all of our features in just a few lines of code.

As part of this process we will create features which include margin weighted elo, an exponential average for asian handicap data, and odds as features.

Our Elo function is essentially the same as the one we created in the AFL tutorial; if you would like to know more about Elo models please read this article.

Note that the cell below may take a few minutes to run.

# Create feature DataFrames
features_all_games = create_all_games_features(all_games)

C:\Users\wardj\Documents\Betfair Public Github\predictive-models\epl\data_preparation_functions.py:419: RuntimeWarning: invalid value encountered in double_scalars .pipe(lambda df: (df.eloAgainst * df[goalsForOrAgainstCol]).sum() / df.eloAgainst.sum()))

The features_all_games df includes elo for each team, as well as their win percentage at home and away over the past 5 and 38 games. For more information on how it was calculated, read through the data_preparation_functions script.

features_all_games.head(3)

	Date	awayWin	awayWinPc38	awayWinPc5	eloAgainst	eloFor	gameId	gameIdHistoric	goalsAgainst	goalsFor	homeGame	homeWin	homeWinPc38	homeWinPc5	season	team	wtEloGoalsFor	wtEloGoalsAgainst
0	2001-08-18	1	NaN	NaN	1500.0	1500.0	-1	1	2.0	1.0	1	0	NaN	NaN	20012002	Charlton	NaN	NaN
1	2001-08-18	1	NaN	NaN	1500.0	1500.0	-1	1	1.0	2.0	0	0	NaN	NaN	20012002	Everton	NaN	NaN
2	2001-08-18	0	NaN	NaN	1500.0	1500.0	-1	2	1.0	2.0	1	1	NaN	NaN	20012002	Derby	NaN	NaN

The features_stats df includes all the expontential weighted averages for each stat in the stats df.

# Create feature stats df
features_stats = create_stats_features_ema(stats, span=best_span)
features_stats.tail(3)

	gameId	team	homeGame	cornersAgainst	cornersFor	freesAgainst	freesFor	goalsAgainst	goalsFor	halfTimeGoalsAgainst	halfTimeGoalsFor	redsAgainst	redsFor	shotsAgainst	shotsFor	shotsOnTargetAgainst	shotsOnTargetFor	yellowsAgainst	yellowsFor
9905	4953	Fulham	1	6.006967	5.045733	10.228997	9.965651	2.147069	1.093550	0.630485	0.364246	0.032937	0.043696	16.510067	11.718122	7.184386	4.645762	1.310424	1.389716
9906	4954	Man United	1	4.463018	5.461075	11.605712	10.870367	0.843222	1.586308	0.427065	0.730650	0.042588	0.027488	10.865754	13.003121	3.562675	4.626450	1.740735	1.712785
9907	4954	Tottenham	0	3.868619	6.362901	10.784145	10.140388	0.954928	2.100166	0.439129	0.799968	0.024351	0.026211	9.947515	16.460598	3.370010	6.136120	1.925005	1.364268

The features_odds df includes a moving average of some of the odds data.

# Create feature_odds df
features_odds = create_betting_features_ema(betting, span=10)
features_odds.tail(3)

	gameId	team	avAsianHandicapOddsAgainst	avAsianHandicapOddsFor	avgreaterthan2.5	avlessthan2.5	sizeOfHandicap
9905	4953	Fulham	1.884552	1.985978	1.756776	2.128261	0.502253
9906	4954	Man United	1.871586	2.031787	1.900655	1.963478	-0.942445
9907	4954	Tottenham	1.947833	1.919607	1.629089	2.383593	-1.235630

The features market values has market values and the % of total market for each position. These values are in millions.

# Create feature market values df
features_market_values = create_market_values_features(df) # This creates a df with one game per row
features_market_values.head(3)

	gameId	Year	HomeTeam	AwayTeam	defMktValH	attMktValH	gkMktValH	totalMktValH	midMktValH	defMktValA	attMktValA	gkMktValA	totalMktValA	midMktValA	attMktH%	attMktA%	midMktH%	midMktA%	defMktH%	defMktA%	gkMktH%	gkMktA%	totalMktH%	totalMktA%
0	1	2005	West Ham	Blackburn	16.90	18.50	6.40	46.40	4.60	27.25	13.00	3.25	70.70	27.20	2.252911	1.583126	0.588168	3.477861	2.486940	4.010007	4.524247	2.297469	1.913986	2.916354
1	2	2005	Aston Villa	Bolton	27.63	31.85	7.60	105.83	38.75	9.60	24.55	8.50	72.40	29.75	3.878659	2.989673	4.954673	3.803910	4.065926	1.412700	5.372543	6.008766	4.365456	2.986478
2	3	2005	Everton	Man United	44.35	31.38	8.55	109.78	25.50	82.63	114.60	9.25	288.48	82.00	3.821423	13.955867	3.260494	10.484727	6.526378	12.159517	6.044111	6.538951	4.528392	11.899714

all_games_cols = ['Date', 'gameId', 'team', 'season', 'homeGame', 'homeWinPc38', 'homeWinPc5', 'awayWinPc38', 'awayWinPc5', 'eloFor', 'eloAgainst', 'wtEloGoalsFor', 'wtEloGoalsAgainst']

# Join the features together
features_multi_line = (features_all_games[all_games_cols]
                                         .pipe(pd.merge, features_stats.drop(columns='homeGame'), on=['gameId', 'team'])
                                         .pipe(pd.merge, features_odds, on=['gameId', 'team']))

# Put each instance on an individual row
features_with_na = put_features_on_one_line(features_multi_line)

market_val_feature_names = ['attMktH%', 'attMktA%', 'midMktH%', 'midMktA%', 'defMktH%', 'defMktA%', 'gkMktH%', 'gkMktA%', 'totalMktH%', 'totalMktA%']

# Merge our team values dataframe to features and result from df
features_with_na = (features_with_na.pipe(pd.merge, (features_market_values[market_val_feature_names + ['gameId']])
                                                      .rename({col: 'f_' + col for col in market_val_feature_names}), on='gameId')
                            .pipe(pd.merge, df[['HomeTeam', 'AwayTeam', 'gameId', 'result', 'B365A', 'B365D', 'B365H']], on=['HomeTeam', 'AwayTeam', 'gameId']))

# Drop NAs from calculating the rolling averages - don't drop Win Pc 38 and Win Pc 5 columns
features = features_with_na.dropna(subset=features_with_na.drop(columns=[col for col in features_with_na.columns if 'WinPc' in col]).columns)

# Fill NAs for the Win Pc columns
features = features.fillna(features.mean())

features.head(3)

	Date	gameId	HomeTeam	season	homeGame	f_homeWinPc38Home	f_homeWinPc5Home	f_awayWinPc38Home	f_awayWinPc5Home	f_eloForHome	f_eloAgainstHome	f_wtEloGoalsForHome	f_wtEloGoalsAgainstHome	f_cornersAgainstHome	f_cornersForHome	f_freesAgainstHome	f_freesForHome	f_goalsAgainstHome	f_goalsForHome	f_halfTimeGoalsAgainstHome	f_halfTimeGoalsForHome	f_redsAgainstHome	f_redsForHome	f_shotsAgainstHome	f_shotsForHome	f_shotsOnTargetAgainstHome	f_shotsOnTargetForHome	f_yellowsAgainstHome	f_yellowsForHome	f_avAsianHandicapOddsAgainstHome	f_avAsianHandicapOddsForHome	f_avgreaterthan2.5Home	f_avlessthan2.5Home	f_sizeOfHandicapHome	AwayTeam	f_homeWinPc38Away	f_homeWinPc5Away	f_awayWinPc38Away	f_awayWinPc5Away	f_eloForAway	f_eloAgainstAway	f_wtEloGoalsForAway	f_wtEloGoalsAgainstAway	f_cornersAgainstAway	f_cornersForAway	f_freesAgainstAway	f_freesForAway	f_goalsAgainstAway	f_goalsForAway	f_halfTimeGoalsAgainstAway	f_halfTimeGoalsForAway	f_redsForAway	f_shotsAgainstAway	f_shotsForAway	f_shotsOnTargetAgainstAway	f_shotsOnTargetForAway	f_yellowsAgainstAway	f_yellowsForAway	f_avAsianHandicapOddsAgainstAway	f_avAsianHandicapOddsForAway	f_avgreaterthan2.5Away	f_avlessthan2.5Away	f_sizeOfHandicapAway	attMktH%	attMktA%	midMktH%	midMktA%	defMktH%	defMktA%	gkMktH%	gkMktA%	totalMktH%	totalMktA%	result	B365A	B365D	B365H
20	2005-08-23	21	Birmingham	0506	1	0.394737	0.4	0.263158	0.2	1478.687038	1492.866048	1.061763	1.260223	4.981818	7.527273	12.000000	9.945455	1.018182	0.509091	0.509091	0.509091	0.000000	0.000000	11.945455	8.018182	6.490909	2.981818	1.000000	2.509091	1.9090	1.9455	2.0510	1.6735	-0.1375	Middlesbrough	0.394737	0.4	0.263158	0.2	1492.866048	1478.687038	1.12994	1.279873	2.545455	5.509091	13.545455	13.436364	1.018182	0.000000	0.000000	0.000000	0.490909	17.018182	8.072727	7.509091	2.509091	3.0	1.490909	1.9395	1.9095	2.0035	1.7155	0.3875	5.132983	5.260851	3.341048	4.289788	3.502318	4.168935	2.332815	3.216457	3.934396	4.522205	away	2.75	3.2	2.50
21	2005-08-23	22	Portsmouth	0506	1	0.447368	0.4	0.263158	0.4	1405.968416	1489.229314	1.147101	1.503051	2.509091	4.963636	21.981818	16.054545	2.000000	0.509091	1.000000	0.000000	0.000000	0.000000	8.454545	10.490909	3.963636	4.454545	3.018182	1.527273	1.8965	1.9690	2.0040	1.7005	0.2500	Aston Villa	0.447368	0.4	0.263158	0.4	1489.229314	1405.968416	1.17516	1.263229	9.527273	7.000000	14.472727	17.563636	1.490909	0.981818	0.981818	0.981818	0.000000	15.545455	3.000000	9.054545	2.509091	2.0	0.509091	1.8565	1.9770	1.8505	1.8485	0.7125	3.738614	3.878659	4.494368	4.954673	2.884262	4.065926	3.746642	5.372543	3.743410	4.365456	draw	2.75	3.2	2.50
22	2005-08-23	23	Sunderland	0506	1	0.236842	0.0	0.236842	0.4	1277.888970	1552.291880	0.650176	1.543716	5.000000	5.000000	12.418182	17.545455	1.981818	0.490909	1.000000	0.490909	0.490909	0.509091	14.509091	6.909091	5.018182	3.927273	1.018182	2.509091	1.8520	1.9915	1.8535	1.8500	0.7125	Man City	0.236842	0.0	0.236842	0.4	1552.291880	1277.888970	1.28875	1.287367	7.527273	3.509091	8.963636	12.490909	0.509091	1.018182	0.509091	0.509091	0.000000	10.963636	11.945455	2.490909	6.981818	3.0	1.490909	1.8150	2.0395	2.0060	1.7095	-0.2000	0.706318	3.750792	1.476812	1.070209	2.634096	4.455890	0.777605	4.913050	1.499427	3.151477	away	2.50	3.2	2.75

We now have a features DataFrame ready, with all the feature columns beginning with the "f_". In the next section, we will walk through the modelling process to try and find the best type of model to use.

03. Model Building & Hyperparameter Tuning

Welcome to the third part of this Machine Learning Walkthrough. This tutorial will focus on the model building process, including how to tune hyperparameters. In the next tutorial, we will create weekly predictions based on the model we have created here.

Specifically, this tutorial will cover a few things:

Choosing which Machine Learning algorithm to use from a variety of choices
Hyperparameter Tuning
Overfitting/Underfitting

Choosing an Algorithm

The best way to decide on specific algorithm to use, is to try them all! To do this, we will define a function which we first used in our AFL Predictions tutorial. This will iterate over a number of algorithms and give us a good indication of which algorithms are suited for this dataset and exercise.

Let's first use grab the features we created in the last tutorial. This may take a minute or two to run.

## Import libraries
from data_preparation_functions import *
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import warnings
from sklearn import linear_model, tree, discriminant_analysis, naive_bayes, ensemble, gaussian_process
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import log_loss, confusion_matrix
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)

features = create_feature_df()
    Creating all games feature DataFrame
    Creating stats feature DataFrame
    Creating odds feature DataFrame
    Creating market values feature DataFrame
    Filling NAs
    Merging stats, odds and market values into one features DataFrame
    Complete.

To start our modelling process, we need to make a training set, a test set and a holdout set. As we are using cross validation, we will make our training set all of the seasons up until 2017/18, and we will use the 2017/18 season as the test set.

feature_list = [col for col in features.columns if col.startswith("f_")]
betting_features = []

le = LabelEncoder() # Initiate a label encoder to transform the labels 'away', 'draw', 'home' to 0, 1, 2

# Grab all seasons except for 17/18 to use CV with
all_x = features.loc[features.season != '1718', ['gameId'] + feature_list]
all_y = features.loc[features.season != '1718', 'result']
all_y = le.fit_transform(all_y)

# Create our training vector as the seasons except 16/17 and 17/18
train_x = features.loc[~features.season.isin(['1617', '1718']), ['gameId'] + feature_list]
train_y = le.transform(features.loc[~features.season.isin(['1617', '1718']), 'result'])

# Create our holdout vectors as the 16/17 season
holdout_x = features.loc[features.season == '1617', ['gameId'] + feature_list]
holdout_y = le.transform(features.loc[features.season == '1617', 'result'])

# Create our test vectors as the 17/18 season
test_x = features.loc[features.season == '1718', ['gameId'] + feature_list]
test_y = le.transform(features.loc[features.season == '1718', 'result'])

# Create a list of standard classifiers
classifiers = [

    #GLM
    linear_model.LogisticRegressionCV(),

    #Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),

    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),

    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(),

    #Gaussian Processes
    gaussian_process.GaussianProcessClassifier(),

    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
#     xgb.XGBClassifier()    
]

def find_best_algorithms(classifier_list, X, y):
    # This function is adapted from https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling
    # Cross validate model with Kfold stratified cross validation
    kfold = StratifiedKFold(n_splits=5)

    # Grab the cross validation scores for each algorithm
    cv_results = [cross_val_score(classifier, X, y, scoring = "neg_log_loss", cv = kfold) for classifier in classifier_list]
    cv_means = [cv_result.mean() * -1 for cv_result in cv_results]
    cv_std = [cv_result.std() for cv_result in cv_results]
    algorithm_names = [alg.__class__.__name__ for alg in classifiers]

    # Create a DataFrame of all the CV results
    cv_results = pd.DataFrame({
        "Mean Log Loss": cv_means,
        "Log Loss Std": cv_std,
        "Algorithm": algorithm_names
    }).sort_values(by='Mean Log Loss')
    return cv_results

algorithm_results = find_best_algorithms(classifiers, all_x, all_y)

algorithm_results

	Mean Log Loss	Log Loss Std	Algorithm
0	0.966540	0.020347	LogisticRegressionCV
3	0.986679	0.015601	LinearDiscriminantAnalysis
1	1.015197	0.017466	BernoulliNB
10	1.098612	0.000000	GaussianProcessClassifier
5	1.101281	0.044383	AdaBoostClassifier
8	1.137778	0.153391	GradientBoostingClassifier
7	2.093981	0.284831	ExtraTreesClassifier
9	2.095088	0.130367	RandomForestClassifier
6	2.120571	0.503132	BaggingClassifier
4	4.065796	1.370119	QuadraticDiscriminantAnalysis
2	5.284171	0.826991	GaussianNB

We can see that LogisticRegression seems to perform the best out of all the algorithms, and some algorithms have a very high log loss. This is most likely due to overfitting. It would definitely be useful to condense our features down to reduce the dimensionality of the dataset.

Hyperparameter Tuning

For now, however, we will use logistic regression. Let's first try and tune a logistic regression model with cross validation. To do this, we will use grid search. Grid search essentially tries out each combination of values and finds the model with the lowest error metric, which in our case is log loss. 'C' in logistic regression determines the amount of regularization. Lower values increase regularization.

# Define our parameters to run a grid search over
lr_grid = {
    "C": [0.0001, 0.01, 0.05, 0.2, 1],
    "solver": ["newton-cg", "lbfgs", "liblinear"]
}

kfold = StratifiedKFold(n_splits=5)

gs = GridSearchCV(LogisticRegression(), param_grid=lr_grid, cv=kfold, scoring='neg_log_loss')
gs.fit(all_x, all_y)
print("Best log loss: {}".format(gs.best_score_ *-1))
best_lr_params = gs.best_params_

  Best log loss: 0.9669551970849734

Defining a Baseline

We should also define a baseline, as we don't really know if our log loss is good or bad. Randomly assigning a 1/3 chance to each selection yields a log loss of log3 = 1.09. However, what we are really interested in, is how our model performs relative to the odds. So let's find the log loss of the odds.

# Finding the log loss of the odds
log_loss(all_y, 1 / all_x[['f_awayOdds', 'f_drawOdds', 'f_homeOdds']])

  0.9590114943474463

This is good news: our algorithm almost beats the bookies in terms of log loss. It would be great if we could beat this result.

Analysing the Errors Made

Now that we have a logistic regression model tuned, let's see what type of errors it made. To do this we will look at the confusion matrix produced when we predict our holdout set.

lr = LogisticRegression(**best_lr_params) # Instantiate the model
lr.fit(train_x, train_y) # Fit our model
lr_predict = lr.predict(holdout_x) # Predict the holdout values

# Create a confusion matrix
c_matrix = (pd.DataFrame(confusion_matrix(holdout_y, lr_predict), columns=le.classes_, index=le.classes_)
 .rename_axis('Actual')
 .rename_axis('Predicted', axis='columns'))

c_matrix

Predicted	away	draw	home
Actual
away	77	0	32
draw	26	3	55
home	33	7	147

As we can see, when we predicted 'away' as the result, we correctly predicted 79 / 109 results, a hit rate of 70.6%. However, when we look at our draw hit rate, we only predicted 6 / 84 correctly, meaning we only had a hit rate of around 8.3%. For a more in depth analysis of our predictions, please skip to the Analysing Predictions & Staking Strategies section of the tutorial.

Before we move on, however, let's use our model to predict the 17/18 season and compare how we went with the odds.

# Get test predictions

test_lr = LogisticRegression(**best_lr_params)
test_lr.fit(all_x, all_y)
test_predictions_probs = lr.predict_proba(test_x)
test_predictions = lr.predict(test_x)

test_ll = log_loss(test_y, test_predictions_probs)
test_accuracy = (test_predictions == test_y).mean()

print("Our predictions for the 2017/18 season have a log loss of: {0:.5f} and an accuracy of: {1:.2f}".format(test_ll, test_accuracy))

Our predictions for the 2017/18 season have a log loss of: 0.95767 and an accuracy of: 0.56

# Get accuracy and log loss based on the odds
odds_ll = log_loss(test_y, 1 / test_x[['f_awayOdds', 'f_drawOdds', 'f_homeOdds']])

odds_predictions = test_x[['f_awayOdds', 'f_drawOdds', 'f_homeOdds']].apply(lambda row: row.idxmin()[2:6], axis=1).values
odds_accuracy = (odds_predictions == le.inverse_transform(test_y)).mean()

print("Odds predictions for the 2017/18 season have a log loss of: {0:.5f} and an accuracy of: {1:.3f}".format(odds_ll, odds_accuracy))

Odds predictions for the 2017/18 season have a log loss of: 0.94635 and an accuracy of: 0.545

Results

There we have it! The odds predicted 54.5% of EPL games correctly in the 2017/18 season, whilst our model predicted 54% correctly. This is a decent result for the first iteration of our model. In future iterations, we could wait a certain number of matches each season and calculate EMAs for on those first n games. This may help the issue of players switching clubs and teams becoming relatively stronger/weaker compared to previous seasons.

04. Weekly Predictions

Welcome to the final part of this Machine Learning Walkthrough. This tutorial will be a walk through of creating weekly EPL predictions from the basic logistic regression model we built in the previous tutorial. We will then analyse our predictions and create staking strategies in the next tutorial.

Specifically, this tutorial will cover a few things:

Obtaining Weekly Odds / Game Info Using Betfair's API
Data Wrangling This Week's Game Info Into Our Feature Set

Obtaining Weekly Odds / Game Info Using Betfair's API

The first thing we need to do to create weekly predictions is get both the games being played this week, as well as match odds from Betfair to be used as features.

To make this process easier, I have created a csv file with the fixture for the 2018/19 season. Let's load that now.

## Import libraries
import pandas as pd
from weekly_prediction_functions import *
from data_preparation_functions import *
from sklearn.metrics import log_loss, confusion_matrix
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)

fixture = (pd.read_csv('data/fixture.csv')
              .assign(Date=lambda df: pd.to_datetime(df.Date)))

fixture.head()

	Date	Time (AEST)	HomeTeam	AwayTeam	Venue	TV	Year	round	season
0	2018-08-11	5:00 AM	Man United	Leicester	Old Trafford, Manchester	Optus, Fox Sports (delay)	2018	1	1819
1	2018-08-11	9:30 PM	Newcastle	Tottenham	St.James’ Park, Newcastle	Optus, SBS	2018	1	1819
2	2018-08-12	12:00 AM	Bournemouth	Cardiff	Vitality Stadium, Bournemouth	Optus	2018	1	1819
3	2018-08-12	12:00 AM	Fulham	Crystal Palace	Craven Cottage, London	Optus	2018	1	1819
4	2018-08-12	12:00 AM	Huddersfield	Chelsea	John Smith’s Stadium, Huddersfield	Optus, Fox Sports (delay)	2018	1	1819

Now we are going to connect to the API and retrieve game level information for the next week. To do this, we will use an R script. If you are not familiar with R, don't worry, it is relatively simple to read through. For this, we will run the script weekly_game_info_puller.R. Go ahead and run that script now.

Note that for this step, you will require a Betfair API App Key. If you don't have one, visit this page and follow the instructions.

I will upload an updated weekly file, so you can follow along regardless of if you have an App Key or not. Let's load that file in now.

game_info = create_game_info_df("data/weekly_game_info.csv")

game_info.head(3)

	AwayTeam	HomeTeam	awaySelectionId	drawSelectionId	homeSelectionId	draw	marketId	marketStartTime	totalMatched	eventId	eventName	homeOdds	drawOdds	awayOdds	competitionId	Date	localMarketStartTime
0	Arsenal	Cardiff	1096	58805	79343	The Draw	1.146897152	2018-09-02 12:30:00	30123.595116	28852020	Cardiff v Arsenal	7.00	4.3	1.62	10932509	2018-09-02	Sun September 2, 10:30PM
1	Bournemouth	Chelsea	1141	58805	55190	The Draw	1.146875421	2018-09-01 14:00:00	30821.329656	28851426	Chelsea v Bournemouth	1.32	6.8	12.00	10932509	2018-09-01	Sun September 2, 12:00AM
2	Fulham	Brighton	56764	58805	18567	The Draw	1.146875746	2018-09-01 14:00:00	16594.833096	28851429	Brighton v Fulham	2.36	3.5	3.50	10932509	2018-09-01	Sun September 2, 12:00AM

Finally, we will use the API to grab the weekly odds. This R script is also provided, but I have also included the weekly odds csv for convenience.

odds = (pd.read_csv('data/weekly_epl_odds.csv')
           .replace({
                'Man Utd': 'Man United',
                'C Palace': 'Crystal Palace'}))

odds.head(3)

	HomeTeam	AwayTeam	f_homeOdds	f_drawOdds	f_awayOdds
0	Leicester	Liverpool	7.80	5.1	1.48
1	Brighton	Fulham	2.36	3.5	3.50
2	Everton	Huddersfield	1.54	4.4	8.20

Data Wrangling This Week's Game Info Into Our Feature Set

Now we have the arduous task of wrangling all of this info into a feature set that we can use to predict this week's games. Luckily our functions we created earlier should work if we just append the non-features to our main dataframe.

df = create_df('data/epl_data.csv')

df.head()

	AC	AF	AR	AS	AST	AY	AwayTeam	B365A	B365D	B365H	BWA	BWD	BWH	Bb1X2	BbAH	BbAHh	BbAv<2.5	BbAv>2.5	BbAvA	BbAvAHA	BbAvAHH	BbAvD	BbAvH	BbMx<2.5	BbMx>2.5	BbMxA	BbMxAHA	BbMxAHH	BbMxD	BbMxH	BbOU	Date	Day	Div	FTAG	FTHG	FTR	HC	HF	HS	HST	HTAG	HTHG	HTR	HY	HomeTeam	IWA	IWD	IWH	LBA	LBD	LBH	Month	Referee	VCA	VCD	VCH	Year	season	gameId	homeWin	awayWin	result
0	6.0	14.0	1.0	11.0	5.0	1.0	Blackburn	2.75	3.20	2.50	2.90	3.30	2.20	55.0	20.0	0.00	1.71	2.02	2.74	2.04	1.82	3.16	2.40	1.80	2.25	2.90	2.08	1.86	3.35	2.60	35.0	2005-08-13	13	E0	1.0	3.0	H	2.0	11.0	13.0	5.0	1.0	0.0	A	0.0	West Ham	2.7	3.0	2.3	2.75	3.00	2.38	8	A Wiley	2.75	3.25	2.40	2005	0506	1	1	0	home
1	8.0	16.0	0.0	13.0	6.0	2.0	Bolton	3.00	3.25	2.30	3.15	3.25	2.10	56.0	22.0	-0.25	1.70	2.01	3.05	1.84	2.01	3.16	2.20	1.87	2.20	3.40	1.92	2.10	3.30	2.40	36.0	2005-08-13	13	E0	2.0	2.0	D	7.0	14.0	3.0	2.0	2.0	2.0	D	0.0	Aston Villa	3.1	3.0	2.1	3.20	3.00	2.10	8	M Riley	3.10	3.25	2.20	2005	0506	2	0	0	draw
2	6.0	14.0	0.0	12.0	5.0	1.0	Man United	1.72	3.40	5.00	1.75	3.35	4.35	56.0	23.0	0.75	1.79	1.93	1.69	1.86	2.00	3.36	4.69	1.87	2.10	1.80	1.93	2.05	3.70	5.65	36.0	2005-08-13	13	E0	2.0	0.0	A	8.0	15.0	10.0	5.0	1.0	0.0	A	3.0	Everton	1.8	3.1	3.8	1.83	3.20	3.75	8	G Poll	1.80	3.30	4.50	2005	0506	3	0	1	away
3	6.0	13.0	0.0	7.0	4.0	2.0	Birmingham	2.87	3.25	2.37	2.80	3.20	2.30	56.0	21.0	0.00	1.69	2.04	2.87	2.05	1.81	3.16	2.31	1.77	2.24	3.05	2.11	1.85	3.30	2.60	36.0	2005-08-13	13	E0	0.0	0.0	D	6.0	12.0	15.0	7.0	0.0	0.0	D	1.0	Fulham	2.9	3.0	2.2	2.88	3.00	2.25	8	R Styles	2.80	3.25	2.35	2005	0506	4	0	0	draw
4	6.0	11.0	0.0	13.0	3.0	3.0	West Brom	5.00	3.40	1.72	4.80	3.45	1.65	55.0	23.0	-0.75	1.77	1.94	4.79	1.76	2.10	3.38	1.69	1.90	2.10	5.60	1.83	2.19	3.63	1.80	36.0	2005-08-13	13	E0	0.0	0.0	D	3.0	13.0	15.0	8.0	0.0	0.0	D	2.0	Man City	4.2	3.2	1.7	4.50	3.25	1.67	8	C Foy	5.00	3.25	1.75	2005	0506	5	0	0	draw

Now we need to specify which game week we would like to predict. We will then filter the fixture for this game week and append this info to the main DataFrame

round_to_predict = int(input("Which game week would you like to predict? Please input next week's Game Week\n"))

Which game week would you like to predict? Please input next week's Game Week
4

future_predictions = (fixture.loc[fixture['round'] == round_to_predict, ['Date', 'HomeTeam', 'AwayTeam', 'season']]
                             .pipe(pd.merge, odds, on=['HomeTeam', 'AwayTeam'])
                             .rename(columns={
                                 'f_homeOdds': 'B365H',
                                 'f_awayOdds': 'B365A',
                                 'f_drawOdds': 'B365D'})
                             .assign(season=lambda df: df.season.astype(str)))

df_including_future_games = (pd.read_csv('data/epl_data.csv', dtype={'season': str})
                .assign(Date=lambda df: pd.to_datetime(df.Date))
                .pipe(lambda df: df.dropna(thresh=len(df) - 2, axis=1))  # Drop cols with NAs
                .dropna(axis=0)  # Drop rows with NAs
                .sort_values('Date')
                .append(future_predictions, sort=True)
                .reset_index(drop=True)
                .assign(gameId=lambda df: list(df.index + 1),
                            Year=lambda df: df.Date.apply(lambda row: row.year),
                            homeWin=lambda df: df.apply(lambda row: 1 if row.FTHG > row.FTAG else 0, axis=1),
                            awayWin=lambda df: df.apply(lambda row: 1 if row.FTAG > row.FTHG else 0, axis=1),
                            result=lambda df: df.apply(lambda row: 'home' if row.FTHG > row.FTAG else ('draw' if row.FTHG == row.FTAG else 'away'), axis=1)))

df_including_future_games.tail(12)

	AC	AF	AR	AS	AST	AY	AwayTeam	B365A	B365D	B365H	BWA	BWD	BWH	Bb1X2	BbAH	BbAHh	BbAv<2.5	BbAv>2.5	BbAvA	BbAvAHA	BbAvAHH	BbAvD	BbAvH	BbMx<2.5	BbMx>2.5	BbMxA	BbMxAHA	BbMxAHH	BbMxD	BbMxH	BbOU	Date	Day	Div	FTAG	FTHG	FTR	HC	HF	HR	HS	HST	HTAG	HTHG	HTR	HY	HomeTeam	IWA	IWD	IWH	LBA	LBD	LBH	Month	Referee	VCA	VCD	VCH	Year	season	gameId	homeWin	awayWin	result
4952	4.0	8.0	0.0	12.0	2.0	1.0	Burnley	4.33	3.40	2.00	4.0	3.3	2.00	39.0	20.0	-0.25	1.65	2.22	4.14	2.22	1.69	3.36	1.98	1.72	2.31	4.5	2.32	1.74	3.57	2.04	36.0	2018-08-26	26.0	E0	2.0	4.0	H	6.0	11.0	0.0	25.0	12.0	2.0	3.0	H	2.0	Fulham	4.10	3.35	1.97	3.90	3.2	2.00	8.0	D Coote	4.33	3.4	2.0	2018	1819	4953	1	0	home
4953	2.0	16.0	0.0	9.0	5.0	4.0	Tottenham	2.90	3.30	2.62	2.9	3.2	2.55	42.0	20.0	-0.25	1.79	2.03	2.86	1.72	2.18	3.27	2.56	1.84	2.10	3.0	1.76	2.25	3.40	2.67	40.0	2018-08-27	27.0	E0	3.0	0.0	A	5.0	11.0	0.0	23.0	5.0	0.0	0.0	D	2.0	Man United	2.75	3.25	2.60	2.75	3.2	2.55	8.0	C Pawson	2.90	3.3	2.6	2018	1819	4954	0	1	away
4954	NaN	NaN	NaN	NaN	NaN	NaN	Liverpool	1.48	5.10	7.80	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Leicester	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4955	0	0	away
4955	NaN	NaN	NaN	NaN	NaN	NaN	Fulham	3.50	3.50	2.36	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Brighton	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4956	0	0	away
4956	NaN	NaN	NaN	NaN	NaN	NaN	Man United	1.70	3.90	6.60	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Burnley	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4957	0	0	away
4957	NaN	NaN	NaN	NaN	NaN	NaN	Bournemouth	12.00	6.80	1.32	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Chelsea	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4958	0	0	away
4958	NaN	NaN	NaN	NaN	NaN	NaN	Southampton	4.50	3.55	2.04	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Crystal Palace	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4959	0	0	away
4959	NaN	NaN	NaN	NaN	NaN	NaN	Huddersfield	8.20	4.40	1.54	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Everton	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4960	0	0	away
4960	NaN	NaN	NaN	NaN	NaN	NaN	Wolves	2.98	3.50	2.62	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	West Ham	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4961	0	0	away
4961	NaN	NaN	NaN	NaN	NaN	NaN	Newcastle	32.00	12.50	1.12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Man City	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4962	0	0	away
4962	NaN	NaN	NaN	NaN	NaN	NaN	Arsenal	1.62	4.30	7.00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Cardiff	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4963	0	0	away
4963	NaN	NaN	NaN	NaN	NaN	NaN	Tottenham	1.68	4.30	5.90	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018-09-03	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Watford	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018	1819	4964	0	0	away

As we can see, what we have done is appended the Game information to our main DataFrame. The rest of the info is left as NAs, but this will be filled when we created our rolling average features. This is a 'hacky' type of way to complete this task, but works well as we can use the same functions that we created in the previous tutorials on this DataFrame. We now need to add the odds from our odds DataFrame, then we can just run our create features functions as usual.

Predicting Next Gameweek's Results

Now that we have our feature DataFrame, all we need to do is split the feature DataFrame up into a training set and next week's games, then use the model we tuned in the last tutorial to create predictions!

features = create_feature_df(df=df_including_future_games)

    Creating all games feature DataFrame
    Creating stats feature DataFrame
    Creating odds feature DataFrame
    Creating market values feature DataFrame
    Filling NAs
    Merging stats, odds and market values into one features DataFrame
    Complete.

# Create a feature DataFrame for this week's games.
production_df = pd.merge(future_predictions, features, on=['Date', 'HomeTeam', 'AwayTeam', 'season'])

# Create a training DataFrame
training_df = features[~features.gameId.isin(production_df.gameId)]

feature_names = [col for col in training_df if col.startswith('f_')]

le = LabelEncoder()
train_y = le.fit_transform(training_df.result)
train_x = training_df[feature_names]

lr = LogisticRegression(C=0.01, solver='liblinear')
lr.fit(train_x, train_y)
predicted_probs = lr.predict_proba(production_df[feature_names])
predicted_odds = 1 / predicted_probs

# Assign the modelled odds to our predictions df
predictions_df = (production_df.loc[:, ['Date', 'HomeTeam', 'AwayTeam', 'B365H', 'B365D', 'B365A']]
                               .assign(homeModelledOdds=[i[2] for i in predicted_odds],
                                      drawModelledOdds=[i[1] for i in predicted_odds],
                                      awayModelledOdds=[i[0] for i in predicted_odds])
                               .rename(columns={
                                   'B365H': 'BetfairHomeOdds',
                                   'B365D': 'BetfairDrawOdds',
                                   'B365A': 'BetfairAwayOdds'}))

predictions_df

	Date	HomeTeam	AwayTeam	BetfairHomeOdds	BetfairDrawOdds	BetfairAwayOdds	homeModelledOdds	drawModelledOdds	awayModelledOdds
0	2018-09-01	Leicester	Liverpool	7.80	5.10	1.48	5.747661	5.249857	1.573478
1	2018-09-02	Brighton	Fulham	2.36	3.50	3.50	2.183193	3.803120	3.584057
2	2018-09-02	Burnley	Man United	6.60	3.90	1.70	5.282620	4.497194	1.699700
3	2018-09-02	Chelsea	Bournemouth	1.32	6.80	12.00	1.308366	6.079068	14.047070
4	2018-09-02	Crystal Palace	Southampton	2.04	3.55	4.50	2.202871	4.213695	3.239122
5	2018-09-02	Everton	Huddersfield	1.54	4.40	8.20	1.641222	3.759249	8.020055
6	2018-09-02	West Ham	Wolves	2.62	3.50	2.98	1.999816	4.000456	4.000279
7	2018-09-02	Man City	Newcastle	1.12	12.50	32.00	1.043103	29.427939	136.231983
8	2018-09-02	Cardiff	Arsenal	7.00	4.30	1.62	6.256929	4.893445	1.572767
9	2018-09-03	Watford	Tottenham	5.90	4.30	1.68	5.643663	4.338926	1.688224

Above are the predictions for this Gameweek's matches.

Disclaimer

Note that whilst models and automated strategies are fun and rewarding to create, we can't promise that your model or betting strategy will be profitable, and we make no representations in relation to the code shared or information on this page. If you're using this code or implementing your own strategies, you do so entirely at your own risk and you are responsible for any winnings/losses incurred. Under no circumstances will Betfair be liable for any loss or damage you suffer.