# EPL Machine Learning Walkthrough¶

# 03. Model Building & Hyperparameter Tuning¶

Welcome to the third part of this Machine Learning Walkthrough. This tutorial will focus on the model building process, including how to tune hyperparameters. In the [next tutorial], we will create weekly predictions based on the model we have created here.

Specifically, this tutorial will cover a few things:

- Choosing which Machine Learning algorithm to use from a variety of choices
- Hyperparameter Tuning
- Overfitting/Underfitting

## Choosing an Algorithm¶

The best way to decide on specific algorithm to use, is to try them all! To do this, we will define a function which we first used in our AFL Predictions tutorial. This will iterate over a number of algorithms and give us a good indication of which algorithms are suited for this dataset and exercise.

Let's first use grab the features we created in the last tutorial. This may take a minute or two to run.

```
## Import libraries
from data_preparation_functions import *
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import warnings
from sklearn import linear_model, tree, discriminant_analysis, naive_bayes, ensemble, gaussian_process
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import log_loss, confusion_matrix
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)
```

```
features = create_feature_df()
Creating all games feature DataFrame
Creating stats feature DataFrame
Creating odds feature DataFrame
Creating market values feature DataFrame
Filling NAs
Merging stats, odds and market values into one features DataFrame
Complete.
```

To start our modelling process, we need to make a training set, a test set and a holdout set. As we are using cross validation, we will make our training set all of the seasons up until 2017/18, and we will use the 2017/18 season as the test set.

```
feature_list = [col for col in features.columns if col.startswith("f_")]
betting_features = []
le = LabelEncoder() # Initiate a label encoder to transform the labels 'away', 'draw', 'home' to 0, 1, 2
# Grab all seasons except for 17/18 to use CV with
all_x = features.loc[features.season != '1718', ['gameId'] + feature_list]
all_y = features.loc[features.season != '1718', 'result']
all_y = le.fit_transform(all_y)
# Create our training vector as the seasons except 16/17 and 17/18
train_x = features.loc[~features.season.isin(['1617', '1718']), ['gameId'] + feature_list]
train_y = le.transform(features.loc[~features.season.isin(['1617', '1718']), 'result'])
# Create our holdout vectors as the 16/17 season
holdout_x = features.loc[features.season == '1617', ['gameId'] + feature_list]
holdout_y = le.transform(features.loc[features.season == '1617', 'result'])
# Create our test vectors as the 17/18 season
test_x = features.loc[features.season == '1718', ['gameId'] + feature_list]
test_y = le.transform(features.loc[features.season == '1718', 'result'])
```

```
# Create a list of standard classifiers
classifiers = [
#GLM
linear_model.LogisticRegressionCV(),
#Navies Bayes
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
#Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
#Ensemble Methods
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
#Gaussian Processes
gaussian_process.GaussianProcessClassifier(),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
# xgb.XGBClassifier()
]
```

```
def find_best_algorithms(classifier_list, X, y):
# This function is adapted from https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling
# Cross validate model with Kfold stratified cross validation
kfold = StratifiedKFold(n_splits=5)
# Grab the cross validation scores for each algorithm
cv_results = [cross_val_score(classifier, X, y, scoring = "neg_log_loss", cv = kfold) for classifier in classifier_list]
cv_means = [cv_result.mean() * -1 for cv_result in cv_results]
cv_std = [cv_result.std() for cv_result in cv_results]
algorithm_names = [alg.__class__.__name__ for alg in classifiers]
# Create a DataFrame of all the CV results
cv_results = pd.DataFrame({
"Mean Log Loss": cv_means,
"Log Loss Std": cv_std,
"Algorithm": algorithm_names
}).sort_values(by='Mean Log Loss')
return cv_results
```

```
algorithm_results = find_best_algorithms(classifiers, all_x, all_y)
```

```
algorithm_results
```

Mean Log Loss | Log Loss Std | Algorithm | |
---|---|---|---|

0 | 0.966540 | 0.020347 | LogisticRegressionCV |

3 | 0.986679 | 0.015601 | LinearDiscriminantAnalysis |

1 | 1.015197 | 0.017466 | BernoulliNB |

10 | 1.098612 | 0.000000 | GaussianProcessClassifier |

5 | 1.101281 | 0.044383 | AdaBoostClassifier |

8 | 1.137778 | 0.153391 | GradientBoostingClassifier |

7 | 2.093981 | 0.284831 | ExtraTreesClassifier |

9 | 2.095088 | 0.130367 | RandomForestClassifier |

6 | 2.120571 | 0.503132 | BaggingClassifier |

4 | 4.065796 | 1.370119 | QuadraticDiscriminantAnalysis |

2 | 5.284171 | 0.826991 | GaussianNB |

We can see that LogisticRegression seems to perform the best out of all the algorithms, and some algorithms have a very high log loss. This is most likely due to overfitting. It would definitely be useful to condense our features down to reduce the dimensionality of the dataset.

## Hyperparameter Tuning¶

For now, however, we will use logistic regression. Let's first try and tune a logistic regression model with cross validation. To do this, we will use grid search. Grid search essentially tries out each combination of values and finds the model with the lowest error metric, which in our case is log loss. 'C' in logistic regression determines the amount of regularization. Lower values increase regularization.

```
# Define our parameters to run a grid search over
lr_grid = {
"C": [0.0001, 0.01, 0.05, 0.2, 1],
"solver": ["newton-cg", "lbfgs", "liblinear"]
}
kfold = StratifiedKFold(n_splits=5)
gs = GridSearchCV(LogisticRegression(), param_grid=lr_grid, cv=kfold, scoring='neg_log_loss')
gs.fit(all_x, all_y)
print("Best log loss: {}".format(gs.best_score_ *-1))
best_lr_params = gs.best_params_
Best log loss: 0.9669551970849734
```

## Defining a Baseline¶

We should also define a baseline, as we don't really know if our log loss is good or bad. Randomly assigning a ⅓ chance to each selection yields a log loss of log3 = 1.09. However, what we are really interested in, is how our model performs relative to the odds. So let's find the log loss of the odds.

```
# Finding the log loss of the odds
log_loss(all_y, 1 / all_x[['f_awayOdds', 'f_drawOdds', 'f_homeOdds']])
0.9590114943474463
```

This is good news: our algorithm almost beats the bookies in terms of log loss. It would be great if we could beat this result.

## Analysing the Errors Made¶

Now that we have a logistic regression model tuned, let's see what type of errors it made. To do this we will look at the confusion matrix produced when we predict our holdout set.

```
lr = LogisticRegression(**best_lr_params) # Instantiate the model
lr.fit(train_x, train_y) # Fit our model
lr_predict = lr.predict(holdout_x) # Predict the holdout values
```

```
# Create a confusion matrix
c_matrix = (pd.DataFrame(confusion_matrix(holdout_y, lr_predict), columns=le.classes_, index=le.classes_)
.rename_axis('Actual')
.rename_axis('Predicted', axis='columns'))
c_matrix
```

Predicted | away | draw | home |
---|---|---|---|

Actual | |||

away | 77 | 0 | 32 |

draw | 26 | 3 | 55 |

home | 33 | 7 | 147 |

As we can see, when we predicted 'away' as the result, we correctly predicted 79 / 109 results, a hit rate of 70.6%. However, when we look at our draw hit rate, we only predicted 6 / 84 correctly, meaning we only had a hit rate of around 8.3%. For a more in depth analysis of our predictions, please skip to the Analysing Predictions & Staking Strategies section of the tutorial.

Before we move on, however, let's use our model to predict the 17/18 season and compare how we went with the odds.

```
# Get test predictions
test_lr = LogisticRegression(**best_lr_params)
test_lr.fit(all_x, all_y)
test_predictions_probs = lr.predict_proba(test_x)
test_predictions = lr.predict(test_x)
test_ll = log_loss(test_y, test_predictions_probs)
test_accuracy = (test_predictions == test_y).mean()
print("Our predictions for the 2017/18 season have a log loss of: {0:.5f} and an accuracy of: {1:.2f}".format(test_ll, test_accuracy))
```

```
# Get accuracy and log loss based on the odds
odds_ll = log_loss(test_y, 1 / test_x[['f_awayOdds', 'f_drawOdds', 'f_homeOdds']])
odds_predictions = test_x[['f_awayOdds', 'f_drawOdds', 'f_homeOdds']].apply(lambda row: row.idxmin()[2:6], axis=1).values
odds_accuracy = (odds_predictions == le.inverse_transform(test_y)).mean()
print("Odds predictions for the 2017/18 season have a log loss of: {0:.5f} and an accuracy of: {1:.3f}".format(odds_ll, odds_accuracy))
```

## Results¶

There we have it! The odds predicted 54.5% of EPL games correctly in the 2017/18 season, whilst our model predicted 54% correctly. This is a decent result for the first iteration of our model. In future iterations, we could wait a certain number of matches each season and calculate EMAs for on those first n games. This may help the issue of players switching clubs and teams becoming relatively stronger/weaker compared to previous seasons.