Spurs defy the <1% odds and make it to the Champion's league final

What does it do?

If you asked anyone with an interest in football near the end of the 2018/19 season if they'd rather have Chelsea's superstar Eden Hazard or Newcastle's Salomón Rondón for the last five games of the season, it would have been a no brainer.

Hazard was the Premier League playmaker of the season and future £90-150m Real Madrid summer signing. He was playing for a team that would finish 3rd in the league. Salomón Rondón was bought for £15.3m, Newcastle finished 13th, and his summer move was to China.

Yet for those playing Fantasy Premier League (FPL), a wildly popular game where people assemble virtual teams of actual footballers to compete with their friends and colleagues, gaining points if the players perform well in real life, it would have been the wrong decision.

Despite costing your team almost double Rondón's price, Hazard would go on to deliver fewer points than Rondón in the last five games. Hazard delivered 23 points for £11m, while Rondón brought in 29 points for £5.9m (out of a total £100m budget that has to buy 15 players). Looking at recent point tallies this seems unexpected, given Hazard scored almost double Rondón's points in the previous five games (43 vs 25). This presents a great challenge for using predictive analytics to look deeper.

So how can we build a model that can predict the ideal team?

This mini project first came about because I both wanted to learn how to implement machine learning models in R, and wanted to beat a couple of competitive colleagues at Fantasy Football. As I'm a fan of interesting, practical and relevant data sets this seemed like the perfect combination.

There were three key hurdles - collecting the data, training the models, and forecasting the ideal team.

Collecting the data required finding and plugging into the FPL API. Training the models required data wrangling and model tweaking. Finally "building the ideal team" could be achieved with linear programming to produce the most predicted points within the fantasy football constraints - budget, players per position and the maximum of three players from any one team.

If you're interested in the specifics, the section at the end, "How was it made?" goes into the detail.

What does the end result look like?

To give an idea of what the output looks like, I'm going to show it to you up front.

This is the machine learning model's ideal team for the last five weeks to the end of the season. Selected members are shown in formation from the keeper at the bottom to the front three forwards at the top (thanks to a little data wrangling as this is actually just a simple ggplot graph). Predicted points are in their circles (one player can be captained for double points) and their cost is below.

Over the five weeks it predicted a very decent score of 367 points, and came in below budget at £83m. It even identified that despite their price tags Rondón and Hazard would both score a similar number of points - though Rondón had the edge.

How well does it do it?

Was it right? Well, kind of. It would be very wrong to pick one example and judge the strength of the model on that - we'll measure its performance properly in a bit. The beauty of football for fans is exactly what makes it so challenging for modelling - it's really difficult to predict what will happen.

While sabermetrics, the empirical use of statistics to analyse baseball, changed baseball forever, it hasn't been the same for football. Baseball, like many American sports, has a huge number of quantifiable actions - as well as a massive sample size of games every year. For each pitch there is a score associated - an easily measurable outcome. Not true of every pass in football. Its use in football is advancing - just look at this NYT piece on Liverpool's use of analytics. Two weeks after it was published Liverpool won the Champions League final, the most prestigious club competition in the world.

However that very final was played against Tottenham Hotspur, who only a few games into the Champions League campaign 538 predicted had a <1% chance of making it there. It's hard to predict results.

Even more unlikely was arguably the biggest upset in the history of sport - Leicester City winning the Premier League at 5000/1 in 2016 - a stunning performance across all 38 games of the season.

If predicting whole competitions is hard, predicting individual games is even harder, and individual performances in individual games more difficult still. But for fantasy football, that's what you have to predict.

My first iteration did try to predict an individual's next game performance, but understandably performed very poorly. As I had to predict individuals (and fantasy football punishes lots of team adjustments), I shifted to a more reasonable performance over the next five games.

The model picked a great team for these last five weeks - the actual points tally was 466, a very strong score given the maximum possible was 476. However individual accuracy varied - although it was picking the strongest players, it was consistently predicting they'd score lower than they did. It had predicted a score of only 367. This degree of bias, which I saw fairly consistently, reflects the fact that by the nature of football games the model is oversimplified against the reality - and it's really hard to predict outlier scores.

If you could solve this, and consistently and accurately predict individual performances, you'd make millions.

Unpredictability of football aside, a key problem was that I was working with a pretty limited dataset. The FPL API gives high level detail like goals and assists, as well as more detailed variables such as big chances missed, but what you really want is metrics like expected goals and defensive coverage that you have to pay a lot of money for.

So actually how accurate was it?

Looking at the output of one week, though it allows you to understand better what the algorithm is doing, is no way to test its accuracy. We need to look at the results in aggregate.

Lets start by establishing a baseline. We'll do this by training really simple linear models (one for each position) that try to predict points in the next five games by only using points in the last five games.

As players get points for doing well (not conceding goals for keepers / defenders / midfielders, scoring or assisting goals for forwards), points scored is like an aggregate of a number of individual metrics. Its the clearest indicator of performance - and our target variable.

There's a clear positive relationship between the two, but the linear model itself here is not a very good predictor. The models have an R² (a measure of much of the variance in the data is predicted by the model, where 100% is a perfect fit) varying from 30% to 56%, and show it seems to be easier to predict goalkeepers than defenders.

The challenge is for our model to increase that R². So I trained a number of models to do exactly that.

First, introducing the other variables to the linear regression substantially increases the R²:

Of course a linear model assumes a linear relationship - not necessarily the case. Logging the points variable allows us to immediately increase the accuracy of the model:

But we can do more. We'll be borrowing two algorithms from the wonderful world of machine learning - random forest and xgboost.

Both of these have at their heart the fundamental concept behind machine learning - computers are better at pattern recognition than humans.

In the same way as with the previous linear regressions, we give the algorithm a training dataset - in this case how players performed over their last five games (points, goals, assists, creativity etc) and how many points they then got over the next five. It trains itself on this - learning the patterns.

We then give it a test dataset, where it only has how they performed over the last five games, and it then tries to predict how they will perform over the next five games. We then measure that against how they actually performed and we can see how accurate it was.

Both of these are based on making decision trees. I've outlined a really simple decision tree below as an example - the ones created by these algorithms are more complex, but its the same fundamental concept.

Both the random forest algorithm and the xgboost algorithm make lots of these decision trees. The random forest makes lots of trees (hence the name), using a random subset of features each time, and then takes the mean prediction of all of them. The xg in xg boost stands for "extreme gradient" boosting, and to massively simplify it basically runs a series of decision trees - where each new one aims to predict the errors of the previous one. All of these trees are then combined for the final prediction.

To visualise how complex these get, I've extracted the simplest of the many trees generated by the random forest (note I've deliberately excluded a lot of the labels so it's even slightly readable):

These machine learning algorithms generally performed better than their linear regression counterparts for all positions except goalkeeper. Recently xgboost has come to the forefront of modelling popularity - it dominates most competitions on Kaggle where a neural network wouldn't be appropriate. It's also more complex to tune than a random forest, and I imagine if I spent more time tuning it it could be the best here too.

To illustrate what this difference in accuracy between models could look like in an example, lets compare the team predicted by the strongest algorithm (the random forest) with the weakest (single variable, unlogged linear regression).

First, the weakest. This is the optimal team as predicted by the weakest model:

It predicted 291 points but that team would have actually scored 340 points - aided by the massive over-performance of two defenders (Alexander-Arnold and van Dijk) who it predicted would be the best to include but massively underestimated just how well they'd do (perhaps unsurprisingly as defenders typically don't score that many points).

The strongest model predicted this team:

It also picked Alexander-Arnold - but was far closer in predicting his actual score. It predicted a total of 367 points for its team, but that team actually scored a massive 466 points. Overall we can see that the higher accuracy did, in this example, translate to more points.

How was it made?

This section is the behind the scenes coding and modelling detail, for those interested

The first step - gathering the data

The first thing needed for this task was the data itself. I started this project part way through the season, so there was no complete dataset available. I also needed the data to come from the Fantasy Premier League, as it needed FPL points. Unfortunately there's no publicly disclosed API. However, that doesn't always mean there isn't any API...

A bit of digging on the website revealed that when inspecting the web page when you load up a player to view, a call to an API is visible in the network traffic.

The preview shows this call includes the player's history, as well as their upcoming fixtures - with all the key stats we need.

A little bit of code in R can extract that for any particular player:

library(jsonlite)
library(tidyverse)

player_url <- "https://fantasy.premierleague.com/api/element-summary/"
player_number <- 342

# Checking what information is available in the API
names(fromJSON(paste0(player_url,player_number,"/")))
# [1] "fixtures"     "history"      "history_past"

# Pulling player match history
player_history <- as.tibble(fromJSON(paste0(player_url,player_number,"/"))$history)

The next stage was then to loop over all player IDs and join up the outputs into one large table, and pull in data on team details from a separate API call.

The result (when run at the end of the season) is a table with ~22,000 rows and 43 columns, one row per player per game, which we can then aggregate up to produce our "last five games" metrics.

The second step - training the models

Fantasy football rewards different numbers of points for different positions - for example for scoring a goal a defender will get 6 points, a midfielder 5, and an attacker 4. Defenders get 4 points for a clean sheet, while attackers get none. It therefore makes sense to have separate models for separate positions.

For training the models I split my data into a training set (70% of the data) and a testing set (30% of the data) so that I could test their accuracy on unseen data.

I used the caret package to train my models, using repeated k-fold cross-validation with 10 folds and 5 repeats. This is a process whereby the models repeatedly train themselves on different subsets of the data, rather than 'seeing' the data all at once, to prevent them 'overfitting'.

It's like if Asian footballer of the year and Spurs hero Son Heung-Min played between 60 and 90 minutes every game, and always scored - except for two games where he didn't. If these two games happened by chance to be when he had played exactly 67 minutes the week before, an overfitted model might conclude that if he plays more than 65 and less than 69 minutes the week before he won't score. But that's probably not the reason why, it's coincidence. So if you want your model to predict if he'll score next week that would be bad. Repeatedly training on different parts of the data helps prevent this.

Here you can see how caret allows the settings for each model to be set up in a similar way, making it much easier to standardise code for actually training each of them later on.

# Setting model parameters ------------------------------------------------

# Linear model
lm_ctrl <- trainControl(method="repeatedcv",
                     number=10,
                     repeats=5,
                     allowParallel=TRUE)

# Random forest model
rf_ctrl <- trainControl(method="repeatedcv",
                     number=10,
                     repeats=5,
                     classProbs=TRUE,
                     #summaryFunction = twoClassSummary,
                     allowParallel=TRUE)

# XGboost model
xg_ctrl <- trainControl(method="repeatedcv",
                        number=10,
                        repeats=5,
                        classProbs=TRUE,
                        verboseIter = FALSE) # no training log

I then made a few helper functions (f_train_model and f_examine_model) to make running lots of models at once, and examining their outputs, much easier to do. This meant I could then loop over my different model settings and generate all the models and outputs.

You can see here the final code for actually running these helper functions (if you want the code of the helper functions, check out my github repositiory) - the process itself was one of trying different models, seeing what worked well and tweaking parameters accordingly.

# Running the models and examining the outputs ----------------------------
pmap(list(position = position_list, model = rep("lm", 4), both_or_just_nolog = rep("both", 4), ctrl = rep("lm_ctrl", 4)), f_train_model)
pmap(list(position = position_list, model = rep("rf", 4), both_or_just_nolog = rep("nolog", 4), ctrl = rep("rf_ctrl", 4)), f_train_model)
pmap(list(position = position_list, model = rep("xgbTree", 4), both_or_just_nolog = rep("nolog", 4), ctrl = rep("xg_ctrl", 4)), f_train_model)

pmap(list(position = position_list, model = rep("lm_5g"), both_or_just_nolog = rep("both",4)), f_examine_model)
pmap(list(position = position_list, model = rep("rf_5g"), both_or_just_nolog = rep("nolog",4)), f_examine_model)
pmap(list(position = position_list, model = rep("xgbTree_5g"), both_or_just_nolog = rep("nolog",4)), f_examine_model)

The final step - building the ideal team

Now all the models were trained, the final step was to build the ideal team for the next week.

I set this part up so that I could pick which week to predict for, and which of the models I'd trained to use:

# Predicting next week's scores ====

week <- 34
model_name <- "rf_5g_nolog"
model_longname <- "Random Forest"

Again, for pulling out the data and predicting the points with it, I made some helper functions (f_create_model_input and f_create_model_output) so it could be done in bulk cleanly and accurately. Running these generated predicted points for every player.

input_goalkeeper <-  f_create_model_input("Goalkeeper",model_name)
input_defender <-  f_create_model_input("Defender",model_name)
input_midfielder <-  f_create_model_input("Midfielder",model_name)
input_forward <-  f_create_model_input("Forward",model_name)

output_goalkeeper <- f_create_model_output("Goalkeeper",model_name, "input_goalkeeper", "")
output_defender <- f_create_model_output("Defender",model_name, "input_defender", "")
output_midfielder <- f_create_model_output("Midfielder",model_name, "input_midfielder", "")
output_forward <- f_create_model_output("Forward",model_name, "input_forward", "")

Then when I had the predictions in hand, I used linear programming to maximise the most points as a team I could get within the constraints of the game.

# ... Linear programming ====
# Constraints
#   Total value < 100m
#   Total number of players = 11
#   Maximum goalkeepers = 1
#   Maximum defenders = 5
#   Maximum midfielders = 5
#   Maximum forwards = 5
# 
# Decision variables
#   Maximum total points

The hardest part of this was the constraint that you can't have more than 3 players from any one team. To do this I made a big matrix of every team and every player. Imagine a table with a column for every player, a row for every team, and a 1 in the cells where the player is in the team.

You then make it a criteria where effectively when you pick out your players (the columns) and put them together in a new table, none of the rows (the teams, where there is a 1 for each player in the team) can add up to more than 3.

In code, this all looks like the below - confusing if you've never used the lpSolveAPI, but basically setting out the criteria. As a cut down example:

matrix <- as.numeric(lp_data$element_type=="Goalkeeper") # Number of goalkeepers 
direction <- c( "==" ) # Must equal
rhs <- c(1)            # One

The full code:

# Creating matrix of all teams, and if player is in the team (used for maximum player/team criteria)
teamMatrix <- lapply(unique(lp_data$team), function(name) as.numeric(lp_data$team==name))
teamMatrix <- t(matrix(unlist(teamMatrix), ncol=n_distinct(lp_data$team)))

# Constraints
matrix <- rbind(
  as.numeric(lp_data$element_type=="Goalkeeper"),
  as.numeric(lp_data$element_type=="Defender"),
  as.numeric(lp_data$element_type=="Defender"),
  as.numeric(lp_data$element_type=="Midfielder"),
  as.numeric(lp_data$element_type=="Midfielder"),
  as.numeric(lp_data$element_type=="Forward"),
  as.numeric(lp_data$element_type=="Forward"),
  as.numeric(lp_data$element_type %in% c("Goalkeeper","Defender","Midfielder","Forward")),
  lp_data$now_cost,
  teamMatrix
)

direction <- c(
  "==",
  ">=",
  "<=",
  ">=",
  "<=",
  ">=",
  "<=",
  "==",
  "<=",
  rep("<=",nrow(teamMatrix))
)

rhs <- c(
  1,
  3,
  5,
  3,
  5,
  3,
  5,
  11,
  1000,
  rep(3,nrow(teamMatrix))
)

# Running the solver
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,
                      types = var.types, max = TRUE)

# Tibble of the final team
final_team <- lp_data[sol$solution==1,]
sum(final_team$now_cost)
sum(final_team$predicted)

The output is a table of the players selected to be in this top scoring team. With an easy bit of manipulation to calculate where each player would go (think that can then be turned into a graph of the team in formation using the wonderful ggplot package.

# Final graph
team_graph <- ggplot(graph_data2,aes(x,y,colour=team))+
  geom_point(size=10)+
  geom_text(aes(label=name_captaincy),vjust=-2)+
  geom_text(aes(label=round(predicted_captaincy,0)),colour="white",size=4,fontface="bold")+
  geom_text(aes(label=paste0("£",now_cost/10,"m")),vjust=2.5)+
  annotate("text",1,1.5,label=paste0("Total cost: £",sum(graph_data2$now_cost)/10,"m"),fontface="bold")+
  annotate("text",1,1.35,label=paste0("Total points: ",sum(round(graph_data2$predicted_captaincy),0)),fontface="bold")+
  annotate("text",filter(graph_data2,element_type == "Goalkeeper")$x,0.5,label="Number in circle is predicted points. Number below is cost", fontface = "italic", colour = "grey") +
  annotate("text",filter(graph_data2,element_type == "Goalkeeper")$x,0.3,label="The captain (c) has their points doubled", fontface = "italic", colour = "grey") +
  theme_void()+
  ylim(0,max(graph_data2$y)*1.1)+
  xlim(0,max(graph_data2$x)+min(graph_data2$x_step))+
  labs(colour="Team")+
  ggtitle(paste0("Optimal FPL Team (Week ",week,") - ",model_longname))+
  theme(plot.title = element_text(hjust=0.5,colour="grey47"))

png(paste0("outputs/","Chosen_team_week_",week," (",model_name,").png"), width=1280,height=800,res=144)
team_graph
dev.off()

All that's left is to compare it to the actual best team if its a week that's already passed - the same process, but using actual points instead of predicted points.

And voila, a model trained to predict the ideal fantasy football team for the next five weeks, and show you it in formation!

Charlie's Site

Perpetually a work in progress
crkershaw@gmail.com
github.com/crkershaw

Fantasy Football Forecaster

What does it do?

How well does it do it?

How was it made?