Expected Goals using machine learning

In my last blog I built two simple xG models using the distance of a shot from the goal line and the angle a shot makes with the crossbar.  The second model attempted to account for the fact that, for the same distance, shots from wide areas are less likely to be scored than shots from right in front of goal.  However, this was not perfect as it still undervalued shots from directly in front of goal closer to the penalty spot.  Ideally, our xG visualisation should look like this (courtesy of David Sumpter) with high xG shots directly in front of goal and low xG shots in the wide areas but still close to the goal.  0-q0teyhg1hhrczccoThis can achieved by combining both distance and angle to goal into one function.  But this is not a trivial problem as described in detail by Michael Caley.  Through trial-and-error, he manages to capture the above distribution pretty well using a combination of distances and angle and inverse distances and angles.

Another approach is to consider clusters of shots.  A set of shots that are geometrically close together will have similar distances and angles so should therefore have a similar xG.  This is the concept of binning as described in this model.  The number of goals divided by the number of shots from inside a particular bin gives us the xG for those shots.


My approach uses a k-nearest neighbour algorithm which is a simple machine learning technique.  For a particular shot, the KNeighborsRegressor Python package will search for the 500 closest shots and then count up the number of goals.  This number divided by 500 is the xG for that shot.  The image below is a visualisation of this model showing all 226,917 shots in the dataset.  xG4.pngThis seems much more like the xG distribution from above.  However, the xG of shots in the 6 yard box right in front of goal is about 0.8 whereas my previous models predicted values closer to 1.  This is a result of the high value for k.  The 500 nearest neighbours to shots near the goal line will draw upon shots in the edges of the 6 yard box due to the relatively low density of shots.

There are 21,578 goals in the whole dataset.  This model predicts 21,409 goals which is an error of just 0.8%.  We can investigate how teams and players perform under this metric.  The graph below shows the total xG for each team in the dataset against the total goals they scored.  The line is y=x and shows over-performing teams above the line and under-performing teams below the line.  teams_xgThe really good teams over-perform meaning there are aspects of their play that isn’t captured by a shots-based xG model.  These teams ordered by their performance, measured by total goals divided by total xG, are shown below.


Similarly, we can assess players using xG.  The graph below shows xG against total goals per player.players_xgAgain it seems the best players over-perform their xG numbers.  A breakdown of players with at least 60 goals is shown in the table below.


Luis Suárez is over-performing his xG by a huge 72%, closely followed by Messi and Griezmann at 70%.  It is interesting to note that Benzema has an xG per shot of 0.154, the highest of the players in the list.  This, coupled with his high over-performance figure, suggests he is finishing really well and getting into high value positions to shoot.  Meanwhile, his Real Madrid teammates, Ronaldo and Bale have the lowest xG/shot.  It seems they have their share of speculative efforts before Benzema comes in to clean up if they fail.

It is reassuring that this xG model identifies the best teams and players, validated by other models.  The machine learnining algorithms can of course be extended to include many other factors.  In future blogs, I’ll look to build these into my cricket models.

A simple Expected Goals model

Whilst this blog will focus on cricket, a lot of concepts will be inspired by work from other sports.  For example, my Expected Runs model describes the average number of runs that would be scored from a delivery with particular attributes such as its line and length.  This is equivalent to the well established Expected Goals (xG) metric from football which measures the quality of a chance.  A simple search will glean many models that take into account a number of factors such as the distance and angle from goal, which body part was used, build-up play etc.  Using some data that I had lying around, I wanted to see if I could replicate the results of some of these models.

Almost all xG models are fundamentally based on the location of the shot.  In fact, this blog suggests that you can construct an xG model that is 95% reliable just by considering the location of shots.

The image below shows the locations of 234,018 shots from 9,133 matches I have in my database from various leagues and competitions around the world (attacking left to right).all-shotsThe cluster of shots in the defending penalty area are not opportunistic goalkeepers but own goals.  This is illustrated by the next image which shows all 24,347 goals in the dataset.all-goalsJust from these images we can immediately start forming some hypotheses such as the closer and more central you are to the goal the more likely you are to score.  Compare the density of the cluster of shots outside of the penalty area to the goals from this area.  Also, consider that the cluster inside the penalty area extends to the edges of the box, while the number of goals in the wide areas is significantly fewer.

For the purposes of my simple xG model, I excluded own goals and penalties and only considered shots in the boxed area show below.df-shotsThis leaves us with 21,424 goals from 225,372 shots.  I calculated the distance of each shot from the middle of the goal line giving us shots that ranged from 0 to 40 metres.  I then placed all these shots into forty 1 metre bins.  Within each bin, the ratio of goals to shots is the expected goals figure or xG for each of those shots.  For example, a shot from 10.5 metres out has an xG of 0.129 because there has been a total of 9,267 shots between 10 and 11 metres out resulting in 1,194 goals.

If we do this for every shot, we get the following graph:graph

We can see that a shot from less than 1 metre out is more or less certain to be a goal.  Beyond this, there is a rapid drop in xG up to 10 metres before slowly converging to zero at large distances.  This sort of graph can typically be fitted by an exponential decay curve.graph2

By eye the curve seems like a decent fit for our data.  We can use the equation to estimate the xG of a shot from just its distance from the goal.  For example, d = 10.5 metres gives us an xG of 0.157 goals.  This is slightly higher than the empirical result from earlier.

The equation, however, breaks down at very small distances.  Specifically, any distance below 0.7 metres will return a value greater than 1 for xG.  Nevertheless, the r-squared value is 0.987, similar to what is obtained here.

Here is a visualisation of the xG model.xG2.png

Another limitation of this simple distance model is that it rates two shots of the same distance with the same xG, even if one is from directly in front of the goal and the other from a very tight angle.  Our intuition tells us that xG should decrease as we get less and less of the goal to aim for.  However, an xG model based just on angle to goal will also suffer from a similar problem.  The xG of a shot from the penalty spot will be the same as that of a shot from the centre mark.  Both have the same angle from goal i.e. zero, but are vastly different distances from goal.

To counter this, we can consider the angle to the middle of the crossbar.  The figure below illustrates this angle.


Let’s say position A is the location of the shot and AB is the distance to goal as before.  BC is the height of the goal i.e. 2.44 metres.  From these two lengths we can calculate the angle to the crossbar, θ, using basic trigonometry.  This angle has the advantage that it decreases with distance assuming the angle to the goal line stays the same.

As before, we can calculate this angle for every shot in the dataset and place these into bins to obtain the xG for a particular angle.bar.pngThis time we get a linear relationship between the angle to the crossbar and xG.  As you might expect, the more of the goal you have to aim for and the closer you are, the higher the chance of scoring.  There is a bit more uncertainty in xG in shots above 30 degrees as the vast majority of shots (99%) have an angle less than this.

This blog has described two simple implementations of an expected goals model based purely on the location of the shot.  In the next blog, I will attempt to build a model that incorporates both the distance and the angle to the goal as well as investigating how teams and players perform under this metric.