A simple Expected Goals model

Whilst this blog will focus on cricket, a lot of concepts will be inspired by work from other sports.  For example, my Expected Runs model describes the average number of runs that would be scored from a delivery with particular attributes such as its line and length.  This is equivalent to the well established Expected Goals (xG) metric from football which measures the quality of a chance.  A simple search will glean many models that take into account a number of factors such as the distance and angle from goal, which body part was used, build-up play etc.  Using some data that I had lying around, I wanted to see if I could replicate the results of some of these models.

Almost all xG models are fundamentally based on the location of the shot.  In fact, this blog suggests that you can construct an xG model that is 95% reliable just by considering the location of shots.

The image below shows the locations of 234,018 shots from 9,133 matches I have in my database from various leagues and competitions around the world (attacking left to right).all-shotsThe cluster of shots in the defending penalty area are not opportunistic goalkeepers but own goals.  This is illustrated by the next image which shows all 24,347 goals in the dataset.all-goalsJust from these images we can immediately start forming some hypotheses such as the closer and more central you are to the goal the more likely you are to score.  Compare the density of the cluster of shots outside of the penalty area to the goals from this area.  Also, consider that the cluster inside the penalty area extends to the edges of the box, while the number of goals in the wide areas is significantly fewer.

For the purposes of my simple xG model, I excluded own goals and penalties and only considered shots in the boxed area show below.df-shotsThis leaves us with 21,424 goals from 225,372 shots.  I calculated the distance of each shot from the middle of the goal line giving us shots that ranged from 0 to 40 metres.  I then placed all these shots into forty 1 metre bins.  Within each bin, the ratio of goals to shots is the expected goals figure or xG for each of those shots.  For example, a shot from 10.5 metres out has an xG of 0.129 because there has been a total of 9,267 shots between 10 and 11 metres out resulting in 1,194 goals.

If we do this for every shot, we get the following graph:graph

We can see that a shot from less than 1 metre out is more or less certain to be a goal.  Beyond this, there is a rapid drop in xG up to 10 metres before slowly converging to zero at large distances.  This sort of graph can typically be fitted by an exponential decay curve.graph2

By eye the curve seems like a decent fit for our data.  We can use the equation to estimate the xG of a shot from just its distance from the goal.  For example, d = 10.5 metres gives us an xG of 0.157 goals.  This is slightly higher than the empirical result from earlier.

The equation, however, breaks down at very small distances.  Specifically, any distance below 0.7 metres will return a value greater than 1 for xG.  Nevertheless, the r-squared value is 0.987, similar to what is obtained here.

Here is a visualisation of the xG model.xG2.png

Another limitation of this simple distance model is that it rates two shots of the same distance with the same xG, even if one is from directly in front of the goal and the other from a very tight angle.  Our intuition tells us that xG should decrease as we get less and less of the goal to aim for.  However, an xG model based just on angle to goal will also suffer from a similar problem.  The xG of a shot from the penalty spot will be the same as that of a shot from the centre mark.  Both have the same angle from goal i.e. zero, but are vastly different distances from goal.

To counter this, we can consider the angle to the middle of the crossbar.  The figure below illustrates this angle.


Let’s say position A is the location of the shot and AB is the distance to goal as before.  BC is the height of the goal i.e. 2.44 metres.  From these two lengths we can calculate the angle to the crossbar, θ, using basic trigonometry.  This angle has the advantage that it decreases with distance assuming the angle to the goal line stays the same.

As before, we can calculate this angle for every shot in the dataset and place these into bins to obtain the xG for a particular angle.bar.pngThis time we get a linear relationship between the angle to the crossbar and xG.  As you might expect, the more of the goal you have to aim for and the closer you are, the higher the chance of scoring.  There is a bit more uncertainty in xG in shots above 30 degrees as the vast majority of shots (99%) have an angle less than this.

This blog has described two simple implementations of an expected goals model based purely on the location of the shot.  In the next blog, I will attempt to build a model that incorporates both the distance and the angle to the goal as well as investigating how teams and players perform under this metric.

Rating players with xR

In my last post, I described a metric called Expected Runs or xR for short.  This gave us the average number of runs you would expect a batsman to score from a delivery that possesses particular attributes such as its line and length etc.  My first attempt at an xR model just considered the position of the ball as it reaches stump level using data from over 200 T20I matches.

In this post, I look at how batsmen perform under this metric over several T20I matches.  The plot below shows the total runs scored for 507 batsmen in my database against their total xR.  Batsmen above the line score more runs than what xR suggests so are over-performing according to this metric.  blog.pngWe can calculate over-performance by dividing xR by runs.  The table below shows the top 20 batsmen, with at least 300 runs (of which there are 55), ordered by runs per xR.


Glenn Maxwell comes out way on top, scoring over 200 runs more than the average batsman would if they faced the same deliveries as he had.  Further down the list we see some notable hitters such as Aaron Finch and Shahid Afridi.  Interestingly, Luke Wright comes in higher than the likes of Chris Gayle, AB de Villiers and David Warner.

At the other end of the scale we get the table below:


Mohammad Hafeez scores nearly 100 runs fewer than expected.  This suggests he is not putting away the bad balls enough of the time, which is not ideal for an opener batting in the Powerplay overs.  It’s a wonder why Pakistan persisted with him for so long considering he has a career average of just 22.73 and a strike rate of 115.

The xR beehive plots from my last post, show that xR for a particular patch is rarely above 1.5.  Given that a boundary can produce a runs/xR multiple of up to 6 for that particular ball, I wanted to see if frequent boundary hitters generally had a higher runs/xR figure.  Taking batsmen to have scored at least 20 boundaries, we can see whether there is a trend between the number of boundaries hit and the runs/xR multiple.blog.pngThe plot above shows that the correlation is not very strong with an R² value of 0.068.  This is encouraging as it implies xR measures something more than just pure power hitting.  It can be used to identify the batsman who have the ability to hit good balls into gaps for ones and twos as well as those batsman who are not good enough to consistently put away bad balls to the boundary.

xR can also be applied to bowlers.  Bowlers with a low xR/ball figure are bowling in areas that are on average low-scoring.  Note that this is independent of what the batsman eventually does.  The table below shows the bowlers to have bowled at least 300 balls (of which there are 34) ordered by their xR/ball multiple.


Perhaps unsurprisingly, Sunil Narine comes out on top with 1.202 xR/ball.  The average run rate per ball across the entire dataset is 1.245.  This is a difference of 5 runs across a 20 over innings, so certainly not insignificant.  It is interesting to note that Narine’s xR conceded is significantly higher than his actual conceded runs.  This suggests that many batsmen are under-performing when facing him even when accounting for the fact that he bowls a lot of good balls.  Batsmen cannot seem to consistently hit him for ones and twos, never mind boundaries  – a testament to his incredibly tight bowling.

Darren Sammy and Sohail Tanvir are the only two bowlers in the top 10 to concede more runs than expected.  This may be due to a combination of mainly facing above-average batsman and some bad luck.

At the other end of the scale we observe that every bowler in the bottom 10 is a seam bowler bar Imran Tahir – an indication of the need to have separate xR models for spinners and seam bowlers.  Kyle Abbott has the highest xR/ball corresponding to 9 runs more than the average T20I innings.  Two fast bowlers, Mitchell Starc and Lasith Malinga, concede significantly fewer than expected.  Although they bowl in relatively high-scoring areas, their pace may be a factor in keeping runs to a minimum.  Again, this is something that can be built into the xR model.


The full list for both batsmen and bowlers can be found here.

xR has certainly shown its potential in accurately rating players beyond traditional metrics.  xR can also be used to rate individual innings as well as determine who ‘deserved’ to win a match by calculating the total xR for each team.  In future posts, I will look to incorporate more factors to further refine the model, including what balls are most likely to get a wicket.

Introducing Expected Runs

How can you tell how good a batsman really is?  How can we measure their true skill?  Maybe they’ve been getting lucky or have been facing some pretty dross bowling?  Perhaps a batsman appears to be out of form because they’ve recently been on the end of a couple of unplayable deliveries early on in their innings.

Averages and strike rates are good summary statistics but reveal little about the current match situation or the quality of the opposition bowling.  In this blog, I describe my first attempt at a metric that aims to predict the number of runs a batsman should score based on the type of deliveries they have faced.

In football analytics, the metric expected goals, abbreviated to xG, measures the probability of a particular shot ending up as a goal based on a variety of factors such as the distance and angle from goal, the body part used to make the shot and the type of assist.  An open goal from 10 yards usually leads to a goal, while a shot from 40 yards out rarely does.  Adding up all these expected goals gives the number of goals that a team or player would score on average.

Similarly in cricket, the same type of delivery usually end with similar outcomes: half-volleys tend to go for 4, top-of-off deliveries tend to be defended and ripping leg-breaks are often played and missed etc.  This can be quantified using data to calculate the number of runs an average batsman would be expected to score from a delivery of a particular line and length, speed and movement off the pitch among other factors.  For example, exactly how many runs would you expect to be scored from a back-of-a-length delivery outside off, with no movement off the pitch at 85 mph?  If we collect all the deliveries that have these attributes and add up the total runs that have accrued, we can divide this by the number of balls to get an Expected Runs figure, or (predictably) xR for short.  The next time a similar ball is bowled we can say that it has an xR of however many.

This concept is also used by CricViz to measure current batting conditions in Test matches, as described here.

You may immediately see how Expected Runs can be used to measure the quality of a batsman.  If the xR of a particular ball is 1.5 and the batsman is able to consistently hit this ball to the fence, it gives an indication of how good this batsman is compared to the average batsman.

In my first version of an Expected Runs model, I only consider the line and bounce of the ball i.e. the position of the ball when it is level with the stumps.  My dataset consists of 51,775 balls from 226 T20I matches.  The data contain details of the over, batsman, bowler, runs, any extras, wickets, ball speed, coordinates of where the ball pitched and coordinates of where the ball ends up at stump level.  I stripped the data of any wides, null and erroneous coordinate values to end up with 43,541 deliveries producing 54,192 runs and 2,420 wickets in total.  I then split this data by right and left-handed batsman to give 30,757 and 12,784 deliveries each.  Every ball in the dataset is shown below as a beehive plot:


The batting crease runs from -1.5 m to 1.5 m with middle stump at 0 m.  I split the coordinate space into square bins of 0.1 m side length, giving us 750 bins in total as shown in the figure below.  However, it is evident from the figure above that the sample size for each bin will vary wildly.axesThis procedure found bins to have Expected Runs ranging from 0 to 6.  These extreme figures were due to very small sample sizes.  The table below illustrates how runs compares against xR.  As expected, they both have virtually the same mean but xR has a significantly smaller standard deviation.  It may or may not be surprising that most balls in T20 matches are either dots or singles.


The continuous nature of the xR metric means we can differentiate between good and bad balls more accurately.  We can determine whether a dot ball was a genuinely good ball or whether the bowler just got away with one.  We can say whether a bowler deserved to be hit for a boundary or just got plain unlucky.

The figures below show the results of the binning for both right and left-handers.  Remember, the average runs per ball is about 1.24.  It can be seen that the relatively high scoring areas are anything wide of off-stump and on the batsman’s legs.  Here the xR value is about 1.4 to 1.5 or up to 9 runs per over.  To restrict the batsman to about a run a ball, the data shows that you should bowl a good to back-of-a-length on off-stump or just outside.  The blank bins indicate extreme values for xR due to small sample sizes so were left out.

right-hand batsmen xR
left-hand batsmen xR

Even with this simple model some cricketing truths are apparent, namely not to give batsman room to free their arms or bowl on their legs in T20.  There is certainly a lot of scope to improve this model.  I am yet to incorporate the positional coordinates of where the ball pitched, and any movement off the pitch etc.  I could construct separate models for both spinners and seam bowlers.  I could also consider game state i.e. where is the best place to bowl in the death overs or at particular batsman early on in their innings.

In the next blog I investigate which batsmen fare the best under this metric and whether xR correlates with other measures.

If you have any questions/suggestions please feel free to tweet me: @cricketsavant