In my first blog I described my first attempt at a metric called Expected Runs or xR. This calculated the average number of runs a batsman would score off a delivery based on the ball’s line and length as it reaches the batsman. In this blog, I extend this model by including other factors such as the speed of the ball and where it pitched.

I recently used a machine learning algorithm called k-nearest neighbours to develop a Expected Goals model. This worked out pretty well so I thought I’d apply this to my cricket data. Put simply, this algorithm finds the k (I used 50) most similar deliveries to a particular delivery and calculates the average number of runs accrued from those. This means that it is not necessary to split bowlers into spinners and fast bowlers, as balls bowled at a typically spinner’s or seamer’s pace are likely to be grouped together.

The algorithm works by randomly splitting the dataset into a training set (from which the patterns are identified) and a test set which is a fifth of the original dataset. This is done 10 times to cross-validate the data and reduce the effect of any bias in the training set. For anyone interested the Python code is below:

From the ball-by-ball T20I data I removed any wides and erroneous deliveries. In contrast to my first xR model I decided to mirror flip any deliveries to left-handers so that they can be equivalently compared to right-handers. This left 41,104 balls from 208 matches. After applying the algorithm as above, the total xR was 50,978 runs compared to 50,923 actual runs – a 0.1% error. This is not surprising at all given that the algorithm learns on itself. Almost by definition, the total actual runs and xR must more or less match.

As before, we can calculate the xR of individual batsmen. The plot below shows each batsman’s careers runs against their xR. The straight line is y=x and acts as the dividing line between over and under-performing their expectation.Most of the highest scoring batsmen are over-performing which perhaps means that there is an aspect of their game we are not capturing with this metric. One possible factor is boundary hitting. The runs/xR ratio for a particular ball can take values up to and above six. As such, batsmen who hit a lot of boundaries can inflate their performance with this metric, not that boundary hitting is a bad thing. The usefulness of xR comes from identifying batsmen who can score more runs from deliveries than on average and this includes rotating the strike and hitting into gaps for two’s and three’s. The graph below shows the total number of boundaries for each player against their performance figure, measured by runs/xR.

Although there is a slight positive correlation, the R² value is just 0.119. This means that total boundaries only explains 11.9% of the variance in the performance measure. The R² drops to 0.093 if we only include batsmen who have scored at least 20 boundaries, indicating that it is even less significant for the better batsmen. The table below shows the top 20 performing batsmen with at least 300 runs (of which there are 52).

As before, Glenn Maxwell comes way out on top over-performing his xR by 42%, followed by his Australian teammate Aaron Finch. Finch and Sammy have swapped places from the list in the my first xR model. Some other notable changes include Shahid Afridi dropping from 5th to 12th in the list, and Mahela Jayawardene breaking into the top 10.

At the other end of the scale, the table is as below:

This time Martin Guptill comes bottom of the pile instead of Mohammad Hafeez who is in second place. It is reassuring that the runs/xR figures themselves have not changed drastically. In this and the previous model, they are mostly between 0.9 and 1.2.

The tables below show the 10 best and worst performing bowlers who have bowled at least 200 balls (there are 60 such bowlers in the dataset).

In my previous blog about xR, I questioned Hafeez’s place in the Pakistan T20 side based on his batting stats. But it appears his bowling justifies his inclusion as he has the lowest xR/ball in the list. As explained in that post, bowlers who concede more runs than expected are most likely unlucky enough to bowl to quality batsmen quite often. Sunil Narine slips down to 8th, although he still concedes over 100 runs fewer than expected.

Spinners dominate the top of the list with the best performing pace bowler, Bhuvneshwar Kumar, coming in at 18th place with an xR/ball of 1.205. This is not entirely unexpected, as spinners have an overall economy rate of 6.89 in T20I’s compared to 7.70 for pace bowlers. The full lists for both batsmen and bowlers can be obtained here.

In future blogs I will be extending this analysis to ODI’s and Test matches and investigate xR at a match level. I’ll also be experimenting with other machine learning algorithms to predict both runs and wickets.

If you have any suggestions or ideas of your own, please feel free to tweet me.

## 3 thoughts on “xR with machine learning”