In my last blog I built two simple xG models using the distance of a shot from the goal line and the angle a shot makes with the crossbar. The second model attempted to account for the fact that, for the same distance, shots from wide areas are less likely to be scored than shots from right in front of goal. However, this was not perfect as it still undervalued shots from directly in front of goal closer to the penalty spot. Ideally, our xG visualisation should look like this (courtesy of David Sumpter) with high xG shots directly in front of goal and low xG shots in the wide areas but still close to the goal. This can achieved by combining both distance and angle to goal into one function. But this is not a trivial problem as described in detail by Michael Caley. Through trial-and-error, he manages to capture the above distribution pretty well using a combination of distances and angle and inverse distances and angles.

Another approach is to consider clusters of shots. A set of shots that are geometrically close together will have similar distances and angles so should therefore have a similar xG. This is the concept of binning as described in this model. The number of goals divided by the number of shots from inside a particular bin gives us the xG for those shots.

My approach uses a k-nearest neighbour algorithm which is a simple machine learning technique. For a particular shot, the KNeighborsRegressor Python package will search for the 500 closest shots and then count up the number of goals. This number divided by 500 is the xG for that shot. The image below is a visualisation of this model showing all 226,917 shots in the dataset. This seems much more like the xG distribution from above. However, the xG of shots in the 6 yard box right in front of goal is about 0.8 whereas my previous models predicted values closer to 1. This is a result of the high value for k. The 500 nearest neighbours to shots near the goal line will draw upon shots in the edges of the 6 yard box due to the relatively low density of shots.

There are 21,578 goals in the whole dataset. This model predicts 21,409 goals which is an error of just 0.8%. We can investigate how teams and players perform under this metric. The graph below shows the total xG for each team in the dataset against the total goals they scored. The line is y=x and shows over-performing teams above the line and under-performing teams below the line. The really good teams over-perform meaning there are aspects of their play that isn’t captured by a shots-based xG model. These teams ordered by their performance, measured by total goals divided by total xG, are shown below.

Similarly, we can assess players using xG. The graph below shows xG against total goals per player.Again it seems the best players over-perform their xG numbers. A breakdown of players with at least 60 goals is shown in the table below.

Luis Suárez is over-performing his xG by a huge 72%, closely followed by Messi and Griezmann at 70%. It is interesting to note that Benzema has an xG per shot of 0.154, the highest of the players in the list. This, coupled with his high over-performance figure, suggests he is finishing really well and getting into high value positions to shoot. Meanwhile, his Real Madrid teammates, Ronaldo and Bale have the lowest xG/shot. It seems they have their share of speculative efforts before Benzema comes in to clean up if they fail.

It is reassuring that this xG model identifies the best teams and players, validated by other models. The machine learnining algorithms can of course be extended to include many other factors. In future blogs, I’ll look to build these into my cricket models.