A Simple Expected Goals Model
Here’s what you’ll find in this post:
I devote a few paragraphs to explaining why every hockey fan should be interested in the so-called “expected goals” metric; and
I describe a very simple expected goals model that I built using only a few lines of code.
Along the way I’ll point out some aspects of the expected goals metric that are perhaps not fully understood by some people. Let’s get started.
You Should Be Interested In “Expected Goals”
If you’re unfamiliar with the expected goals metric, or you still doubt it’s utility, then let me convince you that the metric is useful.
I’ll start by making what should be an uncontroversial statement: data about goals and shots are relevant to understanding what happened in a game of hockey. Assuming that you’re still with me at this point, the expected goals metric is the logical next step. Fundamentally, the expected goals metric provides information about the likelihood of a shot (which is an important data point) turning into a goal (which is also important, for obvious reasons).
Here’s another statement that should be uncontroversial, and really goes to the core of the expected goals metric: some shots are more likely to go in the net than other shots. A shot taken from close to the front of the net is generally more dangerous than a shot taken from the blue line. This should not be surprising, and is plainly evident when you look at a plot of the goals scored this season (excluding empty net goals).
The expected goals metric accounts for the simple fact that a higher proportion of shots from dangerous areas turn into goals. Is it perfect? No, but it’s much more informative than a simple count of how many shots were taken.
I’ll clarify one other thing: the expected goals metric does not tell us how many goals ought to have been scored in a particular situation. Rather, the metric tells us the typical result based on past outcomes and within the limitations of the available data. Of course, in real life not every shot will be “typical”. There are many unusual circumstances that might arise in any particular case, not to mention the skill of the skater shooting the puck or the skill of the goalie trying to stop it. That’s not to say there is no relationship between expected goals and actual goals, but the expected goals metric does not tell us who “really” won a hockey game.
To summarize, while the expected goals metric is imperfect it provides data about shot quality that’s more informative than a simple shot count.
Hopefully I’ve convinced you that the expected goals metric is useful. If not, then perhaps walking through the process of building a ridiculously simple expected goals model will do that job.
How To Build An xG Model Using Less Than 20 Lines Of Code
I’ll start this section of the post by stating that I have no background in mathematics and I only recently taught myself how to code using R. The people who produce the popular expected goals models certainly have the upper hand here and I am not implying that I’ve somehow outdone them with my simple model. With that said, let’s see where this goes.
As far as I know, all expected goals models attempt to compute the likelihood of each shot attempt becoming a goal based on what has happened in the past. Many different variables can be used in this analysis: the location on the ice from which the shot attempt was taken, the type of shot attempt (wrist, slap, etc.), whether the shot attempt was taken off a rebound or rush opportunity … the list goes on. So if, for example, 21% of similar shot attempts turned into goals in the past then a shot attempt has an expected goals value of 0.21. To arrive at total expected goals one simply adds up the expected goals value for each shot attempt (and this is the same for either an individual skater or for a team as a whole).
Let’s walk through how I made my expected goals model to flesh this out.
Warning: Boring Details About Gathering Data (Skip Ahead If You Want)
For anyone interested, here are the details about gathering the data needed to train my model:
I collected data from 3 regular seasons: 2018-2019, 2019-2020, and 2021-2022. I excluded the short COVID season (2020-2021) because things were weird back then. I expect that most other model builders prefer to use a larger collection of data but this is what I used.
I used data from a 5v5 game state. My model hasn’t been optimized for other game states.
I included data for all unblocked shot attempts, which includes missed shots (this will come up again below).
I filtered out all shot attempts taken 70+ feet from the net. In effect, my model treats all such shot attempts as having an expected goals value of 0.0 (which is an oversimplification, obviously).
End Of Boring Details About The Data
That’s the data, now what did I do with it? Well, I focused on only two variables to create my expected goals model: shot distance and shot angle. These two variables indirectly provide each shot attempt’s location on the ice. Next, I created 250 small “clusters” of shot attempts based on their location. Lastly I computed the proportion of shot attempts in each cluster that turned into goals. That’s it.
Voila! An incredibly simple expected goals model. It literally takes less than 20 lines of code to build this model. Speaking of which, the code is displayed here for any readers who have an interest in such things.
Code(R) For A Simple xG Model
# Select Kmeans cluster data (shot distance + shot angle)
training_cluster_data <- select(training_data, c("shot_distance", "shot_angle"))
# Generate shot location clusters (250)
clusters <- kmeans(training_cluster_data, 250, nstart = 750, iter.max = 15)
# Assign clusters to the training data
training_data$cluster <- clusters$cluster
# Compute goal proportions for each cluster
# Step 1: Shot attempts for cluster
cluster_shot_attempts <- training_data %>%
group_by(cluster) %>%
summarise(sa_count = n())
#Step 2: Goals for cluster
cluster_goals <- training_data %>%
filter(event_type == "GOAL") %>%
group_by(cluster) %>%
summarise(goal_count = n())
#Step 3: Join shot attempts and goals to training data and then compute goal proportions
training_data <- training_data %>%
left_join(cluster_shot_attempts, by = "cluster") %>%
left_join(cluster_goals, by = "cluster")
training_data$goal_count[is.na(training_data$goal_count)] <- 0
training_data <- mutate(training_data, goal_prob = goal_count / sa_count)
# Isolate clusters with goal proportions for use with current data
sum_cluster_goal_prob <- select(training_data, c("cluster", "goal_prob")) %>%
group_by(cluster, goal_prob) %>%
summarise()
Here’s a plot of the high-danger areas produced by the model. The nets are on the left and right in this plot, and the orange/purple areas represent the locations on the ice where the highest proportion of shot attempts turned into goals.
This looks reasonable. The high-danger zone fans out from the front of the net and there are some interesting areas down near the goal line.
Applying The Model To Data From This Season
Based on the above plot my model does not produce results that are obviously bad so now I’ll apply the model to the data from this season and see what happens. The process for applying the model to current data is incredibly simple. The “magic” requires only a few more lines of code.
The first step is to produce current data that matches the data used to train the model. This means filtering for unblocked shot attempts that were taken during a 5v5 game state and that were taken within 70 feet of the net. With the data cleaned up, next comes the magic. The shot attempts are put into their “clusters” based on location, and then the goal probabilities computed by the model are assigned to each shot attempt based on cluster. The effect is that every shot attempt is assigned its expected goals value. The final step is to simply add up all of the expected goals values for each skater (or for each team).
Again, I’ll provide the code for anyone interested in seeing it.
Code(R) For Applying The Simple xG Model To New Data
# Select cluster data from "working" data (shot distance + shot angle)
working_cluster_data <- select(working_data_2023, c("shot_distance", "shot_angle"))
# Determine clusters for working data (the "magic" begins)
clusters_working_data <- predict_KMeans(working_cluster_data, clusters$centers)
# Assign clusters to working data
working_data_2023$cluster <- clusters_working_data
# Assign goal probability to clusters (the "magic" ends)
working_data_2023 <- working_data_2023 %>%
left_join(sum_cluster_goal_prob, by = "cluster")
# Add up the expected goals for each skater
current_skater_results <- working_data_2023 %>%
group_by(event_player_1_name) %>%
summarise(sum_xg = sum(goal_prob)) %>%
arrange(desc(sum_xg))
I’ll plot the results here to make sure things are still going in the right direction. The idea behind this plot is the same as the one produced above but uses the clustered shot attempts from this season.
That looks OK, so now I’ll do one more sanity check. This table shows which skaters have the most expected goals according to the model.
[18 Skaters Model] Top 10 Expected Goals (5v5)
Skater | xGoals |
---|---|
Zach Hyman | 17.32984 |
Auston Matthews | 16.54210 |
Timo Meier | 16.42773 |
Brady Tkachuk | 14.79962 |
John Tavares | 14.24182 |
Anders Lee | 14.02766 |
Kyle Connor | 13.97352 |
Alex Ovechkin | 13.70057 |
Brayden Point | 13.24595 |
Jack Hughes | 13.16818 |
This looks reasonable to me, but then again I do rather like Zach Hyman.
My model is not producing results that are undeniably terrible, so let’s see how it compares to one of the most popular “real” models.
Comparing The 18 Skaters Model To The NST Model
Natural Stat Trick (“NST”) is a fantastic resource. I think most hockey fans have at least heard of NST even if they don’t actively use it. Given its popularity and availability I’m going to talk about NST’s expected goals model here.
The first thing I want to do is see which skaters have the most expected goals according to the NST model and compare the results to my model. You could say this is the moment of truth … here are the top 20 skaters from NST and the corresponding expected goals data from my model.
[NST VS 18 Skaters] Top 20 Expected Goals (5v5)
Skater | NST | 18 | +/- | Goals | >< |
---|---|---|---|---|---|
Timo Meier | 17.33 | 16.43 | -0.90 | 16 | 18 |
Zach Hyman | 16.72 | 17.33 | 0.61 | 10 | NST |
Auston Matthews | 15.97 | 16.54 | 0.57 | 15 | NST |
Anders Lee | 15.31 | 14.03 | -1.28 | 14 | 18 |
Brady Tkachuk | 15.27 | 14.80 | -0.47 | 12 | 18 |
Jack Hughes | 13.92 | 13.17 | -0.75 | 22 | NST |
Alex Ovechkin | 13.90 | 13.70 | -0.20 | 14 | NST |
Carter Verhaeghe | 13.31 | 11.88 | -1.43 | 15 | NST |
John Tavares | 13.26 | 14.24 | 0.98 | 11 | NST |
Kyle Connor | 13.26 | 13.97 | 0.71 | 15 | 18 |
Tage Thompson | 13.21 | 12.86 | -0.35 | 18 | NST |
Connor McDavid | 13.13 | 13.05 | -0.08 | 21 | NST |
Michael Bunting | 12.92 | 12.45 | -0.47 | 9 | 18 |
Matthew Tkachuk | 12.53 | 12.09 | -0.44 | 13 | NST |
Rickard Rakell | 12.50 | 11.45 | -1.05 | 9 | 18 |
Brayden Point | 12.20 | 13.25 | 1.05 | 19 | 18 |
David Pastrnak | 12.07 | 11.86 | -0.21 | 20 | NST |
Jordan Martinook | 12.01 | 10.96 | -1.05 | 9 | 18 |
Jason Robertson | 11.75 | 12.03 | 0.28 | 21 | 18 |
William Nylander | 11.66 | 11.57 | -0.09 | 14 | NST |
My model’s expected goals values are not hugely different than the values computed by NST, and in a good number of cases the 18 Skaters values are closer to the actual goals scored.
This next plot shows how the two models compare based on the difference between expected goals and actual goals across the entire league. The lines in the plot track the density of actual goals above or below the expected goals computed by each model. You can see that both models peak near 0.0 which means that most skaters have scored roughly the number of goals estimated by the models.
The models do not differ significantly when it comes to the relationship between expected goals and actual goals. So the 18 Skaters model appears to perform OK relative to a “real” model.
I’ll pause here to note small discrepancies in the data. A handful of skaters have different 5v5 goal totals in my data versus the data pulled from NST. I haven’t been able to identify the reason for this discrepancy. In any event, I don’t think it takes away from the general point being made here.
Now For Something Different
Earlier in this post I mentioned that expected goals models generally take into account all unblocked shot attempts, including shot attempts that missed the net. It’s not obvious to me that this is the best approach. I’m sure the people who produce the “real” expected goals models have good reasons for including missed shot attempts. No doubt they did some research and concluded that it was a good idea. I, however, am unburdened by the results of any such research and I’m simply curious about what happens if those missed shot attempts are excluded.
For one thing, I expect that some skaters are more likely than other skaters to attempt shots that miss the net. Now recall that the expected goals metric is based on the “typical” shot attempt, without regard to whether the shooter regularly hits or misses his shot attempts. I wonder if this systematically rewards the wild shooters and punishes the accurate shooters.
So let’s see what happens. First, I need to retrain my model by excluding missed shot attempts. Here’s an updated plot showing the proportion of shots that turned into goals, just to make sure nothing crazy happened when the model was retrained.
That looks OK.
So how did the change affect the model’s expected goals values? This table shows the “Before” and “After” values for the top 20 skaters.
[All Shot Attempts VS Shots Only] Top 20 Expected Goals (5v5)
Skater | All_Attempts | Shots_Only | +/- |
---|---|---|---|
Zach Hyman | 17.33 | 17.21 | -0.12 |
Auston Matthews | 16.54 | 14.82 | -1.72 |
Timo Meier | 16.43 | 15.34 | -1.09 |
Brady Tkachuk | 14.80 | 13.38 | -1.42 |
John Tavares | 14.24 | 12.74 | -1.50 |
Anders Lee | 14.03 | 12.50 | -1.52 |
Kyle Connor | 13.97 | 13.70 | -0.27 |
Alex Ovechkin | 13.70 | 14.39 | 0.69 |
Brayden Point | 13.25 | 13.97 | 0.72 |
Jack Hughes | 13.17 | 14.86 | 1.69 |
Connor McDavid | 13.05 | 14.59 | 1.54 |
Tage Thompson | 12.86 | 13.07 | 0.21 |
Michael Bunting | 12.45 | 10.86 | -1.59 |
Sebastian Aho | 12.15 | 11.83 | -0.32 |
Matthew Tkachuk | 12.09 | 12.32 | 0.23 |
Jason Robertson | 12.03 | 12.15 | 0.12 |
Troy Terry | 11.99 | 11.02 | -0.96 |
Carter Verhaeghe | 11.88 | 11.75 | -0.13 |
David Pastrnak | 11.86 | 13.11 | 1.25 |
Andrei Svechnikov | 11.75 | 11.24 | -0.51 |
Interesting. This change affects some skaters quite a bit but for other skaters there’s practically no change.
Now here’s how the new model compares to NST for the top 20 skaters.
[NST VS 18 Skaters (Shots Only)] Top 20 Expected Goals (5v5)
Skater | NST | 18 | +/- | Goals | >< |
---|---|---|---|---|---|
Timo Meier | 17.33 | 15.34 | -1.99 | 16 | 18 |
Zach Hyman | 16.72 | 17.21 | 0.49 | 10 | NST |
Auston Matthews | 15.97 | 14.82 | -1.15 | 15 | 18 |
Anders Lee | 15.31 | 12.50 | -2.81 | 14 | NST |
Brady Tkachuk | 15.27 | 13.38 | -1.89 | 12 | 18 |
Jack Hughes | 13.92 | 14.86 | 0.94 | 22 | 18 |
Alex Ovechkin | 13.90 | 14.39 | 0.49 | 14 | NST |
Carter Verhaeghe | 13.31 | 11.75 | -1.56 | 15 | NST |
John Tavares | 13.26 | 12.74 | -0.52 | 11 | 18 |
Kyle Connor | 13.26 | 13.70 | 0.44 | 15 | 18 |
Tage Thompson | 13.21 | 13.07 | -0.14 | 18 | NST |
Connor McDavid | 13.13 | 14.59 | 1.46 | 21 | 18 |
Michael Bunting | 12.92 | 10.86 | -2.06 | 9 | 18 |
Matthew Tkachuk | 12.53 | 12.32 | -0.21 | 13 | NST |
Rickard Rakell | 12.50 | 11.24 | -1.26 | 9 | 18 |
Brayden Point | 12.20 | 13.97 | 1.77 | 19 | 18 |
David Pastrnak | 12.07 | 13.11 | 1.04 | 20 | 18 |
Jordan Martinook | 12.01 | 9.73 | -2.28 | 9 | 18 |
Jason Robertson | 11.75 | 12.15 | 0.40 | 21 | 18 |
William Nylander | 11.66 | 11.87 | 0.21 | 14 | 18 |
Very interesting. The expected goals values computed by the 18 Skaters model are now closer to the actual goals scored in a majority of cases. Maybe this approach has some merit (or maybe it’s just a lucky result based on about half a season’s worth of data).
Now let’s check league-wide results.
No breakthrough here.
Just out of curiosity, what happens if I include only the skaters who have scored at least 5 goals?
When looking at the skaters who actually score goals the Shots Only model has the highest peak near 0.0. It got there by being closer on the “Fewer Goals Than Expected” side of the plot (in other words, the model produced fewer instances where the expected goals exceeded the actual goals). I’m not going to read too much into it, but I am intrigued by this outcome.
What’s Next?
I use the expected goals metric to evaluate skaters for fantasy hockey. The pool of skaters who are relevant in fantasy hockey is smaller than the league as a whole, and the “typical” fantasy hockey skater is likely to be a more talented shooter than the “typical” league-wide skater. My desire to account for this difference is what got me thinking about building my own expected goals model.
In the past I’ve dealt with this issue by computing a skater’s individual expected goal differential and using it to adjust his expected goals. Basically, if a skater has a history of scoring more goals than “expected” then I adjust all his expected goals values upwards to reflect his past performance (and vice versa). That approach doesn’t work at the level of specific shot attempts though, which is what the expected goals metric is all about. I started to ask myself some questions about alternative solutions.
Can I solve this problem by coming from the opposite direction? What would happen if I raised the talent level of the “typical” shooter in the expected goals model itself? Would this produce a more reliable relationship between expected goals and actual goals for the skaters who are most relevant to me?
I haven’t answered these questions yet. Watch for a future post on the topic.
Cheers,
Mark (18 Skaters)
The Data
Data used for the 18 Skaters models were pulled from NHL.com using hockeyR
Data indicated as being from NST came from, well, Natural Stat Trick
Data is current as of 2023-02-02