Simple PWHL xG Model
Introduction
This document describes an expected goals model for the PWHL.
Important features:
the model is a Multivariate Adaptive Regression Splines model (a “MARS model”); and
the variables used to predict goals are shot distance, shot angle, and shot rebound status.
That’s not many variables, obviously. There isn’t much PWHL data available (72 regular season games) so I used only the most important variables for an xG model.
Data for the model were pulled from the PWHL’s API (using functions that I posted on GitHub). There are anomalies and errors in the data. The steps I took to “fix” the data are set out in painful detail below. Most people will have no interest in those details and can skip them. However, if you plan to use data from the PWHL’s API then you must be aware of the issues with the “raw” data.
Basic Setup
Load the packages and the raw play-by-play data (this assumes the data are saved in the working directory).
#install.packages("tidymodels")
#install.packages("readr")
#install.packages("vip")
#install.packages("kableExtra")
library(tidymodels)
library(readr)
library(vip)
library(kableExtra)
<- read_rds("season_one_pbp.rds") raw_data
Functions
Set out below are four functions that adjust and augment the raw play-by-play data.
Adjust Event Location
The x | y location data from the PWHL’s API are on a scale of 600:300 (which does not match the dimensions of a regulation rink).
This function converts the location data to a scale of 200:85 and places center ice at 0,0. After the conversion the axes represent distance in feet. This adjustment might not be appropriate in every case; however, it seems to produce reasonable results.
<- function(pbp_data) {
adjust_event_location
<- pbp_data |>
pbp_data mutate(x_location = (x_location - 300) * 0.3333,
y_location = (y_location - 150) * 0.2833)
return(pbp_data)
}
Add Shot Distance
This function adds a shot distance variable to the play-by-play data. The distance is measured in feet.
<- function(pbp_data) {
add_shot_distance
<- pbp_data |>
find_sides filter(event == "shot") |>
group_by(game_id,
|>
event_team) summarize(mean_shot = mean(x_location,
na.rm = TRUE),
.groups = "drop")
<- pbp_data |>
pbp_data left_join(find_sides,
by = join_by(game_id,
event_team))
<- pbp_data |>
pbp_data mutate(distance = case_when(
> 0 & event == "shot" ~ round(abs(sqrt((x_location - 89)^2 + (y_location)^2)), 1),
mean_shot < 0 & event == "shot" ~ round(abs(sqrt((x_location - (-89))^2 + (y_location)^2)), 1)),
mean_shot .after = y_location)
<- pbp_data |>
pbp_data select(-mean_shot)
return(pbp_data)
}
Add Shot Angle
This function adds a shot angle variable to the play-by-play data. The angle is measured in degrees from the center of the net.
<- function(pbp_data) {
add_shot_angle
<- pbp_data |>
find_sides filter(event == "shot") |>
group_by(game_id,
|>
event_team) summarize(mean_shot = mean(x_location,
na.rm = TRUE),
.groups = "drop")
<- pbp_data |>
pbp_data left_join(find_sides,
by = join_by(game_id,
event_team))
<- pbp_data |>
pbp_data mutate(angle = case_when(
> 0 & event == "shot" ~ round(abs(atan((0-y_location) / (89-x_location)) * (180 / pi)), 1),
mean_shot < 0 & event == "shot" ~ round(abs(atan((0-y_location) / (-89-x_location)) * (180 / pi)), 1)),
mean_shot .after = distance) |>
mutate(angle = ifelse((mean_shot > 0 & x_location > 89) | (mean_shot < 0 & x_location < -89), 180 - angle, angle))
<- pbp_data |>
pbp_data select(-mean_shot)
return(pbp_data)
}
Add Rebound Shots
This function adds a shot rebound logical variable to the play-by-play data. A shot is considered a rebound opportunity if it is taken within 2 seconds of a prior shot.
<- function(pbp_data) {
add_rebound
<- pbp_data |>
pbp_data mutate(is_rebound = if_else((event == "shot" & lag(event) == "shot") & (game_seconds - lag(game_seconds)) < 3, TRUE, FALSE),
.after = is_goal)
return(pbp_data)
}
Data Cleaning | EDA
Adjust the x | y locations and add shot distance to the play-by-play data - this will be helpful for exploring and “fixing” the data.
<- raw_data |>
clean_data adjust_event_location() |>
add_shot_distance()
Mean x_location
Check for anomalies in the average x_location for shots.
There are some odd results here, especially in game_id 3, 9, 23, 35, 38, 40, 43, and 71.
Warning: this next section is a grind to get through.
I’ll plot the suspicious results, game-by-game. I’ll also plot my “fix” for any potential errors. Generally speaking, I decided how to fix the results by looking for suspicious patterns in the underlying data and by looking at the results posted on the PWHL website. I have not reviewed video of each game. That would be the best way to audit the location data but would also be hugely time consuming.
Note that the shot locations are suspiciously close to the middle of the ice. I’ll return to this issue below. For now, here’s my fix: flip the coordinates (both x and y) for periods 2 and 4.
That looks a little better. Now for game_id 9.
My fix: flip the coordinates (both x and y) for period 1.
That looks a little better (the long-distance goal is an empty net goal). Now for game_id 23.
This is probably the most suspicious case of the shots being too close to the middle of the ice - I’ll come back to this issue below.
My fix: flip the coordinates (both x and y) for periods 1 and 4.
That looks a little better except for the long-distance goal. I’ll return to that later. Now for game_id 35.
My fix: flip the coordinates (both x and y) for all long-distance shots (which appear in odd clumps in the data).
That looks a little better. Now for game_id 38.
My fix: flip the coordinates (both x and y) for all long-distance shots.
That looks a little better. Now for game_id 40.
My fix: flip the coordinates (both x and y) for periods 2 and 4.
That looks a little better. Now for game_id 43.
My fix: flip the coordinates (both x and y) for long-distance shots in period 1.
That looks a little better. Now for game_id 71.
My fix: flip the coordinates (both x and y) for period 2, plus a subset of shots by OTT in period 3.
That looks a little better.
That was painful. How does the mean x_location plot look now?
This looks much better but there could be more errors in the data. To find other potential errors, replace the distance data (using the new x | y locations) and then summarize long-distance shots that are marked as “quality” in the PWHL’s data.
game_id | Errors |
---|---|
3 | 1 |
9 | 2 |
10 | 1 |
11 | 2 |
12 | 2 |
13 | 1 |
33 | 1 |
37 | 2 |
40 | 1 |
44 | 1 |
50 | 1 |
53 | 2 |
There are some potential errors (and data from early in the season seems especially suspicious). I’ll flip the coordinates for these shots and then reset the shot distance variable in the clean data.
Repeat the potential errors summary used above.
game_id | location_errors |
---|---|
The potential errors no longer appear in the data.
Max | Min y_location
Check for anomalies in the y_location data by looking at the maximum and minimum y_location for the shots in each game.
There are some suspicious results here, especially the first few games of the season (but also look at game_id 23). As noted above, the y_location for some games seems too close to the middle of the ice. The trend line shows the max and min y_location values are narrower at the start of the season.
Any errors in the y_location data will obviously affect the xG model (decreasing the angle and distance of the shots). I’m not going to apply an arbitrary adjustment, and I’m not going to watch video for the most suspicious games. I’ll leave the y_location data “as-is” but I definitely have concerns about it.
Mean Shot Distance
Plot the average shot distance by each team for every game_id as a further check for location errors.
No massive outliers here. While I expect there are still some errors in the data I’ve probably caught a good chunk of them.
Location Of Goals
Plot the location of all goals (excluding empty net goals and penalty shots).
There are two suspicious goals here: one in the neutral zone and one in the defensive zone (the yellow point).
I watched both goals on YouTube. The defensive zone goal was not a long-distance shot. The x_location needs to be flipped for this goal. The neutral zone goal was from long-distance - the puck took a funny bounce off the boards and went in the net when the goalie went out to play it.
I’ll fix the defensive zone goal by flipping the x_location and recomputing the distance variable.
Update the plot to make sure that fix worked.
That looks about right.
Explore Shot Variables
The above plots showed that the distance variable seems to be working OK. Now add the angle and is_rebound variables to the play-by-play data and visualize them.
Repeat the above goals plot but show the angle of the shots using colour.
The angle variable seems to be working properly with purple in the middle of the ice and a transition to yellow as the angle rotates in either direction.
Now do the same with rebounds.
Most of the rebounds are close to the front of the net which seems right.
Rebounds
It’s well-known that rebound opportunities have a higher probability of turning into goals and are an important variable in an xG model. Just to confirm that, here’s a comparison of shooting percentages for all shots (excluding empty nets and penalty shots) versus shots after a rebound.
Rebound Shot % | Regular Shot % |
---|---|
20.8 | 7.8 |
Rebound opportunities are more likely to turn into goals.
In passing, look at that regular shooting percentage! Goalies in the PWHL have a much higher save % than goalies in the NHL. That’s interesting given that PWHL goalies are smaller than NHL goalies (on average). I have a couple of ideas about what might be happening here but I haven’t dug into it. It will be interesting to see if this changes going forward. Any change will be relevant to an xG model.
Distance
Plot a histogram showing the relationship between shots, goals, and shot distance.
Show the same data but as proportions.
The xG model will likely benefit from a spline for the distance variable.
Angle
Plot the relationship between shots, goals, and shot angle. Remember: the middle of the ice is “0” in these plots.
Show the same data but as proportions.
The xG model will likely benefit from a spline for the angle variable.
There’s an issue here when it comes to extreme angles (90+). I’m guessing that many shot attempts from these angles are not recorded as shots. This would increase the proportion of recorded shots that turn into goals. Having noted the issue, I’ll ignore it for now.
Completeness Of Data
There are anomalies in the game_ids - they are not sequential. Here are the number of unique game_ids in the data.
Total Games |
---|
72 |
There were 72 games in the PWHL regular season.
Check the top goal scorers in the data.
Skater | Goals |
---|---|
Natalie Spooner | 20 |
Grace Zumwinkle | 11 |
Sarah Nurse | 11 |
Marie-Philip Poulin | 10 |
Laura Stacey | 10 |
Daryl Watts | 10 |
Gabbie Hughes | 9 |
Brianne Jenner | 9 |
Alex Carpenter | 8 |
Jade Downie-Landry | 8 |
These totals match the totals on the PWHL website.
xG Model
That’s it for tidying up the data.
Prepare Model Data
This step does the following to the cleaned play-by-play data:
removes shots on empty nets;
filters for the “shot” event; and
converts is_goal and is_rebound to a factor.
The model will not make predictions for empty nets or for penalty shots.
Note: there’s one shot with NA in the is_rebound variable. I’ll assume it should be FALSE and fill the NA.
<- clean_data |>
model_data mutate(en = if_else(lead(goal_is_en) == TRUE, TRUE, FALSE)) |>
mutate(en = if_else(is.na(en), FALSE, en)) |>
filter(event == "shot" & en == FALSE) |>
mutate(is_rebound = if_else(is.na(is_rebound), FALSE, is_rebound))
$is_goal <- factor(model_data$is_goal, levels = c("TRUE", "FALSE"))
model_data
$is_rebound <- as.factor(model_data$is_rebound) model_data
Now split the data (strata = is_goal) for training and testing, then create folds.
set.seed(18)
<- initial_split(data = model_data,
split_data strata = is_goal)
<- training(split_data)
training_data <- testing(split_data)
testing_data
<- vfold_cv(training_data,
data_folds v = 10,
repeats = 3,
strata = is_goal)
Explore Models
Three types of models will be explored below: 1) logistic regression; 2) MARS; and 3) XGBoost.
Logistic Regression
Fit some logistic regression models using a few different recipes. The steps here are:
write various recipes using different variables and processing; and
fit models using the recipes.
<- recipe(is_goal ~ distance +
rec_logistic_base
angle, data = training_data)
<- recipe(is_goal ~ distance +
rec_logistic_rebound +
angle
is_rebound, data = training_data) |>
step_dummy(all_nominal_predictors())
<- recipe(is_goal ~ distance +
rec_logistic_splines +
angle
is_rebound, data = training_data) |>
step_dummy(all_nominal_predictors()) |>
step_ns(angle,
deg_free = 2) |>
step_ns(distance,
deg_free = 3)
<- recipe(is_goal ~ distance +
rec_logistic_interaction +
angle
is_rebound, data = training_data) |>
step_dummy(all_nominal_predictors()) |>
step_interact(terms = ~ distance:angle)
<- list(base = rec_logistic_base,
rec_logistic_list rebound = rec_logistic_rebound,
splines = rec_logistic_splines,
interaction = rec_logistic_interaction)
<- workflow_set(rec_logistic_list,
logistic_models list(logistic = logistic_reg())) |>
workflow_map("fit_resamples",
seed = 18,
#verbose = TRUE,
resamples = data_folds)
The metric used to evaluate all the potential models in this document is area under the ROC curve, or roc_auc. For this metric, a larger number (closer to 1) is good.
Collect the metrics.
wflow_id | model | .metric | mean | std_err |
---|---|---|---|---|
interaction_logistic | logistic_reg | roc_auc | 0.698 | 0.00936 |
splines_logistic | logistic_reg | roc_auc | 0.696 | 0.00944 |
rebound_logistic | logistic_reg | roc_auc | 0.696 | 0.00977 |
base_logistic | logistic_reg | roc_auc | 0.691 | 0.00945 |
The model with the interaction term (distance:angle) performed the best.
MARS
Repeat the same process with some MARS models (excluding the splines recipe).
<- recipe(is_goal ~ distance +
rec_mars_base
angle, data = training_data)
<- recipe(is_goal ~ distance +
rec_mars_rebound +
angle
is_rebound, data = training_data) |>
step_dummy(all_nominal_predictors())
<- recipe(is_goal ~ distance +
rec_mars_interaction +
angle
is_rebound, data = training_data) |>
step_dummy(all_nominal_predictors()) |>
step_interact(terms = ~ distance:angle)
<- list(base = rec_mars_base,
rec_mars_list rebound = rec_mars_rebound,
interaction = rec_mars_interaction)
<- workflow_set(rec_mars_list,
mars_models list(mars = mars(mode = "classification"))) |>
workflow_map("fit_resamples",
seed = 18,
#verbose = TRUE,
resamples = data_folds)
Collect the metrics.
wflow_id | model | .metric | mean | std_err |
---|---|---|---|---|
rebound_mars | mars | roc_auc | 0.692 | 0.00977 |
base_mars | mars | roc_auc | 0.688 | 0.00936 |
interaction_mars | mars | roc_auc | 0.682 | 0.00946 |
The MARS models did not perform as well as the logistic regression models. The best performing model had three variables (distance, angle, and is_rebound).
XGBoost
I’ll try XGBoost here because it has the potential to perform well for this task (my NHL xG model is an XGBoost model). I expect the small amount of data to be a problem here, though.
<- recipe(is_goal ~ distance +
rec_xgb_base
angle, data = training_data)
<- recipe(is_goal ~ distance +
rec_xgb_rebound +
angle
is_rebound, data = training_data) |>
step_dummy(all_nominal_predictors(),
one_hot = TRUE)
<- recipe(is_goal ~ distance +
rec_xgb_interaction +
angle
is_rebound, data = training_data) |>
step_dummy(all_nominal_predictors(),
one_hot = TRUE) |>
step_interact(terms = ~ distance:angle)
<- list(base = rec_xgb_base,
rec_xgb_list rebound = rec_xgb_rebound,
interaction = rec_xgb_interaction)
<- workflow_set(rec_xgb_list,
xgb_models list(xgb = boost_tree(mode = "classification"))) |>
workflow_map("fit_resamples",
seed = 18,
#verbose = TRUE,
resamples = data_folds)
Collect the metrics.
wflow_id | model | .metric | mean | std_err |
---|---|---|---|---|
interaction_xgb | boost_tree | roc_auc | 0.662 | 0.01004 |
base_xgb | boost_tree | roc_auc | 0.662 | 0.00918 |
rebound_xgb | boost_tree | roc_auc | 0.661 | 0.00939 |
The XGBoost model was the worst performer.
Metrics Summary
wflow_id | model | .metric | mean | std_err |
---|---|---|---|---|
interaction_logistic | logistic_reg | roc_auc | 0.698 | 0.00936 |
splines_logistic | logistic_reg | roc_auc | 0.696 | 0.00944 |
rebound_logistic | logistic_reg | roc_auc | 0.696 | 0.00977 |
rebound_mars | mars | roc_auc | 0.692 | 0.00977 |
base_logistic | logistic_reg | roc_auc | 0.691 | 0.00945 |
base_mars | mars | roc_auc | 0.688 | 0.00936 |
interaction_mars | mars | roc_auc | 0.682 | 0.00946 |
interaction_xgb | boost_tree | roc_auc | 0.662 | 0.01004 |
base_xgb | boost_tree | roc_auc | 0.662 | 0.00918 |
rebound_xgb | boost_tree | roc_auc | 0.661 | 0.00939 |
The logistic regression models top the list, with the “interaction” recipe performing the best. It should be acknowledged that none of these models has great performance.
Training | Testing Data
Select the best version of each type of model and fit it using all of the training data. Check results using the testing data.
Logistic Regression
model | .metric | .estimate |
---|---|---|
logistic | roc_auc | 0.6587111 |
There was a drop in performance here.
MARS
model | .metric | .estimate |
---|---|---|
mars | roc_auc | 0.6618625 |
Now the MARS model outperforms the logistic regression model.
XGBoost
model | .metric | .estimate |
---|---|---|
xgboost | roc_auc | 0.6477014 |
XGBoost is still the worst performer.
Variable Importance
The MARS model was the best performer when given more training data, and a model with splines is probably a decent choice here. I’ll go with it.
Plot the importance of each variable in the MARS model.
Final Fit
Fit And Save The Final xG Model
Fit the MARS model using all available data and save the fit model to the working directory.
<- workflow() |>
xg_model add_model(mars(mode = "classification")) |>
add_recipe(rec_mars_rebound) |>
fit(data = model_data)
write_rds(xg_model, "pwhl_xg_model.rds")
Check roc_auc
Check the roc_auc metric for the final model.
model | .metric | .estimate |
---|---|---|
mars | roc_auc | 0.6921291 |
The roc_auc metric improved which is not surprising given that the model was fit using all of the data.
Variable Importance
Check the variable importance for the final model.
The order of the variables is the same but the level of importance changed a little.
Explore Results
Remember: the predictions made by this xG model do not include empty nets and penalty shots.
I think that looks reasonable. Now look at the distribution of xG values.
Show the same data as proportions.
The overall trend is good but the results are a little uneven. It would be nice to have more data :)
Now explore some actual predictions. Here are the top 10 xG leaders from season one according to this model.
Skater | Team | xG | Goals |
---|---|---|---|
Natalie Spooner | TOR | 10.6 | 19 |
Kendall Coyne Schofield | MIN | 7.7 | 6 |
Grace Zumwinkle | MIN | 7.3 | 10 |
Laura Stacey | MTL | 7.3 | 10 |
Alex Carpenter | NY | 7.3 | 8 |
Hayley Scamurra | OTT | 7.1 | 5 |
Jessie Eldridge | NY | 6.9 | 7 |
Hilary Knight | BOS | 6.8 | 6 |
Gabbie Hughes | OTT | 6.0 | 8 |
Blayre Turnbull | TOR | 5.8 | 3 |
Natalie Spooner really separated from the pack (and she had by far the most actual goals as well).
See the cumulative xG for each team according to this model.
Team | xG |
---|---|
MIN | 60.2 |
OTT | 54.3 |
TOR | 54.3 |
MTL | 52.7 |
NY | 51.2 |
BOS | 48.2 |
That’s all,
Mark (18 Skaters)