Simple PWHL xG Model

Introduction

This document describes an expected goals model for the PWHL.

Important features:

the model is a Multivariate Adaptive Regression Splines model (a “MARS model”); and
the variables used to predict goals are shot distance, shot angle, and shot rebound status.

That’s not many variables, obviously. There isn’t much PWHL data available (72 regular season games) so I used only the most important variables for an xG model.

Data for the model were pulled from the PWHL’s API (using functions that I posted on GitHub). There are anomalies and errors in the data. The steps I took to “fix” the data are set out in painful detail below. Most people will have no interest in those details and can skip them. However, if you plan to use data from the PWHL’s API then you must be aware of the issues with the “raw” data.

Basic Setup

Load the packages and the raw play-by-play data (this assumes the data are saved in the working directory).

#install.packages("tidymodels")
#install.packages("readr")
#install.packages("vip")
#install.packages("kableExtra")

library(tidymodels)
library(readr)
library(vip)
library(kableExtra)

raw_data <- read_rds("season_one_pbp.rds")

Functions

Set out below are four functions that adjust and augment the raw play-by-play data.

Adjust Event Location

The x | y location data from the PWHL’s API are on a scale of 600:300 (which does not match the dimensions of a regulation rink).

This function converts the location data to a scale of 200:85 and places center ice at 0,0. After the conversion the axes represent distance in feet. This adjustment might not be appropriate in every case; however, it seems to produce reasonable results.

adjust_event_location <- function(pbp_data) {
        
        pbp_data <- pbp_data |>
                mutate(x_location = (x_location - 300) * 0.3333,
                       y_location = (y_location - 150) * 0.2833)
        
        return(pbp_data)
        
}

Add Shot Distance

This function adds a shot distance variable to the play-by-play data. The distance is measured in feet.

add_shot_distance <- function(pbp_data) {
        
        find_sides <- pbp_data |>
                filter(event == "shot") |>
                group_by(game_id, 
                         event_team) |>
                summarize(mean_shot = mean(x_location,
                                           na.rm = TRUE),
                          .groups = "drop") 
        
        pbp_data <- pbp_data |>
                left_join(find_sides, 
                          by = join_by(game_id, 
                                       event_team))
        
        pbp_data <- pbp_data |>
                mutate(distance = case_when(
                        mean_shot > 0 & event == "shot" ~ round(abs(sqrt((x_location - 89)^2 + (y_location)^2)), 1),
                        mean_shot < 0 & event == "shot" ~ round(abs(sqrt((x_location - (-89))^2 + (y_location)^2)), 1)),
                       .after = y_location)
        
        pbp_data <- pbp_data |>
                select(-mean_shot)
        
        return(pbp_data)
        
}

Add Shot Angle

This function adds a shot angle variable to the play-by-play data. The angle is measured in degrees from the center of the net.

add_shot_angle <- function(pbp_data) {
        
        find_sides <- pbp_data |>
                filter(event == "shot") |>
                group_by(game_id, 
                         event_team) |>
                summarize(mean_shot = mean(x_location,
                                           na.rm = TRUE),
                          .groups = "drop") 
        
        pbp_data <- pbp_data |>
                left_join(find_sides, 
                          by = join_by(game_id, 
                                       event_team))
        
        pbp_data <- pbp_data |>
                mutate(angle = case_when(
                        mean_shot > 0 & event == "shot" ~ round(abs(atan((0-y_location) / (89-x_location)) * (180 / pi)), 1),
                        mean_shot < 0 & event == "shot" ~ round(abs(atan((0-y_location) / (-89-x_location)) * (180 / pi)), 1)),
                       .after = distance) |>
                mutate(angle = ifelse((mean_shot > 0 & x_location > 89) | (mean_shot < 0 & x_location < -89), 180 - angle, angle))
        
        pbp_data <- pbp_data |>
                select(-mean_shot)
        
        return(pbp_data)
        
}

Add Rebound Shots

This function adds a shot rebound logical variable to the play-by-play data. A shot is considered a rebound opportunity if it is taken within 2 seconds of a prior shot.

add_rebound <- function(pbp_data) {
        
        pbp_data <- pbp_data |>
                mutate(is_rebound = if_else((event == "shot" & lag(event) == "shot") & (game_seconds - lag(game_seconds)) < 3, TRUE, FALSE),
                       .after = is_goal)
        
        return(pbp_data)
        
}

Data Cleaning | EDA

Adjust the x | y locations and add shot distance to the play-by-play data - this will be helpful for exploring and “fixing” the data.

clean_data <- raw_data |>
        adjust_event_location() |>
        add_shot_distance()

Mean x_location

Check for anomalies in the average x_location for shots.

There are some odd results here, especially in game_id 3, 9, 23, 35, 38, 40, 43, and 71.

Warning: this next section is a grind to get through.

I’ll plot the suspicious results, game-by-game. I’ll also plot my “fix” for any potential errors. Generally speaking, I decided how to fix the results by looking for suspicious patterns in the underlying data and by looking at the results posted on the PWHL website. I have not reviewed video of each game. That would be the best way to audit the location data but would also be hugely time consuming.

Note that the shot locations are suspiciously close to the middle of the ice. I’ll return to this issue below. For now, here’s my fix: flip the coordinates (both x and y) for periods 2 and 4.

That looks a little better. Now for game_id 9.

My fix: flip the coordinates (both x and y) for period 1.

That looks a little better (the long-distance goal is an empty net goal). Now for game_id 23.

This is probably the most suspicious case of the shots being too close to the middle of the ice - I’ll come back to this issue below.

My fix: flip the coordinates (both x and y) for periods 1 and 4.

That looks a little better except for the long-distance goal. I’ll return to that later. Now for game_id 35.

My fix: flip the coordinates (both x and y) for all long-distance shots (which appear in odd clumps in the data).

That looks a little better. Now for game_id 38.

My fix: flip the coordinates (both x and y) for all long-distance shots.

That looks a little better. Now for game_id 40.

My fix: flip the coordinates (both x and y) for periods 2 and 4.

That looks a little better. Now for game_id 43.

My fix: flip the coordinates (both x and y) for long-distance shots in period 1.

That looks a little better. Now for game_id 71.

My fix: flip the coordinates (both x and y) for period 2, plus a subset of shots by OTT in period 3.

That looks a little better.

That was painful. How does the mean x_location plot look now?

This looks much better but there could be more errors in the data. To find other potential errors, replace the distance data (using the new x | y locations) and then summarize long-distance shots that are marked as “quality” in the PWHL’s data.

game_id	Errors
3	1
9	2
10	1
11	2
12	2
13	1
33	1
37	2
40	1
44	1
50	1
53	2

There are some potential errors (and data from early in the season seems especially suspicious). I’ll flip the coordinates for these shots and then reset the shot distance variable in the clean data.

Repeat the potential errors summary used above.

game_id	location_errors

The potential errors no longer appear in the data.

Max | Min y_location

Check for anomalies in the y_location data by looking at the maximum and minimum y_location for the shots in each game.

There are some suspicious results here, especially the first few games of the season (but also look at game_id 23). As noted above, the y_location for some games seems too close to the middle of the ice. The trend line shows the max and min y_location values are narrower at the start of the season.

Any errors in the y_location data will obviously affect the xG model (decreasing the angle and distance of the shots). I’m not going to apply an arbitrary adjustment, and I’m not going to watch video for the most suspicious games. I’ll leave the y_location data “as-is” but I definitely have concerns about it.

Mean Shot Distance

Plot the average shot distance by each team for every game_id as a further check for location errors.

No massive outliers here. While I expect there are still some errors in the data I’ve probably caught a good chunk of them.

Location Of Goals

Plot the location of all goals (excluding empty net goals and penalty shots).

There are two suspicious goals here: one in the neutral zone and one in the defensive zone (the yellow point).

I watched both goals on YouTube. The defensive zone goal was not a long-distance shot. The x_location needs to be flipped for this goal. The neutral zone goal was from long-distance - the puck took a funny bounce off the boards and went in the net when the goalie went out to play it.

I’ll fix the defensive zone goal by flipping the x_location and recomputing the distance variable.

Update the plot to make sure that fix worked.

That looks about right.

Explore Shot Variables

The above plots showed that the distance variable seems to be working OK. Now add the angle and is_rebound variables to the play-by-play data and visualize them.

Repeat the above goals plot but show the angle of the shots using colour.

The angle variable seems to be working properly with purple in the middle of the ice and a transition to yellow as the angle rotates in either direction.

Now do the same with rebounds.

Most of the rebounds are close to the front of the net which seems right.

Rebounds

It’s well-known that rebound opportunities have a higher probability of turning into goals and are an important variable in an xG model. Just to confirm that, here’s a comparison of shooting percentages for all shots (excluding empty nets and penalty shots) versus shots after a rebound.

Rebound Shot %	Regular Shot %
20.8	7.8

Rebound opportunities are more likely to turn into goals.

In passing, look at that regular shooting percentage! Goalies in the PWHL have a much higher save % than goalies in the NHL. That’s interesting given that PWHL goalies are smaller than NHL goalies (on average). I have a couple of ideas about what might be happening here but I haven’t dug into it. It will be interesting to see if this changes going forward. Any change will be relevant to an xG model.

Distance

Plot a histogram showing the relationship between shots, goals, and shot distance.

Show the same data but as proportions.

The xG model will likely benefit from a spline for the distance variable.

Angle

Plot the relationship between shots, goals, and shot angle. Remember: the middle of the ice is “0” in these plots.

Show the same data but as proportions.

The xG model will likely benefit from a spline for the angle variable.

There’s an issue here when it comes to extreme angles (90+). I’m guessing that many shot attempts from these angles are not recorded as shots. This would increase the proportion of recorded shots that turn into goals. Having noted the issue, I’ll ignore it for now.

Completeness Of Data

There are anomalies in the game_ids - they are not sequential. Here are the number of unique game_ids in the data.

Total Games
72

There were 72 games in the PWHL regular season.

Check the top goal scorers in the data.

Skater	Goals
Natalie Spooner	20
Grace Zumwinkle	11
Sarah Nurse	11
Marie-Philip Poulin	10
Laura Stacey	10
Daryl Watts	10
Gabbie Hughes	9
Brianne Jenner	9
Alex Carpenter	8
Jade Downie-Landry	8

These totals match the totals on the PWHL website.

xG Model

That’s it for tidying up the data.

Prepare Model Data

This step does the following to the cleaned play-by-play data:

removes shots on empty nets;
filters for the “shot” event; and
converts is_goal and is_rebound to a factor.

The model will not make predictions for empty nets or for penalty shots.

Note: there’s one shot with NA in the is_rebound variable. I’ll assume it should be FALSE and fill the NA.

model_data <- clean_data |>
        mutate(en = if_else(lead(goal_is_en) == TRUE, TRUE, FALSE)) |>
        mutate(en = if_else(is.na(en), FALSE, en)) |>
        filter(event == "shot" & en == FALSE) |>
        mutate(is_rebound = if_else(is.na(is_rebound), FALSE, is_rebound))

model_data$is_goal <- factor(model_data$is_goal, levels = c("TRUE", "FALSE"))

model_data$is_rebound <- as.factor(model_data$is_rebound)

Now split the data (strata = is_goal) for training and testing, then create folds.

set.seed(18)

split_data <- initial_split(data = model_data,
                            strata = is_goal)

training_data <- training(split_data)
testing_data <- testing(split_data)

data_folds <- vfold_cv(training_data,
                       v = 10,
                       repeats = 3,
                       strata = is_goal)

Explore Models

Three types of models will be explored below: 1) logistic regression; 2) MARS; and 3) XGBoost.

Logistic Regression

Fit some logistic regression models using a few different recipes. The steps here are:

write various recipes using different variables and processing; and
fit models using the recipes.

rec_logistic_base <- recipe(is_goal ~ distance +
                                    angle, 
                            data = training_data)

rec_logistic_rebound <- recipe(is_goal ~ distance +
                                       angle +
                                       is_rebound, 
                               data = training_data) |>
        step_dummy(all_nominal_predictors())

rec_logistic_splines <- recipe(is_goal ~ distance +
                                       angle +
                                       is_rebound, 
                               data = training_data) |>
        step_dummy(all_nominal_predictors()) |>
        step_ns(angle,
                deg_free = 2) |>
        step_ns(distance,
                deg_free = 3)

rec_logistic_interaction <- recipe(is_goal ~ distance +
                                           angle +
                                           is_rebound, 
                                   data = training_data) |>
        step_dummy(all_nominal_predictors()) |>
        step_interact(terms = ~ distance:angle)

rec_logistic_list <- list(base = rec_logistic_base,
                          rebound = rec_logistic_rebound,
                          splines = rec_logistic_splines,
                          interaction = rec_logistic_interaction)

logistic_models <- workflow_set(rec_logistic_list,
                                list(logistic = logistic_reg())) |>
        workflow_map("fit_resamples",
                     seed = 18,
                     #verbose = TRUE,
                     resamples = data_folds)

The metric used to evaluate all the potential models in this document is area under the ROC curve, or roc_auc. For this metric, a larger number (closer to 1) is good.

Collect the metrics.

wflow_id	model	.metric	mean	std_err
interaction_logistic	logistic_reg	roc_auc	0.698	0.00936
splines_logistic	logistic_reg	roc_auc	0.696	0.00944
rebound_logistic	logistic_reg	roc_auc	0.696	0.00977
base_logistic	logistic_reg	roc_auc	0.691	0.00945

The model with the interaction term (distance:angle) performed the best.

MARS

Repeat the same process with some MARS models (excluding the splines recipe).

rec_mars_base <- recipe(is_goal ~ distance +
                                angle, 
                        data = training_data)

rec_mars_rebound <- recipe(is_goal ~ distance +
                                   angle +
                                   is_rebound, 
                           data = training_data) |>
        step_dummy(all_nominal_predictors())

rec_mars_interaction <- recipe(is_goal ~ distance +
                                       angle +
                                       is_rebound, 
                               data = training_data) |>
        step_dummy(all_nominal_predictors()) |>
        step_interact(terms = ~ distance:angle)

rec_mars_list <- list(base = rec_mars_base,
                      rebound = rec_mars_rebound,
                      interaction = rec_mars_interaction)

mars_models <- workflow_set(rec_mars_list,
                            list(mars = mars(mode = "classification"))) |>
        workflow_map("fit_resamples",
                     seed = 18,
                     #verbose = TRUE,
                     resamples = data_folds)

Collect the metrics.

wflow_id	model	.metric	mean	std_err
rebound_mars	mars	roc_auc	0.692	0.00977
base_mars	mars	roc_auc	0.688	0.00936
interaction_mars	mars	roc_auc	0.682	0.00946

The MARS models did not perform as well as the logistic regression models. The best performing model had three variables (distance, angle, and is_rebound).

XGBoost

I’ll try XGBoost here because it has the potential to perform well for this task (my NHL xG model is an XGBoost model). I expect the small amount of data to be a problem here, though.

rec_xgb_base <- recipe(is_goal ~ distance +
                               angle, 
                       data = training_data)

rec_xgb_rebound <- recipe(is_goal ~ distance +
                                  angle +
                                  is_rebound, 
                          data = training_data) |>
        step_dummy(all_nominal_predictors(),
                   one_hot = TRUE)

rec_xgb_interaction <- recipe(is_goal ~ distance +
                                      angle +
                                      is_rebound, 
                              data = training_data) |>
        step_dummy(all_nominal_predictors(),
                   one_hot = TRUE) |>
        step_interact(terms = ~ distance:angle)

rec_xgb_list <- list(base = rec_xgb_base,
                     rebound = rec_xgb_rebound,
                     interaction = rec_xgb_interaction)

xgb_models <- workflow_set(rec_xgb_list,
                           list(xgb = boost_tree(mode = "classification"))) |>
        workflow_map("fit_resamples",
                     seed = 18,
                     #verbose = TRUE,
                     resamples = data_folds)

Collect the metrics.

wflow_id	model	.metric	mean	std_err
interaction_xgb	boost_tree	roc_auc	0.662	0.01004
base_xgb	boost_tree	roc_auc	0.662	0.00918
rebound_xgb	boost_tree	roc_auc	0.661	0.00939

The XGBoost model was the worst performer.

Metrics Summary

wflow_id	model	.metric	mean	std_err
interaction_logistic	logistic_reg	roc_auc	0.698	0.00936
splines_logistic	logistic_reg	roc_auc	0.696	0.00944
rebound_logistic	logistic_reg	roc_auc	0.696	0.00977
rebound_mars	mars	roc_auc	0.692	0.00977
base_logistic	logistic_reg	roc_auc	0.691	0.00945
base_mars	mars	roc_auc	0.688	0.00936
interaction_mars	mars	roc_auc	0.682	0.00946
interaction_xgb	boost_tree	roc_auc	0.662	0.01004
base_xgb	boost_tree	roc_auc	0.662	0.00918
rebound_xgb	boost_tree	roc_auc	0.661	0.00939

The logistic regression models top the list, with the “interaction” recipe performing the best. It should be acknowledged that none of these models has great performance.

Training | Testing Data

Select the best version of each type of model and fit it using all of the training data. Check results using the testing data.

Logistic Regression

model	.metric	.estimate
logistic	roc_auc	0.6587111

There was a drop in performance here.

MARS

model	.metric	.estimate
mars	roc_auc	0.6618625

Now the MARS model outperforms the logistic regression model.

XGBoost

model	.metric	.estimate
xgboost	roc_auc	0.6477014

XGBoost is still the worst performer.

Variable Importance

The MARS model was the best performer when given more training data, and a model with splines is probably a decent choice here. I’ll go with it.

Plot the importance of each variable in the MARS model.

Final Fit

Fit And Save The Final xG Model

Fit the MARS model using all available data and save the fit model to the working directory.

xg_model <- workflow() |>
        add_model(mars(mode = "classification")) |>
        add_recipe(rec_mars_rebound) |>
        fit(data = model_data)

write_rds(xg_model, "pwhl_xg_model.rds")

Check roc_auc

Check the roc_auc metric for the final model.

model	.metric	.estimate
mars	roc_auc	0.6921291

The roc_auc metric improved which is not surprising given that the model was fit using all of the data.

Variable Importance

Check the variable importance for the final model.

The order of the variables is the same but the level of importance changed a little.

Explore Results

Remember: the predictions made by this xG model do not include empty nets and penalty shots.

I think that looks reasonable. Now look at the distribution of xG values.

Show the same data as proportions.

The overall trend is good but the results are a little uneven. It would be nice to have more data :)

Now explore some actual predictions. Here are the top 10 xG leaders from season one according to this model.

Skater	Team	xG	Goals
Natalie Spooner	TOR	10.6	19
Kendall Coyne Schofield	MIN	7.7	6
Grace Zumwinkle	MIN	7.3	10
Laura Stacey	MTL	7.3	10
Alex Carpenter	NY	7.3	8
Hayley Scamurra	OTT	7.1	5
Jessie Eldridge	NY	6.9	7
Hilary Knight	BOS	6.8	6
Gabbie Hughes	OTT	6.0	8
Blayre Turnbull	TOR	5.8	3

Natalie Spooner really separated from the pack (and she had by far the most actual goals as well).

See the cumulative xG for each team according to this model.

Team	xG
MIN	60.2
OTT	54.3
TOR	54.3
MTL	52.7
NY	51.2
BOS	48.2

That’s all,

Mark (18 Skaters)