WHKYHAC: D-Zone xContributions Viz
The Viz Launchpad Competition
WHKYHAC and Sportlogiq announced a visualization competition using data from the 2023 Season of the Professional Women’s Hockey Players Association. The entry deadline for the competition is July 2, 2023. Prior to that deadline I’ll post some “trial” data visualizations here.
Today’s Data Viz
OK, it’s time to start taking advantage of the detailed play-by-play data supplied by Sportlogiq. This viz introduces something I’m calling “expected contributions” (or xContributions) to team goals from events that occur in the D-Zone. It piggybacks off the familiar xGoal concept, as explained below.
Expected Contributions To Team Goals From The D-Zone
I believe most xGoal models compute the probability of a shot attempt turning into a goal based on how frequently similar shot attempts turned into goals in the past. The xContributions to team goals does something similar for events that occurred in the D-Zone.
For this viz I computed the proportion of successful D-Zone events that occurred within 20 seconds of the team scoring a goal at the other end of the ice. In effect, I assigned different values to successful D-Zone events that can send the puck to the other end of the ice and, potentially, into the opponent’s net. There were 21 different types of D-Zone events included in that calculation. It makes sense intuitively that a successful D-Zone block is less likely to result in a goal-for than, say, a successful controlled zone exit. The xContributions metric attempts to put a value on that difference.
Huge Caveat: The amount of data available from the 2023 Season (40 games) is not nearly enough to build this type of model. Think of this as more of a “concept piece” rather than something that’s reliable.
I’ve included all the code below so you can dig into the details if you’re interested.
Let’s look at some of the results and get to the viz. Here are the Top 25 single game scores from the 2023 Season.
Player | xContribution | Game |
---|---|---|
Micah Zandee-Hart | 0.8410 | SCO-SON 2023-03-04 |
Megan Keller | 0.8318 | ADI-SCO 2023-03-10 |
Laura Fortino | 0.7482 | HAR-SON 2023-03-05 |
Jincy Dunne | 0.7377 | ADI-SON 2022-11-06 |
Jincy Dunne | 0.7368 | ADI-SCO 2022-11-05 |
Micah Zandee-Hart | 0.7293 | ADI-SON 2023-02-26 |
Lee Stecklein | 0.7052 | HAR-SCO 2023-03-12 |
Megan Keller | 0.6978 | ADI-SCO 2022-12-09 |
Jincy Dunne | 0.6923 | ADI-SON 2022-12-10 |
Megan Keller | 0.6913 | HAR-SCO 2023-02-10 |
Renata Fast | 0.6909 | ADI-SON 2023-02-24 |
Megan Keller | 0.6812 | HAR-SCO 2023-03-12 |
Lee Stecklein | 0.6723 | ADI-HAR 2023-03-04 |
Laura Fortino | 0.6714 | HAR-SON 2022-11-05 |
Savannah Harmon | 0.6706 | HAR-SON 2022-10-16 |
Jocelyne Larocque | 0.6688 | ADI-SCO 2022-11-05 |
Claire Thompson | 0.6624 | ADI-SON 2023-02-26 |
Jincy Dunne | 0.6571 | ADI-SON 2023-02-26 |
Savannah Harmon | 0.6549 | HAR-SCO 2023-03-12 |
Laura Fortino | 0.6540 | ADI-HAR 2023-03-04 |
Ella Shelton | 0.6523 | ADI-SCO 2023-03-05 |
Emily Brown | 0.6481 | SCO-SON 2022-10-15 |
Emily Brown | 0.6294 | ADI-SON 2022-11-06 |
Megan Keller | 0.6266 | SCO-SON 2023-01-22 |
Jessica DiGirolamo | 0.6247 | ADI-SON 2023-02-26 |
You can see that some players appear on the list multiple times. Here’s a breakdown of the number of Top 25 appearances.
Player | Top 25 Finishes |
---|---|
Megan Keller | 5 |
Jincy Dunne | 4 |
Laura Fortino | 3 |
Emily Brown | 2 |
Lee Stecklein | 2 |
Micah Zandee-Hart | 2 |
Savannah Harmon | 2 |
Claire Thompson | 1 |
Ella Shelton | 1 |
Jessica DiGirolamo | 1 |
Jocelyne Larocque | 1 |
Renata Fast | 1 |
The model clearly likes what Megan Keller was doing in the D-Zone.
Now here’s the viz. It shows a player’s cumulative xContributions over the course of a single game. I selected the game in which Micah Zandee-Hart achieved the highest xContribiutions score in the 2023 Season.
The Code
Here’s the code for this data viz.
# SETUP ########################################################################
setwd("~/18_skaters/r_studio/whkyhac")
library(tidyverse)
library(stringr)
library(lubridate)
library(knitr)
library(kableExtra)
# LOAD DATA ####################################################################
raw_pbp_data <- read_csv("23_PBP_WHKYHAC_SPORTLOGIQ.csv",
locale = locale(encoding = "ISO-8859-1"))
# EXPLORE DATA #################################################################
#print(str(raw_pbp_data))
# Players
player_names <- unique(raw_pbp_data$player)
# Events
event_names <- unique(raw_pbp_data$eventname)
event_outcomes <- unique(raw_pbp_data$outcome)
event_types <- unique(raw_pbp_data$type)
# Strength states
strength_states <- unique(raw_pbp_data$strengthstate)
# Plot successful / failed passes and dumps
plot_data_scep <- filter(raw_pbp_data,
eventname == "controlledexit",
type == "pass",
outcome == "successful")
plot_scep <- ggplot(plot_data_scep) +
geom_point(aes(xadjcoord,
yadjcoord)) +
theme_minimal()
#plot_scep # SUCCESSFUL CONTROLLED EXIT PASSES ARE LOCATED OUTSIDE D-ZONE
plot_data_fcep <- filter(raw_pbp_data,
eventname == "controlledexit",
type == "pass",
outcome == "failed")
plot_fcep <- ggplot(plot_data_fcep) +
geom_point(aes(xadjcoord, yadjcoord)) +
theme_minimal()
#plot_fcep
plot_data_spo <- filter(raw_pbp_data,
eventname == "pass",
type == "outlet",
outcome == "successful")
plot_spo <- ggplot(plot_data_spo) +
geom_point(aes(xadjcoord, yadjcoord)) +
theme_minimal()
#plot_spo
plot_data_spoob <- filter(raw_pbp_data,
eventname == "pass",
type == "outletoffboards",
outcome == "successful")
plot_spoob <- ggplot(plot_data_spoob) +
geom_point(aes(xadjcoord, yadjcoord)) +
theme_minimal()
#plot_spoob
plot_data_sdi <- filter(raw_pbp_data,
eventname == "dumpout",
type == "ice",
outcome == "successful")
plot_sdi <- ggplot(plot_data_sdi) +
geom_point(aes(xadjcoord, yadjcoord)) +
theme_minimal()
#plot_sdi
# CLEAN AND MANIPULATE DATA (AREAS OF INTEREST ONLY) ###########################
clean_pbp_data <- raw_pbp_data
# Fix name for Kristin O’Neill
clean_pbp_data$player <- str_replace_all(clean_pbp_data$player, "\\031", "’")
# Add game_id
clean_pbp_data$game_id <- paste(clean_pbp_data$game, clean_pbp_data$date)
# Add event_id
clean_pbp_data$event_id <- seq(1:length(clean_pbp_data$seasonstage))
# Add d-zone logical variable
clean_pbp_data <- mutate(clean_pbp_data,
d_zone = ifelse(
xadjcoord < -25,
TRUE,
FALSE))
# Add empty_net to opposing_goalie variable
clean_pbp_data$opposing_goalie <- ifelse(is.na(clean_pbp_data$opposing_goalie), "empty_net", clean_pbp_data$opposing_goalie)
# Tidy up (based on needs for this viz)
clean_pbp_data <- select(clean_pbp_data, c(30:31,1:29,32))
clean_pbp_data <- select(clean_pbp_data, -seasonstage,
-date,
-game,
-ishomegame,
-scoredifferential,
-teamskatersonicecount,
-opposingteamskatersonicecount,
-goalie,
-createsrebound,
-onetimer,
-shotaim,
-shottype)
# FUNCTIONS ####################################################################
get_event_types <- function(event_name) {
working_data <- clean_pbp_data %>%
filter(d_zone == TRUE,
playerprimaryposition == "D",
strengthstate == "evenStrength",
period < 4,
opposing_goalie != "empty_net",
eventname == event_name) %>%
group_by(type) %>%
summarise (sum_event_type = n())
names(working_data)[1] <- "event_type"
working_data$event_name <- event_name
working_data <- select(working_data, c(3,1,2))
return(working_data)
}
tag_successful_target_events <- function(event_name, event_type) {
working_data <- clean_pbp_data %>%
filter(d_zone == ifelse(
event_name == "controlledexit" & event_type == "pass",
FALSE,
TRUE),
playerprimaryposition == "D",
strengthstate == "evenStrength",
period < 4,
opposing_goalie != "empty_net",
eventname == event_name,
type == event_type,
outcome == "successful")
working_data$successful_target_event_id <- paste(event_name, event_type)
return(working_data)
}
# IDENTIFY D-ZONE EVENTS OF INTEREST ###########################################
# Filtered for: even-strength; regulation time; not shooting on an empty net
# Only skaters who play at the defense position are included
# Only successful events are included
# No distinction between regular season and playoffs
d_zone_lpr_event_types <- get_event_types("lpr")
d_zone_pass_event_types <- get_event_types("pass")
d_zone_reception_event_types <- get_event_types("reception")
d_zone_puck_protection_types <- get_event_types("puckprotection")
d_zone_block_types <- get_event_types("block")
d_zone_check_event_types <- get_event_types("check")
d_zone_dumpout_event_types <- get_event_types("dumpout")
d_zone_controlled_exit_event_types <- get_event_types("controlledexit")
# Filter out rare events (< 200)
d_zone_lpr_event_types <- filter(d_zone_lpr_event_types,
sum_event_type >= 200)
d_zone_pass_event_types <- filter(d_zone_pass_event_types,
sum_event_type >= 200)
d_zone_reception_event_types <- filter(d_zone_reception_event_types,
sum_event_type >= 200)
d_zone_puck_protection_types <- filter(d_zone_puck_protection_types,
sum_event_type >= 200)
d_zone_block_types <- filter(d_zone_block_types,
sum_event_type >= 200)
d_zone_check_event_types <- filter(d_zone_check_event_types,
sum_event_type >= 200)
d_zone_dumpout_event_types <- filter(d_zone_dumpout_event_types,
sum_event_type >= 200)
d_zone_controlled_exit_event_types <- filter(d_zone_controlled_exit_event_types,
sum_event_type >= 200)
# Combine events and assign ids
events <- bind_rows(d_zone_lpr_event_types,
d_zone_pass_event_types,
d_zone_reception_event_types,
d_zone_puck_protection_types,
d_zone_block_types,
d_zone_check_event_types,
d_zone_dumpout_event_types,
d_zone_controlled_exit_event_types)
events$target_event_id <- paste(events$event_name, events$event_type)
events <- select(events, c(4,1:3))
# TAG SUCCESSFUL TARGET EVENTS #################################################
# Loop through the events object to tag the target events in the play-by-play data
loop_length <- length(events$target_event_id)
loop_list <- list()
for (i in 1:loop_length) {
loop_data <- tag_successful_target_events(as.character(events[i,2]), as.character(events[i,3]))
loop_list[[i]] <- loop_data
}
tagged_successful_target_events <- bind_rows(loop_list)
# Join target_event_id to play-by-play data
tagged_successful_target_events <- select(tagged_successful_target_events,
event_id,
successful_target_event_id)
clean_pbp_data <- left_join(clean_pbp_data,
tagged_successful_target_events,
by = "event_id")
# IDENTIFY ELIGIBLE GOALS AND PRIOR EVENTS #####################################
# Filter the play-by-play data for eligible goals
# Find the time windows based on the times for eligible goals
# The time window starts 20 seconds before an eligible goal
eligible_goals <- clean_pbp_data %>%
filter(strengthstate == "evenStrength",
period < 4,
opposing_goalie != "empty_net",
goal == 1)
eligible_goal_ids <- eligible_goals$event_id
eligible_goals <- eligible_goals %>%
group_by(game_id) %>%
mutate(eligible_event_time_start = compiledgametime - 20) %>%
select(game_id,
event_id,
eligible_team = teamname,
eligible_event_time_start,
eligible_event_time_end = compiledgametime) %>%
ungroup() %>%
select(-game_id)
# Join eligible goals to play-by-play data
clean_pbp_data$eligible_goal <- ifelse(
clean_pbp_data$event_id %in% eligible_goal_ids,
TRUE,
FALSE )
# Identify eligible prior events
clean_pbp_data <- clean_pbp_data %>%
left_join(eligible_goals,
by = "event_id")
clean_pbp_data <- clean_pbp_data %>%
group_by(game_id) %>%
fill(eligible_event_time_start,
eligible_event_time_end,
eligible_team,
.direction = "up") %>%
ungroup()
clean_pbp_data <- clean_pbp_data %>%
mutate(target_event_with_eligible_goal = ifelse(
successful_target_event_id > 0 &
eligible_team == teamname &
compiledgametime >= eligible_event_time_start &
compiledgametime <= eligible_event_time_end,
TRUE,
FALSE))
# SUMMARIZE DATA AND COMPUTE PROPORTIONS #######################################
# Get the total number of eligible target events
sum_eligible_target_events <- clean_pbp_data %>%
group_by(successful_target_event_id) %>%
summarise(sum = n())
sum_eligible_target_events <- sum_eligible_target_events %>%
filter(successful_target_event_id > 0)
# Get the total number of eligible goals for each target event
sum_eligible_target_events_with_goals <- clean_pbp_data %>%
filter(target_event_with_eligible_goal == TRUE) %>%
group_by(successful_target_event_id) %>%
summarise(goals = n())
# Compute the proportion of target events that lead to an eligible goal
target_event_proportion_data <- sum_eligible_target_events %>%
left_join(sum_eligible_target_events_with_goals,
by = "successful_target_event_id")
target_event_proportion_data <- target_event_proportion_data %>%
mutate(proportion = goals / sum)
# ADD THE PROPORTIONS TO THE MAIN PLAY-BY-PLAY DATA ############################
clean_pbp_data <- clean_pbp_data %>%
left_join(target_event_proportion_data,
by = "successful_target_event_id")
clean_pbp_data$proportion[is.na(clean_pbp_data$proportion)] <- 0
# TAKE A LOOK AT SOME RESULTS ##################################################
# Find highest single game scores
top_game_score <- clean_pbp_data %>%
filter(playerprimaryposition == "D") %>%
group_by(game_id,
player) %>%
summarise(expected_contribution_to_goals = sum(proportion)) %>%
arrange(desc(expected_contribution_to_goals)) %>%
ungroup()
top_game_score$expected_contribution_to_goals <- round(top_game_score$expected_contribution_to_goals, 4)
# Put the top 25 game scores in fancy tables
top_25_table <- top_game_score %>%
slice_head(n =25) %>%
rename("Player" = player,
"xContribution" = expected_contribution_to_goals,
"Game" = game_id)
top_25_summary <- top_25_table %>%
group_by(Player) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
rename("Top 25 Finishes" = count) %>%
ungroup()
top_25_table <- top_25_table %>%
select(c(2,3,1))
top_25_table <- kable(top_25_table, format = "html",
align = c("l", "c", "c")) %>%
kable_styling(position = "center",
full_width = FALSE)
#top_25_table
top_25_summary_table <- kable(top_25_summary, format = "html",
align = c("l", "c")) %>%
kable_styling(position = "center",
full_width = FALSE)
#top_25_summary_table
# PLOT DATA FOR A SINGLE GAME ##################################################
# Selected the game with the highest single game score
# Note: it would be easy to turn this into a function with game/team arguments
selected_game_id <- "SCO-SON 2023-03-04"
selected_team <- "Sonnet"
# Add cumulative expected contributions to team goals
plot_data_game <- clean_pbp_data %>%
filter(game_id == selected_game_id,
teamname == selected_team,
playerprimaryposition == "D",
strengthstate == "evenStrength",
period < 4,
opposing_goalie != "empty_net") %>%
group_by(player) %>%
mutate(cum_xcontribution = cumsum(proportion)) %>%
rename("Defense" = player)
plot_data_game_goals <- clean_pbp_data %>%
filter(game_id == selected_game_id,
teamname == selected_team,
goal == 1)
plot_data_game_goals <- plot_data_game_goals$compiledgametime
# Arrange players in descending order (used for the plot legend)
plot_limits <- plot_data_game %>%
arrange(desc(cum_xcontribution))
plot_limits <- unique(plot_limits$Defense)
# Create a vector of period start / end times
plot_period_ends <-c(0, 1200, 2400, 3600)
# Max xContribution
plot_y_end <- max(plot_data_game$cum_xcontribution)
# Plot the single game data
plot_game <- ggplot() +
geom_step(data = plot_data_game,
aes(x = compiledgametime,
y = cum_xcontribution,
group = Defense,
colour = Defense),
alpha = 0.7,
linewidth = 2.5) +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
panel.grid = element_blank(),
plot.title.position = "plot",
plot.title = element_text(size = 20,
face = "bold"),
plot.subtitle = element_text(size = 16),
plot.caption = element_text(size = 12),
plot.caption.position = "plot",
axis.title.y = element_text(size = 15),
legend.title = element_text(size = 17,
face = "bold"),
legend.text = element_text(size = 16)) +
geom_segment(aes(x = plot_period_ends,
y = 0,
xend = plot_period_ends,
yend = plot_y_end),
alpha = 0.3,
linewidth = 0.6) +
geom_segment(aes(x = plot_data_game_goals,
y = -0.005,
xend = plot_data_game_goals,
yend = plot_y_end),
alpha = 0.2,
linetype = 5) +
geom_segment(aes(x = 0,
y = 0,
xend = 3600,
yend = 0),
alpha = 0.3,
linewidth = 0.6) +
geom_segment(aes(x = 0,
y = plot_y_end,
xend = 3600,
yend = plot_y_end),
alpha = 0.3,
linewidth = 0.6) +
geom_text(aes(x = plot_data_game_goals,
y = -0.01),
label = "G",
size = 4,
alpha = 0.5) +
geom_text(aes(x = 200,
y = -0.03),
label = "Period 1 >>>",
size = 4,
alpha = 0.5) +
geom_text(aes(x = 1400,
y = -0.03),
label = "Period 2 >>>",
size = 4,
alpha = 0.5) +
geom_text(aes(x = 2600,
y = -0.03),
label = "Period 3 >>>",
size = 4,
alpha = 0.5) +
labs(y = "Cumulative xContributions",
title = "Expected Contributions To Team Goals From Events In The D-Zone \n(Even Strength / Regulation Time)",
subtitle = paste("Game:", selected_game_id, " Team:", selected_team),
caption = "Data by Sportlogiq, viz by 18 Skaters, #WHKYHAC") +
scale_colour_viridis_d(limits = plot_limits)
#plot_game
# FURTHER WORK #################################################################
# Exclude goals where there is an intervening face-off?
# Add expected contributions to goals against?
# Build out contributions in other zones?
# GET WAY MORE DATA TO BUILD THE MODEL :)
The End
That’s it. I’ll post more data visualizations in the days leading up to the July 2 deadline for the Viz Launchpad Competition.
Mark (18 Skaters)