4 Tidy

Author

Trevor S

4.1 Setup

We install and load the necessary packages, along with functions from prior chapters.

Code

# renv::install("rvest")
# renv::install("tidyverse")
# renv::install("gt")
# renv::install("DBI")
# renv::install("duckdb")
# renv::install("aws.s3")
# renv::install("paws")
# renv::install("arrow")

library(rvest)
library(tidyverse)
library(stringr)
library(gt)
library(DBI)
library(duckdb)
library(aws.s3)
library(paws)
library(arrow)

source("functions.R") # load functions defined in prior chapters

4.2 Introduction

In the Tidy chapter, we will put the data into tidy form, clean it, and store it in an S3 bucket, which will make our analysis in the following chapters easier. Despite the table from the previous chapter looking fairly clean, further inspection reveals some issues, as shown in Figure 4.2.1:

Code

start_year <- 1996
end_year <- 2020

import_draft(start_year)[23:30,] |> # picks 23-28
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table 4.2.1: Some Issues in our Data")

overall	team	player	nat	pos	age	to	amateur_team	gp	g	a	pts	x	pim	gp_2	w	l	t_o	sv_percent	gaa	ps
23	Pittsburgh Penguins	Craig Hillier	CA	G			Ottawa 67's (OHL)
24	Phoenix Coyotes	Daniel Briere	CA	C	18	2015	Drummondville Voltigeurs (QMJHL)	973	307	389	696	-24	744							78.5
25	Colorado Avalanche	Peter Ratchuk	US	D	19	2001	Shattuck-St. Mary's School (High-MN)	32	1	1	2	-2	10							0.6
26	Detroit Red Wings	Jesse Wallin	CA	D	18	2003	Red Deer Rebels (WHL)	49	0	2	2	-5	34							0.2
	Round 2	Round 2						NHL Scoring Stats	NHL Scoring Stats	NHL Scoring Stats	NHL Scoring Stats	NHL Scoring Stats	NHL Scoring Stats	NHL Goalie Stats	NHL Goalie Stats	NHL Goalie Stats	NHL Goalie Stats	NHL Goalie Stats	NHL Goalie Stats
Overall	Team	Player	Nat.	Pos	Age	To	Amateur Team	GP	G	A	PTS	+/-	PIM	GP	W	L	T/O	SV%	GAA	PS
27	Buffalo Sabres	Cory Sarich	CA	D	18	2014	Saskatoon Blades (WHL)	969	21	137	158	-9	1089							36.0
28	Pittsburgh Penguins	Pavel Skrbek	CZ	D	18	2002	HC Kladno (Czech)	12	0	0	0	1	8							0.2
Table 4.2.1: Some Issues in our Data

Some problems that immediately come up are:

Two rows get inserted at the end of every round to indicate the round changed.
Though not visible because of the usage of gt, numbers are being coded as strings (overall, age, to, etc).
(At least) one player is missing everything except for their pick number, name, team, position, nationality, and amateur team.
The +/- column got renamed to x.
There were two gp columns, one got automatically renamed to gp2 when we used janitor::clean_names() in import_draft.

By doing a little bit of detective work with some of the other players with missing values elsewhere in the dataset, we notice that players who never played in the NHL have empty strings listed for everything except for their pick number, name, team, position, nationality, and amateur team. This is one of several things we will do in the Tidy chapter to prepare our data for analysis.

4.3 Cleaning

We build a function to tidy the data. In particular, we want it to:

Remove the rows added between rounds.
Correct the types of each column so we can use numeric columns in calculations.
Change gp and ps values to 0 for players who never played in the NHL or have a negative ps.
Standardize position to either be F (forward, any combination of LW, RW, C), D (defenceman), or G (goalie). Note that Kaspars Astashenko has his position listed as “D/W”, a quick search of the NHL website reveals he was a defenceman.
If is.na(to), then the player never played in the NHL, so set it to the draft year.
Add a year column so we can combine all of the data into a single data frame.
Select the columns we care about (year, overall, pos, to, ps, and gp) in that order.
Remove any players selected after 224\(^{\text{th}}\) overall (the number of picks in the 2025 NHL Entry Draft).
We don’t care about +/-, so we can ignore the column being renamed since we won’t be using it anyway.
It also turns out that for skaters gp2 is empty and for goalies gp and gp2 will have the same value, so this issue can be resolved by simply selecting gp.
The 69\(^{\text{th}}\) pick of 2011 was forfeited and and 123\(^{\text{rd}}\) pick of 2002 was invalid; both are listed as blank rows, so these should be removed.

Note that we cannot remove the round separating rows by removing a specified row number since many of the drafts in our dataset have different numbers of picks per round, and some rounds within the same draft have even had a different numbers of picks.

Code

tidy_draft <- function(year){
  draft_year_table <- import_draft(year) |> # get the untidied data
    filter(overall != "Overall" & overall != "" & # remove extra rows
             as.numeric(overall) < 225 & # remove players drafted after pick 224
             amateur_team != "()") |> # remove invalid/forfeited picks 
    type_convert() |> # fix types 
    mutate("year" = year, 
           "ps" = pmax(coalesce(ps, 0), 0), "gp" = coalesce(gp, 0), # players who never played in the NHL
           "to" = coalesce(to, year), # players who never played in the NHL
           "pos" = ifelse(str_count(pos, "G") == 1, "G", # fix positions
                          ifelse(str_count(pos, "D") == 1, "D", "F"))) |> 
    select(year, overall, to, pos, ps, gp) # columns we care about
  draft_year_table
}

tidy_draft(1996) |> 
  head(50) |> 
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table 4.3.1: The Function Works Correctly")

year	overall	to	pos	ps	gp
1996	1	2015	D	64.6	1179
1996	2	2008	D	25.8	496
1996	3	2011	F	56.6	822
1996	4	2000	F	0.0	3
1996	5	2007	D	8.8	231
1996	6	2009	F	12.5	627
1996	7	2007	F	9.2	545
1996	8	2004	D	0.0	44
1996	9	2011	D	46.9	917
1996	10	2004	D	2.7	209
1996	11	2004	D	0.0	82
1996	12	2004	F	0.6	60
1996	13	2014	D	74.6	1107
1996	14	2013	F	17.7	798
1996	15	2016	F	52.0	1293
1996	16	1999	D	0.0	5
1996	17	2000	F	3.5	113
1996	18	2001	F	0.0	57
1996	19	2001	D	0.3	5
1996	20	2008	F	12.8	521
1996	21	2012	F	54.0	938
1996	22	1996	D	0.0	0
1996	23	1996	G	0.0	0
1996	24	2015	F	78.5	973
1996	25	2001	D	0.6	32
1996	26	2003	D	0.2	49
1996	27	2014	D	36.0	969
1996	28	2002	D	0.2	12
1996	29	2009	F	0.0	337
1996	30	2012	F	5.1	341
1996	31	1999	D	0.0	18
1996	32	2004	D	0.0	6
1996	33	1996	F	0.0	0
1996	34	1996	F	0.0	0
1996	35	2019	F	66.0	1516
1996	36	2001	D	0.6	19
1996	37	2002	F	2.6	73
1996	38	1996	F	0.0	0
1996	39	2004	F	0.0	55
1996	40	2013	F	6.6	524
1996	41	1996	D	0.0	0
1996	42	2003	D	0.0	2
1996	43	2007	F	22.0	552
1996	44	2013	G	49.0	341
1996	45	1996	F	0.0	0
1996	46	1996	F	0.0	0
1996	47	2006	F	7.3	142
1996	48	2000	F	0.7	53
1996	49	2012	D	42.1	797
1996	50	1996	G	0.0	0
Table 4.3.1: The Function Works Correctly

Figure 4.3.1 confirms the function completes the tasks listed at the start of this section. We can now proceed to import the data, we choose to clean it as we import it.

4.4 Getting the Data

We now load in all of the data and use rbind() to bind the tables together, giving us a single data frame to work with. This also means that our data will be in a mix of long and wide format, since the year and overall columns are formatted like they are in long format but the to, pos, gp, and ps columns are the same as they would be in a wide format. Note that this function takes 2-3 minutes to run (because of the Sys.sleep(5) line, which is necessary to prevent us getting rate limited). We will soon store the data in a S3 bucket so we don’t have to run this function any more than we have to.

Code

all_data <- do.call(rbind, lapply(seq(start_year, end_year, 1),
                   \(x) tidy_draft(x)))

4.5 Verification

We can check that we loaded the data correctly, there should be between 5250 and 5600 rows (the number of picks in a draft has changed over the years in our dataset as we will see, but is always between 210 and 224 so 25 drafts will be between ) and 6 columns:

Code

dim(all_data) # confirm there are 5250-5600 rows and 6 columns

[1] 5425    6

Code

length(unique(all_data$year)) # confirm all 25 years have been included

[1] 25

Code

# should be the last 10 picks from 2020, note that there were 7*31 = 217 
#   picks in the draft that year
all_data |>
  tail(10) |>
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table 4.4.1: The Last 10 Picks from our Dataset")

year	overall	to	pos	ps	gp
2020	208	2020	D	0.0	0
2020	209	2020	F	0.0	0
2020	210	2020	F	0.0	0
2020	211	2020	F	0.0	0
2020	212	2025	G	5.7	39
2020	213	2020	F	0.0	0
2020	214	2020	G	0.0	0
2020	215	2020	F	0.0	0
2020	216	2020	F	0.0	0
2020	217	2020	F	0.0	0
Table 4.4.1: The Last 10 Picks from our Dataset

Code

all_data |> # Confirm pick 123 of 2002 was removed (it was an invalid pick)
  filter(year == 2002 & overall <= 125 & overall >= 121) |> 
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table: 4.4.2: Pick 123 in 2002 was Removed")

year	overall	to	pos	ps	gp
2002	121	2002	G	0	0
2002	122	2002	D	0	0
2002	124	2002	D	0	0
2002	125	2002	D	0	0
Table: 4.4.2: Pick 123 in 2002 was Removed

These checks all returned what they should.

4.6 Storing the Data

Finally, recall that the import_data function (and thus the tidy_draft function) both take quite a while to run, especially when importing data for a large number of years. To resolve this issue, we will store the data in an AWS S3 bucket so future chapters can get the data from the S3 bucket instead of from Hockey Reference, making this report render significantly faster. We do this by following the instructions given by Agarwal (2020), which tell us to save all_data in parquet file and then write it to our S3 bucket.

Code

Sys.setenv("AWS_ACCESS_KEY_ID" = Sys.getenv("AWS_ACCESS_KEY_ID"),
           "AWS_SECRET_ACCESS_KEY" = Sys.getenv("AWS_SECRET_ACCESS_KEY"), 
           "AWS_DEFAULT_REGION" = "us-east-2")
bucket = "trevor-stat468"

s3write_using(all_data, FUN = write_parquet, 
              bucket = bucket, object = "all_data.parquet")

In this and future chapters, we still start by loading in all of the functions from functions.R, which includes all the functions and global constants defined in all chapters, in addition to querying the data from the AWS S3 bucket using duckdb. functions.R is included in this project’s GitHub repo, and the part of it that loads all_data is also given below (here we load it into all_data_test to show that it actually works). I also followed some of the instructions from Michonneau (2023) to help me write the below code.

Code

con <- dbConnect(duckdb())

dbExecute(con, "INSTALL httpfs;")
dbExecute(con, "LOAD httpfs;")

all_data_test <- dbGetQuery(con, "SELECT * 
                            FROM read_parquet('s3://trevor-stat468/all_data.parquet');")

DBI::dbDisconnect(con)

Code

dim(all_data_test)

[1] 5425    6

Code

all_data_test |> 
  head(10) |> 
  gt() |>
  opt_all_caps() |> 
  tab_source_note("Table 4.6.1: Data Loaded from S3 Bucket")

year	overall	to	pos	ps	gp
1996	1	2015	D	64.6	1179
1996	2	2008	D	25.8	496
1996	3	2011	F	56.6	822
1996	4	2000	F	0.0	3
1996	5	2007	D	8.8	231
1996	6	2009	F	12.5	627
1996	7	2007	F	9.2	545
1996	8	2004	D	0.0	44
1996	9	2011	D	46.9	917
1996	10	2004	D	2.7	209
Table 4.6.1: Data Loaded from S3 Bucket

Figure 4.6.1 confirms that the data has been loaded from the S3 bucket. In future chapters we will edit this data and will put the new version in the same S3 bucket. We now proceed to the Visualize chapter.

Agarwal, Saurabh. 2020. “How to Read and Write Data from and to S3 Bucket Using R?” Medium. https://medium.com/@som028/how-to-read-and-write-data-from-and-to-s3-bucket-using-r-3fed7e686844.

Michonneau, François. 2023. “How to Work with Remote Parquet Files with the Duckdb R Package?” https://francoismichonneau.net/2023/06/duckdb-r-remote-data/.