We install and load the necessary packages, along with functions from prior chapters.
Code
# renv::install("rvest")# renv::install("tidyverse")# renv::install("gt")# renv::install("DBI")# renv::install("duckdb")# renv::install("aws.s3")# renv::install("paws")# renv::install("arrow")library(rvest)library(tidyverse)library(stringr)library(gt)library(DBI)library(duckdb)library(aws.s3)library(paws)library(arrow)source("functions.R") # load functions defined in prior chapters
4.2 Introduction
In the Tidy chapter, we will put the data into tidy form, clean it, and store it in an S3 bucket, which will make our analysis in the following chapters easier. Despite the table from the previous chapter looking fairly clean, further inspection reveals some issues, as shown in Figure 4.2.1:
Code
start_year <-1996end_year <-2020import_draft(start_year)[23:30,] |># picks 23-28gt() |>opt_all_caps() |>tab_source_note("Table 4.2.1: Some Issues in our Data")
overall
team
player
nat
pos
age
to
amateur_team
gp
g
a
pts
x
pim
gp_2
w
l
t_o
sv_percent
gaa
ps
23
Pittsburgh Penguins
Craig Hillier
CA
G
Ottawa 67's (OHL)
24
Phoenix Coyotes
Daniel Briere
CA
C
18
2015
Drummondville Voltigeurs (QMJHL)
973
307
389
696
-24
744
78.5
25
Colorado Avalanche
Peter Ratchuk
US
D
19
2001
Shattuck-St. Mary's School (High-MN)
32
1
1
2
-2
10
0.6
26
Detroit Red Wings
Jesse Wallin
CA
D
18
2003
Red Deer Rebels (WHL)
49
0
2
2
-5
34
0.2
Round 2
Round 2
NHL Scoring Stats
NHL Scoring Stats
NHL Scoring Stats
NHL Scoring Stats
NHL Scoring Stats
NHL Scoring Stats
NHL Goalie Stats
NHL Goalie Stats
NHL Goalie Stats
NHL Goalie Stats
NHL Goalie Stats
NHL Goalie Stats
Overall
Team
Player
Nat.
Pos
Age
To
Amateur Team
GP
G
A
PTS
+/-
PIM
GP
W
L
T/O
SV%
GAA
PS
27
Buffalo Sabres
Cory Sarich
CA
D
18
2014
Saskatoon Blades (WHL)
969
21
137
158
-9
1089
36.0
28
Pittsburgh Penguins
Pavel Skrbek
CZ
D
18
2002
HC Kladno (Czech)
12
0
0
0
1
8
0.2
Table 4.2.1: Some Issues in our Data
Some problems that immediately come up are:
Two rows get inserted at the end of every round to indicate the round changed.
Though not visible because of the usage of gt, numbers are being coded as strings (overall, age, to, etc).
(At least) one player is missing everything except for their pick number, name, team, position, nationality, and amateur team.
The +/- column got renamed to x.
There were two gp columns, one got automatically renamed to gp2 when we used janitor::clean_names() in import_draft.
By doing a little bit of detective work with some of the other players with missing values elsewhere in the dataset, we notice that players who never played in the NHL have empty strings listed for everything except for their pick number, name, team, position, nationality, and amateur team. This is one of several things we will do in the Tidy chapter to prepare our data for analysis.
4.3 Cleaning
We build a function to tidy the data. In particular, we want it to:
Remove the rows added between rounds.
Correct the types of each column so we can use numeric columns in calculations.
Change gp and ps values to 0 for players who never played in the NHL or have a negative ps.
Standardize position to either be F (forward, any combination of LW, RW, C), D (defenceman), or G (goalie). Note that Kaspars Astashenko has his position listed as “D/W”, a quick search of the NHL website reveals he was a defenceman.
If is.na(to), then the player never played in the NHL, so set it to the draft year.
Add a year column so we can combine all of the data into a single data frame.
Select the columns we care about (year, overall, pos, to, ps, and gp) in that order.
Remove any players selected after 224\(^{\text{th}}\) overall (the number of picks in the 2025 NHL Entry Draft).
We don’t care about +/-, so we can ignore the column being renamed since we won’t be using it anyway.
It also turns out that for skaters gp2 is empty and for goalies gp and gp2 will have the same value, so this issue can be resolved by simply selecting gp.
The 69\(^{\text{th}}\) pick of 2011 was forfeited and and 123\(^{\text{rd}}\) pick of 2002 was invalid; both are listed as blank rows, so these should be removed.
Note that we cannot remove the round separating rows by removing a specified row number since many of the drafts in our dataset have different numbers of picks per round, and some rounds within the same draft have even had a different numbers of picks.
Code
tidy_draft <-function(year){ draft_year_table <-import_draft(year) |># get the untidied datafilter(overall !="Overall"& overall !=""&# remove extra rowsas.numeric(overall) <225&# remove players drafted after pick 224 amateur_team !="()") |># remove invalid/forfeited picks type_convert() |># fix types mutate("year"= year, "ps"=pmax(coalesce(ps, 0), 0), "gp"=coalesce(gp, 0), # players who never played in the NHL"to"=coalesce(to, year), # players who never played in the NHL"pos"=ifelse(str_count(pos, "G") ==1, "G", # fix positionsifelse(str_count(pos, "D") ==1, "D", "F"))) |>select(year, overall, to, pos, ps, gp) # columns we care about draft_year_table}tidy_draft(1996) |>head(50) |>gt() |>opt_all_caps() |>tab_source_note("Table 4.3.1: The Function Works Correctly")
year
overall
to
pos
ps
gp
1996
1
2015
D
64.6
1179
1996
2
2008
D
25.8
496
1996
3
2011
F
56.6
822
1996
4
2000
F
0.0
3
1996
5
2007
D
8.8
231
1996
6
2009
F
12.5
627
1996
7
2007
F
9.2
545
1996
8
2004
D
0.0
44
1996
9
2011
D
46.9
917
1996
10
2004
D
2.7
209
1996
11
2004
D
0.0
82
1996
12
2004
F
0.6
60
1996
13
2014
D
74.6
1107
1996
14
2013
F
17.7
798
1996
15
2016
F
52.0
1293
1996
16
1999
D
0.0
5
1996
17
2000
F
3.5
113
1996
18
2001
F
0.0
57
1996
19
2001
D
0.3
5
1996
20
2008
F
12.8
521
1996
21
2012
F
54.0
938
1996
22
1996
D
0.0
0
1996
23
1996
G
0.0
0
1996
24
2015
F
78.5
973
1996
25
2001
D
0.6
32
1996
26
2003
D
0.2
49
1996
27
2014
D
36.0
969
1996
28
2002
D
0.2
12
1996
29
2009
F
0.0
337
1996
30
2012
F
5.1
341
1996
31
1999
D
0.0
18
1996
32
2004
D
0.0
6
1996
33
1996
F
0.0
0
1996
34
1996
F
0.0
0
1996
35
2019
F
66.0
1516
1996
36
2001
D
0.6
19
1996
37
2002
F
2.6
73
1996
38
1996
F
0.0
0
1996
39
2004
F
0.0
55
1996
40
2013
F
6.6
524
1996
41
1996
D
0.0
0
1996
42
2003
D
0.0
2
1996
43
2007
F
22.0
552
1996
44
2013
G
49.0
341
1996
45
1996
F
0.0
0
1996
46
1996
F
0.0
0
1996
47
2006
F
7.3
142
1996
48
2000
F
0.7
53
1996
49
2012
D
42.1
797
1996
50
1996
G
0.0
0
Table 4.3.1: The Function Works Correctly
Figure 4.3.1 confirms the function completes the tasks listed at the start of this section. We can now proceed to import the data, we choose to clean it as we import it.
4.4 Getting the Data
We now load in all of the data and use rbind() to bind the tables together, giving us a single data frame to work with. This also means that our data will be in a mix of long and wide format, since the year and overall columns are formatted like they are in long format but the to, pos, gp, and ps columns are the same as they would be in a wide format. Note that this function takes 2-3 minutes to run (because of the Sys.sleep(5) line, which is necessary to prevent us getting rate limited). We will soon store the data in a S3 bucket so we don’t have to run this function any more than we have to.
We can check that we loaded the data correctly, there should be between 5250 and 5600 rows (the number of picks in a draft has changed over the years in our dataset as we will see, but is always between 210 and 224 so 25 drafts will be between ) and 6 columns:
Code
dim(all_data) # confirm there are 5250-5600 rows and 6 columns
[1] 5425 6
Code
length(unique(all_data$year)) # confirm all 25 years have been included
[1] 25
Code
# should be the last 10 picks from 2020, note that there were 7*31 = 217 # picks in the draft that yearall_data |>tail(10) |>gt() |>opt_all_caps() |>tab_source_note("Table 4.4.1: The Last 10 Picks from our Dataset")
year
overall
to
pos
ps
gp
2020
208
2020
D
0.0
0
2020
209
2020
F
0.0
0
2020
210
2020
F
0.0
0
2020
211
2020
F
0.0
0
2020
212
2025
G
5.7
39
2020
213
2020
F
0.0
0
2020
214
2020
G
0.0
0
2020
215
2020
F
0.0
0
2020
216
2020
F
0.0
0
2020
217
2020
F
0.0
0
Table 4.4.1: The Last 10 Picks from our Dataset
Code
all_data |># Confirm pick 123 of 2002 was removed (it was an invalid pick)filter(year ==2002& overall <=125& overall >=121) |>gt() |>opt_all_caps() |>tab_source_note("Table: 4.4.2: Pick 123 in 2002 was Removed")
year
overall
to
pos
ps
gp
2002
121
2002
G
0
0
2002
122
2002
D
0
0
2002
124
2002
D
0
0
2002
125
2002
D
0
0
Table: 4.4.2: Pick 123 in 2002 was Removed
These checks all returned what they should.
4.6 Storing the Data
Finally, recall that the import_data function (and thus the tidy_draft function) both take quite a while to run, especially when importing data for a large number of years. To resolve this issue, we will store the data in an AWS S3 bucket so future chapters can get the data from the S3 bucket instead of from Hockey Reference, making this report render significantly faster. We do this by following the instructions given by Agarwal (2020), which tell us to save all_data in parquet file and then write it to our S3 bucket.
In this and future chapters, we still start by loading in all of the functions from functions.R, which includes all the functions and global constants defined in all chapters, in addition to querying the data from the AWS S3 bucket using duckdb. functions.R is included in this project’s GitHub repo, and the part of it that loads all_data is also given below (here we load it into all_data_test to show that it actually works). I also followed some of the instructions from Michonneau (2023) to help me write the below code.
Code
con <-dbConnect(duckdb())dbExecute(con, "INSTALL httpfs;")dbExecute(con, "LOAD httpfs;")all_data_test <-dbGetQuery(con, "SELECT * FROM read_parquet('s3://trevor-stat468/all_data.parquet');")DBI::dbDisconnect(con)
Code
dim(all_data_test)
[1] 5425 6
Code
all_data_test |>head(10) |>gt() |>opt_all_caps() |>tab_source_note("Table 4.6.1: Data Loaded from S3 Bucket")
year
overall
to
pos
ps
gp
1996
1
2015
D
64.6
1179
1996
2
2008
D
25.8
496
1996
3
2011
F
56.6
822
1996
4
2000
F
0.0
3
1996
5
2007
D
8.8
231
1996
6
2009
F
12.5
627
1996
7
2007
F
9.2
545
1996
8
2004
D
0.0
44
1996
9
2011
D
46.9
917
1996
10
2004
D
2.7
209
Table 4.6.1: Data Loaded from S3 Bucket
Figure 4.6.1 confirms that the data has been loaded from the S3 bucket. In future chapters we will edit this data and will put the new version in the same S3 bucket. We now proceed to the Visualize chapter.