4  Tidy

Author

Trevor S

4.1 Setup

We install and load the necessary packages, along with functions from prior chapters.

Code
# renv::install("rvest")
# renv::install("tidyverse")
# renv::install("gt")
# renv::install("DBI")
# renv::install("duckdb")
# renv::install("aws.s3")
# renv::install("paws")
# renv::install("arrow")

library(rvest)
library(tidyverse)
library(stringr)
library(gt)
library(DBI)
library(duckdb)
library(aws.s3)
library(paws)
library(arrow)

source("functions.R") # load functions defined in prior chapters

4.2 Introduction

In the Tidy chapter, we will put the data into tidy form, clean it, and store it in an S3 bucket, which will make our analysis in the following chapters easier. Despite the table from the previous chapter looking fairly clean, further inspection reveals some issues, as shown in Figure 4.2.1:

Code
start_year <- 1996
end_year <- 2020

import_draft(start_year)[23:30,] |> # picks 23-28
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table 4.2.1: Some Issues in our Data")
overall team player nat pos age to amateur_team gp g a pts x pim gp_2 w l t_o sv_percent gaa ps
23 Pittsburgh Penguins Craig Hillier CA G Ottawa 67's (OHL)
24 Phoenix Coyotes Daniel Briere CA C 18 2015 Drummondville Voltigeurs (QMJHL) 973 307 389 696 -24 744 78.5
25 Colorado Avalanche Peter Ratchuk US D 19 2001 Shattuck-St. Mary's School (High-MN) 32 1 1 2 -2 10 0.6
26 Detroit Red Wings Jesse Wallin CA D 18 2003 Red Deer Rebels (WHL) 49 0 2 2 -5 34 0.2
Round 2 Round 2 NHL Scoring Stats NHL Scoring Stats NHL Scoring Stats NHL Scoring Stats NHL Scoring Stats NHL Scoring Stats NHL Goalie Stats NHL Goalie Stats NHL Goalie Stats NHL Goalie Stats NHL Goalie Stats NHL Goalie Stats
Overall Team Player Nat. Pos Age To Amateur Team GP G A PTS +/- PIM GP W L T/O SV% GAA PS
27 Buffalo Sabres Cory Sarich CA D 18 2014 Saskatoon Blades (WHL) 969 21 137 158 -9 1089 36.0
28 Pittsburgh Penguins Pavel Skrbek CZ D 18 2002 HC Kladno (Czech) 12 0 0 0 1 8 0.2
Table 4.2.1: Some Issues in our Data

Some problems that immediately come up are:

  • Two rows get inserted at the end of every round to indicate the round changed.

  • Though not visible because of the usage of gt, numbers are being coded as strings (overall, age, to, etc).

  • (At least) one player is missing everything except for their pick number, name, team, position, nationality, and amateur team.

  • The +/- column got renamed to x.

  • There were two gp columns, one got automatically renamed to gp2 when we used janitor::clean_names() in import_draft.

By doing a little bit of detective work with some of the other players with missing values elsewhere in the dataset, we notice that players who never played in the NHL have empty strings listed for everything except for their pick number, name, team, position, nationality, and amateur team. This is one of several things we will do in the Tidy chapter to prepare our data for analysis.

4.3 Cleaning

We build a function to tidy the data. In particular, we want it to:

  • Remove the rows added between rounds.

  • Correct the types of each column so we can use numeric columns in calculations.

  • Change gp and ps values to 0 for players who never played in the NHL or have a negative ps.

  • Standardize position to either be F (forward, any combination of LW, RW, C), D (defenceman), or G (goalie). Note that Kaspars Astashenko has his position listed as “D/W”, a quick search of the NHL website reveals he was a defenceman.

  • If is.na(to), then the player never played in the NHL, so set it to the draft year.

  • Add a year column so we can combine all of the data into a single data frame.

  • Select the columns we care about (year, overall, pos, to, ps, and gp) in that order.

  • Remove any players selected after 224\(^{\text{th}}\) overall (the number of picks in the 2025 NHL Entry Draft).

  • We don’t care about +/-, so we can ignore the column being renamed since we won’t be using it anyway.

  • It also turns out that for skaters gp2 is empty and for goalies gp and gp2 will have the same value, so this issue can be resolved by simply selecting gp.

  • The 69\(^{\text{th}}\) pick of 2011 was forfeited and and 123\(^{\text{rd}}\) pick of 2002 was invalid; both are listed as blank rows, so these should be removed.

Note that we cannot remove the round separating rows by removing a specified row number since many of the drafts in our dataset have different numbers of picks per round, and some rounds within the same draft have even had a different numbers of picks.

Code
tidy_draft <- function(year){
  draft_year_table <- import_draft(year) |> # get the untidied data
    filter(overall != "Overall" & overall != "" & # remove extra rows
             as.numeric(overall) < 225 & # remove players drafted after pick 224
             amateur_team != "()") |> # remove invalid/forfeited picks 
    type_convert() |> # fix types 
    mutate("year" = year, 
           "ps" = pmax(coalesce(ps, 0), 0), "gp" = coalesce(gp, 0), # players who never played in the NHL
           "to" = coalesce(to, year), # players who never played in the NHL
           "pos" = ifelse(str_count(pos, "G") == 1, "G", # fix positions
                          ifelse(str_count(pos, "D") == 1, "D", "F"))) |> 
    select(year, overall, to, pos, ps, gp) # columns we care about
  draft_year_table
}

tidy_draft(1996) |> 
  head(50) |> 
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table 4.3.1: The Function Works Correctly")
year overall to pos ps gp
1996 1 2015 D 64.6 1179
1996 2 2008 D 25.8 496
1996 3 2011 F 56.6 822
1996 4 2000 F 0.0 3
1996 5 2007 D 8.8 231
1996 6 2009 F 12.5 627
1996 7 2007 F 9.2 545
1996 8 2004 D 0.0 44
1996 9 2011 D 46.9 917
1996 10 2004 D 2.7 209
1996 11 2004 D 0.0 82
1996 12 2004 F 0.6 60
1996 13 2014 D 74.6 1107
1996 14 2013 F 17.7 798
1996 15 2016 F 52.0 1293
1996 16 1999 D 0.0 5
1996 17 2000 F 3.5 113
1996 18 2001 F 0.0 57
1996 19 2001 D 0.3 5
1996 20 2008 F 12.8 521
1996 21 2012 F 54.0 938
1996 22 1996 D 0.0 0
1996 23 1996 G 0.0 0
1996 24 2015 F 78.5 973
1996 25 2001 D 0.6 32
1996 26 2003 D 0.2 49
1996 27 2014 D 36.0 969
1996 28 2002 D 0.2 12
1996 29 2009 F 0.0 337
1996 30 2012 F 5.1 341
1996 31 1999 D 0.0 18
1996 32 2004 D 0.0 6
1996 33 1996 F 0.0 0
1996 34 1996 F 0.0 0
1996 35 2019 F 66.0 1516
1996 36 2001 D 0.6 19
1996 37 2002 F 2.6 73
1996 38 1996 F 0.0 0
1996 39 2004 F 0.0 55
1996 40 2013 F 6.6 524
1996 41 1996 D 0.0 0
1996 42 2003 D 0.0 2
1996 43 2007 F 22.0 552
1996 44 2013 G 49.0 341
1996 45 1996 F 0.0 0
1996 46 1996 F 0.0 0
1996 47 2006 F 7.3 142
1996 48 2000 F 0.7 53
1996 49 2012 D 42.1 797
1996 50 1996 G 0.0 0
Table 4.3.1: The Function Works Correctly

Figure 4.3.1 confirms the function completes the tasks listed at the start of this section. We can now proceed to import the data, we choose to clean it as we import it.

4.4 Getting the Data

We now load in all of the data and use rbind() to bind the tables together, giving us a single data frame to work with. This also means that our data will be in a mix of long and wide format, since the year and overall columns are formatted like they are in long format but the to, pos, gp, and ps columns are the same as they would be in a wide format. Note that this function takes 2-3 minutes to run (because of the Sys.sleep(5) line, which is necessary to prevent us getting rate limited). We will soon store the data in a S3 bucket so we don’t have to run this function any more than we have to.

Code
all_data <- do.call(rbind, lapply(seq(start_year, end_year, 1),
                   \(x) tidy_draft(x)))

4.5 Verification

We can check that we loaded the data correctly, there should be between 5250 and 5600 rows (the number of picks in a draft has changed over the years in our dataset as we will see, but is always between 210 and 224 so 25 drafts will be between ) and 6 columns:

Code
dim(all_data) # confirm there are 5250-5600 rows and 6 columns
[1] 5425    6
Code
length(unique(all_data$year)) # confirm all 25 years have been included
[1] 25
Code
# should be the last 10 picks from 2020, note that there were 7*31 = 217 
#   picks in the draft that year
all_data |>
  tail(10) |>
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table 4.4.1: The Last 10 Picks from our Dataset")
year overall to pos ps gp
2020 208 2020 D 0.0 0
2020 209 2020 F 0.0 0
2020 210 2020 F 0.0 0
2020 211 2020 F 0.0 0
2020 212 2025 G 5.7 39
2020 213 2020 F 0.0 0
2020 214 2020 G 0.0 0
2020 215 2020 F 0.0 0
2020 216 2020 F 0.0 0
2020 217 2020 F 0.0 0
Table 4.4.1: The Last 10 Picks from our Dataset
Code
all_data |> # Confirm pick 123 of 2002 was removed (it was an invalid pick)
  filter(year == 2002 & overall <= 125 & overall >= 121) |> 
  gt() |> 
  opt_all_caps() |> 
  tab_source_note("Table: 4.4.2: Pick 123 in 2002 was Removed")
year overall to pos ps gp
2002 121 2002 G 0 0
2002 122 2002 D 0 0
2002 124 2002 D 0 0
2002 125 2002 D 0 0
Table: 4.4.2: Pick 123 in 2002 was Removed

These checks all returned what they should.

4.6 Storing the Data

Finally, recall that the import_data function (and thus the tidy_draft function) both take quite a while to run, especially when importing data for a large number of years. To resolve this issue, we will store the data in an AWS S3 bucket so future chapters can get the data from the S3 bucket instead of from Hockey Reference, making this report render significantly faster. We do this by following the instructions given by Agarwal (2020), which tell us to save all_data in parquet file and then write it to our S3 bucket.

Code
Sys.setenv("AWS_ACCESS_KEY_ID" = Sys.getenv("AWS_ACCESS_KEY_ID"),
           "AWS_SECRET_ACCESS_KEY" = Sys.getenv("AWS_SECRET_ACCESS_KEY"), 
           "AWS_DEFAULT_REGION" = "us-east-2")
bucket = "trevor-stat468"

s3write_using(all_data, FUN = write_parquet, 
              bucket = bucket, object = "all_data.parquet")

In this and future chapters, we still start by loading in all of the functions from functions.R, which includes all the functions and global constants defined in all chapters, in addition to querying the data from the AWS S3 bucket using duckdb. functions.R is included in this project’s GitHub repo, and the part of it that loads all_data is also given below (here we load it into all_data_test to show that it actually works). I also followed some of the instructions from Michonneau (2023) to help me write the below code.

Code
con <- dbConnect(duckdb())

dbExecute(con, "INSTALL httpfs;")
dbExecute(con, "LOAD httpfs;")

all_data_test <- dbGetQuery(con, "SELECT * 
                            FROM read_parquet('s3://trevor-stat468/all_data.parquet');")

DBI::dbDisconnect(con)
Code
dim(all_data_test)
[1] 5425    6
Code
all_data_test |> 
  head(10) |> 
  gt() |>
  opt_all_caps() |> 
  tab_source_note("Table 4.6.1: Data Loaded from S3 Bucket")
year overall to pos ps gp
1996 1 2015 D 64.6 1179
1996 2 2008 D 25.8 496
1996 3 2011 F 56.6 822
1996 4 2000 F 0.0 3
1996 5 2007 D 8.8 231
1996 6 2009 F 12.5 627
1996 7 2007 F 9.2 545
1996 8 2004 D 0.0 44
1996 9 2011 D 46.9 917
1996 10 2004 D 2.7 209
Table 4.6.1: Data Loaded from S3 Bucket

Figure 4.6.1 confirms that the data has been loaded from the S3 bucket. In future chapters we will edit this data and will put the new version in the same S3 bucket. We now proceed to the Visualize chapter.

Agarwal, Saurabh. 2020. “How to Read and Write Data from and to S3 Bucket Using R?” Medium. https://medium.com/@som028/how-to-read-and-write-data-from-and-to-s3-bucket-using-r-3fed7e686844.
Michonneau, François. 2023. “How to Work with Remote Parquet Files with the Duckdb R Package?” https://francoismichonneau.net/2023/06/duckdb-r-remote-data/.