1. Introduction
2. Tidy Data
3 Data Transformation
- 3.1 dplyr basics
- 3.2 Join tables
  - Exercise
References

1. Introduction

Raw tabular datasets can come in many ways. How do you re-organize them for easy manipulation in R? How do you further prepare the data (filter, select, summarize, etc.) to get them ready for modeling? We will cover those topics in this session.

2. Tidy Data

Tidy data is a way to organize tabular data. It provides a consistent data structure across packages. A table is tidy if:

each variable forms a column.
each observation, or case, forms a row.
each type of observational unit forms a table

The tidy data concept is introduced by Wickham (2014). If you have studied Database design, you will realize the root of the idea is Codd’s 3rd normal form (Codd EF, 1990)

2.1 “Messy” data and why tidy is better.

The tidy data definition might be too abstract. Let’s start with an example of “messy”.

	treatmenta	treatmentb
John Smith	-	2
Jane Doe	16	11
Mary Johnson	3	1

# load the tidyverse library
# this will load a set of packages
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# create a messy dataset
df_messy <- tibble(
  name = c("John Smith", "Jane Doe", "Mary Johnson"),
  treatmenta = c(NA, 16, 3),
  treatmentb = c(2, 11, 1)
)
df_messy

## # A tibble: 3 x 3
##   name         treatmenta treatmentb
##   <chr>             <dbl>      <dbl>
## 1 John Smith          NA          2.
## 2 Jane Doe            16.        11.
## 3 Mary Johnson         3.         1.

This dataset essentially has three variables: name, treatment, and result. The column headers in the table are values: treatment a or b, not variable names (i.e., it violates (1) and (2) in the tidy data definition). The way this dataset is organized makes it hard to retrieve the values and analyze them in a consistent way.

# list all the treatment method
names(df_messy)

## [1] "name"       "treatmenta" "treatmentb"

# find all the subjects
df_messy["name"]

## # A tibble: 3 x 1
##   name        
##   <chr>       
## 1 John Smith  
## 2 Jane Doe    
## 3 Mary Johnson

What if the data looks like this?

	John Smith	Jane Doe	Mary Johnson
treatmenta	-	16	3
treatmentb	2	11	1

It’s just as “messy”.

Imaging you want to build a set of tools/functions to analyze your data, but the raw data you collected are all messy in their own ways. Your tools/functions then need to deal with many different forms of data, which makes the work flow inefficient. That’s why we would want to first store/transform raw data in/into a consistent “tidy” way.

The tidy data way.

name	treatment	result
John Smith	a	-
Jane Doe	a	16
Mary Johnson	a	3
John Smith	b	2
Jane Doe	b	11
Mary Johnson	b	1

# create a tidy dataset
df_tidy <- tibble(
  name = c("John Smith", "Jane Doe", "Mary Johnson", "John Smith", "Jane Doe", "Mary Johnson"),
  treatment = c("a", "a", "a", "b", "b", "b"),
  result = c(NA, 16, 3, 2, 11, 1)
)
df_tidy

## # A tibble: 6 x 3
##   name         treatment result
##   <chr>        <chr>      <dbl>
## 1 John Smith   a            NA 
## 2 Jane Doe     a            16.
## 3 Mary Johnson a             3.
## 4 John Smith   b             2.
## 5 Jane Doe     b            11.
## 6 Mary Johnson b             1.

print(df_tidy["name"])

## # A tibble: 6 x 1
##   name        
##   <chr>       
## 1 John Smith  
## 2 Jane Doe    
## 3 Mary Johnson
## 4 John Smith  
## 5 Jane Doe    
## 6 Mary Johnson

print(df_tidy["treatment"])

## # A tibble: 6 x 1
##   treatment
##   <chr>    
## 1 a        
## 2 a        
## 3 a        
## 4 b        
## 5 b        
## 6 b

2.2 From messy to tidy

How do we turn a messy dataset into a tidy one in R? It depends on how messy the dataset is, and in what way. We’ll discuss a few examples for a starter.

Let’s first try to turn the messy data example seen above into a tidy one. There are many ways to do this transformation. Wickham (2014) used melt() function in the reshape2 package. The reshape2 package was replaced by tidyr package around 2015, so we will instead use the gather() function in the tidyr package.

gather() collects a set of column names and places them into a single “key” column. It also collects the cells of those columns and places them into a single value column.

# tidyr is part of the tidyverse package, which we have already loaded
df_tidy_new <- gather(df_messy, key = "treatment", value = "result", 2:3)
df_tidy_new

## # A tibble: 6 x 3
##   name         treatment  result
##   <chr>        <chr>       <dbl>
## 1 John Smith   treatmenta    NA 
## 2 Jane Doe     treatmenta    16.
## 3 Mary Johnson treatmenta     3.
## 4 John Smith   treatmentb     2.
## 5 Jane Doe     treatmentb    11.
## 6 Mary Johnson treatmentb     1.

Let’s load a real world dataset for an exercise. (It’s also an opportunity to learn how to import data.) The dataset examines the relationship between income and religion in the US.

# import pew_v2.csv file
df_pew <- read_csv("https://raw.githubusercontent.com/eijoac/datasets/master/pew_v2.csv")

## Parsed with column specification:
## cols(
##   religion = col_character(),
##   `less$10k` = col_integer(),
##   `$10-20k` = col_integer(),
##   `$20-30k` = col_integer(),
##   `$30-40k` = col_integer(),
##   `$40-50k` = col_integer(),
##   `$50-75k` = col_integer(),
##   `$75-100k` = col_integer(),
##   `$100-150k` = col_integer(),
##   above150k = col_integer(),
##   `Don't know/refused` = col_integer()
## )

df_pew

## # A tibble: 18 x 11
##    religion   `less$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
##    <chr>           <int>     <int>     <int>     <int>     <int>     <int>
##  1 Agnostic           27        34        60        81        76       137
##  2 Atheist            12        27        37        52        35        70
##  3 Buddhist           27        21        30        34        33        58
##  4 Catholic          418       617       732       670       638      1116
##  5 Don’t kno~         15        14        15        11        10        35
##  6 Evangelic~        575       869      1064       982       881      1486
##  7 Hindu               1         9         7         9        11        34
##  8 Historica~        228       244       236       238       197       223
##  9 Jehovah's~         20        27        24        24        21        30
## 10 Jewish             19        19        25        25        30        95
## 11 Mainline ~        289       495       619       655       651      1107
## 12 Mormon             29        40        48        51        56       112
## 13 Muslim              6         7         9        10         9        23
## 14 Orthodox           13        17        23        32        32        47
## 15 Other Chr~          9         7        11        13        13        14
## 16 Other Fai~         20        33        40        46        49        63
## 17 Other Wor~          5         2         3         4         2         7
## 18 Unaffilia~        217       299       374       365       341       528
## # ... with 4 more variables: `$75-100k` <int>, `$100-150k` <int>,
## #   above150k <int>, `Don't know/refused` <int>

I used read_csv() from the readr package to import the data into R tibble. The source of the data is a URL. If the data is local, you will just specify the path to the data (e.g. c:\mydata\mydata.csv)

Let’s take a quick look at the document for read_csv() and other related data import functions in readr package. Besides the readr, a few other packages for importing data from different sources are listed below.

readxl for Excel sheets.
haven for SPSS, Stata and SAS data.
jsonlite for JSON.
xml2 for XML.
httr for Web APIs.
rvest for web scraping
DBI for connecting to DataBase engine.

Exercise

Can you tidy up the pew dataset?

# your code here

df_pew_tidy <- gather(df_pew, key = "income", value = "n", -religion)
df_pew_tidy

## # A tibble: 180 x 3
##    religion                income       n
##    <chr>                   <chr>    <int>
##  1 Agnostic                less$10k    27
##  2 Atheist                 less$10k    12
##  3 Buddhist                less$10k    27
##  4 Catholic                less$10k   418
##  5 Don’t know/refused      less$10k    15
##  6 Evangelical Prot        less$10k   575
##  7 Hindu                   less$10k     1
##  8 Historically Black Prot less$10k   228
##  9 Jehovah's Witness       less$10k    20
## 10 Jewish                  less$10k    19
## # ... with 170 more rows

The reverse operation of gather() is spread(). spread() spread a key-value pair across multiple columns.

spread(df_pew_tidy, income, n)

## # A tibble: 18 x 11
##    religion  `$10-20k` `$100-150k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
##    <chr>         <int>       <int>     <int>     <int>     <int>     <int>
##  1 Agnostic         34         109        60        81        76       137
##  2 Atheist          27          59        37        52        35        70
##  3 Buddhist         21          39        30        34        33        58
##  4 Catholic        617         792       732       670       638      1116
##  5 Don’t kn~        14          17        15        11        10        35
##  6 Evangeli~       869         723      1064       982       881      1486
##  7 Hindu             9          48         7         9        11        34
##  8 Historic~       244          81       236       238       197       223
##  9 Jehovah'~        27          11        24        24        21        30
## 10 Jewish           19          87        25        25        30        95
## 11 Mainline~       495         753       619       655       651      1107
## 12 Mormon           40          49        48        51        56       112
## 13 Muslim            7           8         9        10         9        23
## 14 Orthodox         17          42        23        32        32        47
## 15 Other Ch~         7          14        11        13        13        14
## 16 Other Fa~        33          40        40        46        49        63
## 17 Other Wo~         2           4         3         4         2         7
## 18 Unaffili~       299         321       374       365       341       528
## # ... with 4 more variables: `$75-100k` <int>, above150k <int>, `Don't
## #   know/refused` <int>, `less$10k` <int>

spread() can sometimes help tidy up dateset as well. Let’s take a look at the table2 dataset provided by the tidyr package.

table2

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table2 mixes up the values of population and cases (TB cases) in the same column count. We can use spread() to tidy it up. (Discussion: Is it not tidy already?)

Exercise

Use spread() to tidy up the table2.

# your code here

table2_tidy <- spread(table2, type, count)
table2_tidy

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

2.3 Other messy ways

According to Wickham (2014), messy datasets havet 5 common problems.

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

We have seen (1) and (2). Let’s quickly discuss another example (see slides). I’ll leave you to read (Wickham (2014) to explore the rest. This site summarizes the paper.

We end this section by discussing how to save a dataset in csv format. Just use the write_csv() function.

# write_csv(df_pew_tidy, "path_to_data_folder/pew_tidy.csv")

3 Data Transformation

R has many tools/packages to manipulate data. We will mainly focus on the dplyr() package as it’s one of the best and it’s part of an eco-system of tidyverse that handles data in a consistant way.

Let’s first load a dataset.

employees <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Employees.csv")

## Parsed with column specification:
## cols(
##   EmployeeID = col_integer(),
##   LastName = col_character(),
##   FirstName = col_character(),
##   Title = col_character(),
##   TitleOfCourtesy = col_character(),
##   BirthDate = col_datetime(format = ""),
##   HireDate = col_datetime(format = ""),
##   Address = col_character(),
##   City = col_character(),
##   Region = col_character(),
##   PostalCode = col_character(),
##   Country = col_character(),
##   HomePhone = col_character(),
##   Extension = col_integer(),
##   Notes = col_character(),
##   ReportsTo = col_character(),
##   PhotoPath = col_character()
## )

employees

## # A tibble: 9 x 17
##   EmployeeID LastName  FirstName Title TitleOfCourtesy BirthDate          
##        <int> <chr>     <chr>     <chr> <chr>           <dttm>             
## 1          1 Davolio   Nancy     Sale~ Ms.             1966-12-08 00:00:00
## 2          2 Fuller    Andrew    Vice~ Dr.             1970-02-19 00:00:00
## 3          3 Leverling Janet     Sale~ Ms.             1981-08-30 00:00:00
## 4          4 Peacock   Margaret  Sale~ Mrs.            1955-09-19 00:00:00
## 5          5 Buchanan  Steven    Sale~ Mr.             1973-03-04 00:00:00
## 6          6 Suyama    Michael   Sale~ Mr.             1981-07-02 00:00:00
## 7          7 King      Robert    Sale~ Mr.             1978-05-29 00:00:00
## 8          8 Callahan  Laura     Insi~ Ms.             1976-01-09 00:00:00
## 9          9 Dodsworth Anne      Sale~ Ms.             1984-01-27 00:00:00
## # ... with 11 more variables: HireDate <dttm>, Address <chr>, City <chr>,
## #   Region <chr>, PostalCode <chr>, Country <chr>, HomePhone <chr>,
## #   Extension <int>, Notes <chr>, ReportsTo <chr>, PhotoPath <chr>

print(employees, n = 3, width = Inf)

## # A tibble: 9 x 17
##   EmployeeID LastName  FirstName Title                 TitleOfCourtesy
##        <int> <chr>     <chr>     <chr>                 <chr>          
## 1          1 Davolio   Nancy     Sales Representative  Ms.            
## 2          2 Fuller    Andrew    Vice President, Sales Dr.            
## 3          3 Leverling Janet     Sales Representative  Ms.            
##   BirthDate           HireDate            Address                   
##   <dttm>              <dttm>              <chr>                     
## 1 1966-12-08 00:00:00 2010-05-01 00:00:00 507 - 20th Ave. E. Apt. 2A
## 2 1970-02-19 00:00:00 2010-08-14 00:00:00 908 W. Capital Way        
## 3 1981-08-30 00:00:00 2010-04-01 00:00:00 722 Moss Bay Blvd.        
##   City     Region PostalCode Country HomePhone      Extension
##   <chr>    <chr>  <chr>      <chr>   <chr>              <int>
## 1 Seattle  WA     98122      USA     (206) 555-9857      5467
## 2 Tacoma   WA     98401      USA     (206) 555-9482      3457
## 3 Kirkland WA     98033      USA     (206) 555-3412      3355
##   Notes                                                                   
##   <chr>                                                                   
## 1 "Education includes a BA in psychology from Colorado State University i~
## 2 Andrew received his BTS commercial in 1974 and a Ph.D. in international~
## 3 Janet has a BS degree in chemistry from Boston College (1984).  She has~
##   ReportsTo PhotoPath                             
##   <chr>     <chr>                                 
## 1 2         http://accweb/emmployees/davolio.bmp  
## 2 NULL      http://accweb/emmployees/fuller.bmp   
## 3 2         http://accweb/emmployees/leverling.bmp
## # ... with 6 more rows

3.1 `dplyr` basics

filter observations: filter()
select variables: select()
reorder rows: arrange()
create new variables: mutate()
collapse column values to a single summary: summarise()

The above functions can be used together with group_by(), which changes the scope of each function from operating on the whole dataset to operating on each group.

Let’s find all the sales representatives who were born after 1970-01-01. Display their names and their hiring date. Name should be formatted as “LastName, FirstName”. Order the output by hiring date (descending).

# find all the sales representatives whose were born after 1970-01-01
emp01 <- filter(employees, Title == "Sales Representative", BirthDate > 1970-01-01)

## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.

## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.

## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
## 
##   * The `env` argument of `eval_tidy()`
##   * Quosure environments when applicable
## This warning is displayed once per session.

## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.

print(emp01, width = Inf)

## # A tibble: 4 x 17
##   EmployeeID LastName  FirstName Title                TitleOfCourtesy
##        <int> <chr>     <chr>     <chr>                <chr>          
## 1          3 Leverling Janet     Sales Representative Ms.            
## 2          6 Suyama    Michael   Sales Representative Mr.            
## 3          7 King      Robert    Sales Representative Mr.            
## 4          9 Dodsworth Anne      Sales Representative Ms.            
##   BirthDate           HireDate            Address                      
##   <dttm>              <dttm>              <chr>                        
## 1 1981-08-30 00:00:00 2010-04-01 00:00:00 722 Moss Bay Blvd.           
## 2 1981-07-02 00:00:00 2011-10-17 00:00:00 Coventry House Miner Rd.     
## 3 1978-05-29 00:00:00 2012-01-02 00:00:00 Edgeham Hollow Winchester Way
## 4 1984-01-27 00:00:00 2012-11-15 00:00:00 7 Houndstooth Rd.            
##   City     Region PostalCode Country HomePhone      Extension
##   <chr>    <chr>  <chr>      <chr>   <chr>              <int>
## 1 Kirkland WA     98033      USA     (206) 555-3412      3355
## 2 London   NULL   EC2 7JR    UK      (71) 555-7773        428
## 3 London   NULL   RG1 9SP    UK      (71) 555-5598        465
## 4 London   NULL   WG2 7LT    UK      (71) 555-4444        452
##   Notes                                                                   
##   <chr>                                                                   
## 1 Janet has a BS degree in chemistry from Boston College (1984).  She has~
## 2 "Michael is a graduate of Sussex University (MA, economics, 1983) and t~
## 3 "Robert King served in the Peace Corps and traveled extensively before ~
## 4 Anne has a BA degree in English from St. Lawrence College.  She is flue~
##   ReportsTo PhotoPath                             
##   <chr>     <chr>                                 
## 1 2         http://accweb/emmployees/leverling.bmp
## 2 5         http://accweb/emmployees/davolio.bmp  
## 3 5         http://accweb/emmployees/davolio.bmp  
## 4 5         http://accweb/emmployees/davolio.bmp

# create a new variable/column with formatted names as required
emp02 <- mutate(emp01, Name = paste(LastName, FirstName, sep = ", "))

# just print the Name variable/column to take a look at the format
print(emp02["Name"])

## # A tibble: 4 x 1
##   Name            
##   <chr>           
## 1 Leverling, Janet
## 2 Suyama, Michael 
## 3 King, Robert    
## 4 Dodsworth, Anne

In the above code, paste() is used to concatenate strings. glue() from the glue package might be a bit easier to use.

# select the variables/columns (to be displayed)
emp03 <- select(emp02, Name, HireDate)
emp03

## # A tibble: 4 x 2
##   Name             HireDate           
##   <chr>            <dttm>             
## 1 Leverling, Janet 2010-04-01 00:00:00
## 2 Suyama, Michael  2011-10-17 00:00:00
## 3 King, Robert     2012-01-02 00:00:00
## 4 Dodsworth, Anne  2012-11-15 00:00:00

# order by HireDate
emp04 <- arrange(emp03, desc(HireDate))
emp04

## # A tibble: 4 x 2
##   Name             HireDate           
##   <chr>            <dttm>             
## 1 Dodsworth, Anne  2012-11-15 00:00:00
## 2 King, Robert     2012-01-02 00:00:00
## 3 Suyama, Michael  2011-10-17 00:00:00
## 4 Leverling, Janet 2010-04-01 00:00:00

The above steps can be streamlined by using pipes.

# streamline using pipes %>%
emp05 <- employees %>%
  filter(Title == "Sales Representative", BirthDate > 1970-01-01) %>%
  mutate(Name = paste(LastName, FirstName, sep = ", ")) %>%
  select(Name, HireDate) %>%
  arrange(desc(HireDate))

emp05

## # A tibble: 4 x 2
##   Name             HireDate           
##   <chr>            <dttm>             
## 1 Dodsworth, Anne  2012-11-15 00:00:00
## 2 King, Robert     2012-01-02 00:00:00
## 3 Suyama, Michael  2011-10-17 00:00:00
## 4 Leverling, Janet 2010-04-01 00:00:00

Exercise

Find all employees who are based in USA. Display their names and their birth date. Name should be formatted as “FirstName LastName”. Order the output by birth date (descending).

# your code here

# your code here
employees %>%
  filter(Country == "USA") %>%
  mutate(Name = paste(FirstName, LastName, sep = " ")) %>%
  select(Name, BirthDate) %>%
  arrange(desc(BirthDate))

## # A tibble: 5 x 2
##   Name             BirthDate          
##   <chr>            <dttm>             
## 1 Janet Leverling  1981-08-30 00:00:00
## 2 Laura Callahan   1976-01-09 00:00:00
## 3 Andrew Fuller    1970-02-19 00:00:00
## 4 Nancy Davolio    1966-12-08 00:00:00
## 5 Margaret Peacock 1955-09-19 00:00:00

Let’s now count the number of empolyees by Country. We first need to group data by country using group_by().

emp_by_country <- group_by(employees, Country)
emp_by_country

## # A tibble: 9 x 17
## # Groups:   Country [2]
##   EmployeeID LastName  FirstName Title TitleOfCourtesy BirthDate          
##        <int> <chr>     <chr>     <chr> <chr>           <dttm>             
## 1          1 Davolio   Nancy     Sale~ Ms.             1966-12-08 00:00:00
## 2          2 Fuller    Andrew    Vice~ Dr.             1970-02-19 00:00:00
## 3          3 Leverling Janet     Sale~ Ms.             1981-08-30 00:00:00
## 4          4 Peacock   Margaret  Sale~ Mrs.            1955-09-19 00:00:00
## 5          5 Buchanan  Steven    Sale~ Mr.             1973-03-04 00:00:00
## 6          6 Suyama    Michael   Sale~ Mr.             1981-07-02 00:00:00
## 7          7 King      Robert    Sale~ Mr.             1978-05-29 00:00:00
## 8          8 Callahan  Laura     Insi~ Ms.             1976-01-09 00:00:00
## 9          9 Dodsworth Anne      Sale~ Ms.             1984-01-27 00:00:00
## # ... with 11 more variables: HireDate <dttm>, Address <chr>, City <chr>,
## #   Region <chr>, PostalCode <chr>, Country <chr>, HomePhone <chr>,
## #   Extension <int>, Notes <chr>, ReportsTo <chr>, PhotoPath <chr>

summarise(emp_by_country, count = n())

## # A tibble: 2 x 2
##   Country count
##   <chr>   <int>
## 1 UK          4
## 2 USA         5

In the above code, n() is a function for counting. Now, let’s use pipe to do the same.

emp_count_by_country <- employees %>%
  group_by(Country) %>%
  summarise(count = n())
emp_count_by_country

## # A tibble: 2 x 2
##   Country count
##   <chr>   <int>
## 1 UK          4
## 2 USA         5

Let’s load another dataset for your exercise.

customers <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Customers.csv")

## Parsed with column specification:
## cols(
##   CustomerID = col_character(),
##   CompanyName = col_character(),
##   ContactName = col_character(),
##   ContactTitle = col_character(),
##   Address = col_character(),
##   City = col_character(),
##   Region = col_character(),
##   PostalCode = col_character(),
##   Country = col_character(),
##   Phone = col_character(),
##   Fax = col_character()
## )

customers

## # A tibble: 91 x 11
##    CustomerID CompanyName   ContactName ContactTitle Address  City  Region
##    <chr>      <chr>         <chr>       <chr>        <chr>    <chr> <chr> 
##  1 ALFKI      Alfreds Futt~ Maria Ande~ Sales Repre~ Obere S~ Berl~ NULL  
##  2 ANATR      Ana Trujillo~ Ana Trujil~ Owner        Avda. d~ Méxi~ NULL  
##  3 ANTON      Antonio More~ Antonio Mo~ Owner        Matader~ Méxi~ NULL  
##  4 AROUT      Around the H~ Thomas Har~ Sales Repre~ 120 Han~ Lond~ NULL  
##  5 BERGS      Berglunds sn~ Christina ~ Order Admin~ Berguvs~ Luleå NULL  
##  6 BLAUS      Blauer See D~ Hanna Moos  Sales Repre~ Forster~ Mann~ NULL  
##  7 BLONP      Blondesddsl ~ Frédérique~ Marketing M~ 24, pla~ Stra~ NULL  
##  8 BOLID      Bólido Comid~ Martín Som~ Owner        C/ Araq~ Madr~ NULL  
##  9 BONAP      Bon app'      Laurence L~ Owner        12, rue~ Mars~ NULL  
## 10 BOTTM      Bottom-Dolla~ Elizabeth ~ Accounting ~ 23 Tsaw~ Tsaw~ BC    
## # ... with 81 more rows, and 4 more variables: PostalCode <chr>,
## #   Country <chr>, Phone <chr>, Fax <chr>

Exercise

Find total number of customers per Country and City. Order them by count (descending order).

# your code here

# your code here
customers %>%
  group_by(Country, City) %>%
  summarise(CustomerCount = n()) %>%
  arrange(desc(CustomerCount))

## # A tibble: 69 x 3
## # Groups:   Country [21]
##    Country   City           CustomerCount
##    <chr>     <chr>                  <int>
##  1 UK        London                     6
##  2 Mexico    México D.F.                5
##  3 Brazil    Sao Paulo                  4
##  4 Argentina Buenos Aires               3
##  5 Brazil    Rio de Janeiro             3
##  6 Spain     Madrid                     3
##  7 France    Nantes                     2
##  8 France    Paris                      2
##  9 Portugal  Lisboa                     2
## 10 USA       Portland                   2
## # ... with 59 more rows

Load one more dataset for another exercise.

order_details <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/OrderDetails.csv")

## Parsed with column specification:
## cols(
##   OrderID = col_integer(),
##   ProductID = col_integer(),
##   UnitPrice = col_double(),
##   Quantity = col_integer(),
##   Discount = col_double()
## )

order_details

## # A tibble: 2,155 x 5
##    OrderID ProductID UnitPrice Quantity Discount
##      <int>     <int>     <dbl>    <int>    <dbl>
##  1   10248        11     14.0        12   0.    
##  2   10248        42      9.80       10   0.    
##  3   10248        72     34.8         5   0.    
##  4   10249        14     18.6         9   0.    
##  5   10249        51     42.4        40   0.    
##  6   10250        41      7.70       10   0.    
##  7   10250        51     42.4        35   0.150 
##  8   10250        65     16.8        15   0.150 
##  9   10251        22     16.8         6   0.0500
## 10   10251        57     15.6        15   0.0500
## # ... with 2,145 more rows

Exercise

Find all orders with values greater than $12000. Order them by value. Ignore Discount. Hint: since each order (identified by OrderID) has many items, you need to use group_by(). In the summarise() function, use sum() to sum up item subtotals.

# your code here

# method 1
order_details %>%
  group_by(OrderID) %>%
  summarise(TotalOrderValue = sum(UnitPrice * Quantity)) %>%
  filter(TotalOrderValue > 12000) %>%
  arrange(desc(TotalOrderValue))

## # A tibble: 4 x 2
##   OrderID TotalOrderValue
##     <int>           <dbl>
## 1   10865          17250.
## 2   11030          16322.
## 3   10981          15810.
## 4   10372          12281.

# method 2
order_details %>%
  mutate(SubTotal = UnitPrice * Quantity) %>%
  group_by(OrderID) %>%
  summarise(TotalOrderValue = sum(SubTotal)) %>%
  filter(TotalOrderValue > 12000) %>%
  arrange(desc(TotalOrderValue))

## # A tibble: 4 x 2
##   OrderID TotalOrderValue
##     <int>           <dbl>
## 1   10865          17250.
## 2   11030          16322.
## 3   10981          15810.
## 4   10372          12281.

3.2 Join tables

Most often, your dataset is organized as a collection of “tidy” tables. You need to join them for analysis. The join is based on the relationships between the tables.

This is very similar to the concept of a relational database. A database usually consists of many tables and relationships between tables. Let’s see an example.

The above graph represent 8 tables in a database and the relationships between the tables. The graph is called Entity Relationship (ER) diagram. This is a sample database (of a fake company called Northwind) from an old version of MS Access.

In the ER diagram, the tiny vertical key icon indicates a column is a primary key. A primary key is a column (or set of columns) whose values uniquely identify every row in a table. For example, OrderID is the primary key in the Orders table, and OrderID and ProductID (combined) is the primary key in the OrderDetails table.

The relationship icon (a line with a horizontal key at one end and an infinite symbol at the other end) indicates a foreign key constraint and a one-to-many relationship. A foreign key is a column (or set of columns) in one table whose values uniquely identify a row of another table or the same table. A foreign key mostly refers to a primary key in another table. A foreign key constraint requires that the constrained column contain only values from the primary key column of the other table. For example CustomerID in the Orders table is a foreign key that refers to the CustomerID primary key in the Customers table, and it can only contain values that exist in the CustomerID column of the Customers table.

In addition, it happens that every foreign key constraint in the Northwind DB establishes a one-to-many relationship, i.e. a row from one table can have multiple matching rows in another table. For example, one row from the Customers table can match multiple rows in the Orders table (via CustomerID). This makes sense as one customer can place more than one orders. (Another common relationship a foreign key constraint can establish is the one-to-one relationship.)

logo	meaning
	primary key
	one-to-many foreign key constraint

Why do we need foreign key constraints? (Discussion)

In a database, the relationships/constraints will be stored and enforced by the DB management system. Here we will just load the data as raw files to dataframes/tibbles so there is no mechanism to enforce the relationships.

Let’s learn how to join datasets.

inner join
left / right join, and left / right join with exclusion
full join, full outer join

Let’s do some experiments using two small datasets.

t1 <- tribble(
  ~pk, ~t1c1,
     1, "a",
     2, "b"
)
t2 <- tribble(
  ~fk, ~t2c1,
     1, "c",
     1, "d",
     3, "e"
)

t1

## # A tibble: 2 x 2
##      pk t1c1 
##   <dbl> <chr>
## 1    1. a    
## 2    2. b

t2

## # A tibble: 3 x 2
##      fk t2c1 
##   <dbl> <chr>
## 1    1. c    
## 2    1. d    
## 3    3. e

# playground for join exercises

# inner join
t1 %>% inner_join(t2, by = c("pk" = "fk"))

## # A tibble: 2 x 3
##      pk t1c1  t2c1 
##   <dbl> <chr> <chr>
## 1    1. a     c    
## 2    1. a     d

# left join
# t1 %>% left_join(t2, by = c("pk" = "fk"))

# left join with exclusion
# t1 %>% left_join(t2, by = c("pk" = "fk")) %>% filter(is.na(t2c1))

# right join
# t1 %>% right_join(t2, by = c("pk" = "fk"))

# right join with exclusion
# t1 %>% right_join(t2, by = c("pk" = "fk")) %>% filter(is.na(t1c1))

# full join
# t1 %>% full_join(t2, by = c("pk" = "fk"))

# full outer join
# t1 %>% full_join(t2, by = c("pk" = "fk")) %>% filter(is.na(t1c1) | is.na(t2c1))

Exercise

ex1. What is semi_join() and anti_join in dplyr()? Is semi_join() the same as inner_join? Is anti_join() the same as left join with exclusion we discussed above?

# test it out

# inner join t1 with t2
t1 %>%
  inner_join(t2, by = c("pk" = "fk"))

## # A tibble: 2 x 3
##      pk t1c1  t2c1 
##   <dbl> <chr> <chr>
## 1    1. a     c    
## 2    1. a     d

# semi-join t1 with t2
t1 %>%
  semi_join(t2, by = c("pk" = "fk"))

## # A tibble: 1 x 2
##      pk t1c1 
##   <dbl> <chr>
## 1    1. a

# left join t1 with t2 and with exclusion
t1 %>%
  left_join(t2, by = c("pk" = "fk")) %>%
  filter(is.na(t2c1))

## # A tibble: 1 x 3
##      pk t1c1  t2c1 
##   <dbl> <chr> <chr>
## 1    2. b     <NA>

# anti-join t1 with t2
t1 %>%
  anti_join(t2, by = c("pk" = "fk"))

## # A tibble: 1 x 2
##      pk t1c1 
##   <dbl> <chr>
## 1    2. b

# make a new table
t3 <- tribble(
  ~pk, ~t1c1,
     1, "a",
     2, "b",
     2, "b"
)

# left join t3 with t2 and with exclusion
t3 %>%
  left_join(t2, by = c("pk" = "fk")) %>%
  filter(is.na(t2c1))

## # A tibble: 2 x 3
##      pk t1c1  t2c1 
##   <dbl> <chr> <chr>
## 1    2. b     <NA> 
## 2    2. b     <NA>

# anti-join t3 with t2
t3 %>%
  anti_join(t2, by = c("pk" = "fk"))

## # A tibble: 2 x 2
##      pk t1c1 
##   <dbl> <chr>
## 1    2. b    
## 2    2. b

Let’s load the rest of Northwind dataset and do something a bit more interesting.

orders <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Orders.csv")

## Parsed with column specification:
## cols(
##   OrderID = col_integer(),
##   CustomerID = col_character(),
##   EmployeeID = col_integer(),
##   OrderDate = col_datetime(format = ""),
##   RequiredDate = col_datetime(format = ""),
##   ShippedDate = col_character(),
##   ShipVia = col_integer(),
##   Freight = col_double(),
##   ShipName = col_character(),
##   ShipAddress = col_character(),
##   ShipCity = col_character(),
##   ShipRegion = col_character(),
##   ShipPostalCode = col_character(),
##   ShipCountry = col_character()
## )

shippers <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Shippers.csv")

## Parsed with column specification:
## cols(
##   ShipperID = col_integer(),
##   CompanyName = col_character(),
##   Phone = col_character()
## )

suppliers <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Suppliers.csv")

## Parsed with column specification:
## cols(
##   SupplierID = col_integer(),
##   CompanyName = col_character(),
##   ContactName = col_character(),
##   ContactTitle = col_character(),
##   Address = col_character(),
##   City = col_character(),
##   Region = col_character(),
##   PostalCode = col_character(),
##   Country = col_character(),
##   Phone = col_character(),
##   Fax = col_character(),
##   HomePage = col_character()
## )

products <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Products.csv")

## Parsed with column specification:
## cols(
##   ProductID = col_integer(),
##   ProductName = col_character(),
##   SupplierID = col_integer(),
##   CategoryID = col_integer(),
##   QuantityPerUnit = col_character(),
##   UnitPrice = col_double(),
##   UnitsInStock = col_integer(),
##   UnitsOnOrder = col_integer(),
##   ReorderLevel = col_integer(),
##   Discontinued = col_integer()
## )

categories <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Categories.csv")

## Parsed with column specification:
## cols(
##   CategoryID = col_integer(),
##   CategoryName = col_character(),
##   Description = col_character()
## )

ex2. Display all products and their associated suppliers.

# your code here

# your code here
products %>%
  inner_join(suppliers, by = 'SupplierID')

## # A tibble: 77 x 21
##    ProductID ProductName   SupplierID CategoryID QuantityPerUnit UnitPrice
##        <int> <chr>              <int>      <int> <chr>               <dbl>
##  1         1 Chai                   1          1 10 boxes x 20 ~      18.0
##  2         2 Chang                  1          1 24 - 12 oz bot~      19.0
##  3         3 Aniseed Syrup          1          2 12 - 550 ml bo~      10.0
##  4         4 Chef Anton's~          2          2 48 - 6 oz jars       22.0
##  5         5 Chef Anton's~          2          2 36 boxes             21.4
##  6         6 Grandma's Bo~          3          2 12 - 8 oz jars       25.0
##  7         7 Uncle Bob's ~          3          7 12 - 1 lb pkgs.      30.0
##  8         8 Northwoods C~          3          2 12 - 12 oz jars      40.0
##  9         9 Mishi Kobe N~          4          6 18 - 500 g pkg~      97.0
## 10        10 Ikura                  4          8 12 - 200 ml ja~      31.0
## # ... with 67 more rows, and 15 more variables: UnitsInStock <int>,
## #   UnitsOnOrder <int>, ReorderLevel <int>, Discontinued <int>,
## #   CompanyName <chr>, ContactName <chr>, ContactTitle <chr>,
## #   Address <chr>, City <chr>, Region <chr>, PostalCode <chr>,
## #   Country <chr>, Phone <chr>, Fax <chr>, HomePage <chr>

products %>%
  left_join(suppliers, by = 'SupplierID')

## # A tibble: 77 x 21
##    ProductID ProductName   SupplierID CategoryID QuantityPerUnit UnitPrice
##        <int> <chr>              <int>      <int> <chr>               <dbl>
##  1         1 Chai                   1          1 10 boxes x 20 ~      18.0
##  2         2 Chang                  1          1 24 - 12 oz bot~      19.0
##  3         3 Aniseed Syrup          1          2 12 - 550 ml bo~      10.0
##  4         4 Chef Anton's~          2          2 48 - 6 oz jars       22.0
##  5         5 Chef Anton's~          2          2 36 boxes             21.4
##  6         6 Grandma's Bo~          3          2 12 - 8 oz jars       25.0
##  7         7 Uncle Bob's ~          3          7 12 - 1 lb pkgs.      30.0
##  8         8 Northwoods C~          3          2 12 - 12 oz jars      40.0
##  9         9 Mishi Kobe N~          4          6 18 - 500 g pkg~      97.0
## 10        10 Ikura                  4          8 12 - 200 ml ja~      31.0
## # ... with 67 more rows, and 15 more variables: UnitsInStock <int>,
## #   UnitsOnOrder <int>, ReorderLevel <int>, Discontinued <int>,
## #   CompanyName <chr>, ContactName <chr>, ContactTitle <chr>,
## #   Address <chr>, City <chr>, Region <chr>, PostalCode <chr>,
## #   Country <chr>, Phone <chr>, Fax <chr>, HomePage <chr>

ex3. Find all orders with values great than $12000 and are placed in 2016.

# let's do it together
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

orders %>%
  filter(year(OrderDate) == 2016) %>%
  inner_join(order_details, by = "OrderID") %>%
  group_by(OrderID) %>%
  summarise(TotalOrderValue = sum(UnitPrice * Quantity)) %>%
  filter(TotalOrderValue > 12000) %>%
  arrange(desc(TotalOrderValue))

## # A tibble: 3 x 2
##   OrderID TotalOrderValue
##     <int>           <dbl>
## 1   10865          17250.
## 2   11030          16322.
## 3   10981          15810.

ex4. Find customers that never placed an order.

# let's do it together
# method 1
customers %>%
  left_join(orders, by = "CustomerID") %>%
  filter(is.na(OrderID))

## # A tibble: 2 x 24
##   CustomerID CompanyName   ContactName  ContactTitle Address  City  Region
##   <chr>      <chr>         <chr>        <chr>        <chr>    <chr> <chr> 
## 1 FISSA      FISSA Fabric~ Diego Roel   Accounting ~ C/ Mora~ Madr~ NULL  
## 2 PARIS      Paris spécia~ Marie Bertr~ Owner        265, bo~ Paris NULL  
## # ... with 17 more variables: PostalCode <chr>, Country <chr>,
## #   Phone <chr>, Fax <chr>, OrderID <int>, EmployeeID <int>,
## #   OrderDate <dttm>, RequiredDate <dttm>, ShippedDate <chr>,
## #   ShipVia <int>, Freight <dbl>, ShipName <chr>, ShipAddress <chr>,
## #   ShipCity <chr>, ShipRegion <chr>, ShipPostalCode <chr>,
## #   ShipCountry <chr>

# let's do it together
# method 2 - using anti_join
customers %>%
  anti_join(orders, by = "CustomerID")

## # A tibble: 2 x 11
##   CustomerID CompanyName   ContactName  ContactTitle Address  City  Region
##   <chr>      <chr>         <chr>        <chr>        <chr>    <chr> <chr> 
## 1 FISSA      FISSA Fabric~ Diego Roel   Accounting ~ C/ Mora~ Madr~ NULL  
## 2 PARIS      Paris spécia~ Marie Bertr~ Owner        265, bo~ Paris NULL  
## # ... with 4 more variables: PostalCode <chr>, Country <chr>, Phone <chr>,
## #   Fax <chr>

ex5. Find customers who never placed an order from Margaret Peacock (EmployeeID 4).

# your code here

# your code here

# method 1
orders_4 <- orders %>%
  filter(EmployeeID == 4)

customers %>%
  anti_join(orders_4, by = "CustomerID")

## # A tibble: 16 x 11
##    CustomerID CompanyName   ContactName  ContactTitle Address City  Region
##    <chr>      <chr>         <chr>        <chr>        <chr>   <chr> <chr> 
##  1 CONSH      Consolidated~ Elizabeth B~ Sales Repre~ Berkel~ Lond~ NULL  
##  2 DUMON      Du monde ent~ Janine Labr~ Owner        67, ru~ Nant~ NULL  
##  3 FISSA      FISSA Fabric~ Diego Roel   Accounting ~ C/ Mor~ Madr~ NULL  
##  4 FRANR      France resta~ Carine Schm~ Marketing M~ 54, ru~ Nant~ NULL  
##  5 GROSR      GROSELLA-Res~ Manuel Pere~ Owner        5ª Ave~ Cara~ DF    
##  6 LAUGB      Laughing Bac~ Yoshi Tanna~ Marketing A~ 1900 O~ Vanc~ BC    
##  7 LAZYK      Lazy K Kount~ John Steel   Marketing M~ 12 Orc~ Wall~ WA    
##  8 NORTS      North/South   Simon Crowt~ Sales Assoc~ South ~ Lond~ NULL  
##  9 PARIS      Paris spécia~ Marie Bertr~ Owner        265, b~ Paris NULL  
## 10 PERIC      Pericles Com~ Guillermo F~ Sales Repre~ Calle ~ Méxi~ NULL  
## 11 PRINI      Princesa Isa~ Isabel de C~ Sales Repre~ Estrad~ Lisb~ NULL  
## 12 SANTG      Santé Gourmet Jonas Bergu~ Owner        Erling~ Stav~ NULL  
## 13 SEVES      Seven Seas I~ Hari Kumar   Sales Manag~ 90 Wad~ Lond~ NULL  
## 14 SPECD      Spécialités ~ Dominique P~ Marketing M~ 25, ru~ Paris NULL  
## 15 THEBI      The Big Chee~ Liz Nixon    Marketing M~ 89 Jef~ Port~ OR    
## 16 VINET      Vins et alco~ Paul Henriot Accounting ~ 59 rue~ Reims NULL  
## # ... with 4 more variables: PostalCode <chr>, Country <chr>, Phone <chr>,
## #   Fax <chr>

customers %>%
  left_join(orders_4, by = "CustomerID") %>%
  filter(is.na(OrderID))

## # A tibble: 16 x 24
##    CustomerID CompanyName   ContactName  ContactTitle Address City  Region
##    <chr>      <chr>         <chr>        <chr>        <chr>   <chr> <chr> 
##  1 CONSH      Consolidated~ Elizabeth B~ Sales Repre~ Berkel~ Lond~ NULL  
##  2 DUMON      Du monde ent~ Janine Labr~ Owner        67, ru~ Nant~ NULL  
##  3 FISSA      FISSA Fabric~ Diego Roel   Accounting ~ C/ Mor~ Madr~ NULL  
##  4 FRANR      France resta~ Carine Schm~ Marketing M~ 54, ru~ Nant~ NULL  
##  5 GROSR      GROSELLA-Res~ Manuel Pere~ Owner        5ª Ave~ Cara~ DF    
##  6 LAUGB      Laughing Bac~ Yoshi Tanna~ Marketing A~ 1900 O~ Vanc~ BC    
##  7 LAZYK      Lazy K Kount~ John Steel   Marketing M~ 12 Orc~ Wall~ WA    
##  8 NORTS      North/South   Simon Crowt~ Sales Assoc~ South ~ Lond~ NULL  
##  9 PARIS      Paris spécia~ Marie Bertr~ Owner        265, b~ Paris NULL  
## 10 PERIC      Pericles Com~ Guillermo F~ Sales Repre~ Calle ~ Méxi~ NULL  
## 11 PRINI      Princesa Isa~ Isabel de C~ Sales Repre~ Estrad~ Lisb~ NULL  
## 12 SANTG      Santé Gourmet Jonas Bergu~ Owner        Erling~ Stav~ NULL  
## 13 SEVES      Seven Seas I~ Hari Kumar   Sales Manag~ 90 Wad~ Lond~ NULL  
## 14 SPECD      Spécialités ~ Dominique P~ Marketing M~ 25, ru~ Paris NULL  
## 15 THEBI      The Big Chee~ Liz Nixon    Marketing M~ 89 Jef~ Port~ OR    
## 16 VINET      Vins et alco~ Paul Henriot Accounting ~ 59 rue~ Reims NULL  
## # ... with 17 more variables: PostalCode <chr>, Country <chr>,
## #   Phone <chr>, Fax <chr>, OrderID <int>, EmployeeID <int>,
## #   OrderDate <dttm>, RequiredDate <dttm>, ShippedDate <chr>,
## #   ShipVia <int>, Freight <dbl>, ShipName <chr>, ShipAddress <chr>,
## #   ShipCity <chr>, ShipRegion <chr>, ShipPostalCode <chr>,
## #   ShipCountry <chr>

# method 2
orders %>%
  filter(EmployeeID == 4) %>%
  right_join(customers, by = "CustomerID") %>%
  filter(is.na(OrderID))

## # A tibble: 16 x 24
##    OrderID CustomerID EmployeeID OrderDate           RequiredDate       
##      <int> <chr>           <int> <dttm>              <dttm>             
##  1      NA CONSH              NA NA                  NA                 
##  2      NA DUMON              NA NA                  NA                 
##  3      NA FISSA              NA NA                  NA                 
##  4      NA FRANR              NA NA                  NA                 
##  5      NA GROSR              NA NA                  NA                 
##  6      NA LAUGB              NA NA                  NA                 
##  7      NA LAZYK              NA NA                  NA                 
##  8      NA NORTS              NA NA                  NA                 
##  9      NA PARIS              NA NA                  NA                 
## 10      NA PERIC              NA NA                  NA                 
## 11      NA PRINI              NA NA                  NA                 
## 12      NA SANTG              NA NA                  NA                 
## 13      NA SEVES              NA NA                  NA                 
## 14      NA SPECD              NA NA                  NA                 
## 15      NA THEBI              NA NA                  NA                 
## 16      NA VINET              NA NA                  NA                 
## # ... with 19 more variables: ShippedDate <chr>, ShipVia <int>,
## #   Freight <dbl>, ShipName <chr>, ShipAddress <chr>, ShipCity <chr>,
## #   ShipRegion <chr>, ShipPostalCode <chr>, ShipCountry <chr>,
## #   CompanyName <chr>, ContactName <chr>, ContactTitle <chr>,
## #   Address <chr>, City <chr>, Region <chr>, PostalCode <chr>,
## #   Country <chr>, Phone <chr>, Fax <chr>

ex6. Find all orders and their shippers with OrderID less than 10255. Display OrderID, OrderDate (date only), and Shipper CompanyName. Hint: use as.Date() to convert OrderDate (Datetime type) to Date.

# let's do it together
orders %>%
  inner_join(shippers, by = c("ShipVia" = "ShipperID")) %>%
  filter(OrderID < 10255) %>%
  mutate(OrderDate = as.Date(OrderDate)) %>%
  select(OrderID, OrderDate, CompanyName)

## # A tibble: 7 x 3
##   OrderID OrderDate  CompanyName     
##     <int> <date>     <chr>           
## 1   10248 2014-07-04 Federal Shipping
## 2   10249 2014-07-05 Speedy Express  
## 3   10250 2014-07-08 United Package  
## 4   10251 2014-07-08 Speedy Express  
## 5   10252 2014-07-09 United Package  
## 6   10253 2014-07-10 United Package  
## 7   10254 2014-07-11 United Package

References

Tidyverse package site.
Tidy data paper intro and code site.
Hadley Wickham’s Tidy data paper.
Garrett Grolemund’s Data Tidying tutorial.
Data Wrangle in R for Data Science.

part2_data_manipulation

1. Introduction

2. Tidy Data

2.1 “Messy” data and why tidy is better.

2.2 From messy to tidy

Exercise

Exercise

2.3 Other messy ways

3 Data Transformation

3.1 dplyr basics

Exercise

Exercise

Exercise

3.2 Join tables

Exercise

References

3.1 `dplyr` basics