Raw tabular datasets can come in many ways. How do you re-organize them for easy manipulation in R? How do you further prepare the data (filter, select, summarize, etc.) to get them ready for modeling? We will cover those topics in this session.
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages. A table is tidy if:
The tidy data concept is introduced by Wickham (2014). If you have studied Database design, you will realize the root of the idea is Codd’s 3rd normal form (Codd EF, 1990)
The tidy data definition might be too abstract. Let’s start with an example of “messy”.
treatmenta | treatmentb | |
---|---|---|
John Smith | - | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
# load the tidyverse library
# this will load a set of packages
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# create a messy dataset
df_messy <- tibble(
name = c("John Smith", "Jane Doe", "Mary Johnson"),
treatmenta = c(NA, 16, 3),
treatmentb = c(2, 11, 1)
)
df_messy
## # A tibble: 3 x 3
## name treatmenta treatmentb
## <chr> <dbl> <dbl>
## 1 John Smith NA 2.
## 2 Jane Doe 16. 11.
## 3 Mary Johnson 3. 1.
This dataset essentially has three variables: name, treatment, and result. The column headers in the table are values: treatment a or b, not variable names (i.e., it violates (1) and (2) in the tidy data definition). The way this dataset is organized makes it hard to retrieve the values and analyze them in a consistent way.
# list all the treatment method
names(df_messy)
## [1] "name" "treatmenta" "treatmentb"
# find all the subjects
df_messy["name"]
## # A tibble: 3 x 1
## name
## <chr>
## 1 John Smith
## 2 Jane Doe
## 3 Mary Johnson
What if the data looks like this?
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
treatmenta | - | 16 | 3 |
treatmentb | 2 | 11 | 1 |
It’s just as “messy”.
Imaging you want to build a set of tools/functions to analyze your data, but the raw data you collected are all messy in their own ways. Your tools/functions then need to deal with many different forms of data, which makes the work flow inefficient. That’s why we would want to first store/transform raw data in/into a consistent “tidy” way.
The tidy data way.
name | treatment | result |
---|---|---|
John Smith | a | - |
Jane Doe | a | 16 |
Mary Johnson | a | 3 |
John Smith | b | 2 |
Jane Doe | b | 11 |
Mary Johnson | b | 1 |
# create a tidy dataset
df_tidy <- tibble(
name = c("John Smith", "Jane Doe", "Mary Johnson", "John Smith", "Jane Doe", "Mary Johnson"),
treatment = c("a", "a", "a", "b", "b", "b"),
result = c(NA, 16, 3, 2, 11, 1)
)
df_tidy
## # A tibble: 6 x 3
## name treatment result
## <chr> <chr> <dbl>
## 1 John Smith a NA
## 2 Jane Doe a 16.
## 3 Mary Johnson a 3.
## 4 John Smith b 2.
## 5 Jane Doe b 11.
## 6 Mary Johnson b 1.
print(df_tidy["name"])
## # A tibble: 6 x 1
## name
## <chr>
## 1 John Smith
## 2 Jane Doe
## 3 Mary Johnson
## 4 John Smith
## 5 Jane Doe
## 6 Mary Johnson
print(df_tidy["treatment"])
## # A tibble: 6 x 1
## treatment
## <chr>
## 1 a
## 2 a
## 3 a
## 4 b
## 5 b
## 6 b
How do we turn a messy dataset into a tidy one in R? It depends on how messy the dataset is, and in what way. We’ll discuss a few examples for a starter.
Let’s first try to turn the messy data example seen above into a tidy one. There are many ways to do this transformation. Wickham (2014) used melt()
function in the reshape2
package. The reshape2
package was replaced by tidyr
package around 2015, so we will instead use the gather()
function in the tidyr
package.
gather()
collects a set of column names and places them into a single “key” column. It also collects the cells of those columns and places them into a single value column.
# tidyr is part of the tidyverse package, which we have already loaded
df_tidy_new <- gather(df_messy, key = "treatment", value = "result", 2:3)
df_tidy_new
## # A tibble: 6 x 3
## name treatment result
## <chr> <chr> <dbl>
## 1 John Smith treatmenta NA
## 2 Jane Doe treatmenta 16.
## 3 Mary Johnson treatmenta 3.
## 4 John Smith treatmentb 2.
## 5 Jane Doe treatmentb 11.
## 6 Mary Johnson treatmentb 1.
Let’s load a real world dataset for an exercise. (It’s also an opportunity to learn how to import data.) The dataset examines the relationship between income and religion in the US.
# import pew_v2.csv file
df_pew <- read_csv("https://raw.githubusercontent.com/eijoac/datasets/master/pew_v2.csv")
## Parsed with column specification:
## cols(
## religion = col_character(),
## `less$10k` = col_integer(),
## `$10-20k` = col_integer(),
## `$20-30k` = col_integer(),
## `$30-40k` = col_integer(),
## `$40-50k` = col_integer(),
## `$50-75k` = col_integer(),
## `$75-100k` = col_integer(),
## `$100-150k` = col_integer(),
## above150k = col_integer(),
## `Don't know/refused` = col_integer()
## )
df_pew
## # A tibble: 18 x 11
## religion `less$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
## <chr> <int> <int> <int> <int> <int> <int>
## 1 Agnostic 27 34 60 81 76 137
## 2 Atheist 12 27 37 52 35 70
## 3 Buddhist 27 21 30 34 33 58
## 4 Catholic 418 617 732 670 638 1116
## 5 Don’t kno~ 15 14 15 11 10 35
## 6 Evangelic~ 575 869 1064 982 881 1486
## 7 Hindu 1 9 7 9 11 34
## 8 Historica~ 228 244 236 238 197 223
## 9 Jehovah's~ 20 27 24 24 21 30
## 10 Jewish 19 19 25 25 30 95
## 11 Mainline ~ 289 495 619 655 651 1107
## 12 Mormon 29 40 48 51 56 112
## 13 Muslim 6 7 9 10 9 23
## 14 Orthodox 13 17 23 32 32 47
## 15 Other Chr~ 9 7 11 13 13 14
## 16 Other Fai~ 20 33 40 46 49 63
## 17 Other Wor~ 5 2 3 4 2 7
## 18 Unaffilia~ 217 299 374 365 341 528
## # ... with 4 more variables: `$75-100k` <int>, `$100-150k` <int>,
## # above150k <int>, `Don't know/refused` <int>
I used read_csv()
from the readr
package to import the data into R tibble. The source of the data is a URL. If the data is local, you will just specify the path to the data (e.g. c:\mydata\mydata.csv
)
Let’s take a quick look at the document for read_csv()
and other related data import functions in readr
package. Besides the readr
, a few other packages for importing data from different sources are listed below.
readxl
for Excel sheets.haven
for SPSS, Stata and SAS data.jsonlite
for JSON.xml2
for XML.httr
for Web APIs.rvest
for web scrapingDBI
for connecting to DataBase engine.Can you tidy up the pew dataset?
# your code here
df_pew_tidy <- gather(df_pew, key = "income", value = "n", -religion)
df_pew_tidy
## # A tibble: 180 x 3
## religion income n
## <chr> <chr> <int>
## 1 Agnostic less$10k 27
## 2 Atheist less$10k 12
## 3 Buddhist less$10k 27
## 4 Catholic less$10k 418
## 5 Don’t know/refused less$10k 15
## 6 Evangelical Prot less$10k 575
## 7 Hindu less$10k 1
## 8 Historically Black Prot less$10k 228
## 9 Jehovah's Witness less$10k 20
## 10 Jewish less$10k 19
## # ... with 170 more rows
The reverse operation of gather()
is spread()
. spread()
spread a key-value pair across multiple columns.
spread(df_pew_tidy, income, n)
## # A tibble: 18 x 11
## religion `$10-20k` `$100-150k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
## <chr> <int> <int> <int> <int> <int> <int>
## 1 Agnostic 34 109 60 81 76 137
## 2 Atheist 27 59 37 52 35 70
## 3 Buddhist 21 39 30 34 33 58
## 4 Catholic 617 792 732 670 638 1116
## 5 Don’t kn~ 14 17 15 11 10 35
## 6 Evangeli~ 869 723 1064 982 881 1486
## 7 Hindu 9 48 7 9 11 34
## 8 Historic~ 244 81 236 238 197 223
## 9 Jehovah'~ 27 11 24 24 21 30
## 10 Jewish 19 87 25 25 30 95
## 11 Mainline~ 495 753 619 655 651 1107
## 12 Mormon 40 49 48 51 56 112
## 13 Muslim 7 8 9 10 9 23
## 14 Orthodox 17 42 23 32 32 47
## 15 Other Ch~ 7 14 11 13 13 14
## 16 Other Fa~ 33 40 40 46 49 63
## 17 Other Wo~ 2 4 3 4 2 7
## 18 Unaffili~ 299 321 374 365 341 528
## # ... with 4 more variables: `$75-100k` <int>, above150k <int>, `Don't
## # know/refused` <int>, `less$10k` <int>
spread()
can sometimes help tidy up dateset as well. Let’s take a look at the table2
dataset provided by the tidyr
package.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2
mixes up the values of population
and cases
(TB cases) in the same column count
. We can use spread()
to tidy it up. (Discussion: Is it not tidy already?)
Use spread()
to tidy up the table2
.
# your code here
table2_tidy <- spread(table2, type, count)
table2_tidy
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
According to Wickham (2014), messy datasets havet 5 common problems.
We have seen (1) and (2). Let’s quickly discuss another example (see slides). I’ll leave you to read (Wickham (2014) to explore the rest. This site summarizes the paper.
We end this section by discussing how to save a dataset in csv format. Just use the write_csv()
function.
# write_csv(df_pew_tidy, "path_to_data_folder/pew_tidy.csv")
R has many tools/packages to manipulate data. We will mainly focus on the dplyr()
package as it’s one of the best and it’s part of an eco-system of tidyverse
that handles data in a consistant way.
Let’s first load a dataset.
employees <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Employees.csv")
## Parsed with column specification:
## cols(
## EmployeeID = col_integer(),
## LastName = col_character(),
## FirstName = col_character(),
## Title = col_character(),
## TitleOfCourtesy = col_character(),
## BirthDate = col_datetime(format = ""),
## HireDate = col_datetime(format = ""),
## Address = col_character(),
## City = col_character(),
## Region = col_character(),
## PostalCode = col_character(),
## Country = col_character(),
## HomePhone = col_character(),
## Extension = col_integer(),
## Notes = col_character(),
## ReportsTo = col_character(),
## PhotoPath = col_character()
## )
employees
## # A tibble: 9 x 17
## EmployeeID LastName FirstName Title TitleOfCourtesy BirthDate
## <int> <chr> <chr> <chr> <chr> <dttm>
## 1 1 Davolio Nancy Sale~ Ms. 1966-12-08 00:00:00
## 2 2 Fuller Andrew Vice~ Dr. 1970-02-19 00:00:00
## 3 3 Leverling Janet Sale~ Ms. 1981-08-30 00:00:00
## 4 4 Peacock Margaret Sale~ Mrs. 1955-09-19 00:00:00
## 5 5 Buchanan Steven Sale~ Mr. 1973-03-04 00:00:00
## 6 6 Suyama Michael Sale~ Mr. 1981-07-02 00:00:00
## 7 7 King Robert Sale~ Mr. 1978-05-29 00:00:00
## 8 8 Callahan Laura Insi~ Ms. 1976-01-09 00:00:00
## 9 9 Dodsworth Anne Sale~ Ms. 1984-01-27 00:00:00
## # ... with 11 more variables: HireDate <dttm>, Address <chr>, City <chr>,
## # Region <chr>, PostalCode <chr>, Country <chr>, HomePhone <chr>,
## # Extension <int>, Notes <chr>, ReportsTo <chr>, PhotoPath <chr>
print(employees, n = 3, width = Inf)
## # A tibble: 9 x 17
## EmployeeID LastName FirstName Title TitleOfCourtesy
## <int> <chr> <chr> <chr> <chr>
## 1 1 Davolio Nancy Sales Representative Ms.
## 2 2 Fuller Andrew Vice President, Sales Dr.
## 3 3 Leverling Janet Sales Representative Ms.
## BirthDate HireDate Address
## <dttm> <dttm> <chr>
## 1 1966-12-08 00:00:00 2010-05-01 00:00:00 507 - 20th Ave. E. Apt. 2A
## 2 1970-02-19 00:00:00 2010-08-14 00:00:00 908 W. Capital Way
## 3 1981-08-30 00:00:00 2010-04-01 00:00:00 722 Moss Bay Blvd.
## City Region PostalCode Country HomePhone Extension
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 Seattle WA 98122 USA (206) 555-9857 5467
## 2 Tacoma WA 98401 USA (206) 555-9482 3457
## 3 Kirkland WA 98033 USA (206) 555-3412 3355
## Notes
## <chr>
## 1 "Education includes a BA in psychology from Colorado State University i~
## 2 Andrew received his BTS commercial in 1974 and a Ph.D. in international~
## 3 Janet has a BS degree in chemistry from Boston College (1984). She has~
## ReportsTo PhotoPath
## <chr> <chr>
## 1 2 http://accweb/emmployees/davolio.bmp
## 2 NULL http://accweb/emmployees/fuller.bmp
## 3 2 http://accweb/emmployees/leverling.bmp
## # ... with 6 more rows
dplyr
basicsfilter()
select()
arrange()
mutate()
summarise()
The above functions can be used together with group_by()
, which changes the scope of each function from operating on the whole dataset to operating on each group.
Let’s find all the sales representatives who were born after 1970-01-01. Display their names and their hiring date. Name should be formatted as “LastName, FirstName”. Order the output by hiring date (descending).
# find all the sales representatives whose were born after 1970-01-01
emp01 <- filter(employees, Title == "Sales Representative", BirthDate > 1970-01-01)
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
##
## * The `env` argument of `eval_tidy()`
## * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
print(emp01, width = Inf)
## # A tibble: 4 x 17
## EmployeeID LastName FirstName Title TitleOfCourtesy
## <int> <chr> <chr> <chr> <chr>
## 1 3 Leverling Janet Sales Representative Ms.
## 2 6 Suyama Michael Sales Representative Mr.
## 3 7 King Robert Sales Representative Mr.
## 4 9 Dodsworth Anne Sales Representative Ms.
## BirthDate HireDate Address
## <dttm> <dttm> <chr>
## 1 1981-08-30 00:00:00 2010-04-01 00:00:00 722 Moss Bay Blvd.
## 2 1981-07-02 00:00:00 2011-10-17 00:00:00 Coventry House Miner Rd.
## 3 1978-05-29 00:00:00 2012-01-02 00:00:00 Edgeham Hollow Winchester Way
## 4 1984-01-27 00:00:00 2012-11-15 00:00:00 7 Houndstooth Rd.
## City Region PostalCode Country HomePhone Extension
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 Kirkland WA 98033 USA (206) 555-3412 3355
## 2 London NULL EC2 7JR UK (71) 555-7773 428
## 3 London NULL RG1 9SP UK (71) 555-5598 465
## 4 London NULL WG2 7LT UK (71) 555-4444 452
## Notes
## <chr>
## 1 Janet has a BS degree in chemistry from Boston College (1984). She has~
## 2 "Michael is a graduate of Sussex University (MA, economics, 1983) and t~
## 3 "Robert King served in the Peace Corps and traveled extensively before ~
## 4 Anne has a BA degree in English from St. Lawrence College. She is flue~
## ReportsTo PhotoPath
## <chr> <chr>
## 1 2 http://accweb/emmployees/leverling.bmp
## 2 5 http://accweb/emmployees/davolio.bmp
## 3 5 http://accweb/emmployees/davolio.bmp
## 4 5 http://accweb/emmployees/davolio.bmp
# create a new variable/column with formatted names as required
emp02 <- mutate(emp01, Name = paste(LastName, FirstName, sep = ", "))
# just print the Name variable/column to take a look at the format
print(emp02["Name"])
## # A tibble: 4 x 1
## Name
## <chr>
## 1 Leverling, Janet
## 2 Suyama, Michael
## 3 King, Robert
## 4 Dodsworth, Anne
In the above code, paste()
is used to concatenate strings. glue()
from the glue package might be a bit easier to use.
# select the variables/columns (to be displayed)
emp03 <- select(emp02, Name, HireDate)
emp03
## # A tibble: 4 x 2
## Name HireDate
## <chr> <dttm>
## 1 Leverling, Janet 2010-04-01 00:00:00
## 2 Suyama, Michael 2011-10-17 00:00:00
## 3 King, Robert 2012-01-02 00:00:00
## 4 Dodsworth, Anne 2012-11-15 00:00:00
# order by HireDate
emp04 <- arrange(emp03, desc(HireDate))
emp04
## # A tibble: 4 x 2
## Name HireDate
## <chr> <dttm>
## 1 Dodsworth, Anne 2012-11-15 00:00:00
## 2 King, Robert 2012-01-02 00:00:00
## 3 Suyama, Michael 2011-10-17 00:00:00
## 4 Leverling, Janet 2010-04-01 00:00:00
The above steps can be streamlined by using pipes.
# streamline using pipes %>%
emp05 <- employees %>%
filter(Title == "Sales Representative", BirthDate > 1970-01-01) %>%
mutate(Name = paste(LastName, FirstName, sep = ", ")) %>%
select(Name, HireDate) %>%
arrange(desc(HireDate))
emp05
## # A tibble: 4 x 2
## Name HireDate
## <chr> <dttm>
## 1 Dodsworth, Anne 2012-11-15 00:00:00
## 2 King, Robert 2012-01-02 00:00:00
## 3 Suyama, Michael 2011-10-17 00:00:00
## 4 Leverling, Janet 2010-04-01 00:00:00
Find all employees who are based in USA. Display their names and their birth date. Name should be formatted as “FirstName LastName”. Order the output by birth date (descending).
# your code here
# your code here
employees %>%
filter(Country == "USA") %>%
mutate(Name = paste(FirstName, LastName, sep = " ")) %>%
select(Name, BirthDate) %>%
arrange(desc(BirthDate))
## # A tibble: 5 x 2
## Name BirthDate
## <chr> <dttm>
## 1 Janet Leverling 1981-08-30 00:00:00
## 2 Laura Callahan 1976-01-09 00:00:00
## 3 Andrew Fuller 1970-02-19 00:00:00
## 4 Nancy Davolio 1966-12-08 00:00:00
## 5 Margaret Peacock 1955-09-19 00:00:00
Let’s now count the number of empolyees by Country. We first need to group data by country using group_by()
.
emp_by_country <- group_by(employees, Country)
emp_by_country
## # A tibble: 9 x 17
## # Groups: Country [2]
## EmployeeID LastName FirstName Title TitleOfCourtesy BirthDate
## <int> <chr> <chr> <chr> <chr> <dttm>
## 1 1 Davolio Nancy Sale~ Ms. 1966-12-08 00:00:00
## 2 2 Fuller Andrew Vice~ Dr. 1970-02-19 00:00:00
## 3 3 Leverling Janet Sale~ Ms. 1981-08-30 00:00:00
## 4 4 Peacock Margaret Sale~ Mrs. 1955-09-19 00:00:00
## 5 5 Buchanan Steven Sale~ Mr. 1973-03-04 00:00:00
## 6 6 Suyama Michael Sale~ Mr. 1981-07-02 00:00:00
## 7 7 King Robert Sale~ Mr. 1978-05-29 00:00:00
## 8 8 Callahan Laura Insi~ Ms. 1976-01-09 00:00:00
## 9 9 Dodsworth Anne Sale~ Ms. 1984-01-27 00:00:00
## # ... with 11 more variables: HireDate <dttm>, Address <chr>, City <chr>,
## # Region <chr>, PostalCode <chr>, Country <chr>, HomePhone <chr>,
## # Extension <int>, Notes <chr>, ReportsTo <chr>, PhotoPath <chr>
summarise(emp_by_country, count = n())
## # A tibble: 2 x 2
## Country count
## <chr> <int>
## 1 UK 4
## 2 USA 5
In the above code, n()
is a function for counting. Now, let’s use pipe to do the same.
emp_count_by_country <- employees %>%
group_by(Country) %>%
summarise(count = n())
emp_count_by_country
## # A tibble: 2 x 2
## Country count
## <chr> <int>
## 1 UK 4
## 2 USA 5
Let’s load another dataset for your exercise.
customers <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Customers.csv")
## Parsed with column specification:
## cols(
## CustomerID = col_character(),
## CompanyName = col_character(),
## ContactName = col_character(),
## ContactTitle = col_character(),
## Address = col_character(),
## City = col_character(),
## Region = col_character(),
## PostalCode = col_character(),
## Country = col_character(),
## Phone = col_character(),
## Fax = col_character()
## )
customers
## # A tibble: 91 x 11
## CustomerID CompanyName ContactName ContactTitle Address City Region
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ALFKI Alfreds Futt~ Maria Ande~ Sales Repre~ Obere S~ Berl~ NULL
## 2 ANATR Ana Trujillo~ Ana Trujil~ Owner Avda. d~ Méxi~ NULL
## 3 ANTON Antonio More~ Antonio Mo~ Owner Matader~ Méxi~ NULL
## 4 AROUT Around the H~ Thomas Har~ Sales Repre~ 120 Han~ Lond~ NULL
## 5 BERGS Berglunds sn~ Christina ~ Order Admin~ Berguvs~ Luleå NULL
## 6 BLAUS Blauer See D~ Hanna Moos Sales Repre~ Forster~ Mann~ NULL
## 7 BLONP Blondesddsl ~ Frédérique~ Marketing M~ 24, pla~ Stra~ NULL
## 8 BOLID Bólido Comid~ Martín Som~ Owner C/ Araq~ Madr~ NULL
## 9 BONAP Bon app' Laurence L~ Owner 12, rue~ Mars~ NULL
## 10 BOTTM Bottom-Dolla~ Elizabeth ~ Accounting ~ 23 Tsaw~ Tsaw~ BC
## # ... with 81 more rows, and 4 more variables: PostalCode <chr>,
## # Country <chr>, Phone <chr>, Fax <chr>
Find total number of customers per Country and City. Order them by count (descending order).
# your code here
# your code here
customers %>%
group_by(Country, City) %>%
summarise(CustomerCount = n()) %>%
arrange(desc(CustomerCount))
## # A tibble: 69 x 3
## # Groups: Country [21]
## Country City CustomerCount
## <chr> <chr> <int>
## 1 UK London 6
## 2 Mexico México D.F. 5
## 3 Brazil Sao Paulo 4
## 4 Argentina Buenos Aires 3
## 5 Brazil Rio de Janeiro 3
## 6 Spain Madrid 3
## 7 France Nantes 2
## 8 France Paris 2
## 9 Portugal Lisboa 2
## 10 USA Portland 2
## # ... with 59 more rows
Load one more dataset for another exercise.
order_details <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/OrderDetails.csv")
## Parsed with column specification:
## cols(
## OrderID = col_integer(),
## ProductID = col_integer(),
## UnitPrice = col_double(),
## Quantity = col_integer(),
## Discount = col_double()
## )
order_details
## # A tibble: 2,155 x 5
## OrderID ProductID UnitPrice Quantity Discount
## <int> <int> <dbl> <int> <dbl>
## 1 10248 11 14.0 12 0.
## 2 10248 42 9.80 10 0.
## 3 10248 72 34.8 5 0.
## 4 10249 14 18.6 9 0.
## 5 10249 51 42.4 40 0.
## 6 10250 41 7.70 10 0.
## 7 10250 51 42.4 35 0.150
## 8 10250 65 16.8 15 0.150
## 9 10251 22 16.8 6 0.0500
## 10 10251 57 15.6 15 0.0500
## # ... with 2,145 more rows
Find all orders with values greater than $12000. Order them by value. Ignore Discount
. Hint: since each order (identified by OrderID
) has many items, you need to use group_by()
. In the summarise()
function, use sum()
to sum up item subtotals.
# your code here
# method 1
order_details %>%
group_by(OrderID) %>%
summarise(TotalOrderValue = sum(UnitPrice * Quantity)) %>%
filter(TotalOrderValue > 12000) %>%
arrange(desc(TotalOrderValue))
## # A tibble: 4 x 2
## OrderID TotalOrderValue
## <int> <dbl>
## 1 10865 17250.
## 2 11030 16322.
## 3 10981 15810.
## 4 10372 12281.
# method 2
order_details %>%
mutate(SubTotal = UnitPrice * Quantity) %>%
group_by(OrderID) %>%
summarise(TotalOrderValue = sum(SubTotal)) %>%
filter(TotalOrderValue > 12000) %>%
arrange(desc(TotalOrderValue))
## # A tibble: 4 x 2
## OrderID TotalOrderValue
## <int> <dbl>
## 1 10865 17250.
## 2 11030 16322.
## 3 10981 15810.
## 4 10372 12281.
Most often, your dataset is organized as a collection of “tidy” tables. You need to join them for analysis. The join is based on the relationships between the tables.
This is very similar to the concept of a relational database. A database usually consists of many tables and relationships between tables. Let’s see an example.
The above graph represent 8 tables in a database and the relationships between the tables. The graph is called Entity Relationship (ER) diagram. This is a sample database (of a fake company called Northwind) from an old version of MS Access.
In the ER diagram, the tiny vertical key icon indicates a column is a primary key. A primary key is a column (or set of columns) whose values uniquely identify every row in a table. For example, OrderID
is the primary key in the Orders
table, and OrderID
and ProductID
(combined) is the primary key in the OrderDetails
table.
The relationship icon (a line with a horizontal key at one end and an infinite symbol at the other end) indicates a foreign key constraint and a one-to-many relationship. A foreign key is a column (or set of columns) in one table whose values uniquely identify a row of another table or the same table. A foreign key mostly refers to a primary key in another table. A foreign key constraint requires that the constrained column contain only values from the primary key column of the other table. For example CustomerID
in the Orders
table is a foreign key that refers to the CustomerID
primary key in the Customers
table, and it can only contain values that exist in the CustomerID
column of the Customers
table.
In addition, it happens that every foreign key constraint in the Northwind DB establishes a one-to-many relationship, i.e. a row from one table can have multiple matching rows in another table. For example, one row from the Customers
table can match multiple rows in the Orders
table (via CustomerID
). This makes sense as one customer can place more than one orders. (Another common relationship a foreign key constraint can establish is the one-to-one relationship.)
logo | meaning |
---|---|
primary key | |
one-to-many foreign key constraint |
Why do we need foreign key constraints? (Discussion)
In a database, the relationships/constraints will be stored and enforced by the DB management system. Here we will just load the data as raw files to dataframes/tibbles so there is no mechanism to enforce the relationships.
Let’s learn how to join datasets.
Let’s do some experiments using two small datasets.
t1 <- tribble(
~pk, ~t1c1,
1, "a",
2, "b"
)
t2 <- tribble(
~fk, ~t2c1,
1, "c",
1, "d",
3, "e"
)
t1
## # A tibble: 2 x 2
## pk t1c1
## <dbl> <chr>
## 1 1. a
## 2 2. b
t2
## # A tibble: 3 x 2
## fk t2c1
## <dbl> <chr>
## 1 1. c
## 2 1. d
## 3 3. e
# playground for join exercises
# inner join
t1 %>% inner_join(t2, by = c("pk" = "fk"))
## # A tibble: 2 x 3
## pk t1c1 t2c1
## <dbl> <chr> <chr>
## 1 1. a c
## 2 1. a d
# left join
# t1 %>% left_join(t2, by = c("pk" = "fk"))
# left join with exclusion
# t1 %>% left_join(t2, by = c("pk" = "fk")) %>% filter(is.na(t2c1))
# right join
# t1 %>% right_join(t2, by = c("pk" = "fk"))
# right join with exclusion
# t1 %>% right_join(t2, by = c("pk" = "fk")) %>% filter(is.na(t1c1))
# full join
# t1 %>% full_join(t2, by = c("pk" = "fk"))
# full outer join
# t1 %>% full_join(t2, by = c("pk" = "fk")) %>% filter(is.na(t1c1) | is.na(t2c1))
ex1. What is semi_join()
and anti_join
in dplyr()
? Is semi_join()
the same as inner_join
? Is anti_join()
the same as left join with exclusion we discussed above?
# test it out
# inner join t1 with t2
t1 %>%
inner_join(t2, by = c("pk" = "fk"))
## # A tibble: 2 x 3
## pk t1c1 t2c1
## <dbl> <chr> <chr>
## 1 1. a c
## 2 1. a d
# semi-join t1 with t2
t1 %>%
semi_join(t2, by = c("pk" = "fk"))
## # A tibble: 1 x 2
## pk t1c1
## <dbl> <chr>
## 1 1. a
# left join t1 with t2 and with exclusion
t1 %>%
left_join(t2, by = c("pk" = "fk")) %>%
filter(is.na(t2c1))
## # A tibble: 1 x 3
## pk t1c1 t2c1
## <dbl> <chr> <chr>
## 1 2. b <NA>
# anti-join t1 with t2
t1 %>%
anti_join(t2, by = c("pk" = "fk"))
## # A tibble: 1 x 2
## pk t1c1
## <dbl> <chr>
## 1 2. b
# make a new table
t3 <- tribble(
~pk, ~t1c1,
1, "a",
2, "b",
2, "b"
)
# left join t3 with t2 and with exclusion
t3 %>%
left_join(t2, by = c("pk" = "fk")) %>%
filter(is.na(t2c1))
## # A tibble: 2 x 3
## pk t1c1 t2c1
## <dbl> <chr> <chr>
## 1 2. b <NA>
## 2 2. b <NA>
# anti-join t3 with t2
t3 %>%
anti_join(t2, by = c("pk" = "fk"))
## # A tibble: 2 x 2
## pk t1c1
## <dbl> <chr>
## 1 2. b
## 2 2. b
Let’s load the rest of Northwind dataset and do something a bit more interesting.
orders <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Orders.csv")
## Parsed with column specification:
## cols(
## OrderID = col_integer(),
## CustomerID = col_character(),
## EmployeeID = col_integer(),
## OrderDate = col_datetime(format = ""),
## RequiredDate = col_datetime(format = ""),
## ShippedDate = col_character(),
## ShipVia = col_integer(),
## Freight = col_double(),
## ShipName = col_character(),
## ShipAddress = col_character(),
## ShipCity = col_character(),
## ShipRegion = col_character(),
## ShipPostalCode = col_character(),
## ShipCountry = col_character()
## )
shippers <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Shippers.csv")
## Parsed with column specification:
## cols(
## ShipperID = col_integer(),
## CompanyName = col_character(),
## Phone = col_character()
## )
suppliers <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Suppliers.csv")
## Parsed with column specification:
## cols(
## SupplierID = col_integer(),
## CompanyName = col_character(),
## ContactName = col_character(),
## ContactTitle = col_character(),
## Address = col_character(),
## City = col_character(),
## Region = col_character(),
## PostalCode = col_character(),
## Country = col_character(),
## Phone = col_character(),
## Fax = col_character(),
## HomePage = col_character()
## )
products <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Products.csv")
## Parsed with column specification:
## cols(
## ProductID = col_integer(),
## ProductName = col_character(),
## SupplierID = col_integer(),
## CategoryID = col_integer(),
## QuantityPerUnit = col_character(),
## UnitPrice = col_double(),
## UnitsInStock = col_integer(),
## UnitsOnOrder = col_integer(),
## ReorderLevel = col_integer(),
## Discontinued = col_integer()
## )
categories <- read_csv("https://raw.githubusercontent.com/eijoac/nwdb/master/data/Categories.csv")
## Parsed with column specification:
## cols(
## CategoryID = col_integer(),
## CategoryName = col_character(),
## Description = col_character()
## )
ex2. Display all products and their associated suppliers.
# your code here
# your code here
products %>%
inner_join(suppliers, by = 'SupplierID')
## # A tibble: 77 x 21
## ProductID ProductName SupplierID CategoryID QuantityPerUnit UnitPrice
## <int> <chr> <int> <int> <chr> <dbl>
## 1 1 Chai 1 1 10 boxes x 20 ~ 18.0
## 2 2 Chang 1 1 24 - 12 oz bot~ 19.0
## 3 3 Aniseed Syrup 1 2 12 - 550 ml bo~ 10.0
## 4 4 Chef Anton's~ 2 2 48 - 6 oz jars 22.0
## 5 5 Chef Anton's~ 2 2 36 boxes 21.4
## 6 6 Grandma's Bo~ 3 2 12 - 8 oz jars 25.0
## 7 7 Uncle Bob's ~ 3 7 12 - 1 lb pkgs. 30.0
## 8 8 Northwoods C~ 3 2 12 - 12 oz jars 40.0
## 9 9 Mishi Kobe N~ 4 6 18 - 500 g pkg~ 97.0
## 10 10 Ikura 4 8 12 - 200 ml ja~ 31.0
## # ... with 67 more rows, and 15 more variables: UnitsInStock <int>,
## # UnitsOnOrder <int>, ReorderLevel <int>, Discontinued <int>,
## # CompanyName <chr>, ContactName <chr>, ContactTitle <chr>,
## # Address <chr>, City <chr>, Region <chr>, PostalCode <chr>,
## # Country <chr>, Phone <chr>, Fax <chr>, HomePage <chr>
products %>%
left_join(suppliers, by = 'SupplierID')
## # A tibble: 77 x 21
## ProductID ProductName SupplierID CategoryID QuantityPerUnit UnitPrice
## <int> <chr> <int> <int> <chr> <dbl>
## 1 1 Chai 1 1 10 boxes x 20 ~ 18.0
## 2 2 Chang 1 1 24 - 12 oz bot~ 19.0
## 3 3 Aniseed Syrup 1 2 12 - 550 ml bo~ 10.0
## 4 4 Chef Anton's~ 2 2 48 - 6 oz jars 22.0
## 5 5 Chef Anton's~ 2 2 36 boxes 21.4
## 6 6 Grandma's Bo~ 3 2 12 - 8 oz jars 25.0
## 7 7 Uncle Bob's ~ 3 7 12 - 1 lb pkgs. 30.0
## 8 8 Northwoods C~ 3 2 12 - 12 oz jars 40.0
## 9 9 Mishi Kobe N~ 4 6 18 - 500 g pkg~ 97.0
## 10 10 Ikura 4 8 12 - 200 ml ja~ 31.0
## # ... with 67 more rows, and 15 more variables: UnitsInStock <int>,
## # UnitsOnOrder <int>, ReorderLevel <int>, Discontinued <int>,
## # CompanyName <chr>, ContactName <chr>, ContactTitle <chr>,
## # Address <chr>, City <chr>, Region <chr>, PostalCode <chr>,
## # Country <chr>, Phone <chr>, Fax <chr>, HomePage <chr>
ex3. Find all orders with values great than $12000 and are placed in 2016.
# let's do it together
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
orders %>%
filter(year(OrderDate) == 2016) %>%
inner_join(order_details, by = "OrderID") %>%
group_by(OrderID) %>%
summarise(TotalOrderValue = sum(UnitPrice * Quantity)) %>%
filter(TotalOrderValue > 12000) %>%
arrange(desc(TotalOrderValue))
## # A tibble: 3 x 2
## OrderID TotalOrderValue
## <int> <dbl>
## 1 10865 17250.
## 2 11030 16322.
## 3 10981 15810.
ex4. Find customers that never placed an order.
# let's do it together
# method 1
customers %>%
left_join(orders, by = "CustomerID") %>%
filter(is.na(OrderID))
## # A tibble: 2 x 24
## CustomerID CompanyName ContactName ContactTitle Address City Region
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 FISSA FISSA Fabric~ Diego Roel Accounting ~ C/ Mora~ Madr~ NULL
## 2 PARIS Paris spécia~ Marie Bertr~ Owner 265, bo~ Paris NULL
## # ... with 17 more variables: PostalCode <chr>, Country <chr>,
## # Phone <chr>, Fax <chr>, OrderID <int>, EmployeeID <int>,
## # OrderDate <dttm>, RequiredDate <dttm>, ShippedDate <chr>,
## # ShipVia <int>, Freight <dbl>, ShipName <chr>, ShipAddress <chr>,
## # ShipCity <chr>, ShipRegion <chr>, ShipPostalCode <chr>,
## # ShipCountry <chr>
# let's do it together
# method 2 - using anti_join
customers %>%
anti_join(orders, by = "CustomerID")
## # A tibble: 2 x 11
## CustomerID CompanyName ContactName ContactTitle Address City Region
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 FISSA FISSA Fabric~ Diego Roel Accounting ~ C/ Mora~ Madr~ NULL
## 2 PARIS Paris spécia~ Marie Bertr~ Owner 265, bo~ Paris NULL
## # ... with 4 more variables: PostalCode <chr>, Country <chr>, Phone <chr>,
## # Fax <chr>
ex5. Find customers who never placed an order from Margaret Peacock (EmployeeID 4).
# your code here
# your code here
# method 1
orders_4 <- orders %>%
filter(EmployeeID == 4)
customers %>%
anti_join(orders_4, by = "CustomerID")
## # A tibble: 16 x 11
## CustomerID CompanyName ContactName ContactTitle Address City Region
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CONSH Consolidated~ Elizabeth B~ Sales Repre~ Berkel~ Lond~ NULL
## 2 DUMON Du monde ent~ Janine Labr~ Owner 67, ru~ Nant~ NULL
## 3 FISSA FISSA Fabric~ Diego Roel Accounting ~ C/ Mor~ Madr~ NULL
## 4 FRANR France resta~ Carine Schm~ Marketing M~ 54, ru~ Nant~ NULL
## 5 GROSR GROSELLA-Res~ Manuel Pere~ Owner 5ª Ave~ Cara~ DF
## 6 LAUGB Laughing Bac~ Yoshi Tanna~ Marketing A~ 1900 O~ Vanc~ BC
## 7 LAZYK Lazy K Kount~ John Steel Marketing M~ 12 Orc~ Wall~ WA
## 8 NORTS North/South Simon Crowt~ Sales Assoc~ South ~ Lond~ NULL
## 9 PARIS Paris spécia~ Marie Bertr~ Owner 265, b~ Paris NULL
## 10 PERIC Pericles Com~ Guillermo F~ Sales Repre~ Calle ~ Méxi~ NULL
## 11 PRINI Princesa Isa~ Isabel de C~ Sales Repre~ Estrad~ Lisb~ NULL
## 12 SANTG Santé Gourmet Jonas Bergu~ Owner Erling~ Stav~ NULL
## 13 SEVES Seven Seas I~ Hari Kumar Sales Manag~ 90 Wad~ Lond~ NULL
## 14 SPECD Spécialités ~ Dominique P~ Marketing M~ 25, ru~ Paris NULL
## 15 THEBI The Big Chee~ Liz Nixon Marketing M~ 89 Jef~ Port~ OR
## 16 VINET Vins et alco~ Paul Henriot Accounting ~ 59 rue~ Reims NULL
## # ... with 4 more variables: PostalCode <chr>, Country <chr>, Phone <chr>,
## # Fax <chr>
customers %>%
left_join(orders_4, by = "CustomerID") %>%
filter(is.na(OrderID))
## # A tibble: 16 x 24
## CustomerID CompanyName ContactName ContactTitle Address City Region
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CONSH Consolidated~ Elizabeth B~ Sales Repre~ Berkel~ Lond~ NULL
## 2 DUMON Du monde ent~ Janine Labr~ Owner 67, ru~ Nant~ NULL
## 3 FISSA FISSA Fabric~ Diego Roel Accounting ~ C/ Mor~ Madr~ NULL
## 4 FRANR France resta~ Carine Schm~ Marketing M~ 54, ru~ Nant~ NULL
## 5 GROSR GROSELLA-Res~ Manuel Pere~ Owner 5ª Ave~ Cara~ DF
## 6 LAUGB Laughing Bac~ Yoshi Tanna~ Marketing A~ 1900 O~ Vanc~ BC
## 7 LAZYK Lazy K Kount~ John Steel Marketing M~ 12 Orc~ Wall~ WA
## 8 NORTS North/South Simon Crowt~ Sales Assoc~ South ~ Lond~ NULL
## 9 PARIS Paris spécia~ Marie Bertr~ Owner 265, b~ Paris NULL
## 10 PERIC Pericles Com~ Guillermo F~ Sales Repre~ Calle ~ Méxi~ NULL
## 11 PRINI Princesa Isa~ Isabel de C~ Sales Repre~ Estrad~ Lisb~ NULL
## 12 SANTG Santé Gourmet Jonas Bergu~ Owner Erling~ Stav~ NULL
## 13 SEVES Seven Seas I~ Hari Kumar Sales Manag~ 90 Wad~ Lond~ NULL
## 14 SPECD Spécialités ~ Dominique P~ Marketing M~ 25, ru~ Paris NULL
## 15 THEBI The Big Chee~ Liz Nixon Marketing M~ 89 Jef~ Port~ OR
## 16 VINET Vins et alco~ Paul Henriot Accounting ~ 59 rue~ Reims NULL
## # ... with 17 more variables: PostalCode <chr>, Country <chr>,
## # Phone <chr>, Fax <chr>, OrderID <int>, EmployeeID <int>,
## # OrderDate <dttm>, RequiredDate <dttm>, ShippedDate <chr>,
## # ShipVia <int>, Freight <dbl>, ShipName <chr>, ShipAddress <chr>,
## # ShipCity <chr>, ShipRegion <chr>, ShipPostalCode <chr>,
## # ShipCountry <chr>
# method 2
orders %>%
filter(EmployeeID == 4) %>%
right_join(customers, by = "CustomerID") %>%
filter(is.na(OrderID))
## # A tibble: 16 x 24
## OrderID CustomerID EmployeeID OrderDate RequiredDate
## <int> <chr> <int> <dttm> <dttm>
## 1 NA CONSH NA NA NA
## 2 NA DUMON NA NA NA
## 3 NA FISSA NA NA NA
## 4 NA FRANR NA NA NA
## 5 NA GROSR NA NA NA
## 6 NA LAUGB NA NA NA
## 7 NA LAZYK NA NA NA
## 8 NA NORTS NA NA NA
## 9 NA PARIS NA NA NA
## 10 NA PERIC NA NA NA
## 11 NA PRINI NA NA NA
## 12 NA SANTG NA NA NA
## 13 NA SEVES NA NA NA
## 14 NA SPECD NA NA NA
## 15 NA THEBI NA NA NA
## 16 NA VINET NA NA NA
## # ... with 19 more variables: ShippedDate <chr>, ShipVia <int>,
## # Freight <dbl>, ShipName <chr>, ShipAddress <chr>, ShipCity <chr>,
## # ShipRegion <chr>, ShipPostalCode <chr>, ShipCountry <chr>,
## # CompanyName <chr>, ContactName <chr>, ContactTitle <chr>,
## # Address <chr>, City <chr>, Region <chr>, PostalCode <chr>,
## # Country <chr>, Phone <chr>, Fax <chr>
ex6. Find all orders and their shippers with OrderID less than 10255. Display OrderID
, OrderDate
(date only), and Shipper CompanyName
. Hint: use as.Date()
to convert OrderDate
(Datetime type) to Date.
# let's do it together
orders %>%
inner_join(shippers, by = c("ShipVia" = "ShipperID")) %>%
filter(OrderID < 10255) %>%
mutate(OrderDate = as.Date(OrderDate)) %>%
select(OrderID, OrderDate, CompanyName)
## # A tibble: 7 x 3
## OrderID OrderDate CompanyName
## <int> <date> <chr>
## 1 10248 2014-07-04 Federal Shipping
## 2 10249 2014-07-05 Speedy Express
## 3 10250 2014-07-08 United Package
## 4 10251 2014-07-08 Speedy Express
## 5 10252 2014-07-09 United Package
## 6 10253 2014-07-10 United Package
## 7 10254 2014-07-11 United Package