Introduction

Note: Load this R Markdown file (part3_visualization.Rmd) in RStudio (instead of Databricks). Let’s take this opportunity to learn how to use RStudio and create R Markdown files.

There are numerous approaches to plot graphs in R. The base R provides many basic plot functions. Let’s try a few simple plots.

# cars is a built-in dataset (data frame)
head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

plot() is a so-called “generic” function. A generic function behaves differently depending on the objects (e.g. data structures) it takes in. In this case, plot() plots different type of graphs depending on its input. (the plot() method is associated with the input object’s class.)

# plot() takes in x- and y- axis
plot(cars$dist, cars$speed)

# plot() takes in a dataframe, which in this case only have 2 variables/columns
plot(cars)

# iris is another built-in dataset
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# plot() takes in a dataframe with many variables/columns
plot(iris)

hist() is another generic function to plot simple histogram.

hist(cars$speed)

Customize the histogram plot.

hist(cars$speed,
     main="Histogram for Car Speed", 
     xlab="Car Speed (mph)", 
     border="pink", 
     col="deeppink",
     breaks=8) # suggesting number of cells/bins

I’ll leave you to explore base R plot yourself. Here is a good start, http://rpubs.com/SusanEJohnston/7953.

ggplot()

Today we will focus on learning ggplot() from the ggplot2 package, a powerful R plotting package based on the grammar of graphics. The idea is that “you can build every graph from the same components: a data set, a coordinate system, and geoms - visual marks that represent data points” (see the ggplot2 cheat sheet). The grammar of graphics enables us to concisely describe the components of a graphics.

Let’s learn ggplot() using an example. (This example is inspired and built upon this notebook).

First, load a few packages.

# for data manipulation
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# the plotting package
library(ggplot2)

# gapminder contains the data we will use for our plot
library(gapminder)

Let’s take a quick look at the gapminder dataset.

head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

We will only use data from the most recent year in gapminder.

# get data from the most recent year
g_data <- gapminder %>%
  filter(year == 2007)
head(g_data)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923      975.
## 2 Albania     Europe     2007    76.4  3600523     5937.
## 3 Algeria     Africa     2007    72.3 33333216     6223.
## 4 Angola      Africa     2007    42.7 12420476     4797.
## 5 Argentina   Americas   2007    75.3 40301927    12779.
## 6 Australia   Oceania    2007    81.2 20434176    34435.

Let’s understand in general how ggplot() works: a layer-by-layer approach (see slides).

Now, let’s plot lifeExp against gdpPercap (scatter plot).

# scatterplot of life expectancy vs GDP per capita
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

What will you do if you want to connect those dots (i.e. make it a line plot)? This is just for exercise. Obviously it doesn’t make too much sense to connect those dots.

# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  geom_line()

What if you want to color the points in Rotman “deeppink”?

# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(color = "deeppink")

What if you want to label the dots by country name? Does the picture look nice? What did you find out?

# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(color = "deeppink") +
  geom_label(aes(label = country))

ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
  geom_label(aes(label = country)) +
  geom_point(color = "deeppink")

How to make the size of the points proportional to country population (pop)?

# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop))

In addition, can you color the points by continent (i.e. no more Rotman “deeppink”)?

# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent))

Can you add plot title and x- and y-axis title? Hint: add labs(title = 'my title') and labs(x = "x title", y = "y title") layers. At the same time, make the dots a bit lighter (alpha of the geom_point())

# your code here
gapminder %>%
  filter(year == 2007) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop, color = continent), alpha = 0.5) +
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)")

OK. Let’s do a few more things together.

  1. improve the legend title
  2. make the population in million (M)
  3. move the plot title to the middle
  4. format x-axis tick with comma_format
library(scales)
gapminder %>%
  filter(year == 2007) %>%
  mutate(pop_m = pop / 1e6) %>% #population in million
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
    scale_x_continuous(labels = comma_format()) + #x-axis label comma format
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
    labs(color = "Continent", size = "Population (M)") + #legend title
    theme(plot.title = element_text(hjust = 0.5)) #center the plot title

Just for an exercise, let’s find Canada and label it.

# add Canada label
gapminder %>%
  filter(year == 2007) %>%
  mutate(pop_m = pop / 1e6) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
    geom_text(data = filter(gapminder, country == "Canada" & year == 2007), aes(label = country)) +
    scale_x_continuous(labels = comma_format()) +
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
    labs(color = "Continent", size = "Population (M)") +
    theme(plot.title = element_text(hjust = 0.5))

The default label couldn’t clearly identify the dot for Canada. Let’s use the ggrepel package to improve the labeling.

library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.5.2
gapminder %>%
  filter(year == 2007) %>%
  mutate(pop_m = pop / 1e6) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
    geom_text_repel(
      data = filter(gapminder, country == "Canada" & year == 2007), 
      aes(label = country),
      nudge_x = 5000,
      nudge_y = -10) +
    scale_x_continuous(labels = comma_format()) +
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
    labs(color = "Continent", size = "Population (M)") +
    theme(plot.title = element_text(hjust = 0.5))

Let’s make the x-axis in log scale.

# log
gapminder %>%
  filter(year == 2007) %>%
  mutate(pop_m = pop / 1e6) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
    geom_text_repel(
      data = filter(gapminder, country == "Canada" & year == 2007), 
      aes(label = country),
      nudge_x = 5000,
      nudge_y = -10) +
    scale_x_continuous(labels = comma_format(), trans = 'log10') +
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
    labs(color = "Continent", size = "Population (M)") +
    theme(plot.title = element_text(hjust = 0.5))

Let’s add a linear regression line.

# smooth - lm
gapminder %>%
  filter(year == 2007) %>%
  mutate(pop_m = pop / 1e6) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
    geom_smooth(method = "lm") +
    geom_text_repel(
      data = filter(gapminder, country == "Canada" & year == 2007), 
      aes(label = country),
      nudge_x = 5000,
      nudge_y = -10) +
    scale_x_continuous(labels = comma_format(), trans = 'log10') +
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
    labs(color = "Continent", size = "Population (M)") +
    theme(plot.title = element_text(hjust = 0.5))

What if we don’t use log scale on x-axis and let ggplot fit a smooth curve for us?

# smooth - auto
gapminder %>%
  filter(year == 2007) %>%
  mutate(pop_m = pop / 1e6) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
    geom_smooth(method = "auto") +
    geom_text_repel(
      data = filter(gapminder, country == "Canada" & year == 2007), 
      aes(label = country),
      nudge_x = 5000,
      nudge_y = -10) +
    scale_x_continuous(labels = comma_format()) +
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
    labs(color = "Continent", size = "Population (M)") +
    theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Let’s compare plots between 2002 and 2007 using faceting (facet_grid).

# smooth
gapminder %>%
  filter(year == 2007 | year == 2002) %>%
  mutate(pop_m = pop / 1e6) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
    geom_smooth(method = "auto") +
    scale_x_continuous(labels = comma_format()) +
    labs(title = 'Health & Wealth of Nations for 2007') +
    labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
    labs(color = "Continent", size = "Population (M)") +
    theme(plot.title = element_text(hjust = 0.5)) +
    facet_grid(year~.)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercise

Take a look at the diamond dataset. Produce a scatter plot with price (y) against carat (x) and color the dots by clarity. Fine tune the plots to make it as nice as you can.

# your code here
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
# your code here
ggplot(data = diamonds, aes(carat, price)) +
  geom_point(aes(colour = clarity), 
    position = "jitter", 
    alpha=0.5, 
    size = 0.8) +
  scale_y_continuous(trans = "log10") +
  scale_color_brewer(palette = "Spectral") +
  theme_minimal()

By now, you should have a good understanding on how the layer-by-layer approach works in ggplot(). There are obvious a lot more to learn about ggplot(), but the approach is always the same so you can learn other plots easily by referring to the ggplot() document.

Others (only for me)

head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
diamonds %>%
  ggplot(aes(price)) +
    geom_histogram(bins = 50)

diamonds %>%
  ggplot(aes(price)) +
    geom_histogram(bins = 50, alpha = 0.5) +
    geom_freqpoly(bins = 50, color = "deeppink")

diamonds %>%
  ggplot(aes(price)) +
  geom_density(color = "deeppink", fill = "deeppink", alpha = 0.1, adjust = 0.5)

diamonds %>%
  ggplot(aes(price, fill = cut)) +
    geom_histogram(bins = 50)

diamonds %>%
  ggplot(aes(price)) +
    geom_freqpoly(aes(color = cut), bins = 50)

diamonds %>%
  ggplot(aes(price)) +
    geom_density(aes(color = cut))

diamonds %>%
  ggplot(aes(price)) +
  geom_density(color = "deeppink", fill = "deeppink", alpha = 0.1, adjust = 0.5)

diamonds %>%
  ggplot(aes(price)) +
    geom_density(aes(fill = cut), position = "stack", alpha = 0.5)

diamonds %>%
  ggplot(aes(price)) +
    geom_density(aes(fill = cut), position = "fill", alpha = 0.5)