Note: Load this R Markdown file (part3_visualization.Rmd) in RStudio (instead of Databricks). Let’s take this opportunity to learn how to use RStudio and create R Markdown files.
There are numerous approaches to plot graphs in R. The base R provides many basic plot functions. Let’s try a few simple plots.
# cars is a built-in dataset (data frame)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
plot()
is a so-called “generic” function. A generic function behaves differently depending on the objects (e.g. data structures) it takes in. In this case, plot()
plots different type of graphs depending on its input. (the plot() method is associated with the input object’s class.)
# plot() takes in x- and y- axis
plot(cars$dist, cars$speed)
# plot() takes in a dataframe, which in this case only have 2 variables/columns
plot(cars)
# iris is another built-in dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# plot() takes in a dataframe with many variables/columns
plot(iris)
hist()
is another generic function to plot simple histogram.
hist(cars$speed)
Customize the histogram plot.
hist(cars$speed,
main="Histogram for Car Speed",
xlab="Car Speed (mph)",
border="pink",
col="deeppink",
breaks=8) # suggesting number of cells/bins
I’ll leave you to explore base R plot yourself. Here is a good start, http://rpubs.com/SusanEJohnston/7953.
ggplot()
Today we will focus on learning ggplot()
from the ggplot2
package, a powerful R plotting package based on the grammar of graphics. The idea is that “you can build every graph from the same components: a data set, a coordinate system, and geoms - visual marks that represent data points” (see the ggplot2 cheat sheet). The grammar of graphics enables us to concisely describe the components of a graphics.
Let’s learn ggplot()
using an example. (This example is inspired and built upon this notebook).
First, load a few packages.
# for data manipulation
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# the plotting package
library(ggplot2)
# gapminder contains the data we will use for our plot
library(gapminder)
Let’s take a quick look at the gapminder dataset.
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
We will only use data from the most recent year in gapminder.
# get data from the most recent year
g_data <- gapminder %>%
filter(year == 2007)
head(g_data)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
Let’s understand in general how ggplot()
works: a layer-by-layer approach (see slides).
Now, let’s plot lifeExp
against gdpPercap
(scatter plot).
# scatterplot of life expectancy vs GDP per capita
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
What will you do if you want to connect those dots (i.e. make it a line plot)? This is just for exercise. Obviously it doesn’t make too much sense to connect those dots.
# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_line()
What if you want to color the points in Rotman “deeppink”?
# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "deeppink")
What if you want to label the dots by country name? Does the picture look nice? What did you find out?
# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "deeppink") +
geom_label(aes(label = country))
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
geom_label(aes(label = country)) +
geom_point(color = "deeppink")
How to make the size of the points proportional to country population (pop
)?
# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop))
In addition, can you color the points by continent (i.e. no more Rotman “deeppink”)?
# your code here
ggplot(g_data, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent))
Can you add plot title and x- and y-axis title? Hint: add labs(title = 'my title')
and labs(x = "x title", y = "y title")
layers. At the same time, make the dots a bit lighter (alpha
of the geom_point()
)
# your code here
gapminder %>%
filter(year == 2007) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent), alpha = 0.5) +
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)")
OK. Let’s do a few more things together.
library(scales)
gapminder %>%
filter(year == 2007) %>%
mutate(pop_m = pop / 1e6) %>% #population in million
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
scale_x_continuous(labels = comma_format()) + #x-axis label comma format
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
labs(color = "Continent", size = "Population (M)") + #legend title
theme(plot.title = element_text(hjust = 0.5)) #center the plot title
Just for an exercise, let’s find Canada and label it.
# add Canada label
gapminder %>%
filter(year == 2007) %>%
mutate(pop_m = pop / 1e6) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
geom_text(data = filter(gapminder, country == "Canada" & year == 2007), aes(label = country)) +
scale_x_continuous(labels = comma_format()) +
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
labs(color = "Continent", size = "Population (M)") +
theme(plot.title = element_text(hjust = 0.5))
The default label couldn’t clearly identify the dot for Canada. Let’s use the ggrepel package to improve the labeling.
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.5.2
gapminder %>%
filter(year == 2007) %>%
mutate(pop_m = pop / 1e6) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
geom_text_repel(
data = filter(gapminder, country == "Canada" & year == 2007),
aes(label = country),
nudge_x = 5000,
nudge_y = -10) +
scale_x_continuous(labels = comma_format()) +
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
labs(color = "Continent", size = "Population (M)") +
theme(plot.title = element_text(hjust = 0.5))
Let’s make the x-axis in log scale.
# log
gapminder %>%
filter(year == 2007) %>%
mutate(pop_m = pop / 1e6) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
geom_text_repel(
data = filter(gapminder, country == "Canada" & year == 2007),
aes(label = country),
nudge_x = 5000,
nudge_y = -10) +
scale_x_continuous(labels = comma_format(), trans = 'log10') +
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
labs(color = "Continent", size = "Population (M)") +
theme(plot.title = element_text(hjust = 0.5))
Let’s add a linear regression line.
# smooth - lm
gapminder %>%
filter(year == 2007) %>%
mutate(pop_m = pop / 1e6) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
geom_smooth(method = "lm") +
geom_text_repel(
data = filter(gapminder, country == "Canada" & year == 2007),
aes(label = country),
nudge_x = 5000,
nudge_y = -10) +
scale_x_continuous(labels = comma_format(), trans = 'log10') +
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
labs(color = "Continent", size = "Population (M)") +
theme(plot.title = element_text(hjust = 0.5))
What if we don’t use log scale on x-axis and let ggplot fit a smooth curve for us?
# smooth - auto
gapminder %>%
filter(year == 2007) %>%
mutate(pop_m = pop / 1e6) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
geom_smooth(method = "auto") +
geom_text_repel(
data = filter(gapminder, country == "Canada" & year == 2007),
aes(label = country),
nudge_x = 5000,
nudge_y = -10) +
scale_x_continuous(labels = comma_format()) +
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
labs(color = "Continent", size = "Population (M)") +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Let’s compare plots between 2002 and 2007 using faceting (facet_grid
).
# smooth
gapminder %>%
filter(year == 2007 | year == 2002) %>%
mutate(pop_m = pop / 1e6) %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop_m, color = continent), alpha = 0.5) +
geom_smooth(method = "auto") +
scale_x_continuous(labels = comma_format()) +
labs(title = 'Health & Wealth of Nations for 2007') +
labs(x = "GDP per capita ($/year)", y = "Life expectancy (years)") +
labs(color = "Continent", size = "Population (M)") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_grid(year~.)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Take a look at the diamond dataset. Produce a scatter plot with price
(y) against carat
(x) and color the dots by clarity
. Fine tune the plots to make it as nice as you can.
# your code here
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# your code here
ggplot(data = diamonds, aes(carat, price)) +
geom_point(aes(colour = clarity),
position = "jitter",
alpha=0.5,
size = 0.8) +
scale_y_continuous(trans = "log10") +
scale_color_brewer(palette = "Spectral") +
theme_minimal()
By now, you should have a good understanding on how the layer-by-layer approach works in ggplot()
. There are obvious a lot more to learn about ggplot()
, but the approach is always the same so you can learn other plots easily by referring to the ggplot()
document.
Others (only for me)
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
diamonds %>%
ggplot(aes(price)) +
geom_histogram(bins = 50)
diamonds %>%
ggplot(aes(price)) +
geom_histogram(bins = 50, alpha = 0.5) +
geom_freqpoly(bins = 50, color = "deeppink")
diamonds %>%
ggplot(aes(price)) +
geom_density(color = "deeppink", fill = "deeppink", alpha = 0.1, adjust = 0.5)
diamonds %>%
ggplot(aes(price, fill = cut)) +
geom_histogram(bins = 50)
diamonds %>%
ggplot(aes(price)) +
geom_freqpoly(aes(color = cut), bins = 50)
diamonds %>%
ggplot(aes(price)) +
geom_density(aes(color = cut))
diamonds %>%
ggplot(aes(price)) +
geom_density(color = "deeppink", fill = "deeppink", alpha = 0.1, adjust = 0.5)
diamonds %>%
ggplot(aes(price)) +
geom_density(aes(fill = cut), position = "stack", alpha = 0.5)
diamonds %>%
ggplot(aes(price)) +
geom_density(aes(fill = cut), position = "fill", alpha = 0.5)