This practical follows the previous basic introduction to ggplot2. It allows to go further with
ggplot2
: annotation, theme customization, color palette, output formats, scales, and more.
The following libraries are needed all along the practical. Install them with install.packages()
if you do not have them already. Then load them with library()
.
Q1.1 The code below builds a basic histogram for Rbnb apartment prices on the French Riviera. It shows only value under 300 euros. Add code to:
ggtitle()
xlab()
and ylab()
xlim()
and ylim()
# Libraries
library(ggplot2)
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)
# Make the histogram
data %>%
filter( price<300 ) %>%
ggplot( aes(x=price)) +
geom_histogram() +
ggtitle("Night price distribution of Airbnb appartements") +
xlab("Night price") +
ylab("Number of apartments") +
xlim(0,400)
All ggplot2
chart components can be changed using the theme()
function. You can see a complete list of components in the official documentation.
Note: components are changed using different functions: element_text()
, element_line()
for lines and so on..
Q1.2 Reproduce the previous histogram and change:
plot.title
axis.title.x
panel.grid.major
# Make the histogram
data %>%
filter( price<300 ) %>%
ggplot( aes(x=price)) +
geom_histogram() +
ggtitle("Night price distribution of Airbnb appartements") +
xlab("Night price") +
ylab("Number of apartments") +
xlim(0,400) +
theme(
plot.title = element_text(size=13, color="orange"),
axis.title.x = element_text(size=13, color="purple"),
panel.grid.major = element_line(colour = "red")
)
Q1.3 ggplot2
offers a set of pre-built themes. Try the followings to see which one you like the most:
theme_bw()
theme_dark()
theme_minimal()
theme_classic()
See a complete list here.
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)
# Make the histogram
data %>%
filter( price<300 ) %>%
ggplot( aes(x=price)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Night price distribution of Airbnb appartements") +
theme_classic()
Q1.4 The hrbrthemes
package provides my favourite style. Install the package, load it, and apply the theme_ipsum()
. Documentation is here.
# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)
# Make the histogram
data %>%
filter( price<300 ) %>%
ggplot( aes(x=price)) +
stat_bin(breaks=seq(0,300,10), fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Night price distribution of Airbnb appartements") +
theme_ipsum()
Annotation is a crucial component of a good dataviz. It can turn a boring graphic into an interesting and insightful way to convey information. Dataviz is often separated in two main types: exploratory and explanatory analysis. Annotation is used for the second type.
The most common type of annotation is text. Let’s say you have a spike in a line plot. It totally makes sense to highlight it, and explain more in details what it is about.
Q1.1 Build a line plot showing the bitcoin price evolution between 2013 and 2018. Dataset is located here and can be read directly with read.table()
. What part of the chart would you highlight?
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/3_TwoNumOrdered.csv", header=T)
data$date <- as.Date(data$date)
# plot
data %>%
ggplot( aes(x=date, y=value)) +
geom_line(color="#69b3a2")
Q1.2 Use the annotate()
function to add text. Annotate requires several arguments:
geom
: type of annotation, use text
x
: position on the X axisy
: position on the Y axislabel
: what you want to writecolor
, size
, angle
and more.
Q1.3 Find the exact spike date
and its value
. Use this information to add a circle around the spike. This is done with the annotate()
function once more:
geom
: use point
x
: position on the X axisy
: position on the Y axisshape
: use 21, to be able to change the fill
and the color
arguments. (fill=inside, color=stroke)size
# Find spike date and value:
# data %>% arrange(desc(value)) %>% head(1)
# plot
data %>%
ggplot( aes(x=date, y=value)) +
geom_line(color="#69b3a2") +
ylim(0,22000) +
annotate(geom="text", x=as.Date("2017-01-01"), y=20089,
label="Bitcoin price reached 20k $\nat the end of 2017") +
annotate(geom="point", x=as.Date("2017-12-17"), y=20089, size=10, shape=21, fill="transparent")
Q1.4 Add a horizontal abline to show what part of the curve is over 5000 $. This is possible thanks to the geom_hline()
function that requires its yintercept
argument.
# Find spike date and value:
# data %>% arrange(desc(value)) %>% head(1)
# plot
data %>%
ggplot( aes(x=date, y=value)) +
geom_line(color="#69b3a2") +
ylim(0,22000) +
annotate(geom="text", x=as.Date("2017-01-01"), y=20089,
label="Bitcoin price reached 20k $\nat the end of 2017") +
annotate(geom="point", x=as.Date("2017-12-17"), y=20089, size=10, shape=21, fill="transparent") +
geom_hline(yintercept=5000, color="orange", size=.5)
Q1.5 Build a scatterplot based on the gapminder
dataset. Use gdpPercap
for the X axis, lifeExp
for the Y axis, and pop
for bubble size. Keep only the year 2007.
# Data are available in the gapminder package
library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% select(-year)
# Basic scatterplot
ggplot( data, aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
geom_point(alpha=0.7)
Q1.6 Highlight South Africa in the chart: draw it in red, with all other circles in grey. Follow those steps:
mutate
: this new column has the value yes
if country=="South Africa"
, no
otherwise. This is possible thanks to the ifelse
function.scale_color_manual()
to control the color of both group. Use a bright color for the country to highlight, and grey for the others.
Q1.7 Highlight every country with gdpPercap > 5000
& lifeExp < 60
in red. Write their names using the geom_text_repel of the
ggrepel` package to avoid text overlapping.
# ggrepel
library(ggrepel)
# prepare data
tmp <- data %>%
mutate( annotation = ifelse(gdpPercap > 5000 & lifeExp < 60, "yes", "no"))
# plot
tmp %>%
ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
geom_point(alpha=0.7) +
theme(legend.position="none") +
geom_text_repel(data=tmp %>% filter(annotation=="yes"), aes(label=country), size=4 )
Faceting is a very powerful data visualization technique. It splits the figure in small subsets, usually one by level of a categorical variable. ggplot2
offers 2 functions to build small multiples: facet_wrap()
and facet_grid()
.
Q3.1 Build a spaghetti chart showing the evolution of 9 baby names in the US. (See code here). What’s wrong with this chart?
# Libraries
library(babynames)
# Load dataset from github
data <- babynames %>%
filter(name %in% c("Ashley", "Amanda", "Jessica", "Patricia", "Linda", "Deborah", "Dorothy", "Betty", "Helen")) %>%
filter(sex=="F")
# line plot = spaghetti chart
data %>%
ggplot( aes(x=year, y=n, group=name, color=name)) +
geom_line() +
ggtitle("Popularity of American names in the previous 30 years")
Q3.2 Use the facet_wrap()
function to build one area chart for each name. Basically, you have to provide a categorical variable to the function. It will build a chart for each of its level.
Have a look to the Y axis. What do you observe? Is it a good option?
You should get something like this:
data %>%
ggplot( aes(x=year, y=n, group=name, fill=name)) +
geom_area() +
ggtitle("Popularity of American names in the previous 30 years") +
theme(
legend.position="none",
) +
facet_wrap(~name)
Q3.3 Find out how to use the scale
option to have different Y axis limits for each subset. Does it make sense? In which conditions?
You should get something like this:
Bonus Find out what the facet_grid()
function does. Why is it different to facet_wrap()
?
BonusLoad this dataset in R. Build a histogram for every combination of day and sex using facet_wrap()
You should get something like:
Q4.1 - Save the previous chart as a PNG
file using the ggsave()
function. Where is saved the file?
# save the plot in an object called p
p <- ggplot(data, aes(x=total_bill)) +
geom_histogram() +
facet_grid(sex~day)
# Save the plot
ggsave(p, filename = "chartFromRPractical.png")
Q4.2 - Specify the complete path before file name to save the chart at a specific location.
Picking the right colors is a crucial step for a good dataviz. R offers awesome options and packages to make the right choices. Here is an overview of the main options.
Q5.1 Several options exist to pick one color. Change the histogram color using the fill
argument on the chart below using each of the following options:
colors()
to see all the options.rgb()
. This function provides the quantity of red, green and blue to build the color. Plus an argument for the opacity. Example, try rgb(.7, .6, .3, .2)
HTML
colors. Use this website to pick one you like.# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)
# Make the histogram
data %>%
filter( price<300 ) %>%
ggplot( aes(x=price)) +
geom_histogram(fill="steelblue") +
ggtitle("Night price distribution of Airbnb appartements") +
theme_ipsum()
Q5.2 Build a scatterplot based on the iris
dataset. Use Sepal.Length
for the X axis, Petal.Length
for the Y axis. Use color=Species
to color groups.
Q5.3 It is possible to set the color scale manually using scale_color_manual()
. Use the hint below to see how to use it and apply it to the previous scatterplot.
Note: it is a bad practice to pick colors randomly. Your palette will be ugly and will probably not be colorblind friendly.
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) +
geom_point() +
scale_color_manual( values=c("red","green","blue"))
Q5.4 Fortunately, people already tackeled this issue for us and created packages offering nice color palettes. The most famous one is RColorBrewer
. Palettes are already available in ggplot2
. See all of them here, and use one on your chart using scale_color_brewer()
.
Pick the one you like the most and apply it to to previous scatterplot. Use it to color the Species
.
Q5.5 RColorBrewer
also offers continuous color palette. However they must be called through the scale_color_distiller
function. Use the palette you like the most to color circles depending on Sepal_length
.
An interactive chart is a chart on which you can zoom, hover shapes to get tooltips, click to trigger actions and more. Building interactive charts requires javascript under the hood, but it is relatively easy to build it using R packages that wrap the javascript for you. This type of packages are called HTML widgets.
Q6.1 Build the gapminder
bubble plot you’ve already done in the annotation part of this practical. Store it in an object called p
# load data
library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% select(-year)
# Basic ggplot
p <- data %>%
ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
geom_point(alpha=0.7)
p
Q6.2 Install and load the plotly
package. Build an interactive chart using the ggplotly()
function. What are the new functionalities of this chart? Is it useful? What could be better?
Q6.3 Let’s improve the tooltip of the chart:
myText
. Fill it with whatever you want to show in the tooltip.text=myText
ggplotly()
call, add tooltip="text"
Q6.4 Use the HTML widget called leaflet
to build an interactive map showing the earthquakes described in the dataset called quakes
. Code is fully provided here, since cartography with R could deserve an entire practical. The idea is just to discover to potential offered in a few lines of code:
# Library
library(leaflet)
# load example data (Fiji Earthquakes) + keep only 100 first lines
data(quakes)
quakes = head(quakes, 100)
# Create a color palette with handmade bins.
mybins=seq(4, 6.5, by=0.5)
mypalette = colorBin( palette="YlOrBr", domain=quakes$mag, na.color="transparent", bins=mybins)
# Final Map
leaflet(quakes) %>%
addTiles() %>%
setView( lat=-27, lng=170 , zoom=4) %>%
addProviderTiles("Esri.WorldImagery") %>%
addCircleMarkers(~long, ~lat,
fillColor = ~mypalette(mag), fillOpacity = 0.7, color="white", radius=8, stroke=FALSE
) %>%
addLegend( pal=mypalette, values=~mag, opacity=0.9, title = "Magnitude", position = "bottomright" )
The d3heatmap
package allows to build interactive heatmaps in a few line of code. Let’s see how it works
Q6.5 Load this dataset in R. Have a look to the first rows. Describe it. (source)
# Load data
data <- read.csv("http://datasets.flowingdata.com/ppg2008.csv", row.names = 1)
# head(data)
# summary(data)
Q6.6 R offers a heatmap()
function to build… heatmaps! Apply it to the dataset. What do you observe? Are you happy with this heatmap? What’s wrong with it? How can we solve the issue?
Note: input dataset must be at the matrix
format to be accepted by the function. Use as.matrix()
to get this format.
Q6.7 Check the scale
option of the heatmap()
function. What is it for? Can it help us? How? Use it to improve the heatmap.
Q6.8 d3heatmap()
uses exactly the same syntax than heatmap()
. Use the function to get an interactive version of the previous heatmap!
Q6.9 - Use the HTML widget called dygraphs
to build an interactive line plot of the bitcoin price evolution. Try to reproduce the example below.
# Library
library(dygraphs)
library(xts) # To make the convertion data-frame / xts format
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/3_TwoNumOrdered.csv", header=T)
data$date <- as.Date(data$date)
# Then you can create the xts format, and thus use dygraph
don <- xts(x = data$value, order.by = data$date)
# Use the dygraph HTML widget
dygraph(don) %>%
dyOptions(labelsUTC = TRUE, fillGraph=TRUE, fillAlpha=0.1, drawGrid = FALSE, colors="#D8AE5A") %>%
dyRangeSelector() %>%
dyCrosshair(direction = "vertical") %>%
dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE) %>%
dyRoller(rollPeriod = 1)
BONUS - The packages showcased above are just a sample of the possibilities offered by the html widgets. Visit this website to have an overview of what kind of interactive chart you can do with R
. Pick your favorite example and try to reproduce it.
Scales control the details of how data values are translated to visual properties. Many different scales are offered by ggplot2. The most widely one is probably the log scale.
Q7.1 Build a histogram showing the night price distribution of the french riviera apartements (data here). Keep all the data, with extreme values.
# Libraries
library(ggplot2)
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)
# Make the histogram
data %>%
ggplot( aes(x=price)) +
geom_histogram(color="white", fill="steelblue4") +
ggtitle("Night price distribution of Airbnb appartements") +
xlab("Night price") +
ylab("Number of apartments")
Q7.2 A common practice to avoid the effect of extreme values is to filter data, or use xlim
to zoom on a part of the axis. Another approach is to use scale_x_log10()
to apply a log transformation. Apply this function to the histogram.
# Libraries
library(ggplot2)
# Load dataset from github
data <- read.table("https://raw.githubusercontent.com/holtzy/data_to_viz/master/Example_dataset/1_OneNum.csv", header=TRUE)
# Make the histogram
data %>%
ggplot( aes(x=price)) +
geom_histogram(color="white", fill="steelblue4") +
ggtitle("Night price distribution of Airbnb appartements") +
xlab("Night price") +
ylab("Number of apartments") +
scale_x_log10()
Q7.3 What’s the difference between scale_x_log10()
and applying the log()
function on the dataset before doing the chart? Why is it better?
A work by a practical by Yan Holtz