Introduction to ggplot2 – Create publication-ready graphs with R

SAFE Research Data Center

Ina Krapp - April 2023

Logo SAFE

Download .Rmd file
#install.packages("tidyverse")
library(tidyverse)
## Warning: Paket 'tidyverse' wurde unter R Version 4.1.3 erstellt
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0      v purrr   0.3.5 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.1      v stringr 1.5.0 
## v readr   2.1.3      v forcats 0.5.2
## Warning: Paket 'ggplot2' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'tibble' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'tidyr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'readr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'purrr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'dplyr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'stringr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'forcats' wurde unter R Version 4.1.3 erstellt
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#install.packages("palmerpenguins")
library(palmerpenguins)
## Warning: Paket 'palmerpenguins' wurde unter R Version 4.1.3 erstellt
# install.packages("ggthemes")
library(ggthemes)
penguins = penguins

Why Programming in R?

Two main ideas: 1. Make the plot complete - no manual post-processing needed.

  1. Make the plot reproducible - for yourself and others.

What is ggplot2?

gg stands for ‘grammer of graphics’

Grammar of graphics: A coherent system for describing and building graphs.

Take a first look at the dataset

This is a tibble - a special kind of dataframe. We remove all rows which contain missing data (this happens with drop_na) and select only some columns so we get a small dataset.

penguins = penguins %>% drop_na()
penguins = penguins %>% select(year, species, island, body_mass_g, sex)
penguins
## # A tibble: 333 x 5
##     year species island    body_mass_g sex   
##    <int> <fct>   <fct>           <int> <fct> 
##  1  2007 Adelie  Torgersen        3750 male  
##  2  2007 Adelie  Torgersen        3800 female
##  3  2007 Adelie  Torgersen        3250 female
##  4  2007 Adelie  Torgersen        3450 female
##  5  2007 Adelie  Torgersen        3650 male  
##  6  2007 Adelie  Torgersen        3625 female
##  7  2007 Adelie  Torgersen        4675 male  
##  8  2007 Adelie  Torgersen        3200 female
##  9  2007 Adelie  Torgersen        3800 male  
## 10  2007 Adelie  Torgersen        4400 male  
## # ... with 323 more rows

The first graph: Which bird lives where?

Which birds live on which island? This is not the prettiest graph, but a good start: It shows what we want to know.Only Adelie penguins live on all three islands, Gentoo penguins are only found on Biscoe and Chinstrap only on Dream island.

Each ggplot graph follows the same basic structure: There is a ggplot object and a geom object. The ggplot object tells that we want to create a graph, we can enter which data will be used. The geom object tells which type of graph we want. Here, we use points.

ggplot(data = penguins) + 
geom_point(mapping = aes(x = island, y = species))

It is worth noticing that a ggplot object without a geom object really looks like nothing. You can image it as the empty Canvas to draw on.

ggplot(data = penguins)

ggplot often allows several ways to create the same graph. For example, you can leave the ggplot object empty and supply the x and y variable only to geom_point. But then, you need to be explicit about which data frame they are from.

ggplot() + 
geom_point(mapping = aes(x = penguins$island, y = penguins$species))

Like in other programming languages, variable names are on the left side of the equation. They do not need to be explicitly written, values get assigned to them automatically. But when they are missing, values are assigned in order of the default variables. This can get messy if you’re not familiar with what they are, so we stick to naming them.

ggplot(data = penguins) + 
geom_point(mapping = aes(x = island,y =  species))

A ggplot object be easily modified. For example, the x and y values can be switched. Other variables can be entered instead of island and species. But the aes object has to be kept where it is!

ggplot(data = penguins) + 
geom_point(mapping = aes(y = species, x = body_mass_g ))

Source of common mistakes 1: Forgetting or misplacing aes()

The aes()-function (aes for ‘aesthetic mapping’) is a crucial part of any plot written with ggplot2. It describes how data is mapped to properties of a geom-object. Like data, it can be geom-specific or used for the whole graph. If it is geom-specific, it is in the geom object (like above). If it is used for the whole graph, it is in the ggplot object (like below.) Code like x = species or y = body_mass_g needs to be written inside the aes()-function. Outside of it, such code creates an error! You can look at the error by removing the # sign before the two lines and then running the code.

#ggplot(data = penguins) + 
#geom_point(y = species, x =body_mass_g)

So far, the graph does its job to visualize two variables. But what if we have more than that? We need further aesthetics. color and size are widely used, and can be used as in the graph below. shape can be used to assign different shapes (circle, triangle, square) to a variable alpha assigns different degrees of transparency (from solid to transparent ).

ggplot(data = penguins) + 
geom_point(mapping = aes(x = body_mass_g, y = species,  color = sex , size= year ))

Note that some variable types fit better to some aesthetics than others. Because alpha is ordered (from more to less transparent), it is common to assign an ordered variable to it, likewise for the size aesthetic. If a ggplot argument violates these conventions, R gives a message. But it still creates the plot - the decision which aesthetic to use is usually up to you. An exception is the shape aesthetic, for which you can not use continuous values. You can also use more than one aesthetic for the same variable (below, island maps to both size and shape). But that is usually redundant. Also, using too many aesthetics can make a graph look overcrowded (like demonstrated by the graph below).

ggplot(data = penguins) + 
geom_point(mapping = aes(x = body_mass_g, y = species, size = island, alpha = sex, shape = island ))
## Warning: Using size for a discrete variable is not advised.
## Warning: Using alpha for a discrete variable is not advised.

Source of common mistakes 2: Trying to modify graphics with inputs into the aes()-function

Sometimes, people simply want their points to have a specific color. Say you want your points to be blue. Some people try to create a graph with blue points writing code like this. And tend to be very confused by the outcome, as seen below.

What happened here? For ggplot, ‘blue’ is not a color when entered like that. Remember ‘color = sex’ maps the input variable (here: sex) to the color aesthetic. It does not impact the aesthetic itself. The code below reads as “all data points have a value, which is ‘blue’, assign this to a color”. R then assigns this ‘blue’ to the default color, which is not blue, but orange. We’ll see later how to change the default color. For now, keep in mind the specifications within the aes()-function assign data to x, y, colors, shapes and so on. They do not modify these colors or shapes or the way in which they are assigned.

ggplot(data = penguins) + 
geom_point(mapping = aes(x = body_mass_g, y = species, color = 'blue' ))

This can be corrected by moving the color value out of the aesthetic (but it must stay within the geom object). Now, the dots are blue. Note that the legend is missing now - it is only created when a variable is assigned to an aesthetic like color or shape within the aes()-function.

ggplot(data = penguins) + 
geom_point(mapping = aes(y = species, x = body_mass_g), color = 'blue', )

Making a pretty graph, part 1 - changing default values within a specific geom.

There are many ways to modify the optics of a graph created in ggplot. Modifications which affect a specific geom object, are put into the geom object, but behind the aes-function. For example, here, the points are made transparent, so it can be seen better how strongly they overlap. They are also all set to have the same size (size 6).

ggplot(data = penguins) + 
geom_point(mapping = aes(y = species, x = body_mass_g, color = sex ), size = 6, alpha = 0.2)

It is also possible to change the default color palette by which different values of a variable are depicted when the color aesthetic is used. It requires defining a manual color palette. This can be done by inserting color names like ‘blue’, or using hexadecimal code.

Then, simply add ‘+ scale_colour_manual(values=Palette)’ to the ggplot code. This additivity is one of the core strengths of ggplot2. One can build a plot piece by piece.

# Define a color palette
Palette = c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

ggplot(data = penguins) + 
geom_point(mapping = aes(y = species, x = body_mass_g, color = sex ), size = 6, alpha = 0.2)+scale_colour_manual(values=Palette)

Different geoms:

There are a lot of different geom objects available in ggplot and so far, we only looked at geom_point. While you can do a lot with it already, there are many others worth looking into. If you want to learn how to use a specific geom, use the help page (on the lower right of Rstudio) and type its name into the search engine there.

A very simple geom is the histogram. It is a simple plot that shows the distribution of a value. It therefore only requires a single input variable. This plot shows how the body weight of the penguins is distributed. Geoms have individual aesthetic parameters. geom_point has only a color aesthetic, which is the color in which the points are shown. But many other geoms, like the histogram, have a fill aesthetic for the color in which their elements are filled and a color aesthetic for the outline of the bars.

ggplot(data = penguins) + geom_histogram(mapping = aes(x = body_mass_g), fill = "purple", color = "black")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another geom is geom_line. It is often used to plot how variables change over time. As the name suggests, it draws lines instead of points.

For example, I can plot how the average body mass of the penguins changed from 2007 to 2009. But first, I have to calculate it. ‘dat’ is a small dataset containing the mean weight for each possible combination of species, sex and year.

penguins$year = as.factor(penguins$year)
dat = penguins %>%  group_by(year, sex, species) %>%
  summarise(mean_weight = mean(body_mass_g))
## `summarise()` has grouped output by 'year', 'sex'. You can override using the
## `.groups` argument.
dat
## # A tibble: 18 x 4
## # Groups:   year, sex [6]
##    year  sex    species   mean_weight
##    <fct> <fct>  <fct>           <dbl>
##  1 2007  female Adelie          3390.
##  2 2007  female Chinstrap       3569.
##  3 2007  female Gentoo          4619.
##  4 2007  male   Adelie          4039.
##  5 2007  male   Chinstrap       3819.
##  6 2007  male   Gentoo          5553.
##  7 2008  female Adelie          3386 
##  8 2008  female Chinstrap       3472.
##  9 2008  female Gentoo          4627.
## 10 2008  male   Adelie          4098 
## 11 2008  male   Chinstrap       4128.
## 12 2008  male   Gentoo          5411.
## 13 2009  female Adelie          3335.
## 14 2009  female Chinstrap       3523.
## 15 2009  female Gentoo          4786.
## 16 2009  male   Adelie          3995.
## 17 2009  male   Chinstrap       3927.
## 18 2009  male   Gentoo          5511.

Next, to get the plot, the values are associated to the aesthestic variables like before. Additionally, a group variable is specified: It details which points will be connected (here: All points belonging to the same sex and species). If it was not specified, the program would give an error because it would be unclear which points to connect.

ggplot(data = dat) + 
geom_line(mapping = aes(x = year,y = mean_weight, color = sex, linetype = species, group = interaction(sex, species)), linewidth = 1.2)

The geom_line object is one which can often give unexpected results. It automatically splits the data into groups based on all non-numerical variables, and aims to connect all observations within a group through a single line. For example, if I wanted to see how the average weight of the penguins changed over the years, but did not calculate the average weight as above, I would receive the following plot:

ggplot(data = penguins) + 
geom_line(mapping = aes(x = year,y = body_mass_g, color = sex, linetype = species, group = interaction(sex, species)), linewidth = 1.2 )

It can therefore be necessary to modify the data before plotting it. There are many other geoms available, some of which you most likely have heard of, but also some rare ones. For anyone interested into seeing more of what graph types exist in ggplot, here’s a gallery: https://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

Making a pretty graph, part 2 - changing the graph’s appearance

After we already saw how the color, size and other attributes of the depicted data itself can be changed, we now look at the whole graph. The packages ‘ggthemes’ allows for a very simple way to modify the appearance of a graph. It offers styles that allow to imitate graphs from, for example, Excel, Stata or the Economist, just by adding theme_stata(), theme_excel() or theme_economist() to the existing graph. Custom themes can also be designed and allow to manually define the position of the legend, of the axis labels, the background color, the grid and many other details. For example, the added theme element in the last row of this code increases the text size. Like before, it can simply be added to an existing plot.

ggplot(data = dat) + 
geom_line(mapping = aes(x = year,y = mean_weight, color = sex, linetype = species, group = interaction(sex, species)), linewidth=1.2) +theme_stata() + 
theme(text=element_text(size=10))

To make the graph publication-ready, a title and some other information need to be added to the graph. Also, the labels can be made to look nicer.

ggplot(data = dat) + 
geom_line(mapping = aes(x = year, y = mean_weight, color = sex, linetype = species, group = interaction(sex, species)), linewidth=1.2) + theme_stata() + 
theme(text=element_text(size=11)) +
labs(title="Plot of penguin weight by species",
x ="Year", y = "Average body weight (in gram)", color='Sex', linetype = "Species", 
caption = "Data Source: \n Horst, Allison Marie; Hill, Alison Presmanes; Gorman, Kristen B. (2020): palmerpenguins: \n Palmer Archipelago (Antarctica) penguin data. Available online at https://allisonhorst.github.io/palmerpenguins/.
 ") +
scale_colour_manual(values=Palette)

Finally, save and export the plot from R. For example, to receive the picture in png format:

# Width and height modify the size of the output, in pixels.
png(filename="penguin_plot.png", width=1200, height=800)

ggplot(data = dat) + 
geom_line(mapping = aes( x =year,y = mean_weight, color = sex, linetype = species, group = interaction(sex, species)), linewidth=1.2) +
theme_stata() + 
theme(text=element_text(size=20)) +
labs(title="Plot of penguin weight by species",
x ="Year", y = "Average body weight (in gram)", color='Sex', linetype = "Species", 
caption = "Data Source: \n Horst, Allison Marie; Hill, Alison Presmanes; Gorman, Kristen B. (2020): palmerpenguins: \n Palmer Archipelago (Antarctica) penguin data. Available online at https://allisonhorst.github.io/palmerpenguins/.
 ") +
scale_colour_manual(values=Palette)

dev.off()
## png 
##   2

One thing to be careful with when exporting is that the size of the text within the plot remains the same. So, if a plot is exported in a larger format, it may be necessary to increase the text size as well to be able to read it well. On the other hand, if a plot is made smaller, the text needs to be made smaller as well, or else titles may overlap.

There’s much more to ggplot, so if you wish to do something I didn’t cover, it is most likely also possible using ggplot. For example, several graphs side by side can be created using faceting: https://ggplot2-book.org/getting-started.html#qplot-faceting

Even animations of ggplot created graphics are possible, with the gganimate package: https://gganimate.com/

Also, feel encouraged to look at any of these sources I used to create the tutorial:

R for Data Science is generally a very good introductionary book into R. https://r4ds.had.co.nz/data-visualisation.html

The Cookbook for R presents ggplot and other R tools for data analysis in a recipe step-by-step way: http://www.cookbook-r.com/Graphs/

Fundamentals of Data Visualization is more on Data Visualisation in general, not only ggplot. https://clauswilke.com/dataviz/

Sources:

Wickham, Hadley; Grolemund, Garrett (2017): R for data science. Import, tidy, transform, visualize, and model data. First edition. Beijing, Boston, Farnham, Sebastopol, Tokyo: O’Reilly. Available online at https://r4ds.had.co.nz.

Chang, Winston (2013): R graphics cookbook. Sebastopol, CA: O’Reilly Media. Available online at https://www.lehmanns.de/media/26316927.

Wilke, Claus (2019): Fundamentals of data visualization. A primer on making informative and compelling figures. First edition. Beijing: O’Reilly. Available online at https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=5734202.

R and the tidyverse: R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Data Source: Horst, Allison Marie; Hill, Alison Presmanes; Gorman, Kristen B. (2020): palmerpenguins: Palmer Archipelago (Antarctica) penguin data. Available online at https://allisonhorst.github.io/palmerpenguins/.