Beautiful and informative data visualisation


Img

Tutorial Aims:

1. Get familiar with the ggplot2 syntax

2. Practice making different plots with ggplot2

3. Learn to arrange graphs in a panel and to save files

Note : all the files you need to complete this tutorial can be downloaded from this repository. Clone and download the repo as a zip file, then unzip it.

1. Good data visualisation and ggplot2 syntax

We’ve learned how to import our data in RStudio, format and manipulate them, write scripts and Markdown reports… and now it’s time we talk about communicating the results of our analyses - data visualisation! When it comes to data visualisation, the package ggplot2 by Hadley Wickham has won over many scientists’ hearts. In this tutorial, we will learn how to make beautiful and informative graphs and how to arrange them in a panel. Before we take on the ggplot2 syntax, let’s briefly cover what good graphs have in common.

Img

ggplot2 is a great package to guide you through those steps. The gg in ggplot2 stands for grammar of graphics. Writing the code for your graph is like constructing a sentence made up of different parts that logically follow from one another. In a data visualisation context, the different elements of the code represent layers - first you make an empty plot, then you add a layer with your data points, then your measure of uncertainty, the axis labels and so on.

When using ggplot2, you usually start your code with ggplot(your_data, aes(x=independent_variable, y=dependent_variable)), then you add the type of plot you want to make using + geom_boxplot(), + geom_histogram(), etc. aes stands for aesthetics, hinting to the fact that using ggplot2 you can make aesthetically pleasing graphs - there are many ggplot2 functions to help you clearly communicate your results, and we will now go through some of them.

2. Making different plots with ggplot2

Open RStudio, select File/New File/R script and start writing your script with the help of this tutorial.

# Purpose of the script
# Your name, date and email

# Libraries - if you haven't installed them before, run the code install.packages("package_name")
library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
library(gridExtra)

We will use data from the Living Planet Index, which you have already downloaded from the repository (Click on Clone or Download/Download ZIP and then unzip the files). When you run the read.csv(file.choose()) code, a window will pop up, from where you can navigate to the folder where you saved the LPI csv file.

# Import data from the Living Planet Index - population trends of vertebrate species from 1970 to 2014
LPI <- read.csv(file.choose())

The data are in wide format - the different years are column names, when really they should be rows. We will reshape the data using the gather() function from the tidyr package.

# Reshape data into long form
# By adding 9:53, we select rows from 9 to 53, the ones for the different years of monitoring
LPI2 <- gather(LPI, "year", "abundance", 9:53)
View(LPI2)

There is an ‘X’ in front of all the years - when we imported the data, all column names become characters. R put an ‘X’ in front of the years to turn the numbers into characters. Now that the years are rows, not columns, we need them to be proper numbers, so we will transform them using parse_number from the ``readr` package.

LPI2$year <- parse_number(LPI2$year)

# When manipulating data it's always good check if the variables have stayed how we want them
# Use the str() function
str(LPI2)

# Abundance is a character variable, when it should be numeric, let's fix that
LPI2$abundance <- as.numeric(LPI2$abundance)

This is a very large dataset, so for the first few graphs we will focus on how the population of one species has changed. Pick a species of your choice, make sure you spell it the same way as it is entered in the dataframe.

vulture <- filter(LPI2, Common.Name == "Griffon vulture / Eurasian griffon")
head(vulture)

# There are a lot of NAs in this dataframe, so we will get rid of the empty rows using na.omit()
vulture <- na.omit(vulture)

Histogram to visualise data distribution

We will do a quick comparison between base R graphics and ggplot2 - of course both can make good graphs when used well, but here at Coding Club, we like working with ggplot2.

# With base R graphics
base_hist <- hist(vulture$abundance)

# With ggplot2
(vulture_hist <- ggplot(vulture, aes(x=abundance))  +
  geom_histogram()) # putting your entire ggplot code in () creates the graph and shows it in the plot viewer
Img Img

The ggplot one is a bit prettier, but the default ggplot settings are not ideal, there is lots of unnecessary grey space behind the histogram, the axes labels are quite small, and the bars blend with each other; so lets beautify the histogram a bit. This is where the true power of ggplot2 shines!

(vulture_hist <- ggplot(vulture, aes(x=abundance)) +
  geom_histogram(binwidth=250, colour="#8B5A00", fill="#CD8500") +    # Changing the binwidth and colours
  geom_vline(aes(xintercept=mean(abundance)),                         # Adding a line for mean abundance
             color="red", linetype="dashed", size=1) +                # Changing the look of the line
    theme_bw() +                                                      # Changing the theme to get rid of the grey background
  ylab("Count\n") +                                                   # Changing the text of the y axis label
  xlab("\nGriffon vulture abundance")  +                              # \n adds a blank line
  theme(axis.text.x=element_text(size=12),                            # Changing font size of axis labels
        axis.text.y=element_text(size=12),
        axis.title.x=element_text(size=14, face="plain"),             # Changing font size of axis titles
        axis.title.y=element_text(size=14, face="plain"),             # face="plain" changes font type, could also be italic, etc
        panel.grid.major.x=element_blank(),                           # Removing the grey grid lines
        panel.grid.minor.x=element_blank(),
        panel.grid.minor.y=element_blank(),
        panel.grid.major.y=element_blank(),
        plot.margin = unit(c(1,1,1,1), units = , "cm")))              # Putting a 1 cm margin around the plot

# We can see from the histogram that the data are very skewed - a typical distribution of count abundance data

Img

Figure 1. Histogram of Griffon vulture abundance in populations included in the LPI dataset. Red line shows mean abundance.

Pressing enter after each “layer” of your plot (i.e. indenting it) prevents the code from being one gigantic line and makes it much easier to read.

In the code above you can see a colour code colour="#8B5A00" - each colour has a code, a combination of letters and numbers. You can get the codes for different colours online, from Paint, Photoshop or similar programs, or even from RStudio, which is very convenient! There is an RStudio Colourpicker addin - to install it, run the following code:

install.packages("colourpicker")

To find out what is the code for a colour you like, click on Addins/Colour picker.

Img

When you click on All R colours you will see lots of different colours you can choose from - a good colour scheme makes your graph stand out, but of course, don’t go crazy with the colours. When you click on 1, and then on a certain colour, you fill up 1 with that colour, same goes for 2, 3 - you can add mode colours with the +, or delete them by clicking the bin. Once you’ve made your pick, click Done. You will see a line of code c("#8B5A00", "#CD8500") appear - in this case, we just need the colour code, so we can copy that, and delete the rest. Try changing the colour of the histogram you made just now.

Img

Scatter plot to examine how Griffon vulture populations have changed between 1970 and 2017 in Croatia and Italy

# Filtering the data to get records only from Croatia and Italy using the `filter()` function from the `dplyr` package
vultureITCR <- filter(vulture, Country.list == c("Croatia", "Italy"))

# Using default base graphics
plot(vultureITCR$year, vultureITCR$abundance, col = vultureITCR$Country.list)

# Using default ggplot2 graphics
(vulture_scatter <- ggplot(vultureITCR, aes (x=year, y=abundance, colour=Country.list)) +
    geom_point())
Img Img

Hopefully by now we’ve convinced you of the perks of ggplot2, but again like with the histogram, the graph needs a bit more work. You might have noticed that sometimes we have the colour= argument surrounded by aes() and sometimes we don’t. If you are designating colours based on a certain categorical variable in your data, like here colour = Country.list, then that goes in the aes() argument. If you just want to give the lines, dots or bars a certain colour, then you can use e.g. colour = "blue" and that does not need to be surrounded by aes().

(vulture_scatter <- ggplot(vultureITCR, aes (x=year, y=abundance, colour=Country.list)) +
    geom_point(size=2) +                                                # Changing point size
    geom_smooth(method=lm, aes(fill=Country.list)) +                    # Adding a linear model fit and colour-coding by country
    theme_bw() +
    scale_fill_manual(values = c("#EE7600", "#00868B")) +               # Adding custom colours
    scale_colour_manual(values = c("#EE7600", "#00868B"),               # Adding custom colours
                        labels=c("Croatia", "Italy")) +                 # Adding labels for the legend
    ylab("Griffon vulture abundance\n") +                             
    xlab("\nYear")  +
    theme(axis.text.x=element_text(size=12, angle=45, vjust=1, hjust=1),       # making the years at a bit of an angle
          axis.text.y=element_text(size=12),
          axis.title.x=element_text(size=14, face="plain"),             
          axis.title.y=element_text(size=14, face="plain"),             
          panel.grid.major.x=element_blank(),                                  # Removing the background grid lines                
          panel.grid.minor.x=element_blank(),
          panel.grid.minor.y=element_blank(),
          panel.grid.major.y=element_blank(),  
          plot.margin = unit(c(1,1,1,1), units = , "cm")) +                    # Adding a 1cm margin around the plot
    theme(legend.text = element_text(size=12, face="italic"),                  # Setting the font for the legend text
          legend.title = element_blank(),                                      # Removing the legend title
          legend.position=c(0.9, 0.9)))                  # Setting the position for the legend - 0 is left/bottom, 1 is top/right

Img

Figure 2. Population trends of Griffon vulture in Croatia and Italy. Data points represent raw data with a linear model fit and 95% confidence intervals. Abundance is measured in number of breeding individuals.

If your axis labels need to contain fancy characters or superscript, you can get ggplot2 to plot that, too. It might require some googling regarding your specific case, but for example, this code ylabs(expression(paste("Grain yield"," ","(ton.", ha^-1,")", sep=""))) will create a y axis with a Grain yield ton. ha^-1 label.

Boxplot to examine whether vulture abundance differs between Croatia and Italy

(vulture_boxplot <- ggplot (vultureITCR, aes(Country.list, abundance)) + geom_boxplot())

# Beautifying

(vulture_boxplot <- ggplot (vultureITCR, aes(Country.list, abundance)) + geom_boxplot(aes(fill=Country.list)) +
    theme_bw() +
    scale_fill_manual(values = c("#EE7600", "#00868B")) +               # Adding custom colours
    scale_colour_manual(values = c("#EE7600", "#00868B")) +             # Adding custom colours
    ylab("Griffon vulture abundance\n") +                             
    xlab("\nCountry")  +
    theme(axis.text.x=element_text(size=12),
          axis.text.y=element_text(size=12),
          axis.title.x=element_text(size=14, face="plain"),             
          axis.title.y=element_text(size=14, face="plain"),             
          panel.grid.major.x=element_blank(),                           # Removing the background grid lines                
          panel.grid.minor.x=element_blank(),
          panel.grid.minor.y=element_blank(),
          panel.grid.major.y=element_blank(),  
          plot.margin = unit(c(1,1,1,1), units = , "cm"),               # Adding a margin
          legend.position="none"))                                      # Removing the legend - not needed with only two factors

Img

Figure 3. Griffon vulture abundance in Croatia and Italy.

Barplot to examine the species richness of a few European countries

# Calculating species richness using pipes ``%>%` from the `dplyr` package
richness <- LPI2 %>% filter (Country.list == c("United Kingdom", "Germany", "France", "Netherlands", "Italy")) %>%
            group_by(Country.list) %>%
            mutate (., richness=(length(unique(Common.Name))))

(richness_barplot <- ggplot(richness, aes(x=Country.list, y=richness)) +
    geom_bar(position=position_dodge(), stat="identity", colour="black", fill="#00868B") +
    theme_bw() +
    ylab("Species richness\n") +                             
    xlab("Country")  +
    theme(axis.text.x=element_text(size=12, angle=45, vjust=1, hjust=1),  # x axis labels angled, so that text doesn't overlap
          axis.text.y=element_text(size=12),
          axis.title.x=element_text(size=14, face="plain"),             
          axis.title.y=element_text(size=14, face="plain"),             
          panel.grid.major.x=element_blank(),                                          
          panel.grid.minor.x=element_blank(),
          panel.grid.minor.y=element_blank(),
          panel.grid.major.y=element_blank(),  
          plot.margin = unit(c(1,1,1,1), units = , "cm")))

Img

Figure 4. Species richness in five European countries. Based on LPI data.

You might be picking up on the fact that we repeat a lot of the same code - same font size, same margins, etc. Less repetition makes for tidier code and it’s important to have consistent formatting across graphs for the same project, so in our next tutorial we will learn how to write our own functions and loops to create multiple plots at once!

Arranging plots in a panel using grid.arrange() from the package gridExtra

grid.arrange(vulture_hist, vulture_scatter, vulture_boxplot, ncol=1)

# This doesn't look right - the graphs are too stretched, the legend and text are all messed up, the white margins are too big
# Fixing the problems - adding ylab() again overrides the previous settings

panel <- grid.arrange(vulture_hist + ggtitle("(a)") + ylab("Count") + xlab("Abundance") +   # adding labels to the different plots
                 theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), units = , "cm")),
               vulture_boxplot + ggtitle("(b)") + ylab("Abundance") + xlab("Country") +
                 theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), units = , "cm")),
               vulture_scatter + ggtitle("(c)") + ylab("Abundance") + xlab("Year") +
                 theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), units = , "cm")) +
                 theme(legend.text = element_text(size=12, face="italic"),               
                       legend.title = element_blank(),                                   
                       legend.position=c(0.85, 0.85)), # changing the legend position so that it fits within the panel
               ncol=1) # ncol determines how many columns you have

To get around the too stretched/too squished panel problems, we will save the file and give it exact dimensions using ``ggsave`.

ggsave(panel, file="vulture_panel2.png", width=5, height=12) # the file is saved in your working directory, find it with getwd()

Img

Figure 5. Examining Griffon vulture populations from the LPI dataset. (a) shows histogram of abundance data distribution, (b) shows a boxplot comparison of abundance in Croatia and Italy, and (c) shows population trends between 1970 and 2014 in Croatia and Italy.

A team figure beautification challenge

To practice making graphs, open the Graph_challenge.R script file that you unzipped from the repository at the start of this tutorial and follow the instructions. Once you have made your figures, please upload them to this Google Drive folder.



If you have any questions about completing this tutorial, please contact us on ourcodingclub@gmail.com

We would love to hear your feedback on the tutorial, whether you did it in the classroom or online:

https://www.surveymonkey.co.uk/r/83WV8HV

  Subscribe to our mailing list:

Back to blog