Created by Erica Zaja



Tutorial Aims:

  1. What is ANOVA and why is it important?
  2. Setting a research question
  3. Formulating a hypothesis
  4. Data manipulation
  5. Visualising distribution with a histogram
  6. Visualising means with a boxplot
  7. Running a simple one-way ANOVA
  8. BONUS

Many of the questions we ask in science are about differences. Are the observed differences between our experimental groups due to chance or not? For instance, “how does soil pH vary with depth?”, “how does egg hatching time vary with temperature?” Questions like these can be answered using an analysis of variance (ANOVA). Whether you have never used R before but want to learn how to run a simple ANOVA, or you are an R user that wants to understand ANOVA more in depth, this is the tutorial for you! We will go through how to run an ANOVA from start to finish, in a complete and comprehensive tutorial that will guide you step-by-step.

Please note that this tutorial is not a data visualisation tutorial, so don’t worry if you don’t understand all the code in detail: I will also share links to data visualisation tutorials if you want to get up to grips with that too!

All the files you need to complete this tutorial can be downloaded from this repository. Click on Code/Download ZIP and unzip the folder, or clone the repository to your own GitHub account.

What is ANOVA and why is it important?

If you have never used R before, do not despair! We’ve all been there. If you need to download R and RStudio on your personal devices, check out the Coding Club tutorial Getting Started with R and R studio. I also recommend the Troubleshooting and how to find help tutorial and the Coding Etiquette tutorial to familiarise yourself with the coding world! But do not worry, you don’t need to be coding geeks to learn how to run a simple ANOVA!

Let’s set things up.

Open RStudio, create a new script by clicking on File/New File/R Script.

A script is a text file which will contain all your code. This can be saved and your commands can be re-executed and/or modified later.

It is good practice to title your script appropriately. Essential information includes your name and contact details, date, content of the script and data source (and other sources such as image sources and licences). You can also specify your workflow: the main sections of your script, and what each will contain, to make navigation easier. See the script header below, which you can use as reference:

# Title: ANOVA for absolute beginners tutorial
# Script purpose: ANOVA to investigate how frogspawn hatching time varies with temperature.
# Author - contact details
# Date

# Sources and Licences:
# Data created by @ericazaja and licenced under the MIT agreement
# (https://github.com/EdDataScienceEES/tutorial-ericazaja/tree/master/LICENSE.md).
# Icons from website phylopic (http://phylopic.org/),
# distributed under the Creative Commons license (https://creativecommons.org/)
# Photos from Getty Images (https://www.gettyimages.co.uk/eula)
# made available for use (https://www.gettyimages.co.uk/eula)

### WORKFLOW: main sections ----
# 1. Setting up: working directory, loading libraries, importing data
# 2. Data formatting: tidying and exploring data
# 3. Data visualisation: histogram and boxplot
# 4. One-way ANOVA: model, table, assumptions
# 5. Communicating results: barplot
# 6. BONUS: adding phylopic and making panel

The # are used to add comments to your script. This tells R that what you’re writing is NOT a line of code to run, but an informative comment.

N.B. adding 4 or more “-“ after the headings creates an outline i.e. creates sections. To view your outline use Edit/Folding/Collapse all to collapse all sections and navigate to whichever section you need by clicking on the section header. To view all your code, Expand all sections.

Now, save your script using File/Save as. Remember to give your script an informative name, including your initials, the date and a little hint to the script’s purpose. The standard file format for scripts is .R e.g. “ANOVA_tutorial_EZ_2020.R”.

Save your script in a folder that will be your working directory: the folder in your computer where all your work (scripts, data, image outputs etc.) will be saved. N.B. multiple folders may be within your working directory. I recommend creating a folder named data in your working directory folder.

To set your working directory, you can click on Session in the top menu in RStudio and then on Set Working Directory/Choose directory and browse your folders.

### SETTING UP ----

# Set your working directory
setwd("/Users/ericazaja/Desktop/Data_Science/tutorial-ericazaja")  
# N.B. Enter your own filepath.

getwd()  # Run this to check where your working directory is

To run your code, highlight the line you want to run and press Command and Enter if you’re using a Mac or Ctrl and R on Windows PC.

N.B. Working directories can be a bit confusing. If you don’t know where your work is being saved, run getwd() to see your working directory filepath. Run setwd() and type in your working directory filepath to set your working directory (alternatively to what we did above with Session). Filepaths have a different format if on a Mac vs a Windows PC: for a Mac, your working directory filepath will look something like this: setwd("~/Desktop/ANOVA") vs for a Windows PC it will look like this setwd("C:/Users/Name/Desktop/ANOVA"). For an example of setting a working directory, see this tutorial.

Have a look at this tutorial to learn how to create Git repositories and make your life easier using relative filepaths so that you never have to set your working directory again!

Next, we need to load the libraries we will use for the tutorial. Libraries are a cluster of commands used for a certain purpose (e.g. formatting data, making maps, making tables etc.). Before loading libraries, you must install the packages (the cluster of commands) that libraries load. You will only need to install packages once, while you will need to load libraries every time you close and reopen your script or restart your Rstudio. Make sure to install packages only once, since doing so multiple times can create problems.

# Loading Libraries
library(tidyverse)  # For data wrangling and data visualisation

# If you don't have the packages installed already, do so by uncommenting the code below
# install.packages("tidyverse")

tidyverse includes many packages that we will use throughout the tutorial, including dplyr (for data wrangling) and ggplot2 (for data visualisation).

1. What is ANOVA and why is it important?

ANOVA is one of the most used statistical analyses in the domain of ecological and environmental sciences.

It is also widely used in many social science disciplines such as sociology, psychology, communication and media studies.

In order to understand ANOVA, let’s remind ourselves of a few important definitions…

The explanatory variable is the CAUSE and the response variable is the EFFECT.

Now, can you guess what ANOVA stands for? ….

ANOVA is used to assess variation of a continuous dependent variable (y) across levels of one or more categorical independent variables (x). As explained above, the latter are often referred to as “factors”. Each factor may have different categories within it, called levels.

You probably guessed it by now: ANOVA = Analysis of Variance!

In this tutorial, we will focus on a single factor design (i.e. the simplest) and we will learn how to run and interpret a ONE-WAY ANOVA. The basic logic of ANOVA is simple: it compares variation between groups to variation within groups to determine whether the observed differences are due to chance or not. A ONE-way ANOVA only considers ONE factor.

How do you know if ANOVA is the appropriate test for your data?

If your goal is to compare the means of 3 or more independent groups and you have one continuous response variable and ONE categorical explanatory variable with a number of levels, one-way ANOVA is the test for you!

See the path to your stats choice summarised in the diagram below:

N.B. ANOVA is a linear regression BUT the predictor variables are categorical rather than continuous. Moreover, instead of a slope, with ANOVA you obtain an estimate of the response variable for each explanatory variable category.

Note that you can run a linear regression with categorical variables (as we will do below).

If you are keen to learn how to build a simple linear model check out this tutorial.

2. Setting a research question

Always set your research question before you start thinking about which is the most appropriate statistical test to use on your data.

A research question is an answerable enquiry related to your topic of interest. It should be clear and concise and it should contain both your response and your explanatory variables.

In this tutorial, our research question is:

How does frogspawn hatching time vary with temperature?

Imagine we ran a manipulative experiment.

A manipulative study is one in which the experimenter changes something about the experimental study system and studies the effect of this change.

We collected newly-layed frogspawn from a pond in the Italian Alps and we brought them back to the lab, where we divided them into 60 water containers. 20 of the containers’ water temperature was kept to 13°C, 20 containers were kept to 18°C and the remaining 20 containers were kept to 25°C. Having a high number of replicates increases our confidence that the expected difference between groups is due to the factor we are interested in. Here, temperature.

We monitored each water container and we recorded hatching times (days until hatching of eggs) in a spreadsheet (here called frogs_messy_data.csv).

We want to compare the means of 3 independent groups (13°C, 18°C and 25°C temperature groups) and we have one continuous response variable (Hatching time) and one categorical explanatory variable (Temperature). One-way ANOVA is the appropriate analysis!

3. Formulating a hypothesis

Always make a hypothesis and prediction, before you delve into the data analysis.

A hypothesis is a tentative answer to a well-framed question, referring to a mechanistic explanation of the expected pattern. It can be verified via predictions, which can be tested by making additional observations and performing experiments.

This should be backed up by some level of knowledge about your study system.

In our case, knowing that frogspawn takes around 2-3 weeks to hatch under optimal temperatures (15-20°C), we can hypothesize that the lowest the temperature, the longer it will take for frogspawn to hatch. Our hypothesis can therefore be: mean frogspawn hatching time will vary with temperature level. We can predict that given our temperature range, at the highest temperature (25°C) hatching time will be reduced.

4. Data manipulation

Importing data

Now that we’ve set our research question, hypothesis and prediction, let’s load the datasheet (in R jargon, data frame) into R. I recommend saving your data in your working directory folder: it’s easier to import it, it’s more logical and you are less likely to wonder where the file went later on! Find the dataset in your working directory (data folder) on the bottom right panel of RStudio, then import it by using Import Dataset/Import.

The code below will appear in your console (the bottom left panel of RStudio). The content of the console cannot be saved, so make sure to copy the code below into your script (top right panel) so when you run the script again in the future, the data will load automatically.

# Loading data
frogs_messy_data <- read_csv("data/frogs_messy_data.csv")  # Load the messy dataset

head(frogs_messy_data)  # Check that your data imported correctly

The dataset is in .csv (comma separated values) format. This is usually a good format to read into R, being a highly transferrable format, available for use across multiple platforms.

The head() function allows you to view the first few rows and columns of your dataset, to make sure the dataset has been correctly imported. It is important to do this because sometimes R can get confused with row and column names. Make sure to doublecheck!

Notice how we have chosen the name of the data frame object that our frogs_messy_data.csv will be imported as into R. We assign frogs_messy_data.csv to the object name frogs_messy_data via a little arrow <-.

R is in fact an object-oriented statistical programming language.

Find all the basics about R in the Intro to R tutorial.

Tidying data

Let’s take a closer look at our dataset. As you can see from a first glance, this data frame has Temperature13, Temperature18 and Temperature25 (the 3 levels of our explanatory variable) as separate columns within which the hatching time has been recorded for each frogspawn sample. This is our dataset in wide format.

However, for analysing data, we need to re-order the datasheet into long format: this means tidying the data so that each variable is a column and each observation is a row. See below the basic data wrangling code and learn more about data wrangling here.

### DATA FORMATTING ----

# Tidying the dataset
frogs_tidy_data <- gather(frogs_messy_data, Temperature, Hatching_time, c(2:4)) %>%    
                   # Hatching times (value) to be gathered by Temperature (key)
                   mutate(Temperature = parse_number(Temperature)) %>%                 
                   # To get rid of the non-numerical part
                   select("Hatching_time", "Temperature") %>%                          
                   # Keeping only the columns we need for the analysis
                   na.omit()                                                           
                   # To get rid of missing values (NAs)

write.csv(frogs_tidy_data, file = "data/frogs_tidy_data.csv")                                
# Saving cleaned data frame (frogs_tidy.csv) file in the data folder in your working directory
# The write.csv() function will only work if you have created the tidy data frame before

The %>% used above is a pipe. This prevents the use of un-necessary intermediate objects, and makes the code shorter and more efficient. Learn more about pipes trying the Efficient Data Manipulation tutorial.

Tip:

If you wanted to rename your temperature levels to “Low”, “Medium”, “High”, you could create a new column called Temp_level using mutate() function:

# mutate(Temp_level = case_when(Temperature == 13 ~ 'Low', Temperature == 18 ~ 'Medium', Temperature == 25 ~ 'High'))
# When temperature is equal to ("==") "..." , return (~) "...".

Always explore the dataset you’re working with. To check out what kind of data we are dealing with we can use the str() function.

str(frogs_tidy_data)   # Exploring data                                              

You can see that Hatching_time and Temperature are both numerical variables. Is this right?

Now is a good time to think about what you want to achieve with your dataset. Remember the research question: How does frogspawn hatching time vary with temperature?

We want to model hatching time as a function of temperature.

Temperature is our explanatory variable and it is here coded as numerical variable, when it should be coded as factor (categorical variable) with 3 levels (“13”, “18”, “25”). The numbers represent the different categories of our explanatory variable, not actual count data. We therefore need to transform Temperature from numeric to factor variable.

frogs_tidy_data$Temperature <- as.factor(as.character(frogs_tidy_data$Temperature))   
# Makes temperature into factor variable  

The dollar sign $ isolates the column Temperature from the data frame frogs_tidy_data. Temperature is therefore a vector: an ordered sequence of values of the same type.

A data frame is 2 dimensional (rows and columns) whereas a vector is 1 dimensional.

5. Visualising distribution with a histogram

Always have a look at the distribution of your response variable before delving into the statistical analysis. This is because many parametric statistical tests (within which ANOVA) assume that continuous dependent variables are normally distributed, so we must check that assumptions are met to trust our model’s output.

Note that data can be log-transformed to meet normality assumptions. Alternatively, non-parametric tests are available for non-normally distributed data. Have a look at this tutorial for examples of non-parametric testing in R.

We can plot a histogram to have a look at the frequency distribution of our response variable.

First, we can make sure that all the figures we make will be consistent and beautiful using this theme. A theme is like a template: it will make all the figures you apply it to have similar features (e.g. font size, line width, colour palette…). For now, just copy the code below, since data visualization is not the focus of this tutorial. But if you want to learn more about customising your plots, check out this tutorial.

# Data visualisation ----

theme_frogs <- function(){  # Creating a function
  theme_classic() +  # Using pre-defined theme as base
  theme(axis.text.x = element_text(size = 12, face = "bold"),  # Customizing axes text      
        axis.text.y = element_text(size = 12, face = "bold"),
        axis.title = element_text(size = 14, face = "bold"),  # Customizing axis title
        panel.grid = element_blank(),  # Taking off the default grid
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"),
        legend.text = element_text(size = 12, face = "italic"),  # Customizing legend text
        legend.title = element_text(size = 12, face = "bold"),  # Customizing legend title
        legend.position = "right",  # Customizing legend position
        plot.caption = element_text(size = 12))  # Customizing plot caption
}                                                                              

Next, let’s build the histogram with ggplot, a package with which you can make beautiful figures. Think of ggplot as an empty room, to which you gradually add more and more furniture, to make it look pretty, until it’s tidy, useful and beautiful! We add elements to our ggplot with +. Again, don’t worry if you don’t understand all the elements of the code below, copy the content and keep the goal in mind: ANOVA!

# Creating a histogram with ggplot
(frog_histogram <- ggplot(frogs_tidy_data, aes(x = Hatching_time, fill = Temperature)) +  
# Plotting from the tidy data frame and colouring bars by Temperature                      
                   geom_histogram(stat = "count") +
                   # Makes height of bars proportional to number of cases in each group
                   geom_vline(aes(xintercept = mean(Hatching_time)),            
                              colour = "red", linetype = "dashed", size = 1) +
                              # Adding a line for mean abundance
                   scale_fill_manual(values = c("#97F7C5", "#4ED973", "#08873D")) +          
                   # Adding custom colours    
                   labs(x = "\n Hatching time (days)", y = "Frequency \n",                   
                   # Adding x and y axis labels.
                   # "\n" adds space before x and after y axis text
                         caption = "\n Fig.1 Response variable (hatching time) is normally
                         distributed in each treatment group (temperature levels). Red dashed
                         line shows mean hatching time. n = 60.") +   
                         # Adding informative figure caption
                         # caption = "\n Fig.1") +  # Adding caption for figure in panel
                   theme_frogs() +  # Adding our personalised theme
                   guides(fill = guide_legend(title = "Temperature level (°C)")))
                   # Adding an informative legend title

N.B adding () around the whole plotting code allows you to visualise the plot on the bottom right panel of R (Plots window) simultaneously as running the code. Without (), you would need to run the frog_histogram object by itself (adding a line of code to your script that says only frog_histogram), after running the plotting code above, in order to actually see your plot.

Always include informative figure captions. These must include a sentence with the main take-home message from the figure, explanation of the graph’s elements (e.g. raw data points, dashed line for mean, S.E. bars…), the data source and your sample size (n = …).

You can save your graphs with ggsave, saving the figure in the appropriate folder (assets/img/), with the appropriate file name (frogs_histogram). Make sure you change the code below to your specific filepath. Make sure names are concise but informative.

ggsave(frog_histogram, file = "assets/img/frog_histogram.png", width = 9, height = 7)  

You can customise length and width that the picture is saved in with width = "" and height = "". Make sure nothing gets cropped or figures are too squished. You can save pictures in different formats but usually .pdf (figures don’t decrease in quality when you zoom in or out) or .png (easily inserted in text documents) are the best way.

From the histogram above we can assume that the data is normally distributed for all of our 3 Temperature groups (each histogram peaks in the middle and is roughly symmetrical about the mean). If you are unsure about data distributions, check out tutorial From distributions to linear models.

6. Visualising means with a boxplot

Let’s keep exploring our dataset, using a boxplot.

A boxplot allows you to look at the variation in a continuous variable across categories, at the spread of your data and it gives you an idea of what you might find with ANOVA in terms of differences between groups. If boxes do not overlap, you probably have significant differences between groups, but you must verify this via statistical analysis.

You can use ggplot (see below) or you can create a boxplot with: boxplot(Hatching_time ~ Temperature, data = frogs_tidy_data).

# Creating a boxplot with ggplot
(frog_boxplot <- ggplot(frogs_tidy_data, aes(x = Temperature, y = Hatching_time,
                                            fill = Temperature)) +
                 geom_boxplot() +
                 scale_fill_manual(values = c("#97F7C5", "#4ED973", "#08873D")) +  
                 labs(x = "\nTemperature level (°C)", y = "Hatching time (days)",  
                      caption = "\n Fig.2 Forgspawn exposed to lowest temperature (13°C) was
                       the slowest to hatch. n = 60.") +
                      # caption = "\nFig.2") +  # Caption for figure in panel
                 theme_frogs() +  
                 theme(legend.position = "none"))  # Over-writing our theme() to get rid of legend

# Saving boxplot
ggsave(frog_boxplot, file = "assets/img/frog_boxplot.png", width = 9, height = 7)            

Boxes give a measure of variability. Boxes encompass 50% of each group’s values: 25% of values are above the range and 25% below the range. This is therefore useful to know where most of the datapoints fall. The horizontal black lines at the center of each box represent the median. The whiskers (vertical black lines at the top and bottom of each box) are a measure of variability: the wider the whiskers, the more variable the data.

Having a look at our boxplot, you can see something is going on here: the frogspawn exposed to the lowest temperature take the longest to hatch. The eggs exposed to the highest temperature take the least time to hatch, as predicted! The boxes don’t overlap, meaning there is likely a statistically significant difference between groups. To check this (you guessed it) we need ANOVA!

7. Running a simple one-way ANOVA

We’re all set. We can now code the ANOVA!

Keep the goal in mind: analysing hatching time as a function of temperature level.

### ONE-WAY ANOVA -----

frogs_anova <- aov(Hatching_time ~ Temperature, data = frogs_tidy_data)
summary(frogs_anova)

You can read your modelling code as if it was a sentence: the code above runs the ANOVA test (aov), analysing hatching time (Hatching_time) as a function of (~) temperature level (Temperature), getting data (data = ...) from the data frame frogs_tidy_data.

Visualising model output table and interpreting it

The summary() function shows you the summary output of your ANOVA, also known as your ANOVA table, with degrees of freedom, F value and p value (all the info we need!).

See highlighted in the table above the most important information from the model output.

A p-value is the probability of seeing a test statistic as big or bigger than the one we actually observed if the null hypothesis is true. If p < 0.05 we reject the null hypothesis. However, the test should be repeated multiple times to be able to confidently accept or reject the null hypothesis.

Here, p is highly significant (p < 2e-16 ***). This means there is a significant difference between hatching times under different temperature levels. Our predictor variable has had a significant effect on your response variable.

N.B. p is an arbitrary value. So beware of it! It’s not a universal measure and it can be misleading, resulting in false positives. Read more about p values and their drawbacks in this blog post on the Methods in Ecology and Evolution: blog “There is Madness in our methods”.

If you want more details about your analysis, you can run the same code but using a lm linear model function. As mentioned above, ANOVA is in itself a linear model.

## LM

frogs_lm <- lm(Hatching_time ~ Temperature, data = frogs_tidy_data)
summary(frogs_lm)

If you don’t understand everything in detail right now, don’t worry! Take it slow. Statistics can be a difficult subject, especially if you’re new to them! You might want to pause the tutorial here, make a cup of tea and go back to it later.

But if you feel like going ahead…

You are almost there! We now need to check ANOVA assumptions and visualise our results.

Checking assumptions

ANOVA makes 3 fundamental assumptions:

a. Data are normally distributed.

b. Variances are homogeneous.

c. Observations are independent.

We need to check that model assumptions are met, in order to trust ANOVA outputs. Let’s check these one by one with specific plots:

a. Residuals histogram and Normal Q-Q plot: Normality can be checked via a frequency histogram of the residuals and a quantile plot where the residuals are plotted against the values expected from a normal distribution.

Residuals are the deviation of individually measured samples from the mean.

What to look for: The histogram of residuals should follow a normal (gaussian) distribution and the points in the Q-Q plot should lie mostly on the straight line.

# Checking normality
par(mfrow = c(1,2))  # This code put two plots in the same window
hist(frogs_anova$residuals)   # Makes histogram of residuals  
plot(frogs_anova, which = 2)   # Makes Q-Q plot

If the normality asumption is not met, you can log-transform your data into becoming normally distributed or run the non-parametric alternative to ANOVA: Kruskal-Wallis H Test.

b. Residuals VS Fitted plot: To check that the variation in the residuals is approximately equal across the range of the predictor variable (i.e. check for homoscedasticity) we can plot the residuals against the fitted values from the aov model object.

Fitted values are what the model predicts for the response variable.

What to look for: We want to see a straight red line centered around zero! This means residuals do NOT systematically differ across different groups.

# Checking homoscedasticity (Homogeneity of variances)
plot(frogs_anova, which = 1)  # Makes residuals VS fitted plot

If the homogeneity of variances assumption is violated, run a Welch F test and add into your results section that this assumption was violated.

c. ANOVA assumes that all replicate measures are independent of each other:

Two measures are independent if the measurement from one individual gives no indication as to which value the measurement from another individual will produce.

Replicate measures must be equally likely to be sampled from the population of possible values for each level. This issue needs to be considered at the experimental design stage. If data are grouped in any way, then more complex designs are needed to account for additional factors. A mixed model approach is advised for hierarchical data. Have a look at the Linear mixed models tutorial for more info.

Our data does not violate any of the ANOVA assumptions: we can therefore trust our model output! If assumptions are not 1000% met, no panic! Most of the time it is enough for assumptions to be roughly met.

Now we need to communicate our results.

Communicating model results with a barplot

We can comminucate our findings in a few ways:

After running an ANOVA, always report at least your F value, degrees of freedom and p value.

Firstly, let’s create a new data frame with the summarise() function, which allows you to calculate summary statistics including our sample size (n), mean hatching time per temperature level, standard deviation and standard error values.

summary_stats <- frogs_tidy_data %>%
                 group_by(Temperature) %>%
                 summarise(n = n(),  # Calculating sample size n
                           average_hatch = mean(Hatching_time),  
                           # Calculating mean hatching time
                           SD = sd(Hatching_time))%>%  # Calculating standard deviation
                 mutate(SE = SD / sqrt(n))  # Calculating standard error

Standard deviation is a measure of the spread of values around the mean. Standard error is a measure of the statistical accuracy of an estimate.

If you don’t fully grasp the code above, check out the Efficient Data Manipulation tutorial. Also, have a look here for how to calculate standard deviation and standard error.

Now, let’s plot our graph. Don’t worry if you are unclear about some of the elements below. If you’re keen, learn how to make your figures extra pretty with the Data Visualisation tutorial.

# Making a barplot
(frog_barplot_1 <- ggplot(data = summary_stats) +                                  
                   geom_bar(aes(x = Temperature, y = average_hatch,
                               fill = Temperature),
                               stat = "identity", colour = "black") +               
                   geom_errorbar(aes(x = Temperature, ymin = average_hatch - SE,
                                     ymax = average_hatch + SE), width = 0.2,
                                     colour="black", alpha=0.9,size=1) +  
                                     # Adding standard error bars        
                   scale_fill_manual(values = c("#97F7C5", "#4ED973", "#08873D")) +
                   labs(x = "\nTemperature level (°C)", y = " Average hatching time (days)",
                       caption = "\nFig.3 Forgspawn exposed to lowest temperature (13°C) was
                       the slowest to hatch. Non-overlapping S.E. bars indicate significant
                       differences among mean groups. n = 60.") +
                       # caption = "\nFig.3") +  # Adding caption for figure in panel
                   theme_frogs() +                                              
                   theme(legend.position = "none"))

There you go! Well done for making it this far! You have accomplished a lot, you should be proud of yourself! Now, you can stop here (by all means!) or, if you’re keen, scroll down for some extra data visualisation tips that will make your figures look super professional! You won’t regret it!

Conclusion

Well done for getting through the tutorial! Here’s a summary of what you have learned:

If you are keen to see a few cool data visualisation tricks, keep scrolling…

8. BONUS:

Adding icons with phylopic

Click here to view

Do you want your figure to look 100 times better, with just a few extra lines of code?

Add an icon to your plot with just a few clicks! Believe me, once you start, you’ll get addicted and you’ll put animal or plant icons wherever you can! See code below:

# Extra Libraries
library(RCurl)  # For loading icons
library(rphylopic)  # For using pictures from website phylopic
library(png)  # For loading pictures in png format

# If you don't have the libraries, uncomment code below:
# install.packages("RCurl")
# install.packages("rphylopic")
# install.packages("png")

# Animal icon: loading frog logo
frog <- "http://phylopic.org/assets/images/submissions/c07ce7b7-5fb5-484f-83a0-567bb0795e18.256.png"  
# Link of icon, from phylopic website
frog_logo <- readPNG(getURLContent(frog))  # Loading the logo into R

Now re-run the code for the barplot, adding the add_phylopic() function:

(frog_barplot <-  ggplot(data = summary_stats) +                                  
                  geom_bar(aes(x = Temperature, y = average_hatch,
                               fill = Temperature),
                               stat = "identity", colour = "black") +               
                  geom_errorbar(aes(x = Temperature, ymin = average_hatch - SE,
                                    ymax = average_hatch + SE), width = 0.2,
                                    colour="black", alpha=0.9, size=1) +                   
                  scale_fill_manual(values = c("#97F7C5", "#4ED973", "#08873D")) +
                  add_phylopic(frog_logo, alpha = 1, x = 3, y = 23, ysize = 4) +  
                  # Adding frog logo to the plot                                                
                  labs(x = "\nTemperature level (°C)", y = " Average hatching time (days)",
                       caption = "\nFig.4 Forgspawn exposed to lowest temperature (13°C) was
                       the slowest to hatch. Non-overlapping S.E. bars indicate
                       significant differences among mean groups. n = 60.") +
                       # caption = "\nFig.3") +  # Adding caption for figure in panel
                  theme_frogs() +                                              
                  theme(legend.position = "none"))

ggsave(frog_barplot, file = "assets/img/frog_barplot.png", width = 9, height = 7)

Oh, look! A frog just jumped on your screen! How much more professional does that look now? Just with a few clicks, your figure is now more communicative, more effective and much prettier!

If you can’t find the icon straight away, no panic! I promise you, it’s there! It’s just hidden. Maybe the x and y values (where the icon’s center will be) are a bit off. Maybe the icon is way too big or way too small. All you need to do is adjust the icon’s settings. Now, the way I see it is as a game: can you find the frog?

You can find many more animal and plant icons on the phylopic website!

Find out more about how to insert a phylopic here.

Making a panel with gridExtra

Click here to view

To visualise all your main output figures together, you can put them all into a panel. This may be useful in a paper, if you want to display all your outputs in one place so that they clearly convey the main message of your ANOVA.

Remember you will need to make changes to your figures, when putting them in a panel. You might want to simplify the individual captions of the figures, and make a caption for the whole panel. In this tutorial I have created shorter captions for individual figures to be put into the panel (see code above and uncomment the lines with short captions).

Make sure you can see all plots clearly, and that nothing gets squished! Add an informative title at the top of your plot, clearly communicating your message/main finding.

We will need a few extra libraries.

# Extra Libraries
library(gridExtra)  # For making panels
library(ggpubr)  # For data visualisation formatting

panel_title <- text_grob("Frogspawn exposed to lowest temperature slowest to hatch",
                          size = 21, face = "bold")  # Customising panel title

panel_caption <- text_grob("Fig.5 Response variable (hatching time) is normally distributed in
                            each treatment group (Fig.1); frogspawn exposed to lowest
                            temperature (13°C) was the slowest to hatch (Fig.2, Fig.3);
                            Source: Dataset by @ericazaja.", size = 16)   # Customising panel caption

# Making the panel
(frog_panel <- grid.arrange(arrangeGrob(frog_histogram, frog_boxplot,
                            frog_barplot, ncol = 3),  # Sets number of panel columns
                            top = panel_title,  # Adding panel title
                            bottom = panel_caption))  # Adding panel caption                  

ggsave(frog_panel, file = "assets/img/frog_panel.png", width = 18, height = 9)

Pre-registrations

Click here to view

A little tip for science good-practice..

When carrying out research, before you delve into data collection and analysis, it is important to write a pre-registeration.

Writing a pre-registration means specifying your research plan in advance of your study and submitting it to a registry, such as the Center for Open Science.

You should pre-register the following:

…so that they are “set in stone”.

This is done in order to prevent what is known as p-hacking i.e. conducting multiple forms of the analysis and reporting only the one with the lowest p value (i.e. the strongest relationship) hence “more surprising” and - most of all - “more publishable” result.

This is bad practice and should be discouraged! Pre-registrations are a good way to be more transparent and encourage data sharing and honest research. Plus, thinking before acting is always a good idea! In coding and in life!

There are many platforms where you can make a pre-registration, for example the Open Science Framework. Find out more about Transparency in Ecology and Evolution here.

Extras

If you enjoyed the tutorial and you’re keen to learn more about statistics, Coding Club has got you covered! Ecological data often has a hierarchical structure, best analysed via linear mixed models: check out this tutorial to learn more! Interested to explore how to use Bayesian modelling? Click here!

# Need a motivational boost?
library(praise)
praise()


I would love to hear your feedback on the tutorial! If you have any comments, questions or doubts about completing this tutorial, please contact me on ericazaja@gmail.com.

Stay up to date and learn about our newest resources by following us on Twitter!
Contact us with any questions on ourcodingclub@gmail.com

Related tutorials: