Created by John

Tutorial Aims:

Isolate and retrieve data from a html web page
Automate the download of multiple web pages using R
Understand how web scraping can speed up the harvesting of online data

Steps:

Download the relevant packages
Download a .html web page
Import a .html file into R
Locate and filter HTML using grep and gsub
Import multiple web pages with mapply

Why not just copy and paste?

Imagine you want to collect information on the area and percentage water area of African countries. It’s easy enough to head to Wikipedia, click through each page, then copy the relevant information and paste it into a spreadsheet. Now imagine you want to repeat this for every country in the world! This can quickly become VERY tedious as you click between lots of pages, repeating the same actions over and over. It also increases the chance of making mistakes when copying and pasting. By automating this process using R to perform “Web Scraping,” you can reduce the chance of making mistakes and speed up your data collection. Additionally, once you have written the script, it can be adapted for a lot of different projects, saving time in the long run.

Web scraping refers to the action of extracting data from a web page using a computer program, in this case our computer program will be R. Other popular command line interfaces that can perform similar actions are wget and curl.

Getting started

Open up a new R Script where you will be adding the code for this tutorial. All the resources for this tutorial, including some helpful cheatsheets, can be downloaded from this Github repository. Clone and download the repo as a zipfile, then unzip and set the folder as your working directory by running the code below (subbing in the real path), or clicking Session/ Set Working Directory/ Choose Directory in the RStudio menu.

Alternatively, you can fork the repository to your own Github account and then add it as a new RStudio project by copying the HTTPS / SSH link. For more details on how to register on Github, download git, sync RStudio and Github and do version control, please check out our previous tutorial.

setwd("<PATH TO FOLDER>")

1. Download the relevant packages

install.packages("rvest") # To import a html file
install.packages("dplyr") # To use pipes

library(rvest)
library(dplyr)

2. Download a `.html` web page

The simplest way to download a web page is to save it as a .html file to your working directory. This can be accomplished in most browsers by clicking File -> Save as... and saving the file type to Webpage, HTML Only or something similar. Here are some examples for different browser Operating System combinations:

Microsoft Windows - Internet Explorer

Internet Explorer save page screenshot

Microsoft Windows - Google Chrome

Google Chrome save page screenshot

MacOS - Safari

Safari save page screenshot

Download the IUCN Red List information for Aptenogytes forsteri (Emperor Penguin) from [http://www.iucnredlist.org/details/22697752/0] using the above method, saving the file to your working directory.

3. Importing `.html` into R

The file can be imported into R as a vector using the following code:

Penguin_html <- readLines("Aptenodytes forsteri (Emperor Penguin).html")

Each string in the vector is one line of the original .html file.

4. Locating useful information using `grep()` and isolating it using `gsub`

In this example we are going to build a data frame of different species of penguin and gather data on their IUCN status and when the assessment was made so we will have a data frame that looks something like this:

Scientific Name	Common Name	Red List Status	Assessment Date
Aptenodytes forsteri	Emperor Penguin	Near Threatened	2016-10-01
Aptenodytes patagonicus	King Penguin	Least Concern	2016-10-01
…	…	…	…
Spheniscus mendiculus	Galapagos Penguin	Endangered	2016-10-01

Open the IUCN web page for the Emperor Penguin in a web browser. You should see that the species name is next to some other text, Scientific Name:. We can use this “anchor” to find the rough location of the species name:

grep("Scientific Name:", Penguin_html)

The code above tells us that Scientific Name: appears once in the vector, on string 132. We can search around string 132 to find the species name:

Penguin_html[131:135]

This displays the contents of our subsetted vector:

[1] "<tr>"
[2] "  <td class=\"label\"><strong>Scientific Name:</strong></td>"
[3] "  <td class=\"sciName\"><span class=\"notranslate\"><span class=\"sciname\">Aptenodytes forsteri</span></span></td>"
[4] "</tr>"
[5] ""

Aptenoydes forsteri is on line 133, wrapped in a whole load of html tags. Now we can check and isolate the exact piece of text we want, removing all the unwanted information:

# Double check the line containing the scienific name.
grep("Scientific Name:", Penguin_html)
Penguin_html[131:135]
Penguin_html[133]

# Isolate line in new vector
species_line <- Penguin_htm[133]

## Use pipes to grab the text and get rid of unwanted information like html tags
species_name <- species_line %>%
  gsub("<td class=\"sciName\"><span class=\"notranslate\"><span class=\"sciname\">", "", .) %>%  # Remove leading html tag
  gsub("</span></span></td>", "", .) %>%  # Remove trailing html tag
  gsub("^\\s+|\\s+$", "", .)  # Remove whitespace and replace with nothing
species_name

For more information on using pipes, follow our data manipulation tutorial.

gsub() works in the following way:

gsub("Pattern to replace", "The replacement pattern", .)

This is self explanatory when we remove the html tags, but the pattern to remove whitespace looks like a lot of random characters. In this gsub() command, we have used “regular expressions” also known as “regex”. These are sets of character combinations that R (and many other programming languages) can understand and are used to search through character strings. Let’s break down "^\\s+|\\s+$" to understand what it means:

^ = From start of string

\\s = White space

+ = Select 1 or more instances of the character before

| = “or”

$ = To the end of the line

So "^\\s+|\\s+$" can be interpreted as “Select one or more white spaces that exist at the start of the string and select one or more white spaces that exist at the end of the string”. Look in the repository for this tutorial for a regex cheat sheet to help you master grep.

We can do the same for common name:

grep("Common Name", Penguin_html)
Penguin_html[130:160]
Penguin_html[150:160]
Penguin_html[151]

common_name <- Penguin_html[151] %>%
  gsub("<td>", "", .) %>%
  gsub("</td>", "", .) %>%
  gsub("^\\s+|\\s+$", "", .)
common_name

IUCN category:

grep("Red List Category", Penguin_html)
Penguin_html[179:185]
Penguin_html[182]

red_cat <- gsub("^\\s+|\\s+$", "", Penguin_html[182])
red_cat

Date of Assessment:

grep("Date Assessed:", Penguin_html)
Penguin_html[192:197]
Penguin_html[196]

date_assess <- Penguin_html[196] %>%
	gsub("<td>", "",.) %>%
	gsub("</td>", "",.) %>%
	gsub("\\s", "",.)
date_assess

We can create the start of our data frame by concatenating the vectors:

iucn <- data.frame(species_name, common_name, red_cat, date_assess)

5. Importing multiple web pages

The above example only used one file, but the real power of web scraping comes from being able to repeat these actions over a number of web pages to build up a larger dataset.

We can import many web pages from a list of URLs generated by searching the IUCN red list for the word Penguin. Go to [http://www.iucnredlist.org], search for “penguin” and download the resulting web page as a .html file in your working directory.

Import Search Results.html:

search_html <- readLines("Search Results.html")

Now we can search for lines containing links to species pages:

line_list <- grep("<a href=\"/details", search_html)  # Create a vector of line numbers in `search_html` that contain species links
link_list <- search_html[line_list]  # Isolate those lines and place in a new vector

Clean up the lines so only the full URL is left:

species_list <- link_list %>%
  gsub('<a href=\"', "http://www.iucnredlist.org", .) %>%  # Replace the leading html tag with a URL prefix
  gsub('\".*', "", .) %>%  # Remove everything after the `"`
  gsub('\\s', "",.)  # Remove any white space

Clean up the lines so only the species name is left and transform it into a file name for each web page download:

file_list_grep <- link_list %>%
  gsub('.*sciname\">', "", .) %>% # Remove everything before `sciname\">`
  gsub('</span></a>.*', ".html", .) # Replace everything after `</span></a>` with `.html`
file_list_grep

Now we can use mapply() to download each web page in turn using species_list and name the files using file_list_grep:

mapply(function(x,y) download.file(x,y), species_list, file_list_grep)

mapply() loops through download.file for each instance of species_list and file_list_grep, giving us 18 files saved to the working directory

Import each of the downloaded .html files based on its name:

penguin_html_list <- lapply(file_list_grep, readLines)

Now we can perform similar actions to those in the first part of the tutorial to loop over each of the vectors in penguin_html_list and build our data frame:

Search for “Scientific Name:” and store the line number for each web page in a list:

sci_name_list_rough <- lapply(penguin_html_list, grep, pattern="Scientific Name:")
sci_name_list_rough
penguin_html_list[[2]][133]

Convert the list into a simple vector using unlist(). +1 gives us the line containing the actual species name rather than Scientific Name::

sci_name_unlist_rough <- unlist(sci_name_list_rough) + 1

Retrieve lines containing scientific names from each .html file in turn and store in a vector:

sci_name_line <- mapply('[', penguin_html_list, sci_name_unlist_rough)

Remove the html tags and whitespace around each entry:

## Clean html
sci_name <- sci_name_line %>%
  gsub("<td class=\"sciName\"><span class=\"notranslate\"><span class=\"sciname\">", "", .) %>%
  gsub("</span></span></td>", "", .) %>%
  gsub("^\\s+|\\s+$", "", .)

As before, we can perform similar actions for Common Name:

# Find common name
## Isolate line
common_name_list_rough <- lapply(penguin_html_list, grep, pattern = "Common Name")
common_name_list_rough #146
penguin_html_list[[1]][151]

common_name_unlist_rough <- unlist(common_name_list_rough) + 5

common_name_line <- mapply('[', penguin_html_list, common_name_unlist_rough)

common_name <- common_name_line %>%
  gsub("<td>", "", .) %>%
  gsub("</td>", "", .) %>%
  gsub("^\\s+|\\s+$", "", .)

Red list category:

red_cat_list_rough <- lapply(penguin_html_list, grep, pattern = "Red List Category")
red_cat_list_rough
penguin_html_list[[16]][186]

red_cat_unlist_rough <- unlist(red_cat_list_rough) + 2
red_cat_unlist_rough
penguin_html_list[[2]][187]

red_cat_line <- mapply(`[`, penguin_html_list, red_cat_unlist_rough)
red_cat <- gsub("^\\s+|\\s+$", "", red_cat_line)

Date assessed:

date_list_rough <- lapply(penguin_html_list, grep, pattern = "Date Assessed:")
date_list_rough  # Different locations
penguin_html_list[[18]][200]
date_unlist_rough <- unlist(date_list_rough) + 1

date_line <- mapply('[', penguin_html_list, date_unlist_rough)

date <- date_line %>%
  gsub("<td>", "",.) %>%
  gsub("</td>", "",.) %>%
  gsub("\\s", "",.)

Then we can combine the vectors into a data frame:

penguin_df <- data.frame(sci_name, common_name, red_cat, date)
penguin_df

Does your data frame look something like this?

sci_name	common_name	red_cat	date
Aptenodytes forsteri	Emperor Penguin	Near Threatened	2016-10-01
Aptenodytes patagonicus	King Penguin	Least Concern	2016-10-01
Eudyptes chrysocome	Southern Rockhopper Penguin, Rockhopper Penguin	Vulnerable	2016-10-01
Eudyptes chrysolophus	Macaroni Penguin	Vulnerable	2016-10-01
Eudyptes moseleyi	Northern Rockhopper Penguin	Endangered	2016-10-01
Eudyptes pachyrhynchus	Fiordland Penguin, Fiordland Crested Penguin	Vulnerable	2016-10-01
Eudyptes robustus	Snares Penguin, Snares Crested Penguin, Snares Islands Penguin	Vulnerable	2016-10-01
Eudyptes schlegeli	Royal Penguin	Near Threatened	2016-10-01
Eudyptes sclateri	Erect-crested Penguin, Big-crested Penguin	Endangered	2016-10-01
Eudyptula minor	Little Penguin, Blue Penguin, Fairy Penguin	Least Concern	2016-10-01
Megadyptes antipodes	Yellow-eyed Penguin	Endangered	2016-10-01
Pygoscelis adeliae	Adélie Penguin	Least Concern	2016-10-01
Pygoscelis antarcticus	Chinstrap Penguin	Least Concern	2016-10-01
Pygoscelis papua	Gentoo Penguin	Least Concern	2016-10-01
Spheniscus demersus	African Penguin, Jackass Penguin, Black-footed Penguin	Endangered	2016-10-01
Spheniscus humboldti	Humboldt Penguin, Peruvian Penguin	Vulnerable	2016-10-01
Spheniscus magellanicus	Magellanic Penguin	Near Threatened	2016-10-01
Spheniscus mendiculus	Galapagos Penguin, Galápagos Penguin	Endangered	2016-10-01

Now that you have your data frame you can start analysing it. Try to make a bar chart showing how many penguin species are in each red list category follow our data visualisation tutorial to learn how to do this with ggplot2.

A full .R script for this tutorial along with some helpful cheatsheets and data can be found in the repository for this tutorial.

Web Scraping

Retrieving useful information from web pages

Tutorial Aims:

Steps:

Why not just copy and paste?

Getting started

1. Download the relevant packages

2. Download a `.html` web page

3. Importing `.html` into R

4. Locating useful information using `grep()` and isolating it using `gsub`

5. Importing multiple web pages

Stay up to date and learn about our newest resources by following us on Twitter!

We would love to hear your feedback, please fill out our survey!

Contact us with any questions on ourcodingclub@gmail.com

Related tutorials:

Web Scraping

Retrieving useful information from web pages

Tutorial Aims:

Steps:

Why not just copy and paste?

Getting started

1. Download the relevant packages

2. Download a .html web page

3. Importing .html into R

4. Locating useful information using grep() and isolating it using gsub

5. Importing multiple web pages

Stay up to date and learn about our newest resources by following us on Twitter!

We would love to hear your feedback, please fill out our survey!

Contact us with any questions on ourcodingclub@gmail.com

Related tutorials:

2. Download a `.html` web page

3. Importing `.html` into R

4. Locating useful information using `grep()` and isolating it using `gsub`