This challenge will require the use of data manipulation, plotting and linear modelling skills, and is the culmination of the STATS FROM SCRATCH course stream. Scroll for more information on your tasks and how to complete the challenge.
Data overview
You will use the following datasets, available from the Challenge Github repository on Github. To be able to answer the quiz questions properly, it is important that you use these datasets and not potentially updated versions available through the original providers.
The Scottish Squirrel Database
squirrels.csv
: A dataset of grey and red squirrel observations compiled by the Scottish Wildlife Trust and hosted on the NBN Atlas. The most relevant variables in the dataset for this challenge are:
- Year: the year of the sighting
- Count: the number of squirrels sighted on the occasion (if blank, assume it is 1)
- OSGR: the Ordnance Survey grid reference for 10 x 10 km squares; will be useful to link the forest cover data
Forest cover
forestcoverOS.csv
: This dataset contains the forest cover (in % and total area) in each OS grid cell. This dataset was created by us*, using:
- The National Forest Inventory for Scotland 2017, from the Forestry Commission
- OS grid cells at a 10 x 10 km resolution, from this Git repository
Fancy a more advanced challenge? Why don’t you try re-creating this dataset yourself? (Best suited to someone with notions of spatial analysis: all you have to do is intersect the files and extract the area.)
Specific tasks
Here is a detailed list of the tasks you should achieve within this challenge. Remember that a challenge is meant to be, well, challenging, and therefore we are setting you goals but the choice of workflow and functions to achieve them is up to you! We also list the questions that will be asked in the quiz at the end to confirm your successful completion - we suggest you take note of your answers as you go.
1. Data manipulation
Clean the squirrel dataset for the last decade, so it’s ready to analyse. Specifically, you should:
- Keep only observations for the years 2008 to 2017 (using the
Start.date.year
column and renaming it toyear
) - Remove the observations that are not at the species level (i.e. we don’t know whether they are grey or red squirrels)
- Create a species column that will have Red and Grey as factor levels
- We will assume that the observations that have
NA
ascount
are observations of one squirrel; replace them with the value 1.
Be prepared to answer the question:
To the nearest thousand, how large is your cleaned dataset?
2. Temporal trends
Determine if there is a temporal trend in the number of observations for red and grey squirrels (2008-2017). Specifically, you should:
- Summarise the number of observations per species and per year. (That means a total number of red vs grey squirrels for each year.) A more complex analysis would also account for spatial autocorrelation and other factors, but as a preliminary analysis you are only asked to consider the total numbers at the national scale.
- Plot the data and run one linear model to test the question Have squirrel populations increased or decreased over time, and is the trend the same for red and grey squirrels?
Be prepared to answer the questions:
- Which species showed the strongest change over time?
- What were your predictor variable(s) and their data type in the model?
- What is the adjusted R-squared of the regression?
- Considering the nature of our response variable, what modelling approach would be the most appropriate? (Don’t worry if you only ran a linear regression! It’s a justifiable approach for a preliminary analysis, and for such large numbers the results will be similar.)
Think about the following: what could be the reasons for this trend? Is it ecologically meaningful? Are there any biases in the data to be aware of?
3. Do red and grey squirrels prefer different habitats?
We usually think of grey squirrels as city dwellers, while red squirrels require extensive forest cover. Determine whether recent squirrel counts in OS grid cells (10km) are linked to forest cover in that cell. Specifically, you should:
- Filter the data to the period covering 2015-2017. Summarise the squirrel count data at the species and grid cell level. (You can sum counts across years; this is not ideal but since we’re only dealing with a few years of data this will give us a population index that allows for inconsistent sampling across years, hopefully without double-counting too much.) Remove observations greater than 300, as they mess up with the plots later (but feel free to experiment with different subsets!).
- Merge the squirrel and forest datasets
- Visualise the scatterplot of abundance as a function of forest cover for each species. Run one linear model (bonus: try a glm with the appropriate distribution) to test the relationship.
Be prepared to answer the questions:
- Are red squirrels significantly associated with forested areas?
- Does the model explain the variation in the data well?
4. Re-classify forest cover
Building on the previous point, try turning the forest cover data into a categorical variable, and use the visual representation of your choice to display the median abundance of grey and red squirrels in these classes, and the uncertainty around these measures. Specifically, you should:
- Transform the cover data into a cover.class variable with the following bins:
- 0-10%
- 10-20%
- 20-30%
- 30-40%
- 40-50%
- 50+%
- Create your visualisation
Be prepared to answer the question:
- In what cover classes are red squirrels more abundant than the grey?
Finished? Take the quiz!
Once you have a fully working script and have completed the specific tasks, take the quiz.
Help & hints
Here is a list of tutorials that might help you complete this challenge:
Need a hint? Just click on a question to expand
How do I remove unwanted data points
You can specify a variety of logical statements in the the filter()
function from {dplyr}
.
I can't figure out how to replace NA values with something else.
NA values are something special in R, and there are special functions to handle them. Take a look at the is.na()
logical function, and see if you can use it within a mutate
call to create a new column based on existing values.
You’ll want mutate to replace the value in a cell IF the original value was one, and ELSE you’ll want to keep the original value. Oh, hey, do you know the ifelse()
function?
We love getting your feedback, and will add more hints to this section if you get in touch and tell us where you struggled in this challenge!
Acknowledgements
We thank all the organisations that provided open access data for this challenge. The datasets licences are as follow:
- Scottish Wildlife Trust (2018). The Scottish Squirrel Database. Occurrence dataset [https://doi.org/10.15468/fqg0h3] under license CC-BY-4.0
- Forestry Commission (2018). National Forest Inventory Woodland Scotland 2017. Available at the Forestry Commission Open Data portal under Open Governement licence: Crown copyright and database right 2018 Ordnance Survey [100021242]
- Charles Roper (2015). OSGB Grids in shapefile format. Available on Github under a CC-0 (public domain) license.
Get in touch
Bee in your bonnet? Technical issues? Don't hesitate to get in touch with any questions or suggestions concerning the course. Please keep in mind that this is a brand new course and we are still testing and implementing some features, so if you notice errors or some areas of the site are not working as they should, please tell us!