back to overview    

Before starting

Install software

Create a new project in RStudio

Do this only once for the entire practical part of the class and once again for your project in the end but not for each session separately  

If you want to visualize this tutorial in the viewer inside RStudio (to save space on your screen) run the following chunk of code:

install.packages("rstudioapi") # install an R-package required for this step
dir <- tempfile()
dir.create(dir)
download.file("https://gift.uni-goettingen.de/mcmmb/index.html", destfile = file.path(dir, "index.html"))
download.file("https://gift.uni-goettingen.de/mcmmb/Day1.html", destfile = file.path(dir, "Day1.html"))
htmlFile <- file.path(dir, "Day1.html")
rstudioapi::viewer(htmlFile)

Now you can conveniently copy code from the viewer into your script.  

1. Getting started

1.1. Create a data.frame

  1. Imagine a field study with 5 plots (1,2,3,…) in each of 3 vegetation types (forest, shrubland, grassland).
  2. Create a data.frame with 1 row per plot and 2 columns for “vegetation_type” and “plot”.
  3. Every second day of a month a student visits one of the plots. Add a third column called “day” indicating when the student visits each plot: 1, 3, 5, …
  4. In each plot, the species richness is counted.
Table 1: Dataframe exercise
vegetation_type plot day rich
forest 1 1 5
forest 2 3 7
forest 3 5 11
forest 4 7 18
forest 5 9 4
shrubland 1 11 18
shrubland 2 13 19
shrubland 3 15 13
shrubland 4 17 13
shrubland 5 19 1
grassland 1 21 4
grassland 2 23 4
grassland 3 25 14
grassland 4 27 8
grassland 5 29 15

 

What would be the R code needed to create such a table?
Species richness can be made up of random numbers.

Solution
plots <- data.frame(vegetation_type = rep(c("forest", "shrubland", "grassland"),
                                          each = 5),
                    plot = rep(1:5, 3))
plots$day <- seq(1, 29, by = 2)
set.seed(1)
plots$rich <- round(runif(n = nrow(plots), min = 0, max = 20), 0)


To subset part of the data, we use the operator [], or $ to access a specific column. For example, to get the average species richness in forest plots, we can run:

mean(plots[which(plots$vegetation_type == "forest"), ]$rich)
## [1] 9

1.2. Plots

We can plot the species richness against the vegetation type, using for example boxplots.

Generally speaking, there are two major ways of plotting in R: Base R and ggplot2. Except for raster data, we will use ggplot2 throughout the tutorials.  

library("ggplot2")
ggplot(plots, aes(vegetation_type, rich)) +
  geom_boxplot()

# Fine-tuning the boxplot
ggplot(plots, aes(vegetation_type, rich)) +
  geom_boxplot(aes(fill = vegetation_type)) +
  labs(title = "Basic boxplot",
       x = "Vegetation type", y = "Species richness") +
  theme_bw() +
  theme(legend.position = "bottom")

With base R code, we could do:

boxplot(plots$rich ~ plots$vegetation_type, main = "Basic boxplot",
        xlab = "Vegetation type", ylab = "Species richness")

2. Spatial polygons or shapefiles

In biogeography and macroecology, we often deal with spatial data. There are different types of spatial data, one of them is shapefiles. Here we are working with the global shapefile of the countries of the world.

The r library we need to work with shapefiles is sf. Make sure to have it installed and to load it in your session.

# install.packages("sf")
library("sf")
## Warning: package 'sf' was built under R version 4.4.2
## Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE

2.1. Data and first plot

To get the polygons for all countries, we load two packages containing the data.

library("rnaturalearth")
library("rnaturalearthdata")
## 
## Attaching package: 'rnaturalearthdata'
## The following object is masked from 'package:rnaturalearth':
## 
##     countries110

These packages come with datasets, one of the world country shapes and one with coastlines. We can retrieve them like this:

world <- ne_countries(returnclass = "sf")
coastline <- ne_coastline(returnclass = "sf")

As with any object, we should look at its class and structure:

class(world)
## [1] "sf"         "data.frame"
class(coastline)
## [1] "sf"         "data.frame"
str(world)
str(coastline)

The geometry component of the object makes it spatial. This allows us to make a map of its contents, directly using ggplot2 with the geometry geom_sf().

Different geometries are possible (see a list here), and we can inspect it with the function st_geometry:

st_geometry(world)
## Geometry set for 177 features 
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## First 5 geometries:
## MULTIPOLYGON (((180 -16.06713, 180 -16.55522, 1...
## MULTIPOLYGON (((33.90371 -0.95, 34.07262 -1.059...
## MULTIPOLYGON (((-8.66559 27.65643, -8.665124 27...
## MULTIPOLYGON (((-122.84 49, -122.9742 49.00254,...
## MULTIPOLYGON (((-122.84 49, -120 49, -117.0312 ...
st_geometry(coastline)
## Geometry set for 134 features 
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -85.60904 xmax: 180 ymax: 83.64513
## Geodetic CRS:  WGS 84
## First 5 geometries:
## LINESTRING (-163.7129 -78.59567, -163.1058 -78....
## LINESTRING (-6.197885 53.86757, -6.032985 53.15...
## LINESTRING (141.0002 -2.600151, 142.7352 -3.289...
## LINESTRING (114.204 4.525874, 114.6 4.900011, 1...
## LINESTRING (-93.61276 74.98, -94.15691 74.59235...

To plot a sf object with ggplot, we simply need to call the geometry geom_sf:

ggplot(world) +
  geom_sf()

With base R plot, we can run:

plot(st_geometry(world))

To add another sf layer to a plot, we can simply add a new geometry with a new data. Let’s try plotting the shapes of the world’s countries with the coastlines on top:

ggplot(world) +
  geom_sf(color = "grey50", fill = "grey80") +
  geom_sf(data = coastline, color = "black", fill = "blue") # fill useless here

2.2. Subset

We can plot only a subset of the `world’ object by applying a classic subset to the object. Let’s plot the countries that fall into the continent of North America:

ggplot(world[which(world$continent == "North America"), ]) +
  geom_sf()

Based on the columns subregion, we can also color each of these polygons based on their subregions:

ggplot(world[which(world$continent == "North America"), ]) +
  geom_sf(aes(fill = subregion))

Note that we are using the aesthetic fill to color the inside of each polygon. We could also use color to have different outlines based on a different variable. For example:

ggplot(world[which(world$continent == "North America"), ]) +
  geom_sf(aes(fill = pop_est,
              color = subregion), linewidth = 1) # increasing the outline width

We can also label each polygon based on its country name:

ggplot(world[which(world$continent == "North America"), ]) +
  geom_sf(aes(color = subregion,
              fill = pop_est)) +
  geom_sf_label(aes(label = name), size = 2)
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

2.3. Colors

Choosing colors is an important topic, and they are some classic mistakes that should be avoided. First, a fair proportion of people are color blind and do not distinguish between certain hues. Second, some color are perceived better by the human eye, artificially biasing our interpretation of figures. The following figure taken from this page, illustrates nicely some of these issues:

Ocean seafloor age map. (a) The rainbow colour scheme adds randomly (at least) two strong artificial boundaries to the data: One along the red-yellow transition (at 0.4 of the colour bar range) and one along the blue-cyan transition (at 0.7 of the colour bar range). At the same time, it hides the local data gradients in the greenish parts of the colour map making it impossible to spot, for example, the large area of same age just north of the Tonga subduction zone. Moreover, it is unreadable for people with most forms of colour blindness, and when printed in black and white. The perceptually-uniform colour maps (b) lajolla and (c) vik prevent all of these problems, where the latter is so-called zero-centred and can be used to specifically highlight the old (brownish) and young (blueish) parts of the ocean floor with a boundary (white) occurring naturally in the middle of the data range.

Ocean seafloor age map. (a) The rainbow colour scheme adds randomly (at least) two strong artificial boundaries to the data: One along the red-yellow transition (at 0.4 of the colour bar range) and one along the blue-cyan transition (at 0.7 of the colour bar range). At the same time, it hides the local data gradients in the greenish parts of the colour map making it impossible to spot, for example, the large area of same age just north of the Tonga subduction zone. Moreover, it is unreadable for people with most forms of colour blindness, and when printed in black and white. The perceptually-uniform colour maps (b) lajolla and (c) vik prevent all of these problems, where the latter is so-called zero-centred and can be used to specifically highlight the old (brownish) and young (blueish) parts of the ocean floor with a boundary (white) occurring naturally in the middle of the data range.

So what to do? Instead of picking your own colors, pick a color set designed by a professional. These color palettes are are perceptually uniform, work with black and white, accessible to colorblind. Two of these palettes are RColorBrewer and viridis, which come with their own R packages. You can visualize the Color Brewer palette on this very nice website: https://colorbrewer2.org, and viridis is also introduced here.

# install.packages("RColorBrewer")
# install.packages("viridis")
library("RColorBrewer")
library("viridis")
## Loading required package: viridisLite

Both of these palettes can be implemented in a ggplot framework very easily:

# RColorBrewer
# Palette names are visible here: RColorBrewer::display.brewer.all()
ggplot(world) +
  geom_sf(aes(color = continent, fill = continent)) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Set1 palette from ColorBrewer")

# viridis
ggplot(world) +
  geom_sf(aes(color = continent, fill = continent)) +
  scale_color_viridis_d(option = "magma") +
  scale_fill_viridis_d(option = "magma") +
  labs(title = "Magma palette from viridis")

To go further on this topic, here are some resources: here and here

2.4. Projections

One mandatory consideration when dealing with spatial data is the Coordinate Reference System (CRS) to be used.

A CRS is made of a datum and a coordinate system. A datum is a reference system which allows the location of latitudes and longitudes (and heights) to be identified onto the surface of the Earth. The coordinate system is a model of our world or projection. CRS can be geographic, in which case they rely on latitudes and longitures (e.g. unprojected WGS84), or projected (mathematical translation to a 2D surface).

What we need to ensure is that our spatial objects have a CRS defined. To check if a spatial object has a CRS defined, we can run st_crs.

st_crs(world)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]

As we can see in the last field, the EPSG code for our spatial object is 4326. EPSG is an acronym for European Petroleum Survey Group. This group publishes, among other things, data on coordinate systems on map projections and datums.
EPSG 4326 corresponds to the WGS84 datum (more info here), and is a geographic CRS, therefore providing latitudes and longitudes, it is not a projected CRS.

Side-note
WGS84 is the most common CRS in the world so is a convenient baseline for transforming other data to to be consistent. Commonly used by organizations that provide GIS data for the entire globe or across many countries. It is the CRS used by Google Earth as the default.

EPSG identifier A CRS can be defined using a so-called proj4string or a unique EPSG identifier. EPSG identifiers can be searched here: epsg.io, or here: spatialreference.org.

With our map based on the unprojected WGS84, we can observe a high distorsion of country size towards high latitudes. This is also visible on this nice website: https://thetruesize.com/.

Other projections offer different results.
When using a projection, we are making a 2D representation of a 3D volume. Converting a sphere/ellipsoid to a flat surface results in distortions, and this implies some trade-offs. You can find many nice resources describing these trade-offs, for example here or here but, in a nutshell, projections can distort four properties: shape, area, distance and direction.

Projections can preserve the shape of countries, like the famous Mercator projection, the area (e.g. Eckert IV or Mollweide), the distance between two points (e.g. Plate Carrée) or be a compromise (e.g. Winkel Tripel).

We can map our shapefile according to another projection in two different ways. Either we use do it within the ggplot2 framework using the function coord_sf, or we directly transform the data before plotting using st_transform.
Below are two ways of doing this with a the Mollweide projection, which is an equal-area projection preserving the areas of countries.

mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs"
ggplot(world) +
  geom_sf(aes(fill = continent, color = continent),
          show.legend = FALSE) +
  scale_color_viridis_d(option = "magma") +
  scale_fill_viridis_d(option = "magma") +
  coord_sf(crs = mollweide) +
  labs(title = "Mollweide (equal-area) projection") +
  theme_bw()

Or, if we directly transform the data:

world_mollweide <- st_transform(world, crs = mollweide)

ggplot(world_mollweide) +
  geom_sf(aes(fill = continent, color = continent),
          show.legend = FALSE) +
  scale_color_viridis_d(option = "magma") +
  scale_fill_viridis_d(option = "magma") +
  labs(title = "Mollweide (equal-area) projection") +
  theme_bw()

Chosing a projection depends on the area we are interested in, and on the distortion properties. Here are two links that can help when chosing a projection: here and here.

Below are the identifiers/proj4strings of several widely used projections.

# Conformal
mercator <- 3857 # EPSG identifier
# alternatively, we could use this format:
mercator <- "+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs +type=crs"

# Equal-area
eckertIV <- "+proj=eck4 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs" 
mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs"
# https://semba-blog.netlify.app/01/26/2020/world-map-and-map-projections/

# Equidistant
plate_carree <- "+proj=eqc +lat_ts=0 +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs"

# Compromise
robinson <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs"
wink <- "+proj=wintri +lon_0=0 +lat_1=50.467 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs" # used by National Geographic

# One suited for North America
albers <- "+proj=aea +lat_0=40 +lon_0=-96 +lat_1=20 +lat_2=60 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +type=crs"

# Orange-peel
good <- "+proj=igh +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +type=crs"

# Globe view
globe <- "+proj=laea +x_0=0 +y_0=0 +lon_0=-74 +lat_0=40"

And here is the code to produce a combined plot with 4 different projections:

library("cowplot") # R package used to combine plots

# Combined plot
combined_plot <- plot_grid(
  ggplot(world) +
    geom_sf(aes(fill = continent, color = continent),
            show.legend = FALSE) +
    scale_color_viridis_d(option = "magma") +
    scale_fill_viridis_d(option = "magma") +
    labs(title = "Plate Carrée (equidistant along meridians)") +
    theme_bw(),
  ggplot(world) +
    geom_sf(aes(fill = continent, color = continent),
            show.legend = FALSE) +
    scale_color_viridis_d(option = "magma") +
    scale_fill_viridis_d(option = "magma") +
    coord_sf(crs = mollweide) +
    labs(title = "Mollweide (equal-area projection)") +
    theme_bw(),
  ggplot(world) +
    geom_sf(aes(fill = continent, color = continent),
            show.legend = FALSE) +
    scale_color_viridis_d(option = "magma") +
    scale_fill_viridis_d(option = "magma") +
    coord_sf(crs = robinson) +
    labs(title = "Robinson (compromise)") +
    theme_bw(),
  ggplot(world) +
    geom_sf(aes(fill = continent, color = continent),
            show.legend = FALSE) +
    scale_color_viridis_d(option = "magma") +
    scale_fill_viridis_d(option = "magma") +
    coord_sf(crs = globe) +
    labs(title = "Globe") +
    theme_bw(),
  nrow = 2)

combined_plot

2.5. Saving a plot

To export a plot, it is always better to use functions such as ggsave rahter than to use the Export button in the plot window. Functions allow you to change important parameters, such as the dpi, which is usually a requirement when submitting manuscripts to a journal, in a reproducible way.

# Saving plot
ggsave(filename = "figures/combined_world_maps.png", plot = combined_plot, width = 20, height = 10, units = "cm", dpi = 300)

3. Rasters

Rasters constitute another type of spatial objects. They are spatial matrices made of cells at a certain resolution. They are widely used in biogeography and macroecology, mostly to store environmental data.

In R, we handle them with the terra library.

3.1. Load raster data

Loading and visualising one environmental raster layer
Chelsa - Climate data http://chelsa-climate.org/
Annual mean temperature (CHELSA_bio10_01.tif)
 

Extract and copy the rasters into your R-project folder ../data/

Load terra library

# install.packages("terra") # install the terra package in case not done yet
library("terra")

 

Import climate data from Chelsa

temperature <- rast("data/CHELSA_bio10_01.tif")

3.2. Properties and plot

plot(temperature, main = "Temperature")  # Temperature (*10 °C)

 

Have a look at characteristics of the raster layers
Like for shapefiles, raster should have a CRS defined.

crs(temperature)
## [1] "GEOGCRS[\"WGS 84\",\n    ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n        MEMBER[\"World Geodetic System 1984 (Transit)\"],\n        MEMBER[\"World Geodetic System 1984 (G730)\"],\n        MEMBER[\"World Geodetic System 1984 (G873)\"],\n        MEMBER[\"World Geodetic System 1984 (G1150)\"],\n        MEMBER[\"World Geodetic System 1984 (G1674)\"],\n        MEMBER[\"World Geodetic System 1984 (G1762)\"],\n        MEMBER[\"World Geodetic System 1984 (G2139)\"],\n        MEMBER[\"World Geodetic System 1984 (G2296)\"],\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]],\n        ENSEMBLEACCURACY[2.0]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    USAGE[\n        SCOPE[\"Horizontal component of 3D system.\"],\n        AREA[\"World.\"],\n        BBOX[-90,-180,90,180]],\n    ID[\"EPSG\",4326]]"

Raster are spatial matrices, meaning they have a certain number of cells, rows and columns.

ncol(temperature)
## [1] 43200
nrow(temperature)
## [1] 20880
ncell(temperature)
## [1] 902016000
ncol(temperature) * nrow(temperature) == ncell(temperature)
## [1] TRUE

The spatial extent can be accessed with:

ext(temperature)
## SpatExtent : -180.00013888885, 179.99985967115, -90.00013888885, 83.99986041515 (xmin, xmax, ymin, ymax)

Each cell has a certain resolution, which we can access with the following function:

res(temperature)
## [1] 0.008333333 0.008333333

The unit is decimal degree. One arc-degree is equivalent to 3600 arc-seconds, so 0.008333333 decimal degrees do correspond to 30 arc-second, which is roughly 1 km at the Equator.
An online converter is accessible here.

 

The resolution of a raster layer can be coarsened. Here we coarsen it by a factor 60.

temperature_30min <- aggregate(temperature, fact = 60, fun = mean, na.rm = TRUE)
dim(temperature)
## [1] 20880 43200     1
dim(temperature_30min)
## [1] 348 720   1
res(temperature)
## [1] 0.008333333 0.008333333
res(temperature_30min)
## [1] 0.5 0.5

We now have a resolution of 0.5 degree, so 1800 arc-second, so 1800/60=30 arc minute, which is roughly 55km at the Equator.

You can do calculations with raster layers

# Careful! this is computer intense and might take a long time
temperature_30min <- temperature_30min/10      
plot(temperature_30min, main = "Temperature",
     col = viridis(99, option = "C")) # Temperature (°C)
points(9.916, 51.541, pch = 4) # Göttingen
text(9.916, 51.541, "You are here", pos = 4)

As for shapefiles, raster layers can be reprojected. We here reproject the temperature raster to the Mollweide projection.

temperature_mollweide <- project(x = temperature_30min, y = mollweide)

plot(temperature_mollweide, main = "Mollweide projection",
     col = viridis(99, option = "C"))