Before starting
If you would like to see this tutorial in your viewer within RStudio, please execute the following chunk of code:

install.packages("rstudioapi") # installs a required R-package
dir <- tempfile()
dir.create(dir)
download.file("https://gift.uni-goettingen.de/community/index.html",
              destfile = file.path(dir, "index.html"))
download.file("https://gift.uni-goettingen.de/community/lesson6.html",
              destfile = file.path(dir, "lesson6.html"))
htmlFile <- file.path(dir, "lesson6.html")
rstudioapi::viewer(htmlFile)

In this practical, we will perform an Ordination and a Cluster analysis on the BCI plots to see in a multivariate space which plots are closer with each other regarding their species composition.

How to classify these 6 plots?

We will use two datasets for this practical. The first one is the same than for the practical 2, i.e. the Barro Colorado Island (BCI) forest-plot inventory in Panama. Refer to the practical 2 for more details regarding this dataset. The second one is introduced in the first section.

Let’s first start as usual by importing the BCI dataset.

BCI <- read.table("data/BCI_Data.csv",
                  header = TRUE, # does the file have column names
                  sep = ",", # what separates the column
                  row.names = 1) # does the file have row names
BCI <- t(BCI)

# Presence/absence table
BCI_bin <- BCI
BCI_bin[BCI_bin > 0] <- 1

# Loading vegan package
library(vegan)

1. Non-Metric Multidimensional Scaling (aka NMDS, MDS, NMS)

The goal of NMDS is to represent the position of communities in multidimensional space as accurately as possible using a reduced number of dimensions that can be easily visualized.

NMDS does not use the absolute abundances of species in communities, but rather their rank orders and as a result is a flexible technique that accepts a variety of types of data. The “non-metric” part of the name comes from this characteristic.

1.1. Distance matrix

To do

Calculate the abundance-based versions of the Sorensen and Jaccard indexes using the vegdist() function. (Remember that vegdist() returns the “dissimilarity” of these indices):

Solution

# Abundance-based Sorensen dissimilarity: Bray Curtis index
sorBCI <- vegdist(BCI, method = "bray") 
as.matrix(sorBCI)[1:3, 1:3]

##           Plot1     Plot2     Plot3
## Plot1 0.0000000 0.2706682 0.3501647
## Plot2 0.2706682 0.0000000 0.2873051
## Plot3 0.3501647 0.2873051 0.0000000

# Abundance-based Jaccard dissimilarity
jacBCI <- vegdist(BCI, method = "jaccard") 
as.matrix(jacBCI)[1:3, 1:3]

##           Plot1     Plot2     Plot3
## Plot1 0.0000000 0.4260250 0.5186992
## Plot2 0.4260250 0.0000000 0.4463668
## Plot3 0.5186992 0.4463668 0.0000000

Calculate the Simpson index using the betadiver() function.

Solution

simBCI <- betadiver(BCI, method = "sim") # Simpson dissimilarity
as.matrix(simBCI)[1:3, 1:3]

##           Plot1     Plot2     Plot3
## Plot1 0.0000000 0.2380952 0.2888889
## Plot2 0.2380952 0.0000000 0.2619048
## Plot3 0.2888889 0.2619048 0.0000000

Plot the 3 beta diversity measures against each other.

Solution

par(mfrow = c(1, 3))
plot(jacBCI, sorBCI, xlab = "Jaccard", ylab = "Sorensen", 
     pch = 16, col = rgb(0, 0, 0, 0.1))
plot(jacBCI, simBCI, xlab = "Jaccard", ylab = "Simpson", 
     pch = 16, col = rgb(0, 0, 0, 0.1))
plot(sorBCI, simBCI, xlab = "Sorensen", ylab = "Simpson", 
     pch = 16, col = rgb(0, 0, 0, 0.1))

1.2. NMDS (Non-Metric multiDimensional Scaling)

To run the NMDS, we will use the function metaMDS() from vegan. The first argument of the functions is either a Community dataset or a dissimilarity dataset (i.e. a “distance” object, like the ones we created above). The function also needs a number of dimensions (k) as argument.

BCI_NMDS <- metaMDS(jacBCI, k = 2)

## Run 0 stress 0.1741506 
## Run 1 stress 0.185232 
## Run 2 stress 0.1838078 
## Run 3 stress 0.1880655 
## Run 4 stress 0.1741506 
## ... Procrustes: rmse 2.57383e-05  max resid 0.0001457721 
## ... Similar to previous best
## Run 5 stress 0.1792232 
## Run 6 stress 0.1872379 
## Run 7 stress 0.1741506 
## ... New best solution
## ... Procrustes: rmse 2.112085e-05  max resid 0.0001306282 
## ... Similar to previous best
## Run 8 stress 0.1880654 
## Run 9 stress 0.1840393 
## Run 10 stress 0.1920244 
## Run 11 stress 0.1840393 
## Run 12 stress 0.1787908 
## Run 13 stress 0.1791452 
## Run 14 stress 0.1862914 
## Run 15 stress 0.1963988 
## Run 16 stress 0.1838079 
## Run 17 stress 0.1929295 
## Run 18 stress 0.1792232 
## Run 19 stress 0.1852319 
## Run 20 stress 0.1838131 
## *** Best solution repeated 1 times

BCI_NMDS

## 
## Call:
## metaMDS(comm = jacBCI, k = 2) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     jacBCI 
## Distance: jaccard 
## 
## Dimensions: 2 
## Stress:     0.1741506 
## Stress type 1, weak ties
## Best solution was repeated 1 time in 20 tries
## The best solution was from try 7 (random start)
## Scaling: centring, PC rotation, halfchange scaling 
## Species: scores missing

# Visualizing the Stress Value
BCI_NMDS$stress

## [1] 0.1741506

The differences between the 2D configuration and actual distances are output as a stress value, indicating how well the reduction to 2 dimensions worked. Rule of thumb: stress > 0.05 provides excellent representation in the reduced space, > 0.1 very good, > 0.2 good, and stress > 0.3 provides poor representation. According to Clarke (1993), stress values as low as 0.2 should be treated with caution.

If the stress value is too high, more dimensions are needed. How many axes are needed to maximize the quality of the space can be determined with the help of a screeplot.

# Manual for-loop to create the screeplot
scree_plot <- c()
for(stress in 1:12){
  tmp_NMDS <- metaMDS(jacBCI, k = stress)
  scree_plot <- rbind(scree_plot,
                      data.frame(k = stress,
                                 stress = tmp_NMDS$stress))
}

plot(scree_plot$k, scree_plot$stress, pch = 16, type = "b",
     xlab = "Number of Dimensions", ylab = "Stress",
     main = "NMDS stress plot")
abline(h = 0.2, lty = 2) # acceptable threshold for NMDS

Let us compare the Shepard diagrams for an NMDS with \(k=2\) dimensions and \(k=12\) dimensions.
A Shepard diagram is a plot opposing the observed dissimilarity for each pair of plots to the dissimilarity within the k-dimensional space.

# Let's also perform the NMDS with 12 axes
BCI_NMDS12 <- metaMDS(jacBCI, k = 12)

## Run 0 stress 0.02748627 
## Run 1 stress 0.02840113 
## Run 2 stress 0.02766044 
## ... Procrustes: rmse 0.02364471  max resid 0.07398426 
## Run 3 stress 0.02774047 
## ... Procrustes: rmse 0.02827973  max resid 0.06443561 
## Run 4 stress 0.02762685 
## ... Procrustes: rmse 0.01914491  max resid 0.04668225 
## Run 5 stress 0.02839741 
## Run 6 stress 0.02870925 
## Run 7 stress 0.02775243 
## ... Procrustes: rmse 0.02708146  max resid 0.0803032 
## Run 8 stress 0.02797043 
## ... Procrustes: rmse 0.0236498  max resid 0.04871824 
## Run 9 stress 0.02746687 
## ... New best solution
## ... Procrustes: rmse 0.02002674  max resid 0.05867485 
## Run 10 stress 0.02771866 
## ... Procrustes: rmse 0.01980682  max resid 0.04564467 
## Run 11 stress 0.02807836 
## Run 12 stress 0.0279434 
## ... Procrustes: rmse 0.01927751  max resid 0.04076382 
## Run 13 stress 0.02754685 
## ... Procrustes: rmse 0.02054401  max resid 0.05706231 
## Run 14 stress 0.02741558 
## ... New best solution
## ... Procrustes: rmse 0.00980684  max resid 0.02463436 
## Run 15 stress 0.02833609 
## Run 16 stress 0.02802631 
## Run 17 stress 0.02741791 
## ... Procrustes: rmse 0.00782252  max resid 0.01867643 
## Run 18 stress 0.02828502 
## Run 19 stress 0.02794789 
## Run 20 stress 0.02788353 
## ... Procrustes: rmse 0.02983545  max resid 0.06984518 
## *** Best solution was not repeated -- monoMDS stopping criteria:
##     20: no. of iterations >= maxit

# Shepard diagrams confront (2 axes vs 12 axes)
par(mfrow = c(1, 2))
stressplot(BCI_NMDS, main = "2 axes")
stressplot(BCI_NMDS12, main = "12 axes")

Now we can plot the NMDS. To make it clear that the NMDS is based on a distance matrix, we have applied the metaMDS() function to a distance matrix so far. However, the function can also be applied directly to the species-by-sites matrix and will then calculate the distances internally for a desired distance measure. This has the advantage that the species are available for plotting.

BCI_NMDS <- metaMDS(BCI, k = 2, distance = "jaccard")

We can now plot the NMDS, with 2 axes based on Jaccard distance.

par(mfrow = c(1, 1))
plot(BCI_NMDS)#, type = "t") # type = 't' adds text labels.

# sites only
plot(BCI_NMDS, display = "sites", type = "t", main = "Goodness of fit")
# size reflecting goodness of fit per site (bigger = worse fit)
points(BCI_NMDS, display = "sites", cex = goodness(BCI_NMDS)*200) 
title(main = "Goodness of fit")

# Ordination plot function especially for congested plots
ordiplot(BCI_NMDS, type = "n")
# adding text or points to ordination plots
orditorp(BCI_NMDS, display = "species", col = "firebrick3", air = 0.01)
orditorp(BCI_NMDS, display = "sites", cex = 1.25, air = 0.01)

1.3. Add environmental information

Although NMDS is a non-metric ordination procedure, the result (i.e., the coordinates of species and “sites” along the axes) is metric. Environmental variables can therefore be correlated with the axes to investigate whether they influence species assemblages.

We want to look at soil data for the BCI subplots for this purpose. Please download the BCI soil data (BCI_soil.txt from Stud-IP ) and put it in your data directory in your project folder.

# Import data
BCIsoil <- read.table("data/environment/soil/BCI_soil.txt", row.names = 1)
head(BCIsoil)

##     x   y        Al       B       Ca      Cu       Fe         K       Mg
## 1  50  50  901.0908 0.79448 1680.021 6.20312 135.2870 141.88128 279.1291
## 2  50 150  954.2488 0.66968 1503.365 6.03148 141.8080 137.23932 280.4524
## 3  50 250 1114.1122 0.59516 1182.311 6.79768 157.0878  98.69056 230.3973
## 4  50 350 1023.5793 0.56780 1558.020 6.63400 153.1746  98.36412 228.9468
## 5  50 450 1001.8848 0.39876 1242.251 6.44428 149.2509  94.07208 202.6820
## 6 150  50 1091.4672 0.73120 1441.977 6.49552 173.8682 131.89280 276.5010
##         Mn       P      Zn        N   N.min.      pH
## 1 266.9997 1.95248 2.96948 18.46500 -3.88544 4.32432
## 2 320.4786 2.24740 2.53208 21.59896  5.64388 4.37548
## 3 445.0708 1.95484 2.24672 20.24516 -4.06408 4.34700
## 4 407.7580 2.63444 2.44284 20.84232  7.89012 4.46112
## 5 250.5403 1.86356 2.13748 16.94500  8.53716 4.40128
## 6 477.3249 1.61612 2.63148 20.29812  4.38948 4.57252

You can see that each plot is associated with a 50x50m quadrat containing various edaphic variables. As an example, we can visualize how are distributed pH values across BCI plot:

# Visualization
BCIsoil[1:10, c("x", "y", "pH")] # subsetting the coordinates and ph column

##      x   y      pH
## 1   50  50 4.32432
## 2   50 150 4.37548
## 3   50 250 4.34700
## 4   50 350 4.46112
## 5   50 450 4.40128
## 6  150  50 4.57252
## 7  150 150 4.55972
## 8  150 250 4.41168
## 9  150 350 4.53336
## 10 150 450 4.55500

matrix(BCIsoil$pH, ncol = 5, byrow = TRUE) # matrix with the pH values we want to plot

##          [,1]    [,2]    [,3]    [,4]    [,5]
##  [1,] 4.32432 4.37548 4.34700 4.46112 4.40128
##  [2,] 4.57252 4.55972 4.41168 4.53336 4.55500
##  [3,] 4.71792 4.29640 4.12084 4.12820 4.15760
##  [4,] 4.50452 4.46092 4.05204 4.24652 4.33736
##  [5,] 4.77492 4.67260 4.29220 4.69712 4.74796
##  [6,] 4.96144 4.97480 4.84068 4.75024 4.75976
##  [7,] 5.00428 5.15908 5.05812 5.05132 4.85808
##  [8,] 4.93364 4.93540 4.88152 4.93488 4.77104
##  [9,] 4.89252 4.82900 4.82804 4.94864 5.02364
## [10,] 4.68412 4.52992 4.87296 5.00388 5.02052

# Generate a Countour plot based on pH 
filled.contour(x = seq(50, 950, by = 100), y = seq(50, 450, by = 100), 
               matrix(BCIsoil$pH, ncol = 5, byrow = TRUE), 
               color.palette = terrain.colors, 
               main = "pH", xlab ="(m)", ylab ="(m)")

We now want to “fit” the soil variables to the NMDS axes.

ef <- envfit(BCI_NMDS,
             BCIsoil[, c("Al","B","Ca","Cu","Fe","K","Mg","Mn",
                         "P","Zn","N","N.min.","pH")],
             permu = 999)
ef

## 
## ***VECTORS
## 
##           NMDS1    NMDS2     r2 Pr(>r)    
## Al      0.92417 -0.38198 0.3715  0.001 ***
## B      -0.77420  0.63294 0.5818  0.001 ***
## Ca     -0.79155  0.61111 0.3943  0.001 ***
## Cu     -0.99143  0.13064 0.3902  0.001 ***
## Fe     -0.72750  0.68610 0.1838  0.007 ** 
## K      -0.80100  0.59867 0.4940  0.001 ***
## Mg     -0.69935  0.71478 0.2487  0.002 ** 
## Mn     -0.98730  0.15890 0.3459  0.001 ***
## P       0.71354  0.70061 0.2868  0.001 ***
## Zn     -0.53277  0.84626 0.4972  0.001 ***
## N      -0.56293  0.82650 0.2637  0.002 ** 
## N.min. -0.42872  0.90344 0.4123  0.001 ***
## pH     -0.68106  0.73223 0.7380  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

Mean concentrations of soil nutrients in 50x50 m quadrats were fitted to the ordinations to test the relationships between species composition and soil nutrients. The orientation of arrows indicates the direction in ordination space in which the soil variables change most rapidly and in which they have maximum correlation with the ordination configuration, whereas the length of the arrows indicates the rate of change.

plot(BCI_NMDS, display = "sites")
plot(ef, p.max = 0.1)

2. Classification

In a Hierarchical Cluster Analysis, the plots are either gradually merged into groups (agglomerative; bottom-up) or divided into groups (divisive, top-down). Here, there are different procedures that use different distance and similarity measures to maximize differences between clusters or similarity within clusters. In contrast to non-hierarchical cluster analyses, this results in a nested structure (smaller clusters fall into larger clusters) that can be represented as a dendrogram. The number of clusters does not have to be defined beforehand, but can be chosen based on the dendrogram.

2.1. Hierarchical clustering

# Agglomerative: Maximizes the distance between clusters
BCIclust <- hclust(jacBCI, method = "complete")
plot(BCIclust, xlab = "", cex = 0.7, ann = FALSE)
abline(h = 0.78, lwd = 2, col = "black", lty = 2)

The dendrogram can be cut at a certain height to get an appropriate number of clusters (black dashed line). This can also be optimized to get the ideal number of clusters, but we won’t go into that here. Instead, we manually specify three clusters.

BCIclust_3 <- cutree(BCIclust, k = 3)
# matrix(names(BCIclust_3), ncol = 5, byrow = TRUE)
# How many plots per cluster?
table(BCIclust_3)

## BCIclust_3
##  1  2  3 
## 26 19  5

# Visualization
# See https://colorbrewer2.org/ for better color gradients
image(x = seq(50, 950, by = 100), y = seq(50, 450, by = 100), 
      z = matrix(BCIclust_3, ncol = 5, byrow = TRUE), 
      col = c("#fc8d59", "#ffffbf", "#91bfdb"), 
      xlab = "(m)", ylab = "(m)", main = "Clusters on BCI")
# add legend
legend(x = 900, y = 500,
       # grconvertX(1, "device"), grconvertY(1, "device"), 
       legend = c("Cl. 1", "Cl. 2", "Cl. 3"),
       fill = c("#fc8d59", "#ffffbf", "#91bfdb"), xpd = NA)

2.3. Clusters on NMDS

We can now mark the plots in the NMDS according to their clusters. Where are the identified clusters on the NMDS visualization?

plot(BCI_NMDS, display = "sites")
ordihull(BCI_NMDS, groups = BCIclust_3,
         draw = "polygon", col = c("#fc8d59", "#ffffbf", "#91bfdb"),
         label = FALSE)
points(BCI_NMDS, display = "sites", pch = 21, col = "black",
       bg = c("#fc8d59", "#ffffbf", "#91bfdb")[BCIclust_3])
legend(x = 0.4, y = -0.2, legend = c(paste0("Cl. ", seq(1:3))),
       col = c("#fc8d59", "#ffffbf", "#91bfdb"), pch = 16)