A tutorial in VEGAN

an R package for community analysis

The vegan package provides tools for descriptive community ecology. It has most basic functions of:

  • diversity analysis
  • community ordination
  • dissimilarity analysis

In this tutorial, we will briefly explore the breadth of the program as well as dive into basic diversity analysis explore ordination of multivariate datasets.

If you haven’t done so already, please install vegan

#install.packages("vegan")

Documentation

Consider visiting the vegan documentation to learn about this package.

Datasets

We will using a few datasets native to vegan

library(vegan)
## Warning: package 'vegan' was built under R version 3.3.3
## Loading required package: permute
## Warning: package 'permute' was built under R version 3.3.3
## Loading required package: lattice
## This is vegan 2.4-3
data(package = "vegan") ## names of data sets in the package
data(dune) # Vegetation and Environment in Dutch Dune Meadows
str(dune) #a data frame of observations of 30 species at 20 sites
## 'data.frame':    20 obs. of  30 variables:
##  $ Achimill: num  1 3 0 0 2 2 2 0 0 4 ...
##  $ Agrostol: num  0 0 4 8 0 0 0 4 3 0 ...
##  $ Airaprae: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alopgeni: num  0 2 7 2 0 0 0 5 3 0 ...
##  $ Anthodor: num  0 0 0 0 4 3 2 0 0 4 ...
##  $ Bellpere: num  0 3 2 2 2 0 0 0 0 2 ...
##  $ Bromhord: num  0 4 0 3 2 0 2 0 0 4 ...
##  $ Chenalbu: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cirsarve: num  0 0 0 2 0 0 0 0 0 0 ...
##  $ Comapalu: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Eleopalu: num  0 0 0 0 0 0 0 4 0 0 ...
##  $ Elymrepe: num  4 4 4 4 4 0 0 0 6 0 ...
##  $ Empenigr: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyporadi: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Juncarti: num  0 0 0 0 0 0 0 4 4 0 ...
##  $ Juncbufo: num  0 0 0 0 0 0 2 0 4 0 ...
##  $ Lolipere: num  7 5 6 5 2 6 6 4 2 6 ...
##  $ Planlanc: num  0 0 0 0 5 5 5 0 0 3 ...
##  $ Poaprat : num  4 4 5 4 2 3 4 4 4 4 ...
##  $ Poatriv : num  2 7 6 5 6 4 5 4 5 4 ...
##  $ Ranuflam: num  0 0 0 0 0 0 0 2 0 0 ...
##  $ Rumeacet: num  0 0 0 0 5 6 3 0 2 0 ...
##  $ Sagiproc: num  0 0 0 5 0 0 0 2 2 0 ...
##  $ Salirepe: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Scorautu: num  0 5 2 2 3 3 3 3 2 3 ...
##  $ Trifprat: num  0 0 0 0 2 5 2 0 0 0 ...
##  $ Trifrepe: num  0 5 2 1 2 5 2 2 3 6 ...
##  $ Vicilath: num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Bracruta: num  0 0 2 2 2 6 2 2 2 2 ...
##  $ Callcusp: num  0 0 0 0 0 0 0 0 0 0 ...

Diversity

Explore the utility of the diversity function

diversity(dune,index = "simpson") # calculate Simpson's 1-D Index of Diversity for each site. # closer to 1 = greater diversity
##         1         2         3         4         5         6         7 
## 0.7345679 0.8900227 0.8787500 0.9007407 0.9140076 0.9001736 0.9075000 
##         8         9        10        11        12        13        14 
## 0.9087500 0.9115646 0.9031909 0.8671875 0.8685714 0.8521579 0.8333333 
##        15        16        17        18        19        20 
## 0.8506616 0.8429752 0.8355556 0.8614540 0.8740895 0.8678460
simpson <- diversity(dune, "simpson") # or assign to var.
simpson 
##         1         2         3         4         5         6         7 
## 0.7345679 0.8900227 0.8787500 0.9007407 0.9140076 0.9001736 0.9075000 
##         8         9        10        11        12        13        14 
## 0.9087500 0.9115646 0.9031909 0.8671875 0.8685714 0.8521579 0.8333333 
##        15        16        17        18        19        20 
## 0.8506616 0.8429752 0.8355556 0.8614540 0.8740895 0.8678460
shannon <- diversity(dune) # note that Shannon's is default
shannon #Typically ranges from 1.5 - 3.4, higher = more diverse 
##        1        2        3        4        5        6        7        8 
## 1.440482 2.252516 2.193749 2.426779 2.544421 2.345946 2.471733 2.434898 
##        9       10       11       12       13       14       15       16 
## 2.493568 2.398613 2.106065 2.114495 2.099638 1.863680 1.979309 1.959795 
##       17       18       19       20 
## 1.876274 2.079387 2.134024 2.048270
# lets compare the two
par(mfrow = c(1, 2))  # use par to generate panels with 1 row of 2 graphs
hist(simpson)
hist(shannon)

Next we can calcuate all of the pair-wise dissimilarity (distance) measures between sites based on their species composition using the function vegdist

Vegdist computes dissimilarity indices. We are using gower and bray-curtis which are good in detecting underlying ecological gradients

Both indices are used to quantify the compositional dissimilarity between two different sites. They are bounded between 0 and 1, where 0 = same composition, 1 = maximally dissimilar.

par(mfrow = c(1, 2))
bray = vegdist(dune, "bray") 
gower = vegdist(dune, "gower")
hist(bray, xlim = range(0.0,1.0))
hist(gower, xlim = range(0.0,1.0))

# r allows for multiple iterations of each dissimilarity index to examine #freqeuncy of differences

Dissimilarity analysis is a good way to explore variability in community composition. The next steps would be to do some sort of cluster analysis to see where community associations exist, however we’re going to switch gear now.

Rarefaction

Rarefaction is a technique to assess expected species richness.

Rarefaction allows the calculation of species richness for a given number of individual samples, based on the construction of rarefaction curves.

The issue that occurs when sampling various species in a community is that the larger the number of individuals sampled, the more species that will be found. Rarefaction curves are created by randomly re-sampling the pool of N samples multiple times and then plotting the average number of species found in each sample (1,2, … N). “Thus rarefaction generates the expected number of species in a small collection of n individuals (or n samples) drawn at random from the large pool of N samples.”. Rarefaction curves generally grow rapidly at first, as the most common species are found, but the curves plateau as only the rarest species remain to be sampled.

To try some rarefaction, we use the rarefy and rarecurve functions.

spAbund <- rowSums(dune)  #gives the number of individuals found in each plot
spAbund # view observations per plot 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 18 42 40 45 43 48 40 40 42 43 32 35 33 24 23 33 15 27 31 31
raremin <- min(rowSums(dune))  #rarefaction uses the smallest number of observations per sample to extrapolate the expected number if all other samples only had that number of observations
raremin # view smallest # of obs (site 17)
## [1] 15
sRare <- rarefy(dune, raremin) # now use function rarefy
sRare #gives an "expected"rarefied" number of species (not obs) if only 15 individuals were present
##        1        2        3        4        5        6        7        8 
## 4.813725 8.294747 7.985897 9.105309 9.787891 8.688685 9.496061 9.317495 
##        9       10       11       12       13       14       15       16 
## 9.575586 9.027570 7.734350 7.719151 7.782660 6.572498 7.271063 6.961173 
##       17       18       19       20 
## 7.000000 7.752880 7.891114 7.334569 
## attr(,"Subsample")
## [1] 15
rarecurve(dune, col = "blue") # produces rarefaction curves # squares are site numbers positioned at observed space. To "rarefy" a larger site, follow the rarefaction curve until the curve corresponds with the lesser site obs. This gives you rarefied species richness

Non-Metric Multidimensional Scaling

(aka NMDS, MDS, NMS)

Now let’s explore some ordination techniques.

Many ordination techniques exist, including principal components analysis (PCA), non-metric multidimensional scaling (NMDS), and correspondence analysis (CA), among others.

Vegan is especially good at NMDS. This tutorial explores this in most detail.

Let’s lay some groundwork:

The goal of NMDS is to collapse information from multiple dimensions (e.g, from multiple communities, sites, etc.) into just a few, so that they can be visualized and interpreted. NMDS does not produce a statistical output - however we could do so (more on this later)

The goal of NMDS is to represent the position of communities in multidimensional space as accurately as possible using a reduced number of dimensions that can be easily visualized (and to spare your brain box).

NMDS does not use the absolute abundances of species in communities, but rather their rank orders and as a result is a flexible technique that accepts a variety of types of data. (It’s also where the “non-metric” part of the name comes from.)

To run the NMDS, we will use the function metaMDS. The function requires only a community-by-species matrix (which we will create randomly).

set.seed(2) # random no. generator / way to specify seeds, 2=no. of integers?
community_matrix=matrix(
   sample(1:100,300,replace=T),nrow=10, # counts up to 100, 300 cells
   dimnames=list(paste("community",1:10,sep=""),paste("sp",1:30,sep="")))
head(community_matrix)
##            sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9 sp10 sp11 sp12 sp13 sp14
## community1  19  56  67   2  99   1  78  28  36   98   21    3    4   21
## community2  71  24  39  17  30   2  89  32  68   40   43   74   25   92
## community3  58  77  84  82  12  69  63   5   3   38   99   38   98    3
## community4  17  19  16  87  17  93  27  19  41   57   83   58   89   97
## community5  95  41  35  52  95  28  86  19  20   47   29   83   25   32
## community6  95  86  49  63  80  82  44  76  86   20   60   82   76   67
##            sp15 sp16 sp17 sp18 sp19 sp20 sp21 sp22 sp23 sp24 sp25 sp26
## community1   18   19   85   97   95   11   86   98   16   68   90   71
## community2   29   37   95   11   58   41   65   53   94   34   31    9
## community3   63   91    5   26   96   98   61   83   41   90   56   26
## community4   31   40   76   90   79   17   98   68   49   53   76    6
## community5   45   78   30   39   12   51   38   98   69   88   96   74
## community6   74   29   66   80   84   21   82    6   86   26   96    2
##            sp27 sp28 sp29 sp30
## community1   76   89   28   26
## community2    6   92   98   32
## community3   63   85   66   75
## community4   36   30   68   50
## community5   81   79   21   71
## community6   68   50  100   42
example_NMDS=metaMDS(community_matrix, # Our community-by-species matrix
                     k=2) # The number of reduced dimensions. Increase if high stress is problem. 
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1280709 
## Run 1 stress 0.1280709 
## ... Procrustes: rmse 3.168062e-05  max resid 5.925658e-05 
## ... Similar to previous best
## Run 2 stress 0.2630959 
## Run 3 stress 0.1424254 
## Run 4 stress 0.1778004 
## Run 5 stress 0.1280709 
## ... New best solution
## ... Procrustes: rmse 8.669277e-06  max resid 1.508905e-05 
## ... Similar to previous best
## Run 6 stress 0.1752059 
## Run 7 stress 0.1624619 
## Run 8 stress 0.1280709 
## ... Procrustes: rmse 3.499232e-05  max resid 6.594091e-05 
## ... Similar to previous best
## Run 9 stress 0.1624619 
## Run 10 stress 0.128071 
## ... Procrustes: rmse 7.130361e-05  max resid 0.0001196693 
## ... Similar to previous best
## Run 11 stress 0.206671 
## Run 12 stress 0.1424254 
## Run 13 stress 0.157268 
## Run 14 stress 0.2023676 
## Run 15 stress 0.1461475 
## Run 16 stress 0.1280709 
## ... Procrustes: rmse 4.588009e-06  max resid 8.034253e-06 
## ... Similar to previous best
## Run 17 stress 0.1280709 
## ... New best solution
## ... Procrustes: rmse 6.914407e-06  max resid 1.306718e-05 
## ... Similar to previous best
## Run 18 stress 0.2028469 
## Run 19 stress 0.1624619 
## Run 20 stress 0.1778003 
## *** Solution reached
#"The stress, or the disagreement between 2-D configuration and predicted values from the regression"

#A good rule of thumb: stress > 0.05 provides an excellent representation in reduced dimensions, > 0.1 is great, >0.2 is good/ok, and stress > 0.3 provides a poor representation

You should see each iteration of the NMDS until a solution is reached (i.e., stress was minimized after some number of reconfigurations of the points in 2 dimensions).

Now we can plot the NMDS. The plot shows us both the communities (“sites”, open circles) and species (red crosses), but we don’t know which circle corresponds to which site, and which species corresponds to which cross.

plot(example_NMDS)

We can use the function ordiplot and orditorp to add text to the plot in place of points to make some sense of this rather non-intuitive mess.

ordiplot(example_NMDS,type="n") #Ordination plot function especially for congested plots
orditorp(example_NMDS,display="species",col="red",air=0.01) #The function adds text or points to ordination plots
orditorp(example_NMDS,display="sites",cex=1.25,air=0.01)

Let’s suppose that communities 1-5 had some treatment applied, and communities 6-10 a different treatment. Using ordihull we can draw convex hulls connecting the vertices of the points made by these communities on the plot.

This is an intuitive way to understand how communities and species cluster based on treatments.

treat=c(rep("Treatment1",5),rep("Treatment2",5))
ordiplot(example_NMDS,type="n")
ordihull(example_NMDS,groups=treat,draw="polygon",col="grey90",label=F)
orditorp(example_NMDS,display="species",col="red",air=0.01)
orditorp(example_NMDS,display="sites",col=c(rep("green",5),rep("blue",5)),
   air=0.01,cex=1.25)

One can also plot “spider graphs” using the function orderspider, ellipses using the function ordiellipse, or a minimum spanning tree (MST) using ordicluster which connects similar communities (useful to see if treatments are effective in controlling community structure).

#spider plot
ordiplot(example_NMDS,type="n")
ordispider(example_NMDS,groups=treat)
orditorp(example_NMDS,display="species",col="red",air=0.01)
orditorp(example_NMDS,display="sites",col=c(rep("green",5),rep("blue",5)),
         air=0.01,cex=1.25)

Great for visualization, but no statistical output?

  • Find centroid of polygons
  • Measure difference in center vector
  • Randomize community names and re-run 1000x
  • Examine randomly assigned community centriod distances compared to observed differences for spurious ordination

If the treatment is continuous, such as an environmental gradient, then it might be useful to plot contour lines rather than convex hulls. We can simply make up some, say, elevation data for our original community matrix and overlay them onto the NMDS plot using ordisurf:

# Define random elevations for previous example
elevation=runif(10,0.5,1.5)
# Use the function ordisurf to plot contour lines
ordisurf(example_NMDS,elevation,main="",col="forestgreen")
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)
## 
## Estimated degrees of freedom:
## 3.18  total = 4.18 
## 
## REML score: 4.430011
# Finally, display species on plot
orditorp(example_NMDS,display="species",col="grey30",air=0.1,
   cex=1)

unused code

visualization of ordination

Consider a single axis representing the abundance of a single species. Along this axis, we can plot the communities in which this species appears, based on its abundance within each.

plot(0:10,0:10,type="n",axes=F,xlab="Abundance of Species 1",ylab="")
axis(1)
points(5,0); text(5.5,0.5,labels="community A")
points(3,0); text(3.2,0.5,labels="community B")
points(0,0); text(0.8,0.5,labels="community C")

Now consider a second axis of abundance, representing another species. We can now plot each community along the two axes (Species 1 and Species 2).

plot(0:10,0:10,type="n",xlab="Abundance of Species 1",
     ylab="Abundance of Species 2")
points(5,5); text(5,4.5,labels="community A")
points(3,3); text(3,3.5,labels="community B")
points(0,5); text(0.8,5.5,labels="community C")

Now consider a third axis of abundance representing yet another species.

# install.packages("scatterplot3d")
library(scatterplot3d) # an R package for the visualization of multivariate data in a three dimensional space.
d<-scatterplot3d(0:10,0:10,0:10,type="n",xlab="Abundance of Species 1",
  ylab="Abundance of Species 2",zlab="Abundance of Species 3"); d
## $xyz.convert
## function (x, y = NULL, z = NULL) 
## {
##     xyz <- xyz.coords(x, y, z)
##     if (angle > 2) {
##         temp <- xyz$x
##         xyz$x <- xyz$y
##         xyz$y <- temp
##     }
##     y <- (xyz$y - y.add)/y.scal
##     return(list(x = xyz$x/x.scal + yx.f * y, y = xyz$z/z.scal + 
##         yz.f * y))
## }
## <environment: 0x000000002bbd1658>
## 
## $points3d
## function (x, y = NULL, z = NULL, type = "p", ...) 
## {
##     xyz <- xyz.coords(x, y, z)
##     if (angle > 2) {
##         temp <- xyz$x
##         xyz$x <- xyz$y
##         xyz$y <- temp
##     }
##     y2 <- (xyz$y - y.add)/y.scal
##     x <- xyz$x/x.scal + yx.f * y2
##     y <- xyz$z/z.scal + yz.f * y2
##     mem.par <- par(mar = mar, usr = usr)
##     if (type == "h") {
##         y2 <- z.min + yz.f * y2
##         segments(x, y, x, y2, ...)
##         points(x, y, type = "p", ...)
##     }
##     else points(x, y, type = type, ...)
## }
## <environment: 0x000000002bbd1658>
## 
## $plane3d
## function (Intercept, x.coef = NULL, y.coef = NULL, lty = "dashed", 
##     lty.box = NULL, draw_lines = TRUE, draw_polygon = FALSE, 
##     polygon_args = list(border = NA, col = rgb(0, 0, 0, 0.2)), 
##     ...) 
## {
##     if (!is.atomic(Intercept) && !is.null(coef(Intercept))) {
##         Intercept <- coef(Intercept)
##         if (!("(Intercept)" %in% names(Intercept))) 
##             Intercept <- c(0, Intercept)
##     }
##     if (is.null(lty.box)) 
##         lty.box <- lty
##     if (is.null(x.coef) && length(Intercept) == 3) {
##         x.coef <- Intercept[if (angle > 2) 
##             3
##         else 2]
##         y.coef <- Intercept[if (angle > 2) 
##             2
##         else 3]
##         Intercept <- Intercept[1]
##     }
##     mem.par <- par(mar = mar, usr = usr)
##     x <- x.min:x.max
##     y <- 0:y.max
##     ltya <- c(lty.box, rep(lty, length(x) - 2), lty.box)
##     x.coef <- x.coef * x.scal
##     z1 <- (Intercept + x * x.coef + y.add * y.coef)/z.scal
##     z2 <- (Intercept + x * x.coef + (y.max * y.scal + y.add) * 
##         y.coef)/z.scal
##     if (draw_polygon) 
##         do.call("polygon", c(list(c(x.min, x.min + y.max * yx.f, 
##             x.max + y.max * yx.f, x.max), c(z1[1], z2[1] + yz.f * 
##             y.max, z2[length(z2)] + yz.f * y.max, z1[length(z1)])), 
##             polygon_args))
##     if (draw_lines) 
##         segments(x, z1, x + y.max * yx.f, z2 + yz.f * y.max, 
##             lty = ltya, ...)
##     ltya <- c(lty.box, rep(lty, length(y) - 2), lty.box)
##     y.coef <- (y * y.scal + y.add) * y.coef
##     z1 <- (Intercept + x.min * x.coef + y.coef)/z.scal
##     z2 <- (Intercept + x.max * x.coef + y.coef)/z.scal
##     if (draw_lines) 
##         segments(x.min + y * yx.f, z1 + y * yz.f, x.max + y * 
##             yx.f, z2 + y * yz.f, lty = ltya, ...)
## }
## <environment: 0x000000002bbd1658>
## 
## $box3d
## function (...) 
## {
##     mem.par <- par(mar = mar, usr = usr)
##     lines(c(x.min, x.max), c(z.max, z.max), ...)
##     lines(c(0, y.max * yx.f) + x.max, c(0, y.max * yz.f) + z.max, 
##         ...)
##     lines(c(0, y.max * yx.f) + x.min, c(0, y.max * yz.f) + z.max, 
##         ...)
##     lines(c(x.max, x.max), c(z.min, z.max), ...)
##     lines(c(x.min, x.min), c(z.min, z.max), ...)
##     lines(c(x.min, x.max), c(z.min, z.min), ...)
## }
## <environment: 0x000000002bbd1658>
## 
## $par.mar
## $par.mar$mar
## [1] 5.1 4.1 4.1 2.1
d$points3d(5,5,0); text(d$xyz.convert(5,5,0.5),labels="community A")
d$points3d(3,3,3); text(d$xyz.convert(3,3,3.5),labels="community B")
d$points3d(0,5,5); text(d$xyz.convert(0,5,5.5),labels="community C")