Table of Contents

R Tools for Paleontology:

Plain-Language &
Multilingual  Abstracts



What is R and Why Should We Use It?

Setting Up the Environment

Loading Your Data in R

Distance/Stance/Similarity/Beta Diversity Indices

Non-Parametric Species Estimators and Rarefaction

Minimum Spanning Trees

Biogeography and GIS






Print article




Measuring the ecologic distance between sets of samples is often a necessary first step in many multivariate analyses (Green 1980; Shi 1993). As such, it also is often a contentious one, with different researchers advocating different measures, with at times multiple correct arguments. Although I do not wish to provide a full explanation here of every single measure, I will provide a brief overview of those included in the fossil package. Some of these measures are best described as indices of beta diversity, although they are grouped here with other similarity measures for convenience since they are typically used in a similar fashion.

All of the similarity functions can be used in the same way. The functions need two arguments representing the two samples. It is important that the species occurrences are arranged in the same way for each site, and that any absent species are represented by a zero.



The species estimator functions included can be broadly grouped into two categories, those that use occurrence data and those that use abundance data. As abundance data is not always available, especially in palaeontology, more measures that use occurrence data are included in the package. Occurrence based measures can also be used with abundance data, but the abundance matrix is converted to an occurrence matrix by the function.

One of the oldest and best known occurrence measures is the Jaccard measure, also known as the Coefficient of Community (Table 1; Jaccard 1901; Shi 1993). The measure has seen extensive use, largely due to its simplicity and intuitiveness (Shi 1993; Magurran 2004). A similar measure also in common use is the Sorenson measure (also known as Dice, Czekanowski or Coincidence Index), which places more emphasis on the shared species present rather than the unshared, as can be seen in the difference in values for the example data set. Again, the calculation is relatively simple and intuitive, and both indices have been shown to provide useful results (Wolda 1981; Hubálek 1982). Two other similar indices that are occasionally used are the Ochiai and Kulczynski measures. While Hubálek (1982) lists the Ochiai and Kulczynski indices as providing good results, the Jaccard or Sorenson are typically more recommended if only because they are more commonly used.

One of the most common problems in palaeontology, and indeed in many ecological studies, is that of differing sample sizes. Comparing two sites of very unequal sampling intensities can give a biased view of the actual species overlap. For example, a subsample of a site could be considered identical to the original site, as all the species in the subsample will be within the original. However, all the previous measures would show less than complete similarity due to their mathematical properties. With this in mind, Simpson (1960) developed a measure, which can account for variability of sample sizes. His formula scales the value by the number of species from the least sampled site, so that the subsample in this case would have full similarity with the original. The Simpson measure is often used with data that is highly variable in sampling intensity, such as fossil datasets, for this very reason.

While the fossil package contains a number of occurrence based similarity indices, by no means are they all included. For example, Shi (1993) lists 39 and Hubálek (1982) lists 43 different variations of the similarity index, many of which are rarely used outside their original papers.

While not as common in palaeontological data sets, abundance values can provide valuable information about a community that is not possible using occurrence data. Analyses of community structure are very limited without abundance data, and abundance data can provide more subtle distinctions between communities. As well, species abundances can provide some measure of sampling intensity.

Possibly the most widely used abundance based measure is the Bray-Curtis measure, due to its strong relationship to ecological distance under varying conditions (Bray and Curtis 1957; Faith et al. 1987; Minchin 1987; Clarke 1993). The measure is equivalent to the Sorenson coefficient when used as a similarity measure with occurrence data. The Morisita-Horn index, while not as common as the Bray-Curtis, is also a highly recommended measure due to its relative independence from sample size and diversity (Wolda 1981; Magurran 2004). While there are several variations of the measure, I have used the version found within Magurran (2004).

Luckily, though the diversity of indices may seem somewhat overwhelming, the package provides an easy way to use them with large data sets. An included function called dino.dist() will take a matrix of species occurrences versus locality (or any analogous groupings) and return a full pairwise distance matrix as output. This function is written such that any other similarity index, including those defined by other packages or by the user, can be specified and used to calculate the matrix.


Next Section

R Tools for Paleontology
Plain-Language & Multilingual  Abstracts | Abstract | Introduction | What is R and Why Should We Use It?
 Setting Up the Environment | Loading Your Data in R | Distance/Stance/Similarity/Beta Diversity Indices
 Non-Parametric Species Estimators and RarefactionMinimum Spanning Trees
Biogeography and GIS | Conclusions | Acknowledgements | References | Appendix
Print article