PAST: PALEONTOLOGICAL STATISTICS SOFTWARE PACKAGE FOR EDUCATION AND DATA ANALYSIS

MULTIVARIATE ANALYSIS

Paleontological data sets, whether based on fossil occurrences or morphology, often have high dimensionality. PAST includes several methods for multivariate data analysis, including methods that are specific to paleontology and biology.

Principal components analysis (PCA) is a procedure for finding hypothetical variables (components) that account for as much of the variance in a multidimensional data set as possible (Davis 1986, Harper 1999). These new variables are linear combinations of the original variables. PCA is a standard method for reducing the dimensionality of morphometric and ecological data. The PCA routine finds the eigenvalues and eigenvectors of the variance-covariance matrix or the correlation matrix. The eigenvalues, giving a measure of the variance accounted for by the corresponding eigenvectors (components), are displayed together with the percentages of variance accounted for by each of these components. A scatter plot of these data projected onto the principal components is provided, along with the option of including the Minimal Spanning Tree, which is the shortest possible set of connected lines joining all points. This may be used as a visual aid in grouping close points (Harper 1999). The component loadings can also be plotted. Bruton and Owen (1988) describe a typical morphometrical application of PCA.

Principal coordinates analysis (PCO) is another ordination method, somewhat similar to PCA. The PCO routine finds the eigenvalues and eigenvectors of a matrix containing the distances between all data points, measured with the Gower distance or the Euclidean distance. The PCO algorithm used in PAST was taken from Davis (1986), which also includes a more detailed description of the method and example analysis.

Correspondence analysis (CA) is a further ordination method, somewhat similar to PCA, but for counted or discrete data. Correspondence analysis can compare associations containing counts of taxa or counted taxa across associations. Also, CA is more suitable if it is expected that species have unimodal responses to the underlying parameters, that is they favor a certain range of the parameter and become rare under for lower and higher values (this is in contrast to PCA, that assumes a linear response). The CA algorithm employed in PAST is taken from Davis (1986), which also includes a more detailed description of the method and example analysis. Ordination of both samples and taxa can be plotted in the same CA coordinate system, whose axes will normally be interpreted in terms of environmental parameters (e.g., water depth, type of substrate temperature).

The Detrended Correspondence (DCA) module uses the same 'reciprocal averaging' algorithm as the program Decorana (Hill and Gauch 1980). It is specialized for use on "ecological" data sets with abundance data (taxa in rows, localities in columns), and it has become a standard method for studying gradients in such data. Detrending is a type of normalization procedure in two steps. The first step involves an attempt to "straighten out" points lying along an arch-like pattern (= Kendall's Horseshoe). The second step involves "spreading out" the points to avoid artificial clustering at the edges of the plot.

Hierarchical clustering routines produce a dendrogram showing how and where data points can be clustered (Davis 1986, Harper 1999). Clustering is one of the most commonly used methods of multivariate data analysis in paleontology. Both R-mode clustering (groupings of taxa), and Q-mode clustering (grouping variables or associations) can be carried out within PAST by transposing the data matrix. Three different clustering algorithms are available: the unweighted pair-group average (UPGMA) algorithm, the single linkage (nearest neighbor) algorithm, and Ward's method. The similarity-association matrix upon which the clusters are based can be computed using nine different indices: Euclidean distance, correlation (using Pearson's r or Spearman's , Bray-Curtis, chord and Morisita indices for abundance data, and Dice, Jaccard, and Raup-Crick indices for presence-absence data.

Seriation of an absence-presence matrix can be performed using the algorithm described by Brower and Kyle (1988). For constrained seriation, columns should be ordered according to some external criterion (normally stratigraphic level) or positioned along a presumed faunal gradient. Seriation routines attempt to reorganize the data matrix such that the presences are concentrated along the diagonal. Also, in the constrained mode, the program runs a 'Monte Carlo' simulation to determine whether the original matrix is more informative than a random matrix. In the unconstrained mode both rows and columns are free to move: the method then amounts to a simple form of ordination.

The degree of separation between to hypothesized groups (e.g., species or morphs) can be investigated using discriminant analysis (Davis 1986). Given two sets of multivariate data, an axis is constructed that maximizes the differences between the sets. The two sets are then plotted along this axis using a histogram. The null hypothesis of group means equality is tested using Hotelling's T² test.