Multivariate statistics


Principal components analysis

Typical applicationAssumptionsData needed
Reduction and interpretation of large multivariate data sets with some underlying linear structure Debated Two or more rows of measured data with three or more variables

Principal components analysis (PCA) is a procedure for finding hypothetical variables (components) which account for as much of the variance in your multidimensional data as possible (Davis 1986, Harper 1999). These new variables are linear combinations of the original variables. PCA has several applications, two of them are:

  • Simple reduction of the data set to only two variables (the two most important components), for plotting and clustering purposes.
  • More interestingly, you might try to hypothesize that the most important components are correlated with some other underlying variables. For morphometric data, this might be simply age, while for associations it might be a faunal gradient (e.g. latitude or position across the shelf).

    The PCA routine finds the eigenvalues and eigenvectors of the variance-covariance matrix or the correlation matrix. Choose var-covar if all your variables are measured in the same unit (e.g. centimetres). Choose correlation (normalized var-covar) if your variables are measured in different units; however, all variables will be normalized. The eigenvalues, giving a measure of the variance accounted for by the corresponding eigenvectors (components) are given for the first four most important components (or fewer if there are fewer than four variables). The percentages of variance accounted for by these components are also given. If most of the variance is accounted for by the first one or two components, you have scored a success, but if the variance is spread more or less evenly among the components, the PCA has in a sense not been very successful.

    The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the two most important components. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours. You can also plot the Minimal Spanning Tree, which is the shortest possible set of connected lines connecting all points. This may be used as a visual aid in grouping close points. The MST is based on an Euclidean distance measure of the original data points, so it is most meaningful when all your variables use the same unit.

    The 'View loadings' option shows to what degree your different original variables (given in the original order along the x axis) enter into the different components (as chosen in the radio button panel). These component loadings are important when you try to interpret the 'meaning' of the components.

    Bruton & Owen (1988) describe a typical morphometrical application of PCA.

    Principal coordinates

    Typical applicationAssumptionsData needed
    Reduction and interpretation of large multivariate data sets with some underlying linear structure Unknown Two or more rows of measured data with three or more variables

    Principal coordinates analysis (PCO) is another ordination method, somewhat similar to PCA. The algorithm is taken from Davis (1986).

    The PCO routine finds the eigenvalues and eigenvectors of a matrix containing the distances between all data points. You can choose between the Gower distance measure or the Euclidean distance. The Gower measure will normally be used - Euclidean distance gives results similar to PCA. The eigenvalues, giving a measure of the variance accounted for by the corresponding eigenvectors (coordinates) are given for the first four most important coordinates (or fewer if there are fewer than four data points). The percentages of variance accounted for by these components are also given.

    The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the PCO. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours.

    Correspondence analysis

    Typical applicationAssumptionsData needed
    Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients Unknown Two or more rows of counted data in three or more compartments

    Correspondence analysis (CA) is yet another ordination method, somewhat similar to PCA but for counted data. For comparing associations (columns) containing counts of taxa, or counted taxa (rows) across associations, CA is the more appropriate algorithm. Also, CA is more suitable if you expect that species have unimodal responses to the underlying parameters, that is they favour a certain range of the parameter, becoming rare for lower and higher values (this is in contrast to PCA, which assumes a linear response). The algorithm is taken from Davis (1986).

    The CA routine finds the eigenvalues and eigenvectors of a matrix containing the Chi-squared distances between all data points. The eigenvalues, giving a measure of the similarity accounted for by the corresponding eigenvectors, are given for the first four most important eigenvectors (or fewer if there are fewer than four variables). The percentages of similarity accounted for by these components are also given.

    The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the CA. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours.

    In addition, the variables (columns, associations) can be plotted in the same coordinate system (Q mode), optionally including the column labels. If your data are 'well behaved', taxa typical for an association should plot in the vicinity of that association.

    If you have more than two columns in your data set, you can choose to view a scatter plot on the second and third axes. This will give a result comparable to Reciprocal Averaging (see below).

    Detrended correspondence analysis

    Typical applicationAssumptionsData needed
    Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients Unknown Two or more rows of counted data in three or more compartments

    The Detrended Correspondence (DCA) module uses the same algorithm as Decorana (Hill & Gauch 1980). It is specialized for use on 'ecological' data sets with abundance data (taxa in rows, localities in columns). When the 'Detrending' option is switched off, a basic Reciprocal Averaging will be carried out. The result should be similar to Correspondence Analysis (see above) plotted on the second and third axes.

    Detrending is a sort of normalization procedure in two steps. The first step involves an attempt to 'straighten out' points lying in an arch, which is a common occurrence. The second step involves 'spreading out' the points to avoid clustering of the points at the edges of the plot. Detrending may seem an arbitrary procedure, but can be a useful aid in interpretation.

    Cluster analysis

    Typical applicationAssumptionsData needed
    Finding hierarchical groupings in multivariate data sets None Two or more rows of counted, measured or presence/absence data in one or more variables or categories

    The hierarchical clustering routine produces a 'dendrogram' showing how data points (rows) can be clustered. For 'R' mode clustering, putting weight on groupings of taxa, taxa should go in rows. It is also possible to find groupings of variables or associations (Q mode), by entering taxa in columns. Switching between the two is done by transposing the matrix (in the Edit menu).

    Three different algorithms are available: Unweighted pair-group average (UPGMA), single linkage (nearest neighbour) and Ward's method. One is not necessarily better than the other, though single linkage is not recommended by some. It can be useful to compare the dendrograms given by the different algorithms in order to informally assess the robustness of the groupings. If a grouping is changed when trying another algorithm, that grouping should perhaps not be trusted.

    For Ward's method, a Euclidean distance measure is inherent to the algorithm. For UPGMA and single linkage, the distance matrix can be computed using eight different measures:

  • The Euclidean distance (between rows) is a robust and widely applicable measure.
  • Correlation (of the variables along rows) using Pearson's r. A little meaningless if you have only two variables.
  • Correlation using Spearman's rho (basically the r value of the ranks). Will often give the same result as correlation using r.
  • Dice coefficient for absence-presence data (coded as 0 or positive numbers). Puts more weight on joint occurences than on mismatches.
  • Jaccard coefficient for absence-presence data.
  • Bray-Curtis measure for abundance data.
  • Chord distance for abundance data. Recommended!
  • Morisita's index for abundance data. Recommended!
  • Raup-Crick index for absence-presence data. Recommended!

    See Harper (1999) or Davis (1986) for details.

    Seriation

    Typical applicationAssumptionsData needed
    Stratigraphical or environmental ordering of taxa and localities None Presence/absence (0/1) matrix with taxa in rows

    Seriation of an absence-presence matrix using the algorithm described by Brower & Kyle (1988). This method is typically applied to an association matrix with taxa (species) in the rows and populations in the columns. For constrained seriation (see below), columns should be ordered according to some criterion, normally stratigraphic level or position along a presumed faunal gradient.

    The seriation routines attempt to reorganize the data matrix such that the presences are concentrated along the diagonal. There are two algorithms: Constrained and unconstrained optimization. In constrained optimization, only the rows (taxa) are free to move. Given an ordering of the columns, this procedure finds the 'optimal' biozonation, that is, the ordering of taxa which gives the prettiest range plot. Also, in the constrained mode, the program runs a 'Monte Carlo' simulation, generating and seriating 30 random matrices with the same number of occurences within each taxon, and compares these to the original matrix to see if it is more informative than a random one (this procedure is time-consuming for large data sets).

    In the unconstrained mode, both rows and columns are free to move.

    Discriminant analysis

    Typical applicationAssumptionsData needed
    Testing for separation of multivariate data sets Multivariate normality Two multivariate data sets of measured data, marked with different colors

    Given two sets of multivariate data, an axis is constructed which maximizes the difference between the sets. The two sets are then plotted along this axis using a histogram.

    This module expects the rows in the two data sets to grouped into two sets by coloring the rows, e.g. with black (dots) and red (crosses). The histogram may not show the entire discriminant axis, so the start and end values for the histogram bins may have to be set manually.

    Equality of the two groups is tested by a multivariate analogue to the t test, called Hotelling's t-squared, and a p value for this test is given.

    Discriminant analysis can be used for visually confirming or rejecting the hypothesis that two species are morphologically distinct.

    See Davis (1986) for details.

    Next: Fitting data to functions