Comparing data sets


There are many different standard tests available for comparing two distributions. Here is the standard disclaimer: You can never prove that two distributions are the same. A high probability value is only consistent with a similar distribution, but does of course give an indication of the similarity between the two sample distributions. On the other hand, a very low probability value does show, to the given level of significance, that the distributions are different.

Chi square (one sample v. normal)

Typical applicationAssumptionsData needed
Testing for normal distribution of a sample Large sample (N>30) Single column of measured or counted data

Tests whether a single distribution (one selected column) is normal, by binning the numbers in four compartments. This test should only be used for relatively large sample (N>30). See Brown & Rothery (1993) or Davis (1986) for details.

Shapiro-Wilk (one sample v. normal)

Typical applicationAssumptionsData needed
Testing for normal distribution of a sample Small sample (N<50) Single column of measured or counted data

Tests whether a single distribution (one selected column) is normal. This test is designed for relatively small populations (N<50).

F and t tests (two samples)

Typical applicationAssumptionsData needed
Testing for equality of the variances and means of two samples Normal or almost normal distribution Two columns of measured or counted data

Two columns must be selected. The F test compares the variances of two distributions, while the t test compares their means. The F and t statistics, and the probabilities that the variances and means of the parent populations are the same, are given. The F and t tests should only be used if you have reason to believe that the parent populations are close to normally distributed. The Chi-square test for one distribution against a normal distribution can give you an idea about this.

Also, the t test is really only applicable when the variances are the same. So if the F test says otherwise, you should be cautious about the t test. An unequal variance t statistic (Welch test) is also given, which should be used in this case.

Sometimes publications give not the data, but values for sample size, mean and variance for two populations. These can be entered manually using the 'F and t from parameters' option in the menu.

See Brown & Rothery (1993) or Davis (1986) for details.

How do I test lognormal distributions?

All of the above tests apply to lognormal distributions as well. All you need to do is to transform your data first, by taking the log transform in the Massage menu. You might want to 'backup' your data column first, using Copy, and then get your original column back using Paste.

Chi-square (two samples)

Typical applicationAssumptionsData needed
Testing for equal distribution of compartmentalized, counted data Each compartment containing at least five individuals Two columns of counted data in different compartments (rows)

The Chi-square test is the one to use if your data consist of the numbers of elements in different bins (compartments). For example, this test can be used to compare two associations (columns) with the number of individuals in each taxon organized in the rows. You should be a little cautious about such comparisons if any of the bins contain less than five individuals. Note:It is assumed that there are no constraints (degrees of freedom equals the number of bins). If this is not true, the result will be inaccurate (not serious if the number of bins is large).

See Brown & Rothery (1993) or Davis (1986) for details.

Mann-Whitney (two samples)

Typical applicationAssumptionsData needed
Comparing the medians of two samples None Two columns of measured or counted data

Two columns must be selected. The (Wilcoxon) Mann-Whitney U test can be used to test whether the medians of two independent distributions are different. This test is non-parametric, which means that the distributions can be of any shape.

See Brown & Rothery (1993) or Davis (1986) for details.

Kolmogorov-Smirnov (two samples)

Typical applicationAssumptionsData needed
Comparing the distributions of two samples None Two columns of measured data

Two columns must be selected. The K-S test can be used to test whether two independent distributions of continuous, unbinned numerical data are different. The K-S test is non-parametric, which means that the distributions can be of any shape. If you want to test just the locations of the distribution (medians), you should rather use the Mann-Whitney U test.

See Davis (1986) for details.

Spearman's rho and Kendall's tau (two samples)

Typical applicationAssumptionsData needed
Testing whether two variables are correlated None Two columns of measured or counted paired data (such as x/y pairs)

These non-parametric rank-order tests are used to test for correlation between two variables.

Dice and Jaccard similarity indices

Typical applicationAssumptionsData needed
Comparing two or more presence/absence samples Equal sampling conditions Two or more columns of presence/absence (0/1) data with taxa down the rows

The Dice and Jaccard similarity indices are used to compare associations, limited to absence/presence data (any positive number is interpreted as presence). When comparing two columns (associations), a match is counted for all taxa with presences in both columns. Using 'M' for the number of matches and 'N' for the the total number of taxa with presences in just one column, we have

Dice similarity = 2M / (2M+N)

Jaccard similarity = M / (M+N)

Both these indices range from 0 (no similarity) to 1 (identity). A matrix is presented with the comparisons between all pairs of associations. Dice indices are given in the upper triangle of the matrix (above and to the right of the diagonal), and Jaccard indices are given in the lower.

See Harper (1999) for details.

Raup-Crick similarity index

Typical applicationAssumptionsData needed
Comparing two or more presence/absence samples Equal sampling conditions Two or more columns of presence/absence (0/1) data with taxa down the rows

The Raup-Crick similarity index is used to compare associations, limited to absence/presence data (any positive number is interpreted as presence). This index ranges from 0 (no similarity) to 1 (identity). A matrix is presented with the comparisons between all pairs of associations.

The Raup-Crick index (Raup & Crick 1979) uses a randomization ("Monte Carlo") procedure, comparing the observed number of species ocurring in both associations with the distribution of co-occurrences from 200 random replicates.

Correlation matrix

Typical applicationAssumptionsData needed
Quantifying correlation between two or more variables Normal distribution Two or more columns of measured or counted variables

A matrix is presented with the correlations between all pairs of columns. Correlation values (Pearson's r) are given in the upper triangle of the matrix, and the probabilities that the columns are uncorrelated are given in the lower.

Contingency table analysis

Typical applicationAssumptionsData needed
Testing for dependence between two variables None Matrix of counted data in compartments

A contingency table is input to this routine. Rows represent the different states of one nominal variable, columns represent the states of another nominal variable, and cells contain the counts of occurrences of that specific state (row, column) of the two variables. A measure and probability of association of the two variables (based on Chi-square) is then given.

For example, rows may represent taxa and columns samples as usual (with specimen counts in the cells). The contingency table analysis then gives information on whether the two variables of taxon and locality are associated. If not, the data matrix is not very informative. For details, see Press et al. (1992).

One-way ANOVA

Typical applicationAssumptionsData needed
Testing for equality of the means of several univariate samples Normal distribution and similar variances and sample sizes Two or more columns of measured or counted data

One-way ANOVA (analysis of variance) is a statistical procedure for testing the null hypothesis that several univariate data sets (in columns) have the same mean. The data sets are required to be close to normally distributed.

See Brown & Rothery (1993) or Davis (1986) for details.

Next: Multivariate statistics