# Statistical Significance

An observed positive or negative correlation may arise from purely random effects. Statistical significance testing methodology gives a way of determining whether an observed correlation is just because of random occurrences, or whether it is a real phenomenon, i.e., statistically significant.

The ingredients of statistical significance testing are given by the null hypothesis and the test statistic. The null hypothesis describes the case when there is no correlation. In our case an obvious choice for a statistical significance test is Fisher's Exact Test, mid-P variant (Berry and Armitage 1995).

In Fisher's Exact Test the null hypothesis is that the contingency table has been sampled uniformly and randomly from a set of contingency tables having fixed marginal species counts (a+b, a+c, b+d, and c+d, respectively). The count a is used as a test statistic. The p-value of the one-tailed Fisher's Exact Test is defined to be the probability that the value of the test statistic is at least as extreme in the given direction under the null hypothesis; the lower-tail p-value can be expressed using the hypergeometric distribution function as phyper(a,a+b,c+d,a+c), as discussed earlier. The mid-P variant addresses the fact that the one-tailed Fisher's Exact Test is slightly too conservative; the standard test is modified slightly such that the sum of lower-tail and upper-tail p-values are guaranteed to add up to unity (see Berry and Armitage (1995) for details and discussion). As usual, we define a correlation to be significant if the p-value is at most some predefined value, such as p?0.05. Otherwise we declare the correlation not statistically significant. In the remainder of this paper we use the significance limit of 0.05 and therefore declare a correlation significant if and only if the respective p-values satisfy p?0.05, and by Fisher's Exact Test, mid-P variant, we mean the lower-tail variant unless otherwise noted. In a proper statistical significance testing method, such as Fisher's Exact Test, mid-P variant, the probability of a false positive (an event where the null hypothesis is rejected even if it holds) is at most 0.05.

Because of using the count a as a test statistic, i.e., considering lower counts more significant, the lower-tail p-value measures the significance of negative correlation. The p-value is small when the count a is exceptionally small compared to the marginal sums, i.e., the null hypothesis. Positive correlation can be measured with the upper-tail p-value, which can be computed by 1-p, where p is the p-value for negative correlation.

Hence we can test two hypotheses, one for both positive and negative correlation. However having two hypotheses per species pair must be taken into account in a multiple hypothesis correction step, which is described later. If we are not interested in the direction of the correlation, i.e., we are only looking for extreme correlations, we should use a two-tailed p-value. It can be easily computed by 2 min(p, 1-p), where p is the one-tailed p-value (see Dudoit et al. 2003). A less conservative two-tailed p-value can also by computed by taking all contingency tables with given marginals sums, selecting those with probability equal to or less than the observed table, and summing up their probabilities; this approach does not, however, generalize in a straightforward way to a situation where we want to obtain one-tailed p-values.

This statistical test can be used with all association similarity indices if we want to test whether there is a deviation from the independence of the species assumption. Significance testing can therefore be done independently of the selected association similarity index.

Besides simplicity, the strength of Fisher's Exact Test is that it produces valid results regardless of the sample size. Pearson's Chi-square Test might be used instead of Fisher's Exact Test, but it assumes a sufficiently large sample size. In many presence-absence data sets sample sizes are not large enough to claim them sufficiently large with confidence. Besides analytical tests, significance of similarity indices can also be tested using the Monte Carlo methods. Monte Carlo methods rely on computational power to generate random samples and to calculate empirical p-values by comparing statistics in real data and in random samples drawn from a null distribution. Monte Carlo methods do not require cumbersome analytical treatment and they make it possible to derive significance estimates when no analytical solution is known. For the independence of the species null hypothesis we know the analytical solution, so Monte Carlo methods are not needed.

Statistical significance testing is further complicated by the fact that typically there is not only one pair of species, but several pairs of species of whose correlations we want to study. Multiple tests may result in false negatives. For example, assume that there are seven species. Then there are 21 pairs of species and an equal number of positive correlations to test for significance. We can test each of the 21 individual correlations for significance using Fisher's Exact Test, as described above. We call the p-values produced by these tests unadjusted p-values. It follows that even if the null hypothesis is true (i.e., there are really no correlations for any of the pair of species) we would declare on average about one of the 21 correlations significant by random chance alone. This effect is because we are rejecting null hypotheses at level 0.05 = 1/20. Statistical test controls the probability of falsely rejecting a single null hypothesis, but when the test is repeated, the probability of a mistake increases unless further control procedures are used. Also if we are testing both for positive and negative correlation, there are two hypotheses for each species pair, effectively doubling the number of simultaneous hypotheses.

There are several ways to construct a multiple hypothesis testing method (see Dudoit et al. 2003 for a review and references) that corrects for this effect. Multiple hypothesis testing methods take as an input the unadjusted p-values, in our case those produced by the mid-P variants of the one-tailed Fisher's Exact Tests. The multiple hypothesis testing methods output adjusted p-values, one for each of the correlations. A null hypothesis is then rejected if the respective adjusted p-value is at most the chosen level, in our case 0.05. The simplest and the most well known of the methods is the Bonferroni correction, where the adjusted p-values are obtained by multiplying the respective unadjusted p-values by the total number of hypothesis (in this case, correlations) to be tested. The Bonferroni correction, while proper, is however excessively conservative and therefore lacks power.

We use false discovery rate (FDR) adjustment and the Benjamini-Hochberg method (Benjamini and Hochberg 1995) to obtain the adjusted p-values and use the adjusted p-values to find out significant correlations. We declare a correlation significant if the respective adjusted p-value is at most 0.05. The derivation of the Benjamini-Hochberg method is somewhat involved, but it can be implemented only by few lines of program code to compute the adjusted p-values out of the unadjusted p-values. The Benjamini-Hochberg method guarantees that the expected fraction of false positives among all correlations declared significant is at most 0.05 (Benjamini and Hochberg 1995).

Summarizing, we compute an unadjusted p-value for each of the correlations using the one-tailed Fisher's Exact Test, mid-P variant. We then apply the Benjamini-Hochberg method to these unadjusted p-values to obtain the adjusted p-values. We declare a correlation significant if the respective adjusted p-value is at most 0.05. A model software implementation of the method is presented in the Appendix.