Introduction

Presence-absence data indicate for a collection of locations and taxa which taxa are present in given locations. The locations can be, for example, fossil sites, with a known age, or map grid cells associated with observations of present-day mammals. Fundamental concepts in the analysis of presence-absence data are co-occurrence and correlation of pairs of species. Are two taxa occurring together more or less often than they should on the basis of pure chance?

A correlation between two species does not, by itself, necessarily carry any ecological meaning (for discussion, see Schluter (1984)). Correlation can be explained by trivial reasons, such as prevalence of species (Manel et al. 2001). However, a statistically significant correlation can be due to some ecologically relevant process such as the existence and evolution of species communities. Correlations can be used as an input or a starting point for more complicated analysis, such as cluster analysis ("find clusters of highly correlated species") or multidimensional scaling ("find a projection of species to a plane such that the correlated species are near to each other and uncorrelated species are far away"). In some studies the complete presence-absence matrix is analysed, but here we focus on co-occurrence and correlation of pairs of species. There exists a good body of work on the statistical questions related to the analysis of binary presence-absence matrices, and we refer the interested reader to Gilpin and Diamond (1984), Connor and Simberloff (1984) and Zaman and Simberloff (2002).

The first step in analysis is often to find out whether there exists any correlation between two species. The second step is to find a sound explanation for the correlations, or to apply a more advanced method. This paper focuses on this first step in a general setting. The ecological interpretation or detailed analysis of the causes of the correlation always depends on the context of the data set.

Co-occurrences are traditionally quantified by various similarity indices. There are many such indices (Shi 1993; Hubálek 1982; Archer and Maples 1987; Maples and Archer 1988). Most of the similarity indices can be computed using a contingency table (Table 1).

Typical similarity indices are those suggested by Jaccard (Jaccard 1912), Dice (Dice 1945; Sørensen 1948), Kulczynski (Kulczynski 1927) and Ochiai (Ochiai 1957); these four indices were recommended by Hubálek (1982) who evaluated 43 similarity indices for presence-absence data. Notice that the similarity indices are often used to measure taxon similarity between locations (samples), while we use the indices here to measure similarity between species based on their presence or absence at various locations. These indices can be computed using the counts in the contingency table:

• Jaccard: a/(a+b+c)

• Dice: 2a/(2a+b+c)

• Kulczynski: (a/(a+b)+a/(a+c))/2

• Ochiai: a/sqrt((a+b)(a+c))

It is worth noting that none of these indices depend on the number of locations with no occurrences d, or on the total count of locations n. All indices range from 0 to 1, with the value 1 taking place when the taxa always co-occur (b=c=0).

As we are analysing real data, the data will always contain noise, i.e., the counts will have errors. Species may sometimes be incorrectly labeled present, but more often present species are incorrectly labeled absent, due to them not being detected or neglected for some other reason (Rosen and Smith 1988, Upchurch and Hunn 2002). This pseudo-absence is one of the error sources that have to be accepted in data analysis, emphasising the need for rigorous statistical framework capable of dealing with uncertainty.

We define two species to be uncorrelated if they occur independently of each other. More formally, the two species are uncorrelated if the contingency table approximately obeys the product of the marginal distributions of species, that is, a is approximately equal to (a+b)(a+c)/n, b is approximately equal to (a+b)(b+d)/n, c is approximately equal to (c+d)(a+c)/n, and d is approximately equal to (c+d)(b+d)/n. A similarity index that directly measures this correlation can be formally defined using the hypergeometric distribution function phyper(a,a+b,c+d,a+c).¹ The application of hypergeometric distribution in constructing a similarity index that measures the faunal similarity between locations has been discussed in Raup and Crick (1979). The hypergeometric distribution function gives directly p-value of the one-tailed Fisher's Exact test. A p-value close to zero corresponds to a strong negative correlation, while a value close to unity corresponds to a strong positive correlation (i.e., the species tend to co-occur), and a value near ½ corresponds to lack of correlation (i.e., the species occur independently of each other). An essential property of Fisher's Exact test, or the definition of correlation in general, is that if we add locations where neither of the species occur (i.e., d grows large), we will obtain a strong positive correlation (the p-value tends to 1 as d grows large).

The four similarity indices described earlier are fundamentally different from the p-value of Fisher's Exact test. The value of Jaccard, Dice, Kulczynski, or Ochiai index carries little information about correlation (except when the indices are exactly one), that is, whether the species occur independently or not; see Table 2 for an example. In the table high values of Jaccard and like indices are simply due to the fact that both of the species A and B are quite common (they both occur on 90% of the find sites), and, therefore, the co-occurrence count is high even though the species occur independently of each other. These indices are useful, for example, in comparing the relative co-occurrences of several pairs of species, but they do not tell about correlation of two species.

To study the existence of correlation, as discussed above, it is essential to take into account the number of locations where neither of the species occur. This number is related to the choice of the base set: if we study the correlation of, say, two African species we will typically obtain very different results if we take into account only the African locations, or if we take into account all locations across the globe. By base set we mean the set of locations, which we include in our study, that is, the locations which we use to compute the contingency matrix of Table 1. As discussed later in more detail, if we take into our study all locations across the globe we will usually obtain a strong positive correlation (p-value of Fisher's Exact Test that is close to one), due to the fact that there are many locations where neither of the species occurs because both species exist in Africa only.

The first main argument in this paper is that as an initial step in any analysis involving co-occurrences of species it is often necessary to ascertain whether there is a statistically significant negative or positive correlation between two given species. For this purpose, as discussed above, most of the traditional association similarity indices are useless as such. A certain value of an association similarity index such as Jaccard does not imply existence of negative or positive correlation. Our second main argument is that before computing the correlations it is essential to select the base set properly, depending on the effect we want to study. As discussed above we can, for example, almost always obtain a positive correlation by adding to our study locations in which neither of the species occur. In this work we give principled guidelines on how to select the base set based on the effects we want to study.

¹ The hypergeometric distribution corresponds to a process where k balls are drawn in random from an urn with m white and n black balls. phyper(q,m,n,k) denotes the probability of drawing at most q white balls.