Example Analysis

This example analysis of correlations is based on the 2007 version of Neogene of the Old World Database of Fossil Mammals NOW (Fortelius 2007). The database contains information about Eurasian Miocene to Pleistocene land mammal taxa and localities. An extensive database collected in international collaboration is a good example of the importance of base set selection, as it has been collected from various data sources and from studies looking at completely different research questions. In other words, the data have not been collected to answer the questions we are about to analyze, and therefore, it is essential to carefully select the subset of the data that is relevant and as unbiased as possible for answering our question.

NOW data were preprocessed by including only large mammals that were present at 10 or more sites. Sites were filtered by including only sites with 10 or more genera. The preprocessing resulted in a base set of 217 sites and 169 genera. Without filtering the data set would have contained more marginal species and locations, lowering the number of statistically significant results. The justification for the filtering is that we select into our base set only those species and locations that have been studied more widely. This will prevent biases such as having a set of findings from a very exotic and tightly focused research programme distort the results of our general correlation studies.

We use the data to show the effect of the previously described filtering criteria. For presenting the results, we use shorthand notation for the restrictions. Keyword GeoUnion is used for data that have been filtered using the geographic criteria where all locations within the combined area (union of areas) of the two species are included as a base set. Keyword GeoIntersection is used when only locations on the shared area (intersection of areas) of the two species are included. When no georaphic restriction is used we use the keyword NoGeo.

For temporal restrictions keyword Time is used when we apply restriction that selects only sites with MN units in which both of the genera existed and NoTime for data that have not been filtered with this restriction.

For taxonomic criteria, keyword Family is used for data that have been filtered by similarity at the family level, that is, for each pair of species, we take into base set only the find sites in which there is at least one representative from the families of both of the two species. This way we can rule out the distribution patterns of the families as explanatory factors. Keyword Order has been used for data that have been filtered by similarity at the order level, respectively, and keyword NoTaxonomic for data that have not been filtered with this restriction.

For all pairs of taxa we calculated correlation and p-value for the correlation using Fisher's Exact Test, mid-P variant. Both positive and negative correlations were tested. Multiple testing correction was conducted using the Benjamini-Hochberg method, with false discovery rate controlled to 0.05. Total numbers of significant correlations for the NOW data set are given in Table 7.

In Table 7 the most obvious difference between counts of significant correlations is observed when the temporal restriction is applied. Without the restriction a vast amount of correlations are seen as the database contains large numbers of species that have lived at different times. As the counts for NoTime are about two orders of magnitude larger than for Time, it is obvious that the trivial temporal effect dwarfs other ecologically more interesting effects and therefore should be taken care of by using the temporal restriction.

When considering geographic restrictions, we see that the three cases GeoIntersection, GeoUnion, and NoGeo all have a different level of strictness. Typically there are no grounds for considering geographic locations that for some reason have not been reachable by the other species, suggesting that at least the GeoUnion restriction should be applied and that the correlations from NoGeo might stem from some obvious or uninteresting geographic effects. If we are interested in intraspecies dynamics, it is advisable to use the stricter GeoIntersection restriction to include only areas where we have evidence of both species existing.

Finally, the selection of taxonomic level yields different counts of significant correlations. When interested in interactions between orders, criteria Order might offer the right level of inspection, as it filters out effects stemming from possibly different orders of the two species. Similarly, the criteria Family can be used for studying interactions between families. It can also be used for studying interactions between orders, but is not optimal for that, as it filters out more locations. NoTaxonomic is a good choice if we know that data do not contain taxonomic biases, because no filtering yields in the largest number of locations and hence the best statistical power. The right level of filtering should be decided based on the question that is being studied.

It is important to bear in mind that we only see correlations between species and not their interpretation. The existence of a correlation often implies an ecologically relevant process, assuming that the base set is selected appropriately. However, a correlation does not directly carry any ecological meaning, but instead only states that the occurrence of the two species in the base set cannot be simply explained by random effects only. Correlations are an invaluable way for generating hypothesis and finding interesting aspects of the data set for further examination. It is the task of the next analysis step to validate correlations and to find sound ecological explanations for them.