As paleontological data become more quantitative, multivariate, complex, and voluminous, the choice of tools for data analysis acquires a greater influence over the biological and geological conclusions that are drawn from a given body of data. Either the data must be processed, summarized, its dimensionality reduced, and its details obscured; or we need new tools to handle the presentation and publication of larger, more complex data sets. In this paper, one such new tool—the "pairs" plot—is suggested as a way of improving the standard procedures for examining the relationship between leaf architecture and environment.
The Climate Leaf Analysis Multivariate Program (CLAMP) is a method of analyzing fossil leaf assemblages or floras (specifically those deriving from the woody dicotyledonous component of forest ecosystems) by quantifying a set of significant morphological or architectural leaf variables and relating these variables (averaged over the flora) to climate parameters. Such a procedure also allows estimation of ancient climate parameters by uniformitarian extrapolation of patterns found in the distribution of leaf attributes in modern vegetation (Wolfe 1993, 1995, Wolfe and Spicer 1999).
This general notion, which is sometimes referred to as "leaf physiognomy", has been accepted since the early twentieth century when Bailey and Sinnott (1915) pointed out the strong correlation between the temperature in which modern forests grow and the proportion of the species that compose them that have "entire" (i.e., untoothed) leaves. From this observed correlation in the modern world, a determination of the percentage of species with entire margins in a fossil flora allows the estimation of the temperature in which it grew. With the introduction of computers that could handle algorithmic classification and ordination of multivariate data, Wolfe (1993) proposed a multivariate method of coding leaves (originally based on 29, but later updated to 31 variables) that was intended both to improve the precision of temperature estimates over the univariate linear regressions that had preceded it and to allow the estimation of other climatic variables.
In addition to temperature, other variables that have been more-or-less successfully estimated using CLAMP data are precipitation (e.g., Wilf et al. 1998), and moist enthalpy, which can be used to calculate paleo-elevation (e.g., Forest et al. 1999). The linear relationships between leaf size and precipitation and between leaf physiognomy and moist entropy are less clear than the relationship between leaf margins and temperature. As Kennedy (1998) points out: "It appears that CLAMP provides a relatively accurate estimation of temperature, but only a general estimation of precipitation." Other variables (including many related to timing of changes in temperature and precipitation, like growing season precipitation, or warm month mean temperature) have been studied less extensively.
Despite the relatively widespread application of CLAMP methods, its procedures have been criticized as overly complex and no more informative than simple regression models (Wilf 1997, Wilf et al. 1998). Nevertheless it provides the only well-known procedure for collecting multivariate data on leaf morphology, and in certain contexts has become a standard way of determining ancient terrestrial climatic parameters. Therefore the focus is almost exclusively on the CLAMP method, though some of the issues identified may also apply to other recent leaf physiognomic studies like the "digital" approach of Huff et al. (2003) and Royer et al. (2005). Much of the debate about the advantages of the CLAMP method over various regression models centers around statistical details: the goal has been to maximize the "explanatory power" of the method and minimize the standard errors of the temperature estimates that it provides. This may not, however, be the best way to choose an analytical methodology, because we have no satisfactory mechanistic explanation of the relationships between most leaf morphological characters and climatic variables. Thus we are by definition engaged in data analysis: that is, we are trying to determine what measured quantities signify and to design empirical models to predict them, not trying to test models based on theory against real data. Minimizing analytic error and maximizing explained variance produce a model that best explains a given set of data. Whether this model will ever explain any other data, be of practical predicative utility, or suggest fruitful lines of future inquiry, is a very different question.
In the following consideration of the available data, several issues with the CLAMP method that ought to be addressed become apparent. The focus throughout is on the analytical choices made, not on the collection of raw data: for the purposes of this discussion, it is assumed that the matrix of CLAMP scores is a relatively good reflection of the woody dicot leaf forms present in a living flora. No coding scheme is perfect, but the CLAMP method is the only such coding method that has been widely applied. In contrast, the statistical methods for analyzing CLAMP data represent only a small fraction of the available procedures for multivariate data analysis. Therefore, it seems necessary to explore to what extent the results of a CLAMP analysis are sensitive to the analytical methods chosen and to inherent biases in the data. Are the eigenvector and regression techniques that are generally applied to these data appropriate? What other techniques should be tried? In short: how can we improve the methods used to analyze CLAMP data? The alternative or supplementary analytical program proposed by this paper is based on graphical data analysis using pairs plots and seems to show substantial advantages over eigenvector approaches for exploring the relationships between dicot leaves and the environments in which they grow. Though the focus of this examination is specifically aimed at one type of paleobotanical data, the general issue of whether exploratory graphical analysis is more appropriate than data-reduction is applicable to many other paleontological data sets.