Cetacean Diversity Estimates:

Plain-Language &
Multilingual  Abstracts



Data Sources and Data Processing

Methods of Analysis and Methods to Assess Potential Biases







Print article



Data on the geographic and temporal distribution of cetacean genera were tabulated from the primary original literature and from subsequent reviews of the original literature. The bibliographic data, as well as the taxonomic and occurrence data in this literature, were entered into the Paleobiology Database (PBDB). This database is publicly available. Great effort has been taken to include every publication that includes taxonomic or distributional data on any fossil cetacean taxon through the June 2007. Certainly some small number of fossil cetacean publications still remain to be entered into the PBDB, but these should be only randomly associated with any particular taxa, time interval, or geographical region, which means that their absence should not bias our results in any predictable manner.

This data compilation resulted in 248 cetacean genera currently in use (218 of which have a known fossil record) and 188 additional generic names that have been previously applied to cetacean fossils. Of these 188, 74 are now junior synonyms of other cetacean genera; seven were improperly formed genera that were subsequently replaced; three are objective synonyms of other genera; 77 of them are nomina dubia (names of doubtful application), nomina nuda (names without proper justification), nomina oblita (unused names), or nomina vana (names for which the proper application of which cannot be determined); and 26 are published misspellings of other genera. One genus was moved to Pinnipedia based on a new understanding of the type specimen (Ontocetus emmonsi), leaving a subsequently named cetacean species without a genus ("Ontocetus" oxymycterus). These data were complied from 761 separate scholarly sources.

A similar, but perhaps slightly less comprehensive, data collection approach was also used for the mammalian order Sirenia, for use as a taphonomic control taxon. A slightly larger proportion of sirenian references probably still remain to be entered into the PBDB, but as stated above, these should be only randomly associated with any particular taxa, time interval, or geographical region, which means that their absence should not bias our results in any predictable manner.

This data compilation resulted in 40 sirenian genera currently in use and 33 generic names that have been previously applied to sirenian fossils. Of these 33, 30 are now junior synonyms of other sirenian genera; three were improperly formed genera (a homonym and two nomina nuda) that were subsequently replaced. These data were compiled from 301 separate scholarly sources.

Data from five museum specimen databases were compiled to help answer questions relating to the numbers of specimens, collections, and fossil collectors, and the habits of these collectors. These factors can be important biases that affect studies of diversity through time, especially for vertebrate paleontologists (e.g., Davis and Pyenson 2007). All records relating to the Order Cetacea were gleaned from the databases at the Florida Museum of Natural History (FLMNH), Natural History Museum of Los Angeles County (LACM), San Diego Museum of Natural History (SDMNH), University of California Museum of Paleontology (UCMP), and the United States National Museum (USNM). These museum databases were chosen because they all have very large collections of fossil cetaceans, and these museum collections together include fossil collections from all of the major fossil cetacean-producing areas of North America. The 48 contiguous United States were considered to be roughly equivalent to the continent of North America for several reasons. First, no fossil cetaceans are known from Hawaii, and only poorly identified material is known from two collections in Alaska. Second, of the fossil cetacean material in Canada, the most productive locality is from a very small area of southern Quebec, which has produced only two taxa (Delphinapterus and Phocoena) of Pleistocene age. The only other fossil cetacean localities from Canada are limited to the type locality of Chonecetus and two other collections producing indeterminate cetaceans from the Oligocene of British Columbia. Mexico includes six collections that have produced cetacean fossils identified to the generic level, resulting in eight genera. The effort to digitize the geologic maps of Canada and Mexico was considered excessive compared to the potential payoff. Also, keeping this part of the analysis to only a single country also helps to standardize the collection effort, since it is more likely to be closer to uniform within a single country.

Geologic maps of those 48 contiguous United States that contain continental shelf strata from the Eocene to Recent were digitized in order to determine the map area of rocks that could potentially produce fossil cetaceans. Any states with digital versions of geologic maps were unmodified from their original format. The latter include: Delaware (Delaware Geological Survey 1976), Florida (Scott et al. 2001, Maryland (Maryland Geological Survey 1968). For the remaining states, paper maps were digitally photographed in small sections. These include: Alabama (Osborne et al. 1989), California (Jennings 1977), Georgia (Pickering and Murray 1976), Louisiana (Snead and McCulloh 1984), Mississippi (Bicker 1969), North Carolina (North Carolina Geological Survey 1985), Oregon (Earnest et al. 1991), South Carolina (US Geological Survey 1936), Texas (Hartmann and Scranton 1992), Virginia (Johnson 1993) and Washington (Caruthers et al. 2002; Walsh et al. 1987). These sections, including the map scale, were photographed at exactly the same scale with a digital camera mounted on a tripod. These map sections were then reassembled in Adobe Photoshop to form a complete map. See Figure 1 for a graphical outline of the measurement procedure described here. Map areas of all Eocene to Recent formations were digitally measured using ImageJ software (Figure 1). These area data were recorded at the finest stratigraphic scale possible, based on the stratigraphic resolution of the map, and then assigned geologic ages based on the most recent publications available.

Population data were also tabulated to determine if the number of people in a potentially fossil cetacean-bearing area had any effect on the number of fossil cetaceans found in that area. Data on the population of each county that included potentially fossil cetacean-bearing strata in the states mentioned above were tabulated from the U.S. Census Bureau. Data from the censuses of 1900 to 1990 were included in this analysis.

We assessed the research effort by tabulating the number of papers published on fossil cetaceans (those including the keyword Cetacea) from the GeoRef database (American Geological Institute). Thus, we use publication output as a proxy for research effort. GeoRef may not be the ideal database to use for this type of tabulation because it does not include many museum publications, nor does it include many biologically oriented publications where many paleontologists regularly publish their work. The Bibliography of Fossil Vertebrates (BFV) may seem like a better choice, but it contains large gaps, covering only the years 1509 to 1968 and 1981 to 1993. Although GeoRef may be problematic because it may miss some obscure publications, it is less problematic for this type of analysis than the BFV and GeoRef has conceivably not changed the focus of its databasing effort over calendrical time.

To measure publication effort, we counted the number of papers on fossil cetaceans that referred to each time interval in our cetacean data set using the name of the time interval as a keyword. For instance, we tabulated a paper as "late Eocene Cetacea" if both keywords "late Eocene" and "Cetacea" are in the keyword field. Table 1 lists the set of searches used to generate this data set. This procedure was complicated by the observation that GeoRef includes both time terms such as early, middle, and late, as well as time-stratigraphic terms such as lower, middle, and upper. We included papers that included either the time terms or time-stratigraphic terms. We avoided double counting of papers that included both types of terms by combining search criteria with the "OR" operator. The papers in GeoRef that include only a more general time term such as Eocene, without a modifier of any kind, presented another potentially confounding factor. These papers were tabulated by some additional searches for references that included the more general time terms exclusive of the more particular terms. The results of these tabulations made in December 2006 are shown in Table 1.

We also tabulated the number of references that contribute to the total data set in the Paleobiology Database, as well as the number of references that contribute to the generic counts and occurrence counts in each time interval. The results of these tabulations are also shown in Table 1. The number of genera described per paper is, of course, variable, but as long as there is no temporal trend (that is, a trend over the geologic times from which these fossils are described) that variability should not affect our results.

Several previous authors have also suggested excluding singletons (genera that are known from a single time interval) from this type of analysis to minimize Lagerstätten and monographic effects (Lu et al. 2006). This approach was not practical here, since many vertebrate genera, and fossil cetacean genera in particular, are known from single specimens or single collections. (See the Discussion below for a more in-depth exploration of this point.) Moreover, we suspect that excluding singletons would not matter much because: a) there are few cases of cetacean Lagerstätten (e.g., Sharktooth Hill bonebed); b) such cetacean Lagerstätten exhibit spectacular preservation and do not impact diversity (e.g., Pisco Formation); and c) few cetacean publications that have produced many generic names are still in use.


Next Section

Cetacean Diversity Estimates
Plain-Language & Multilingual  Abstracts | Abstract | Introduction | Data Sources and Data Processing
Methods of Analysis and Methods to Assess Potential Biases | Results
Discussion | Acknowledgements | References | Appendix
Print article