PALEONTOLOGY COLLECTIONS ON THE WORLD WIDE WEB:
THE MISSING LINK

In a recent editorial in Palaeontologia Electronica, Warren Allmon reviewed his use of Google as a research tool (Allmon 2004). He did a series of web searches for his favorite snail (Turritella) and analyzed the results. In these analyses, he pointed out two interesting issues that we have been dealing with at the Department of Invertebrate Paleontology at the Natural History Museum of Los Angeles County (LACMIP): The need for collections catalogs to take advantage of standard web search engines such as Google, and the problem of data entry. How do we publish as much high-quality data as possible with limited resources and limited local specialist expertise? In this essay we discuss our philosophy regarding the important issues of publishing and maintaining electronic catalogs.

Collection Catalogs can be Indexed by Web Search Engines

On noticing a lack of Google search hits from museum catalogs, Allmon asked, “Where are the big databases”? We can’t speak for other big databases, but we are there. As of January 10, 2005, a search for “LACMIP” and “Turritella” using www.google.com returns a page from our collections catalog describing a locality from the famous Turritella beds of the Topanga Canyon Formation in the Santa Monica Mountains. This result is no coincidence. In the past few years we have rebuilt our catalog to allow searching using a modern web-based approach.

Specimens are the raw data of paleontology. They are the primary record of an occurrence of a particular taxon in a particular stratigraphic and geographical context. Subsequent researchers must be able to go back to the specimens to verify that the original taxonomic determinations and related interpretations are correct. Taxonomic and stratigraphic concepts may change, but the specimens remain constant. Because of the fundamental role of specimens, collections in natural history museums are an important part of the future of the science (Erwin 1997; Suarez and Tsutsui 2004; Allmon 2005). Collections are the reason for the existence of museums in the first place, because they act as lending libraries, providing access to material examined in previous studies. These collections are supposed to be held in trust for the public in perpetuity (ICZN 1999; Article 72.10), and it is this planning for the future that is important. Collections can be held at universities or other academic institutions, but in general there is no plan for the future of these collections. Museum collections are the best way to ensure that future scientists have access to important paleontological material. This fact is particularly true in the case of ephemeral exposures, where the specimens may be the only available record of past scientific assertions, because re-collecting will never be possible.

Traditionally, researchers visit collections, work through cabinets, and search printed or electronic registers in order to find the specimens that they would like to study. In some cases, material can be borrowed without visiting the collection if the researcher knows what he or she is looking for. However, much material is unlikely to be cataloged, and even if a researcher has some idea of what should be looked for (say, a particular taxon or material from a particular collecting locality), local staff might not be available to locate, pack, and ship the actual specimens. An alternative approach to traditional collections usage is to make as much specimen information as possible available to the public via electronic publication. The World Wide Web serves this approach well, and advances in information technology are making it relatively easy (Graham et al. 2004). For example, there is a strong initiative to deploy multi-institution database networks using web services that allow the simultaneous search of many museum catalogs (for example, Fishnet, Herpnet, Manis, or The Paleontology Portal). This approach will provide a network of comprehensive search tools for museum collections. However, as pointed out by Allmon, these systems are not really compatible with the rest of the World Wide Web, especially because Google or other search engines do not index them. In contrast to these systems, the LACMIP collections catalog is indexed by present-day web search engines and can be searched, browsed, and linked by using standard web hypertext links.

Web Search Engines and Lessons of Goggle

Web Architecture

Computer science is a rapidly developing field, and there has been great progress in the design and implementation of web-based information systems during the past decade. Much of this progress has focused on developing e-commerce systems for companies such as Amazon.com or eBay that process hundreds of millions of dollars worth of transactions per month. Obviously, paleontology can learn something from the study of the structure and interface design of such sites. For example, consider using Amazon.com as a way to find a book. Most users take advantage of the search tools presented on the Amazon.com home page if they don’t know what they are looking for. So to find a copy of On the Origin of Species, one would go to the URL http://amazon.com using a standard web browser and type “origin of species” or “darwin” into a search field, and a list of books is returned. This is the approach taken during the past decade by most paleontology catalogs that are on the Web (for example, The Academy of Natural Sciences in Philadelphia, the Kansas Natural History Museum, The University of California Museum of Paleontology, The Florida Museum of Natural History, and The San Diego Natural History Museum). A web browser is used to enter search terms into a web form. This information is submitted to a remote web server and results are sent back to the web browser. This system works well for several reasons, most of all because the only software required by the user is a standard web browser. All of the specialized software is running on the remote computer, and the average user doesn’t need to know about it.

But what if we already know what book we are looking for, and instead only wish to find out the price or date of publication? Why not go directly to that book? There is a way to do this using a special URL that specifies the ISBN of the book you are looking for. For example, try http://www.amazon.com/o/ASIN/1400041279 to see the book with ISBN number 1400041279. The URL is essentially an address, and it turns out that every book in the Amazon.com database has at least one address on the World Wide Web. This allows users to bookmark a page or share the link with friends who can get the most up-to-date information about that book without manually typing into the web forms. This method is as old as the World Wide Web and is known as REST (or representational state transfer; Fielding 2000; He 2004), or more generally as web architecture. Although a relatively simple design, over the years it has been shown to be very flexible, powerful, and extremely useful (Jacobs 2004). The main idea of web architecture is that each piece of information on the World Wide Web should have a unique address or URL. That address can be used like any other web link to build a hypertext database.

We have implemented such links for the electronic catalog at LACMIP. To see information from a particular locality use the URL  http://ip.nhm.org/ipdatabase/locality/xxx where xxx is the locality code. The address for information regarding LACMIP Locality1219 is http://ip.nhm.org/ipdatabase/locality/1219. Specimen lots, type specimens, and images are all directly accessible using similar link schemes. For example, the address of type specimen LACMIP type 10095 is http://ip.nhm.org/ipdatabase/type/10095, and the address of specimen lot LACMIP 20631-3 is http://ip.nhm.org/ipdatabase/lot/20631-3. These links can be used to direct users to particular bits of information in our catalog. The most obvious use would be in online publications, where type specimens and locality data are required. This would enable a link from the online version of the Journal of Paleontology to tap directly into the most up-to-date information regarding a particular specimen or locality. Another possibility is to link from another online database to our system. For example, we recently created a search tool that can be used to search simultaneously the Recent and fossil collections of mollusks at the Natural History Museum of Los Angeles County. More importantly, web search engines can index these links so this information can be discovered in a search using a tool such as Google. We have included links into the entire LACMIP catalog on the pages that are returned from these “quick-link” access points. This means that a computer program theoretically can browse our entire catalog by passing from link to link, and makes all of the data accessible to web search engines or other metadatabases. We have tried to keep our links simple, so that they are easy to read, and logical, so that they are easy to remember. This rationale of simplicity and logical function makes them easy to use. In fact, with this kind of link, it should be possible to guess the correct URL, rather than navigating a complex search engine (Seebach 2001).

Web Services

The web architecture approach to providing links to information stored in online databases is different from the services approach to integrating databases currently in development as multi-institutional search engines such as Herpnet, Manis, or The Paleontology Portal. These services use a particular form of communication called SOAP (Simple Object Access Protocol; Box 2001). Rather than providing a unique web address for information, in this approach a query is expressed using a standard format and sent via the World Wide Web within a virtual envelope. Unlike the web architecture approach, very complex queries can be transmitted. However, because the query must be written in a special language, special client software is required and the data are not easily accessible to search engines. Furthermore, data format standards must be defined for this approach to work because different systems must all be able to process the same query (they must be “speaking the same language”). These standards are being actively developed (including the Darwin Core, and ABCD schemas), and are sure to result in a rapid progress in creating these integrated systems.

But we should not forget the lesson of Google. Web search engines work very well even though web pages do not adhere to complex standards that define the types of information that is included. This is the real triumph of the World Wide Web. Although some standards are certainly required (for example, all web pages must adhere to some standard encoding of characters, and http headers must be passed as part of the web communication protocols), over-defined standards can also hinder progress by limiting the information available to whatever is allowed by the standard. Besides, there is no reason not to continue publishing information in a loosely formatted way while the standards are being established (and this might take a long time because committees are trying to shoe-horn non-standard rules and practices that have developed over centuries, such as taxonomic nomenclature, into some specifiable set of objects). After all, Google is very useful even though there are no standards for the web pages that we find using it. Do we need to develop a similar specialized search tool for collections catalogs that indexes loosely structured data?

Collections Catalogs are Collaborations between Museums and Researchers

In his essay on the use of Google in paleontological research, Allmon (2004) pointed out that although records of type and figured specimens held in major paleontological collections are available online, information about non-type material is much less likely to be accessible. This absence is the result of setting priorities for developing electronic catalogs. Entering data from specimen labels is a labor-intensive process and, with limited resources available, most collections managers opt to focus on type material. In his essay, Allmon wondered whether there might be a more efficient way to bring more collections into the World Wide Web than this “brute force” approach. We do not think that there can be any short cuts in this critical process. Instead we look to the community of paleontologists who are the primary users of electronic catalogs to help us produce an online catalog.

In the current economic climate, collections managers are under pressure to justify the expense of supporting collections that rarely generate any income for their institution. In general, collections-holding institutions provide a service to research scientists without financial compensation. This service is appropriate, but people who use collections will have to take on some responsibility for the continued accessibility of these important research tools. As a group, the people who use paleontology collections know much more about them than collections staff ever can, and they must share this expertise. The philosophy is similar to a “barn-raising” where community members join together to create structures that benefit everyone. In our case, the community is dispersed throughout the world. We would like to present the paleontological community with a challenge: If paleontology collections are so very critical for the future of the science, then the community must give back to collections support. There must be a two-way exchange of information between researchers and museums. Museums have the responsibility to make our collections available to the scientific community, but in turn the scientific community must share data with museums.

Traditionally, users would contribute by adding written labels with new data and interpretations for specimen lots. At LACMIP, our aim is to encourage digital contributions by providing the tools for users to conveniently add information using the World Wide Web. We allow and encourage researchers using our collections to add their data directly into the electronic catalog. The system works via a set of secure web forms similar to the mechanism used by online commerce sites. Users that have been authorized by the collections manager are provided with a password that can be used to gain access to these forms. No special software is required, and information can be collected from any computer connected to the World Wide Web.

One effect of this approach is that, as collections managers, we are transferring the role of “authorities” from museum collections staff to the scientific community. Our collections are very large and we are attempting to track a diverse assortment of stratigraphic, geographic, and taxonomic facts and interpretations. We have neither the resources nor the expertise to assess the validity of the majority of information contributed by people who donate material or use the collections for research. Because of this constraint we accept whatever information is donated. Therefore, our catalog cannot contain a single authoritative entry for each piece of information. For example, several specialists might have differing taxonomic concepts and might have assigned different taxonomic determinations to the same specimen. Our job as museum collections staff is to record these interpretations and make them available to the public. In the catalog, the source and date of entry is maintained for each contribution, and all contributions are published so users can decide for themselves whether any particular entry is better than the others. Taxonomic determinations are an obvious case, but a similar procedure is used to track differing opinions regarding stratigraphy, age, geography and most other information. At the most basic level, because the LACMIP catalog is not “read-only”, simple typographic errors can be fixed and researchers can add basic information as they browse the system. The goal is that the data are maintained and gradually improved over the years using our catalog as a structured online collaboration system. We think that this is a viable way to help the paleontology research community to help itself build collections-based research tools for the future.

PE Editorial  Number: 8.2.5E
Copyright: Coquina Press October 2005