[tdwg-lit] Biodiversity Heritage Library Name Services v.1.0 released
*Apologies for cross-posts*
The Biodiversity Heritage Library (BHL) has released its first suite of Name Services into production. Using these services application providers can retrieve "discovered bibliographies" for a given taxonomic name showing where that name occurs throughout the entire corpus of natural history literature available through BHL.
To view an example response, go to: http://www.biodiversitylibrary.org/services/name/NameService.asmx?op=Nam eGetDetail Enter 6663950 in nameBankID: box and hit Invoke.
Full documentation of the services, including examples, can be found at: http://docs.google.com/Doc?id=dgvjvvkz_1x5qbm3
These services complement the human-readable interface to "discovered bibliographies" currently deployed at http://www.biodiversitylibrary.org/NameSearch.aspx. See http://www.biodiversitylibrary.org/name/Tapirus_bairdi for an example bibliography for Baird's Tapir.
How it works Each digitized page image in BHL has an accompanying OCR text file. As users navigate to a page, the uncorrected OCR file is sent to uBio's TaxonFinder, which identifies text strings that match the characteristics of Latin binomials. Those potential name strings are then compared to the 10.7 million+ names in uBio's NameBank, and the results, both matched and unmatched, are stored in the BHL database. BHL also has automated processes to reindex pages at regular intervals since NameBank is a growing repository.
What we've found As of 20 Nov 2007 more than 6.8 million potential name strings have been identified throughout the BHL corpus, with more than 3.8 million matched to a corresponding NameBank identifier. There are more than 431,000 unique names within that 3.8 million set. Of those, more than 156,000 are known by a single occurrence. These results will be evaluated more thoroughly in the coming months to determine potential errors such as false positives and how to refine the TaxonFinder algorithm to reduce them.
Caveat: These results are generated from uncorrected OCR, which range in quality from pretty good (contemporary publications, such as modern issues of Rhodora) to downright terrible (18th century Latin texts, such as Species Plantarum). Again, further evaluation is required to determine the full scope of this problem.
Metadata in response The metadata currently returned from the services is not yet explicitly mapped to Dublin Core, MODS, or other schemata but will be in subsequent revisions. This current response is the most verbose offering possible, encompassing every piece of information we have to share about the digitized page and the book from which it was scanned. For this first release we wanted to expose all metadata possible then work to map to other schema according to community wants and needs.
Where we're headed This service, and the metadata it serves, will be incorporated into the Encyclopedia of Life by its launch next year. Discussions are already underway with other data providers to demonstrate how these services can be used.
To see a simple example of how the human-readable interface can be used from external sites, check out the 'External Links' at the bottom of the Wikipedia article for Mimosa pudica L., the sensitive plant: http://en.wikipedia.org/wiki/Mimosa_pudica
Up next: Development of an OpenURL parser that will allow service providers to submit requests for citations and retrieve page images or resolve to appropriate pages in the BHL portal.
Any and all comments welcome! E-mail directly to me, or leave them on the BHL blog at http://biodiversitylibrary.blogspot.com.
Chris Freeland Technical Director, Biodiversity Heritage Library Application Development Manager, Missouri Botanical Garden
participants (1)
-
Chris Freeland