[tdwg-guid] Biodiversity Heritage Library Name Services v.1.0 released

Wed Dec 5 22:02:19 CET 2007

*Apologies for cross-posts*

The Biodiversity Heritage Library (BHL) has released its first suite of
Name Services into production.  Using these services application
providers can retrieve "discovered bibliographies" for a given taxonomic
name showing where that name occurs throughout the entire corpus of
natural history literature available through BHL.  

To view an example response, go to:
http://www.biodiversitylibrary.org/services/name/NameService.asmx?op=Nam
eGetDetail
Enter 6663950 in nameBankID: box and hit Invoke.

Full documentation of the services, including examples, can be found at:
http://docs.google.com/Doc?id=dgvjvvkz_1x5qbm3

These services complement the human-readable interface to "discovered
bibliographies" currently deployed at
http://www.biodiversitylibrary.org/NameSearch.aspx.  See
http://www.biodiversitylibrary.org/name/Tapirus_bairdi for an example
bibliography for Baird's Tapir.

How it works
Each digitized page image in BHL has an accompanying OCR text file. As
users navigate to a page, the uncorrected OCR file is sent to uBio's
TaxonFinder, which identifies text strings that match the
characteristics of Latin binomials. Those potential name strings are
then compared to the 10.7 million+ names in uBio's NameBank, and the
results, both matched and unmatched, are stored in the BHL database. BHL
also has automated processes to reindex pages at regular intervals since
NameBank is a growing repository.

What we've found
As of 20 Nov 2007 more than 6.8 million potential name strings have been
identified throughout the BHL corpus, with more than 3.8 million matched
to a corresponding NameBank identifier. There are more than 431,000
unique names within that 3.8 million set. Of those, more than 156,000
are known by a single occurrence. These results will be evaluated more
thoroughly in the coming months to determine potential errors such as
false positives and how to refine the TaxonFinder algorithm to reduce
them.

Caveat: These results are generated from uncorrected OCR, which range in
quality from pretty good (contemporary publications, such as modern
issues of Rhodora) to downright terrible (18th century Latin texts, such
as Species Plantarum). Again, further evaluation is required to
determine the full scope of this problem.

Metadata in response
The metadata currently returned from the services is not yet explicitly
mapped to Dublin Core, MODS, or other schemata but will be in subsequent
revisions.  This current response is the most verbose offering possible,
encompassing every piece of information we have to share about the
digitized page and the book from which it was scanned.  For this first
release we wanted to expose all metadata possible then work to map to
other schema according to community wants and needs.

Where we're headed
This service, and the metadata it serves, will be incorporated into the
Encyclopedia of Life by its launch next year.  Discussions are already
underway with other data providers to demonstrate how these services can
be used.

To see a simple example of how the human-readable interface can be used
from external sites, check out the 'External Links' at the bottom of the
Wikipedia article for Mimosa pudica L., the sensitive plant:
http://en.wikipedia.org/wiki/Mimosa_pudica

Up next: Development of an OpenURL parser that will allow service
providers to submit requests for citations and retrieve page images or
resolve to appropriate pages in the BHL portal.

Any and all comments welcome!  E-mail directly to me, or leave them on
the BHL blog at
http://biodiversitylibrary.blogspot.com.

Chris Freeland
Technical Director, Biodiversity Heritage Library
Application Development Manager, Missouri Botanical Garden