[tdwg-guid] An approach to Abstract LSIDs

Sun Jul 15 20:59:59 CEST 2007

In my previous post, I quoted the LSID Best Practices page
(http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/) on
describing "Abstract" LSIDs.  Here is the full section:

***************************************
Abstract LSIDs

The data behind the data bytes of a concept might exist in multiple data
formats or derivations. One approach using a single LSID would be to append
all different instances together, using some token to separate the different
formats. This solution is poor for many reasons, primarily because the
client must download all formats. The best approach is to create a different
LSID for each data format or for derivations and connect them with a single
abstract LSID.

The benefit of using an abstract scheme is that it allows for LSIDs that do
not name actual data bytes but instead provide only metadata documents.
These LSIDs can be used to represent abstract notions, such as a gene or
protein, which may have many concrete representations. The metadata
documents associated with these abstract LSIDs can contain multiple
relationships pointing to LSIDs that name data bytes.

In this way, researchers can use a series of LSIDs to create an
interconnected metadata graph to name objects that may have many different
representations. The abstract LSID provides the anchor point for software
and users to explore the metadata and obtain further pointers to all the
concrete LSID references that contain data, along with the data's exact
relationship to the abstract concept. This level of indirection is very
powerful.
***************************************

Previously, we've debated about whether an LSID assigned to a non-digital
object should be assigned to the "Abstract" object, or to a specific
database record created for that object.  I'll stick with the Taxon Name
example, but the same principles apply to other non-digital objects like
specimens, observations, reference citations, etc.

Many, many databases in the world include a database record to represent the
butterflyfish genus described by Linnaeus in 1758 (which, for the sake of
simplicity, I'll henceforth refer to via the ASCII rendering "Chaetodon").

Database records (rows) are, inherently, digital objects, and therefore can
(with some level of established convention) be represented by binary "data"
-- retrievable via getData().  Thus, the many, many database records out
there can each receive a proper data-bearing LSID.  Obviously, there would
need to be mechanisms to make sure that the bytestream returned by getData()
for these inherently digital database records are always bit-consistent.
This could be relatively easy if the only "data" returned for the LSID is a
specified encoding of the primary key value for the database record, and all
the other columns/fields were returned via getMetadata().  But the point is,
a database record *is* an inherently digital object, and therefore *can* be
legitimately represented by a data-bearing (non-Abstract) LSID.

We could then assign an "Abstract" LSID for the "idea" or "notion" of the
scientific name "Chaetodon", and use that LSID in the spirit of the
above-quoted best practices description of Abstract LSIDs to track "further
pointers to all the concrete LSID [for database records established for the
genus Chaetodon] references that contain data".

That would effectively allow the Abstract LSID to serve the needs of those
of us who *want* a shared, resusable, persistent identifier for the
idea/notion/concept of the taxon name "Chaetodon", which itself serves as an
index of sorts to all manner of database records (digital objects) that
contain data (and metadata) associated with that taxon name.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html