[tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]

Richard Pyle deepreef at bishopmuseum.org
Fri Jul 13 22:11:05 CEST 2007


Thanks to Ricardo for starting this very timely discussion.  I've been
following LSIDs for a long time now, and have attended both GBIF GUID
workshops, and had some very detailed conversations with Ben Szekely about
this very issue, and I think I have a pretty good handle on it.  And, it's
really not that complicated.

The byte-stream (bit sequence) for the data of a given LSID cannot change,
according to the LSID spec.  The "meaning" of the data is irrelevant in this
context -- what matters is the actual sequence of 1's and 0's.  If you have
a TIFF image file that represents a 12-megapixel image, and you change one
bit of one pixel of that image file, you cannot use the same LSID to
represent it.  If you package it into a ZIP file, that ZIP file is a new
bytestream and could not be returned as the data for that LSID assigned to
the TIFF image data object.

If we want to change this specification, then we are not using LSIDs anymore
-- we are using something like "TDWG identifiers that look an awful lot like
LSIDs, but really aren't LSIDs".  I think that's the last thing this
community should do.

The "data" for LSIDs should be an unambiguous digital object.  Species names
are not digital objects.  They are not even physical objects. In fact, they
aren't even text objects (the text string of a species "name", as defined by
any of the nomenclatural codes, is a property or attribute of the
name-object -- not the name-object itself).  Species names are "abstract" or
"conceptual" objects -- with no inherent digital manifestation, and not even
any inherent physical manifestation.  The LSID spec accomodates such objects
in the form of "data-less" LSIDs -- that is, LSIDs with zero "data" content
(only metadata).

Please, let's not get bogged down in alternate definitions of the word
"data" and "metadata".  I swear, the single greatest impediment to progress
in biodiversity informatics (by far) in my opinion has been human-language
semantics.  I had to qualify the word "semantics" in the previous sentence
with "human-language", because even the very word "semantics" has more than
one meaning in our conversations (I almost used the word "vocabulary"
instead of "sematics", but of course that word, too, has another meaning
within our various conversations).  We could fill a small dictionary with
words that have more than one meaning in different contexts ("concept",
"type", "class", "synonym", and worst of all, "name" -- among many others).

So, when we speak of "data" and "metadata" in the context of LSIDs, let us
please use those words specifically in the context of their well-defined
meaning as related to LSIDs.

And in this LSID sense of the word "data", many of our objects (taxon names,
taxon concepts, locality descriptions, specimens, agents, bibliographic
citations, etc.) simply have no "data", because none of these things have
any inherent digital manifestation.  We could concatenate what would
otherwise be LSID-metadata for one of these non-digital objects (e.g., a
database record) into a single byte-stream, and define this as "data" tied
to a particular LSID, but then a new LSID would need to be issued everytime
someone wanted to change that bytestream (e.g., convert it from ASCII to
UNICODE, or change the meaning, rendering, or content of one of the
concatenated metadata elements). For this, and other reasons, I think this
is a bad approach.

Instead, I think we should embrace LSIDs *WITH* data (sensu LSID spec) in
cases where it makes sense to do so (e.g., image files, PDFs, perhaps DNS
sequences represented as an ASCII character stream or some other specified
standard binary format), and embrace LSIDs *WITHOUT* data (only metadata) --
as accomodated in the LSID spec -- for most of non-digital objects we want
to exchange information about (taxon names, taxon concepts, locality
descriptions, specimens, agents, bibliographic citations, etc.).

Getting back to the intended topic of this discussion (metadata
persistence), I frankly am very happy that there is no requirement for
metadata persistence in the LSID spec (if there was a requirement for
persistence, then you might as well package it all up as data, then use the
embedded versioning component of LSIDs or some other mechanism for issuing
new LSIDs that are cross-linked to each other in an appropriate way).  

I believe the answer to Ricardo's example is better addressed in the next
discussion, concerning methods for data versioning.  I think the answer to
this issue (persistence of metadata) necessarily must be solved via that
discussion (versioning), so maybe we should discuss the versioning issue
first.

Aloha,
Rich

Richard L. Pyle, PhD
Database Coordinator for Natural Sciences
  and Associate Zoologist in Ichthyology
Department of Natural Sciences, Bishop Museum
1525 Bernice St., Honolulu, HI 96817
Ph: (808)848-4115, Fax: (808)847-8252
email: deepreef at bishopmuseum.org
http://hbs.bishopmuseum.org/staff/pylerichard.html






More information about the tdwg-tag mailing list