[tdwg-guid] Immutability of LSID data

P. Bryan Heidorn pheidorn at uiuc.edu
Wed Jul 25 20:48:40 CEST 2007

Sorry I did not reply earlier. My time was eaten up with proposal  
writing and the like.

I agree that we are talking about digital data for the LSID and I did  
not intend to insinuate otherwise in my prior message. It is just  
that people are putting Galileo's data online in digital format and  
do need unique identifiers, it just is not biodiversity data. The  
issues have been addressed many times and it is important to learn  
from past experience.

We can say that it is important to be able to insure the properties  
of the data service, so that the digesting process can make  
assumptions about the data. In Java and many other languages there is  
a bit level equivalence operator such as "=". This is relevant to the  
concept of homonyms.

Hannu pointed out that it is nice to be able to make assumptions  
about the nature of the data being delivered. You can for example  
know you can use "=" in your program and assume it should return true  
if the service is following the rules (of bit level immutability).  
When we say two things are equivalent in these languages we mean  
"equivalent" under the languages operators. The LSID GetData function  
service is defined in these terms which is very reasonable for many  
forms of data including molecular sequences (except that genetic  
matching algorithms frequently treat a genetic sequence and it  
complement as equivalent because they are both half of the same  
double helix. So, I would guess that even the molecular community who  
defined LSID might have people who are unhappy with the current  
definition. In some languages we are allowed to overload operators  
such as "=" with our own definition of equivalence. The language  
designers did this because people often need different definitions of  
equivalence particularly for complex data types.

In many programming tasks, bit level equivalence, is not needed and  
is indeed problematic. So RDF and software such as DOM define  
equivalence not as bit level matching of 1's and 0's in a particular  
order but as a higher order construct. So, we can have a born-digital  
object that describes a species of plant.  
"<leaf><arrangement>alternate</arrangement><length unit="mm">10</ 
length></leaf>" is equivalent to "<leaf><length unit="mm">10</ 

There are applications in biodiversity informatics where bit level  
equivalence is useful so I support keeping getData's requirment of  
bit-level equivalence. Other branches of biodiversity informatics,  
however would benefit from a different definition of equivalence.  
This can be handled with an LSID extension as a new function. Who  
pays for development of this new function is important. We can role  
out a more constrained standard with getData as is and later add the  
new getDataRepresenationallyEquivolant later.

So, lets move ahead, adopt LSID and start using it for the cases  
where bit level equivalence is acceptable and either expand it later  
or develop a different standard to give unique identifiers for the  
other applications.

   P. Bryan Heidorn
   Graduate School of Library and Information Science
   University of Illinois at Urbana-Champaign
   pheidorn at uiuc.edu
   (V)217/ 244-7792     (F)217/ 244-3302
   Online Calendar: http://www.uiuc.edu/goto/heidorncalendar

On Jul 16, 2007, at 8:17 PM, Richard Pyle wrote:

> Hi Bryan,
>> What is data and what is metadata has no relation to being
>> digital or not. There was data and metadata long before there
>> were computers.
> Again, we are coming back to this communication problem.  I agree  
> with you
> in the context of the words "data" and "metadata" as most of us  
> probably
> define them.  But we are talking about LSIDs, and so we should  
> follow the
> definitions of these words in the context of the LSID spec.  It may be
> terribly unfortunate that the LSID spec defines "data" differently  
> from how
> most of us would use that word -- just as it is terribly  
> unfortunate that a
> "named concept" has essentially nothing to do with either a taxon  
> "concept"
> or a taxon "name", or that a "Class" written in C++ has no  
> relationship to
> the "Class" Mammalia, or that a data "type" has nothing to do with  
> a "type"
> specimen, or the fact that all of these "homonyms" cause problems  
> that are
> different from the sorts of problems created by taxonomic  
> "homonyms" --
> among dozens of other frustrating language barriers we have.
> However, in the context of LSIDs, which is what we are now  
> discussing, the
> word "data" does indeed unambiguously refer to a digital/binary  
> bytestream,
> and *not* the kind of "data" that Galileo collected.
> Aloha,
> Rich

More information about the tdwg-tag mailing list