[tdwg-guid] Immutability of LSID data
P. Bryan Heidorn
pheidorn at uiuc.edu
Wed Jul 25 20:48:40 CEST 2007
Sorry I did not reply earlier. My time was eaten up with proposal
writing and the like.
I agree that we are talking about digital data for the LSID and I did
not intend to insinuate otherwise in my prior message. It is just
that people are putting Galileo's data online in digital format and
do need unique identifiers, it just is not biodiversity data. The
issues have been addressed many times and it is important to learn
from past experience.
We can say that it is important to be able to insure the properties
of the data service, so that the digesting process can make
assumptions about the data. In Java and many other languages there is
a bit level equivalence operator such as "=". This is relevant to the
concept of homonyms.
Hannu pointed out that it is nice to be able to make assumptions
about the nature of the data being delivered. You can for example
know you can use "=" in your program and assume it should return true
if the service is following the rules (of bit level immutability).
When we say two things are equivalent in these languages we mean
"equivalent" under the languages operators. The LSID GetData function
service is defined in these terms which is very reasonable for many
forms of data including molecular sequences (except that genetic
matching algorithms frequently treat a genetic sequence and it
complement as equivalent because they are both half of the same
double helix. So, I would guess that even the molecular community who
defined LSID might have people who are unhappy with the current
definition. In some languages we are allowed to overload operators
such as "=" with our own definition of equivalence. The language
designers did this because people often need different definitions of
equivalence particularly for complex data types.
In many programming tasks, bit level equivalence, is not needed and
is indeed problematic. So RDF and software such as DOM define
equivalence not as bit level matching of 1's and 0's in a particular
order but as a higher order construct. So, we can have a born-digital
object that describes a species of plant.
length></leaf>" is equivalent to "<leaf><length unit="mm">10</
There are applications in biodiversity informatics where bit level
equivalence is useful so I support keeping getData's requirment of
bit-level equivalence. Other branches of biodiversity informatics,
however would benefit from a different definition of equivalence.
This can be handled with an LSID extension as a new function. Who
pays for development of this new function is important. We can role
out a more constrained standard with getData as is and later add the
new getDataRepresenationallyEquivolant later.
So, lets move ahead, adopt LSID and start using it for the cases
where bit level equivalence is acceptable and either expand it later
or develop a different standard to give unique identifiers for the
P. Bryan Heidorn
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
pheidorn at uiuc.edu
(V)217/ 244-7792 (F)217/ 244-3302
Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 8:17 PM, Richard Pyle wrote:
> Hi Bryan,
>> What is data and what is metadata has no relation to being
>> digital or not. There was data and metadata long before there
>> were computers.
> Again, we are coming back to this communication problem. I agree
> with you
> in the context of the words "data" and "metadata" as most of us
> define them. But we are talking about LSIDs, and so we should
> follow the
> definitions of these words in the context of the LSID spec. It may be
> terribly unfortunate that the LSID spec defines "data" differently
> from how
> most of us would use that word -- just as it is terribly
> unfortunate that a
> "named concept" has essentially nothing to do with either a taxon
> or a taxon "name", or that a "Class" written in C++ has no
> relationship to
> the "Class" Mammalia, or that a data "type" has nothing to do with
> a "type"
> specimen, or the fact that all of these "homonyms" cause problems
> that are
> different from the sorts of problems created by taxonomic
> "homonyms" --
> among dozens of other frustrating language barriers we have.
> However, in the context of LSIDs, which is what we are now
> discussing, the
> word "data" does indeed unambiguously refer to a digital/binary
> and *not* the kind of "data" that Galileo collected.
More information about the tdwg-tag