[tdwg-guid] Immutability of LSID data

16 Jul 2007

      Folks,

    Let's pick one controversial issue at a time and discuss it. I 
suggest we pick the "easiest" ones first. Let's pick the immutability of 
LSID data next.

    Let us first review which methods are provided by the LSID data 
services:

bytes getData(LSID lsid)
bytes getDataByRange(LSID lsid, integer start, integer length)

    I wasn't the one who came up with the LSID spec, but I suppose that 
those methods were specifically designed to handle sequence data (DNA 
and protein data). The getDataByRange method in particular was designed 
to allow clients to refer to very specific subsets of those sequences.

    No doubt that this is all very useful for the bioinformatics folks, 
but as we've seen in previous discussions, it is not as useful for us in 
the biodiversity (and ecological) informatics communities. The main 
reason is that some of our data is represented in XML, which cannot be 
serialized as the very same stream of bytes every time. But it may still 
be helpful to use the getData call to retrieve such data.

    The question in discussion in this thread is whether we should bend 
the LSID rules to accept XML data in getData calls. My proposal, which I 
think gathers the points presented in the previous discussion thread, is 
that whatever is served using getData is "semantically immutable". 
Semantical immutability would then depend on the content type of the 
data returned. For example:

1) If data is of content type text/plain, application/octet-stream, 
image/*, etc, then it must always be returned as the exact byte stream 
sequence (just like the LSID spec states now).
2) If data is in XML, i.e., it is of content type text/xml, text/html 
(God forbid), then it must always return an equivalent XML DOM tree;
3) If data is application/rdf+xml or application/rdf+n3 (i.e. RDF data), 
the getData call must always return the same RDF graph;
4) And so on for every other MIME type out there.
//
    The implications of bending the LSID getData calls like that are:

a) One may not use getDataByRange call for data that is not byte stream 
equivalent (item #1 above). Authorities would have to return an error 
message when getDataByRange is called on a "semantically immutable" object.

b) Some may claim that caching of LSIDs and the associated data would be 
impossible. But since the data is always "semantically immutable", 
what's wrong with caching it?

c) Authorities wouldn't be able to return data in alternate MIME types 
(RDF in XML or N3 or Turtle) as there is no parameter that specifies 
that. Not a problem I suppose.

    I agree with Dave in that we would gain much if we bent the LSID 
rule about immutability of data. We would have a more general solution 
that would fit the needs of a broader set of providers, without 
impacting the authorities that today don't use getData that much, such 
as the providers of names, concepts, observations, specimens, authors, 
and collections.

    So the questions I pose to our group in this thread are:

"Should we allow 'semantically immutable' data to be returned in the 
getData call? How exactly do we do it (i.e., what would be a 
specification for it)?"

    I don't really see a problem bending the LSID rules a bit, as 
outlined above. What do you think?

    Cheers,

Ricardo

[tdwg-guid] Immutability of LSID data

Ricardo Pereira