[tdwg-guid] Immutability of LSID data
Ricardo Pereira
ricardo at tdwg.org
Mon Jul 16 15:36:50 CEST 2007
Folks,
Let's pick one controversial issue at a time and discuss it. I
suggest we pick the "easiest" ones first. Let's pick the immutability of
LSID data next.
Let us first review which methods are provided by the LSID data
services:
bytes getData(LSID lsid)
bytes getDataByRange(LSID lsid, integer start, integer length)
I wasn't the one who came up with the LSID spec, but I suppose that
those methods were specifically designed to handle sequence data (DNA
and protein data). The getDataByRange method in particular was designed
to allow clients to refer to very specific subsets of those sequences.
No doubt that this is all very useful for the bioinformatics folks,
but as we've seen in previous discussions, it is not as useful for us in
the biodiversity (and ecological) informatics communities. The main
reason is that some of our data is represented in XML, which cannot be
serialized as the very same stream of bytes every time. But it may still
be helpful to use the getData call to retrieve such data.
The question in discussion in this thread is whether we should bend
the LSID rules to accept XML data in getData calls. My proposal, which I
think gathers the points presented in the previous discussion thread, is
that whatever is served using getData is "semantically immutable".
Semantical immutability would then depend on the content type of the
data returned. For example:
1) If data is of content type text/plain, application/octet-stream,
image/*, etc, then it must always be returned as the exact byte stream
sequence (just like the LSID spec states now).
2) If data is in XML, i.e., it is of content type text/xml, text/html
(God forbid), then it must always return an equivalent XML DOM tree;
3) If data is application/rdf+xml or application/rdf+n3 (i.e. RDF data),
the getData call must always return the same RDF graph;
4) And so on for every other MIME type out there.
//
The implications of bending the LSID getData calls like that are:
a) One may not use getDataByRange call for data that is not byte stream
equivalent (item #1 above). Authorities would have to return an error
message when getDataByRange is called on a "semantically immutable" object.
b) Some may claim that caching of LSIDs and the associated data would be
impossible. But since the data is always "semantically immutable",
what's wrong with caching it?
c) Authorities wouldn't be able to return data in alternate MIME types
(RDF in XML or N3 or Turtle) as there is no parameter that specifies
that. Not a problem I suppose.
I agree with Dave in that we would gain much if we bent the LSID
rule about immutability of data. We would have a more general solution
that would fit the needs of a broader set of providers, without
impacting the authorities that today don't use getData that much, such
as the providers of names, concepts, observations, specimens, authors,
and collections.
So the questions I pose to our group in this thread are:
"Should we allow 'semantically immutable' data to be returned in the
getData call? How exactly do we do it (i.e., what would be a
specification for it)?"
I don't really see a problem bending the LSID rules a bit, as
outlined above. What do you think?
Cheers,
Ricardo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-tag/attachments/20070716/187318a6/attachment.html
More information about the tdwg-tag
mailing list