Folks,
Let's pick one controversial issue at a time and discuss it. I suggest we pick the "easiest" ones first. Let's pick the immutability of LSID data next.
Let us first review which methods are provided by the LSID data services:
bytes getData(LSID lsid) bytes getDataByRange(LSID lsid, integer start, integer length)
I wasn't the one who came up with the LSID spec, but I suppose that those methods were specifically designed to handle sequence data (DNA and protein data). The getDataByRange method in particular was designed to allow clients to refer to very specific subsets of those sequences.
No doubt that this is all very useful for the bioinformatics folks, but as we've seen in previous discussions, it is not as useful for us in the biodiversity (and ecological) informatics communities. The main reason is that some of our data is represented in XML, which cannot be serialized as the very same stream of bytes every time. But it may still be helpful to use the getData call to retrieve such data.
The question in discussion in this thread is whether we should bend the LSID rules to accept XML data in getData calls. My proposal, which I think gathers the points presented in the previous discussion thread, is that whatever is served using getData is "semantically immutable". Semantical immutability would then depend on the content type of the data returned. For example:
1) If data is of content type text/plain, application/octet-stream, image/*, etc, then it must always be returned as the exact byte stream sequence (just like the LSID spec states now). 2) If data is in XML, i.e., it is of content type text/xml, text/html (God forbid), then it must always return an equivalent XML DOM tree; 3) If data is application/rdf+xml or application/rdf+n3 (i.e. RDF data), the getData call must always return the same RDF graph; 4) And so on for every other MIME type out there. // The implications of bending the LSID getData calls like that are:
a) One may not use getDataByRange call for data that is not byte stream equivalent (item #1 above). Authorities would have to return an error message when getDataByRange is called on a "semantically immutable" object.
b) Some may claim that caching of LSIDs and the associated data would be impossible. But since the data is always "semantically immutable", what's wrong with caching it?
c) Authorities wouldn't be able to return data in alternate MIME types (RDF in XML or N3 or Turtle) as there is no parameter that specifies that. Not a problem I suppose.
I agree with Dave in that we would gain much if we bent the LSID rule about immutability of data. We would have a more general solution that would fit the needs of a broader set of providers, without impacting the authorities that today don't use getData that much, such as the providers of names, concepts, observations, specimens, authors, and collections.
So the questions I pose to our group in this thread are:
"Should we allow 'semantically immutable' data to be returned in the getData call? How exactly do we do it (i.e., what would be a specification for it)?"
I don't really see a problem bending the LSID rules a bit, as outlined above. What do you think?
Cheers,
Ricardo