[tdwg-guid] Immutability of LSID data

P. Bryan Heidorn pheidorn at uiuc.edu
Mon Jul 16 19:50:23 CEST 2007


I do not know if I stated my position on the issue of getData()  
immutability. There is an installed base of application "expecting"  
that data returned by getData() will always have the same bit  
pattern. Because of that and the existing definition of getData() in  
the LSID spec we should not mess with that contract. That leaves two  
options for "semantically immutable" data. Either call it metadata  
and return it in getMetaData() or I would prefer an extension to the  
LSID spec to allow a new method getMimeData() or getflexData() we can  
argue for a long time about the name but this method can validly  
return XML, RDF or other data types that many have semantically  
equivalent representations with different byte orders. With this  
solution we would not need to support illegal activity.


On Jul 16, 2007, at 12:10 PM, Bob Morris wrote:

> My last escaped prematurely. I meant:
>
> There is no way that an application that passes an LSID to another
> application can know that the second program will abide by some
> non-standard TDWG-defined contract about something called an LSID. Any
> program that passes a uri beginning with urn:lsid with an implicit or
> explicit request for a getData() call cannot be assured of anything
> about the chain of custody except what is in the LSID spec.
>
> I wholly \agree/ with the need to have semantically persistent
> services, together with agreed upon, named, algorithms which establish
> the identity of two data streams for that purpose. What I don't agree
> with is calling the hook urn:lsid and the method getData()
>
> Since the infrastructure at TDWG and elsewhere is in place for LSID, I
> think I would address this issue not by defining a new standard that
> is a clone of LSID except with a different definition of getData(),
> but rather think about whether there can be stuff in the getMetadata()
> calls and returns that permit an assertion by the callee that some bit
> of stuff has been provided under the semantic persistence contract.
> Yes, this will lead to needing a call to getMetadata() for stuff that
> some people insist is data (and also insist there is a difference).
> This is the cost of doing robust business. Yes, some people will write
> non-compliant getData() services. Yes, applications that deal with
> those will sometimes break. As Bruce Stein said in a breakout group
> last week in the Observation Modeling workshop: "You can't legislate
> against illegal activity."
>
> On 7/16/07, Bob Morris <morris.bob at gmail.com> wrote:
>> There is no way to guarantee that a particular application which
>> passes an LSID to another application can expect anything other
>>
>> On 7/16/07, P. Bryan Heidorn <pheidorn at uiuc.edu> wrote:
>> > I am not sure if I follow completely Bob but I think you are  
>> pointing
>> > out an important issue for "semantics immutability" versus "byte/ 
>> bit-
>> > level immunity". If a client retrieves data from two different
>> > clients under a byte-level immutability contract a simple  
>> equivalence
>> > test should be able to verify the byte-level equivalence. Under the
>> > semantic immutability contract, a more complex text for equivalence
>> > would be required to fit for example the mime-type.
>> >
>> > In practice I do not think this is an issue. If clients act under
>> > blind faith under either contract they would not text the
>> > equivalence. In fact they would usually only retrieve a particular
>> > LSID from one service. The blind faith client would process the  
>> data
>> > as if the data provider is following the contract and no more. The
>> > client could not assume byte-level immutability when there is only
>> > semantic immutability because it may indeed break the client code.
>> > Caching a byte-level representation of data from one call can  
>> not be
>> > compared with semantic data. If XML is carried in the data all
>> > operations must be consistent with XML operations. I do not see  
>> this
>> > as a problem.
>> >
>> > Since in the biodiversity community LSID data payloads would be  
>> about
>> > a large variety of objects, clients would always need to check the
>> > data types before most processing operations. The data type
>> > information would be encoded in the metadata but could also be
>> > segregated by service provider (but even there for good form the
>> > metadata should encode the data type.) The metadata needs to encode
>> > both the physical layout of the bits and "use" (there must be a
>> > better word). For example, the data could be a Darwin core  
>> records, a
>> > dublin core records or SDD. All are XML but the legal operations  
>> over
>> > that XML are different depending on the "use". Some clients could
>> > just pass the data through without be concerned about this but  
>> other
>> > clients would need to process accordingly perhaps ignoring types it
>> > knows nothing about.
>> >
>> > ------
>> >
>> > Unrelated to Bob's comment I would like to add a point about  
>> digital
>> > from birth vs made digital data.
>> >
>> > What is data and what is metadata has no relation to being  
>> digital or
>> > not. There was data and metadata long before there were computers.
>> > Galileo studying the time of objects to move down an inclined plane
>> > collected data, the time, distance, angle and mass of the  
>> objects. At
>> > least the time and the distance recorded in his notebooks are data.
>> > If we re-represent his data from the notebook in digital format in
>> > 2007 so we can process it in an excel spreadsheet it is still the
>> > same data. If we just take a photo of the book we might have a
>> > different beast but as long as we leave his number as numbers it is
>> > the same data. The metadata about inclined plane experiment would
>> > include information about the apparatus used. For example he might
>> > have bells that ring at different locations/distances of the  
>> inclined
>> > plane., it might be made of a wooden frame with brass rails. All  
>> this
>> > metadata tells us about the data, it is data about the data.  
>> Similar
>> > arguments can be made about specimens. A digital representation  
>> of a
>> > specimen is still data. No one is arguing that the specimen is a
>> > species or a species concept. A specimen glued to paper or in a  
>> photo
>> > can be assigned to a species concept, meaning someone has said this
>> > is an X. As such we can treat it as an exemplar of X. If it is a  
>> type
>> > we can even say it is a very good example of X but it does not  
>> cover
>> > the entire concept of X. The image of the specimen can be data. We
>> > need not treat it as metadata just because it is digital or because
>> > there is an object or event in the world that is now primary
>> > representation. Galileo's numbers also existing in the notebook do
>> > not make the numbers in the computer any less data. We will want to
>> > add metadata to the digital numbers to tell the user that they came
>> > from Galileo's notebook.
>> >
>> > Bryan
>> > --
>> >  
>> --------------------------------------------------------------------
>> >    P. Bryan Heidorn
>> >    Graduate School of Library and Information Science
>> >    University of Illinois at Urbana-Champaign
>> >    pheidorn at uiuc.edu
>> >    (V)217/ 244-7792     (F)217/ 244-3302
>> >    http://www.uiuc.edu/goto/heidorn
>> >    Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
>> >
>> >
>> > On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
>> >
>> > > On 7/16/07, Ricardo Pereira <ricardo at tdwg.org> wrote:
>> > >>
>> > > One thing that is wrong with it is that if a conforming client
>> > > acquires the data with a getData call from two different  
>> sources, and
>> > > they return different byte strings, then the client is  
>> permitted to
>> > > signal an error and possibly break an application that  
>> exercises a
>> > > blind faith in the power of "semantic immutability".
>> > >
>> > >
>> > >>  b) Some may claim that caching of LSIDs and the associated data
>> > >> would be
>> > >> impossible. But since the data is always "semantically  
>> immutable",
>> > >> what's
>> > >> wrong with caching it?
>> > >>
>> > >
>> > > --
>> > > Robert A. Morris
>> > > Professor of Computer Science
>> > > UMASS-Boston
>> > > ram at cs.umb.edu
>> > > http://bdei.cs.umb.edu/
>> > > http://www.cs.umb.edu/~ram
>> > > http://www.cs.umb.edu/~ram/calendar.html
>> > > phone (+1)617 287 6466
>> > > _______________________________________________
>> > > tdwg-guid mailing list
>> > > tdwg-guid at lists.tdwg.org
>> > > http://lists.tdwg.org/mailman/listinfo/tdwg-guid
>> >
>> >
>>
>>
>> --
>> Robert A. Morris
>> Professor of Computer Science
>> UMASS-Boston
>> ram at cs.umb.edu
>> http://bdei.cs.umb.edu/
>> http://www.cs.umb.edu/~ram
>> http://www.cs.umb.edu/~ram/calendar.html
>> phone (+1)617 287 6466
>>
>
>
> -- 
> Robert A. Morris
> Professor of Computer Science
> UMASS-Boston
> ram at cs.umb.edu
> http://bdei.cs.umb.edu/
> http://www.cs.umb.edu/~ram
> http://www.cs.umb.edu/~ram/calendar.html
> phone (+1)617 287 6466




More information about the tdwg-tag mailing list