[tdwg-guid] LSID metadata persistence (or lack thereof)

P. Bryan Heidorn pheidorn at uiuc.edu
Fri Jul 13 21:08:50 CEST 2007


I agree with Chuck's very valid argument that we can not get tied up  
in details so much so that nothing happens but Dave is very also  
correct. In this case I think the resolution is fairly simple. We  
need not argue for long about details.  Semantic equivalence  not  
absolute  bit level equivalence is used in almost all XML-based TDWG  
standards including for example Darwin Core, ABCD, and SDD. We would  
like a definition that preserves the semantic equivalence but not the  
absolute. This does lead to the kind of complication that Chuck  
rightfully abhors is that semantic equivalence requires a method of  
testing the equivalence.  In the TDWG standards this is easy since  
the schema+XML validator+data is all we need. We have already gambled  
in developing the XML based standards that the XML validation tools  
will persist for a long time.

Perhaps
"The data associated with a LSID is semantically persistent"

would meet both the simplicity Chuck is looking for and the  
expressiveness Dave points out is necessary. I do not know how many  
people understand semeantic persistance so it may require a  
definition or footnote. Just referring to the XML standards should be  
sufficient.

"Semantic persistence insures that the framework for interpretation  
of data will not change across representations as for example is the  
case with expressive equivalence of multiple representations of the  
same information under XML."

It is starting to sound like formal logic but that might be a good  
thing.

-- 
--------------------------------------------------------------------
   P. Bryan Heidorn
   Graduate School of Library and Information Science
   University of Illinois at Urbana-Champaign
   pheidorn at uiuc.edu
   (V)217/ 244-7792     (F)217/ 244-3302
   http://www.uiuc.edu/goto/heidorn
   Online Calendar: http://www.uiuc.edu/goto/heidorncalendar


On Jul 13, 2007, at 1:26 PM, Dave Vieglais wrote:

> Hi Chuck,
> I absolutely have to disagree.  Consider that the xml document:
>
> <a/>
>
> can also be represented:
>
> <a />
>
> <a></a>
>
> with identical content, yet the corresponding byte streams are  
> quite different.
>
> What happens say, if you are generating your xml output from a  
> database using some DOM library for example, and during an update  
> to your software (perhaps in a library over which you have no  
> control) there is a subtle change in the generation of XML that  
> remains consistent for the content but uses one of the alternate  
> representations above?  Not only do you violate the "unchanged byte  
> stream" rule when the corresponding LSID is resolved, but  
> downstream consumers that rely on that rule may be broken yet there  
> is no change in the information content.
>
> It seems more practical, manageable, and achievable to indicate  
> that the canonical form remains constant.
>
> Dave V.
>
>
> On Jul 14, 2007, at 06:03, Chuck Miller wrote:
>
>> Dave,
>> What you say is true.  But, I think we already have too many  
>> variations,
>> subtleties, and reinterpretations which are endlessly debated.
>>
>> The LSID standard would be simple, clear and consistent if we used  
>> the
>> identical-byte-stream definition.  The LSID would uniquely tag a
>> persistent byte stream. A persistent byte stream is always the same
>> thing without any further explanation or clarification.
>>
>> The provider of an LSID byte-stream would need to commit to  
>> keeping that
>> byte-stream persistent and not represent it in multiple ways, even
>> though technically they could.  If they can't commit to that, then it
>> can't be an LSID byte-stream.
>>
>> And in the name of simplicity and clarity, if they had to provide
>> different byte-stream representations then they would have to  
>> assign a
>> different LSID to each and use "SameAs" metadata.
>>
>> Chuck
>>
>> -----Original Message-----
>> From: Dave Vieglais [mailto:vieglais at ku.edu]
>> Sent: Friday, July 13, 2007 12:42 PM
>> To: Chuck Miller
>> Cc: Ricardo Pereira; tdwg-guid at lists.tdwg.org
>> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
>>
>> Hi Ricardo, Chuck,
>> Asserting that the byte stream returned as data associated with an
>> LSID should never change is perhaps a bit confusing from a
>> programmatic view.  There are for example many ways to represent data
>> in xml that are identical from an information content point of view,
>> but the byte streams could be very different.
>>
>> Perhaps it might be better to state something like "the canonical
>> representation of the data associated with an LSID must not change",
>> or something to that effect?
>>
>> Dave V.
>>
>> On Jul 14, 2007, at 05:29, Chuck Miller wrote:
>>
>>> Ricardo,
>>>
>>> Looking at this definition: "Persistence of LSID Data: The data
>>> associated with an LSID (i.e, the byte stream returned by the LSID
>>> getData call) must never change"
>>>
>>>
>>>
>>> Perhaps this is a more straightforward way to conceive LSIDs.  The
>>> LSID goes with a byte stream.  It's that byte stream that must stay
>>> the same.  So, if there is a byte stream associated with a
>>> collection that needs to stay the same, then whatever that byte
>>> stream happens to be is the data that gets an LSID assigned to it.
>>> That sure seems a clearer definition of what is data and what is
>>> metadata, rather than the issue of primary object and all that.
>>>
>>>
>>>
>>> So we can create a new definition in the context of LSIDs: Data is
>>> a byte stream that is persistent, never changes and can have an
>>> LSID.  Metadata is a byte stream is non-persistent, might change
>>> and is only associated with an LSID.
>>>
>>>
>>>
>>> The institution who assigns an LSID can make their own decision
>>> about whether the byte stream being provided is persistent or non-
>>> persistent.  By assigning an LSID to any byte stream, whatever it
>>> is, the institution is declaring it to be data and persistent.
>>>
>>>
>>>
>>> So, in the example given of an observation record with a
>>> determination that needs to remain fixed and unchanged, by
>>> assigning an LSID to that observation+determination it would be
>>> "declared to be data" and unchangeable.  A different determination
>>> would then be different data with a different LSID.  That would
>>> provide a solution for those who want to employ it.  Others could
>>> choose not to use it.
>>>
>>>
>>>
>>> Chuck
>>>
>>>
>>>
>>> From: tdwg-guid-bounces at lists.tdwg.org [mailto:tdwg-guid-
>>> bounces at lists.tdwg.org] On Behalf Of Ricardo Pereira
>>> Sent: Friday, July 13, 2007 9:47 AM
>>> To: tdwg-guid at lists.tdwg.org
>>> Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
>>>
>>>
>>>
>>>     Hi there folks,
>>>
>>>     As Chuck mentioned a few weeks ago, we do have a few
>>> outstanding issues to address regarding LSIDs. I would like to
>>> discuss those one by one, in an orderly manner, and reach consensus
>>> as much as we can. Then we can sum them up in a TDWG standard,
>>> possibly by or shortly after the Bratislava conference.
>>>
>>>     The first issue I would like to discuss is LSID metadata
>>> persistence. First, let me remind you of a corollary established by
>>> the LSID specification:
>>>
>>>             Corollary 1: LSIDs are not guaranteed to be resolvable
>>> indefinitely.
>>>
>>>     In other words, there is no guarantee that one will always be
>>> able to retrieve the data associated with an LSID as the authority
>>> may choose (or be forced) not  to resolve an LSID anymore.
>>>
>>>     Second, let me distinguish this kind of persistence I'm talking
>>> about from other two related concepts (which we'll not discuss in
>>> this thread):
>>>
>>>         1) Persistence of Assignment: Once assigned to an object,
>>> an LSID is indefinitely associated with it. The same LSID cannot be
>>> assigned to another object. Ever! The LSID may not be resolvable
>>> anymore, but it cannot be assigned to another object. This is
>>> established by the LSID specification.
>>>
>>>         2) Persistence of LSID Data: The data associated with an
>>> LSID (i.e, the byte stream returned by the LSID getData call) must
>>> never change. Although the LSID may not be resolvable anymore
>>> (according to corollary 1), the data associated with an LSID must
>>> never ever change. That's defined by the LSID spec, too.
>>>
>>>     What I want to discuss here is the persistence of LSID metadata
>>> (what is returned by the getMetadata call) or the lack thereof.
>>>
>>>     A use case associated with metadata persistence is when someone
>>> collects observation records (and implicitly, their determinations)
>>> and runs an experiment (a model or simulation) with it. This person
>>> may want to record the identifiers of the points used so that
>>> someone using the results of that experiment may refer back to the
>>> primary data, to validate or repeat it the experiment.
>>>
>>>     The bad news is that LSID identification scheme (or any other
>>> GUID that I know of) was not designed to guarantee metadata
>>> persistence, and thus it cannot implement the use case above by
>>> itself. To implement that use case, the specification would have to
>>> guarantee that the metadata (which we are using here as data) is
>>> immutable. But it doesn't.
>>>
>>>     Most of us wish that metadata was persistent, but it isn't.
>>> Many things can change in the metadata: a new determination, a
>>> mispeling that is corrected, many things. We just cannot guarantee
>>> that the metadata will look like it was sometime ago.
>>>
>>>     We then reach the following conclusion.
>>>
>>>             Corollary 2: LSIDs metadata is not immutable nor
>>> persistent.
>>>
>>>     The consequence of this corollary is that, if you need to refer
>>> back to a piece of information (metadata) associated with an LSID,
>>> exactly as it was when you got it, you must make a copy of it, or
>>> arrange that someone else make that copy for you.
>>>
>>>     In other words, a client cannot assume that the metadata
>>> associated with an LSID today will be the same tomorrow. If the
>>> client does assume that, it may be relying on a false assumption
>>> and its output may be flawed.
>>>
>>>     If we are not happy with that conclusion, we may develop an
>>> additional component in our architecture, an archive of some sort,
>>> to handle (meta)data persistence. That is exactly what the STD-DOI
>>> project (http://www.std-doi.de/) and SEEK (http://
>>> seek.ecoinformatics.org) have done to some extent.
>>>
>>>     While we cannot guarantee that LSID metadata is persistent nor
>>> immutable, we can definitely document how the metadata have changed
>>> through metadata versioning. That's the topic of the next thread.
>>> We will move on to discuss metadata versioning as soon as we are
>>> done with metadata persistence.
>>>
>>>     Cheers,
>>>
>>> Ricardo
>>>
>>> _______________________________________________
>>> tdwg-guid mailing list
>>> tdwg-guid at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
>>
>
> _______________________________________________
> tdwg-guid mailing list
> tdwg-guid at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-guid




More information about the tdwg-tag mailing list