[tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]

Dave Vieglais vieglais at ku.edu
Sat Jul 14 00:18:10 CEST 2007


I think we are all in agreement that the data and metadata referenced  
by an LSID remains unchanged (in the case of the metadata, semantic  
equivalence is a requirement for reasons such as outlined by Matt).   
My question is to do purely with the data that an LSID references  
through the getData() operation.  The form of that data could be  
anything really - an encrypted byte stream, digital image, Open  
Office document, spreadsheet, xml document...

We all know that the same data can be represented many ways that are  
logically, semantically and functionally equivalent yet form a  
different set of bytes when serialized.  Data expressed in XML is one  
example (is <a/> = <a /> = <a></a> ?).  A pallet based image is  
another - the order of colors in the palette may be changed, and the  
pixel values adjusted to match the new palette order, but the image  
is still the same. There are many more simple examples that can be  
constructed that violate the unchanged bytes rule but for all  
practical and functional purposes the data are unchanged.

As mentioned previously, enforcing and implementing the unchanged  
bytes rule is not challenging. It is however quite different from  
stating that the data are returned unchanged.  It is this that I, and  
I'm sure a lot of other implementors would appreciate consensus on.

Dave V.

On Jul 14, 2007, at 09:20, Matthew Jones wrote:

> In terms of the metadata returned from an LSID, or any other  
> digital identifier, there are definite cases where metadata must be  
> semantically persistent in order to preserve the utility of data  
> and accuracy of scientific results.
>
> As a trivial example, given a set of observations collected at time  
> t, one can represent the data for those observations in dataset D  
> and the metadata for the dataset, including the time value t, in a  
> metadata document M.  In a later event, it is discovered that t was  
> entered incorrectly, and needs to be adjusted, creating metadata  
> document M'. That M and M' are not congruent is critical knowledge  
> when analyzing data from D with data from another dataset D2.  In  
> other words, because there is no true distinction between data and  
> metadata (any given piece of information can be stored in either  
> location), a proper archive must be able to distinguish any changes  
> in the data and any changes in the metadata.
>
> That said, there are some metadata that could change with little or  
> no impact on data interpretation (e.g., the spelling of the street  
> on which Technician Tom gets his snailmail).  But at the current  
> time its impossible to distinguish this kind of metadata from the  
> important kind in the general case of the existing metadata  
> standards in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
>
> Our process in the KNB/SEEK/NCEAS and other ecological data  
> archives is to give persistent identifiers to both data objects and  
> metadata objects, and provide new identifiers when either changes.
>
> Matt
>
>
> Dave Vieglais wrote:
>> Hi Bob,
>> Just because a standard is published does not mean that it is  
>> practical.  Requiring that a set of bytes referenced by an LSID  
>> are unchanged has a lot of implications with respect to the  
>> implementation of data services.  For example, if it is agreed to  
>> abide by the rule that the blob referenced by an LSID remains  
>> forever unchanged, then that implies that the data provider stores  
>> the data as a blob, rather than risking the process of  
>> reconstructing on the fly from some database, especially for the  
>> example of data expressed in XML where functionally identical  
>> objects (constructed using different DOM libraries for example)  
>> are not identical blobs.
>> Asserting that two instances of an object with the same LSID are  
>> semantically equivalent is a vastly more complicated processes  
>> than asserting that the canonical representation of those  
>> instances are identical.  Generally there can be defined a simple  
>> set of guidelines for constructing the canonical form of an object  
>> (eg. for xml http:www.w3.org/TR/xml-c14n ) whereas asserting  
>> semantic equivalence is an ongoing topic of research.
>> Requiring identical blobs is certainly possible, but people need  
>> to be aware of the implications of such a requirement in the early  
>> stages of designing a system to support such a specification.  My  
>> preference for the canonical form relaxes the implementation  
>> requirements considerably whilst still maintaining the integrity  
>> of the data and the intent of the LSID.
>> regards,
>>   Dave V.
>> On Jul 14, 2007, at 08:08, Bob Morris wrote:
>>> This entire discussion confuses me. The LSID standard is published.
>>> Why is there a discussion of what an LSID should be? The standard
>>> requires that the data, as defined by the return of getData,  to be
>>> identical for all resolutions of the LSID. From page 9 of the LSID
>>> spec:
>>>
>>> " bytes getData (LSID lsid)
>>> bytes getDataByRange (LSID lsid, integer start, integer length)
>>> Metadata_response getMetadata (LSID lsid, string[] accepted_formats)
>>> Metadata_response getMetadataSubset (LSID lsid,
>>> string[] accepted_formats, string selector)
>>> The data retrieval services may implement all of the methods, or  
>>> only
>>> methods for retrieving data, or only methods for retrieving  
>>> associated
>>> metadata.
>>> The same LSID named data object must be resolved always to the same
>>> set of bytes. Therefore, all of the data retrieval services  
>>> return the
>>> same results for the same LSID. The user has, however, the choice of
>>> which one of these to utilize depending on its location, known  
>>> quality
>>> of service and other attributes. With metadata, the situation is
>>> different. Each data retrieval service can provide different  
>>> metadata
>>> for the same LSID."
>>>
>>> This doesn't seem very ambiguous to me, and doesn't have anything to
>>> do with imperfect storage of data or anything else about the  
>>> physical
>>> or electronic world. If two calls to getData() with the same  
>>> argument
>>> on two occasions to possibly two different resolution services do  
>>> not
>>> yield the same set of bytes, then one or the other or both of  
>>> those is
>>> not executing a compliant service response. Unless this  
>>> discussion is
>>> really "Shall we call something other than the return of getData by
>>> the term 'data associated with the LSID?' there seems to be  
>>> nothing to
>>> discuss.
>>>
>>> Bob
>>>
>>>
>>>
>>>
>>> On 7/13/07, Paul Kirk <p.kirk at cabi.org> wrote:
>>>>
>>>>
>>>>
>>>> In an imperfect world there is no such thing as an 'identical- 
>>>> byte-stream'
>>>> because the technology we use is imperfect ... the disk  
>>>> controllers which
>>>> manage our bytes and the disk we use to store our bytes have  
>>>> recognized
>>>> error rates. Perhaps I'm being a pedant in the above analysis  
>>>> but I was
>>>> almost persuaded that except for digital objects (images,  
>>>> sounds) which can
>>>> be data all other 'things' (names, specimen accession numbers)  
>>>> had to be
>>>> metadata. This to me makes no sense in the real but imperfect  
>>>> world we live
>>>> in. An LSID assigned to a name (e.g. Homo sapiens) is assigned  
>>>> to the name
>>>> as data, not metadata. What is 'identical' here it that if the  
>>>> spelling has
>>>> to change for any reason the new spelling gets a new LSID and  
>>>> the now
>>>> incorrect spelling gets deprecated (but is still resolvable)  
>>>> with a pointer
>>>> to the correct spelling/LSID in the metadata.
>>>>
>>>> OK?
>>>>
>>>> Paul
>>>>
>>>>  ________________________________
>>>>  From: tdwg-guid-bounces at lists.tdwg.org on behalf of Chuck
>>>> Miller
>>>> Sent: Fri 13/07/2007 19:03
>>>> To: Dave Vieglais
>>>> Cc: tdwg-guid at lists.tdwg.org
>>>> Subject: RE: [tdwg-guid] LSID metadata persistence (or lack
>>>> thereof)[Scanned]
>>>>
>>>>
>>>>
>>>>
>>>> Dave,
>>>> What you say is true.  But, I think we already have too many  
>>>> variations,
>>>> subtleties, and reinterpretations which are endlessly debated.
>>>>
>>>> The LSID standard would be simple, clear and consistent if we  
>>>> used the
>>>> identical-byte-stream definition.  The LSID would uniquely tag a
>>>> persistent byte stream. A persistent byte stream is always the same
>>>> thing without any further explanation or clarification.
>>>>
>>>> The provider of an LSID byte-stream would need to commit to  
>>>> keeping that
>>>> byte-stream persistent and not represent it in multiple ways, even
>>>> though technically they could.  If they can't commit to that,  
>>>> then it
>>>> can't be an LSID byte-stream.
>>>>
>>>> And in the name of simplicity and clarity, if they had to provide
>>>> different byte-stream representations then they would have to  
>>>> assign a
>>>> different LSID to each and use "SameAs" metadata.
>>>>
>>>> Chuck
>>>>
>>>> -----Original Message-----
>>>> From: Dave Vieglais [mailto:vieglais at ku.edu]
>>>> Sent: Friday, July 13, 2007 12:42 PM
>>>> To: Chuck Miller
>>>> Cc: Ricardo Pereira; tdwg-guid at lists.tdwg.org
>>>> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack  
>>>> thereof)
>>>>
>>>> Hi Ricardo, Chuck,
>>>> Asserting that the byte stream returned as data associated with an
>>>> LSID should never change is perhaps a bit confusing from a
>>>> programmatic view.  There are for example many ways to represent  
>>>> data
>>>> in xml that are identical from an information content point of  
>>>> view,
>>>> but the byte streams could be very different.
>>>>
>>>> Perhaps it might be better to state something like "the canonical
>>>> representation of the data associated with an LSID must not  
>>>> change",
>>>> or something to that effect?
>>>>
>>>> Dave V.
>>>>
>>>> On Jul 14, 2007, at 05:29, Chuck Miller wrote:
>>>>
>>>> > Ricardo,
>>>> >
>>>> > Looking at this definition: "Persistence of LSID Data: The data
>>>> > associated with an LSID (i.e, the byte stream returned by the  
>>>> LSID
>>>> > getData call) must never change"
>>>> >
>>>> >
>>>> >
>>>> > Perhaps this is a more straightforward way to conceive LSIDs.   
>>>> The
>>>> > LSID goes with a byte stream.  It's that byte stream that must  
>>>> stay
>>>> > the same.  So, if there is a byte stream associated with a
>>>> > collection that needs to stay the same, then whatever that byte
>>>> > stream happens to be is the data that gets an LSID assigned to  
>>>> it.
>>>> > That sure seems a clearer definition of what is data and what is
>>>> > metadata, rather than the issue of primary object and all that.
>>>> >
>>>> >
>>>> >
>>>> > So we can create a new definition in the context of LSIDs:  
>>>> Data is
>>>> > a byte stream that is persistent, never changes and can have an
>>>> > LSID.  Metadata is a byte stream is non-persistent, might change
>>>> > and is only associated with an LSID.
>>>> >
>>>> >
>>>> >
>>>> > The institution who assigns an LSID can make their own decision
>>>> > about whether the byte stream being provided is persistent or  
>>>> non-
>>>> > persistent.  By assigning an LSID to any byte stream, whatever it
>>>> > is, the institution is declaring it to be data and persistent.
>>>> >
>>>> >
>>>> >
>>>> > So, in the example given of an observation record with a
>>>> > determination that needs to remain fixed and unchanged, by
>>>> > assigning an LSID to that observation+determination it would be
>>>> > "declared to be data" and unchangeable.  A different  
>>>> determination
>>>> > would then be different data with a different LSID.  That would
>>>> > provide a solution for those who want to employ it.  Others could
>>>> > choose not to use it.
>>>> >
>>>> >
>>>> >
>>>> > Chuck
>>>> >
>>>> >
>>>> >
>>>> > From: tdwg-guid-bounces at lists.tdwg.org [mailto:tdwg-guid-
>>>> > bounces at lists.tdwg.org] On Behalf Of Ricardo Pereira
>>>> > Sent: Friday, July 13, 2007 9:47 AM
>>>> > To: tdwg-guid at lists.tdwg.org
>>>> > Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
>>>> >
>>>> >
>>>> >
>>>> >     Hi there folks,
>>>> >
>>>> >     As Chuck mentioned a few weeks ago, we do have a few
>>>> > outstanding issues to address regarding LSIDs. I would like to
>>>> > discuss those one by one, in an orderly manner, and reach  
>>>> consensus
>>>> > as much as we can. Then we can sum them up in a TDWG standard,
>>>> > possibly by or shortly after the Bratislava conference.
>>>> >
>>>> >     The first issue I would like to discuss is LSID metadata
>>>> > persistence. First, let me remind you of a corollary  
>>>> established by
>>>> > the LSID specification:
>>>> >
>>>> >             Corollary 1: LSIDs are not guaranteed to be  
>>>> resolvable
>>>> > indefinitely.
>>>> >
>>>> >     In other words, there is no guarantee that one will always be
>>>> > able to retrieve the data associated with an LSID as the  
>>>> authority
>>>> > may choose (or be forced) not  to resolve an LSID anymore.
>>>> >
>>>> >     Second, let me distinguish this kind of persistence I'm  
>>>> talking
>>>> > about from other two related concepts (which we'll not discuss in
>>>> > this thread):
>>>> >
>>>> >         1) Persistence of Assignment: Once assigned to an object,
>>>> > an LSID is indefinitely associated with it. The same LSID  
>>>> cannot be
>>>> > assigned to another object. Ever! The LSID may not be resolvable
>>>> > anymore, but it cannot be assigned to another object. This is
>>>> > established by the LSID specification.
>>>> >
>>>> >         2) Persistence of LSID Data: The data associated with an
>>>> > LSID (i.e, the byte stream returned by the LSID getData call)  
>>>> must
>>>> > never change. Although the LSID may not be resolvable anymore
>>>> > (according to corollary 1), the data associated with an LSID must
>>>> > never ever change. That's defined by the LSID spec, too.
>>>> >
>>>> >     What I want to discuss here is the persistence of LSID  
>>>> metadata
>>>> > (what is returned by the getMetadata call) or the lack thereof.
>>>> >
>>>> >     A use case associated with metadata persistence is when  
>>>> someone
>>>> > collects observation records (and implicitly, their  
>>>> determinations)
>>>> > and runs an experiment (a model or simulation) with it. This  
>>>> person
>>>> > may want to record the identifiers of the points used so that
>>>> > someone using the results of that experiment may refer back to  
>>>> the
>>>> > primary data, to validate or repeat it the experiment.
>>>> >
>>>> >     The bad news is that LSID identification scheme (or any other
>>>> > GUID that I know of) was not designed to guarantee metadata
>>>> > persistence, and thus it cannot implement the use case above by
>>>> > itself. To implement that use case, the specification would  
>>>> have to
>>>> > guarantee that the metadata (which we are using here as data) is
>>>> > immutable. But it doesn't.
>>>> >
>>>> >     Most of us wish that metadata was persistent, but it isn't.
>>>> > Many things can change in the metadata: a new determination, a
>>>> > mispeling that is corrected, many things. We just cannot  
>>>> guarantee
>>>> > that the metadata will look like it was sometime ago.
>>>> >
>>>> >     We then reach the following conclusion.
>>>> >
>>>> >             Corollary 2: LSIDs metadata is not immutable nor
>>>> > persistent.
>>>> >
>>>> >     The consequence of this corollary is that, if you need to  
>>>> refer
>>>> > back to a piece of information (metadata) associated with an  
>>>> LSID,
>>>> > exactly as it was when you got it, you must make a copy of it, or
>>>> > arrange that someone else make that copy for you.
>>>> >
>>>> >     In other words, a client cannot assume that the metadata
>>>> > associated with an LSID today will be the same tomorrow. If the
>>>> > client does assume that, it may be relying on a false assumption
>>>> > and its output may be flawed.
>>>> >
>>>> >     If we are not happy with that conclusion, we may develop an
>>>> > additional component in our architecture, an archive of some  
>>>> sort,
>>>> > to handle (meta)data persistence. That is exactly what the STD- 
>>>> DOI
>>>> > project (http://www.std-doi.de/) and SEEK (http://
>>>> > seek.ecoinformatics.org) have done to some extent.
>>>> >
>>>> >     While we cannot guarantee that LSID metadata is persistent  
>>>> nor
>>>> > immutable, we can definitely document how the metadata have  
>>>> changed
>>>> > through metadata versioning. That's the topic of the next thread.
>>>> > We will move on to discuss metadata versioning as soon as we are
>>>> > done with metadata persistence.
>>>> >
>>>> >     Cheers,
>>>> >
>>>> > Ricardo
>>>> >
>>>> > _______________________________________________
>>>> > tdwg-guid mailing list
>>>> > tdwg-guid at lists.tdwg.org
>>>> > http://lists.tdwg.org/mailman/listinfo/tdwg-guid
>>>>
>>>> _______________________________________________
>>>> tdwg-guid mailing list
>>>> tdwg-guid at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
>>>>
>>>>
>>>> P Think Green - don't print this email unless you really need to
>>>>
>>>> ******************************************************************* 
>>>> *****
>>>>  The information contained in this e-mail and any files  
>>>> transmitted with it
>>>> is confidential and is for the exclusive use of the intended  
>>>> recipient. If
>>>> you are not the intended recipient please note that any  
>>>> distribution,
>>>> copying or use of this communication or the information in it is  
>>>> prohibited.
>>>>
>>>>  Whilst CAB International trading as CABI takes steps to prevent  
>>>> the
>>>> transmission of viruses via e-mail, we cannot guarantee that any  
>>>> e-mail or
>>>> attachment is free from computer viruses and you are strongly  
>>>> advised to
>>>> undertake your own anti-virus precautions.
>>>>
>>>>  If you have received this communication in error, please notify  
>>>> us by
>>>> e-mail at cabi at cabi.org or by telephone on +44 (0)1491 829199  
>>>> and then
>>>> delete the e-mail and any copies of it.
>>>>
>>>>  CABI is an International Organization recognised by the UK  
>>>> Government under
>>>> Statutory Instrument 1982 No. 1071.
>>>>
>>>> ******************************************************************* 
>>>> *******
>>>> _______________________________________________
>>>> tdwg-guid mailing list
>>>> tdwg-guid at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
>>>>
>>>>
>>>
>>>
>>> --Robert A. Morris
>>> Professor of Computer Science
>>> UMASS-Boston
>>> ram at cs.umb.edu
>>> http://bdei.cs.umb.edu/
>>> http://www.cs.umb.edu/~ram
>>> http://www.cs.umb.edu/~ram/calendar.html
>>> phone (+1)617 287 6466
>> _______________________________________________
>> tdwg-guid mailing list
>> tdwg-guid at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid




More information about the tdwg-tag mailing list