Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]

13 Jul 2007

      There seems to be two methods to resolving this problem.

One is to change the LSID definitions to allow semantic equivalence  
in the data and not require exact bit stream equivalence.

The other option is to change the data representation so that it is  
"easily" reduced to a repeatable canonical form. For example, it is  
almost as easy as saying where XML ordering does not specify order of  
elements, elements will be ordered alphabetically. Seems stupid but  
it almost works.. except where you have repeating elements with the  
same element name where it does not work.

It seems a little odd to bend the standards for the data being  
delivered to fit the requirement of the LSID spec. In theory, the  
other standard developers who set the data being delivered did not  
fix order because it did not matter.

This is different from Chuck's observation that the semantics of the  
element within some of the standards are insufficiently specified.  
So, what we mean is a darwin mode species name is just a string and  
nothing more now.

--Bryan

On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
...
I think we are all in agreement that the data and metadata  
referenced by an LSID remains unchanged (in the case of the  
metadata, semantic equivalence is a requirement for reasons such as  
outlined by Matt).  My question is to do purely with the data that  
an LSID references through the getData() operation.  The form of  
that data could be anything really - an encrypted byte stream,  
digital image, Open Office document, spreadsheet, xml document...
We all know that the same data can be represented many ways that  
are logically, semantically and functionally equivalent yet form a  
different set of bytes when serialized.  Data expressed in XML is  
one example (is <a/> = <a /> = <a></a> ?).  A pallet based image is  
another - the order of colors in the palette may be changed, and  
the pixel values adjusted to match the new palette order, but the  
image is still the same. There are many more simple examples that  
can be constructed that violate the unchanged bytes rule but for  
all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged  
bytes rule is not challenging. It is however quite different from  
stating that the data are returned unchanged.  It is this that I,  
and I'm sure a lot of other implementors would appreciate consensus  
on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
...
In terms of the metadata returned from an LSID, or any other  
digital identifier, there are definite cases where metadata must  
be semantically persistent in order to preserve the utility of  
data and accuracy of scientific results.
As a trivial example, given a set of observations collected at  
time t, one can represent the data for those observations in  
dataset D and the metadata for the dataset, including the time  
value t, in a metadata document M.  In a later event, it is  
discovered that t was entered incorrectly, and needs to be  
adjusted, creating metadata document M'. That M and M' are not  
congruent is critical knowledge when analyzing data from D with  
data from another dataset D2.  In other words, because there is no  
true distinction between data and metadata (any given piece of  
information can be stored in either location), a proper archive  
must be able to distinguish any changes in the data and any  
changes in the metadata.
That said, there are some metadata that could change with little  
or no impact on data interpretation (e.g., the spelling of the  
street on which Technician Tom gets his snailmail).  But at the  
current time its impossible to distinguish this kind of metadata  
from the important kind in the general case of the existing  
metadata standards in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological data  
archives is to give persistent identifiers to both data objects  
and metadata objects, and provide new identifiers when either  
changes.
Matt
Dave Vieglais wrote:
...
Hi Bob,
Just because a standard is published does not mean that it is  
practical.  Requiring that a set of bytes referenced by an LSID  
are unchanged has a lot of implications with respect to the  
implementation of data services.  For example, if it is agreed to  
abide by the rule that the blob referenced by an LSID remains  
forever unchanged, then that implies that the data provider  
stores the data as a blob, rather than risking the process of  
reconstructing on the fly from some database, especially for the  
example of data expressed in XML where functionally identical  
objects (constructed using different DOM libraries for example)  
are not identical blobs.
Asserting that two instances of an object with the same LSID are  
semantically equivalent is a vastly more complicated processes  
than asserting that the canonical representation of those  
instances are identical.  Generally there can be defined a simple  
set of guidelines for constructing the canonical form of an  
object (eg. for xml http:www.w3.org/TR/xml-c14n ) whereas  
asserting semantic equivalence is an ongoing topic of research.
Requiring identical blobs is certainly possible, but people need  
to be aware of the implications of such a requirement in the  
early stages of designing a system to support such a  
specification.  My preference for the canonical form relaxes the  
implementation requirements considerably whilst still maintaining  
the integrity of the data and the intent of the LSID.
regards,
  Dave V.
On Jul 14, 2007, at 08:08, Bob Morris wrote:
...
This entire discussion confuses me. The LSID standard is published.
Why is there a discussion of what an LSID should be? The standard
requires that the data, as defined by the return of getData,  to be
identical for all resolutions of the LSID. From page 9 of the LSID
spec:
" bytes getData (LSID lsid)
bytes getDataByRange (LSID lsid, integer start, integer length)
Metadata_response getMetadata (LSID lsid, string[]  
accepted_formats)
Metadata_response getMetadataSubset (LSID lsid,
string[] accepted_formats, string selector)
The data retrieval services may implement all of the methods, or  
only
methods for retrieving data, or only methods for retrieving  
associated
metadata.
The same LSID named data object must be resolved always to the same
set of bytes. Therefore, all of the data retrieval services  
return the
same results for the same LSID. The user has, however, the  
choice of
which one of these to utilize depending on its location, known  
quality
of service and other attributes. With metadata, the situation is
different. Each data retrieval service can provide different  
metadata
for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have  
anything to
do with imperfect storage of data or anything else about the  
physical
or electronic world. If two calls to getData() with the same  
argument
on two occasions to possibly two different resolution services  
do not
yield the same set of bytes, then one or the other or both of  
those is
not executing a compliant service response. Unless this  
discussion is
really "Shall we call something other than the return of getData by
the term 'data associated with the LSID?' there seems to be  
nothing to
discuss.
Bob
On 7/13/07, Paul Kirk <p.kirk@cabi.org> wrote:
...
In an imperfect world there is no such thing as an 'identical- 
byte-stream'
because the technology we use is imperfect ... the disk  
controllers which
manage our bytes and the disk we use to store our bytes have  
recognized
error rates. Perhaps I'm being a pedant in the above analysis  
but I was
almost persuaded that except for digital objects (images,  
sounds) which can
be data all other 'things' (names, specimen accession numbers)  
had to be
metadata. This to me makes no sense in the real but imperfect  
world we live
in. An LSID assigned to a name (e.g. Homo sapiens) is assigned  
to the name
as data, not metadata. What is 'identical' here it that if the  
spelling has
to change for any reason the new spelling gets a new LSID and  
the now
incorrect spelling gets deprecated (but is still resolvable)  
with a pointer
to the correct spelling/LSID in the metadata.
OK?
Paul
________________________________
 From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck
Miller
Sent: Fri 13/07/2007 19:03
To: Dave Vieglais
Cc: tdwg-guid@lists.tdwg.org
Subject: RE: [tdwg-guid] LSID metadata persistence (or lack
thereof)[Scanned]
Dave,
What you say is true.  But, I think we already have too many  
variations,
subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we  
used the
identical-byte-stream definition.  The LSID would uniquely tag a
persistent byte stream. A persistent byte stream is always the  
same
thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to  
keeping that
byte-stream persistent and not represent it in multiple ways, even
though technically they could.  If they can't commit to that,  
then it
can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide
different byte-stream representations then they would have to  
assign a
different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message-----
From: Dave Vieglais [mailto:vieglais@ku.edu]
Sent: Friday, July 13, 2007 12:42 PM
To: Chuck Miller
Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org
Subject: Re: [tdwg-guid] LSID metadata persistence (or lack  
thereof)
Hi Ricardo, Chuck,
Asserting that the byte stream returned as data associated with an
LSID should never change is perhaps a bit confusing from a
programmatic view.  There are for example many ways to  
represent data
in xml that are identical from an information content point of  
view,
but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical
representation of the data associated with an LSID must not  
change",
or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
...
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data
associated with an LSID (i.e, the byte stream returned by the  
LSID
getData call) must never change"
Perhaps this is a more straightforward way to conceive  
LSIDs.  The
LSID goes with a byte stream.  It's that byte stream that  
must stay
the same.  So, if there is a byte stream associated with a
collection that needs to stay the same, then whatever that byte
stream happens to be is the data that gets an LSID assigned  
to it.
That sure seems a clearer definition of what is data and what is
metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs:  
Data is
a byte stream that is persistent, never changes and can have an
LSID.  Metadata is a byte stream is non-persistent, might change
and is only associated with an LSID.
The institution who assigns an LSID can make their own decision
about whether the byte stream being provided is persistent or  
non-
persistent.  By assigning an LSID to any byte stream,  
whatever it
is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a
determination that needs to remain fixed and unchanged, by
assigning an LSID to that observation+determination it would be
"declared to be data" and unchangeable.  A different  
determination
would then be different data with a different LSID.  That would
provide a solution for those who want to employ it.  Others  
could
choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-
bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira
Sent: Friday, July 13, 2007 9:47 AM
To: tdwg-guid@lists.tdwg.org
Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks,
As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to
discuss those one by one, in an orderly manner, and reach  
consensus
as much as we can. Then we can sum them up in a TDWG standard,
possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary  
established by
the LSID specification:
Corollary 1: LSIDs are not guaranteed to be  
resolvable
indefinitely.
In other words, there is no guarantee that one will  
always be
able to retrieve the data associated with an LSID as the  
authority
may choose (or be forced) not  to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm  
talking
about from other two related concepts (which we'll not  
discuss in
this thread):
1) Persistence of Assignment: Once assigned to an  
object,
an LSID is indefinitely associated with it. The same LSID  
cannot be
assigned to another object. Ever! The LSID may not be resolvable
anymore, but it cannot be assigned to another object. This is
established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call)  
must
never change. Although the LSID may not be resolvable anymore
(according to corollary 1), the data associated with an LSID  
must
never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID  
metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when  
someone
collects observation records (and implicitly, their  
determinations)
and runs an experiment (a model or simulation) with it. This  
person
may want to record the identifiers of the points used so that
someone using the results of that experiment may refer back  
to the
primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any  
other
GUID that I know of) was not designed to guarantee metadata
persistence, and thus it cannot implement the use case above by
itself. To implement that use case, the specification would  
have to
guarantee that the metadata (which we are using here as data) is
immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a
mispeling that is corrected, many things. We just cannot  
guarantee
that the metadata will look like it was sometime ago.
We then reach the following conclusion.
Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to  
refer
back to a piece of information (metadata) associated with an  
LSID,
exactly as it was when you got it, you must make a copy of  
it, or
arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the
client does assume that, it may be relying on a false assumption
and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some  
sort,
to handle (meta)data persistence. That is exactly what the  
STD-DOI
project (http://www.std-doi.de/) and SEEK (http://
seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is  
persistent nor
immutable, we can definitely document how the metadata have  
changed
through metadata versioning. That's the topic of the next  
thread.
We will move on to discuss metadata versioning as soon as we are
done with metadata persistence.
Cheers,
Ricardo
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
****************************************************************** 
******
 The information contained in this e-mail and any files  
transmitted with it
is confidential and is for the exclusive use of the intended  
recipient. If
you are not the intended recipient please note that any  
distribution,
copying or use of this communication or the information in it  
is prohibited.
Whilst CAB International trading as CABI takes steps to  
prevent the
transmission of viruses via e-mail, we cannot guarantee that  
any e-mail or
attachment is free from computer viruses and you are strongly  
advised to
undertake your own anti-virus precautions.
If you have received this communication in error, please  
notify us by
e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199  
and then
delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK  
Government under
Statutory Instrument 1982 No. 1071.
****************************************************************** 
********
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
--Robert A. Morris
Professor of Computer Science
UMASS-Boston
ram@cs.umb.edu
http://bdei.cs.umb.edu/
http://www.cs.umb.edu/~ram
http://www.cs.umb.edu/~ram/calendar.html
phone (+1)617 287 6466

tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid

Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]

P. Bryan Heidorn