RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]

13 Jul 2007

      It seems to me that there is a third method to resolving the problem:  

When we want to identify an object that is itself digital in nature (e.g., a
database record, or a binary data file such as a PDF, JPG, ASCII, Unicode,
or whatever), we resolve said binary object via getData().  If, for some
reason, we change the exact bit-sequence of that digital/binary object
(e.g., color-correct an image, change a text string from ASII to Unicode, or
whatever...), we assign a new LSID to it (whether that "new" LSID differs
from the "old" LSID only via the optional "Revision" part of the LSID, or
via a new Object Identification part, is a topic for another debate).

When we want to identify an object that does not itself have a digital
manifestation -- like a physical object (e.g., specimen or a particular
printed copy of a publication) or an abstract/conceptual object (e.g., a
taxon name, a taxon concept, a geographica place, or a cited publication) --
then we return *nothing* in response to getData(), and we treat all the
attributes of said physical/abstract/conceptual object of interest to us as
metadata.

If there are cases where certain metadata elements of an object without an
inherent digital existence need to persists (and there are), yet we also
want to allow modifications to metadata elements without the need to
generate new identifiers for the underlying object (and we do) -- then we
deal with those within our own community via adopted standards and best
practices.

I would disagree strongly with bending the existing LSID standard, and would
just as strongly favor working within its existing framework (which, I
think, we can).  I would also disagree with the practice of embedding XML
documents as "data" for an LSID, unless the LSID is intended to represent
the XML document itself (in which case there might be a different LSID to
represent the database record that was used to generate the XML document;
and yet another LSID to represent the abstract concept that the database
record was created to represent -- like a taxon name, for example).

If we want to use LSIDs to pass around XML packages (that are not rendered
as RDF) about abstract objects (e.g., taxon names), why doesn't our
community define within our semantic vocabulary something along the lines of
"TCS_XML", which can be established as a standard metadata component for
LSIDs assigned to taxon concepts (i.e., abstract objects, identified by
"data-less" LSIDs).  The exact bytestream of the content of that metadata
element can change, without changing its canonical rendering.

I'm beginning to suspect (strongly) that I am completely missing some
fundamental point here -- and perhaps is is the same point that underlies
the apparent antagonism towards LSIDs in general (which I do not yet share).
But I am fairly certain we are dealing with some level of miscommunication
here.

Aloha,
Rich
...
-----Original Message-----
From: tdwg-guid-bounces@lists.tdwg.org 
[mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. 
Bryan Heidorn
Sent: Friday, July 13, 2007 12:48 PM
To: Dave Vieglais
Cc: tdwg-guid@lists.tdwg.org
Subject: Re: [tdwg-guid] LSID metadata persistence (or lack 
thereof)[Scanned]
There seems to be two methods to resolving this problem.
One is to change the LSID definitions to allow semantic 
equivalence in the data and not require exact bit stream equivalence.
The other option is to change the data representation so that 
it is "easily" reduced to a repeatable canonical form. For 
example, it is almost as easy as saying where XML ordering 
does not specify order of elements, elements will be ordered 
alphabetically. Seems stupid but it almost works.. except 
where you have repeating elements with the same element name 
where it does not work.
It seems a little odd to bend the standards for the data 
being delivered to fit the requirement of the LSID spec. In 
theory, the other standard developers who set the data being 
delivered did not fix order because it did not matter.
This is different from Chuck's observation that the semantics 
of the element within some of the standards are 
insufficiently specified.  
So, what we mean is a darwin mode species name is just a 
string and nothing more now.
--Bryan
On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
...
I think we are all in agreement that the data and metadata 
referenced 
by an LSID remains unchanged (in the case of the metadata, semantic 
equivalence is a requirement for reasons such as outlined 
by Matt).  
My question is to do purely with the data that an LSID references 
through the getData() operation.  The form of that data could be 
anything really - an encrypted byte stream, digital image, 
Open Office 
document, spreadsheet, xml document...
We all know that the same data can be represented many ways 
that are 
logically, semantically and functionally equivalent yet form a 
different set of bytes when serialized.  Data expressed in 
XML is one 
example (is <a/> = <a /> = <a></a> ?).  A pallet based image is 
another - the order of colors in the palette may be 
changed, and the 
pixel values adjusted to match the new palette order, but 
the image is 
still the same. There are many more simple examples that can be 
constructed that violate the unchanged bytes rule but for all 
practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged 
bytes rule is not challenging. It is however quite different from 
stating that the data are returned unchanged.  It is this 
that I, and 
I'm sure a lot of other implementors would appreciate consensus on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
...
In terms of the metadata returned from an LSID, or any 
other digital 
identifier, there are definite cases where metadata must be 
semantically persistent in order to preserve the utility 
of data and 
accuracy of scientific results.
As a trivial example, given a set of observations 
collected at time 
t, one can represent the data for those observations in 
dataset D and 
the metadata for the dataset, including the time value t, in a 
metadata document M.  In a later event, it is discovered 
that t was 
entered incorrectly, and needs to be adjusted, creating metadata 
document M'. That M and M' are not congruent is critical knowledge 
when analyzing data from D with data from another dataset D2.  In 
other words, because there is no true distinction between data and 
metadata (any given piece of information can be stored in either 
location), a proper archive must be able to distinguish 
any changes 
in the data and any changes in the metadata.
That said, there are some metadata that could change with 
little or 
no impact on data interpretation (e.g., the spelling of 
the street on 
which Technician Tom gets his snailmail).  But at the current time 
its impossible to distinguish this kind of metadata from the 
important kind in the general case of the existing 
metadata standards 
in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological 
data archives 
is to give persistent identifiers to both data objects and 
metadata 
objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
...
Hi Bob,
Just because a standard is published does not mean that it is 
practical.  Requiring that a set of bytes referenced by 
an LSID are 
unchanged has a lot of implications with respect to the 
implementation of data services.  For example, if it is agreed to 
abide by the rule that the blob referenced by an LSID remains 
forever unchanged, then that implies that the data 
provider stores 
the data as a blob, rather than risking the process of 
reconstructing on the fly from some database, especially for the 
example of data expressed in XML where functionally identical 
objects (constructed using different DOM libraries for 
example) are 
not identical blobs.
Asserting that two instances of an object with the same LSID are 
semantically equivalent is a vastly more complicated 
processes than 
asserting that the canonical representation of those 
instances are 
identical.  Generally there can be defined a simple set of 
guidelines for constructing the canonical form of an 
object (eg. for 
xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic 
equivalence is an ongoing topic of research.
Requiring identical blobs is certainly possible, but 
people need to 
be aware of the implications of such a requirement in the early 
stages of designing a system to support such a specification.  My 
preference for the canonical form relaxes the implementation 
requirements considerably whilst still maintaining the 
integrity of 
the data and the intent of the LSID.
regards,
  Dave V.
On Jul 14, 2007, at 08:08, Bob Morris wrote:
...
This entire discussion confuses me. The LSID standard is 
published.
Why is there a discussion of what an LSID should be? The 
standard 
requires that the data, as defined by the return of 
getData,  to be 
identical for all resolutions of the LSID. From page 9 
of the LSID
spec:
" bytes getData (LSID lsid)
bytes getDataByRange (LSID lsid, integer start, integer length) 
Metadata_response getMetadata (LSID lsid, string[]
accepted_formats)
Metadata_response getMetadataSubset (LSID lsid, string[] 
accepted_formats, string selector) The data retrieval 
services may 
implement all of the methods, or only methods for 
retrieving data, 
or only methods for retrieving associated metadata.
The same LSID named data object must be resolved always 
to the same 
set of bytes. Therefore, all of the data retrieval 
services return 
the same results for the same LSID. The user has, however, the 
choice of which one of these to utilize depending on its 
location, 
known quality of service and other attributes. With 
metadata, the 
situation is different. Each data retrieval service can provide 
different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have 
anything 
to do with imperfect storage of data or anything else about the 
physical or electronic world. If two calls to getData() with the 
same argument on two occasions to possibly two different 
resolution 
services do not yield the same set of bytes, then one or 
the other 
or both of those is not executing a compliant service response. 
Unless this discussion is really "Shall we call something other 
than the return of getData by the term 'data associated with the 
LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk <p.kirk@cabi.org> wrote:
...
In an imperfect world there is no such thing as an 'identical- 
byte-stream'
because the technology we use is imperfect ... the disk 
controllers which manage our bytes and the disk we use to store 
our bytes have recognized error rates. Perhaps I'm
being a pedant
...
in the above analysis but I was almost persuaded that 
except for 
digital objects (images,
sounds) which can
be data all other 'things' (names, specimen accession 
numbers) had 
to be metadata. This to me makes no sense in the real but 
imperfect world we live in. An LSID assigned to a name 
(e.g. Homo 
sapiens) is assigned to the name as data, not metadata. What is 
'identical' here it that if the spelling has to change for any 
reason the new spelling gets a new LSID and the now incorrect 
spelling gets deprecated (but is still resolvable) with 
a pointer 
to the correct spelling/LSID in the metadata.
OK?
Paul
________________________________
 From: tdwg-guid-bounces@lists.tdwg.org on behalf of 
Chuck Miller
Sent: Fri 13/07/2007 19:03
To: Dave Vieglais
Cc: tdwg-guid@lists.tdwg.org
Subject: RE: [tdwg-guid] LSID metadata persistence (or lack 
thereof)[Scanned]
Dave,
What you say is true.  But, I think we already have too many 
variations, subtleties, and reinterpretations which are 
endlessly 
debated.
The LSID standard would be simple, clear and consistent 
if we used 
the identical-byte-stream definition.  The LSID would 
uniquely tag 
a persistent byte stream. A persistent byte stream is 
always the 
same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to 
keeping that byte-stream persistent and not represent it in 
multiple ways, even though technically they could.  If 
they can't 
commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had 
to provide 
different byte-stream representations then they would have to 
assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message-----
From: Dave Vieglais [mailto:vieglais@ku.edu]
Sent: Friday, July 13, 2007 12:42 PM
To: Chuck Miller
Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org
Subject: Re: [tdwg-guid] LSID metadata persistence (or lack
thereof)
Hi Ricardo, Chuck,
Asserting that the byte stream returned as data 
associated with an 
LSID should never change is perhaps a bit confusing from a 
programmatic view.  There are for example many ways to 
represent 
data in xml that are identical from an information 
content point 
of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the 
canonical 
representation of the data associated with an LSID must not 
change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
> Ricardo,
>
> Looking at this definition: "Persistence of LSID 
Data: The data 
> associated with an LSID (i.e, the byte stream returned by the
LSID
> getData call) must never change"
>
>
>
> Perhaps this is a more straightforward way to conceive
LSIDs.  The
> LSID goes with a byte stream.  It's that byte stream that
must stay
> the same.  So, if there is a byte stream associated with a 
> collection that needs to stay the same, then whatever 
that byte 
> stream happens to be is the data that gets an LSID assigned
to it.
> That sure seems a clearer definition of what is data 
and what is 
> metadata, rather than the issue of primary object and 
all that.
>
>
>
> So we can create a new definition in the context of LSIDs:  
Data is
> a byte stream that is persistent, never changes and 
can have an 
> LSID.  Metadata is a byte stream is non-persistent, 
might change 
> and is only associated with an LSID.
>
>
>
> The institution who assigns an LSID can make their 
own decision 
> about whether the byte stream being provided is persistent or
non-
> persistent.  By assigning an LSID to any byte stream,
whatever it
> is, the institution is declaring it to be data and persistent.
>
>
>
> So, in the example given of an observation record with a 
> determination that needs to remain fixed and unchanged, by 
> assigning an LSID to that observation+determination 
it would be 
> "declared to be data" and unchangeable.  A different
determination
> would then be different data with a different LSID.  
That would 
> provide a solution for those who want to employ it.  Others
could
> choose not to use it.
>
>
>
> Chuck
>
>
>
> From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- 
> bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira
> Sent: Friday, July 13, 2007 9:47 AM
> To: tdwg-guid@lists.tdwg.org
> Subject: [tdwg-guid] LSID metadata persistence (or 
lack thereof)
>
>
>
>     Hi there folks,
>
>     As Chuck mentioned a few weeks ago, we do have a few 
> outstanding issues to address regarding LSIDs. I 
would like to 
> discuss those one by one, in an orderly manner, and reach
consensus
> as much as we can. Then we can sum them up in a TDWG 
standard, 
> possibly by or shortly after the Bratislava conference.
>
>     The first issue I would like to discuss is LSID metadata 
> persistence. First, let me remind you of a corollary
established by
> the LSID specification:
>
>             Corollary 1: LSIDs are not guaranteed to be
resolvable
> indefinitely.
>
>     In other words, there is no guarantee that one will
always be
> able to retrieve the data associated with an LSID as the
authority
> may choose (or be forced) not  to resolve an LSID anymore.
>
>     Second, let me distinguish this kind of persistence I'm
talking
> about from other two related concepts (which we'll not
discuss in
> this thread):
>
>         1) Persistence of Assignment: Once assigned to an
object,
> an LSID is indefinitely associated with it. The same LSID
cannot be
> assigned to another object. Ever! The LSID may not be 
resolvable 
> anymore, but it cannot be assigned to another object. This is 
> established by the LSID specification.
>
>         2) Persistence of LSID Data: The data 
associated with an 
> LSID (i.e, the byte stream returned by the LSID getData call)
must
> never change. Although the LSID may not be resolvable anymore 
> (according to corollary 1), the data associated with an LSID
must
> never ever change. That's defined by the LSID spec, too.
>
>     What I want to discuss here is the persistence of LSID
metadata
> (what is returned by the getMetadata call) or the 
lack thereof.
>
>     A use case associated with metadata persistence is when
someone
> collects observation records (and implicitly, their
determinations)
> and runs an experiment (a model or simulation) with it. This
person
> may want to record the identifiers of the points used so that 
> someone using the results of that experiment may refer back
to the
> primary data, to validate or repeat it the experiment.
>
>     The bad news is that LSID identification scheme (or any
other
> GUID that I know of) was not designed to guarantee metadata 
> persistence, and thus it cannot implement the use 
case above by 
> itself. To implement that use case, the specification would
have to
> guarantee that the metadata (which we are using here 
as data) is 
> immutable. But it doesn't.
>
>     Most of us wish that metadata was persistent, but 
it isn't.
> Many things can change in the metadata: a new 
determination, a 
> mispeling that is corrected, many things. We just cannot
guarantee
> that the metadata will look like it was sometime ago.
>
>     We then reach the following conclusion.
>
>             Corollary 2: LSIDs metadata is not immutable nor 
> persistent.
>
>     The consequence of this corollary is that, if you need to
refer
> back to a piece of information (metadata) associated with an
LSID,
> exactly as it was when you got it, you must make a copy of
it, or
> arrange that someone else make that copy for you.
>
>     In other words, a client cannot assume that the metadata 
> associated with an LSID today will be the same 
tomorrow. If the 
> client does assume that, it may be relying on a false 
assumption 
> and its output may be flawed.
>
>     If we are not happy with that conclusion, we may 
develop an 
> additional component in our architecture, an archive of some
sort,
> to handle (meta)data persistence. That is exactly what the
STD-DOI
> project (http://www.std-doi.de/) and SEEK (http://
> seek.ecoinformatics.org) have done to some extent.
>
>     While we cannot guarantee that LSID metadata is
persistent nor
> immutable, we can definitely document how the metadata have
changed
> through metadata versioning. That's the topic of the next
thread.
> We will move on to discuss metadata versioning as 
soon as we are 
> done with metadata persistence.
>
>     Cheers,
>
> Ricardo
>
> _______________________________________________
> tdwg-guid mailing list
> tdwg-guid@lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to

...
...
...
...
...
******
 The information contained in this e-mail and any files 
transmitted with it is confidential and is for the 
exclusive use 
of the intended recipient. If you are not the intended 
recipient 
please note that any distribution, copying or use of this 
communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps 
to prevent 
the transmission of viruses via e-mail, we cannot 
guarantee that 
any e-mail or attachment is free from computer viruses 
and you are 
strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, 
please notify 
us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 
829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK 
Government under Statutory Instrument 1982 No. 1071.

...
...
...
...
...
********
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
--Robert A. Morris
Professor of Computer Science
UMASS-Boston
ram@cs.umb.edu
http://bdei.cs.umb.edu/
http://www.cs.umb.edu/~ram
http://www.cs.umb.edu/~ram/calendar.html
phone (+1)617 287 6466

tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________
tdwg-guid mailing list
tdwg-guid@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-guid

RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]

Richard Pyle