[tdwg-guid] LSID metadata persistence (or lack thereof)

Dave Vieglais vieglais at ku.edu
Mon Jul 16 12:00:25 CEST 2007


Hi Rich,
the question I posed about getData() has nothing to do with the  
actual data being referenced - that is, and should be opaque to the  
LSID service itself (apart from the metadata describing the data, but  
that is not part of the service).  Heck, it could be data about the  
number of coconuts consumed last year for all that matters.  The  
question was about the functionality of the protocol and services  
that implement it.  If an LSID is assigned to some data, then right  
now it is required that the data retrieved by getData() is always  
exactly the same byte sequence.  That's fine.  No more discussion  
required.  Leave it be.

The issue that does concern me though is that requiring the exact  
same byte stream for data identified by an LSID can raise unexpected  
implementation issues that seem to be overly restrictive without  
improving functionality.  My impression of LSIDs and their utility  
has always been as pointers to data which must always be consistent  
regardless of how or when the data is retrieved.  This is not  
necessarily the same thing as saying the byte stream used to  
represent the data is always the same, and many examples of this can  
be provided.  There are however, simple ways around this limitation  
(such as creating a new method as I outlined elsewhere) and perhaps  
there should be a little further discussion on this specific aspect  
of the LSID specification.

Dave V.

On Jul 16, 2007, at 06:22, Richard Pyle wrote:

>
> I'm not sure I understand this fixation with the getData() call.   
> Why is it
> so important to use that call to retrieve bytestream information  
> relating to
> objects that are not themselevs inherently digital?  Much of what  
> we are
> intereseted in within the biodiversity informatics community, in  
> terms of
> what we want to establish identifiers for, are not inherently digital
> objects and therefore should NOT have any bytes returned for getData 
> ().
> Some of our objects *are* inherently digital (PDFs, image files of  
> various
> formats, video clips, audio files, possibly Genbank sequences in a  
> specified
> format and encoding, etc.)  To me, the distinction is very simple:   
> is the
> object that the LSID identifies a binary data file?  If yes, then  
> the binary
> data become the data of the LSID.  If no, then the LSID has no  
> binary "data"
> (sensu LSID Spec), and returns only metadata through getMetadata 
> ().  The
> LSID spec refers to such LSIDs as "Abstract" (or sometimes  
> "Conceptual")
> LSIDs.
>
> It's really not that complicated -- unless, as I suggested  
> previously, I am
> missing something fundamentally important.
>
> I don't understand the advantage we gain by "force-fitting" some  
> digitized
> rendering of an otherwise non-digital object. Taxon Names (for  
> example) have
> no inherent digital manifestation.  We create an artificial digital
> representation of them by stringing ASCII or Unicode characters  
> together in
> a way that resembles (in principle) the characters otherwise  
> represented by
> ink on paper.  But if we want to embed such a character string as  
> "data" for
> an LSID, then the LSID is teally an identifier for the *character  
> string*
> itself, NOT the "notion" or "idea" or "concept" of the taxon name.   
> As a
> taxonomist and biodiversity informatics manager, I have very little  
> use for
> LSIDs that identify specific charcter strings.  I want an LSID that
> itentifies the shared understanding of a taxon name -- not an
> artificial/substitute rendering of the taxon name.  I see no  
> advantage to
> creating one LSID for a text string that encodes a taxon name as  
> UTF-8, and
> another LSID for the same name encoded as UTF-16,and so on, and so on.
> These variants are purely artificial from the perspective of what I  
> want an
> LSID for (i.e., the idea/notion/concept of a taxon name).
>
> I do acknowledge that the idea of an "Abstract" LSID was really  
> meant to
> serve as an "umbrella" of sorts to tie together multiple data- 
> bearing LSIDs.
> The classic example is an image that can be represented as a RAW, a  
> TIFF, or
> a JPEG file format.  Assuming all three image files derive from the  
> same
> shutter-release event of a camera, then the intended function of an
> "Abstract" LSID is to serve to gather together the LSIDs  
> established for
> each of the three file formats of the "same" image.  The images are  
> the
> "same" only in the conceptual -- i.e., that they all derive from the
> shutter-release event.  But the point is, the purpose of the  
> "Abstract" LSID
> is really intended to be a mechanism of organizing data-bearing  
> LSIDs that
> refer to different digital renderings of the "same thing".  From  
> the "LSID
> Best Practices" website
> (http://www-128.ibm.com/developerworks/opensource/library/os- 
> lsidbp/), under
> the heading "Abstract LSIDs":
>
> "The abstract LSID provides the anchor point for software and users to
> explore the metadata and obtain further pointers to all the  
> concrete LSID
> references that contain data, along with the data's exact  
> relationship to
> the abstract concept."
>
> This implies that "Abstract" LSIDs should exist primarily to aggregate
> data-bearing LSIDs.
>
> For the most part, I don't think this is what we are really trying  
> to do
> when we want to assign LSIDs to non-digital objects like taxon names,
> specimens, etc.  So, in a sense, what I am advocating deviates a  
> bit from
> the intention of an "Abstract" LSID.  But at least I'm not outright
> violating the fundamental tenents of the LSID spec, like trying to  
> apply a
> single LSID to more than one bytestream returnable via getData().
>
> So, again, I return to my original confusion:  why all the fixation  
> with the
> getData() call?
>
> The only reasons I can think of are:
>
> 1) Semantics (of the human communcation kind): We're uncomfortable  
> thinking
> of things like refering to the text string C-e-n-t-r-o-p-y-g-e  
> (minus the
> dashes) as being mere "metadata" for the angelfish genus described  
> by Kaup
> in 1860 -- when it just feels like the "actual" name to us (and  
> hence should
> be thought of as "data").
>
> 2) Persistence: We want to embed information as "data" for the LSID  
> because
> we want to make sure the "same information" is always there, and  
> the LSID
> spec emphasizes the permanent relationship between an LSID and its  
> data.
> The only trouble is, we want to define the word "same" in this  
> context in a
> way that is utterly incomprehensible (without all manner of comparison
> algorithms) to a computer.  *We* know that "Chaetodon" is the  
> "same" as
> "Chætodon", so we want a single LSID to refer to the genus name for
> butterflyfishes described by Linnaeus in 1758. And we don't like being
> required to always choose one rendering or the other to embed as the
> bit-identical "data" for the LSID.
>
> 3) Performance(?):  This is where I may be missing something  
> fundamental.
> Are there characteristics of the getData() call that are far  
> superior to
> getMetadata()?
>
>
> As for number 1:  all I can say is "get over it".  Our unfortunate  
> reality
> in biodiversity informatics is a proponderence of homonymy -- not  
> just in
> taxon names, but in our human-mitigated communication lexicon as well.
>
> As for number 2: We can deal with persistence through layers of  
> standards
> and convention within our community.  Almost everything we talk about
> involves an assumption of adherence to standards and conventions.   
> If we
> want persistent metadata, then we need to formalize a document  
> detailing
> which metadata elements should be mandatory and/or persistent and/ 
> or have
> other properties that we as a community feel are important. This  
> document
> would also outline when metadata may be modified for a given LSID,  
> vs. when
> a new LSID should be generated, allowing certain metadata elements  
> for each
> to remain unchanged (e.g., perhaps one LSID for "Chaetodon" and  
> another for
> "Chætodon", for the object type "Digital Taxon Name Rendering").  The
> document would also outline how multiple LSIDs should be cross- 
> referenced to
> each other (e.g., the two "DTNR" objects identified by two  
> different LSIDs
> in the previous example would both refer to the same Abstract LSID
> established for the butterflyfish genus name described by Linnaeus  
> in 1758).
>
> As for number 3: I just hope someone can explain to me where I  
> missed the
> boat.
>
>
> One final note:  I do see a way that we can preserve the spirit of  
> intent
> for the "Abstract LSID" in our domain for things like Taxon Names.   
> Rather
> than explain it here, I follow up with another email describing it.
>
> Aloha,
> Rich
>
> Richard L. Pyle, PhD
> Database Coordinator for Natural Sciences
>   and Associate Zoologist in Ichthyology
> Department of Natural Sciences, Bishop Museum
> 1525 Bernice St., Honolulu, HI 96817
> Ph: (808)848-4115, Fax: (808)847-8252
> email: deepreef at bishopmuseum.org
> http://hbs.bishopmuseum.org/staff/pylerichard.html
>
>
>
> ________________________________
>
> 	From: tdwg-guid-bounces at lists.tdwg.org
> [mailto:tdwg-guid-bounces at lists.tdwg.org] On Behalf Of Chuck Miller
> 	Sent: Saturday, July 14, 2007 2:29 PM
> 	To: Ricardo Pereira
> 	Cc: tdwg-guid at lists.tdwg.org
> 	Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)
> 	
> 	
> 	Ricardo,
> 	I disagree on your assertion of consensus on a couple of points.
> 	
> 	On 2) there is no consensus/decision on whether XML can be returned
> from a getData call.  I asked this question and it has not been  
> answered.
> We could disallow XML as an allowed format for getData and allow it  
> only for
> getMetadata.
> 		
> 	We do not have consensus and actually have disagreement on "We
> shouldn't for example return the bare scientific
> 	name of a species in the getData() call just because that can be
> immutable"  because "the name itself is in the metadata"   I for  
> one believe
> that we cannot avoid returning a scientific name byte stream in the  
> getData
> for an LSID for a scientific name.  That requirement is fundamental  
> to what
> we need for biodiversity data.  Pragmatically and empirically,  
> names and
> specimens/observations are THE most fundamental data objects  
> existing today
> in the databases published by GBIF.  So if we can't put LSIDs on  
> names, we
> have failed to enable one of the most fundamental needs of this  
> community.
> If the definition of LSIDs needs to be amended to enable that, then  
> so be
> it.
> 	
> 	Chuck
> 	
> ________________________________
>
> 	From: tdwg-guid-bounces at lists.tdwg.org on behalf of Ricardo Pereira
> 	Sent: Fri 7/13/2007 8:12 PM
> 	Cc: tdwg-guid at lists.tdwg.org
> 	Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
> 	
> 	
>
> 	    Folks,
> 	
> 	    Thanks much to all of you who replied to my post. All the posts
> were
> 	really relevant to our discussion.
> 	
> 	    Before we go ahead, however, let us stop for a minute to try and
> 	summarize the points we agree upon and the points in which there is
> 	still significant controversy.
> 	
> 	    I believe that we reached consensus in the following issues:
> 	
> 	1) We do agree that *LSID metadata is not required to be persistent*
> 	(i.e. clients cannot assume it is immutable). See note [1].
> 	
> 	2) We should not force XML representations of data to be byte
> identical
> 	just to return that in the LSID getData() call. We must find another
> way
> 	to fulfill this requirement.
> 	
> 	3) We should not try to return something in the LSID getData() call
> just
> 	for the sake of it. We shouldn't for example return the bare
> scientific
> 	name of a species in the getData() call just because that can be
> 	immutable and thus fulfill the requirement from the LSID spec. This
> is
> 	counterproductive because the name itself is in the metadata already
> and
> 	no client would gain anything from calling getData() in this case.
> 	
> 	
> 	    We have also raised new issues that may be worth discussing (in
> 	their own separate thread if possible):
> 	
> 	4) We "may" bend the immutability rule of LSID getData() to our
> benefit
> 	and accept data that is not byte stream identical, but only
> 	"semantically" identical (depending on content type maybe). If we do
> 	this, we may use the LSID getData() call more effectively to
> identify
> 	real datasets such as matrices, identification keys, etc.
> 	
> 	5) As Brian pointed out, we may need to revisit what we call data
> and
> 	metadata. We have been using the LSID getMetadata() call to return
> what
> 	some people may call data (taxon names, specimens, collections). And
> we
> 	forgot completely that there may be other kinds of data out there
> that
> 	may be returned in the getData() call and that those still need
> metadata
> 	to describe them. I think this may be worth discussing in a separate
> thread.
> 	
> 	    Did I leave anything out? If so, please let us know by replying
> to
> 	my post and adding a short entry to either list above.
> 	
> 	    Cheers,
> 	
> 	Ricardo
> 	
> 	
> 	
> 	Notes:
> 	-------
> 	
> 	[1] Matt may disagree with me here, but my point is that we can't
> force
> 	all authorities (i.e. data providers) to keep perfect archives of
> all
> 	versions of their databases given a heterogeneous and distributed
> 	environment we operate in. While some may want to provide this
> feature,
> 	other providers may not want or be able to.
> 	
> 	
> 	Richard Pyle wrote:
> 	> It seems to me that there is a third method to resolving the
> problem:
> 	>
> 	> When we want to identify an object that is itself digital in
> nature (e.g., a
> 	> database record, or a binary data file such as a PDF, JPG, ASCII,
> Unicode,
> 	> or whatever), we resolve said binary object via getData().  If,
> for some
> 	> reason, we change the exact bit-sequence of that digital/binary
> object
> 	> (e.g., color-correct an image, change a text string from ASII to
> Unicode, or
> 	> whatever...), we assign a new LSID to it (whether that "new" LSID
> differs
> 	> from the "old" LSID only via the optional "Revision" part of the
> LSID, or
> 	> via a new Object Identification part, is a topic for another
> debate).
> 	>
> 	> When we want to identify an object that does not itself have a
> digital
> 	> manifestation -- like a physical object (e.g., specimen or a
> particular
> 	> printed copy of a publication) or an abstract/conceptual object
> (e.g., a
> 	> taxon name, a taxon concept, a geographica place, or a cited
> publication) --
> 	> then we return *nothing* in response to getData(), and we treat
> all the
> 	> attributes of said physical/abstract/conceptual object of interest
> to us as
> 	> metadata.
> 	>
> 	> If there are cases where certain metadata elements of an object
> without an
> 	> inherent digital existence need to persists (and there are), yet
> we also
> 	> want to allow modifications to metadata elements without the need
> to
> 	> generate new identifiers for the underlying object (and we do) --
> then we
> 	> deal with those within our own community via adopted standards and
> best
> 	> practices.
> 	>
> 	> I would disagree strongly with bending the existing LSID standard,
> and would
> 	> just as strongly favor working within its existing framework
> (which, I
> 	> think, we can).  I would also disagree with the practice of
> embedding XML
> 	> documents as "data" for an LSID, unless the LSID is intended to
> represent
> 	> the XML document itself (in which case there might be a different
> LSID to
> 	> represent the database record that was used to generate the XML
> document;
> 	> and yet another LSID to represent the abstract concept that the
> database
> 	> record was created to represent -- like a taxon name, for
> example).
> 	>
> 	> If we want to use LSIDs to pass around XML packages (that are not
> rendered
> 	> as RDF) about abstract objects (e.g., taxon names), why doesn't
> our
> 	> community define within our semantic vocabulary something along
> the lines of
> 	> "TCS_XML", which can be established as a standard metadata
> component for
> 	> LSIDs assigned to taxon concepts (i.e., abstract objects,
> identified by
> 	> "data-less" LSIDs).  The exact bytestream of the content of that
> metadata
> 	> element can change, without changing its canonical rendering.
> 	>
> 	> I'm beginning to suspect (strongly) that I am completely missing
> some
> 	> fundamental point here -- and perhaps is is the same point that
> underlies
> 	> the apparent antagonism towards LSIDs in general (which I do not
> yet share).
> 	> But I am fairly certain we are dealing with some level of
> miscommunication
> 	> here.
> 	>
> 	> Aloha,
> 	> Rich
> 	>
> 	>
> 	>> -----Original Message-----
> 	>> From: tdwg-guid-bounces at lists.tdwg.org
> 	>> [mailto:tdwg-guid-bounces at lists.tdwg.org] On Behalf Of P.
> 	>> Bryan Heidorn
> 	>> Sent: Friday, July 13, 2007 12:48 PM
> 	>> To: Dave Vieglais
> 	>> Cc: tdwg-guid at lists.tdwg.org
> 	>> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack
> 	>> thereof)[Scanned]
> 	>>
> 	>> There seems to be two methods to resolving this problem.
> 	>>
> 	>> One is to change the LSID definitions to allow semantic
> 	>> equivalence in the data and not require exact bit stream
> equivalence.
> 	>>
> 	>> The other option is to change the data representation so that
> 	>> it is "easily" reduced to a repeatable canonical form. For
> 	>> example, it is almost as easy as saying where XML ordering
> 	>> does not specify order of elements, elements will be ordered
> 	>> alphabetically. Seems stupid but it almost works.. except
> 	>> where you have repeating elements with the same element name
> 	>> where it does not work.
> 	>>
> 	>> It seems a little odd to bend the standards for the data
> 	>> being delivered to fit the requirement of the LSID spec. In
> 	>> theory, the other standard developers who set the data being
> 	>> delivered did not fix order because it did not matter.
> 	>>
> 	>> This is different from Chuck's observation that the semantics
> 	>> of the element within some of the standards are
> 	>> insufficiently specified.
> 	>> So, what we mean is a darwin mode species name is just a
> 	>> string and nothing more now.
> 	>>
> 	>>
> 	>> --Bryan
> 	>>
> 	>> On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
> 	>>
> 	>>
> 	>>> I think we are all in agreement that the data and metadata
> 	>>>
> 	>> referenced
> 	>>
> 	>>> by an LSID remains unchanged (in the case of the metadata,
> semantic
> 	>>> equivalence is a requirement for reasons such as outlined
> 	>>>
> 	>> by Matt).
> 	>>
> 	>>> My question is to do purely with the data that an LSID
> references
> 	>>> through the getData() operation.  The form of that data could be
> 	>>> anything really - an encrypted byte stream, digital image,
> 	>>>
> 	>> Open Office
> 	>>
> 	>>> document, spreadsheet, xml document...
> 	>>>
> 	>>> We all know that the same data can be represented many ways
> 	>>>
> 	>> that are
> 	>>
> 	>>> logically, semantically and functionally equivalent yet form a
> 	>>> different set of bytes when serialized.  Data expressed in
> 	>>>
> 	>> XML is one
> 	>>
> 	>>> example (is <a/> = <a /> = <a></a> ?).  A pallet based image is
> 	>>> another - the order of colors in the palette may be
> 	>>>
> 	>> changed, and the
> 	>>
> 	>>> pixel values adjusted to match the new palette order, but
> 	>>>
> 	>> the image is
> 	>>
> 	>>> still the same. There are many more simple examples that can be
> 	>>> constructed that violate the unchanged bytes rule but for all
> 	>>> practical and functional purposes the data are unchanged.
> 	>>>
> 	>>> As mentioned previously, enforcing and implementing the
> unchanged
> 	>>> bytes rule is not challenging. It is however quite different
> from
> 	>>> stating that the data are returned unchanged.  It is this
> 	>>>
> 	>> that I, and
> 	>>
> 	>>> I'm sure a lot of other implementors would appreciate consensus
> on.
> 	>>>
> 	>>> Dave V.
> 	>>>
> 	>>> On Jul 14, 2007, at 09:20, Matthew Jones wrote:
> 	>>>
> 	>>>
> 	>>>> In terms of the metadata returned from an LSID, or any
> 	>>>>
> 	>> other digital
> 	>>
> 	>>>> identifier, there are definite cases where metadata must be
> 	>>>> semantically persistent in order to preserve the utility
> 	>>>>
> 	>> of data and
> 	>>
> 	>>>> accuracy of scientific results.
> 	>>>>
> 	>>>> As a trivial example, given a set of observations
> 	>>>>
> 	>> collected at time
> 	>>
> 	>>>> t, one can represent the data for those observations in
> 	>>>>
> 	>> dataset D and
> 	>>
> 	>>>> the metadata for the dataset, including the time value t, in a
> 	>>>> metadata document M.  In a later event, it is discovered
> 	>>>>
> 	>> that t was
> 	>>
> 	>>>> entered incorrectly, and needs to be adjusted, creating
> metadata
> 	>>>> document M'. That M and M' are not congruent is critical
> knowledge
> 	>>>> when analyzing data from D with data from another dataset D2.
> In
> 	>>>> other words, because there is no true distinction between data
> and
> 	>>>> metadata (any given piece of information can be stored in
> either
> 	>>>> location), a proper archive must be able to distinguish
> 	>>>>
> 	>> any changes
> 	>>
> 	>>>> in the data and any changes in the metadata.
> 	>>>>
> 	>>>> That said, there are some metadata that could change with
> 	>>>>
> 	>> little or
> 	>>
> 	>>>> no impact on data interpretation (e.g., the spelling of
> 	>>>>
> 	>> the street on
> 	>>
> 	>>>> which Technician Tom gets his snailmail).  But at the current
> time
> 	>>>> its impossible to distinguish this kind of metadata from the
> 	>>>> important kind in the general case of the existing
> 	>>>>
> 	>> metadata standards
> 	>>
> 	>>>> in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
> 	>>>>
> 	>>>> Our process in the KNB/SEEK/NCEAS and other ecological
> 	>>>>
> 	>> data archives
> 	>>
> 	>>>> is to give persistent identifiers to both data objects and
> 	>>>>
> 	>> metadata
> 	>>
> 	>>>> objects, and provide new identifiers when either changes.
> 	>>>>
> 	>>>> Matt
> 	>>>>
> 	>>>>
> 	>>>> Dave Vieglais wrote:
> 	>>>>
> 	>>>>> Hi Bob,
> 	>>>>> Just because a standard is published does not mean that it is
> 	>>>>> practical.  Requiring that a set of bytes referenced by
> 	>>>>>
> 	>> an LSID are
> 	>>
> 	>>>>> unchanged has a lot of implications with respect to the
> 	>>>>> implementation of data services.  For example, if it is agreed
> to
> 	>>>>> abide by the rule that the blob referenced by an LSID remains
> 	>>>>> forever unchanged, then that implies that the data
> 	>>>>>
> 	>> provider stores
> 	>>
> 	>>>>> the data as a blob, rather than risking the process of
> 	>>>>> reconstructing on the fly from some database, especially for
> the
> 	>>>>> example of data expressed in XML where functionally identical
> 	>>>>> objects (constructed using different DOM libraries for
> 	>>>>>
> 	>> example) are
> 	>>
> 	>>>>> not identical blobs.
> 	>>>>> Asserting that two instances of an object with the same LSID
> are
> 	>>>>> semantically equivalent is a vastly more complicated
> 	>>>>>
> 	>> processes than
> 	>>
> 	>>>>> asserting that the canonical representation of those
> 	>>>>>
> 	>> instances are
> 	>>
> 	>>>>> identical.  Generally there can be defined a simple set of
> 	>>>>> guidelines for constructing the canonical form of an
> 	>>>>>
> 	>> object (eg. for
> 	>>
> 	>>>>> xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic
> 	>>>>> equivalence is an ongoing topic of research.
> 	>>>>> Requiring identical blobs is certainly possible, but
> 	>>>>>
> 	>> people need to
> 	>>
> 	>>>>> be aware of the implications of such a requirement in the
> early
> 	>>>>> stages of designing a system to support such a specification.
> My
> 	>>>>> preference for the canonical form relaxes the implementation
> 	>>>>> requirements considerably whilst still maintaining the
> 	>>>>>
> 	>> integrity of
> 	>>
> 	>>>>> the data and the intent of the LSID.
> 	>>>>> regards,
> 	>>>>>   Dave V.
> 	>>>>> On Jul 14, 2007, at 08:08, Bob Morris wrote:
> 	>>>>>
> 	>>>>>> This entire discussion confuses me. The LSID standard is
> 	>>>>>>
> 	>> published.
> 	>>
> 	>>>>>> Why is there a discussion of what an LSID should be? The
> 	>>>>>>
> 	>> standard
> 	>>
> 	>>>>>> requires that the data, as defined by the return of
> 	>>>>>>
> 	>> getData,  to be
> 	>>
> 	>>>>>> identical for all resolutions of the LSID. From page 9
> 	>>>>>>
> 	>> of the LSID
> 	>>
> 	>>>>>> spec:
> 	>>>>>>
> 	>>>>>> " bytes getData (LSID lsid)
> 	>>>>>> bytes getDataByRange (LSID lsid, integer start, integer
> length)
> 	>>>>>> Metadata_response getMetadata (LSID lsid, string[]
> 	>>>>>> accepted_formats)
> 	>>>>>> Metadata_response getMetadataSubset (LSID lsid, string[]
> 	>>>>>> accepted_formats, string selector) The data retrieval
> 	>>>>>>
> 	>> services may
> 	>>
> 	>>>>>> implement all of the methods, or only methods for
> 	>>>>>>
> 	>> retrieving data,
> 	>>
> 	>>>>>> or only methods for retrieving associated metadata.
> 	>>>>>> The same LSID named data object must be resolved always
> 	>>>>>>
> 	>> to the same
> 	>>
> 	>>>>>> set of bytes. Therefore, all of the data retrieval
> 	>>>>>>
> 	>> services return
> 	>>
> 	>>>>>> the same results for the same LSID. The user has, however,
> the
> 	>>>>>> choice of which one of these to utilize depending on its
> 	>>>>>>
> 	>> location,
> 	>>
> 	>>>>>> known quality of service and other attributes. With
> 	>>>>>>
> 	>> metadata, the
> 	>>
> 	>>>>>> situation is different. Each data retrieval service can
> provide
> 	>>>>>> different metadata for the same LSID."
> 	>>>>>>
> 	>>>>>> This doesn't seem very ambiguous to me, and doesn't have
> 	>>>>>>
> 	>> anything
> 	>>
> 	>>>>>> to do with imperfect storage of data or anything else about
> the
> 	>>>>>> physical or electronic world. If two calls to getData() with
> the
> 	>>>>>> same argument on two occasions to possibly two different
> 	>>>>>>
> 	>> resolution
> 	>>
> 	>>>>>> services do not yield the same set of bytes, then one or
> 	>>>>>>
> 	>> the other
> 	>>
> 	>>>>>> or both of those is not executing a compliant service
> response.
> 	>>>>>> Unless this discussion is really "Shall we call something
> other
> 	>>>>>> than the return of getData by the term 'data associated with
> the
> 	>>>>>> LSID?' there seems to be nothing to discuss.
> 	>>>>>>
> 	>>>>>> Bob
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>>
> 	>>>>>> On 7/13/07, Paul Kirk <p.kirk at cabi.org> wrote:
> 	>>>>>>
> 	>>>>>>>
> 	>>>>>>> In an imperfect world there is no such thing as an
> 'identical-
> 	>>>>>>> byte-stream'
> 	>>>>>>> because the technology we use is imperfect ... the disk
> 	>>>>>>> controllers which manage our bytes and the disk we use to
> store
> 	>>>>>>> our bytes have recognized error rates. Perhaps I'm
> 	>>>>>>>
> 	>> being a pedant
> 	>>
> 	>>>>>>> in the above analysis but I was almost persuaded that
> 	>>>>>>>
> 	>> except for
> 	>>
> 	>>>>>>> digital objects (images,
> 	>>>>>>> sounds) which can
> 	>>>>>>> be data all other 'things' (names, specimen accession
> 	>>>>>>>
> 	>> numbers) had
> 	>>
> 	>>>>>>> to be metadata. This to me makes no sense in the real but
> 	>>>>>>> imperfect world we live in. An LSID assigned to a name
> 	>>>>>>>
> 	>> (e.g. Homo
> 	>>
> 	>>>>>>> sapiens) is assigned to the name as data, not metadata. What
> is
> 	>>>>>>> 'identical' here it that if the spelling has to change for
> any
> 	>>>>>>> reason the new spelling gets a new LSID and the now
> incorrect
> 	>>>>>>> spelling gets deprecated (but is still resolvable) with
> 	>>>>>>>
> 	>> a pointer
> 	>>
> 	>>>>>>> to the correct spelling/LSID in the metadata.
> 	>>>>>>>
> 	>>>>>>> OK?
> 	>>>>>>>
> 	>>>>>>> Paul
> 	>>>>>>>
> 	>>>>>>>  ________________________________
> 	>>>>>>>  From: tdwg-guid-bounces at lists.tdwg.org on behalf of
> 	>>>>>>>
> 	>> Chuck Miller
> 	>>
> 	>>>>>>> Sent: Fri 13/07/2007 19:03
> 	>>>>>>> To: Dave Vieglais
> 	>>>>>>> Cc: tdwg-guid at lists.tdwg.org
> 	>>>>>>> Subject: RE: [tdwg-guid] LSID metadata persistence (or lack
> 	>>>>>>> thereof)[Scanned]
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>> Dave,
> 	>>>>>>> What you say is true.  But, I think we already have too many
> 	>>>>>>> variations, subtleties, and reinterpretations which are
> 	>>>>>>>
> 	>> endlessly
> 	>>
> 	>>>>>>> debated.
> 	>>>>>>>
> 	>>>>>>> The LSID standard would be simple, clear and consistent
> 	>>>>>>>
> 	>> if we used
> 	>>
> 	>>>>>>> the identical-byte-stream definition.  The LSID would
> 	>>>>>>>
> 	>> uniquely tag
> 	>>
> 	>>>>>>> a persistent byte stream. A persistent byte stream is
> 	>>>>>>>
> 	>> always the
> 	>>
> 	>>>>>>> same thing without any further explanation or clarification.
> 	>>>>>>>
> 	>>>>>>> The provider of an LSID byte-stream would need to commit to
> 	>>>>>>> keeping that byte-stream persistent and not represent it in
> 	>>>>>>> multiple ways, even though technically they could.  If
> 	>>>>>>>
> 	>> they can't
> 	>>
> 	>>>>>>> commit to that, then it can't be an LSID byte-stream.
> 	>>>>>>>
> 	>>>>>>> And in the name of simplicity and clarity, if they had
> 	>>>>>>>
> 	>> to provide
> 	>>
> 	>>>>>>> different byte-stream representations then they would have
> to
> 	>>>>>>> assign a different LSID to each and use "SameAs" metadata.
> 	>>>>>>>
> 	>>>>>>> Chuck
> 	>>>>>>>
> 	>>>>>>> -----Original Message-----
> 	>>>>>>> From: Dave Vieglais [mailto:vieglais at ku.edu]
> 	>>>>>>> Sent: Friday, July 13, 2007 12:42 PM
> 	>>>>>>> To: Chuck Miller
> 	>>>>>>> Cc: Ricardo Pereira; tdwg-guid at lists.tdwg.org
> 	>>>>>>> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack
> 	>>>>>>> thereof)
> 	>>>>>>>
> 	>>>>>>> Hi Ricardo, Chuck,
> 	>>>>>>> Asserting that the byte stream returned as data
> 	>>>>>>>
> 	>> associated with an
> 	>>
> 	>>>>>>> LSID should never change is perhaps a bit confusing from a
> 	>>>>>>> programmatic view.  There are for example many ways to
> 	>>>>>>>
> 	>> represent
> 	>>
> 	>>>>>>> data in xml that are identical from an information
> 	>>>>>>>
> 	>> content point
> 	>>
> 	>>>>>>> of view, but the byte streams could be very different.
> 	>>>>>>>
> 	>>>>>>> Perhaps it might be better to state something like "the
> 	>>>>>>>
> 	>> canonical
> 	>>
> 	>>>>>>> representation of the data associated with an LSID must not
> 	>>>>>>> change", or something to that effect?
> 	>>>>>>>
> 	>>>>>>> Dave V.
> 	>>>>>>>
> 	>>>>>>> On Jul 14, 2007, at 05:29, Chuck Miller wrote:
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>> Ricardo,
> 	>>>>>>>>
> 	>>>>>>>> Looking at this definition: "Persistence of LSID
> 	>>>>>>>>
> 	>> Data: The data
> 	>>
> 	>>>>>>>> associated with an LSID (i.e, the byte stream returned by
> the
> 	>>>>>>>>
> 	>>>>>>> LSID
> 	>>>>>>>
> 	>>>>>>>> getData call) must never change"
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>> Perhaps this is a more straightforward way to conceive
> 	>>>>>>>>
> 	>>>>>>> LSIDs.  The
> 	>>>>>>>
> 	>>>>>>>> LSID goes with a byte stream.  It's that byte stream that
> 	>>>>>>>>
> 	>>>>>>> must stay
> 	>>>>>>>
> 	>>>>>>>> the same.  So, if there is a byte stream associated with a
> 	>>>>>>>> collection that needs to stay the same, then whatever
> 	>>>>>>>>
> 	>> that byte
> 	>>
> 	>>>>>>>> stream happens to be is the data that gets an LSID assigned
> 	>>>>>>>>
> 	>>>>>>> to it.
> 	>>>>>>>
> 	>>>>>>>> That sure seems a clearer definition of what is data
> 	>>>>>>>>
> 	>> and what is
> 	>>
> 	>>>>>>>> metadata, rather than the issue of primary object and
> 	>>>>>>>>
> 	>> all that.
> 	>>
> 	>>>>>>>>
> 	>>>>>>>> So we can create a new definition in the context of LSIDs:
> 	>>>>>>>>
> 	>>>>>>> Data is
> 	>>>>>>>
> 	>>>>>>>> a byte stream that is persistent, never changes and
> 	>>>>>>>>
> 	>> can have an
> 	>>
> 	>>>>>>>> LSID.  Metadata is a byte stream is non-persistent,
> 	>>>>>>>>
> 	>> might change
> 	>>
> 	>>>>>>>> and is only associated with an LSID.
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>> The institution who assigns an LSID can make their
> 	>>>>>>>>
> 	>> own decision
> 	>>
> 	>>>>>>>> about whether the byte stream being provided is persistent
> or
> 	>>>>>>>>
> 	>>>>>>> non-
> 	>>>>>>>
> 	>>>>>>>> persistent.  By assigning an LSID to any byte stream,
> 	>>>>>>>>
> 	>>>>>>> whatever it
> 	>>>>>>>
> 	>>>>>>>> is, the institution is declaring it to be data and
> persistent.
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>> So, in the example given of an observation record with a
> 	>>>>>>>> determination that needs to remain fixed and unchanged, by
> 	>>>>>>>> assigning an LSID to that observation+determination
> 	>>>>>>>>
> 	>> it would be
> 	>>
> 	>>>>>>>> "declared to be data" and unchangeable.  A different
> 	>>>>>>>>
> 	>>>>>>> determination
> 	>>>>>>>
> 	>>>>>>>> would then be different data with a different LSID.
> 	>>>>>>>>
> 	>> That would
> 	>>
> 	>>>>>>>> provide a solution for those who want to employ it.  Others
> 	>>>>>>>>
> 	>>>>>>> could
> 	>>>>>>>
> 	>>>>>>>> choose not to use it.
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>> Chuck
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>>
> 	>>>>>>>> From: tdwg-guid-bounces at lists.tdwg.org [mailto:tdwg-guid-
> 	>>>>>>>> bounces at lists.tdwg.org] On Behalf Of Ricardo Pereira
> 	>>>>>>>> Sent: Friday, July 13, 2007 9:47 AM
> 	>>>>>>>> To: tdwg-guid at lists.tdwg.org
> 	>>>>>>>> Subject: [tdwg-guid] LSID metadata persistence (or
> 	>>>>>>>>
> 	>> lack thereof)
> 	>>
> 	>>>>>>>>
> 	>>>>>>>>     Hi there folks,
> 	>>>>>>>>
> 	>>>>>>>>     As Chuck mentioned a few weeks ago, we do have a few
> 	>>>>>>>> outstanding issues to address regarding LSIDs. I
> 	>>>>>>>>
> 	>> would like to
> 	>>
> 	>>>>>>>> discuss those one by one, in an orderly manner, and reach
> 	>>>>>>>>
> 	>>>>>>> consensus
> 	>>>>>>>
> 	>>>>>>>> as much as we can. Then we can sum them up in a TDWG
> 	>>>>>>>>
> 	>> standard,
> 	>>
> 	>>>>>>>> possibly by or shortly after the Bratislava conference.
> 	>>>>>>>>
> 	>>>>>>>>     The first issue I would like to discuss is LSID
> metadata
> 	>>>>>>>> persistence. First, let me remind you of a corollary
> 	>>>>>>>>
> 	>>>>>>> established by
> 	>>>>>>>
> 	>>>>>>>> the LSID specification:
> 	>>>>>>>>
> 	>>>>>>>>             Corollary 1: LSIDs are not guaranteed to be
> 	>>>>>>>>
> 	>>>>>>> resolvable
> 	>>>>>>>
> 	>>>>>>>> indefinitely.
> 	>>>>>>>>
> 	>>>>>>>>     In other words, there is no guarantee that one will
> 	>>>>>>>>
> 	>>>>>>> always be
> 	>>>>>>>
> 	>>>>>>>> able to retrieve the data associated with an LSID as the
> 	>>>>>>>>
> 	>>>>>>> authority
> 	>>>>>>>
> 	>>>>>>>> may choose (or be forced) not  to resolve an LSID anymore.
> 	>>>>>>>>
> 	>>>>>>>>     Second, let me distinguish this kind of persistence I'm
> 	>>>>>>>>
> 	>>>>>>> talking
> 	>>>>>>>
> 	>>>>>>>> about from other two related concepts (which we'll not
> 	>>>>>>>>
> 	>>>>>>> discuss in
> 	>>>>>>>
> 	>>>>>>>> this thread):
> 	>>>>>>>>
> 	>>>>>>>>         1) Persistence of Assignment: Once assigned to an
> 	>>>>>>>>
> 	>>>>>>> object,
> 	>>>>>>>
> 	>>>>>>>> an LSID is indefinitely associated with it. The same LSID
> 	>>>>>>>>
> 	>>>>>>> cannot be
> 	>>>>>>>
> 	>>>>>>>> assigned to another object. Ever! The LSID may not be
> 	>>>>>>>>
> 	>> resolvable
> 	>>
> 	>>>>>>>> anymore, but it cannot be assigned to another object. This
> is
> 	>>>>>>>> established by the LSID specification.
> 	>>>>>>>>
> 	>>>>>>>>         2) Persistence of LSID Data: The data
> 	>>>>>>>>
> 	>> associated with an
> 	>>
> 	>>>>>>>> LSID (i.e, the byte stream returned by the LSID getData
> call)
> 	>>>>>>>>
> 	>>>>>>> must
> 	>>>>>>>
> 	>>>>>>>> never change. Although the LSID may not be resolvable
> anymore
> 	>>>>>>>> (according to corollary 1), the data associated with an
> LSID
> 	>>>>>>>>
> 	>>>>>>> must
> 	>>>>>>>
> 	>>>>>>>> never ever change. That's defined by the LSID spec, too.
> 	>>>>>>>>
> 	>>>>>>>>     What I want to discuss here is the persistence of LSID
> 	>>>>>>>>
> 	>>>>>>> metadata
> 	>>>>>>>
> 	>>>>>>>> (what is returned by the getMetadata call) or the
> 	>>>>>>>>
> 	>> lack thereof.
> 	>>
> 	>>>>>>>>     A use case associated with metadata persistence is when
> 	>>>>>>>>
> 	>>>>>>> someone
> 	>>>>>>>
> 	>>>>>>>> collects observation records (and implicitly, their
> 	>>>>>>>>
> 	>>>>>>> determinations)
> 	>>>>>>>
> 	>>>>>>>> and runs an experiment (a model or simulation) with it.
> This
> 	>>>>>>>>
> 	>>>>>>> person
> 	>>>>>>>
> 	>>>>>>>> may want to record the identifiers of the points used so
> that
> 	>>>>>>>> someone using the results of that experiment may refer back
> 	>>>>>>>>
> 	>>>>>>> to the
> 	>>>>>>>
> 	>>>>>>>> primary data, to validate or repeat it the experiment.
> 	>>>>>>>>
> 	>>>>>>>>     The bad news is that LSID identification scheme (or any
> 	>>>>>>>>
> 	>>>>>>> other
> 	>>>>>>>
> 	>>>>>>>> GUID that I know of) was not designed to guarantee metadata
> 	>>>>>>>> persistence, and thus it cannot implement the use
> 	>>>>>>>>
> 	>> case above by
> 	>>
> 	>>>>>>>> itself. To implement that use case, the specification would
> 	>>>>>>>>
> 	>>>>>>> have to
> 	>>>>>>>
> 	>>>>>>>> guarantee that the metadata (which we are using here
> 	>>>>>>>>
> 	>> as data) is
> 	>>
> 	>>>>>>>> immutable. But it doesn't.
> 	>>>>>>>>
> 	>>>>>>>>     Most of us wish that metadata was persistent, but
> 	>>>>>>>>
> 	>> it isn't.
> 	>>
> 	>>>>>>>> Many things can change in the metadata: a new
> 	>>>>>>>>
> 	>> determination, a
> 	>>
> 	>>>>>>>> mispeling that is corrected, many things. We just cannot
> 	>>>>>>>>
> 	>>>>>>> guarantee
> 	>>>>>>>
> 	>>>>>>>> that the metadata will look like it was sometime ago.
> 	>>>>>>>>
> 	>>>>>>>>     We then reach the following conclusion.
> 	>>>>>>>>
> 	>>>>>>>>             Corollary 2: LSIDs metadata is not immutable
> nor
> 	>>>>>>>> persistent.
> 	>>>>>>>>
> 	>>>>>>>>     The consequence of this corollary is that, if you need
> to
> 	>>>>>>>>
> 	>>>>>>> refer
> 	>>>>>>>
> 	>>>>>>>> back to a piece of information (metadata) associated with
> an
> 	>>>>>>>>
> 	>>>>>>> LSID,
> 	>>>>>>>
> 	>>>>>>>> exactly as it was when you got it, you must make a copy of
> 	>>>>>>>>
> 	>>>>>>> it, or
> 	>>>>>>>
> 	>>>>>>>> arrange that someone else make that copy for you.
> 	>>>>>>>>
> 	>>>>>>>>     In other words, a client cannot assume that the
> metadata
> 	>>>>>>>> associated with an LSID today will be the same
> 	>>>>>>>>
> 	>> tomorrow. If the
> 	>>
> 	>>>>>>>> client does assume that, it may be relying on a false
> 	>>>>>>>>
> 	>> assumption
> 	>>
> 	>>>>>>>> and its output may be flawed.
> 	>>>>>>>>
> 	>>>>>>>>     If we are not happy with that conclusion, we may
> 	>>>>>>>>
> 	>> develop an
> 	>>
> 	>>>>>>>> additional component in our architecture, an archive of
> some
> 	>>>>>>>>
> 	>>>>>>> sort,
> 	>>>>>>>
> 	>>>>>>>> to handle (meta)data persistence. That is exactly what the
> 	>>>>>>>>
> 	>>>>>>> STD-DOI
> 	>>>>>>>
> 	>>>>>>>> project (http://www.std-doi.de/) and SEEK (http://
> <http:///>
> 	>>>>>>>> seek.ecoinformatics.org) have done to some extent.
> 	>>>>>>>>
> 	>>>>>>>>     While we cannot guarantee that LSID metadata is
> 	>>>>>>>>
> 	>>>>>>> persistent nor
> 	>>>>>>>
> 	>>>>>>>> immutable, we can definitely document how the metadata have
> 	>>>>>>>>
> 	>>>>>>> changed
> 	>>>>>>>
> 	>>>>>>>> through metadata versioning. That's the topic of the next
> 	>>>>>>>>
> 	>>>>>>> thread.
> 	>>>>>>>
> 	>>>>>>>> We will move on to discuss metadata versioning as
> 	>>>>>>>>
> 	>> soon as we are
> 	>>
> 	>>>>>>>> done with metadata persistence.
> 	>>>>>>>>
> 	>>>>>>>>     Cheers,
> 	>>>>>>>>
> 	>>>>>>>> Ricardo
> 	>>>>>>>>
> 	>>>>>>>> _______________________________________________
> 	>>>>>>>> tdwg-guid mailing list
> 	>>>>>>>> tdwg-guid at lists.tdwg.org
> 	>>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	>>>>>>>>
> 	>>>>>>> _______________________________________________
> 	>>>>>>> tdwg-guid mailing list
> 	>>>>>>> tdwg-guid at lists.tdwg.org
> 	>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>> P Think Green - don't print this email unless you really
> need to
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>
> ******************************************************************
> 	>>
> 	>>>>>>> ******
> 	>>>>>>>  The information contained in this e-mail and any files
> 	>>>>>>> transmitted with it is confidential and is for the
> 	>>>>>>>
> 	>> exclusive use
> 	>>
> 	>>>>>>> of the intended recipient. If you are not the intended
> 	>>>>>>>
> 	>> recipient
> 	>>
> 	>>>>>>> please note that any distribution, copying or use of this
> 	>>>>>>> communication or the information in it is prohibited.
> 	>>>>>>>
> 	>>>>>>>  Whilst CAB International trading as CABI takes steps
> 	>>>>>>>
> 	>> to prevent
> 	>>
> 	>>>>>>> the transmission of viruses via e-mail, we cannot
> 	>>>>>>>
> 	>> guarantee that
> 	>>
> 	>>>>>>> any e-mail or attachment is free from computer viruses
> 	>>>>>>>
> 	>> and you are
> 	>>
> 	>>>>>>> strongly advised to undertake your own anti-virus
> precautions.
> 	>>>>>>>
> 	>>>>>>>  If you have received this communication in error,
> 	>>>>>>>
> 	>> please notify
> 	>>
> 	>>>>>>> us by e-mail at cabi at cabi.org or by telephone on +44 (0)1491
> 	>>>>>>> 829199 and then delete the e-mail and any copies of it.
> 	>>>>>>>
> 	>>>>>>>  CABI is an International Organization recognised by the UK
> 	>>>>>>> Government under Statutory Instrument 1982 No. 1071.
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>
> ******************************************************************
> 	>>
> 	>>>>>>> ********
> 	>>>>>>> _______________________________________________
> 	>>>>>>> tdwg-guid mailing list
> 	>>>>>>> tdwg-guid at lists.tdwg.org
> 	>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>>>
> 	>>>>>> --Robert A. Morris
> 	>>>>>> Professor of Computer Science
> 	>>>>>> UMASS-Boston
> 	>>>>>> ram at cs.umb.edu
> 	>>>>>> http://bdei.cs.umb.edu/
> 	>>>>>> http://www.cs.umb.edu/~ram
> 	>>>>>> http://www.cs.umb.edu/~ram/calendar.html
> 	>>>>>> phone (+1)617 287 6466
> 	>>>>>>
> 	>>>>> _______________________________________________
> 	>>>>> tdwg-guid mailing list
> 	>>>>> tdwg-guid at lists.tdwg.org
> 	>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	>>>>>
> 	>>> _______________________________________________
> 	>>> tdwg-guid mailing list
> 	>>> tdwg-guid at lists.tdwg.org
> 	>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	>>>
> 	>> _______________________________________________
> 	>> tdwg-guid mailing list
> 	>> tdwg-guid at lists.tdwg.org
> 	>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	>>
> 	>
> 	>
> 	> _______________________________________________
> 	> tdwg-guid mailing list
> 	> tdwg-guid at lists.tdwg.org
> 	> http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	>
> 	>
> 	
> 	_______________________________________________
> 	tdwg-guid mailing list
> 	tdwg-guid at lists.tdwg.org
> 	http://lists.tdwg.org/mailman/listinfo/tdwg-guid
> 	
>
>
>
> _______________________________________________
> tdwg-guid mailing list
> tdwg-guid at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-guid




More information about the tdwg-tag mailing list