[tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks,
As Chuck mentioned a few weeks ago, we do have a few outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is *LSID metadata persistence*. First, let me remind you of a corollary established by the LSID specification:
* Corollary 1: *_LSIDs are not guaranteed to be resolvable indefinitely._
In other words, there is no guarantee that one will always be able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking about from other two related concepts (which we'll not discuss in this thread):
1) *Persistence of Assignment: *Once assigned to an object, an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) *Persistence of LSID Data: *The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the *persistence of LSID metadata* (what is returned by the getMetadata call) or the lack thereof.
A use case associated with *metadata persistence* is when someone collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other GUID that I know of) was not designed to guarantee *metadata persistence*, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to *guarantee* that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't. Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion.
*Corollary 2: *LSIDs metadata is not immutable nor persistent.
The consequence of this corollary is that, if you need to refer back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you *must make a copy of it*, or arrange that someone else make that copy for you.
In other words, a client cannot *assume* that the metadata associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http://seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor immutable, we can definitely document how the metadata have changed through *metadata **versioning*. That's the topic of the next thread. We will move on to discuss *metadata **versioning* as soon as we are done with *metadata persistence*.
Cheers,
Ricardo
I think the non-immutability of metadata is correct. Basically we would all like everyone else's metadata to stay the same, but we want to be able to change our own metadata at will (correcting errors, improving data quality, adding new information etc.)
Versioning has to be the answer for this, for those databases (e.g. IPNI) that can support it. However all of the conversations I've had regarding LSIDs and verisoning have said that: 1. Versioning in LSIDs is more or less deprecated 2. Versioning in LSIDs is for Data, not Metadata
Accordingly, for ipni, given that we have versioning, we bodged in a 'versioned as' field in the metadata to handle the fact that we could both give you a version number AND supply you with a (hopefully unchanged) copy of the metadata based on the version you had if that's what you want.
But that's anticipating Ricardo's next thread so I'll leave it there.
Sally
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few outstanding
issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is *LSID metadata
persistence*. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: *_LSIDs are not guaranteed to be resolvable
indefinitely._
In other words, there is no guarantee that one will always be able
to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) *Persistence of Assignment: *Once assigned to an object, an
LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) *Persistence of LSID Data: *The data associated with an LSID
(i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the *persistence of LSID metadata*
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with *metadata persistence* is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other GUID
that I know of) was not designed to guarantee *metadata persistence*, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to *guarantee* that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't. Many
things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. *Corollary 2: *LSIDs metadata is not immutable nor persistent. The consequence of this corollary is that, if you need to refer back
to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you *must make a copy of it*, or arrange that someone else make that copy for you.
In other words, a client cannot *assume* that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http://seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through *metadata **versioning*. That's the topic of the next thread. We will move on to discuss *metadata **versioning* as soon as we are done with *metadata persistence*.
Cheers,
Ricardo
*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non-persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
________________________________
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks,
As Chuck mentioned a few weeks ago, we do have a few outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable indefinitely.
In other words, there is no guarantee that one will always be able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object, an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata (what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't. Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion.
Corollary 2: LSIDs metadata is not immutable nor persistent.
The consequence of this corollary is that, if you need to refer back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http://seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: “Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change”
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It’s that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be “declared to be data” and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Hi Chuck, I absolutely have to disagree. Consider that the xml document:
<a/>
can also be represented:
<a />
<a></a>
with identical content, yet the corresponding byte streams are quite different.
What happens say, if you are generating your xml output from a database using some DOM library for example, and during an update to your software (perhaps in a library over which you have no control) there is a subtle change in the generation of XML that remains consistent for the content but uses one of the alternate representations above? Not only do you violate the "unchanged byte stream" rule when the corresponding LSID is resolved, but downstream consumers that rely on that rule may be broken yet there is no change in the information content.
It seems more practical, manageable, and achievable to indicate that the canonical form remains constant.
Dave V.
On Jul 14, 2007, at 06:03, Chuck Miller wrote:
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
I agree with Chuck's very valid argument that we can not get tied up in details so much so that nothing happens but Dave is very also correct. In this case I think the resolution is fairly simple. We need not argue for long about details. Semantic equivalence not absolute bit level equivalence is used in almost all XML-based TDWG standards including for example Darwin Core, ABCD, and SDD. We would like a definition that preserves the semantic equivalence but not the absolute. This does lead to the kind of complication that Chuck rightfully abhors is that semantic equivalence requires a method of testing the equivalence. In the TDWG standards this is easy since the schema+XML validator+data is all we need. We have already gambled in developing the XML based standards that the XML validation tools will persist for a long time.
Perhaps "The data associated with a LSID is semantically persistent"
would meet both the simplicity Chuck is looking for and the expressiveness Dave points out is necessary. I do not know how many people understand semeantic persistance so it may require a definition or footnote. Just referring to the XML standards should be sufficient.
"Semantic persistence insures that the framework for interpretation of data will not change across representations as for example is the case with expressive equivalence of multiple representations of the same information under XML."
It is starting to sound like formal logic but that might be a good thing.
In an imperfect world there is no such thing as an 'identical-byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
________________________________
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// http:/// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
Paul,
That is the exact issue we have yet to definitively and unambiguously resolve. The purist definition of metadata has been an impediment to getting what we need. It is an imperfect world and that world is real.
We need to be able to apply an LSID to that byte-stream "Homo sapiens", if that is what the actual data is, and then that byte-stream never change. We have millions of that particular kind of data waiting to have LSIDs given to them and a whole lot of other kinds, too.
Chuck
________________________________
From: Paul Kirk [mailto:p.kirk@cabi.org] Sent: Friday, July 13, 2007 1:45 PM To: Chuck Miller; Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
In an imperfect world there is no such thing as an 'identical-byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
________________________________
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
************************************************************************ **
I think this simple sting data example fits well under the semantic equivalence definition if we do not interpret it too broadly. If the data being returned is a specia name we can define semantic equivalence, same for observation data, or any data type. The data being retrieved by the LSID defines its own semantic equivalence. This means that the data is inherently typed or associated with its semantics.
Thanks to Ricardo for starting this very timely discussion. I've been following LSIDs for a long time now, and have attended both GBIF GUID workshops, and had some very detailed conversations with Ben Szekely about this very issue, and I think I have a pretty good handle on it. And, it's really not that complicated.
The byte-stream (bit sequence) for the data of a given LSID cannot change, according to the LSID spec. The "meaning" of the data is irrelevant in this context -- what matters is the actual sequence of 1's and 0's. If you have a TIFF image file that represents a 12-megapixel image, and you change one bit of one pixel of that image file, you cannot use the same LSID to represent it. If you package it into a ZIP file, that ZIP file is a new bytestream and could not be returned as the data for that LSID assigned to the TIFF image data object.
If we want to change this specification, then we are not using LSIDs anymore -- we are using something like "TDWG identifiers that look an awful lot like LSIDs, but really aren't LSIDs". I think that's the last thing this community should do.
The "data" for LSIDs should be an unambiguous digital object. Species names are not digital objects. They are not even physical objects. In fact, they aren't even text objects (the text string of a species "name", as defined by any of the nomenclatural codes, is a property or attribute of the name-object -- not the name-object itself). Species names are "abstract" or "conceptual" objects -- with no inherent digital manifestation, and not even any inherent physical manifestation. The LSID spec accomodates such objects in the form of "data-less" LSIDs -- that is, LSIDs with zero "data" content (only metadata).
Please, let's not get bogged down in alternate definitions of the word "data" and "metadata". I swear, the single greatest impediment to progress in biodiversity informatics (by far) in my opinion has been human-language semantics. I had to qualify the word "semantics" in the previous sentence with "human-language", because even the very word "semantics" has more than one meaning in our conversations (I almost used the word "vocabulary" instead of "sematics", but of course that word, too, has another meaning within our various conversations). We could fill a small dictionary with words that have more than one meaning in different contexts ("concept", "type", "class", "synonym", and worst of all, "name" -- among many others).
So, when we speak of "data" and "metadata" in the context of LSIDs, let us please use those words specifically in the context of their well-defined meaning as related to LSIDs.
And in this LSID sense of the word "data", many of our objects (taxon names, taxon concepts, locality descriptions, specimens, agents, bibliographic citations, etc.) simply have no "data", because none of these things have any inherent digital manifestation. We could concatenate what would otherwise be LSID-metadata for one of these non-digital objects (e.g., a database record) into a single byte-stream, and define this as "data" tied to a particular LSID, but then a new LSID would need to be issued everytime someone wanted to change that bytestream (e.g., convert it from ASCII to UNICODE, or change the meaning, rendering, or content of one of the concatenated metadata elements). For this, and other reasons, I think this is a bad approach.
Instead, I think we should embrace LSIDs *WITH* data (sensu LSID spec) in cases where it makes sense to do so (e.g., image files, PDFs, perhaps DNS sequences represented as an ASCII character stream or some other specified standard binary format), and embrace LSIDs *WITHOUT* data (only metadata) -- as accomodated in the LSID spec -- for most of non-digital objects we want to exchange information about (taxon names, taxon concepts, locality descriptions, specimens, agents, bibliographic citations, etc.).
Getting back to the intended topic of this discussion (metadata persistence), I frankly am very happy that there is no requirement for metadata persistence in the LSID spec (if there was a requirement for persistence, then you might as well package it all up as data, then use the embedded versioning component of LSIDs or some other mechanism for issuing new LSIDs that are cross-linked to each other in an appropriate way).
I believe the answer to Ricardo's example is better addressed in the next discussion, concerning methods for data versioning. I think the answer to this issue (persistence of metadata) necessarily must be solved via that discussion (versioning), so maybe we should discuss the versioning issue first.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
In the spirit of self-clarification as demonstrated by Bob (with whom I agree on this issue of LSID spec/etc.), I should point out the following:
Rich said 'The LSID spec accomodates such objects in the form of "data-less" LSIDs -- that is, LSIDs with zero "data" content (only metadata).'
Rich should have said 'The LSID spec accomodates such objects in the form of "data-less" (my term) LSIDs -- that is, LSIDs with zero "data" content (only metadata).'
Rich did not mean to imply, through use of quotation marks, that the term "data-less" is included in the LSID spec. The words used in the spec (if I remember correctly) are "conceptual" and "abstract", which I believe are used in the LSID spec synonymously. I avoid the term "conceptual" like the plague, because inevitably it is interpreted by some as "LSIDs applied to taxon concepts". And though less of a problem, I'm worried that some might interpret "abstract" as "LSIDs applied to abstractions of data objects", or maybe "LSIDs applied to publication objects, for which only the Abstract is returned". The term "data-less" seems to cut striaght to the chase, and is not (to my knowledge) homonymous with anything else in our usual lexicon.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Please, let's not get bogged down in alternate definitions of the word "data" and "metadata". I swear, the single greatest impediment to progress in biodiversity informatics (by far) in my opinion has been human-language semantics. I had to qualify the word "semantics" in the previous sentence with "human-language", because even the very word "semantics" has more than one meaning in our conversations (I almost used the word "vocabulary" instead of "sematics", but of course that word, too, has another meaning within our various conversations). We could fill a small dictionary with words that have more than one meaning in different contexts ("concept", "type", "class", "synonym", and worst of all, "name" -- among many others).
So, when we speak of "data" and "metadata" in the context of LSIDs, let us please use those words specifically in the context of their well-defined meaning as related to LSIDs.
And in this LSID sense of the word "data", many of our objects (taxon names, taxon concepts, locality descriptions, specimens, agents, bibliographic citations, etc.) simply have no "data", because none of these things have any inherent digital manifestation. We could concatenate what would otherwise be LSID-metadata for one of these non-digital objects (e.g., a database record) into a single byte-stream, and define this as "data" tied to a particular LSID, but then a new LSID would need to be issued everytime someone wanted to change that bytestream (e.g., convert it from ASCII to UNICODE, or change the meaning, rendering, or content of one of the concatenated metadata elements). For this, and other reasons, I think this is a bad approach.
Instead, I think we should embrace LSIDs *WITH* data (sensu LSID spec) in cases where it makes sense to do so (e.g., image files, PDFs, perhaps DNS sequences represented as an ASCII character stream or some other specified standard binary format), and embrace LSIDs *WITHOUT* data (only metadata) -- as accomodated in the LSID spec -- for most of non-digital objects we want to exchange information about (taxon names, taxon concepts, locality descriptions, specimens, agents, bibliographic citations, etc.).
Getting back to the intended topic of this discussion (metadata persistence), I frankly am very happy that there is no requirement for metadata persistence in the LSID spec (if there was a requirement for persistence, then you might as well package it all up as data, then use the embedded versioning component of LSIDs or some other mechanism for issuing new LSIDs that are cross-linked to each other in an appropriate way).
I believe the answer to Ricardo's example is better addressed in the next discussion, concerning methods for data versioning. I think the answer to this issue (persistence of metadata) necessarily must be solved via that discussion (versioning), so maybe we should discuss the versioning issue first.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
This is stated clearly along with Bob's comments.
My complication about semantics is unnecessary for the LSID definition.
LSID data does not change meaning it always has the same bit pattern for a given LSID. Since XML allows different bit level expressions for equivalent records there is a mismatch with the LSID mechanism. The community can live with this as long as there are additional constraints put on the generation of XML-based records. For the sake of simplicity keep the fixed bitlevel exppression. The existing metadat mechanism handles the semantics of interpretation of the data (sorry to use the word "semantics" but it is nothing really special, just a definition of the "meaning" of the data) The semantics are relevant because it tells what can be done with the data. Some data is internally defined but not all so the metadata mechanism in LSID covers those cases.
Everything is fine and there is nothing to work out... except how to actually use the mechanism.
LSID data does not change meaning it always has the same bit pattern for a given LSID. Since XML allows different bit level expressions for equivalent records there is a mismatch with the LSID mechanism.
There is only a mismatch if you try to return XML as LSID "data". I don't see any reason to do this, unless the XML file *is* the object to which the LSID is applied (as opposed to the object that the XML content attempts to describe, such as a specimen or a taxon name). If, for some reason, someone would want to encapsulate an XML file as the LSID-identified "data", then you would have to do it in a way that "locked in" the bytestream of the XML in a way that is bit-level persistent.
The community can live with this as long as there are additional constraints put on the generation of XML-based records. For the sake of simplicity keep the fixed bitlevel exppression. The existing metadat mechanism handles the semantics of interpretation of the data (sorry to use the word "semantics" but it is nothing really special, just a definition of the "meaning" of the data)
No problem on the use of "sematics", because it's clear which meaning you intended from the context of how you used it (i.e., the semantics of the word semantics was not opaque... :-) )
Aloha, Rich
Hmmm, we are in trouble. It seems that placing the "record" in the LSID metadata is gaming the system. An observation record of a bird, is on some real sense data. We can say it is metadata about the bird but this is not data about data under that use, it is data about an object, that instance of a bird at that point in time. I would prefer not to treat the record as abstract. We can not put the observation record into the metadata because metadata has the nice property of allowing us to change the form. The LSID metadata should tell us the semantics of the data. If we use the metadata to save the record we need meta-metadata to save the semantics.
From page 10 of the spec: "bytes getData (LSID lsid) This method is used to return data associated with the given lsid. If a copy of the data represented by an LSID cannot be returned for any reason, an exception should be raised. If the given lsid represents an abstract entity (a concept), this method returns an empty array of bytes. Note that the semantics of the returned bytes is not defined by this specification. It is either known from an external documentation, or (preferably) it is available by reading the metadata for this particular lsid. "<----
As Dave points out. The bit identity constraint is a problem when XML is a payload. Current TDWG standards do not enforce a particular canonical form for the XML documents. They could when being carried by data in LSID. That additional constraint or specification would need to be carried in the metadata.
-- Bryan
PS: Someone should send a medic to Chuck's office. Chuck is likely under his desk pulling out his hair muttering something about never "doing" anything.
This entire discussion confuses me. The LSID standard is published. Why is there a discussion of what an LSID should be? The standard requires that the data, as defined by the return of getData, to be identical for all resolutions of the LSID. From page 9 of the LSID spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval services may implement all of the methods, or only methods for retrieving data, or only methods for retrieving associated metadata. The same LSID named data object must be resolved always to the same set of bytes. Therefore, all of the data retrieval services return the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its location, known quality of service and other attributes. With metadata, the situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have anything to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an 'identical-byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Ah,
Bob said "If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. " but meant
"If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those responses is not compliant".
Bob did not mean to discuss whether the channel, the requestor, or the responder was the cause of the non-compliant response.
On 7/13/07, Bob Morris morris.bob@gmail.com wrote:
This entire discussion confuses me. The LSID standard is published. Why is there a discussion of what an LSID should be? The standard requires that the data, as defined by the return of getData, to be identical for all resolutions of the LSID. From page 9 of the LSID spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval services may implement all of the methods, or only methods for retrieving data, or only methods for retrieving associated metadata. The same LSID named data object must be resolved always to the same set of bytes. Therefore, all of the data retrieval services return the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its location, known quality of service and other attributes. With metadata, the situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have anything to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an 'identical-byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by an LSID are unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed to abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data provider stores the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for example) are not identical blobs.
Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated processes than asserting that the canonical representation of those instances are identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an object (eg. for xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research.
Requiring identical blobs is certainly possible, but people need to be aware of the implications of such a requirement in the early stages of designing a system to support such a specification. My preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the integrity of the data and the intent of the LSID.
regards, Dave V.
On Jul 14, 2007, at 08:08, Bob Morris wrote:
This entire discussion confuses me. The LSID standard is published. Why is there a discussion of what an LSID should be? The standard requires that the data, as defined by the return of getData, to be identical for all resolutions of the LSID. From page 9 of the LSID spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval services may implement all of the methods, or only methods for retrieving data, or only methods for retrieving associated metadata. The same LSID named data object must be resolved always to the same set of bytes. Therefore, all of the data retrieval services return the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its location, known quality of service and other attributes. With metadata, the situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have anything to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an 'identical-byte- stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e- mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
In terms of the metadata returned from an LSID, or any other digital identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility of data and accuracy of scientific results.
As a trivial example, given a set of observations collected at time t, one can represent the data for those observations in dataset D and the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered that t was entered incorrectly, and needs to be adjusted, creating metadata document M'. That M and M' are not congruent is critical knowledge when analyzing data from D with data from another dataset D2. In other words, because there is no true distinction between data and metadata (any given piece of information can be stored in either location), a proper archive must be able to distinguish any changes in the data and any changes in the metadata.
That said, there are some metadata that could change with little or no impact on data interpretation (e.g., the spelling of the street on which Technician Tom gets his snailmail). But at the current time its impossible to distinguish this kind of metadata from the important kind in the general case of the existing metadata standards in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological data archives is to give persistent identifiers to both data objects and metadata objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by an LSID are unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed to abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data provider stores the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for example) are not identical blobs.
Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated processes than asserting that the canonical representation of those instances are identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an object (eg. for xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research.
Requiring identical blobs is certainly possible, but people need to be aware of the implications of such a requirement in the early stages of designing a system to support such a specification. My preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the integrity of the data and the intent of the LSID.
regards, Dave V.
On Jul 14, 2007, at 08:08, Bob Morris wrote:
This entire discussion confuses me. The LSID standard is published. Why is there a discussion of what an LSID should be? The standard requires that the data, as defined by the return of getData, to be identical for all resolutions of the LSID. From page 9 of the LSID spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval services may implement all of the methods, or only methods for retrieving data, or only methods for retrieving associated metadata. The same LSID named data object must be resolved always to the same set of bytes. Therefore, all of the data retrieval services return the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its location, known quality of service and other attributes. With metadata, the situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have anything to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an 'identical-byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
--Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
I think we are all in agreement that the data and metadata referenced by an LSID remains unchanged (in the case of the metadata, semantic equivalence is a requirement for reasons such as outlined by Matt). My question is to do purely with the data that an LSID references through the getData() operation. The form of that data could be anything really - an encrypted byte stream, digital image, Open Office document, spreadsheet, xml document...
We all know that the same data can be represented many ways that are logically, semantically and functionally equivalent yet form a different set of bytes when serialized. Data expressed in XML is one example (is <a/> = <a /> = <a></a> ?). A pallet based image is another - the order of colors in the palette may be changed, and the pixel values adjusted to match the new palette order, but the image is still the same. There are many more simple examples that can be constructed that violate the unchanged bytes rule but for all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged bytes rule is not challenging. It is however quite different from stating that the data are returned unchanged. It is this that I, and I'm sure a lot of other implementors would appreciate consensus on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
In terms of the metadata returned from an LSID, or any other digital identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility of data and accuracy of scientific results.
As a trivial example, given a set of observations collected at time t, one can represent the data for those observations in dataset D and the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered that t was entered incorrectly, and needs to be adjusted, creating metadata document M'. That M and M' are not congruent is critical knowledge when analyzing data from D with data from another dataset D2. In other words, because there is no true distinction between data and metadata (any given piece of information can be stored in either location), a proper archive must be able to distinguish any changes in the data and any changes in the metadata.
That said, there are some metadata that could change with little or no impact on data interpretation (e.g., the spelling of the street on which Technician Tom gets his snailmail). But at the current time its impossible to distinguish this kind of metadata from the important kind in the general case of the existing metadata standards in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological data archives is to give persistent identifiers to both data objects and metadata objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by an LSID are unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed to abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data provider stores the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for example) are not identical blobs. Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated processes than asserting that the canonical representation of those instances are identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an object (eg. for xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research. Requiring identical blobs is certainly possible, but people need to be aware of the implications of such a requirement in the early stages of designing a system to support such a specification. My preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the integrity of the data and the intent of the LSID. regards, Dave V. On Jul 14, 2007, at 08:08, Bob Morris wrote:
This entire discussion confuses me. The LSID standard is published. Why is there a discussion of what an LSID should be? The standard requires that the data, as defined by the return of getData, to be identical for all resolutions of the LSID. From page 9 of the LSID spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval services may implement all of the methods, or only methods for retrieving data, or only methods for retrieving associated metadata. The same LSID named data object must be resolved always to the same set of bytes. Therefore, all of the data retrieval services return the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its location, known quality of service and other attributes. With metadata, the situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have anything to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an 'identical- byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the
LSID
getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs.
The
LSID goes with a byte stream. It's that byte stream that must
stay
the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to
it.
That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs:
Data is
a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or
non-
persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different
determination
would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach
consensus
as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary
established by
the LSID specification:
Corollary 1: LSIDs are not guaranteed to be
resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the
authority
may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm
talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID
cannot be
assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call)
must
never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID
metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when
someone
collects observation records (and implicitly, their
determinations)
and runs an experiment (a model or simulation) with it. This
person
may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to
the
primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would
have to
guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot
guarantee
that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to
refer
back to a piece of information (metadata) associated with an
LSID,
exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some
sort,
to handle (meta)data persistence. That is exactly what the STD-
DOI
project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent
nor
immutable, we can definitely document how the metadata have
changed
through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
--Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
There seems to be two methods to resolving this problem.
One is to change the LSID definitions to allow semantic equivalence in the data and not require exact bit stream equivalence.
The other option is to change the data representation so that it is "easily" reduced to a repeatable canonical form. For example, it is almost as easy as saying where XML ordering does not specify order of elements, elements will be ordered alphabetically. Seems stupid but it almost works.. except where you have repeating elements with the same element name where it does not work.
It seems a little odd to bend the standards for the data being delivered to fit the requirement of the LSID spec. In theory, the other standard developers who set the data being delivered did not fix order because it did not matter.
This is different from Chuck's observation that the semantics of the element within some of the standards are insufficiently specified. So, what we mean is a darwin mode species name is just a string and nothing more now.
--Bryan
On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
I think we are all in agreement that the data and metadata referenced by an LSID remains unchanged (in the case of the metadata, semantic equivalence is a requirement for reasons such as outlined by Matt). My question is to do purely with the data that an LSID references through the getData() operation. The form of that data could be anything really - an encrypted byte stream, digital image, Open Office document, spreadsheet, xml document...
We all know that the same data can be represented many ways that are logically, semantically and functionally equivalent yet form a different set of bytes when serialized. Data expressed in XML is one example (is <a/> = <a /> = <a></a> ?). A pallet based image is another - the order of colors in the palette may be changed, and the pixel values adjusted to match the new palette order, but the image is still the same. There are many more simple examples that can be constructed that violate the unchanged bytes rule but for all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged bytes rule is not challenging. It is however quite different from stating that the data are returned unchanged. It is this that I, and I'm sure a lot of other implementors would appreciate consensus on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
In terms of the metadata returned from an LSID, or any other digital identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility of data and accuracy of scientific results.
As a trivial example, given a set of observations collected at time t, one can represent the data for those observations in dataset D and the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered that t was entered incorrectly, and needs to be adjusted, creating metadata document M'. That M and M' are not congruent is critical knowledge when analyzing data from D with data from another dataset D2. In other words, because there is no true distinction between data and metadata (any given piece of information can be stored in either location), a proper archive must be able to distinguish any changes in the data and any changes in the metadata.
That said, there are some metadata that could change with little or no impact on data interpretation (e.g., the spelling of the street on which Technician Tom gets his snailmail). But at the current time its impossible to distinguish this kind of metadata from the important kind in the general case of the existing metadata standards in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological data archives is to give persistent identifiers to both data objects and metadata objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by an LSID are unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed to abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data provider stores the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for example) are not identical blobs. Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated processes than asserting that the canonical representation of those instances are identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an object (eg. for xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research. Requiring identical blobs is certainly possible, but people need to be aware of the implications of such a requirement in the early stages of designing a system to support such a specification. My preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the integrity of the data and the intent of the LSID. regards, Dave V. On Jul 14, 2007, at 08:08, Bob Morris wrote:
This entire discussion confuses me. The LSID standard is published. Why is there a discussion of what an LSID should be? The standard requires that the data, as defined by the return of getData, to be identical for all resolutions of the LSID. From page 9 of the LSID spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval services may implement all of the methods, or only methods for retrieving data, or only methods for retrieving associated metadata. The same LSID named data object must be resolved always to the same set of bytes. Therefore, all of the data retrieval services return the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its location, known quality of service and other attributes. With metadata, the situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have anything to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an 'identical- byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm being a pedant in the above analysis but I was almost persuaded that except for digital objects (images, sounds) which can be data all other 'things' (names, specimen accession numbers) had to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a pointer to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the
LSID
getData call) must never change"
Perhaps this is a more straightforward way to conceive
LSIDs. The
LSID goes with a byte stream. It's that byte stream that
must stay
the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned
to it.
That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs:
Data is
a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or
non-
persistent. By assigning an LSID to any byte stream,
whatever it
is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different
determination
would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others
could
choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach
consensus
as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary
established by
the LSID specification:
Corollary 1: LSIDs are not guaranteed to be
resolvable
indefinitely.
In other words, there is no guarantee that one will
always be
able to retrieve the data associated with an LSID as the
authority
may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm
talking
about from other two related concepts (which we'll not
discuss in
this thread):
1) Persistence of Assignment: Once assigned to an
object,
an LSID is indefinitely associated with it. The same LSID
cannot be
assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call)
must
never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID
must
never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID
metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when
someone
collects observation records (and implicitly, their
determinations)
and runs an experiment (a model or simulation) with it. This
person
may want to record the identifiers of the points used so that someone using the results of that experiment may refer back
to the
primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any
other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would
have to
guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot
guarantee
that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to
refer
back to a piece of information (metadata) associated with an
LSID,
exactly as it was when you got it, you must make a copy of
it, or
arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some
sort,
to handle (meta)data persistence. That is exactly what the
STD-DOI
project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is
persistent nor
immutable, we can definitely document how the metadata have
changed
through metadata versioning. That's the topic of the next
thread.
We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
--Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
It seems to me that there is a third method to resolving the problem:
When we want to identify an object that is itself digital in nature (e.g., a database record, or a binary data file such as a PDF, JPG, ASCII, Unicode, or whatever), we resolve said binary object via getData(). If, for some reason, we change the exact bit-sequence of that digital/binary object (e.g., color-correct an image, change a text string from ASII to Unicode, or whatever...), we assign a new LSID to it (whether that "new" LSID differs from the "old" LSID only via the optional "Revision" part of the LSID, or via a new Object Identification part, is a topic for another debate).
When we want to identify an object that does not itself have a digital manifestation -- like a physical object (e.g., specimen or a particular printed copy of a publication) or an abstract/conceptual object (e.g., a taxon name, a taxon concept, a geographica place, or a cited publication) -- then we return *nothing* in response to getData(), and we treat all the attributes of said physical/abstract/conceptual object of interest to us as metadata.
If there are cases where certain metadata elements of an object without an inherent digital existence need to persists (and there are), yet we also want to allow modifications to metadata elements without the need to generate new identifiers for the underlying object (and we do) -- then we deal with those within our own community via adopted standards and best practices.
I would disagree strongly with bending the existing LSID standard, and would just as strongly favor working within its existing framework (which, I think, we can). I would also disagree with the practice of embedding XML documents as "data" for an LSID, unless the LSID is intended to represent the XML document itself (in which case there might be a different LSID to represent the database record that was used to generate the XML document; and yet another LSID to represent the abstract concept that the database record was created to represent -- like a taxon name, for example).
If we want to use LSIDs to pass around XML packages (that are not rendered as RDF) about abstract objects (e.g., taxon names), why doesn't our community define within our semantic vocabulary something along the lines of "TCS_XML", which can be established as a standard metadata component for LSIDs assigned to taxon concepts (i.e., abstract objects, identified by "data-less" LSIDs). The exact bytestream of the content of that metadata element can change, without changing its canonical rendering.
I'm beginning to suspect (strongly) that I am completely missing some fundamental point here -- and perhaps is is the same point that underlies the apparent antagonism towards LSIDs in general (which I do not yet share). But I am fairly certain we are dealing with some level of miscommunication here.
Aloha, Rich
-----Original Message----- From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. Bryan Heidorn Sent: Friday, July 13, 2007 12:48 PM To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
There seems to be two methods to resolving this problem.
One is to change the LSID definitions to allow semantic equivalence in the data and not require exact bit stream equivalence.
The other option is to change the data representation so that it is "easily" reduced to a repeatable canonical form. For example, it is almost as easy as saying where XML ordering does not specify order of elements, elements will be ordered alphabetically. Seems stupid but it almost works.. except where you have repeating elements with the same element name where it does not work.
It seems a little odd to bend the standards for the data being delivered to fit the requirement of the LSID spec. In theory, the other standard developers who set the data being delivered did not fix order because it did not matter.
This is different from Chuck's observation that the semantics of the element within some of the standards are insufficiently specified. So, what we mean is a darwin mode species name is just a string and nothing more now.
--Bryan
On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
I think we are all in agreement that the data and metadata
referenced
by an LSID remains unchanged (in the case of the metadata, semantic equivalence is a requirement for reasons such as outlined
by Matt).
My question is to do purely with the data that an LSID references through the getData() operation. The form of that data could be anything really - an encrypted byte stream, digital image,
Open Office
document, spreadsheet, xml document...
We all know that the same data can be represented many ways
that are
logically, semantically and functionally equivalent yet form a different set of bytes when serialized. Data expressed in
XML is one
example (is <a/> = <a /> = <a></a> ?). A pallet based image is another - the order of colors in the palette may be
changed, and the
pixel values adjusted to match the new palette order, but
the image is
still the same. There are many more simple examples that can be constructed that violate the unchanged bytes rule but for all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged bytes rule is not challenging. It is however quite different from stating that the data are returned unchanged. It is this
that I, and
I'm sure a lot of other implementors would appreciate consensus on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
In terms of the metadata returned from an LSID, or any
other digital
identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility
of data and
accuracy of scientific results.
As a trivial example, given a set of observations
collected at time
t, one can represent the data for those observations in
dataset D and
the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered
that t was
entered incorrectly, and needs to be adjusted, creating metadata document M'. That M and M' are not congruent is critical knowledge when analyzing data from D with data from another dataset D2. In other words, because there is no true distinction between data and metadata (any given piece of information can be stored in either location), a proper archive must be able to distinguish
any changes
in the data and any changes in the metadata.
That said, there are some metadata that could change with
little or
no impact on data interpretation (e.g., the spelling of
the street on
which Technician Tom gets his snailmail). But at the current time its impossible to distinguish this kind of metadata from the important kind in the general case of the existing
metadata standards
in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological
data archives
is to give persistent identifiers to both data objects and
metadata
objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by
an LSID are
unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed to abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data
provider stores
the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for
example) are
not identical blobs. Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated
processes than
asserting that the canonical representation of those
instances are
identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an
object (eg. for
xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research. Requiring identical blobs is certainly possible, but
people need to
be aware of the implications of such a requirement in the early stages of designing a system to support such a specification. My preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the
integrity of
the data and the intent of the LSID. regards, Dave V. On Jul 14, 2007, at 08:08, Bob Morris wrote:
This entire discussion confuses me. The LSID standard is
published.
Why is there a discussion of what an LSID should be? The
standard
requires that the data, as defined by the return of
getData, to be
identical for all resolutions of the LSID. From page 9
of the LSID
spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval
services may
implement all of the methods, or only methods for
retrieving data,
or only methods for retrieving associated metadata. The same LSID named data object must be resolved always
to the same
set of bytes. Therefore, all of the data retrieval
services return
the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its
location,
known quality of service and other attributes. With
metadata, the
situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have
anything
to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different
resolution
services do not yield the same set of bytes, then one or
the other
or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an 'identical- byte-stream' because the technology we use is imperfect ... the disk controllers which manage our bytes and the disk we use to store our bytes have recognized error rates. Perhaps I'm
being a pedant
in the above analysis but I was almost persuaded that
except for
digital objects (images, sounds) which can be data all other 'things' (names, specimen accession
numbers) had
to be metadata. This to me makes no sense in the real but imperfect world we live in. An LSID assigned to a name
(e.g. Homo
sapiens) is assigned to the name as data, not metadata. What is 'identical' here it that if the spelling has to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with
a pointer
to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of
Chuck Miller
Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many variations, subtleties, and reinterpretations which are
endlessly
debated.
The LSID standard would be simple, clear and consistent
if we used
the identical-byte-stream definition. The LSID would
uniquely tag
a persistent byte stream. A persistent byte stream is
always the
same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping that byte-stream persistent and not represent it in multiple ways, even though technically they could. If
they can't
commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had
to provide
different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data
associated with an
LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to
represent
data in xml that are identical from an information
content point
of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the
canonical
representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
> Ricardo, > > Looking at this definition: "Persistence of LSID
Data: The data
> associated with an LSID (i.e, the byte stream returned by the LSID > getData call) must never change" > > > > Perhaps this is a more straightforward way to conceive LSIDs. The > LSID goes with a byte stream. It's that byte stream that must stay > the same. So, if there is a byte stream associated with a > collection that needs to stay the same, then whatever
that byte
> stream happens to be is the data that gets an LSID assigned to it. > That sure seems a clearer definition of what is data
and what is
> metadata, rather than the issue of primary object and
all that.
> > > > So we can create a new definition in the context of LSIDs: Data is > a byte stream that is persistent, never changes and
can have an
> LSID. Metadata is a byte stream is non-persistent,
might change
> and is only associated with an LSID. > > > > The institution who assigns an LSID can make their
own decision
> about whether the byte stream being provided is persistent or non- > persistent. By assigning an LSID to any byte stream, whatever it > is, the institution is declaring it to be data and persistent. > > > > So, in the example given of an observation record with a > determination that needs to remain fixed and unchanged, by > assigning an LSID to that observation+determination
it would be
> "declared to be data" and unchangeable. A different determination > would then be different data with a different LSID.
That would
> provide a solution for those who want to employ it. Others could > choose not to use it. > > > > Chuck > > > > From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- > bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira > Sent: Friday, July 13, 2007 9:47 AM > To: tdwg-guid@lists.tdwg.org > Subject: [tdwg-guid] LSID metadata persistence (or
lack thereof)
> > > > Hi there folks, > > As Chuck mentioned a few weeks ago, we do have a few > outstanding issues to address regarding LSIDs. I
would like to
> discuss those one by one, in an orderly manner, and reach consensus > as much as we can. Then we can sum them up in a TDWG
standard,
> possibly by or shortly after the Bratislava conference. > > The first issue I would like to discuss is LSID metadata > persistence. First, let me remind you of a corollary established by > the LSID specification: > > Corollary 1: LSIDs are not guaranteed to be resolvable > indefinitely. > > In other words, there is no guarantee that one will always be > able to retrieve the data associated with an LSID as the authority > may choose (or be forced) not to resolve an LSID anymore. > > Second, let me distinguish this kind of persistence I'm talking > about from other two related concepts (which we'll not discuss in > this thread): > > 1) Persistence of Assignment: Once assigned to an object, > an LSID is indefinitely associated with it. The same LSID cannot be > assigned to another object. Ever! The LSID may not be
resolvable
> anymore, but it cannot be assigned to another object. This is > established by the LSID specification. > > 2) Persistence of LSID Data: The data
associated with an
> LSID (i.e, the byte stream returned by the LSID getData call) must > never change. Although the LSID may not be resolvable anymore > (according to corollary 1), the data associated with an LSID must > never ever change. That's defined by the LSID spec, too. > > What I want to discuss here is the persistence of LSID metadata > (what is returned by the getMetadata call) or the
lack thereof.
> > A use case associated with metadata persistence is when someone > collects observation records (and implicitly, their determinations) > and runs an experiment (a model or simulation) with it. This person > may want to record the identifiers of the points used so that > someone using the results of that experiment may refer back to the > primary data, to validate or repeat it the experiment. > > The bad news is that LSID identification scheme (or any other > GUID that I know of) was not designed to guarantee metadata > persistence, and thus it cannot implement the use
case above by
> itself. To implement that use case, the specification would have to > guarantee that the metadata (which we are using here
as data) is
> immutable. But it doesn't. > > Most of us wish that metadata was persistent, but
it isn't.
> Many things can change in the metadata: a new
determination, a
> mispeling that is corrected, many things. We just cannot guarantee > that the metadata will look like it was sometime ago. > > We then reach the following conclusion. > > Corollary 2: LSIDs metadata is not immutable nor > persistent. > > The consequence of this corollary is that, if you need to refer > back to a piece of information (metadata) associated with an LSID, > exactly as it was when you got it, you must make a copy of it, or > arrange that someone else make that copy for you. > > In other words, a client cannot assume that the metadata > associated with an LSID today will be the same
tomorrow. If the
> client does assume that, it may be relying on a false
assumption
> and its output may be flawed. > > If we are not happy with that conclusion, we may
develop an
> additional component in our architecture, an archive of some sort, > to handle (meta)data persistence. That is exactly what the STD-DOI > project (http://www.std-doi.de/) and SEEK (http:// > seek.ecoinformatics.org) have done to some extent. > > While we cannot guarantee that LSID metadata is persistent nor > immutable, we can definitely document how the metadata have changed > through metadata versioning. That's the topic of the next thread. > We will move on to discuss metadata versioning as
soon as we are
> done with metadata persistence. > > Cheers, > > Ricardo > > _______________________________________________ > tdwg-guid mailing list > tdwg-guid@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the
exclusive use
of the intended recipient. If you are not the intended
recipient
please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps
to prevent
the transmission of viruses via e-mail, we cannot
guarantee that
any e-mail or attachment is free from computer viruses
and you are
strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error,
please notify
us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
--Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Folks,
Thanks much to all of you who replied to my post. All the posts were really relevant to our discussion.
Before we go ahead, however, let us stop for a minute to try and summarize the points we agree upon and the points in which there is still significant controversy.
I believe that we reached consensus in the following issues:
1) We do agree that *LSID metadata is not required to be persistent* (i.e. clients cannot assume it is immutable). See note [1].
2) We should not force XML representations of data to be byte identical just to return that in the LSID getData() call. We must find another way to fulfill this requirement.
3) We should not try to return something in the LSID getData() call just for the sake of it. We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable and thus fulfill the requirement from the LSID spec. This is counterproductive because the name itself is in the metadata already and no client would gain anything from calling getData() in this case.
We have also raised new issues that may be worth discussing (in their own separate thread if possible):
4) We "may" bend the immutability rule of LSID getData() to our benefit and accept data that is not byte stream identical, but only "semantically" identical (depending on content type maybe). If we do this, we may use the LSID getData() call more effectively to identify real datasets such as matrices, identification keys, etc.
5) As Brian pointed out, we may need to revisit what we call data and metadata. We have been using the LSID getMetadata() call to return what some people may call data (taxon names, specimens, collections). And we forgot completely that there may be other kinds of data out there that may be returned in the getData() call and that those still need metadata to describe them. I think this may be worth discussing in a separate thread.
Did I leave anything out? If so, please let us know by replying to my post and adding a short entry to either list above.
Cheers,
Ricardo
Notes: -------
[1] Matt may disagree with me here, but my point is that we can't force all authorities (i.e. data providers) to keep perfect archives of all versions of their databases given a heterogeneous and distributed environment we operate in. While some may want to provide this feature, other providers may not want or be able to.
Richard Pyle wrote:
It seems to me that there is a third method to resolving the problem:
When we want to identify an object that is itself digital in nature (e.g., a database record, or a binary data file such as a PDF, JPG, ASCII, Unicode, or whatever), we resolve said binary object via getData(). If, for some reason, we change the exact bit-sequence of that digital/binary object (e.g., color-correct an image, change a text string from ASII to Unicode, or whatever...), we assign a new LSID to it (whether that "new" LSID differs from the "old" LSID only via the optional "Revision" part of the LSID, or via a new Object Identification part, is a topic for another debate).
When we want to identify an object that does not itself have a digital manifestation -- like a physical object (e.g., specimen or a particular printed copy of a publication) or an abstract/conceptual object (e.g., a taxon name, a taxon concept, a geographica place, or a cited publication) -- then we return *nothing* in response to getData(), and we treat all the attributes of said physical/abstract/conceptual object of interest to us as metadata.
If there are cases where certain metadata elements of an object without an inherent digital existence need to persists (and there are), yet we also want to allow modifications to metadata elements without the need to generate new identifiers for the underlying object (and we do) -- then we deal with those within our own community via adopted standards and best practices.
I would disagree strongly with bending the existing LSID standard, and would just as strongly favor working within its existing framework (which, I think, we can). I would also disagree with the practice of embedding XML documents as "data" for an LSID, unless the LSID is intended to represent the XML document itself (in which case there might be a different LSID to represent the database record that was used to generate the XML document; and yet another LSID to represent the abstract concept that the database record was created to represent -- like a taxon name, for example).
If we want to use LSIDs to pass around XML packages (that are not rendered as RDF) about abstract objects (e.g., taxon names), why doesn't our community define within our semantic vocabulary something along the lines of "TCS_XML", which can be established as a standard metadata component for LSIDs assigned to taxon concepts (i.e., abstract objects, identified by "data-less" LSIDs). The exact bytestream of the content of that metadata element can change, without changing its canonical rendering.
I'm beginning to suspect (strongly) that I am completely missing some fundamental point here -- and perhaps is is the same point that underlies the apparent antagonism towards LSIDs in general (which I do not yet share). But I am fairly certain we are dealing with some level of miscommunication here.
Aloha, Rich
-----Original Message----- From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. Bryan Heidorn Sent: Friday, July 13, 2007 12:48 PM To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
There seems to be two methods to resolving this problem.
One is to change the LSID definitions to allow semantic equivalence in the data and not require exact bit stream equivalence.
The other option is to change the data representation so that it is "easily" reduced to a repeatable canonical form. For example, it is almost as easy as saying where XML ordering does not specify order of elements, elements will be ordered alphabetically. Seems stupid but it almost works.. except where you have repeating elements with the same element name where it does not work.
It seems a little odd to bend the standards for the data being delivered to fit the requirement of the LSID spec. In theory, the other standard developers who set the data being delivered did not fix order because it did not matter.
This is different from Chuck's observation that the semantics of the element within some of the standards are insufficiently specified. So, what we mean is a darwin mode species name is just a string and nothing more now.
--Bryan
On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
I think we are all in agreement that the data and metadata
referenced
by an LSID remains unchanged (in the case of the metadata, semantic equivalence is a requirement for reasons such as outlined
by Matt).
My question is to do purely with the data that an LSID references through the getData() operation. The form of that data could be anything really - an encrypted byte stream, digital image,
Open Office
document, spreadsheet, xml document...
We all know that the same data can be represented many ways
that are
logically, semantically and functionally equivalent yet form a different set of bytes when serialized. Data expressed in
XML is one
example (is <a/> = <a /> = <a></a> ?). A pallet based image is another - the order of colors in the palette may be
changed, and the
pixel values adjusted to match the new palette order, but
the image is
still the same. There are many more simple examples that can be constructed that violate the unchanged bytes rule but for all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged bytes rule is not challenging. It is however quite different from stating that the data are returned unchanged. It is this
that I, and
I'm sure a lot of other implementors would appreciate consensus on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
In terms of the metadata returned from an LSID, or any
other digital
identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility
of data and
accuracy of scientific results.
As a trivial example, given a set of observations
collected at time
t, one can represent the data for those observations in
dataset D and
the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered
that t was
entered incorrectly, and needs to be adjusted, creating metadata document M'. That M and M' are not congruent is critical knowledge when analyzing data from D with data from another dataset D2. In other words, because there is no true distinction between data and metadata (any given piece of information can be stored in either location), a proper archive must be able to distinguish
any changes
in the data and any changes in the metadata.
That said, there are some metadata that could change with
little or
no impact on data interpretation (e.g., the spelling of
the street on
which Technician Tom gets his snailmail). But at the current time its impossible to distinguish this kind of metadata from the important kind in the general case of the existing
metadata standards
in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological
data archives
is to give persistent identifiers to both data objects and
metadata
objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by
an LSID are
unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed to abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data
provider stores
the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for
example) are
not identical blobs. Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated
processes than
asserting that the canonical representation of those
instances are
identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an
object (eg. for
xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research. Requiring identical blobs is certainly possible, but
people need to
be aware of the implications of such a requirement in the early stages of designing a system to support such a specification. My preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the
integrity of
the data and the intent of the LSID. regards, Dave V. On Jul 14, 2007, at 08:08, Bob Morris wrote:
This entire discussion confuses me. The LSID standard is
published.
Why is there a discussion of what an LSID should be? The
standard
requires that the data, as defined by the return of
getData, to be
identical for all resolutions of the LSID. From page 9
of the LSID
spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval
services may
implement all of the methods, or only methods for
retrieving data,
or only methods for retrieving associated metadata. The same LSID named data object must be resolved always
to the same
set of bytes. Therefore, all of the data retrieval
services return
the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its
location,
known quality of service and other attributes. With
metadata, the
situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have
anything
to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different
resolution
services do not yield the same set of bytes, then one or
the other
or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
> > In an imperfect world there is no such thing as an 'identical- > byte-stream' > because the technology we use is imperfect ... the disk > controllers which manage our bytes and the disk we use to store > our bytes have recognized error rates. Perhaps I'm >
being a pedant
> in the above analysis but I was almost persuaded that >
except for
> digital objects (images, > sounds) which can > be data all other 'things' (names, specimen accession >
numbers) had
> to be metadata. This to me makes no sense in the real but > imperfect world we live in. An LSID assigned to a name >
(e.g. Homo
> sapiens) is assigned to the name as data, not metadata. What is > 'identical' here it that if the spelling has to change for any > reason the new spelling gets a new LSID and the now incorrect > spelling gets deprecated (but is still resolvable) with >
a pointer
> to the correct spelling/LSID in the metadata. > > OK? > > Paul > > ________________________________ > From: tdwg-guid-bounces@lists.tdwg.org on behalf of >
Chuck Miller
> Sent: Fri 13/07/2007 19:03 > To: Dave Vieglais > Cc: tdwg-guid@lists.tdwg.org > Subject: RE: [tdwg-guid] LSID metadata persistence (or lack > thereof)[Scanned] > > > > > Dave, > What you say is true. But, I think we already have too many > variations, subtleties, and reinterpretations which are >
endlessly
> debated. > > The LSID standard would be simple, clear and consistent >
if we used
> the identical-byte-stream definition. The LSID would >
uniquely tag
> a persistent byte stream. A persistent byte stream is >
always the
> same thing without any further explanation or clarification. > > The provider of an LSID byte-stream would need to commit to > keeping that byte-stream persistent and not represent it in > multiple ways, even though technically they could. If >
they can't
> commit to that, then it can't be an LSID byte-stream. > > And in the name of simplicity and clarity, if they had >
to provide
> different byte-stream representations then they would have to > assign a different LSID to each and use "SameAs" metadata. > > Chuck > > -----Original Message----- > From: Dave Vieglais [mailto:vieglais@ku.edu] > Sent: Friday, July 13, 2007 12:42 PM > To: Chuck Miller > Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org > Subject: Re: [tdwg-guid] LSID metadata persistence (or lack > thereof) > > Hi Ricardo, Chuck, > Asserting that the byte stream returned as data >
associated with an
> LSID should never change is perhaps a bit confusing from a > programmatic view. There are for example many ways to >
represent
> data in xml that are identical from an information >
content point
> of view, but the byte streams could be very different. > > Perhaps it might be better to state something like "the >
canonical
> representation of the data associated with an LSID must not > change", or something to that effect? > > Dave V. > > On Jul 14, 2007, at 05:29, Chuck Miller wrote: > > >> Ricardo, >> >> Looking at this definition: "Persistence of LSID >>
Data: The data
>> associated with an LSID (i.e, the byte stream returned by the >> > LSID > >> getData call) must never change" >> >> >> >> Perhaps this is a more straightforward way to conceive >> > LSIDs. The > >> LSID goes with a byte stream. It's that byte stream that >> > must stay > >> the same. So, if there is a byte stream associated with a >> collection that needs to stay the same, then whatever >>
that byte
>> stream happens to be is the data that gets an LSID assigned >> > to it. > >> That sure seems a clearer definition of what is data >>
and what is
>> metadata, rather than the issue of primary object and >>
all that.
>> >> So we can create a new definition in the context of LSIDs: >> > Data is > >> a byte stream that is persistent, never changes and >>
can have an
>> LSID. Metadata is a byte stream is non-persistent, >>
might change
>> and is only associated with an LSID. >> >> >> >> The institution who assigns an LSID can make their >>
own decision
>> about whether the byte stream being provided is persistent or >> > non- > >> persistent. By assigning an LSID to any byte stream, >> > whatever it > >> is, the institution is declaring it to be data and persistent. >> >> >> >> So, in the example given of an observation record with a >> determination that needs to remain fixed and unchanged, by >> assigning an LSID to that observation+determination >>
it would be
>> "declared to be data" and unchangeable. A different >> > determination > >> would then be different data with a different LSID. >>
That would
>> provide a solution for those who want to employ it. Others >> > could > >> choose not to use it. >> >> >> >> Chuck >> >> >> >> From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- >> bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira >> Sent: Friday, July 13, 2007 9:47 AM >> To: tdwg-guid@lists.tdwg.org >> Subject: [tdwg-guid] LSID metadata persistence (or >>
lack thereof)
>> >> Hi there folks, >> >> As Chuck mentioned a few weeks ago, we do have a few >> outstanding issues to address regarding LSIDs. I >>
would like to
>> discuss those one by one, in an orderly manner, and reach >> > consensus > >> as much as we can. Then we can sum them up in a TDWG >>
standard,
>> possibly by or shortly after the Bratislava conference. >> >> The first issue I would like to discuss is LSID metadata >> persistence. First, let me remind you of a corollary >> > established by > >> the LSID specification: >> >> Corollary 1: LSIDs are not guaranteed to be >> > resolvable > >> indefinitely. >> >> In other words, there is no guarantee that one will >> > always be > >> able to retrieve the data associated with an LSID as the >> > authority > >> may choose (or be forced) not to resolve an LSID anymore. >> >> Second, let me distinguish this kind of persistence I'm >> > talking > >> about from other two related concepts (which we'll not >> > discuss in > >> this thread): >> >> 1) Persistence of Assignment: Once assigned to an >> > object, > >> an LSID is indefinitely associated with it. The same LSID >> > cannot be > >> assigned to another object. Ever! The LSID may not be >>
resolvable
>> anymore, but it cannot be assigned to another object. This is >> established by the LSID specification. >> >> 2) Persistence of LSID Data: The data >>
associated with an
>> LSID (i.e, the byte stream returned by the LSID getData call) >> > must > >> never change. Although the LSID may not be resolvable anymore >> (according to corollary 1), the data associated with an LSID >> > must > >> never ever change. That's defined by the LSID spec, too. >> >> What I want to discuss here is the persistence of LSID >> > metadata > >> (what is returned by the getMetadata call) or the >>
lack thereof.
>> A use case associated with metadata persistence is when >> > someone > >> collects observation records (and implicitly, their >> > determinations) > >> and runs an experiment (a model or simulation) with it. This >> > person > >> may want to record the identifiers of the points used so that >> someone using the results of that experiment may refer back >> > to the > >> primary data, to validate or repeat it the experiment. >> >> The bad news is that LSID identification scheme (or any >> > other > >> GUID that I know of) was not designed to guarantee metadata >> persistence, and thus it cannot implement the use >>
case above by
>> itself. To implement that use case, the specification would >> > have to > >> guarantee that the metadata (which we are using here >>
as data) is
>> immutable. But it doesn't. >> >> Most of us wish that metadata was persistent, but >>
it isn't.
>> Many things can change in the metadata: a new >>
determination, a
>> mispeling that is corrected, many things. We just cannot >> > guarantee > >> that the metadata will look like it was sometime ago. >> >> We then reach the following conclusion. >> >> Corollary 2: LSIDs metadata is not immutable nor >> persistent. >> >> The consequence of this corollary is that, if you need to >> > refer > >> back to a piece of information (metadata) associated with an >> > LSID, > >> exactly as it was when you got it, you must make a copy of >> > it, or > >> arrange that someone else make that copy for you. >> >> In other words, a client cannot assume that the metadata >> associated with an LSID today will be the same >>
tomorrow. If the
>> client does assume that, it may be relying on a false >>
assumption
>> and its output may be flawed. >> >> If we are not happy with that conclusion, we may >>
develop an
>> additional component in our architecture, an archive of some >> > sort, > >> to handle (meta)data persistence. That is exactly what the >> > STD-DOI > >> project (http://www.std-doi.de/) and SEEK (http:// >> seek.ecoinformatics.org) have done to some extent. >> >> While we cannot guarantee that LSID metadata is >> > persistent nor > >> immutable, we can definitely document how the metadata have >> > changed > >> through metadata versioning. That's the topic of the next >> > thread. > >> We will move on to discuss metadata versioning as >>
soon as we are
>> done with metadata persistence. >> >> Cheers, >> >> Ricardo >> >> _______________________________________________ >> tdwg-guid mailing list >> tdwg-guid@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >> > _______________________________________________ > tdwg-guid mailing list > tdwg-guid@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-guid > > > P Think Green - don't print this email unless you really need to > > >
> ****** > The information contained in this e-mail and any files > transmitted with it is confidential and is for the >
exclusive use
> of the intended recipient. If you are not the intended >
recipient
> please note that any distribution, copying or use of this > communication or the information in it is prohibited. > > Whilst CAB International trading as CABI takes steps >
to prevent
> the transmission of viruses via e-mail, we cannot >
guarantee that
> any e-mail or attachment is free from computer viruses >
and you are
> strongly advised to undertake your own anti-virus precautions. > > If you have received this communication in error, >
please notify
> us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 > 829199 and then delete the e-mail and any copies of it. > > CABI is an International Organization recognised by the UK > Government under Statutory Instrument 1982 No. 1071. > > >
> ******** > _______________________________________________ > tdwg-guid mailing list > tdwg-guid@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-guid > > > --Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Ricardo, I disagree on your assertion of consensus on a couple of points.
On 2) there is no consensus/decision on whether XML can be returned from a getData call. I asked this question and it has not been answered. We could disallow XML as an allowed format for getData and allow it only for getMetadata.
We do not have consensus and actually have disagreement on "We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable" because "the name itself is in the metadata" I for one believe that we cannot avoid returning a scientific name byte stream in the getData for an LSID for a scientific name. That requirement is fundamental to what we need for biodiversity data. Pragmatically and empirically, names and specimens/observations are THE most fundamental data objects existing today in the databases published by GBIF. So if we can't put LSIDs on names, we have failed to enable one of the most fundamental needs of this community. If the definition of LSIDs needs to be amended to enable that, then so be it.
Chuck
________________________________
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Ricardo Pereira Sent: Fri 7/13/2007 8:12 PM Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Folks,
Thanks much to all of you who replied to my post. All the posts were really relevant to our discussion.
Before we go ahead, however, let us stop for a minute to try and summarize the points we agree upon and the points in which there is still significant controversy.
I believe that we reached consensus in the following issues:
1) We do agree that *LSID metadata is not required to be persistent* (i.e. clients cannot assume it is immutable). See note [1].
2) We should not force XML representations of data to be byte identical just to return that in the LSID getData() call. We must find another way to fulfill this requirement.
3) We should not try to return something in the LSID getData() call just for the sake of it. We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable and thus fulfill the requirement from the LSID spec. This is counterproductive because the name itself is in the metadata already and no client would gain anything from calling getData() in this case.
We have also raised new issues that may be worth discussing (in their own separate thread if possible):
4) We "may" bend the immutability rule of LSID getData() to our benefit and accept data that is not byte stream identical, but only "semantically" identical (depending on content type maybe). If we do this, we may use the LSID getData() call more effectively to identify real datasets such as matrices, identification keys, etc.
5) As Brian pointed out, we may need to revisit what we call data and metadata. We have been using the LSID getMetadata() call to return what some people may call data (taxon names, specimens, collections). And we forgot completely that there may be other kinds of data out there that may be returned in the getData() call and that those still need metadata to describe them. I think this may be worth discussing in a separate thread.
Did I leave anything out? If so, please let us know by replying to my post and adding a short entry to either list above.
Cheers,
Ricardo
Notes: -------
[1] Matt may disagree with me here, but my point is that we can't force all authorities (i.e. data providers) to keep perfect archives of all versions of their databases given a heterogeneous and distributed environment we operate in. While some may want to provide this feature, other providers may not want or be able to.
Richard Pyle wrote:
It seems to me that there is a third method to resolving the problem:
When we want to identify an object that is itself digital in nature (e.g., a database record, or a binary data file such as a PDF, JPG, ASCII, Unicode, or whatever), we resolve said binary object via getData(). If, for some reason, we change the exact bit-sequence of that digital/binary object (e.g., color-correct an image, change a text string from ASII to Unicode, or whatever...), we assign a new LSID to it (whether that "new" LSID differs from the "old" LSID only via the optional "Revision" part of the LSID, or via a new Object Identification part, is a topic for another debate).
When we want to identify an object that does not itself have a digital manifestation -- like a physical object (e.g., specimen or a particular printed copy of a publication) or an abstract/conceptual object (e.g., a taxon name, a taxon concept, a geographica place, or a cited publication) -- then we return *nothing* in response to getData(), and we treat all the attributes of said physical/abstract/conceptual object of interest to us as metadata.
If there are cases where certain metadata elements of an object without an inherent digital existence need to persists (and there are), yet we also want to allow modifications to metadata elements without the need to generate new identifiers for the underlying object (and we do) -- then we deal with those within our own community via adopted standards and best practices.
I would disagree strongly with bending the existing LSID standard, and would just as strongly favor working within its existing framework (which, I think, we can). I would also disagree with the practice of embedding XML documents as "data" for an LSID, unless the LSID is intended to represent the XML document itself (in which case there might be a different LSID to represent the database record that was used to generate the XML document; and yet another LSID to represent the abstract concept that the database record was created to represent -- like a taxon name, for example).
If we want to use LSIDs to pass around XML packages (that are not rendered as RDF) about abstract objects (e.g., taxon names), why doesn't our community define within our semantic vocabulary something along the lines of "TCS_XML", which can be established as a standard metadata component for LSIDs assigned to taxon concepts (i.e., abstract objects, identified by "data-less" LSIDs). The exact bytestream of the content of that metadata element can change, without changing its canonical rendering.
I'm beginning to suspect (strongly) that I am completely missing some fundamental point here -- and perhaps is is the same point that underlies the apparent antagonism towards LSIDs in general (which I do not yet share). But I am fairly certain we are dealing with some level of miscommunication here.
Aloha, Rich
-----Original Message----- From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. Bryan Heidorn Sent: Friday, July 13, 2007 12:48 PM To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
There seems to be two methods to resolving this problem.
One is to change the LSID definitions to allow semantic equivalence in the data and not require exact bit stream equivalence.
The other option is to change the data representation so that it is "easily" reduced to a repeatable canonical form. For example, it is almost as easy as saying where XML ordering does not specify order of elements, elements will be ordered alphabetically. Seems stupid but it almost works.. except where you have repeating elements with the same element name where it does not work.
It seems a little odd to bend the standards for the data being delivered to fit the requirement of the LSID spec. In theory, the other standard developers who set the data being delivered did not fix order because it did not matter.
This is different from Chuck's observation that the semantics of the element within some of the standards are insufficiently specified. So, what we mean is a darwin mode species name is just a string and nothing more now.
--Bryan
On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
I think we are all in agreement that the data and metadata
referenced
by an LSID remains unchanged (in the case of the metadata, semantic equivalence is a requirement for reasons such as outlined
by Matt).
My question is to do purely with the data that an LSID references through the getData() operation. The form of that data could be anything really - an encrypted byte stream, digital image,
Open Office
document, spreadsheet, xml document...
We all know that the same data can be represented many ways
that are
logically, semantically and functionally equivalent yet form a different set of bytes when serialized. Data expressed in
XML is one
example (is <a/> = <a /> = <a></a> ?). A pallet based image is another - the order of colors in the palette may be
changed, and the
pixel values adjusted to match the new palette order, but
the image is
still the same. There are many more simple examples that can be constructed that violate the unchanged bytes rule but for all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged bytes rule is not challenging. It is however quite different from stating that the data are returned unchanged. It is this
that I, and
I'm sure a lot of other implementors would appreciate consensus on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
In terms of the metadata returned from an LSID, or any
other digital
identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility
of data and
accuracy of scientific results.
As a trivial example, given a set of observations
collected at time
t, one can represent the data for those observations in
dataset D and
the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered
that t was
entered incorrectly, and needs to be adjusted, creating metadata document M'. That M and M' are not congruent is critical knowledge when analyzing data from D with data from another dataset D2. In other words, because there is no true distinction between data and metadata (any given piece of information can be stored in either location), a proper archive must be able to distinguish
any changes
in the data and any changes in the metadata.
That said, there are some metadata that could change with
little or
no impact on data interpretation (e.g., the spelling of
the street on
which Technician Tom gets his snailmail). But at the current time its impossible to distinguish this kind of metadata from the important kind in the general case of the existing
metadata standards
in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological
data archives
is to give persistent identifiers to both data objects and
metadata
objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by
an LSID are
unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed to abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data
provider stores
the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for
example) are
not identical blobs. Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated
processes than
asserting that the canonical representation of those
instances are
identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an
object (eg. for
xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research. Requiring identical blobs is certainly possible, but
people need to
be aware of the implications of such a requirement in the early stages of designing a system to support such a specification. My preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the
integrity of
the data and the intent of the LSID. regards, Dave V. On Jul 14, 2007, at 08:08, Bob Morris wrote:
This entire discussion confuses me. The LSID standard is
published.
Why is there a discussion of what an LSID should be? The
standard
requires that the data, as defined by the return of
getData, to be
identical for all resolutions of the LSID. From page 9
of the LSID
spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval
services may
implement all of the methods, or only methods for
retrieving data,
or only methods for retrieving associated metadata. The same LSID named data object must be resolved always
to the same
set of bytes. Therefore, all of the data retrieval
services return
the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its
location,
known quality of service and other attributes. With
metadata, the
situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have
anything
to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different
resolution
services do not yield the same set of bytes, then one or
the other
or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
> > In an imperfect world there is no such thing as an 'identical- > byte-stream' > because the technology we use is imperfect ... the disk > controllers which manage our bytes and the disk we use to store > our bytes have recognized error rates. Perhaps I'm >
being a pedant
> in the above analysis but I was almost persuaded that >
except for
> digital objects (images, > sounds) which can > be data all other 'things' (names, specimen accession >
numbers) had
> to be metadata. This to me makes no sense in the real but > imperfect world we live in. An LSID assigned to a name >
(e.g. Homo
> sapiens) is assigned to the name as data, not metadata. What is > 'identical' here it that if the spelling has to change for any > reason the new spelling gets a new LSID and the now incorrect > spelling gets deprecated (but is still resolvable) with >
a pointer
> to the correct spelling/LSID in the metadata. > > OK? > > Paul > > ________________________________ > From: tdwg-guid-bounces@lists.tdwg.org on behalf of >
Chuck Miller
> Sent: Fri 13/07/2007 19:03 > To: Dave Vieglais > Cc: tdwg-guid@lists.tdwg.org > Subject: RE: [tdwg-guid] LSID metadata persistence (or lack > thereof)[Scanned] > > > > > Dave, > What you say is true. But, I think we already have too many > variations, subtleties, and reinterpretations which are >
endlessly
> debated. > > The LSID standard would be simple, clear and consistent >
if we used
> the identical-byte-stream definition. The LSID would >
uniquely tag
> a persistent byte stream. A persistent byte stream is >
always the
> same thing without any further explanation or clarification. > > The provider of an LSID byte-stream would need to commit to > keeping that byte-stream persistent and not represent it in > multiple ways, even though technically they could. If >
they can't
> commit to that, then it can't be an LSID byte-stream. > > And in the name of simplicity and clarity, if they had >
to provide
> different byte-stream representations then they would have to > assign a different LSID to each and use "SameAs" metadata. > > Chuck > > -----Original Message----- > From: Dave Vieglais [mailto:vieglais@ku.edu] > Sent: Friday, July 13, 2007 12:42 PM > To: Chuck Miller > Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org > Subject: Re: [tdwg-guid] LSID metadata persistence (or lack > thereof) > > Hi Ricardo, Chuck, > Asserting that the byte stream returned as data >
associated with an
> LSID should never change is perhaps a bit confusing from a > programmatic view. There are for example many ways to >
represent
> data in xml that are identical from an information >
content point
> of view, but the byte streams could be very different. > > Perhaps it might be better to state something like "the >
canonical
> representation of the data associated with an LSID must not > change", or something to that effect? > > Dave V. > > On Jul 14, 2007, at 05:29, Chuck Miller wrote: > > >> Ricardo, >> >> Looking at this definition: "Persistence of LSID >>
Data: The data
>> associated with an LSID (i.e, the byte stream returned by the >> > LSID > >> getData call) must never change" >> >> >> >> Perhaps this is a more straightforward way to conceive >> > LSIDs. The > >> LSID goes with a byte stream. It's that byte stream that >> > must stay > >> the same. So, if there is a byte stream associated with a >> collection that needs to stay the same, then whatever >>
that byte
>> stream happens to be is the data that gets an LSID assigned >> > to it. > >> That sure seems a clearer definition of what is data >>
and what is
>> metadata, rather than the issue of primary object and >>
all that.
>> >> So we can create a new definition in the context of LSIDs: >> > Data is > >> a byte stream that is persistent, never changes and >>
can have an
>> LSID. Metadata is a byte stream is non-persistent, >>
might change
>> and is only associated with an LSID. >> >> >> >> The institution who assigns an LSID can make their >>
own decision
>> about whether the byte stream being provided is persistent or >> > non- > >> persistent. By assigning an LSID to any byte stream, >> > whatever it > >> is, the institution is declaring it to be data and persistent. >> >> >> >> So, in the example given of an observation record with a >> determination that needs to remain fixed and unchanged, by >> assigning an LSID to that observation+determination >>
it would be
>> "declared to be data" and unchangeable. A different >> > determination > >> would then be different data with a different LSID. >>
That would
>> provide a solution for those who want to employ it. Others >> > could > >> choose not to use it. >> >> >> >> Chuck >> >> >> >> From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- >> bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira >> Sent: Friday, July 13, 2007 9:47 AM >> To: tdwg-guid@lists.tdwg.org >> Subject: [tdwg-guid] LSID metadata persistence (or >>
lack thereof)
>> >> Hi there folks, >> >> As Chuck mentioned a few weeks ago, we do have a few >> outstanding issues to address regarding LSIDs. I >>
would like to
>> discuss those one by one, in an orderly manner, and reach >> > consensus > >> as much as we can. Then we can sum them up in a TDWG >>
standard,
>> possibly by or shortly after the Bratislava conference. >> >> The first issue I would like to discuss is LSID metadata >> persistence. First, let me remind you of a corollary >> > established by > >> the LSID specification: >> >> Corollary 1: LSIDs are not guaranteed to be >> > resolvable > >> indefinitely. >> >> In other words, there is no guarantee that one will >> > always be > >> able to retrieve the data associated with an LSID as the >> > authority > >> may choose (or be forced) not to resolve an LSID anymore. >> >> Second, let me distinguish this kind of persistence I'm >> > talking > >> about from other two related concepts (which we'll not >> > discuss in > >> this thread): >> >> 1) Persistence of Assignment: Once assigned to an >> > object, > >> an LSID is indefinitely associated with it. The same LSID >> > cannot be > >> assigned to another object. Ever! The LSID may not be >>
resolvable
>> anymore, but it cannot be assigned to another object. This is >> established by the LSID specification. >> >> 2) Persistence of LSID Data: The data >>
associated with an
>> LSID (i.e, the byte stream returned by the LSID getData call) >> > must > >> never change. Although the LSID may not be resolvable anymore >> (according to corollary 1), the data associated with an LSID >> > must > >> never ever change. That's defined by the LSID spec, too. >> >> What I want to discuss here is the persistence of LSID >> > metadata > >> (what is returned by the getMetadata call) or the >>
lack thereof.
>> A use case associated with metadata persistence is when >> > someone > >> collects observation records (and implicitly, their >> > determinations) > >> and runs an experiment (a model or simulation) with it. This >> > person > >> may want to record the identifiers of the points used so that >> someone using the results of that experiment may refer back >> > to the > >> primary data, to validate or repeat it the experiment. >> >> The bad news is that LSID identification scheme (or any >> > other > >> GUID that I know of) was not designed to guarantee metadata >> persistence, and thus it cannot implement the use >>
case above by
>> itself. To implement that use case, the specification would >> > have to > >> guarantee that the metadata (which we are using here >>
as data) is
>> immutable. But it doesn't. >> >> Most of us wish that metadata was persistent, but >>
it isn't.
>> Many things can change in the metadata: a new >>
determination, a
>> mispeling that is corrected, many things. We just cannot >> > guarantee > >> that the metadata will look like it was sometime ago. >> >> We then reach the following conclusion. >> >> Corollary 2: LSIDs metadata is not immutable nor >> persistent. >> >> The consequence of this corollary is that, if you need to >> > refer > >> back to a piece of information (metadata) associated with an >> > LSID, > >> exactly as it was when you got it, you must make a copy of >> > it, or > >> arrange that someone else make that copy for you. >> >> In other words, a client cannot assume that the metadata >> associated with an LSID today will be the same >>
tomorrow. If the
>> client does assume that, it may be relying on a false >>
assumption
>> and its output may be flawed. >> >> If we are not happy with that conclusion, we may >>
develop an
>> additional component in our architecture, an archive of some >> > sort, > >> to handle (meta)data persistence. That is exactly what the >> > STD-DOI > >> project (http://www.std-doi.de/) and SEEK (http:// http:/// >> seek.ecoinformatics.org) have done to some extent. >> >> While we cannot guarantee that LSID metadata is >> > persistent nor > >> immutable, we can definitely document how the metadata have >> > changed > >> through metadata versioning. That's the topic of the next >> > thread. > >> We will move on to discuss metadata versioning as >>
soon as we are
>> done with metadata persistence. >> >> Cheers, >> >> Ricardo >> >> _______________________________________________ >> tdwg-guid mailing list >> tdwg-guid@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >> > _______________________________________________ > tdwg-guid mailing list > tdwg-guid@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-guid > > > P Think Green - don't print this email unless you really need to > > >
> ****** > The information contained in this e-mail and any files > transmitted with it is confidential and is for the >
exclusive use
> of the intended recipient. If you are not the intended >
recipient
> please note that any distribution, copying or use of this > communication or the information in it is prohibited. > > Whilst CAB International trading as CABI takes steps >
to prevent
> the transmission of viruses via e-mail, we cannot >
guarantee that
> any e-mail or attachment is free from computer viruses >
and you are
> strongly advised to undertake your own anti-virus precautions. > > If you have received this communication in error, >
please notify
> us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 > 829199 and then delete the e-mail and any copies of it. > > CABI is an International Organization recognised by the UK > Government under Statutory Instrument 1982 No. 1071. > > >
> ******** > _______________________________________________ > tdwg-guid mailing list > tdwg-guid@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-guid > > > --Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Hi Chuck, if XML is disallowed as a data format that can be returned using getData() operation, then should we also disallow any data format where the same information can be represented in alternate byte streams (images, documents, source code, etc.)? Perhaps a better option is to create a new method getDataRepr() (get data representation) or perhaps getDataInstance() or something like that where the data is guaranteed to be be consistent between calls even if the byte stream can not. This would allow sufficient flexibility for implementors to be less concerned with the very low level details of the implementation and not reduce the actual utility of the LSID system since the referenced data can still be guaranteed to be consistent.
Just a reminder as well, thus far the discussions about LSIDs seem to have been very focussed on the specific requirements of GBIF, which is all good but there are other systems which are currently using LSIDs (as mentioned by Matt) or can benefit greatly from a consistent GUID scheme. It would be great if the LSID use guidelines that are derived during the TDWG process also remain relevant and useful to the broader community.
Dave V.
On Jul 15, 2007, at 12:28, Chuck Miller wrote:
Ricardo, I disagree on your assertion of consensus on a couple of points.
On 2) there is no consensus/decision on whether XML can be returned from a getData call. I asked this question and it has not been answered. We could disallow XML as an allowed format for getData and allow it only for getMetadata.
We do not have consensus and actually have disagreement on "We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable" because "the name itself is in the metadata" I for one believe that we cannot avoid returning a scientific name byte stream in the getData for an LSID for a scientific name. That requirement is fundamental to what we need for biodiversity data. Pragmatically and empirically, names and specimens/observations are THE most fundamental data objects existing today in the databases published by GBIF. So if we can't put LSIDs on names, we have failed to enable one of the most fundamental needs of this community. If the definition of LSIDs needs to be amended to enable that, then so be it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Ricardo Pereira Sent: Fri 7/13/2007 8:12 PM Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Folks, Thanks much to all of you who replied to my post. All the posts
were really relevant to our discussion.
Before we go ahead, however, let us stop for a minute to try and
summarize the points we agree upon and the points in which there is still significant controversy.
I believe that we reached consensus in the following issues:
- We do agree that *LSID metadata is not required to be persistent*
(i.e. clients cannot assume it is immutable). See note [1].
- We should not force XML representations of data to be byte
identical just to return that in the LSID getData() call. We must find another way to fulfill this requirement.
- We should not try to return something in the LSID getData() call
just for the sake of it. We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable and thus fulfill the requirement from the LSID spec. This is counterproductive because the name itself is in the metadata already and no client would gain anything from calling getData() in this case.
We have also raised new issues that may be worth discussing (in
their own separate thread if possible):
- We "may" bend the immutability rule of LSID getData() to our
benefit and accept data that is not byte stream identical, but only "semantically" identical (depending on content type maybe). If we do this, we may use the LSID getData() call more effectively to identify real datasets such as matrices, identification keys, etc.
- As Brian pointed out, we may need to revisit what we call data and
metadata. We have been using the LSID getMetadata() call to return what some people may call data (taxon names, specimens, collections). And we forgot completely that there may be other kinds of data out there that may be returned in the getData() call and that those still need metadata to describe them. I think this may be worth discussing in a separate thread.
Did I leave anything out? If so, please let us know by replying to
my post and adding a short entry to either list above.
Cheers,
Ricardo
Notes:
[1] Matt may disagree with me here, but my point is that we can't force all authorities (i.e. data providers) to keep perfect archives of all versions of their databases given a heterogeneous and distributed environment we operate in. While some may want to provide this feature, other providers may not want or be able to.
Richard Pyle wrote:
It seems to me that there is a third method to resolving the
problem:
When we want to identify an object that is itself digital in
nature (e.g., a
database record, or a binary data file such as a PDF, JPG, ASCII,
Unicode,
or whatever), we resolve said binary object via getData(). If,
for some
reason, we change the exact bit-sequence of that digital/binary
object
(e.g., color-correct an image, change a text string from ASII to
Unicode, or
whatever...), we assign a new LSID to it (whether that "new" LSID
differs
from the "old" LSID only via the optional "Revision" part of the
LSID, or
via a new Object Identification part, is a topic for another
debate).
When we want to identify an object that does not itself have a
digital
manifestation -- like a physical object (e.g., specimen or a
particular
printed copy of a publication) or an abstract/conceptual object
(e.g., a
taxon name, a taxon concept, a geographica place, or a cited
publication) --
then we return *nothing* in response to getData(), and we treat
all the
attributes of said physical/abstract/conceptual object of
interest to us as
metadata.
If there are cases where certain metadata elements of an object
without an
inherent digital existence need to persists (and there are), yet
we also
want to allow modifications to metadata elements without the need to generate new identifiers for the underlying object (and we do) --
then we
deal with those within our own community via adopted standards
and best
practices.
I would disagree strongly with bending the existing LSID
standard, and would
just as strongly favor working within its existing framework
(which, I
think, we can). I would also disagree with the practice of
embedding XML
documents as "data" for an LSID, unless the LSID is intended to
represent
the XML document itself (in which case there might be a different
LSID to
represent the database record that was used to generate the XML
document;
and yet another LSID to represent the abstract concept that the
database
record was created to represent -- like a taxon name, for example).
If we want to use LSIDs to pass around XML packages (that are not
rendered
as RDF) about abstract objects (e.g., taxon names), why doesn't our community define within our semantic vocabulary something along
the lines of
"TCS_XML", which can be established as a standard metadata
component for
LSIDs assigned to taxon concepts (i.e., abstract objects,
identified by
"data-less" LSIDs). The exact bytestream of the content of that
metadata
element can change, without changing its canonical rendering.
I'm beginning to suspect (strongly) that I am completely missing
some
fundamental point here -- and perhaps is is the same point that
underlies
the apparent antagonism towards LSIDs in general (which I do not
yet share).
But I am fairly certain we are dealing with some level of
miscommunication
here.
Aloha, Rich
-----Original Message----- From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. Bryan Heidorn Sent: Friday, July 13, 2007 12:48 PM To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
There seems to be two methods to resolving this problem.
One is to change the LSID definitions to allow semantic equivalence in the data and not require exact bit stream
equivalence.
The other option is to change the data representation so that it is "easily" reduced to a repeatable canonical form. For example, it is almost as easy as saying where XML ordering does not specify order of elements, elements will be ordered alphabetically. Seems stupid but it almost works.. except where you have repeating elements with the same element name where it does not work.
It seems a little odd to bend the standards for the data being delivered to fit the requirement of the LSID spec. In theory, the other standard developers who set the data being delivered did not fix order because it did not matter.
This is different from Chuck's observation that the semantics of the element within some of the standards are insufficiently specified. So, what we mean is a darwin mode species name is just a string and nothing more now.
--Bryan
On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
I think we are all in agreement that the data and metadata
referenced
by an LSID remains unchanged (in the case of the metadata,
semantic
equivalence is a requirement for reasons such as outlined
by Matt).
My question is to do purely with the data that an LSID references through the getData() operation. The form of that data could be anything really - an encrypted byte stream, digital image,
Open Office
document, spreadsheet, xml document...
We all know that the same data can be represented many ways
that are
logically, semantically and functionally equivalent yet form a different set of bytes when serialized. Data expressed in
XML is one
example (is <a/> = <a /> = <a></a> ?). A pallet based image is another - the order of colors in the palette may be
changed, and the
pixel values adjusted to match the new palette order, but
the image is
still the same. There are many more simple examples that can be constructed that violate the unchanged bytes rule but for all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the unchanged bytes rule is not challenging. It is however quite different from stating that the data are returned unchanged. It is this
that I, and
I'm sure a lot of other implementors would appreciate consensus
on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
In terms of the metadata returned from an LSID, or any
other digital
identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility
of data and
accuracy of scientific results.
As a trivial example, given a set of observations
collected at time
t, one can represent the data for those observations in
dataset D and
the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered
that t was
entered incorrectly, and needs to be adjusted, creating metadata document M'. That M and M' are not congruent is critical
knowledge
when analyzing data from D with data from another dataset D2. In other words, because there is no true distinction between data
and
metadata (any given piece of information can be stored in either location), a proper archive must be able to distinguish
any changes
in the data and any changes in the metadata.
That said, there are some metadata that could change with
little or
no impact on data interpretation (e.g., the spelling of
the street on
which Technician Tom gets his snailmail). But at the current
time
its impossible to distinguish this kind of metadata from the important kind in the general case of the existing
metadata standards
in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological
data archives
is to give persistent identifiers to both data objects and
metadata
objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by
an LSID are
unchanged has a lot of implications with respect to the implementation of data services. For example, if it is
agreed to
abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data
provider stores
the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for the example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for
example) are
not identical blobs. Asserting that two instances of an object with the same LSID are semantically equivalent is a vastly more complicated
processes than
asserting that the canonical representation of those
instances are
identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an
object (eg. for
xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research. Requiring identical blobs is certainly possible, but
people need to
be aware of the implications of such a requirement in the early stages of designing a system to support such a
specification. My
preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the
integrity of
the data and the intent of the LSID. regards, Dave V. On Jul 14, 2007, at 08:08, Bob Morris wrote:
> This entire discussion confuses me. The LSID standard is >
published.
> Why is there a discussion of what an LSID should be? The >
standard
> requires that the data, as defined by the return of >
getData, to be
> identical for all resolutions of the LSID. From page 9 >
of the LSID
> spec: > > " bytes getData (LSID lsid) > bytes getDataByRange (LSID lsid, integer start, integer length) > Metadata_response getMetadata (LSID lsid, string[] > accepted_formats) > Metadata_response getMetadataSubset (LSID lsid, string[] > accepted_formats, string selector) The data retrieval >
services may
> implement all of the methods, or only methods for >
retrieving data,
> or only methods for retrieving associated metadata. > The same LSID named data object must be resolved always >
to the same
> set of bytes. Therefore, all of the data retrieval >
services return
> the same results for the same LSID. The user has, however, the > choice of which one of these to utilize depending on its >
location,
> known quality of service and other attributes. With >
metadata, the
> situation is different. Each data retrieval service can provide > different metadata for the same LSID." > > This doesn't seem very ambiguous to me, and doesn't have >
anything
> to do with imperfect storage of data or anything else about the > physical or electronic world. If two calls to getData() with
the
> same argument on two occasions to possibly two different >
resolution
> services do not yield the same set of bytes, then one or >
the other
> or both of those is not executing a compliant service response. > Unless this discussion is really "Shall we call something other > than the return of getData by the term 'data associated with
the
> LSID?' there seems to be nothing to discuss. > > Bob > > > > > On 7/13/07, Paul Kirk p.kirk@cabi.org wrote: > >> >> In an imperfect world there is no such thing as an 'identical- >> byte-stream' >> because the technology we use is imperfect ... the disk >> controllers which manage our bytes and the disk we use to
store
>> our bytes have recognized error rates. Perhaps I'm >>
being a pedant
>> in the above analysis but I was almost persuaded that >>
except for
>> digital objects (images, >> sounds) which can >> be data all other 'things' (names, specimen accession >>
numbers) had
>> to be metadata. This to me makes no sense in the real but >> imperfect world we live in. An LSID assigned to a name >>
(e.g. Homo
>> sapiens) is assigned to the name as data, not metadata.
What is
>> 'identical' here it that if the spelling has to change for any >> reason the new spelling gets a new LSID and the now incorrect >> spelling gets deprecated (but is still resolvable) with >>
a pointer
>> to the correct spelling/LSID in the metadata. >> >> OK? >> >> Paul >> >> ________________________________ >> From: tdwg-guid-bounces@lists.tdwg.org on behalf of >>
Chuck Miller
>> Sent: Fri 13/07/2007 19:03 >> To: Dave Vieglais >> Cc: tdwg-guid@lists.tdwg.org >> Subject: RE: [tdwg-guid] LSID metadata persistence (or lack >> thereof)[Scanned] >> >> >> >> >> Dave, >> What you say is true. But, I think we already have too many >> variations, subtleties, and reinterpretations which are >>
endlessly
>> debated. >> >> The LSID standard would be simple, clear and consistent >>
if we used
>> the identical-byte-stream definition. The LSID would >>
uniquely tag
>> a persistent byte stream. A persistent byte stream is >>
always the
>> same thing without any further explanation or clarification. >> >> The provider of an LSID byte-stream would need to commit to >> keeping that byte-stream persistent and not represent it in >> multiple ways, even though technically they could. If >>
they can't
>> commit to that, then it can't be an LSID byte-stream. >> >> And in the name of simplicity and clarity, if they had >>
to provide
>> different byte-stream representations then they would have to >> assign a different LSID to each and use "SameAs" metadata. >> >> Chuck >> >> -----Original Message----- >> From: Dave Vieglais [mailto:vieglais@ku.edu] >> Sent: Friday, July 13, 2007 12:42 PM >> To: Chuck Miller >> Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org >> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack >> thereof) >> >> Hi Ricardo, Chuck, >> Asserting that the byte stream returned as data >>
associated with an
>> LSID should never change is perhaps a bit confusing from a >> programmatic view. There are for example many ways to >>
represent
>> data in xml that are identical from an information >>
content point
>> of view, but the byte streams could be very different. >> >> Perhaps it might be better to state something like "the >>
canonical
>> representation of the data associated with an LSID must not >> change", or something to that effect? >> >> Dave V. >> >> On Jul 14, 2007, at 05:29, Chuck Miller wrote: >> >> >>> Ricardo, >>> >>> Looking at this definition: "Persistence of LSID >>>
Data: The data
>>> associated with an LSID (i.e, the byte stream returned by the >>> >> LSID >> >>> getData call) must never change" >>> >>> >>> >>> Perhaps this is a more straightforward way to conceive >>> >> LSIDs. The >> >>> LSID goes with a byte stream. It's that byte stream that >>> >> must stay >> >>> the same. So, if there is a byte stream associated with a >>> collection that needs to stay the same, then whatever >>>
that byte
>>> stream happens to be is the data that gets an LSID assigned >>> >> to it. >> >>> That sure seems a clearer definition of what is data >>>
and what is
>>> metadata, rather than the issue of primary object and >>>
all that.
>>> >>> So we can create a new definition in the context of LSIDs: >>> >> Data is >> >>> a byte stream that is persistent, never changes and >>>
can have an
>>> LSID. Metadata is a byte stream is non-persistent, >>>
might change
>>> and is only associated with an LSID. >>> >>> >>> >>> The institution who assigns an LSID can make their >>>
own decision
>>> about whether the byte stream being provided is persistent or >>> >> non- >> >>> persistent. By assigning an LSID to any byte stream, >>> >> whatever it >> >>> is, the institution is declaring it to be data and
persistent.
>>> >>> >>> >>> So, in the example given of an observation record with a >>> determination that needs to remain fixed and unchanged, by >>> assigning an LSID to that observation+determination >>>
it would be
>>> "declared to be data" and unchangeable. A different >>> >> determination >> >>> would then be different data with a different LSID. >>>
That would
>>> provide a solution for those who want to employ it. Others >>> >> could >> >>> choose not to use it. >>> >>> >>> >>> Chuck >>> >>> >>> >>> From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- >>> bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira >>> Sent: Friday, July 13, 2007 9:47 AM >>> To: tdwg-guid@lists.tdwg.org >>> Subject: [tdwg-guid] LSID metadata persistence (or >>>
lack thereof)
>>> >>> Hi there folks, >>> >>> As Chuck mentioned a few weeks ago, we do have a few >>> outstanding issues to address regarding LSIDs. I >>>
would like to
>>> discuss those one by one, in an orderly manner, and reach >>> >> consensus >> >>> as much as we can. Then we can sum them up in a TDWG >>>
standard,
>>> possibly by or shortly after the Bratislava conference. >>> >>> The first issue I would like to discuss is LSID metadata >>> persistence. First, let me remind you of a corollary >>> >> established by >> >>> the LSID specification: >>> >>> Corollary 1: LSIDs are not guaranteed to be >>> >> resolvable >> >>> indefinitely. >>> >>> In other words, there is no guarantee that one will >>> >> always be >> >>> able to retrieve the data associated with an LSID as the >>> >> authority >> >>> may choose (or be forced) not to resolve an LSID anymore. >>> >>> Second, let me distinguish this kind of persistence I'm >>> >> talking >> >>> about from other two related concepts (which we'll not >>> >> discuss in >> >>> this thread): >>> >>> 1) Persistence of Assignment: Once assigned to an >>> >> object, >> >>> an LSID is indefinitely associated with it. The same LSID >>> >> cannot be >> >>> assigned to another object. Ever! The LSID may not be >>>
resolvable
>>> anymore, but it cannot be assigned to another object. This is >>> established by the LSID specification. >>> >>> 2) Persistence of LSID Data: The data >>>
associated with an
>>> LSID (i.e, the byte stream returned by the LSID getData call) >>> >> must >> >>> never change. Although the LSID may not be resolvable anymore >>> (according to corollary 1), the data associated with an LSID >>> >> must >> >>> never ever change. That's defined by the LSID spec, too. >>> >>> What I want to discuss here is the persistence of LSID >>> >> metadata >> >>> (what is returned by the getMetadata call) or the >>>
lack thereof.
>>> A use case associated with metadata persistence is when >>> >> someone >> >>> collects observation records (and implicitly, their >>> >> determinations) >> >>> and runs an experiment (a model or simulation) with it. This >>> >> person >> >>> may want to record the identifiers of the points used so that >>> someone using the results of that experiment may refer back >>> >> to the >> >>> primary data, to validate or repeat it the experiment. >>> >>> The bad news is that LSID identification scheme (or any >>> >> other >> >>> GUID that I know of) was not designed to guarantee metadata >>> persistence, and thus it cannot implement the use >>>
case above by
>>> itself. To implement that use case, the specification would >>> >> have to >> >>> guarantee that the metadata (which we are using here >>>
as data) is
>>> immutable. But it doesn't. >>> >>> Most of us wish that metadata was persistent, but >>>
it isn't.
>>> Many things can change in the metadata: a new >>>
determination, a
>>> mispeling that is corrected, many things. We just cannot >>> >> guarantee >> >>> that the metadata will look like it was sometime ago. >>> >>> We then reach the following conclusion. >>> >>> Corollary 2: LSIDs metadata is not immutable nor >>> persistent. >>> >>> The consequence of this corollary is that, if you need to >>> >> refer >> >>> back to a piece of information (metadata) associated with an >>> >> LSID, >> >>> exactly as it was when you got it, you must make a copy of >>> >> it, or >> >>> arrange that someone else make that copy for you. >>> >>> In other words, a client cannot assume that the metadata >>> associated with an LSID today will be the same >>>
tomorrow. If the
>>> client does assume that, it may be relying on a false >>>
assumption
>>> and its output may be flawed. >>> >>> If we are not happy with that conclusion, we may >>>
develop an
>>> additional component in our architecture, an archive of some >>> >> sort, >> >>> to handle (meta)data persistence. That is exactly what the >>> >> STD-DOI >> >>> project (http://www.std-doi.de/) and SEEK (http:// >>> seek.ecoinformatics.org) have done to some extent. >>> >>> While we cannot guarantee that LSID metadata is >>> >> persistent nor >> >>> immutable, we can definitely document how the metadata have >>> >> changed >> >>> through metadata versioning. That's the topic of the next >>> >> thread. >> >>> We will move on to discuss metadata versioning as >>>
soon as we are
>>> done with metadata persistence. >>> >>> Cheers, >>> >>> Ricardo >>> >>> _______________________________________________ >>> tdwg-guid mailing list >>> tdwg-guid@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >>> >> _______________________________________________ >> tdwg-guid mailing list >> tdwg-guid@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >> >> >> P Think Green - don't print this email unless you really
need to
>> >> >>
>> ****** >> The information contained in this e-mail and any files >> transmitted with it is confidential and is for the >>
exclusive use
>> of the intended recipient. If you are not the intended >>
recipient
>> please note that any distribution, copying or use of this >> communication or the information in it is prohibited. >> >> Whilst CAB International trading as CABI takes steps >>
to prevent
>> the transmission of viruses via e-mail, we cannot >>
guarantee that
>> any e-mail or attachment is free from computer viruses >>
and you are
>> strongly advised to undertake your own anti-virus precautions. >> >> If you have received this communication in error, >>
please notify
>> us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 >> 829199 and then delete the e-mail and any copies of it. >> >> CABI is an International Organization recognised by the UK >> Government under Statutory Instrument 1982 No. 1071. >> >> >>
>> ******** >> _______________________________________________ >> tdwg-guid mailing list >> tdwg-guid@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >> >> >> > --Robert A. Morris > Professor of Computer Science > UMASS-Boston > ram@cs.umb.edu > http://bdei.cs.umb.edu/ > http://www.cs.umb.edu/~ram > http://www.cs.umb.edu/~ram/calendar.html > phone (+1)617 287 6466 > _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
I'm not sure I understand this fixation with the getData() call. Why is it so important to use that call to retrieve bytestream information relating to objects that are not themselevs inherently digital? Much of what we are intereseted in within the biodiversity informatics community, in terms of what we want to establish identifiers for, are not inherently digital objects and therefore should NOT have any bytes returned for getData(). Some of our objects *are* inherently digital (PDFs, image files of various formats, video clips, audio files, possibly Genbank sequences in a specified format and encoding, etc.) To me, the distinction is very simple: is the object that the LSID identifies a binary data file? If yes, then the binary data become the data of the LSID. If no, then the LSID has no binary "data" (sensu LSID Spec), and returns only metadata through getMetadata(). The LSID spec refers to such LSIDs as "Abstract" (or sometimes "Conceptual") LSIDs.
It's really not that complicated -- unless, as I suggested previously, I am missing something fundamentally important.
I don't understand the advantage we gain by "force-fitting" some digitized rendering of an otherwise non-digital object. Taxon Names (for example) have no inherent digital manifestation. We create an artificial digital representation of them by stringing ASCII or Unicode characters together in a way that resembles (in principle) the characters otherwise represented by ink on paper. But if we want to embed such a character string as "data" for an LSID, then the LSID is teally an identifier for the *character string* itself, NOT the "notion" or "idea" or "concept" of the taxon name. As a taxonomist and biodiversity informatics manager, I have very little use for LSIDs that identify specific charcter strings. I want an LSID that itentifies the shared understanding of a taxon name -- not an artificial/substitute rendering of the taxon name. I see no advantage to creating one LSID for a text string that encodes a taxon name as UTF-8, and another LSID for the same name encoded as UTF-16,and so on, and so on. These variants are purely artificial from the perspective of what I want an LSID for (i.e., the idea/notion/concept of a taxon name).
I do acknowledge that the idea of an "Abstract" LSID was really meant to serve as an "umbrella" of sorts to tie together multiple data-bearing LSIDs. The classic example is an image that can be represented as a RAW, a TIFF, or a JPEG file format. Assuming all three image files derive from the same shutter-release event of a camera, then the intended function of an "Abstract" LSID is to serve to gather together the LSIDs established for each of the three file formats of the "same" image. The images are the "same" only in the conceptual -- i.e., that they all derive from the shutter-release event. But the point is, the purpose of the "Abstract" LSID is really intended to be a mechanism of organizing data-bearing LSIDs that refer to different digital renderings of the "same thing". From the "LSID Best Practices" website (http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/), under the heading "Abstract LSIDs":
"The abstract LSID provides the anchor point for software and users to explore the metadata and obtain further pointers to all the concrete LSID references that contain data, along with the data's exact relationship to the abstract concept."
This implies that "Abstract" LSIDs should exist primarily to aggregate data-bearing LSIDs.
For the most part, I don't think this is what we are really trying to do when we want to assign LSIDs to non-digital objects like taxon names, specimens, etc. So, in a sense, what I am advocating deviates a bit from the intention of an "Abstract" LSID. But at least I'm not outright violating the fundamental tenents of the LSID spec, like trying to apply a single LSID to more than one bytestream returnable via getData().
So, again, I return to my original confusion: why all the fixation with the getData() call?
The only reasons I can think of are:
1) Semantics (of the human communcation kind): We're uncomfortable thinking of things like refering to the text string C-e-n-t-r-o-p-y-g-e (minus the dashes) as being mere "metadata" for the angelfish genus described by Kaup in 1860 -- when it just feels like the "actual" name to us (and hence should be thought of as "data").
2) Persistence: We want to embed information as "data" for the LSID because we want to make sure the "same information" is always there, and the LSID spec emphasizes the permanent relationship between an LSID and its data. The only trouble is, we want to define the word "same" in this context in a way that is utterly incomprehensible (without all manner of comparison algorithms) to a computer. *We* know that "Chaetodon" is the "same" as "Chætodon", so we want a single LSID to refer to the genus name for butterflyfishes described by Linnaeus in 1758. And we don't like being required to always choose one rendering or the other to embed as the bit-identical "data" for the LSID.
3) Performance(?): This is where I may be missing something fundamental. Are there characteristics of the getData() call that are far superior to getMetadata()?
As for number 1: all I can say is "get over it". Our unfortunate reality in biodiversity informatics is a proponderence of homonymy -- not just in taxon names, but in our human-mitigated communication lexicon as well.
As for number 2: We can deal with persistence through layers of standards and convention within our community. Almost everything we talk about involves an assumption of adherence to standards and conventions. If we want persistent metadata, then we need to formalize a document detailing which metadata elements should be mandatory and/or persistent and/or have other properties that we as a community feel are important. This document would also outline when metadata may be modified for a given LSID, vs. when a new LSID should be generated, allowing certain metadata elements for each to remain unchanged (e.g., perhaps one LSID for "Chaetodon" and another for "Chætodon", for the object type "Digital Taxon Name Rendering"). The document would also outline how multiple LSIDs should be cross-referenced to each other (e.g., the two "DTNR" objects identified by two different LSIDs in the previous example would both refer to the same Abstract LSID established for the butterflyfish genus name described by Linnaeus in 1758).
As for number 3: I just hope someone can explain to me where I missed the boat.
One final note: I do see a way that we can preserve the spirit of intent for the "Abstract LSID" in our domain for things like Taxon Names. Rather than explain it here, I follow up with another email describing it.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
________________________________
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of Chuck Miller Sent: Saturday, July 14, 2007 2:29 PM To: Ricardo Pereira Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof) Ricardo, I disagree on your assertion of consensus on a couple of points. On 2) there is no consensus/decision on whether XML can be returned from a getData call. I asked this question and it has not been answered. We could disallow XML as an allowed format for getData and allow it only for getMetadata. We do not have consensus and actually have disagreement on "We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable" because "the name itself is in the metadata" I for one believe that we cannot avoid returning a scientific name byte stream in the getData for an LSID for a scientific name. That requirement is fundamental to what we need for biodiversity data. Pragmatically and empirically, names and specimens/observations are THE most fundamental data objects existing today in the databases published by GBIF. So if we can't put LSIDs on names, we have failed to enable one of the most fundamental needs of this community. If the definition of LSIDs needs to be amended to enable that, then so be it. Chuck ________________________________
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Ricardo Pereira Sent: Fri 7/13/2007 8:12 PM Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Folks, Thanks much to all of you who replied to my post. All the posts were really relevant to our discussion. Before we go ahead, however, let us stop for a minute to try and summarize the points we agree upon and the points in which there is still significant controversy. I believe that we reached consensus in the following issues: 1) We do agree that *LSID metadata is not required to be persistent* (i.e. clients cannot assume it is immutable). See note [1]. 2) We should not force XML representations of data to be byte identical just to return that in the LSID getData() call. We must find another way to fulfill this requirement. 3) We should not try to return something in the LSID getData() call just for the sake of it. We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable and thus fulfill the requirement from the LSID spec. This is counterproductive because the name itself is in the metadata already and no client would gain anything from calling getData() in this case. We have also raised new issues that may be worth discussing (in their own separate thread if possible): 4) We "may" bend the immutability rule of LSID getData() to our benefit and accept data that is not byte stream identical, but only "semantically" identical (depending on content type maybe). If we do this, we may use the LSID getData() call more effectively to identify real datasets such as matrices, identification keys, etc. 5) As Brian pointed out, we may need to revisit what we call data and metadata. We have been using the LSID getMetadata() call to return what some people may call data (taxon names, specimens, collections). And we forgot completely that there may be other kinds of data out there that may be returned in the getData() call and that those still need metadata to describe them. I think this may be worth discussing in a separate thread. Did I leave anything out? If so, please let us know by replying to my post and adding a short entry to either list above. Cheers, Ricardo Notes: ------- [1] Matt may disagree with me here, but my point is that we can't force all authorities (i.e. data providers) to keep perfect archives of all versions of their databases given a heterogeneous and distributed environment we operate in. While some may want to provide this feature, other providers may not want or be able to. Richard Pyle wrote: > It seems to me that there is a third method to resolving the problem: > > When we want to identify an object that is itself digital in nature (e.g., a > database record, or a binary data file such as a PDF, JPG, ASCII, Unicode, > or whatever), we resolve said binary object via getData(). If, for some > reason, we change the exact bit-sequence of that digital/binary object > (e.g., color-correct an image, change a text string from ASII to Unicode, or > whatever...), we assign a new LSID to it (whether that "new" LSID differs > from the "old" LSID only via the optional "Revision" part of the LSID, or > via a new Object Identification part, is a topic for another debate). > > When we want to identify an object that does not itself have a digital > manifestation -- like a physical object (e.g., specimen or a particular > printed copy of a publication) or an abstract/conceptual object (e.g., a > taxon name, a taxon concept, a geographica place, or a cited publication) -- > then we return *nothing* in response to getData(), and we treat all the > attributes of said physical/abstract/conceptual object of interest to us as > metadata. > > If there are cases where certain metadata elements of an object without an > inherent digital existence need to persists (and there are), yet we also > want to allow modifications to metadata elements without the need to > generate new identifiers for the underlying object (and we do) -- then we > deal with those within our own community via adopted standards and best > practices. > > I would disagree strongly with bending the existing LSID standard, and would > just as strongly favor working within its existing framework (which, I > think, we can). I would also disagree with the practice of embedding XML > documents as "data" for an LSID, unless the LSID is intended to represent > the XML document itself (in which case there might be a different LSID to > represent the database record that was used to generate the XML document; > and yet another LSID to represent the abstract concept that the database > record was created to represent -- like a taxon name, for example). > > If we want to use LSIDs to pass around XML packages (that are not rendered > as RDF) about abstract objects (e.g., taxon names), why doesn't our > community define within our semantic vocabulary something along the lines of > "TCS_XML", which can be established as a standard metadata component for > LSIDs assigned to taxon concepts (i.e., abstract objects, identified by > "data-less" LSIDs). The exact bytestream of the content of that metadata > element can change, without changing its canonical rendering. > > I'm beginning to suspect (strongly) that I am completely missing some > fundamental point here -- and perhaps is is the same point that underlies > the apparent antagonism towards LSIDs in general (which I do not yet share). > But I am fairly certain we are dealing with some level of miscommunication > here. > > Aloha, > Rich > > >> -----Original Message----- >> From: tdwg-guid-bounces@lists.tdwg.org >> [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. >> Bryan Heidorn >> Sent: Friday, July 13, 2007 12:48 PM >> To: Dave Vieglais >> Cc: tdwg-guid@lists.tdwg.org >> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack >> thereof)[Scanned] >> >> There seems to be two methods to resolving this problem. >> >> One is to change the LSID definitions to allow semantic >> equivalence in the data and not require exact bit stream equivalence. >> >> The other option is to change the data representation so that >> it is "easily" reduced to a repeatable canonical form. For >> example, it is almost as easy as saying where XML ordering >> does not specify order of elements, elements will be ordered >> alphabetically. Seems stupid but it almost works.. except >> where you have repeating elements with the same element name >> where it does not work. >> >> It seems a little odd to bend the standards for the data >> being delivered to fit the requirement of the LSID spec. In >> theory, the other standard developers who set the data being >> delivered did not fix order because it did not matter. >> >> This is different from Chuck's observation that the semantics >> of the element within some of the standards are >> insufficiently specified. >> So, what we mean is a darwin mode species name is just a >> string and nothing more now. >> >> >> --Bryan >> >> On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote: >> >> >>> I think we are all in agreement that the data and metadata >>> >> referenced >> >>> by an LSID remains unchanged (in the case of the metadata, semantic >>> equivalence is a requirement for reasons such as outlined >>> >> by Matt). >> >>> My question is to do purely with the data that an LSID references >>> through the getData() operation. The form of that data could be >>> anything really - an encrypted byte stream, digital image, >>> >> Open Office >> >>> document, spreadsheet, xml document... >>> >>> We all know that the same data can be represented many ways >>> >> that are >> >>> logically, semantically and functionally equivalent yet form a >>> different set of bytes when serialized. Data expressed in >>> >> XML is one >> >>> example (is <a/> = <a /> = <a></a> ?). A pallet based image is >>> another - the order of colors in the palette may be >>> >> changed, and the >> >>> pixel values adjusted to match the new palette order, but >>> >> the image is >> >>> still the same. There are many more simple examples that can be >>> constructed that violate the unchanged bytes rule but for all >>> practical and functional purposes the data are unchanged. >>> >>> As mentioned previously, enforcing and implementing the unchanged >>> bytes rule is not challenging. It is however quite different from >>> stating that the data are returned unchanged. It is this >>> >> that I, and >> >>> I'm sure a lot of other implementors would appreciate consensus on. >>> >>> Dave V. >>> >>> On Jul 14, 2007, at 09:20, Matthew Jones wrote: >>> >>> >>>> In terms of the metadata returned from an LSID, or any >>>> >> other digital >> >>>> identifier, there are definite cases where metadata must be >>>> semantically persistent in order to preserve the utility >>>> >> of data and >> >>>> accuracy of scientific results. >>>> >>>> As a trivial example, given a set of observations >>>> >> collected at time >> >>>> t, one can represent the data for those observations in >>>> >> dataset D and >> >>>> the metadata for the dataset, including the time value t, in a >>>> metadata document M. In a later event, it is discovered >>>> >> that t was >> >>>> entered incorrectly, and needs to be adjusted, creating metadata >>>> document M'. That M and M' are not congruent is critical knowledge >>>> when analyzing data from D with data from another dataset D2. In >>>> other words, because there is no true distinction between data and >>>> metadata (any given piece of information can be stored in either >>>> location), a proper archive must be able to distinguish >>>> >> any changes >> >>>> in the data and any changes in the metadata. >>>> >>>> That said, there are some metadata that could change with >>>> >> little or >> >>>> no impact on data interpretation (e.g., the spelling of >>>> >> the street on >> >>>> which Technician Tom gets his snailmail). But at the current time >>>> its impossible to distinguish this kind of metadata from the >>>> important kind in the general case of the existing >>>> >> metadata standards >> >>>> in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc). >>>> >>>> Our process in the KNB/SEEK/NCEAS and other ecological >>>> >> data archives >> >>>> is to give persistent identifiers to both data objects and >>>> >> metadata >> >>>> objects, and provide new identifiers when either changes. >>>> >>>> Matt >>>> >>>> >>>> Dave Vieglais wrote: >>>> >>>>> Hi Bob, >>>>> Just because a standard is published does not mean that it is >>>>> practical. Requiring that a set of bytes referenced by >>>>> >> an LSID are >> >>>>> unchanged has a lot of implications with respect to the >>>>> implementation of data services. For example, if it is agreed to >>>>> abide by the rule that the blob referenced by an LSID remains >>>>> forever unchanged, then that implies that the data >>>>> >> provider stores >> >>>>> the data as a blob, rather than risking the process of >>>>> reconstructing on the fly from some database, especially for the >>>>> example of data expressed in XML where functionally identical >>>>> objects (constructed using different DOM libraries for >>>>> >> example) are >> >>>>> not identical blobs. >>>>> Asserting that two instances of an object with the same LSID are >>>>> semantically equivalent is a vastly more complicated >>>>> >> processes than >> >>>>> asserting that the canonical representation of those >>>>> >> instances are >> >>>>> identical. Generally there can be defined a simple set of >>>>> guidelines for constructing the canonical form of an >>>>> >> object (eg. for >> >>>>> xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic >>>>> equivalence is an ongoing topic of research. >>>>> Requiring identical blobs is certainly possible, but >>>>> >> people need to >> >>>>> be aware of the implications of such a requirement in the early >>>>> stages of designing a system to support such a specification. My >>>>> preference for the canonical form relaxes the implementation >>>>> requirements considerably whilst still maintaining the >>>>> >> integrity of >> >>>>> the data and the intent of the LSID. >>>>> regards, >>>>> Dave V. >>>>> On Jul 14, 2007, at 08:08, Bob Morris wrote: >>>>> >>>>>> This entire discussion confuses me. The LSID standard is >>>>>> >> published. >> >>>>>> Why is there a discussion of what an LSID should be? The >>>>>> >> standard >> >>>>>> requires that the data, as defined by the return of >>>>>> >> getData, to be >> >>>>>> identical for all resolutions of the LSID. From page 9 >>>>>> >> of the LSID >> >>>>>> spec: >>>>>> >>>>>> " bytes getData (LSID lsid) >>>>>> bytes getDataByRange (LSID lsid, integer start, integer length) >>>>>> Metadata_response getMetadata (LSID lsid, string[] >>>>>> accepted_formats) >>>>>> Metadata_response getMetadataSubset (LSID lsid, string[] >>>>>> accepted_formats, string selector) The data retrieval >>>>>> >> services may >> >>>>>> implement all of the methods, or only methods for >>>>>> >> retrieving data, >> >>>>>> or only methods for retrieving associated metadata. >>>>>> The same LSID named data object must be resolved always >>>>>> >> to the same >> >>>>>> set of bytes. Therefore, all of the data retrieval >>>>>> >> services return >> >>>>>> the same results for the same LSID. The user has, however, the >>>>>> choice of which one of these to utilize depending on its >>>>>> >> location, >> >>>>>> known quality of service and other attributes. With >>>>>> >> metadata, the >> >>>>>> situation is different. Each data retrieval service can provide >>>>>> different metadata for the same LSID." >>>>>> >>>>>> This doesn't seem very ambiguous to me, and doesn't have >>>>>> >> anything >> >>>>>> to do with imperfect storage of data or anything else about the >>>>>> physical or electronic world. If two calls to getData() with the >>>>>> same argument on two occasions to possibly two different >>>>>> >> resolution >> >>>>>> services do not yield the same set of bytes, then one or >>>>>> >> the other >> >>>>>> or both of those is not executing a compliant service response. >>>>>> Unless this discussion is really "Shall we call something other >>>>>> than the return of getData by the term 'data associated with the >>>>>> LSID?' there seems to be nothing to discuss. >>>>>> >>>>>> Bob >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 7/13/07, Paul Kirk p.kirk@cabi.org wrote: >>>>>> >>>>>>> >>>>>>> In an imperfect world there is no such thing as an 'identical- >>>>>>> byte-stream' >>>>>>> because the technology we use is imperfect ... the disk >>>>>>> controllers which manage our bytes and the disk we use to store >>>>>>> our bytes have recognized error rates. Perhaps I'm >>>>>>> >> being a pedant >> >>>>>>> in the above analysis but I was almost persuaded that >>>>>>> >> except for >> >>>>>>> digital objects (images, >>>>>>> sounds) which can >>>>>>> be data all other 'things' (names, specimen accession >>>>>>> >> numbers) had >> >>>>>>> to be metadata. This to me makes no sense in the real but >>>>>>> imperfect world we live in. An LSID assigned to a name >>>>>>> >> (e.g. Homo >> >>>>>>> sapiens) is assigned to the name as data, not metadata. What is >>>>>>> 'identical' here it that if the spelling has to change for any >>>>>>> reason the new spelling gets a new LSID and the now incorrect >>>>>>> spelling gets deprecated (but is still resolvable) with >>>>>>> >> a pointer >> >>>>>>> to the correct spelling/LSID in the metadata. >>>>>>> >>>>>>> OK? >>>>>>> >>>>>>> Paul >>>>>>> >>>>>>> ________________________________ >>>>>>> From: tdwg-guid-bounces@lists.tdwg.org on behalf of >>>>>>> >> Chuck Miller >> >>>>>>> Sent: Fri 13/07/2007 19:03 >>>>>>> To: Dave Vieglais >>>>>>> Cc: tdwg-guid@lists.tdwg.org >>>>>>> Subject: RE: [tdwg-guid] LSID metadata persistence (or lack >>>>>>> thereof)[Scanned] >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Dave, >>>>>>> What you say is true. But, I think we already have too many >>>>>>> variations, subtleties, and reinterpretations which are >>>>>>> >> endlessly >> >>>>>>> debated. >>>>>>> >>>>>>> The LSID standard would be simple, clear and consistent >>>>>>> >> if we used >> >>>>>>> the identical-byte-stream definition. The LSID would >>>>>>> >> uniquely tag >> >>>>>>> a persistent byte stream. A persistent byte stream is >>>>>>> >> always the >> >>>>>>> same thing without any further explanation or clarification. >>>>>>> >>>>>>> The provider of an LSID byte-stream would need to commit to >>>>>>> keeping that byte-stream persistent and not represent it in >>>>>>> multiple ways, even though technically they could. If >>>>>>> >> they can't >> >>>>>>> commit to that, then it can't be an LSID byte-stream. >>>>>>> >>>>>>> And in the name of simplicity and clarity, if they had >>>>>>> >> to provide >> >>>>>>> different byte-stream representations then they would have to >>>>>>> assign a different LSID to each and use "SameAs" metadata. >>>>>>> >>>>>>> Chuck >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Dave Vieglais [mailto:vieglais@ku.edu] >>>>>>> Sent: Friday, July 13, 2007 12:42 PM >>>>>>> To: Chuck Miller >>>>>>> Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org >>>>>>> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack >>>>>>> thereof) >>>>>>> >>>>>>> Hi Ricardo, Chuck, >>>>>>> Asserting that the byte stream returned as data >>>>>>> >> associated with an >> >>>>>>> LSID should never change is perhaps a bit confusing from a >>>>>>> programmatic view. There are for example many ways to >>>>>>> >> represent >> >>>>>>> data in xml that are identical from an information >>>>>>> >> content point >> >>>>>>> of view, but the byte streams could be very different. >>>>>>> >>>>>>> Perhaps it might be better to state something like "the >>>>>>> >> canonical >> >>>>>>> representation of the data associated with an LSID must not >>>>>>> change", or something to that effect? >>>>>>> >>>>>>> Dave V. >>>>>>> >>>>>>> On Jul 14, 2007, at 05:29, Chuck Miller wrote: >>>>>>> >>>>>>> >>>>>>>> Ricardo, >>>>>>>> >>>>>>>> Looking at this definition: "Persistence of LSID >>>>>>>> >> Data: The data >> >>>>>>>> associated with an LSID (i.e, the byte stream returned by the >>>>>>>> >>>>>>> LSID >>>>>>> >>>>>>>> getData call) must never change" >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Perhaps this is a more straightforward way to conceive >>>>>>>> >>>>>>> LSIDs. The >>>>>>> >>>>>>>> LSID goes with a byte stream. It's that byte stream that >>>>>>>> >>>>>>> must stay >>>>>>> >>>>>>>> the same. So, if there is a byte stream associated with a >>>>>>>> collection that needs to stay the same, then whatever >>>>>>>> >> that byte >> >>>>>>>> stream happens to be is the data that gets an LSID assigned >>>>>>>> >>>>>>> to it. >>>>>>> >>>>>>>> That sure seems a clearer definition of what is data >>>>>>>> >> and what is >> >>>>>>>> metadata, rather than the issue of primary object and >>>>>>>> >> all that. >> >>>>>>>> >>>>>>>> So we can create a new definition in the context of LSIDs: >>>>>>>> >>>>>>> Data is >>>>>>> >>>>>>>> a byte stream that is persistent, never changes and >>>>>>>> >> can have an >> >>>>>>>> LSID. Metadata is a byte stream is non-persistent, >>>>>>>> >> might change >> >>>>>>>> and is only associated with an LSID. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The institution who assigns an LSID can make their >>>>>>>> >> own decision >> >>>>>>>> about whether the byte stream being provided is persistent or >>>>>>>> >>>>>>> non- >>>>>>> >>>>>>>> persistent. By assigning an LSID to any byte stream, >>>>>>>> >>>>>>> whatever it >>>>>>> >>>>>>>> is, the institution is declaring it to be data and persistent. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> So, in the example given of an observation record with a >>>>>>>> determination that needs to remain fixed and unchanged, by >>>>>>>> assigning an LSID to that observation+determination >>>>>>>> >> it would be >> >>>>>>>> "declared to be data" and unchangeable. A different >>>>>>>> >>>>>>> determination >>>>>>> >>>>>>>> would then be different data with a different LSID. >>>>>>>> >> That would >> >>>>>>>> provide a solution for those who want to employ it. Others >>>>>>>> >>>>>>> could >>>>>>> >>>>>>>> choose not to use it. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Chuck >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- >>>>>>>> bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira >>>>>>>> Sent: Friday, July 13, 2007 9:47 AM >>>>>>>> To: tdwg-guid@lists.tdwg.org >>>>>>>> Subject: [tdwg-guid] LSID metadata persistence (or >>>>>>>> >> lack thereof) >> >>>>>>>> >>>>>>>> Hi there folks, >>>>>>>> >>>>>>>> As Chuck mentioned a few weeks ago, we do have a few >>>>>>>> outstanding issues to address regarding LSIDs. I >>>>>>>> >> would like to >> >>>>>>>> discuss those one by one, in an orderly manner, and reach >>>>>>>> >>>>>>> consensus >>>>>>> >>>>>>>> as much as we can. Then we can sum them up in a TDWG >>>>>>>> >> standard, >> >>>>>>>> possibly by or shortly after the Bratislava conference. >>>>>>>> >>>>>>>> The first issue I would like to discuss is LSID metadata >>>>>>>> persistence. First, let me remind you of a corollary >>>>>>>> >>>>>>> established by >>>>>>> >>>>>>>> the LSID specification: >>>>>>>> >>>>>>>> Corollary 1: LSIDs are not guaranteed to be >>>>>>>> >>>>>>> resolvable >>>>>>> >>>>>>>> indefinitely. >>>>>>>> >>>>>>>> In other words, there is no guarantee that one will >>>>>>>> >>>>>>> always be >>>>>>> >>>>>>>> able to retrieve the data associated with an LSID as the >>>>>>>> >>>>>>> authority >>>>>>> >>>>>>>> may choose (or be forced) not to resolve an LSID anymore. >>>>>>>> >>>>>>>> Second, let me distinguish this kind of persistence I'm >>>>>>>> >>>>>>> talking >>>>>>> >>>>>>>> about from other two related concepts (which we'll not >>>>>>>> >>>>>>> discuss in >>>>>>> >>>>>>>> this thread): >>>>>>>> >>>>>>>> 1) Persistence of Assignment: Once assigned to an >>>>>>>> >>>>>>> object, >>>>>>> >>>>>>>> an LSID is indefinitely associated with it. The same LSID >>>>>>>> >>>>>>> cannot be >>>>>>> >>>>>>>> assigned to another object. Ever! The LSID may not be >>>>>>>> >> resolvable >> >>>>>>>> anymore, but it cannot be assigned to another object. This is >>>>>>>> established by the LSID specification. >>>>>>>> >>>>>>>> 2) Persistence of LSID Data: The data >>>>>>>> >> associated with an >> >>>>>>>> LSID (i.e, the byte stream returned by the LSID getData call) >>>>>>>> >>>>>>> must >>>>>>> >>>>>>>> never change. Although the LSID may not be resolvable anymore >>>>>>>> (according to corollary 1), the data associated with an LSID >>>>>>>> >>>>>>> must >>>>>>> >>>>>>>> never ever change. That's defined by the LSID spec, too. >>>>>>>> >>>>>>>> What I want to discuss here is the persistence of LSID >>>>>>>> >>>>>>> metadata >>>>>>> >>>>>>>> (what is returned by the getMetadata call) or the >>>>>>>> >> lack thereof. >> >>>>>>>> A use case associated with metadata persistence is when >>>>>>>> >>>>>>> someone >>>>>>> >>>>>>>> collects observation records (and implicitly, their >>>>>>>> >>>>>>> determinations) >>>>>>> >>>>>>>> and runs an experiment (a model or simulation) with it. This >>>>>>>> >>>>>>> person >>>>>>> >>>>>>>> may want to record the identifiers of the points used so that >>>>>>>> someone using the results of that experiment may refer back >>>>>>>> >>>>>>> to the >>>>>>> >>>>>>>> primary data, to validate or repeat it the experiment. >>>>>>>> >>>>>>>> The bad news is that LSID identification scheme (or any >>>>>>>> >>>>>>> other >>>>>>> >>>>>>>> GUID that I know of) was not designed to guarantee metadata >>>>>>>> persistence, and thus it cannot implement the use >>>>>>>> >> case above by >> >>>>>>>> itself. To implement that use case, the specification would >>>>>>>> >>>>>>> have to >>>>>>> >>>>>>>> guarantee that the metadata (which we are using here >>>>>>>> >> as data) is >> >>>>>>>> immutable. But it doesn't. >>>>>>>> >>>>>>>> Most of us wish that metadata was persistent, but >>>>>>>> >> it isn't. >> >>>>>>>> Many things can change in the metadata: a new >>>>>>>> >> determination, a >> >>>>>>>> mispeling that is corrected, many things. We just cannot >>>>>>>> >>>>>>> guarantee >>>>>>> >>>>>>>> that the metadata will look like it was sometime ago. >>>>>>>> >>>>>>>> We then reach the following conclusion. >>>>>>>> >>>>>>>> Corollary 2: LSIDs metadata is not immutable nor >>>>>>>> persistent. >>>>>>>> >>>>>>>> The consequence of this corollary is that, if you need to >>>>>>>> >>>>>>> refer >>>>>>> >>>>>>>> back to a piece of information (metadata) associated with an >>>>>>>> >>>>>>> LSID, >>>>>>> >>>>>>>> exactly as it was when you got it, you must make a copy of >>>>>>>> >>>>>>> it, or >>>>>>> >>>>>>>> arrange that someone else make that copy for you. >>>>>>>> >>>>>>>> In other words, a client cannot assume that the metadata >>>>>>>> associated with an LSID today will be the same >>>>>>>> >> tomorrow. If the >> >>>>>>>> client does assume that, it may be relying on a false >>>>>>>> >> assumption >> >>>>>>>> and its output may be flawed. >>>>>>>> >>>>>>>> If we are not happy with that conclusion, we may >>>>>>>> >> develop an >> >>>>>>>> additional component in our architecture, an archive of some >>>>>>>> >>>>>>> sort, >>>>>>> >>>>>>>> to handle (meta)data persistence. That is exactly what the >>>>>>>> >>>>>>> STD-DOI >>>>>>> >>>>>>>> project (http://www.std-doi.de/) and SEEK (http:// http:/// >>>>>>>> seek.ecoinformatics.org) have done to some extent. >>>>>>>> >>>>>>>> While we cannot guarantee that LSID metadata is >>>>>>>> >>>>>>> persistent nor >>>>>>> >>>>>>>> immutable, we can definitely document how the metadata have >>>>>>>> >>>>>>> changed >>>>>>> >>>>>>>> through metadata versioning. That's the topic of the next >>>>>>>> >>>>>>> thread. >>>>>>> >>>>>>>> We will move on to discuss metadata versioning as >>>>>>>> >> soon as we are >> >>>>>>>> done with metadata persistence. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Ricardo >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> tdwg-guid mailing list >>>>>>>> tdwg-guid@lists.tdwg.org >>>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> tdwg-guid mailing list >>>>>>> tdwg-guid@lists.tdwg.org >>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >>>>>>> >>>>>>> >>>>>>> P Think Green - don't print this email unless you really need to >>>>>>> >>>>>>> >>>>>>> >> ****************************************************************** >> >>>>>>> ****** >>>>>>> The information contained in this e-mail and any files >>>>>>> transmitted with it is confidential and is for the >>>>>>> >> exclusive use >> >>>>>>> of the intended recipient. If you are not the intended >>>>>>> >> recipient >> >>>>>>> please note that any distribution, copying or use of this >>>>>>> communication or the information in it is prohibited. >>>>>>> >>>>>>> Whilst CAB International trading as CABI takes steps >>>>>>> >> to prevent >> >>>>>>> the transmission of viruses via e-mail, we cannot >>>>>>> >> guarantee that >> >>>>>>> any e-mail or attachment is free from computer viruses >>>>>>> >> and you are >> >>>>>>> strongly advised to undertake your own anti-virus precautions. >>>>>>> >>>>>>> If you have received this communication in error, >>>>>>> >> please notify >> >>>>>>> us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 >>>>>>> 829199 and then delete the e-mail and any copies of it. >>>>>>> >>>>>>> CABI is an International Organization recognised by the UK >>>>>>> Government under Statutory Instrument 1982 No. 1071. >>>>>>> >>>>>>> >>>>>>> >> ****************************************************************** >> >>>>>>> ******** >>>>>>> _______________________________________________ >>>>>>> tdwg-guid mailing list >>>>>>> tdwg-guid@lists.tdwg.org >>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >>>>>>> >>>>>>> >>>>>>> >>>>>> --Robert A. Morris >>>>>> Professor of Computer Science >>>>>> UMASS-Boston >>>>>> ram@cs.umb.edu >>>>>> http://bdei.cs.umb.edu/ >>>>>> http://www.cs.umb.edu/~ram >>>>>> http://www.cs.umb.edu/~ram/calendar.html >>>>>> phone (+1)617 287 6466 >>>>>> >>>>> _______________________________________________ >>>>> tdwg-guid mailing list >>>>> tdwg-guid@lists.tdwg.org >>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >>>>> >>> _______________________________________________ >>> tdwg-guid mailing list >>> tdwg-guid@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >>> >> _______________________________________________ >> tdwg-guid mailing list >> tdwg-guid@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >> > > > _______________________________________________ > tdwg-guid mailing list > tdwg-guid@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-guid > > _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
In my previous post, I quoted the LSID Best Practices page (http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/) on describing "Abstract" LSIDs. Here is the full section:
*************************************** Abstract LSIDs
The data behind the data bytes of a concept might exist in multiple data formats or derivations. One approach using a single LSID would be to append all different instances together, using some token to separate the different formats. This solution is poor for many reasons, primarily because the client must download all formats. The best approach is to create a different LSID for each data format or for derivations and connect them with a single abstract LSID.
The benefit of using an abstract scheme is that it allows for LSIDs that do not name actual data bytes but instead provide only metadata documents. These LSIDs can be used to represent abstract notions, such as a gene or protein, which may have many concrete representations. The metadata documents associated with these abstract LSIDs can contain multiple relationships pointing to LSIDs that name data bytes.
In this way, researchers can use a series of LSIDs to create an interconnected metadata graph to name objects that may have many different representations. The abstract LSID provides the anchor point for software and users to explore the metadata and obtain further pointers to all the concrete LSID references that contain data, along with the data's exact relationship to the abstract concept. This level of indirection is very powerful. ***************************************
Previously, we've debated about whether an LSID assigned to a non-digital object should be assigned to the "Abstract" object, or to a specific database record created for that object. I'll stick with the Taxon Name example, but the same principles apply to other non-digital objects like specimens, observations, reference citations, etc.
Many, many databases in the world include a database record to represent the butterflyfish genus described by Linnaeus in 1758 (which, for the sake of simplicity, I'll henceforth refer to via the ASCII rendering "Chaetodon").
Database records (rows) are, inherently, digital objects, and therefore can (with some level of established convention) be represented by binary "data" -- retrievable via getData(). Thus, the many, many database records out there can each receive a proper data-bearing LSID. Obviously, there would need to be mechanisms to make sure that the bytestream returned by getData() for these inherently digital database records are always bit-consistent. This could be relatively easy if the only "data" returned for the LSID is a specified encoding of the primary key value for the database record, and all the other columns/fields were returned via getMetadata(). But the point is, a database record *is* an inherently digital object, and therefore *can* be legitimately represented by a data-bearing (non-Abstract) LSID.
We could then assign an "Abstract" LSID for the "idea" or "notion" of the scientific name "Chaetodon", and use that LSID in the spirit of the above-quoted best practices description of Abstract LSIDs to track "further pointers to all the concrete LSID [for database records established for the genus Chaetodon] references that contain data".
That would effectively allow the Abstract LSID to serve the needs of those of us who *want* a shared, resusable, persistent identifier for the idea/notion/concept of the taxon name "Chaetodon", which itself serves as an index of sorts to all manner of database records (digital objects) that contain data (and metadata) associated with that taxon name.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Isn't this where I came in last week Rich?
The LSID urn:lsid:indexfungorum.org:names:178962 is assigned to the IF database record for Amanita phalloides (the deathcap - used as one of the EoL sample species pages [well done EoL]). If I recall correctly the getData() returns "Amanita phalloides" [not the primary key of the database record, which is 178962 - Rich, why did you restrict the getData() to returning a PK?]; the getMetadata() returns "(Vaill. ex Fr.) Link","1833", etc etc. Whatever the encoding, this LSID will always return the same 'payload' - if the coding value of any of the characters in the string has to change for whatever reason (e.g. typographical 'spelling' error, Code required orthographic correction, even capitalization of the specific epithet) it gets a new LSID (a new database record) and the 'old' LSID (the 'old' database record) has a column containing the new PK and a metadata element in the getMetadata() containing the new LSID. Kevin will correct me if I'm wrong on this.
Is this your proposed solution Rich?
Paul
PS the author string above could have been represented by "(" & urn:lsid:ipni.org:authors:11023-1 & "ex" & urn:lsid:ipni.org:authors:2913-1 & ")" & urn:lsid:ipni.org:authors:22401-1 but more elegantly rendered in XML ... ;-)
-----Original Message----- From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of Richard Pyle Sent: 15 July 2007 20:00 To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] An approach to Abstract LSIDs[Scanned]
In my previous post, I quoted the LSID Best Practices page (http://www-128.ibm.com/developerworks/opensource/library/os-lsidbp/) on describing "Abstract" LSIDs. Here is the full section:
*************************************** Abstract LSIDs
The data behind the data bytes of a concept might exist in multiple data formats or derivations. One approach using a single LSID would be to append all different instances together, using some token to separate the different formats. This solution is poor for many reasons, primarily because the client must download all formats. The best approach is to create a different LSID for each data format or for derivations and connect them with a single abstract LSID.
The benefit of using an abstract scheme is that it allows for LSIDs that do not name actual data bytes but instead provide only metadata documents. These LSIDs can be used to represent abstract notions, such as a gene or protein, which may have many concrete representations. The metadata documents associated with these abstract LSIDs can contain multiple relationships pointing to LSIDs that name data bytes.
In this way, researchers can use a series of LSIDs to create an interconnected metadata graph to name objects that may have many different representations. The abstract LSID provides the anchor point for software and users to explore the metadata and obtain further pointers to all the concrete LSID references that contain data, along with the data's exact relationship to the abstract concept. This level of indirection is very powerful. ***************************************
Previously, we've debated about whether an LSID assigned to a non-digital object should be assigned to the "Abstract" object, or to a specific database record created for that object. I'll stick with the Taxon Name example, but the same principles apply to other non-digital objects like specimens, observations, reference citations, etc.
Many, many databases in the world include a database record to represent the butterflyfish genus described by Linnaeus in 1758 (which, for the sake of simplicity, I'll henceforth refer to via the ASCII rendering "Chaetodon").
Database records (rows) are, inherently, digital objects, and therefore can (with some level of established convention) be represented by binary "data" -- retrievable via getData(). Thus, the many, many database records out there can each receive a proper data-bearing LSID. Obviously, there would need to be mechanisms to make sure that the bytestream returned by getData() for these inherently digital database records are always bit-consistent. This could be relatively easy if the only "data" returned for the LSID is a specified encoding of the primary key value for the database record, and all the other columns/fields were returned via getMetadata(). But the point is, a database record *is* an inherently digital object, and therefore *can* be legitimately represented by a data-bearing (non-Abstract) LSID.
We could then assign an "Abstract" LSID for the "idea" or "notion" of the scientific name "Chaetodon", and use that LSID in the spirit of the above-quoted best practices description of Abstract LSIDs to track "further pointers to all the concrete LSID [for database records established for the genus Chaetodon] references that contain data".
That would effectively allow the Abstract LSID to serve the needs of those of us who *want* a shared, resusable, persistent identifier for the idea/notion/concept of the taxon name "Chaetodon", which itself serves as an index of sorts to all manner of database records (digital objects) that contain data (and metadata) associated with that taxon name.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid ************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
Isn't this where I came in last week Rich?
Sort of....but we weren't clear back then about the role of the "Abstract" LSID that would encompass all of the many database-record LSIDs. I know this is almost a no-brainer for those who have been part of the LSID discussions over the years, but I think part of our confusion is that we're talking about data-bearing LSIDs applied to database records as if *they* would be the LSIDs we also use to represent the abstract notion of "the name".
The LSID urn:lsid:indexfungorum.org:names:178962 is assigned to the IF database record for Amanita phalloides (the deathcap - used as one of the EoL sample species pages [well done EoL]).
Right -- so this is really a "Name Usage" instance -- that is, the usage of the name "Amanita phalloides" by Index Fungorum. Or, maybe it's a usage instance from some other publication that IF index, like the original description (=protologue) of the name "Amanita phalloides".
If I recall correctly the getData() returns "Amanita phalloides"
Not according to Kevin (who agrees with me on the data-less LSIDs for names) -- but he can answer that himself when he gets back to his email (I think he and Sally are just now getting onboard a plane leaving Hawaii as I type this).
[not the primary key of the database record, which is 178962 - Rich, why did you restrict the getData() to returning a PK?]
My rationale is this: If getData() returns a bytestream (rather than nothing), then pretty-much by definition the LSID identifies a digital object -- not an abstract object. The "name" is an abstract object, with no digital (or even physical) manifestation. So, if the LSID returns binary data via getData(), then the LSID identifies a digital object, which in the scenario I described would be a computer database row (reocrd). I suggested the PK as a "natural" binary representation for a database record because it's the attribute of a database record that is LEAST likely to even need to be changed. Technically, if the PK changes, then you're really talking about a *different* database row, and as such, it would be a different digital object, and as such, it would need a new LSID.
In most cases, the content of other columns (fields) in a database record are more subject to change. If you embedded content of other columns/fields into the "data" part of the LSID, then you would be duty-bound (per LSID specs) to generate a new LSID everytime you changed any part of any column/field that was included within the scope of "data" returned by getData().
Because I like he idea of GUID reusability, my inclination would be to follow a protocol that least necessitated the generation of new GUIDs for objects that I would otherwise intuitively think to be the "same" thing. Frankly, the biologist in me is FAR more interested in GUIDs for abstract objects (i.e., objects without inherent digital manifestation, such as taxon names, specimens, etc.), than I am interested in GUIDs that identify specific database records.
Is this your proposed solution Rich?
Not exactly....but I only had 5 hrs sleep last night, and it's been a REALLY long day (11pm now), so it's probably best for everyone concerned that I shut up now and go to bed....
:-)
Aloha, Rich
Now I am worried. This is becoming very confusing.
For a while there I almost believed that we might use LSID's as unique identifiers but ... I (CANB) simply cannot afford the risk associated with any implementation of LSID's within our database as either object keys or instance identifiers when what we really need is to stay with our trusted, persistent and opaque surrogates.
But given that we already use a GUID to manage the identity of an object, the LSID still adds two very useful methods to our persistence model:
getData to return to a particular instance of an object (same state) and;
getMetadata to establish relationships within and between states.
The LSID becomes a surrogate for a query about an object rather than the object itself. The relationship object:LSID is one to many.
To meet out TDWG obligations we will deliver data sets about objects uniquely identified by LSIDs and we will establish the necessary resolvers. The question of underlying persistence implied by this agreement is another matter. There is nothing in the candidate standard to assist data providers deal with the issues of object, or instance, identity management and many will simply find it beyond their resources and/or capabilities.
Perhaps there is a way, peer-to-peer like, if providers can be convinced to use both object and instance identifiers, for our aggregators to provide services delivering the kinds of metadata and instance persistence required. I don't think that is going to come from many providers.
While LSIDs may provide a useful framework for managing and testing object persistence their true value will still lie in the guarantee, even if ephemeral or without the benefits of resolution, of an object's identity and provenance; and in the advantages their mere presence in a dataset can offer to both providers and users of these data.
greg
Richard Pyle wrote:
Isn't this where I came in last week Rich?
Sort of....but we weren't clear back then about the role of the "Abstract" LSID that would encompass all of the many database-record LSIDs. I know this is almost a no-brainer for those who have been part of the LSID discussions over the years, but I think part of our confusion is that we're talking about data-bearing LSIDs applied to database records as if *they* would be the LSIDs we also use to represent the abstract notion of "the name".
The LSID urn:lsid:indexfungorum.org:names:178962 is assigned to the IF database record for Amanita phalloides (the deathcap - used as one of the EoL sample species pages [well done EoL]).
Right -- so this is really a "Name Usage" instance -- that is, the usage of the name "Amanita phalloides" by Index Fungorum. Or, maybe it's a usage instance from some other publication that IF index, like the original description (=protologue) of the name "Amanita phalloides".
If I recall correctly the getData() returns "Amanita phalloides"
Not according to Kevin (who agrees with me on the data-less LSIDs for names) -- but he can answer that himself when he gets back to his email (I think he and Sally are just now getting onboard a plane leaving Hawaii as I type this).
[not the primary key of the database record, which is 178962 - Rich, why did you restrict the getData() to returning a PK?]
My rationale is this: If getData() returns a bytestream (rather than nothing), then pretty-much by definition the LSID identifies a digital object -- not an abstract object. The "name" is an abstract object, with no digital (or even physical) manifestation. So, if the LSID returns binary data via getData(), then the LSID identifies a digital object, which in the scenario I described would be a computer database row (reocrd). I suggested the PK as a "natural" binary representation for a database record because it's the attribute of a database record that is LEAST likely to even need to be changed. Technically, if the PK changes, then you're really talking about a *different* database row, and as such, it would be a different digital object, and as such, it would need a new LSID.
In most cases, the content of other columns (fields) in a database record are more subject to change. If you embedded content of other columns/fields into the "data" part of the LSID, then you would be duty-bound (per LSID specs) to generate a new LSID everytime you changed any part of any column/field that was included within the scope of "data" returned by getData().
Because I like he idea of GUID reusability, my inclination would be to follow a protocol that least necessitated the generation of new GUIDs for objects that I would otherwise intuitively think to be the "same" thing. Frankly, the biologist in me is FAR more interested in GUIDs for abstract objects (i.e., objects without inherent digital manifestation, such as taxon names, specimens, etc.), than I am interested in GUIDs that identify specific database records.
Is this your proposed solution Rich?
Not exactly....but I only had 5 hrs sleep last night, and it's been a REALLY long day (11pm now), so it's probably best for everyone concerned that I shut up now and go to bed....
:-)
Aloha, Rich
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Greg,
I believe that there is a misunderstanding here.
The proposal of assigning LSIDs to physical world entities and all of its digital representations is perfectly feasible and sensible, but it is not a required practice. In other words, you are not obligated to keep LSIDs for physical and digital objects separately when assigning and resolving LSIDs.
While some providers may find useful to represent the world in such way (and agree to perform the extra management tasks required by it), other providers may very well adopt a simpler approach. For example, many providers will find just simpler to continue to use the data model they currently implement in their institutions and just assign an LSID to every record that they wish to share from their databases. This way they would keep the mapping between LSID and database records one to one. In those cases, data providers don't even need to care about whether they are modeling physical entities or just their digital representations.
I believe that you assumed that this simpler approach for assigning LSID was feasible and it was your main drive towards adopting LSIDs. I am very sure that it is an assumption that still holds true.
Furthermore, I believe that you would still benefit from replacing your local trusted, persistent and opaque surrogates by LSIDs. First because LSIDs have all those properties you mentioned and more: they are globally (not only locally) unique, their syntax has been standardized, they have a standard resolution mechanism, and provide provenance. A good example of putting LSIDs for good use by you would be to track of the relationships between taxon names exported by APNI to IPNI.
I hope this makes things clearer and helps you see the LSID specification again as a useful tool to share and manage your data records.
Best regards,
Ricardo
Greg Whitbread wrote:
Now I am worried. This is becoming very confusing.
For a while there I almost believed that we might use LSID's as unique identifiers but ... I (CANB) simply cannot afford the risk associated with any implementation of LSID's within our database as either object keys or instance identifiers when what we really need is to stay with our trusted, persistent and opaque surrogates.
But given that we already use a GUID to manage the identity of an object, the LSID still adds two very useful methods to our persistence model:
getData to return to a particular instance of an object (same state) and;
getMetadata to establish relationships within and between states.
The LSID becomes a surrogate for a query about an object rather than the object itself. The relationship object:LSID is one to many.
To meet out TDWG obligations we will deliver data sets about objects uniquely identified by LSIDs and we will establish the necessary resolvers. The question of underlying persistence implied by this agreement is another matter. There is nothing in the candidate standard to assist data providers deal with the issues of object, or instance, identity management and many will simply find it beyond their resources and/or capabilities.
Perhaps there is a way, peer-to-peer like, if providers can be convinced to use both object and instance identifiers, for our aggregators to provide services delivering the kinds of metadata and instance persistence required. I don't think that is going to come from many providers.
While LSIDs may provide a useful framework for managing and testing object persistence their true value will still lie in the guarantee, even if ephemeral or without the benefits of resolution, of an object's identity and provenance; and in the advantages their mere presence in a dataset can offer to both providers and users of these data.
greg
Richard Pyle wrote:
Isn't this where I came in last week Rich?
Sort of....but we weren't clear back then about the role of the "Abstract" LSID that would encompass all of the many database-record LSIDs. I know this is almost a no-brainer for those who have been part of the LSID discussions over the years, but I think part of our confusion is that we're talking about data-bearing LSIDs applied to database records as if *they* would be the LSIDs we also use to represent the abstract notion of "the name".
The LSID urn:lsid:indexfungorum.org:names:178962 is assigned to the IF database record for Amanita phalloides (the deathcap - used as one of the EoL sample species pages [well done EoL]).
Right -- so this is really a "Name Usage" instance -- that is, the usage of the name "Amanita phalloides" by Index Fungorum. Or, maybe it's a usage instance from some other publication that IF index, like the original description (=protologue) of the name "Amanita phalloides".
If I recall correctly the getData() returns "Amanita phalloides"
Not according to Kevin (who agrees with me on the data-less LSIDs for names) -- but he can answer that himself when he gets back to his email (I think he and Sally are just now getting onboard a plane leaving Hawaii as I type this).
[not the primary key of the database record, which is 178962 - Rich, why did you restrict the getData() to returning a PK?]
My rationale is this: If getData() returns a bytestream (rather than nothing), then pretty-much by definition the LSID identifies a digital object -- not an abstract object. The "name" is an abstract object, with no digital (or even physical) manifestation. So, if the LSID returns binary data via getData(), then the LSID identifies a digital object, which in the scenario I described would be a computer database row (reocrd). I suggested the PK as a "natural" binary representation for a database record because it's the attribute of a database record that is LEAST likely to even need to be changed. Technically, if the PK changes, then you're really talking about a *different* database row, and as such, it would be a different digital object, and as such, it would need a new LSID.
In most cases, the content of other columns (fields) in a database record are more subject to change. If you embedded content of other columns/fields into the "data" part of the LSID, then you would be duty-bound (per LSID specs) to generate a new LSID everytime you changed any part of any column/field that was included within the scope of "data" returned by getData().
Because I like he idea of GUID reusability, my inclination would be to follow a protocol that least necessitated the generation of new GUIDs for objects that I would otherwise intuitively think to be the "same" thing. Frankly, the biologist in me is FAR more interested in GUIDs for abstract objects (i.e., objects without inherent digital manifestation, such as taxon names, specimens, etc.), than I am interested in GUIDs that identify specific database records.
Is this your proposed solution Rich?
Not exactly....but I only had 5 hrs sleep last night, and it's been a REALLY long day (11pm now), so it's probably best for everyone concerned that I shut up now and go to bed....
:-)
Aloha, Rich
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Hi Rich, the question I posed about getData() has nothing to do with the actual data being referenced - that is, and should be opaque to the LSID service itself (apart from the metadata describing the data, but that is not part of the service). Heck, it could be data about the number of coconuts consumed last year for all that matters. The question was about the functionality of the protocol and services that implement it. If an LSID is assigned to some data, then right now it is required that the data retrieved by getData() is always exactly the same byte sequence. That's fine. No more discussion required. Leave it be.
The issue that does concern me though is that requiring the exact same byte stream for data identified by an LSID can raise unexpected implementation issues that seem to be overly restrictive without improving functionality. My impression of LSIDs and their utility has always been as pointers to data which must always be consistent regardless of how or when the data is retrieved. This is not necessarily the same thing as saying the byte stream used to represent the data is always the same, and many examples of this can be provided. There are however, simple ways around this limitation (such as creating a new method as I outlined elsewhere) and perhaps there should be a little further discussion on this specific aspect of the LSID specification.
Dave V.
On Jul 16, 2007, at 06:22, Richard Pyle wrote:
I'm not sure I understand this fixation with the getData() call. Why is it so important to use that call to retrieve bytestream information relating to objects that are not themselevs inherently digital? Much of what we are intereseted in within the biodiversity informatics community, in terms of what we want to establish identifiers for, are not inherently digital objects and therefore should NOT have any bytes returned for getData (). Some of our objects *are* inherently digital (PDFs, image files of various formats, video clips, audio files, possibly Genbank sequences in a specified format and encoding, etc.) To me, the distinction is very simple: is the object that the LSID identifies a binary data file? If yes, then the binary data become the data of the LSID. If no, then the LSID has no binary "data" (sensu LSID Spec), and returns only metadata through getMetadata (). The LSID spec refers to such LSIDs as "Abstract" (or sometimes "Conceptual") LSIDs.
It's really not that complicated -- unless, as I suggested previously, I am missing something fundamentally important.
I don't understand the advantage we gain by "force-fitting" some digitized rendering of an otherwise non-digital object. Taxon Names (for example) have no inherent digital manifestation. We create an artificial digital representation of them by stringing ASCII or Unicode characters together in a way that resembles (in principle) the characters otherwise represented by ink on paper. But if we want to embed such a character string as "data" for an LSID, then the LSID is teally an identifier for the *character string* itself, NOT the "notion" or "idea" or "concept" of the taxon name. As a taxonomist and biodiversity informatics manager, I have very little use for LSIDs that identify specific charcter strings. I want an LSID that itentifies the shared understanding of a taxon name -- not an artificial/substitute rendering of the taxon name. I see no advantage to creating one LSID for a text string that encodes a taxon name as UTF-8, and another LSID for the same name encoded as UTF-16,and so on, and so on. These variants are purely artificial from the perspective of what I want an LSID for (i.e., the idea/notion/concept of a taxon name).
I do acknowledge that the idea of an "Abstract" LSID was really meant to serve as an "umbrella" of sorts to tie together multiple data- bearing LSIDs. The classic example is an image that can be represented as a RAW, a TIFF, or a JPEG file format. Assuming all three image files derive from the same shutter-release event of a camera, then the intended function of an "Abstract" LSID is to serve to gather together the LSIDs established for each of the three file formats of the "same" image. The images are the "same" only in the conceptual -- i.e., that they all derive from the shutter-release event. But the point is, the purpose of the "Abstract" LSID is really intended to be a mechanism of organizing data-bearing LSIDs that refer to different digital renderings of the "same thing". From the "LSID Best Practices" website (http://www-128.ibm.com/developerworks/opensource/library/os- lsidbp/), under the heading "Abstract LSIDs":
"The abstract LSID provides the anchor point for software and users to explore the metadata and obtain further pointers to all the concrete LSID references that contain data, along with the data's exact relationship to the abstract concept."
This implies that "Abstract" LSIDs should exist primarily to aggregate data-bearing LSIDs.
For the most part, I don't think this is what we are really trying to do when we want to assign LSIDs to non-digital objects like taxon names, specimens, etc. So, in a sense, what I am advocating deviates a bit from the intention of an "Abstract" LSID. But at least I'm not outright violating the fundamental tenents of the LSID spec, like trying to apply a single LSID to more than one bytestream returnable via getData().
So, again, I return to my original confusion: why all the fixation with the getData() call?
The only reasons I can think of are:
- Semantics (of the human communcation kind): We're uncomfortable
thinking of things like refering to the text string C-e-n-t-r-o-p-y-g-e (minus the dashes) as being mere "metadata" for the angelfish genus described by Kaup in 1860 -- when it just feels like the "actual" name to us (and hence should be thought of as "data").
- Persistence: We want to embed information as "data" for the LSID
because we want to make sure the "same information" is always there, and the LSID spec emphasizes the permanent relationship between an LSID and its data. The only trouble is, we want to define the word "same" in this context in a way that is utterly incomprehensible (without all manner of comparison algorithms) to a computer. *We* know that "Chaetodon" is the "same" as "Chætodon", so we want a single LSID to refer to the genus name for butterflyfishes described by Linnaeus in 1758. And we don't like being required to always choose one rendering or the other to embed as the bit-identical "data" for the LSID.
- Performance(?): This is where I may be missing something
fundamental. Are there characteristics of the getData() call that are far superior to getMetadata()?
As for number 1: all I can say is "get over it". Our unfortunate reality in biodiversity informatics is a proponderence of homonymy -- not just in taxon names, but in our human-mitigated communication lexicon as well.
As for number 2: We can deal with persistence through layers of standards and convention within our community. Almost everything we talk about involves an assumption of adherence to standards and conventions. If we want persistent metadata, then we need to formalize a document detailing which metadata elements should be mandatory and/or persistent and/ or have other properties that we as a community feel are important. This document would also outline when metadata may be modified for a given LSID, vs. when a new LSID should be generated, allowing certain metadata elements for each to remain unchanged (e.g., perhaps one LSID for "Chaetodon" and another for "Chætodon", for the object type "Digital Taxon Name Rendering"). The document would also outline how multiple LSIDs should be cross- referenced to each other (e.g., the two "DTNR" objects identified by two different LSIDs in the previous example would both refer to the same Abstract LSID established for the butterflyfish genus name described by Linnaeus in 1758).
As for number 3: I just hope someone can explain to me where I missed the boat.
One final note: I do see a way that we can preserve the spirit of intent for the "Abstract LSID" in our domain for things like Taxon Names. Rather than explain it here, I follow up with another email describing it.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of Chuck Miller Sent: Saturday, July 14, 2007 2:29 PM To: Ricardo Pereira Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)
Ricardo, I disagree on your assertion of consensus on a couple of points.
On 2) there is no consensus/decision on whether XML can be returned from a getData call. I asked this question and it has not been answered. We could disallow XML as an allowed format for getData and allow it only for getMetadata. We do not have consensus and actually have disagreement on "We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable" because "the name itself is in the metadata" I for one believe that we cannot avoid returning a scientific name byte stream in the getData for an LSID for a scientific name. That requirement is fundamental to what we need for biodiversity data. Pragmatically and empirically, names and specimens/observations are THE most fundamental data objects existing today in the databases published by GBIF. So if we can't put LSIDs on names, we have failed to enable one of the most fundamental needs of this community. If the definition of LSIDs needs to be amended to enable that, then so be it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Ricardo Pereira Sent: Fri 7/13/2007 8:12 PM Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Folks, Thanks much to all of you who replied to my post. All the posts
were really relevant to our discussion.
Before we go ahead, however, let us stop for a minute to try and
summarize the points we agree upon and the points in which there is still significant controversy.
I believe that we reached consensus in the following issues:
- We do agree that *LSID metadata is not required to be persistent*
(i.e. clients cannot assume it is immutable). See note [1].
- We should not force XML representations of data to be byte
identical just to return that in the LSID getData() call. We must find another way to fulfill this requirement.
- We should not try to return something in the LSID getData() call
just for the sake of it. We shouldn't for example return the bare scientific name of a species in the getData() call just because that can be immutable and thus fulfill the requirement from the LSID spec. This is counterproductive because the name itself is in the metadata already and no client would gain anything from calling getData() in this case.
We have also raised new issues that may be worth discussing (in
their own separate thread if possible):
- We "may" bend the immutability rule of LSID getData() to our
benefit and accept data that is not byte stream identical, but only "semantically" identical (depending on content type maybe). If we do this, we may use the LSID getData() call more effectively to identify real datasets such as matrices, identification keys, etc.
- As Brian pointed out, we may need to revisit what we call data
and metadata. We have been using the LSID getMetadata() call to return what some people may call data (taxon names, specimens, collections). And we forgot completely that there may be other kinds of data out there that may be returned in the getData() call and that those still need metadata to describe them. I think this may be worth discussing in a separate thread.
Did I leave anything out? If so, please let us know by replying
to my post and adding a short entry to either list above.
Cheers,
Ricardo
Notes:
[1] Matt may disagree with me here, but my point is that we can't force all authorities (i.e. data providers) to keep perfect archives of all versions of their databases given a heterogeneous and distributed environment we operate in. While some may want to provide this feature, other providers may not want or be able to.
Richard Pyle wrote:
It seems to me that there is a third method to resolving the
problem:
When we want to identify an object that is itself digital in
nature (e.g., a
database record, or a binary data file such as a PDF, JPG, ASCII,
Unicode,
or whatever), we resolve said binary object via getData(). If,
for some
reason, we change the exact bit-sequence of that digital/binary
object
(e.g., color-correct an image, change a text string from ASII to
Unicode, or
whatever...), we assign a new LSID to it (whether that "new" LSID
differs
from the "old" LSID only via the optional "Revision" part of the
LSID, or
via a new Object Identification part, is a topic for another
debate).
When we want to identify an object that does not itself have a
digital
manifestation -- like a physical object (e.g., specimen or a
particular
printed copy of a publication) or an abstract/conceptual object
(e.g., a
taxon name, a taxon concept, a geographica place, or a cited
publication) --
then we return *nothing* in response to getData(), and we treat
all the
attributes of said physical/abstract/conceptual object of interest
to us as
metadata.
If there are cases where certain metadata elements of an object
without an
inherent digital existence need to persists (and there are), yet
we also
want to allow modifications to metadata elements without the need
to
generate new identifiers for the underlying object (and we do) --
then we
deal with those within our own community via adopted standards and
best
practices.
I would disagree strongly with bending the existing LSID standard,
and would
just as strongly favor working within its existing framework
(which, I
think, we can). I would also disagree with the practice of
embedding XML
documents as "data" for an LSID, unless the LSID is intended to
represent
the XML document itself (in which case there might be a different
LSID to
represent the database record that was used to generate the XML
document;
and yet another LSID to represent the abstract concept that the
database
record was created to represent -- like a taxon name, for
example).
If we want to use LSIDs to pass around XML packages (that are not
rendered
as RDF) about abstract objects (e.g., taxon names), why doesn't
our
community define within our semantic vocabulary something along
the lines of
"TCS_XML", which can be established as a standard metadata
component for
LSIDs assigned to taxon concepts (i.e., abstract objects,
identified by
"data-less" LSIDs). The exact bytestream of the content of that
metadata
element can change, without changing its canonical rendering.
I'm beginning to suspect (strongly) that I am completely missing
some
fundamental point here -- and perhaps is is the same point that
underlies
the apparent antagonism towards LSIDs in general (which I do not
yet share).
But I am fairly certain we are dealing with some level of
miscommunication
here.
Aloha, Rich
-----Original Message----- From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. Bryan Heidorn Sent: Friday, July 13, 2007 12:48 PM To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
There seems to be two methods to resolving this problem.
One is to change the LSID definitions to allow semantic equivalence in the data and not require exact bit stream
equivalence.
The other option is to change the data representation so that it is "easily" reduced to a repeatable canonical form. For example, it is almost as easy as saying where XML ordering does not specify order of elements, elements will be ordered alphabetically. Seems stupid but it almost works.. except where you have repeating elements with the same element name where it does not work.
It seems a little odd to bend the standards for the data being delivered to fit the requirement of the LSID spec. In theory, the other standard developers who set the data being delivered did not fix order because it did not matter.
This is different from Chuck's observation that the semantics of the element within some of the standards are insufficiently specified. So, what we mean is a darwin mode species name is just a string and nothing more now.
--Bryan
On Jul 13, 2007, at 5:18 PM, Dave Vieglais wrote:
I think we are all in agreement that the data and metadata
referenced
by an LSID remains unchanged (in the case of the metadata,
semantic
equivalence is a requirement for reasons such as outlined
by Matt).
My question is to do purely with the data that an LSID
references
through the getData() operation. The form of that data could be anything really - an encrypted byte stream, digital image,
Open Office
document, spreadsheet, xml document...
We all know that the same data can be represented many ways
that are
logically, semantically and functionally equivalent yet form a different set of bytes when serialized. Data expressed in
XML is one
example (is <a/> = <a /> = <a></a> ?). A pallet based image is another - the order of colors in the palette may be
changed, and the
pixel values adjusted to match the new palette order, but
the image is
still the same. There are many more simple examples that can be constructed that violate the unchanged bytes rule but for all practical and functional purposes the data are unchanged.
As mentioned previously, enforcing and implementing the
unchanged
bytes rule is not challenging. It is however quite different
from
stating that the data are returned unchanged. It is this
that I, and
I'm sure a lot of other implementors would appreciate consensus
on.
Dave V.
On Jul 14, 2007, at 09:20, Matthew Jones wrote:
In terms of the metadata returned from an LSID, or any
other digital
identifier, there are definite cases where metadata must be semantically persistent in order to preserve the utility
of data and
accuracy of scientific results.
As a trivial example, given a set of observations
collected at time
t, one can represent the data for those observations in
dataset D and
the metadata for the dataset, including the time value t, in a metadata document M. In a later event, it is discovered
that t was
entered incorrectly, and needs to be adjusted, creating
metadata
document M'. That M and M' are not congruent is critical
knowledge
when analyzing data from D with data from another dataset D2.
In
other words, because there is no true distinction between data
and
metadata (any given piece of information can be stored in
either
location), a proper archive must be able to distinguish
any changes
in the data and any changes in the metadata.
That said, there are some metadata that could change with
little or
no impact on data interpretation (e.g., the spelling of
the street on
which Technician Tom gets his snailmail). But at the current
time
its impossible to distinguish this kind of metadata from the important kind in the general case of the existing
metadata standards
in use (e.g., FGDC BDP, ISO 19115, EML, GML, etc).
Our process in the KNB/SEEK/NCEAS and other ecological
data archives
is to give persistent identifiers to both data objects and
metadata
objects, and provide new identifiers when either changes.
Matt
Dave Vieglais wrote:
Hi Bob, Just because a standard is published does not mean that it is practical. Requiring that a set of bytes referenced by
an LSID are
unchanged has a lot of implications with respect to the implementation of data services. For example, if it is agreed
to
abide by the rule that the blob referenced by an LSID remains forever unchanged, then that implies that the data
provider stores
the data as a blob, rather than risking the process of reconstructing on the fly from some database, especially for
the
example of data expressed in XML where functionally identical objects (constructed using different DOM libraries for
example) are
not identical blobs. Asserting that two instances of an object with the same LSID
are
semantically equivalent is a vastly more complicated
processes than
asserting that the canonical representation of those
instances are
identical. Generally there can be defined a simple set of guidelines for constructing the canonical form of an
object (eg. for
xml http:www.w3.org/TR/xml-c14n ) whereas asserting semantic equivalence is an ongoing topic of research. Requiring identical blobs is certainly possible, but
people need to
be aware of the implications of such a requirement in the
early
stages of designing a system to support such a specification.
My
preference for the canonical form relaxes the implementation requirements considerably whilst still maintaining the
integrity of
the data and the intent of the LSID. regards, Dave V. On Jul 14, 2007, at 08:08, Bob Morris wrote:
> This entire discussion confuses me. The LSID standard is >
published.
> Why is there a discussion of what an LSID should be? The >
standard
> requires that the data, as defined by the return of >
getData, to be
> identical for all resolutions of the LSID. From page 9 >
of the LSID
> spec: > > " bytes getData (LSID lsid) > bytes getDataByRange (LSID lsid, integer start, integer
length)
> Metadata_response getMetadata (LSID lsid, string[] > accepted_formats) > Metadata_response getMetadataSubset (LSID lsid, string[] > accepted_formats, string selector) The data retrieval >
services may
> implement all of the methods, or only methods for >
retrieving data,
> or only methods for retrieving associated metadata. > The same LSID named data object must be resolved always >
to the same
> set of bytes. Therefore, all of the data retrieval >
services return
> the same results for the same LSID. The user has, however,
the
> choice of which one of these to utilize depending on its >
location,
> known quality of service and other attributes. With >
metadata, the
> situation is different. Each data retrieval service can
provide
> different metadata for the same LSID." > > This doesn't seem very ambiguous to me, and doesn't have >
anything
> to do with imperfect storage of data or anything else about
the
> physical or electronic world. If two calls to getData() with
the
> same argument on two occasions to possibly two different >
resolution
> services do not yield the same set of bytes, then one or >
the other
> or both of those is not executing a compliant service
response.
> Unless this discussion is really "Shall we call something
other
> than the return of getData by the term 'data associated with
the
> LSID?' there seems to be nothing to discuss. > > Bob > > > > > On 7/13/07, Paul Kirk p.kirk@cabi.org wrote: > >> >> In an imperfect world there is no such thing as an
'identical-
>> byte-stream' >> because the technology we use is imperfect ... the disk >> controllers which manage our bytes and the disk we use to
store
>> our bytes have recognized error rates. Perhaps I'm >>
being a pedant
>> in the above analysis but I was almost persuaded that >>
except for
>> digital objects (images, >> sounds) which can >> be data all other 'things' (names, specimen accession >>
numbers) had
>> to be metadata. This to me makes no sense in the real but >> imperfect world we live in. An LSID assigned to a name >>
(e.g. Homo
>> sapiens) is assigned to the name as data, not metadata. What
is
>> 'identical' here it that if the spelling has to change for
any
>> reason the new spelling gets a new LSID and the now
incorrect
>> spelling gets deprecated (but is still resolvable) with >>
a pointer
>> to the correct spelling/LSID in the metadata. >> >> OK? >> >> Paul >> >> ________________________________ >> From: tdwg-guid-bounces@lists.tdwg.org on behalf of >>
Chuck Miller
>> Sent: Fri 13/07/2007 19:03 >> To: Dave Vieglais >> Cc: tdwg-guid@lists.tdwg.org >> Subject: RE: [tdwg-guid] LSID metadata persistence (or lack >> thereof)[Scanned] >> >> >> >> >> Dave, >> What you say is true. But, I think we already have too many >> variations, subtleties, and reinterpretations which are >>
endlessly
>> debated. >> >> The LSID standard would be simple, clear and consistent >>
if we used
>> the identical-byte-stream definition. The LSID would >>
uniquely tag
>> a persistent byte stream. A persistent byte stream is >>
always the
>> same thing without any further explanation or clarification. >> >> The provider of an LSID byte-stream would need to commit to >> keeping that byte-stream persistent and not represent it in >> multiple ways, even though technically they could. If >>
they can't
>> commit to that, then it can't be an LSID byte-stream. >> >> And in the name of simplicity and clarity, if they had >>
to provide
>> different byte-stream representations then they would have
to
>> assign a different LSID to each and use "SameAs" metadata. >> >> Chuck >> >> -----Original Message----- >> From: Dave Vieglais [mailto:vieglais@ku.edu] >> Sent: Friday, July 13, 2007 12:42 PM >> To: Chuck Miller >> Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org >> Subject: Re: [tdwg-guid] LSID metadata persistence (or lack >> thereof) >> >> Hi Ricardo, Chuck, >> Asserting that the byte stream returned as data >>
associated with an
>> LSID should never change is perhaps a bit confusing from a >> programmatic view. There are for example many ways to >>
represent
>> data in xml that are identical from an information >>
content point
>> of view, but the byte streams could be very different. >> >> Perhaps it might be better to state something like "the >>
canonical
>> representation of the data associated with an LSID must not >> change", or something to that effect? >> >> Dave V. >> >> On Jul 14, 2007, at 05:29, Chuck Miller wrote: >> >> >>> Ricardo, >>> >>> Looking at this definition: "Persistence of LSID >>>
Data: The data
>>> associated with an LSID (i.e, the byte stream returned by
the
>>> >> LSID >> >>> getData call) must never change" >>> >>> >>> >>> Perhaps this is a more straightforward way to conceive >>> >> LSIDs. The >> >>> LSID goes with a byte stream. It's that byte stream that >>> >> must stay >> >>> the same. So, if there is a byte stream associated with a >>> collection that needs to stay the same, then whatever >>>
that byte
>>> stream happens to be is the data that gets an LSID assigned >>> >> to it. >> >>> That sure seems a clearer definition of what is data >>>
and what is
>>> metadata, rather than the issue of primary object and >>>
all that.
>>> >>> So we can create a new definition in the context of LSIDs: >>> >> Data is >> >>> a byte stream that is persistent, never changes and >>>
can have an
>>> LSID. Metadata is a byte stream is non-persistent, >>>
might change
>>> and is only associated with an LSID. >>> >>> >>> >>> The institution who assigns an LSID can make their >>>
own decision
>>> about whether the byte stream being provided is persistent
or
>>> >> non- >> >>> persistent. By assigning an LSID to any byte stream, >>> >> whatever it >> >>> is, the institution is declaring it to be data and
persistent.
>>> >>> >>> >>> So, in the example given of an observation record with a >>> determination that needs to remain fixed and unchanged, by >>> assigning an LSID to that observation+determination >>>
it would be
>>> "declared to be data" and unchangeable. A different >>> >> determination >> >>> would then be different data with a different LSID. >>>
That would
>>> provide a solution for those who want to employ it. Others >>> >> could >> >>> choose not to use it. >>> >>> >>> >>> Chuck >>> >>> >>> >>> From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- >>> bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira >>> Sent: Friday, July 13, 2007 9:47 AM >>> To: tdwg-guid@lists.tdwg.org >>> Subject: [tdwg-guid] LSID metadata persistence (or >>>
lack thereof)
>>> >>> Hi there folks, >>> >>> As Chuck mentioned a few weeks ago, we do have a few >>> outstanding issues to address regarding LSIDs. I >>>
would like to
>>> discuss those one by one, in an orderly manner, and reach >>> >> consensus >> >>> as much as we can. Then we can sum them up in a TDWG >>>
standard,
>>> possibly by or shortly after the Bratislava conference. >>> >>> The first issue I would like to discuss is LSID
metadata
>>> persistence. First, let me remind you of a corollary >>> >> established by >> >>> the LSID specification: >>> >>> Corollary 1: LSIDs are not guaranteed to be >>> >> resolvable >> >>> indefinitely. >>> >>> In other words, there is no guarantee that one will >>> >> always be >> >>> able to retrieve the data associated with an LSID as the >>> >> authority >> >>> may choose (or be forced) not to resolve an LSID anymore. >>> >>> Second, let me distinguish this kind of persistence I'm >>> >> talking >> >>> about from other two related concepts (which we'll not >>> >> discuss in >> >>> this thread): >>> >>> 1) Persistence of Assignment: Once assigned to an >>> >> object, >> >>> an LSID is indefinitely associated with it. The same LSID >>> >> cannot be >> >>> assigned to another object. Ever! The LSID may not be >>>
resolvable
>>> anymore, but it cannot be assigned to another object. This
is
>>> established by the LSID specification. >>> >>> 2) Persistence of LSID Data: The data >>>
associated with an
>>> LSID (i.e, the byte stream returned by the LSID getData
call)
>>> >> must >> >>> never change. Although the LSID may not be resolvable
anymore
>>> (according to corollary 1), the data associated with an
LSID
>>> >> must >> >>> never ever change. That's defined by the LSID spec, too. >>> >>> What I want to discuss here is the persistence of LSID >>> >> metadata >> >>> (what is returned by the getMetadata call) or the >>>
lack thereof.
>>> A use case associated with metadata persistence is when >>> >> someone >> >>> collects observation records (and implicitly, their >>> >> determinations) >> >>> and runs an experiment (a model or simulation) with it.
This
>>> >> person >> >>> may want to record the identifiers of the points used so
that
>>> someone using the results of that experiment may refer back >>> >> to the >> >>> primary data, to validate or repeat it the experiment. >>> >>> The bad news is that LSID identification scheme (or any >>> >> other >> >>> GUID that I know of) was not designed to guarantee metadata >>> persistence, and thus it cannot implement the use >>>
case above by
>>> itself. To implement that use case, the specification would >>> >> have to >> >>> guarantee that the metadata (which we are using here >>>
as data) is
>>> immutable. But it doesn't. >>> >>> Most of us wish that metadata was persistent, but >>>
it isn't.
>>> Many things can change in the metadata: a new >>>
determination, a
>>> mispeling that is corrected, many things. We just cannot >>> >> guarantee >> >>> that the metadata will look like it was sometime ago. >>> >>> We then reach the following conclusion. >>> >>> Corollary 2: LSIDs metadata is not immutable
nor
>>> persistent. >>> >>> The consequence of this corollary is that, if you need
to
>>> >> refer >> >>> back to a piece of information (metadata) associated with
an
>>> >> LSID, >> >>> exactly as it was when you got it, you must make a copy of >>> >> it, or >> >>> arrange that someone else make that copy for you. >>> >>> In other words, a client cannot assume that the
metadata
>>> associated with an LSID today will be the same >>>
tomorrow. If the
>>> client does assume that, it may be relying on a false >>>
assumption
>>> and its output may be flawed. >>> >>> If we are not happy with that conclusion, we may >>>
develop an
>>> additional component in our architecture, an archive of
some
>>> >> sort, >> >>> to handle (meta)data persistence. That is exactly what the >>> >> STD-DOI >> >>> project (http://www.std-doi.de/) and SEEK (http://
>>> seek.ecoinformatics.org) have done to some extent. >>> >>> While we cannot guarantee that LSID metadata is >>> >> persistent nor >> >>> immutable, we can definitely document how the metadata have >>> >> changed >> >>> through metadata versioning. That's the topic of the next >>> >> thread. >> >>> We will move on to discuss metadata versioning as >>>
soon as we are
>>> done with metadata persistence. >>> >>> Cheers, >>> >>> Ricardo >>> >>> _______________________________________________ >>> tdwg-guid mailing list >>> tdwg-guid@lists.tdwg.org >>> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >>> >> _______________________________________________ >> tdwg-guid mailing list >> tdwg-guid@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >> >> >> P Think Green - don't print this email unless you really
need to
>> >> >>
>> ****** >> The information contained in this e-mail and any files >> transmitted with it is confidential and is for the >>
exclusive use
>> of the intended recipient. If you are not the intended >>
recipient
>> please note that any distribution, copying or use of this >> communication or the information in it is prohibited. >> >> Whilst CAB International trading as CABI takes steps >>
to prevent
>> the transmission of viruses via e-mail, we cannot >>
guarantee that
>> any e-mail or attachment is free from computer viruses >>
and you are
>> strongly advised to undertake your own anti-virus
precautions.
>> >> If you have received this communication in error, >>
please notify
>> us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 >> 829199 and then delete the e-mail and any copies of it. >> >> CABI is an International Organization recognised by the UK >> Government under Statutory Instrument 1982 No. 1071. >> >> >>
>> ******** >> _______________________________________________ >> tdwg-guid mailing list >> tdwg-guid@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-guid >> >> >> > --Robert A. Morris > Professor of Computer Science > UMASS-Boston > ram@cs.umb.edu > http://bdei.cs.umb.edu/ > http://www.cs.umb.edu/~ram > http://www.cs.umb.edu/~ram/calendar.html > phone (+1)617 287 6466 > _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Bob, That's more or less what I was trying to say. LSID = same bytes. Metadata about LSID = varying bytes. That's a very simple definition. It is divorced from the question of primary object, first class object, or any of that stuff. Just simply same bytes.
I don't know what to say about serving up LSID data bytes from a database via XML through DOM. Dave is suggesting that could cause more trouble than it saves and I can see that point since you can't control how reused code works in the future, so the bytes might change. Maybe an LSID provider should never do that.
But, is the LSID getData call supposed to return data bytes in XML form? That is, not the metadata, the data.
Chuck
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Friday, July 13, 2007 3:08 PM To: Paul Kirk Cc: Chuck Miller; Dave Vieglais; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
This entire discussion confuses me. The LSID standard is published. Why is there a discussion of what an LSID should be? The standard requires that the data, as defined by the return of getData, to be identical for all resolutions of the LSID. From page 9 of the LSID spec:
" bytes getData (LSID lsid) bytes getDataByRange (LSID lsid, integer start, integer length) Metadata_response getMetadata (LSID lsid, string[] accepted_formats) Metadata_response getMetadataSubset (LSID lsid, string[] accepted_formats, string selector) The data retrieval services may implement all of the methods, or only methods for retrieving data, or only methods for retrieving associated metadata. The same LSID named data object must be resolved always to the same set of bytes. Therefore, all of the data retrieval services return the same results for the same LSID. The user has, however, the choice of which one of these to utilize depending on its location, known quality of service and other attributes. With metadata, the situation is different. Each data retrieval service can provide different metadata for the same LSID."
This doesn't seem very ambiguous to me, and doesn't have anything to do with imperfect storage of data or anything else about the physical or electronic world. If two calls to getData() with the same argument on two occasions to possibly two different resolution services do not yield the same set of bytes, then one or the other or both of those is not executing a compliant service response. Unless this discussion is really "Shall we call something other than the return of getData by the term 'data associated with the LSID?' there seems to be nothing to discuss.
Bob
On 7/13/07, Paul Kirk p.kirk@cabi.org wrote:
In an imperfect world there is no such thing as an
'identical-byte-stream'
because the technology we use is imperfect ... the disk controllers
which
manage our bytes and the disk we use to store our bytes have
recognized
error rates. Perhaps I'm being a pedant in the above analysis but I
was
almost persuaded that except for digital objects (images, sounds)
which can
be data all other 'things' (names, specimen accession numbers) had to
be
metadata. This to me makes no sense in the real but imperfect world we
live
in. An LSID assigned to a name (e.g. Homo sapiens) is assigned to the
name
as data, not metadata. What is 'identical' here it that if the
spelling has
to change for any reason the new spelling gets a new LSID and the now incorrect spelling gets deprecated (but is still resolvable) with a
pointer
to the correct spelling/LSID in the metadata.
OK?
Paul
From: tdwg-guid-bounces@lists.tdwg.org on behalf of Chuck Miller Sent: Fri 13/07/2007 19:03 To: Dave Vieglais Cc: tdwg-guid@lists.tdwg.org Subject: RE: [tdwg-guid] LSID metadata persistence (or lack thereof)[Scanned]
Dave, What you say is true. But, I think we already have too many
variations,
subtleties, and reinterpretations which are endlessly debated.
The LSID standard would be simple, clear and consistent if we used the identical-byte-stream definition. The LSID would uniquely tag a persistent byte stream. A persistent byte stream is always the same thing without any further explanation or clarification.
The provider of an LSID byte-stream would need to commit to keeping
that
byte-stream persistent and not represent it in multiple ways, even though technically they could. If they can't commit to that, then it can't be an LSID byte-stream.
And in the name of simplicity and clarity, if they had to provide different byte-stream representations then they would have to assign a different LSID to each and use "SameAs" metadata.
Chuck
-----Original Message----- From: Dave Vieglais [mailto:vieglais@ku.edu] Sent: Friday, July 13, 2007 12:42 PM To: Chuck Miller Cc: Ricardo Pereira; tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi Ricardo, Chuck, Asserting that the byte stream returned as data associated with an LSID should never change is perhaps a bit confusing from a programmatic view. There are for example many ways to represent data in xml that are identical from an information content point of view, but the byte streams could be very different.
Perhaps it might be better to state something like "the canonical representation of the data associated with an LSID must not change", or something to that effect?
Dave V.
On Jul 14, 2007, at 05:29, Chuck Miller wrote:
Ricardo,
Looking at this definition: "Persistence of LSID Data: The data associated with an LSID (i.e, the byte stream returned by the LSID getData call) must never change"
Perhaps this is a more straightforward way to conceive LSIDs. The LSID goes with a byte stream. It's that byte stream that must stay the same. So, if there is a byte stream associated with a collection that needs to stay the same, then whatever that byte stream happens to be is the data that gets an LSID assigned to it. That sure seems a clearer definition of what is data and what is metadata, rather than the issue of primary object and all that.
So we can create a new definition in the context of LSIDs: Data is a byte stream that is persistent, never changes and can have an LSID. Metadata is a byte stream is non-persistent, might change and is only associated with an LSID.
The institution who assigns an LSID can make their own decision about whether the byte stream being provided is persistent or non- persistent. By assigning an LSID to any byte stream, whatever it is, the institution is declaring it to be data and persistent.
So, in the example given of an observation record with a determination that needs to remain fixed and unchanged, by assigning an LSID to that observation+determination it would be "declared to be data" and unchangeable. A different determination would then be different data with a different LSID. That would provide a solution for those who want to employ it. Others could choose not to use it.
Chuck
From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid- bounces@lists.tdwg.org] On Behalf Of Ricardo Pereira Sent: Friday, July 13, 2007 9:47 AM To: tdwg-guid@lists.tdwg.org Subject: [tdwg-guid] LSID metadata persistence (or lack thereof)
Hi there folks, As Chuck mentioned a few weeks ago, we do have a few
outstanding issues to address regarding LSIDs. I would like to discuss those one by one, in an orderly manner, and reach consensus as much as we can. Then we can sum them up in a TDWG standard, possibly by or shortly after the Bratislava conference.
The first issue I would like to discuss is LSID metadata
persistence. First, let me remind you of a corollary established by the LSID specification:
Corollary 1: LSIDs are not guaranteed to be resolvable
indefinitely.
In other words, there is no guarantee that one will always be
able to retrieve the data associated with an LSID as the authority may choose (or be forced) not to resolve an LSID anymore.
Second, let me distinguish this kind of persistence I'm talking
about from other two related concepts (which we'll not discuss in this thread):
1) Persistence of Assignment: Once assigned to an object,
an LSID is indefinitely associated with it. The same LSID cannot be assigned to another object. Ever! The LSID may not be resolvable anymore, but it cannot be assigned to another object. This is established by the LSID specification.
2) Persistence of LSID Data: The data associated with an
LSID (i.e, the byte stream returned by the LSID getData call) must never change. Although the LSID may not be resolvable anymore (according to corollary 1), the data associated with an LSID must never ever change. That's defined by the LSID spec, too.
What I want to discuss here is the persistence of LSID metadata
(what is returned by the getMetadata call) or the lack thereof.
A use case associated with metadata persistence is when someone
collects observation records (and implicitly, their determinations) and runs an experiment (a model or simulation) with it. This person may want to record the identifiers of the points used so that someone using the results of that experiment may refer back to the primary data, to validate or repeat it the experiment.
The bad news is that LSID identification scheme (or any other
GUID that I know of) was not designed to guarantee metadata persistence, and thus it cannot implement the use case above by itself. To implement that use case, the specification would have to guarantee that the metadata (which we are using here as data) is immutable. But it doesn't.
Most of us wish that metadata was persistent, but it isn't.
Many things can change in the metadata: a new determination, a mispeling that is corrected, many things. We just cannot guarantee that the metadata will look like it was sometime ago.
We then reach the following conclusion. Corollary 2: LSIDs metadata is not immutable nor
persistent.
The consequence of this corollary is that, if you need to refer
back to a piece of information (metadata) associated with an LSID, exactly as it was when you got it, you must make a copy of it, or arrange that someone else make that copy for you.
In other words, a client cannot assume that the metadata
associated with an LSID today will be the same tomorrow. If the client does assume that, it may be relying on a false assumption and its output may be flawed.
If we are not happy with that conclusion, we may develop an
additional component in our architecture, an archive of some sort, to handle (meta)data persistence. That is exactly what the STD-DOI project (http://www.std-doi.de/) and SEEK (http:// seek.ecoinformatics.org) have done to some extent.
While we cannot guarantee that LSID metadata is persistent nor
immutable, we can definitely document how the metadata have changed through metadata versioning. That's the topic of the next thread. We will move on to discuss metadata versioning as soon as we are done with metadata persistence.
Cheers,
Ricardo
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
P Think Green - don't print this email unless you really need to
************************************************************************
The information contained in this e-mail and any files transmitted
with it
is confidential and is for the exclusive use of the intended
recipient. If
you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is
prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any
e-mail or
attachment is free from computer viruses and you are strongly advised
to
undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government
under
Statutory Instrument 1982 No. 1071.
************************************************************************ **
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
participants (10)
-
Bob Morris
-
Chuck Miller
-
Dave Vieglais
-
Greg Whitbread
-
Matthew Jones
-
P. Bryan Heidorn
-
Paul Kirk
-
Ricardo Pereira
-
Richard Pyle
-
Sally Hinchcliffe