[tdwg-guid] Immutability of LSID data
Folks,
Let's pick one controversial issue at a time and discuss it. I suggest we pick the "easiest" ones first. Let's pick the immutability of LSID data next.
Let us first review which methods are provided by the LSID data services:
bytes getData(LSID lsid) bytes getDataByRange(LSID lsid, integer start, integer length)
I wasn't the one who came up with the LSID spec, but I suppose that those methods were specifically designed to handle sequence data (DNA and protein data). The getDataByRange method in particular was designed to allow clients to refer to very specific subsets of those sequences.
No doubt that this is all very useful for the bioinformatics folks, but as we've seen in previous discussions, it is not as useful for us in the biodiversity (and ecological) informatics communities. The main reason is that some of our data is represented in XML, which cannot be serialized as the very same stream of bytes every time. But it may still be helpful to use the getData call to retrieve such data.
The question in discussion in this thread is whether we should bend the LSID rules to accept XML data in getData calls. My proposal, which I think gathers the points presented in the previous discussion thread, is that whatever is served using getData is "semantically immutable". Semantical immutability would then depend on the content type of the data returned. For example:
1) If data is of content type text/plain, application/octet-stream, image/*, etc, then it must always be returned as the exact byte stream sequence (just like the LSID spec states now). 2) If data is in XML, i.e., it is of content type text/xml, text/html (God forbid), then it must always return an equivalent XML DOM tree; 3) If data is application/rdf+xml or application/rdf+n3 (i.e. RDF data), the getData call must always return the same RDF graph; 4) And so on for every other MIME type out there. // The implications of bending the LSID getData calls like that are:
a) One may not use getDataByRange call for data that is not byte stream equivalent (item #1 above). Authorities would have to return an error message when getDataByRange is called on a "semantically immutable" object.
b) Some may claim that caching of LSIDs and the associated data would be impossible. But since the data is always "semantically immutable", what's wrong with caching it?
c) Authorities wouldn't be able to return data in alternate MIME types (RDF in XML or N3 or Turtle) as there is no parameter that specifies that. Not a problem I suppose.
I agree with Dave in that we would gain much if we bent the LSID rule about immutability of data. We would have a more general solution that would fit the needs of a broader set of providers, without impacting the authorities that today don't use getData that much, such as the providers of names, concepts, observations, specimens, authors, and collections.
So the questions I pose to our group in this thread are:
"Should we allow 'semantically immutable' data to be returned in the getData call? How exactly do we do it (i.e., what would be a specification for it)?"
I don't really see a problem bending the LSID rules a bit, as outlined above. What do you think?
Cheers,
Ricardo
On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote:
One thing that is wrong with it is that if a conforming client acquires the data with a getData call from two different sources, and they return different byte strings, then the client is permitted to signal an error and possibly break an application that exercises a blind faith in the power of "semantic immutability".
b) Some may claim that caching of LSIDs and the associated data would be impossible. But since the data is always "semantically immutable", what's wrong with caching it?
I am not sure if I follow completely Bob but I think you are pointing out an important issue for "semantics immutability" versus "byte/bit- level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple equivalence test should be able to verify the byte-level equivalence. Under the semantic immutability contract, a more complex text for equivalence would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the data as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can not be compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see this as a problem.
Since in the biodiversity community LSID data payloads would be about a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to encode both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core records, a dublin core records or SDD. All are XML but the legal operations over that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but other clients would need to process accordingly perhaps ignoring types it knows nothing about.
------
Unrelated to Bob's comment I would like to add a point about digital from birth vs made digital data.
What is data and what is metadata has no relation to being digital or not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined plane collected data, the time, distance, angle and mass of the objects. At least the time and the distance recorded in his notebooks are data. If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers it is the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the inclined plane., it might be made of a wooden frame with brass rails. All this metadata tells us about the data, it is data about the data. Similar arguments can be made about specimens. A digital representation of a specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a photo can be assigned to a species concept, meaning someone has said this is an X. As such we can treat it as an exemplar of X. If it is a type we can even say it is a very good example of X but it does not cover the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or because there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will want to add metadata to the digital numbers to tell the user that they came from Galileo's notebook.
Bryan
There is no way to guarantee that a particular application which passes an LSID to another application can expect anything other
On 7/16/07, P. Bryan Heidorn pheidorn@uiuc.edu wrote:
I am not sure if I follow completely Bob but I think you are pointing out an important issue for "semantics immutability" versus "byte/bit- level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple equivalence test should be able to verify the byte-level equivalence. Under the semantic immutability contract, a more complex text for equivalence would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the data as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can not be compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see this as a problem.
Since in the biodiversity community LSID data payloads would be about a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to encode both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core records, a dublin core records or SDD. All are XML but the legal operations over that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but other clients would need to process accordingly perhaps ignoring types it knows nothing about.
Unrelated to Bob's comment I would like to add a point about digital from birth vs made digital data.
What is data and what is metadata has no relation to being digital or not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined plane collected data, the time, distance, angle and mass of the objects. At least the time and the distance recorded in his notebooks are data. If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers it is the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the inclined plane., it might be made of a wooden frame with brass rails. All this metadata tells us about the data, it is data about the data. Similar arguments can be made about specimens. A digital representation of a specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a photo can be assigned to a species concept, meaning someone has said this is an X. As such we can treat it as an exemplar of X. If it is a type we can even say it is a very good example of X but it does not cover the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or because there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will want to add metadata to the digital numbers to tell the user that they came from Galileo's notebook.
Bryan
P. Bryan Heidorn Graduate School of Library and Information Science University of Illinois at Urbana-Champaign pheidorn@uiuc.edu (V)217/ 244-7792 (F)217/ 244-3302 http://www.uiuc.edu/goto/heidorn Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote:
One thing that is wrong with it is that if a conforming client acquires the data with a getData call from two different sources, and they return different byte strings, then the client is permitted to signal an error and possibly break an application that exercises a blind faith in the power of "semantic immutability".
b) Some may claim that caching of LSIDs and the associated data would be impossible. But since the data is always "semantically immutable", what's wrong with caching it?
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
My last escaped prematurely. I meant:
There is no way that an application that passes an LSID to another application can know that the second program will abide by some non-standard TDWG-defined contract about something called an LSID. Any program that passes a uri beginning with urn:lsid with an implicit or explicit request for a getData() call cannot be assured of anything about the chain of custody except what is in the LSID spec.
I wholly \agree/ with the need to have semantically persistent services, together with agreed upon, named, algorithms which establish the identity of two data streams for that purpose. What I don't agree with is calling the hook urn:lsid and the method getData()
Since the infrastructure at TDWG and elsewhere is in place for LSID, I think I would address this issue not by defining a new standard that is a clone of LSID except with a different definition of getData(), but rather think about whether there can be stuff in the getMetadata() calls and returns that permit an assertion by the callee that some bit of stuff has been provided under the semantic persistence contract. Yes, this will lead to needing a call to getMetadata() for stuff that some people insist is data (and also insist there is a difference). This is the cost of doing robust business. Yes, some people will write non-compliant getData() services. Yes, applications that deal with those will sometimes break. As Bruce Stein said in a breakout group last week in the Observation Modeling workshop: "You can't legislate against illegal activity."
On 7/16/07, Bob Morris morris.bob@gmail.com wrote:
There is no way to guarantee that a particular application which passes an LSID to another application can expect anything other
On 7/16/07, P. Bryan Heidorn pheidorn@uiuc.edu wrote:
I am not sure if I follow completely Bob but I think you are pointing out an important issue for "semantics immutability" versus "byte/bit- level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple equivalence test should be able to verify the byte-level equivalence. Under the semantic immutability contract, a more complex text for equivalence would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the data as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can not be compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see this as a problem.
Since in the biodiversity community LSID data payloads would be about a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to encode both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core records, a dublin core records or SDD. All are XML but the legal operations over that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but other clients would need to process accordingly perhaps ignoring types it knows nothing about.
Unrelated to Bob's comment I would like to add a point about digital from birth vs made digital data.
What is data and what is metadata has no relation to being digital or not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined plane collected data, the time, distance, angle and mass of the objects. At least the time and the distance recorded in his notebooks are data. If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers it is the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the inclined plane., it might be made of a wooden frame with brass rails. All this metadata tells us about the data, it is data about the data. Similar arguments can be made about specimens. A digital representation of a specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a photo can be assigned to a species concept, meaning someone has said this is an X. As such we can treat it as an exemplar of X. If it is a type we can even say it is a very good example of X but it does not cover the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or because there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will want to add metadata to the digital numbers to tell the user that they came from Galileo's notebook.
Bryan
P. Bryan Heidorn Graduate School of Library and Information Science University of Illinois at Urbana-Champaign pheidorn@uiuc.edu (V)217/ 244-7792 (F)217/ 244-3302 http://www.uiuc.edu/goto/heidorn Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote:
One thing that is wrong with it is that if a conforming client acquires the data with a getData call from two different sources, and they return different byte strings, then the client is permitted to signal an error and possibly break an application that exercises a blind faith in the power of "semantic immutability".
b) Some may claim that caching of LSIDs and the associated data would be impossible. But since the data is always "semantically immutable", what's wrong with caching it?
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
I do not know if I stated my position on the issue of getData() immutability. There is an installed base of application "expecting" that data returned by getData() will always have the same bit pattern. Because of that and the existing definition of getData() in the LSID spec we should not mess with that contract. That leaves two options for "semantically immutable" data. Either call it metadata and return it in getMetaData() or I would prefer an extension to the LSID spec to allow a new method getMimeData() or getflexData() we can argue for a long time about the name but this method can validly return XML, RDF or other data types that many have semantically equivalent representations with different byte orders. With this solution we would not need to support illegal activity.
On Jul 16, 2007, at 12:10 PM, Bob Morris wrote:
My last escaped prematurely. I meant:
There is no way that an application that passes an LSID to another application can know that the second program will abide by some non-standard TDWG-defined contract about something called an LSID. Any program that passes a uri beginning with urn:lsid with an implicit or explicit request for a getData() call cannot be assured of anything about the chain of custody except what is in the LSID spec.
I wholly \agree/ with the need to have semantically persistent services, together with agreed upon, named, algorithms which establish the identity of two data streams for that purpose. What I don't agree with is calling the hook urn:lsid and the method getData()
Since the infrastructure at TDWG and elsewhere is in place for LSID, I think I would address this issue not by defining a new standard that is a clone of LSID except with a different definition of getData(), but rather think about whether there can be stuff in the getMetadata() calls and returns that permit an assertion by the callee that some bit of stuff has been provided under the semantic persistence contract. Yes, this will lead to needing a call to getMetadata() for stuff that some people insist is data (and also insist there is a difference). This is the cost of doing robust business. Yes, some people will write non-compliant getData() services. Yes, applications that deal with those will sometimes break. As Bruce Stein said in a breakout group last week in the Observation Modeling workshop: "You can't legislate against illegal activity."
On 7/16/07, Bob Morris morris.bob@gmail.com wrote:
There is no way to guarantee that a particular application which passes an LSID to another application can expect anything other
On 7/16/07, P. Bryan Heidorn pheidorn@uiuc.edu wrote:
I am not sure if I follow completely Bob but I think you are
pointing
out an important issue for "semantics immutability" versus "byte/
bit-
level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple
equivalence
test should be able to verify the byte-level equivalence. Under the semantic immutability contract, a more complex text for equivalence would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the
data
as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can
not be
compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see
this
as a problem.
Since in the biodiversity community LSID data payloads would be
about
a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to encode both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core
records, a
dublin core records or SDD. All are XML but the legal operations
over
that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but
other
clients would need to process accordingly perhaps ignoring types it knows nothing about.
Unrelated to Bob's comment I would like to add a point about
digital
from birth vs made digital data.
What is data and what is metadata has no relation to being
digital or
not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined plane collected data, the time, distance, angle and mass of the
objects. At
least the time and the distance recorded in his notebooks are data. If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers it is the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the
inclined
plane., it might be made of a wooden frame with brass rails. All
this
metadata tells us about the data, it is data about the data.
Similar
arguments can be made about specimens. A digital representation
of a
specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a
photo
can be assigned to a species concept, meaning someone has said this is an X. As such we can treat it as an exemplar of X. If it is a
type
we can even say it is a very good example of X but it does not
cover
the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or because there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will want to add metadata to the digital numbers to tell the user that they came from Galileo's notebook.
Bryan
P. Bryan Heidorn Graduate School of Library and Information Science University of Illinois at Urbana-Champaign pheidorn@uiuc.edu (V)217/ 244-7792 (F)217/ 244-3302 http://www.uiuc.edu/goto/heidorn Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote:
One thing that is wrong with it is that if a conforming client acquires the data with a getData call from two different
sources, and
they return different byte strings, then the client is
permitted to
signal an error and possibly break an application that
exercises a
blind faith in the power of "semantic immutability".
b) Some may claim that caching of LSIDs and the associated data would be impossible. But since the data is always "semantically
immutable",
what's wrong with caching it?
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
Returning XML data in the getMetadata() operation is probably ok (not illegal, perhaps borderline), but what about those other data types that may have different byte streams but still identical content? Expressing those in the getMetadata() operation may be unwieldy. Would the data be returned as an attachment to getMetadata()? Always returning the data in the getMetadata operation may also be inefficient (toss up between the number of calls to a service and the volume of data returned for each call).
The simplest and cleanest mechanism to do this seems to be through a new method. The signature might be like the getMetadata() operation:
bytes getSemanticallyEquivalentData(LSID lsid, string[] accepted_formats)
where accepted_formats is an optional parameter specifying a list of acceptable MIME type of the data (in order of preference). A list of MIME types supported by the service (may only be one) can be expressed in the metadata.
The new method can be defined simply by extending the WSDL document that describes the data retrieval services.
On Jul 17, 2007, at 05:50, P. Bryan Heidorn wrote:
I do not know if I stated my position on the issue of getData() immutability. There is an installed base of application "expecting" that data returned by getData() will always have the same bit pattern. Because of that and the existing definition of getData() in the LSID spec we should not mess with that contract. That leaves two options for "semantically immutable" data. Either call it metadata and return it in getMetaData() or I would prefer an extension to the LSID spec to allow a new method getMimeData() or getflexData() we can argue for a long time about the name but this method can validly return XML, RDF or other data types that many have semantically equivalent representations with different byte orders. With this solution we would not need to support illegal activity.
On Jul 16, 2007, at 12:10 PM, Bob Morris wrote:
My last escaped prematurely. I meant:
There is no way that an application that passes an LSID to another application can know that the second program will abide by some non-standard TDWG-defined contract about something called an LSID. Any program that passes a uri beginning with urn:lsid with an implicit or explicit request for a getData() call cannot be assured of anything about the chain of custody except what is in the LSID spec.
I wholly \agree/ with the need to have semantically persistent services, together with agreed upon, named, algorithms which establish the identity of two data streams for that purpose. What I don't agree with is calling the hook urn:lsid and the method getData()
Since the infrastructure at TDWG and elsewhere is in place for LSID, I think I would address this issue not by defining a new standard that is a clone of LSID except with a different definition of getData(), but rather think about whether there can be stuff in the getMetadata() calls and returns that permit an assertion by the callee that some bit of stuff has been provided under the semantic persistence contract. Yes, this will lead to needing a call to getMetadata() for stuff that some people insist is data (and also insist there is a difference). This is the cost of doing robust business. Yes, some people will write non-compliant getData() services. Yes, applications that deal with those will sometimes break. As Bruce Stein said in a breakout group last week in the Observation Modeling workshop: "You can't legislate against illegal activity."
On 7/16/07, Bob Morris morris.bob@gmail.com wrote:
There is no way to guarantee that a particular application which passes an LSID to another application can expect anything other
On 7/16/07, P. Bryan Heidorn pheidorn@uiuc.edu wrote:
I am not sure if I follow completely Bob but I think you are
pointing
out an important issue for "semantics immutability" versus
"byte/bit-
level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple
equivalence
test should be able to verify the byte-level equivalence. Under
the
semantic immutability contract, a more complex text for
equivalence
would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the
data
as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can
not be
compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see
this
as a problem.
Since in the biodiversity community LSID data payloads would be
about
a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to
encode
both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core
records, a
dublin core records or SDD. All are XML but the legal
operations over
that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but
other
clients would need to process accordingly perhaps ignoring
types it
knows nothing about.
Unrelated to Bob's comment I would like to add a point about
digital
from birth vs made digital data.
What is data and what is metadata has no relation to being
digital or
not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined
plane
collected data, the time, distance, angle and mass of the
objects. At
least the time and the distance recorded in his notebooks are
data.
If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers
it is
the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the
inclined
plane., it might be made of a wooden frame with brass rails.
All this
metadata tells us about the data, it is data about the data.
Similar
arguments can be made about specimens. A digital representation
of a
specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a
photo
can be assigned to a species concept, meaning someone has said
this
is an X. As such we can treat it as an exemplar of X. If it is
a type
we can even say it is a very good example of X but it does not
cover
the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or
because
there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will
want to
add metadata to the digital numbers to tell the user that they
came
from Galileo's notebook.
Bryan
P. Bryan Heidorn Graduate School of Library and Information Science University of Illinois at Urbana-Champaign pheidorn@uiuc.edu (V)217/ 244-7792 (F)217/ 244-3302 http://www.uiuc.edu/goto/heidorn Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote:
One thing that is wrong with it is that if a conforming client acquires the data with a getData call from two different
sources, and
they return different byte strings, then the client is
permitted to
signal an error and possibly break an application that
exercises a
blind faith in the power of "semantic immutability".
b) Some may claim that caching of LSIDs and the associated
data
would be impossible. But since the data is always "semantically
immutable",
what's wrong with caching it?
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Guys,
It looks like we are converging towards extending the LSID specification to add a new method that returns semantically equivalent data.
If we all agree, I'll write up a specification for this new method and add it to the TDWG LSID Applicability Statement.
Also, if you agree, I'd like to move on to discuss the next issue.
What do you think?
Cheers,
Ricardo
Dave Vieglais wrote:
Returning XML data in the getMetadata() operation is probably ok (not illegal, perhaps borderline), but what about those other data types that may have different byte streams but still identical content? Expressing those in the getMetadata() operation may be unwieldy. Would the data be returned as an attachment to getMetadata()? Always returning the data in the getMetadata operation may also be inefficient (toss up between the number of calls to a service and the volume of data returned for each call).
The simplest and cleanest mechanism to do this seems to be through a new method. The signature might be like the getMetadata() operation:
bytes getSemanticallyEquivalentData(LSID lsid, string[] accepted_formats)
where accepted_formats is an optional parameter specifying a list of acceptable MIME type of the data (in order of preference). A list of MIME types supported by the service (may only be one) can be expressed in the metadata.
The new method can be defined simply by extending the WSDL document that describes the data retrieval services.
On Jul 17, 2007, at 05:50, P. Bryan Heidorn wrote:
I do not know if I stated my position on the issue of getData() immutability. There is an installed base of application "expecting" that data returned by getData() will always have the same bit pattern. Because of that and the existing definition of getData() in the LSID spec we should not mess with that contract. That leaves two options for "semantically immutable" data. Either call it metadata and return it in getMetaData() or I would prefer an extension to the LSID spec to allow a new method getMimeData() or getflexData() we can argue for a long time about the name but this method can validly return XML, RDF or other data types that many have semantically equivalent representations with different byte orders. With this solution we would not need to support illegal activity.
On Jul 16, 2007, at 12:10 PM, Bob Morris wrote:
My last escaped prematurely. I meant:
There is no way that an application that passes an LSID to another application can know that the second program will abide by some non-standard TDWG-defined contract about something called an LSID. Any program that passes a uri beginning with urn:lsid with an implicit or explicit request for a getData() call cannot be assured of anything about the chain of custody except what is in the LSID spec.
I wholly \agree/ with the need to have semantically persistent services, together with agreed upon, named, algorithms which establish the identity of two data streams for that purpose. What I don't agree with is calling the hook urn:lsid and the method getData()
Since the infrastructure at TDWG and elsewhere is in place for LSID, I think I would address this issue not by defining a new standard that is a clone of LSID except with a different definition of getData(), but rather think about whether there can be stuff in the getMetadata() calls and returns that permit an assertion by the callee that some bit of stuff has been provided under the semantic persistence contract. Yes, this will lead to needing a call to getMetadata() for stuff that some people insist is data (and also insist there is a difference). This is the cost of doing robust business. Yes, some people will write non-compliant getData() services. Yes, applications that deal with those will sometimes break. As Bruce Stein said in a breakout group last week in the Observation Modeling workshop: "You can't legislate against illegal activity."
On 7/16/07, Bob Morris morris.bob@gmail.com wrote:
There is no way to guarantee that a particular application which passes an LSID to another application can expect anything other
On 7/16/07, P. Bryan Heidorn pheidorn@uiuc.edu wrote:
I am not sure if I follow completely Bob but I think you are
pointing
out an important issue for "semantics immutability" versus
"byte/bit-
level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple
equivalence
test should be able to verify the byte-level equivalence. Under the semantic immutability contract, a more complex text for equivalence would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the data as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can not be compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see this as a problem.
Since in the biodiversity community LSID data payloads would be
about
a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to encode both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core
records, a
dublin core records or SDD. All are XML but the legal operations
over
that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but other clients would need to process accordingly perhaps ignoring types it knows nothing about.
Unrelated to Bob's comment I would like to add a point about digital from birth vs made digital data.
What is data and what is metadata has no relation to being
digital or
not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined plane collected data, the time, distance, angle and mass of the
objects. At
least the time and the distance recorded in his notebooks are data. If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers it is the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the
inclined
plane., it might be made of a wooden frame with brass rails. All
this
metadata tells us about the data, it is data about the data. Similar arguments can be made about specimens. A digital representation of a specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a
photo
can be assigned to a species concept, meaning someone has said this is an X. As such we can treat it as an exemplar of X. If it is a
type
we can even say it is a very good example of X but it does not cover the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or because there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will want to add metadata to the digital numbers to tell the user that they came from Galileo's notebook.
Bryan
P. Bryan Heidorn Graduate School of Library and Information Science University of Illinois at Urbana-Champaign pheidorn@uiuc.edu (V)217/ 244-7792 (F)217/ 244-3302 http://www.uiuc.edu/goto/heidorn Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote: > One thing that is wrong with it is that if a conforming client acquires the data with a getData call from two different
sources, and
they return different byte strings, then the client is
permitted to
signal an error and possibly break an application that exercises a blind faith in the power of "semantic immutability".
> b) Some may claim that caching of LSIDs and the associated data > would be > impossible. But since the data is always "semantically
immutable",
> what's > wrong with caching it? >
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
--Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Extending the specification might seem like the honorable thing to do, but who is going to implement it. How are we going to build it into the list of services that providers are already obliged to support if they wish to participate. Why would we want to do it when we already have a robust protocol for delivering data using any format in TAPIR services?
The fact that LSID was a purpose built, off-the-shelf solution was a strong attractor in the initial selection process. If now we find that it is just not suitable then we should probably re-evaluate that decision before we rush off to roll our own. Especially now that we know that LSIDs are not the hook into the semantic web that we thought they might be.
How many of us believe in a database federation based on tdwgLSID.getSemanticallyEquivalentData(LSID, format) calls?
The important issues here are not ones of resolution - which has come as a consequence of choosing LSID and duplicates what we have elsewhere - but in delivering on our requirement to establish provenance, manage uniqueness and support persistence. We will find LSIDs *already* embedded in tdwgFORMAT records and expect to believe that they represent globally unique keys for these data objects and their relationships.
Isn't getSemanticallyEquivalentData(LSID, format) already handled by http://tapir_provider/?op=search&model=formatSpec&LSID=... along with op=metadata, op=capabilities, op=inventory, etc?
greg
Ricardo Pereira wrote:
Guys,
It looks like we are converging towards extending the LSID specification to add a new method that returns semantically equivalent data.
If we all agree, I'll write up a specification for this new method and add it to the TDWG LSID Applicability Statement.
Also, if you agree, I'd like to move on to discuss the next issue.
What do you think?
Cheers,
Ricardo
Dave Vieglais wrote:
Returning XML data in the getMetadata() operation is probably ok (not illegal, perhaps borderline), but what about those other data types that may have different byte streams but still identical content? Expressing those in the getMetadata() operation may be unwieldy. Would the data be returned as an attachment to getMetadata()? Always returning the data in the getMetadata operation may also be inefficient (toss up between the number of calls to a service and the volume of data returned for each call).
The simplest and cleanest mechanism to do this seems to be through a new method. The signature might be like the getMetadata() operation:
bytes getSemanticallyEquivalentData(LSID lsid, string[] accepted_formats)
where accepted_formats is an optional parameter specifying a list of acceptable MIME type of the data (in order of preference). A list of MIME types supported by the service (may only be one) can be expressed in the metadata.
The new method can be defined simply by extending the WSDL document that describes the data retrieval services.
On Jul 17, 2007, at 05:50, P. Bryan Heidorn wrote:
I do not know if I stated my position on the issue of getData() immutability. There is an installed base of application "expecting" that data returned by getData() will always have the same bit pattern. Because of that and the existing definition of getData() in the LSID spec we should not mess with that contract. That leaves two options for "semantically immutable" data. Either call it metadata and return it in getMetaData() or I would prefer an extension to the LSID spec to allow a new method getMimeData() or getflexData() we can argue for a long time about the name but this method can validly return XML, RDF or other data types that many have semantically equivalent representations with different byte orders. With this solution we would not need to support illegal activity.
On Jul 16, 2007, at 12:10 PM, Bob Morris wrote:
My last escaped prematurely. I meant:
There is no way that an application that passes an LSID to another application can know that the second program will abide by some non-standard TDWG-defined contract about something called an LSID. Any program that passes a uri beginning with urn:lsid with an implicit or explicit request for a getData() call cannot be assured of anything about the chain of custody except what is in the LSID spec.
I wholly \agree/ with the need to have semantically persistent services, together with agreed upon, named, algorithms which establish the identity of two data streams for that purpose. What I don't agree with is calling the hook urn:lsid and the method getData()
Since the infrastructure at TDWG and elsewhere is in place for LSID, I think I would address this issue not by defining a new standard that is a clone of LSID except with a different definition of getData(), but rather think about whether there can be stuff in the getMetadata() calls and returns that permit an assertion by the callee that some bit of stuff has been provided under the semantic persistence contract. Yes, this will lead to needing a call to getMetadata() for stuff that some people insist is data (and also insist there is a difference). This is the cost of doing robust business. Yes, some people will write non-compliant getData() services. Yes, applications that deal with those will sometimes break. As Bruce Stein said in a breakout group last week in the Observation Modeling workshop: "You can't legislate against illegal activity."
On 7/16/07, Bob Morris morris.bob@gmail.com wrote:
There is no way to guarantee that a particular application which passes an LSID to another application can expect anything other
On 7/16/07, P. Bryan Heidorn pheidorn@uiuc.edu wrote:
I am not sure if I follow completely Bob but I think you are
pointing
out an important issue for "semantics immutability" versus
"byte/bit-
level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple
equivalence
test should be able to verify the byte-level equivalence. Under the semantic immutability contract, a more complex text for equivalence would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the data as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can not be compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see this as a problem.
Since in the biodiversity community LSID data payloads would be
about
a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to encode both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core
records, a
dublin core records or SDD. All are XML but the legal operations
over
that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but other clients would need to process accordingly perhaps ignoring types it knows nothing about.
Unrelated to Bob's comment I would like to add a point about digital from birth vs made digital data.
What is data and what is metadata has no relation to being
digital or
not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined plane collected data, the time, distance, angle and mass of the
objects. At
least the time and the distance recorded in his notebooks are data. If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers it is the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the
inclined
plane., it might be made of a wooden frame with brass rails. All
this
metadata tells us about the data, it is data about the data. Similar arguments can be made about specimens. A digital representation of a specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a
photo
can be assigned to a species concept, meaning someone has said this is an X. As such we can treat it as an exemplar of X. If it is a
type
we can even say it is a very good example of X but it does not cover the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or because there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will want to add metadata to the digital numbers to tell the user that they came from Galileo's notebook.
Bryan
P. Bryan Heidorn Graduate School of Library and Information Science University of Illinois at Urbana-Champaign pheidorn@uiuc.edu (V)217/ 244-7792 (F)217/ 244-3302 http://www.uiuc.edu/goto/heidorn Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
> On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote: >> > One thing that is wrong with it is that if a conforming client > acquires the data with a getData call from two different
sources, and
> they return different byte strings, then the client is
permitted to
> signal an error and possibly break an application that exercises a > blind faith in the power of "semantic immutability". > > >> b) Some may claim that caching of LSIDs and the associated data >> would be >> impossible. But since the data is always "semantically
immutable",
>> what's >> wrong with caching it? >> > > -- > Robert A. Morris > Professor of Computer Science > UMASS-Boston > ram@cs.umb.edu > http://bdei.cs.umb.edu/ > http://www.cs.umb.edu/~ram > http://www.cs.umb.edu/~ram/calendar.html > phone (+1)617 287 6466 > _______________________________________________ > tdwg-guid mailing list > tdwg-guid@lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
--Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
From the LSID spec: http://www.omg.org/docs/dtc/04-05-01.pdf, page 18,
LSID Resolution Service, getData method:
"Note that the semantics of the returned bytes is not defined by this specification. It is either known from an external documentation, or (preferably) it is available by reading the metadata for this particular lsid."
So, LSID per se defines nothing about the content of the returned bytes and defers the whole thing to a separately specified definition or via metadata.
What the LSID spec does however state clearly is that the bytes cannot change, including the case of no bytes.
Page 16, Specification: If an LSID represents real data, the LSID Resolution service (described elsewhere in this document) must resolve always the same set of bytes representing such data. If an LSID represents an abstract entity the LSID resolution service must always resolve an empty result.
Page 17, getAvailableServices: "The same LSID named data object must be resolved always to the same set of bytes."
Chuck
-----Original Message----- From: tdwg-guid-bounces@lists.tdwg.org [mailto:tdwg-guid-bounces@lists.tdwg.org] On Behalf Of P. Bryan Heidorn Sent: Monday, July 16, 2007 12:50 PM To: Bob Morris Cc: tdwg-guid@lists.tdwg.org Subject: Re: [tdwg-guid] Immutability of LSID data
I do not know if I stated my position on the issue of getData() immutability. There is an installed base of application "expecting" that data returned by getData() will always have the same bit pattern. Because of that and the existing definition of getData() in the LSID spec we should not mess with that contract. That leaves two options for "semantically immutable" data. Either call it metadata and return it in getMetaData() or I would prefer an extension to the LSID spec to allow a new method getMimeData() or getflexData() we can argue for a long time about the name but this method can validly return XML, RDF or other data types that many have semantically equivalent representations with different byte orders. With this solution we would not need to support illegal activity.
On Jul 16, 2007, at 12:10 PM, Bob Morris wrote:
My last escaped prematurely. I meant:
There is no way that an application that passes an LSID to another application can know that the second program will abide by some non-standard TDWG-defined contract about something called an LSID. Any program that passes a uri beginning with urn:lsid with an implicit or explicit request for a getData() call cannot be assured of anything about the chain of custody except what is in the LSID spec.
I wholly \agree/ with the need to have semantically persistent services, together with agreed upon, named, algorithms which establish the identity of two data streams for that purpose. What I don't agree with is calling the hook urn:lsid and the method getData()
Since the infrastructure at TDWG and elsewhere is in place for LSID, I think I would address this issue not by defining a new standard that is a clone of LSID except with a different definition of getData(), but rather think about whether there can be stuff in the getMetadata() calls and returns that permit an assertion by the callee that some bit of stuff has been provided under the semantic persistence contract. Yes, this will lead to needing a call to getMetadata() for stuff that some people insist is data (and also insist there is a difference). This is the cost of doing robust business. Yes, some people will write non-compliant getData() services. Yes, applications that deal with those will sometimes break. As Bruce Stein said in a breakout group last week in the Observation Modeling workshop: "You can't legislate against illegal activity."
On 7/16/07, Bob Morris morris.bob@gmail.com wrote:
There is no way to guarantee that a particular application which passes an LSID to another application can expect anything other
On 7/16/07, P. Bryan Heidorn pheidorn@uiuc.edu wrote:
I am not sure if I follow completely Bob but I think you are
pointing
out an important issue for "semantics immutability" versus "byte/
bit-
level immunity". If a client retrieves data from two different clients under a byte-level immutability contract a simple
equivalence
test should be able to verify the byte-level equivalence. Under the semantic immutability contract, a more complex text for equivalence would be required to fit for example the mime-type.
In practice I do not think this is an issue. If clients act under blind faith under either contract they would not text the equivalence. In fact they would usually only retrieve a particular LSID from one service. The blind faith client would process the
data
as if the data provider is following the contract and no more. The client could not assume byte-level immutability when there is only semantic immutability because it may indeed break the client code. Caching a byte-level representation of data from one call can
not be
compared with semantic data. If XML is carried in the data all operations must be consistent with XML operations. I do not see
this
as a problem.
Since in the biodiversity community LSID data payloads would be
about
a large variety of objects, clients would always need to check the data types before most processing operations. The data type information would be encoded in the metadata but could also be segregated by service provider (but even there for good form the metadata should encode the data type.) The metadata needs to encode both the physical layout of the bits and "use" (there must be a better word). For example, the data could be a Darwin core
records, a
dublin core records or SDD. All are XML but the legal operations
over
that XML are different depending on the "use". Some clients could just pass the data through without be concerned about this but
other
clients would need to process accordingly perhaps ignoring types it knows nothing about.
Unrelated to Bob's comment I would like to add a point about
digital
from birth vs made digital data.
What is data and what is metadata has no relation to being
digital or
not. There was data and metadata long before there were computers. Galileo studying the time of objects to move down an inclined plane collected data, the time, distance, angle and mass of the
objects. At
least the time and the distance recorded in his notebooks are data. If we re-represent his data from the notebook in digital format in 2007 so we can process it in an excel spreadsheet it is still the same data. If we just take a photo of the book we might have a different beast but as long as we leave his number as numbers it is the same data. The metadata about inclined plane experiment would include information about the apparatus used. For example he might have bells that ring at different locations/distances of the
inclined
plane., it might be made of a wooden frame with brass rails. All
this
metadata tells us about the data, it is data about the data.
Similar
arguments can be made about specimens. A digital representation
of a
specimen is still data. No one is arguing that the specimen is a species or a species concept. A specimen glued to paper or in a
photo
can be assigned to a species concept, meaning someone has said this is an X. As such we can treat it as an exemplar of X. If it is a
type
we can even say it is a very good example of X but it does not
cover
the entire concept of X. The image of the specimen can be data. We need not treat it as metadata just because it is digital or because there is an object or event in the world that is now primary representation. Galileo's numbers also existing in the notebook do not make the numbers in the computer any less data. We will want to add metadata to the digital numbers to tell the user that they came from Galileo's notebook.
Bryan
P. Bryan Heidorn Graduate School of Library and Information Science University of Illinois at Urbana-Champaign pheidorn@uiuc.edu (V)217/ 244-7792 (F)217/ 244-3302 http://www.uiuc.edu/goto/heidorn Online Calendar: http://www.uiuc.edu/goto/heidorncalendar
On Jul 16, 2007, at 9:01 AM, Bob Morris wrote:
On 7/16/07, Ricardo Pereira ricardo@tdwg.org wrote:
One thing that is wrong with it is that if a conforming client acquires the data with a getData call from two different
sources, and
they return different byte strings, then the client is
permitted to
signal an error and possibly break an application that
exercises a
blind faith in the power of "semantic immutability".
b) Some may claim that caching of LSIDs and the associated data would be impossible. But since the data is always "semantically
immutable",
what's wrong with caching it?
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466
_______________________________________________ tdwg-guid mailing list tdwg-guid@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-guid
Dear all,
Since the infrastructure at TDWG and elsewhere is in place for LSID, I think I would address this issue not by defining a new standard that is a clone of LSID except with a different definition of getData(), but rather think about whether there can be stuff in the getMetadata() calls and returns that permit an assertion by the callee that some bit
Maybe a workaround would be to add a fingerprint (hash key) of the delivered data to the metadata? At least it would be transparent for the client if two clients deliver different data.
In general I think the discussion (and also the discussion on persistence) should again bring up the question whether it should be considered to introduce some kind of trustworthyness certificate for LSID providers. Doesn't the reliability of immutability of data and persistence lastly depend on the trustworthyness of the data provider? I think of for example the Open Archival Information System (OAIS) model which is now ISO standard (ISO 14721). Btw. The model distinguishes between data object (bit stream) and representation information. Which if applied to LSIDs would help in the immutabiliy problem. N.B.: because it must _of course_ be within the responsibility of a data center to adopt the representation form of a digital object to the needs of its users, e.g. it would make no sense to continue to deliver early 80-ties StarWriter formatted documents!
Best regards, Robert
Hi Bryan,
What is data and what is metadata has no relation to being digital or not. There was data and metadata long before there were computers.
Again, we are coming back to this communication problem. I agree with you in the context of the words "data" and "metadata" as most of us probably define them. But we are talking about LSIDs, and so we should follow the definitions of these words in the context of the LSID spec. It may be terribly unfortunate that the LSID spec defines "data" differently from how most of us would use that word -- just as it is terribly unfortunate that a "named concept" has essentially nothing to do with either a taxon "concept" or a taxon "name", or that a "Class" written in C++ has no relationship to the "Class" Mammalia, or that a data "type" has nothing to do with a "type" specimen, or the fact that all of these "homonyms" cause problems that are different from the sorts of problems created by taxonomic "homonyms" -- among dozens of other frustrating language barriers we have.
However, in the context of LSIDs, which is what we are now discussing, the word "data" does indeed unambiguously refer to a digital/binary bytestream, and *not* the kind of "data" that Galileo collected.
Aloha, Rich
sigh, Sorry I did not reply earlier. My time was eaten up with proposal writing and the like.
I agree that we are talking about digital data for the LSID and I did not intend to insinuate otherwise in my prior message. It is just that people are putting Galileo's data online in digital format and do need unique identifiers, it just is not biodiversity data. The issues have been addressed many times and it is important to learn from past experience.
We can say that it is important to be able to insure the properties of the data service, so that the digesting process can make assumptions about the data. In Java and many other languages there is a bit level equivalence operator such as "=". This is relevant to the concept of homonyms.
Hannu pointed out that it is nice to be able to make assumptions about the nature of the data being delivered. You can for example know you can use "=" in your program and assume it should return true if the service is following the rules (of bit level immutability). When we say two things are equivalent in these languages we mean "equivalent" under the languages operators. The LSID GetData function service is defined in these terms which is very reasonable for many forms of data including molecular sequences (except that genetic matching algorithms frequently treat a genetic sequence and it complement as equivalent because they are both half of the same double helix. So, I would guess that even the molecular community who defined LSID might have people who are unhappy with the current definition. In some languages we are allowed to overload operators such as "=" with our own definition of equivalence. The language designers did this because people often need different definitions of equivalence particularly for complex data types.
In many programming tasks, bit level equivalence, is not needed and is indeed problematic. So RDF and software such as DOM define equivalence not as bit level matching of 1's and 0's in a particular order but as a higher order construct. So, we can have a born-digital object that describes a species of plant. "<leaf><arrangement>alternate</arrangement><length unit="mm">10</ length></leaf>" is equivalent to "<leaf><length unit="mm">10</ length><arrangement>alternate</arrangement></leaf>"
There are applications in biodiversity informatics where bit level equivalence is useful so I support keeping getData's requirment of bit-level equivalence. Other branches of biodiversity informatics, however would benefit from a different definition of equivalence. This can be handled with an LSID extension as a new function. Who pays for development of this new function is important. We can role out a more constrained standard with getData as is and later add the new getDataRepresenationallyEquivolant later.
So, lets move ahead, adopt LSID and start using it for the cases where bit level equivalence is acceptable and either expand it later or develop a different standard to give unique identifiers for the other applications.
Hi Ricardo,
I certainly agree with the direction you want to take the discussion in, but I do want to make a couple of comments:
I wasn't the one who came up with the LSID spec, but I suppose that those methods were specifically designed to handle sequence data (DNA and protein data). The getDataByRange method in particular was designed to allow clients to refer to very specific subsets of those sequences.
No doubt that this is all very useful for the bioinformatics folks, but as we've seen in previous discussions, it is not as useful for us in the biodiversity (and ecological) informatics communities. The main reason is that some of our data is represented in XML, which cannot be serialized as the very same stream of bytes every time. But it may still be helpful to use the getData call to retrieve such data.
I am dubious that we will eventually find much use for the getData() call for any non-digital objects (which, in my understanding, includes many of the things we want to exchange data about). However, I think that the getData() call *does* have value for a non-trivial portion of data objects that *are* of interest to us in the biodiversity informatics community. First of all, the data generated by the bioinformatics folks are of interest to our community, and will increasingly become so as time moves on. But I think the getData() call could also be of value to other objects of interest to us as well. Examples include cropped regions of image files, individual pages of multi-page scanned paper document files, specific segments of video files, specific segments of audio files, among others. Obviously, this would depend on the nature of binary file itself, such that a contiguous block of bytes extracted from within the complete binary file would represent a meaningful, render-able unit of information (possibly not . But the point is, I do believe that getData() does have potential use to us in our data domain for certain tasks.
More fundamentally, however, I want to echo something you said in an earlier post: "We should not try to return something in the LSID getData() call just for the sake of it." In other words, if our LSIDs identify something other than a *static* binary data file (which, in my mind, a database record or a dynamically generated XML file usually do *not* represent -- for the reasons that Dave and others have already pointed out), then we should find ways to make use of the information as returned via getMetadata().
I am still a bit confused as to why we need to define all of these "symatically immutable" rules just to allow us to make use of the getData() call for objects that are not static binary files. I can certainly understand a set of rules that our community defines in terms of managing versioning of metadata and dealing with the sorts of use-cases that Matt describes, but I don't see why we can't layer that on top of the existing LSID specs and methods, rather than "bend" the existing LSID spec and/or develop new methods. Like Bryan said:
But I definitely think it would be mistake to re-define what "data" means in the context of LSIDs (i.e., to allow mutability in certain cases, and in so doing fail to fulfill the contract for serving LSIDs).
Aloha, Rich
Indeed, I think we need to look closer at handling of the 6th, optional element in the LSID spec, revision.
If the XML stream returned as data needs to change, that would lead to issuing a new revision. It is not said clearly that getData should return data from the latest revision (if omitted), although I'd expect that. In other words, if the revision has changed, the data returned does not have to be the same, although it might look like that it has illegally changed if you omit the revision.
But I must say that I like receiving immutable getData. For instance, if I use an LSID to identify a taxonomic concept, I expect the description not to change without notice. Someone many add a new more elaborate revision, though, which is ok.
Regards, Hannu
--
Rich wrote (in the other thread): I believe the answer to Ricardo's example is better addressed in the next discussion, concerning methods for data versioning. I think the answer to this issue (persistence of metadata) necessarily must be solved via that discussion (versioning), so maybe we should discuss the versioning issue first.
participants (9)
-
Bob Morris
-
Chuck Miller
-
Dave Vieglais
-
Greg Whitbread
-
Hannu Saarenmaa
-
P. Bryan Heidorn
-
Ricardo Pereira
-
Richard Pyle
-
Robert Huber