Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
Thanks, Kathi.
I appreciate your comments and understand your concerns. This certainly is a social problem - no technology solution will take it away. A large proportion (though certainly not all) of the issues surrounding LSIDs will arise with any technology which tries to address the problem.
I seem to be in the minority in believing that we can use LSIDs as one part of a strategy to develop a community infrastructure for our data. However we do need to start from somewhere if we want to do anything about the persistence of our data. We need some foundations before we can properly worry about "intelligent caching and harvesting mechanisms" (which I agree we need).
So - here is my outline for how I think we could move forward from these discussions:
1. An identifier scheme which aims to provide some long term persistence probably needs to embody at least three key facts: who generated/published the data object, what data collection this object belongs to, which data object from the specified data collection this one is. These correspond roughly to the Darwin Core InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData part somewhat. Some systems such as DOI and PURL may not always have an explicit whatCollectionItBelongsTo part, but dealing with collections promises to be an organisational simplification for most purposes.
2. TDWG should recommend the LSID as one suitable model for constructing GUIDs (i.e. "urn:lsid:<whoGeneratedTheDataObject>:<whatCollectionItBelongsTo>:<whichItemInTheCollectionItIs>"). We could propose (or adopt) some other syntax for this, but this gives us a neat enough way to encapsulate what we need to know. The "urn:lsid:" part can be seen as a useful flag that this is indeed to be considered as an identifier.
3. Where feasible, TDWG should recommend that these LSIDs should be associated with a resolver implementing the standard LSID mechanism. Frankly I am a lot less bothered by the resolvability of most identifiers than I am about their consistent use, so I have no problem with the idea of assigning LSIDs to things which do not currently resolve.
4. TDWG requires that a path must exist to retrieve the associated data using an HTTP resolver to proxy the LSID (i.e. http://whoGeneratedTheDataObject.org/<optional_path_elements>/<lsid>) and that our practice is to consider this proxified version to be identical for comparison purposes with the bare LSID. For LSIDs resolvable using the standard LSID mechanism, this path can be http://lsid.tdwg.org/<lsid>. In cases in which the data are only accessible via HTTP, we have broken the LSID specification - although it seems there may be nobody other than us to care about that fact.
5. All references to LSIDs within RDF documents should use the proxified form.
6. TDWG and its partners should establish a PURL-like service which makes it easy to register data sets to be associated with identifiers of this form. In other words, a service should exist (around a domain secured for this purpose into the future) which associates data providers with an appropriate whoGeneratedTheDataObject element and associates their data collections with an appropriate whatCollectionItBelongsTo element and associated URL pattern for retrieving RDF data for the individual data objects. The exact details could vary, but assume that TDWG sets up this service at http://lsid.tdwg.org/ and that CSIRO wishes to register the ANIC data collection and to have individual specimen records associated with LSID-based identifiers. Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/<catalogueNumber>. The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345 and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345 both being mapped through to http://www.csiro.au/anic/specimens/12345. It would probably be preferable for the LSID in this case to be urn:lsid:csiro.tdwg.org:anic:12345 (which would make relocation of all LSID services for a single data provider easy, but could require large numbers of SRV records to be managed by TDWG). (I would note that it would be easy for the infrastructure to allow the data provider to choose whether the whole LSID or just the final ID element should be passed to the final URL.)
7. TDWG and its partners should use this same infrastructure to handle alternative resolution paths as required in the future - if alternative identifier schemes become the preferred option. This infrastructure could also add significant other functions, including e.g. 1) intelligent caching of data, 2) validation of RDF data, and 3) simultaneous registration of DOIs associated with metadata for each data collection to make it easier for them to be cited by journal articles.
8. Any provider may opt at any time to use alternative HTTP-resolvable identifiers in place of LSIDs (e.g. DOIs, handles, PURLs), but must consider the technological and social implications of keeping these identifiers alive into the future.
As far as I can see, this approach allows us to develop a community-based approach to managing identifiers in a way which builds on LSIDs for those who have already minted them. It would be easy for us to reinvent this as a PURL-based approach in the future. The costs should not be great and it gives us a better chance of avoiding the confusion of random-URLs-pointing-at-random-data-formats being offered as semantically useful GUIDs.
Whatever happens, TDWG needs to finalise an applicability statement for how LSIDs should be used by those providers who have chosen or who will choose to use them for biodiversity data. This does not mandate that everyone MUST use LSIDs.
Does this seem worth pursuing?
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- Date: Mon, 6 Apr 2009 10:15:00 +0200 From: Schleidt Katharina katharina.schleidt@umweltbundesamt.at Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG? To: Roger Hyam rogerhyam@mac.com, Peter DeVries pete.devries@gmail.com Cc: "tdwg-tag@lists.tdwg.org" tdwg-tag@lists.tdwg.org Message-ID: 8638F29270898544933A7663226809E5EAAA6060@PCMAIL3.umweltbundesamt.at Content-Type: text/plain; charset="utf-8"
Hi all,
I admit I?m glad that this topic does seem to be back in discussion. I?ve been worried about LSIDs from the outset, but did not have the time or resources at the time of decision to do anything about it. Most of this discussion reflects what we?ve been discussing here in Vienna ever since the topic came up. Here an excerpt from a recent mail of mine:
? I have never been a proponent of LSIDs. More to the point, I have been against their adoption from the onset. The reasons for this are:
o It?s misusing a technical solution as an answer for a social problem. Just because LSIDs entail a list of (quite necessary) requirements such as persistent IDs, dependability of availability of online references, it can in no way guarantee this, it just nicely covers the problem up
o I do not see the technology being supported. IBM dropped it, and Cambridge Semantics Inc. also seems to have gone other ways
o An example of the lack of dependability of LSID servers seems to me to be the eternal problem with the TDWG LSID Server
o I?m worried about a group such as TDWG, which doesn?t have the backup to push through technology development, is going towards requiring all adopters to implement non-mainstream technology in order to maintain compatibility
We?ve come to the conclusion, as mentioned several times in this thread, that what we really need is the commitment to persistence, and no technology will support us in that. Why waste nonexistent funds sorting out an esoteric technology nobodies supporting; why not just buy a domain, pass a hat and set up a trust fund with 1000? (or $), and agree to have this domain available over some institution (i.e. university) for the next 100 years. After that, my non-existent great-grandchildren can sort out the rest!
@Matt: http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html is online again! And a short absence/down-time will happen in all distributed technologies. If anything, I believe that we should worry more about intelligent caching and harvesting mechanisms!
:)
kathi
On Apr 7, 2009, at 1:55 AM, Donald.Hobern@csiro.au wrote:
Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/ <catalogueNumber>. The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345
Wouldn't that say according to your proposed usage guideline that tdwg.org is whoGeneratedTheData and csiro.anic is whatCollectionItBelongsTo, when in reality CSIRO generated the data and ANIC is the collection it belongs to?
I understand why you're suggesting the LSID formatted as you do, and you might say that the name-mangling isn't too drastic. But don't have data owners a strong sense of ownership in their data objects and in their collections? And more importantly, don't you think that a usage guideline that contradicts itself (or that is bound to be internally inconsistent) will continue to raise debate and be in the way of broader adoption?
and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345 both being mapped through to http://www.csiro.au/anic/specimens/ 12345.
Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more justice to the names of whoGeneratedTheData and whatCollectionItBelongsTo, be easier to implement, and have the same possibilities to implement caching etc, in fact using standard software such as mod_proxy for apache?
Just some thoughts.
-hilmar
Thanks, Hilmar.
I agree that using tdwg.org as the authority for the LSID is less than ideal - hence my recommendation later that we should consider instead using e.g. csiro.tdwg.org (and I don't think it should be tdwg.org - perhaps something more neutral like csiro.bio-id.org. My concern there was the proliferation of SRV records if we support the LSID protocol.
You are also correct that the big issue with this is the question of ownership. Quite frankly, if we had believed in 2006 that institutions would be prepared to cede responsibility for handling their identifiers to a third party, the recommendations from the TDWG workshops would probably have been rather different. Part of the reason for adopting LSIDs was because institutions did not seem to want to use an identifier which might imply that a third-party was responsible for the data.
The PURL form would have some benefits and would be a perfectly consistent alternative. I seem to be the only person who wants to avoid an outright capitulation to using HTTP URIs to identify objects in our domain. However, in case anyone cares, here again are my reasons why I prefer HTTP-wrapped non-HTTP identifiers over plain HTTP URIs:
1. The "urn:lsid:" part of the identifier serves as a clear statement of intent which is not present with an HTTP URI. We could mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our domain and that e.g. http://www.csiro.au/ URIs cannot do so, but that seems an arrogant and arbitrary rule. However, if we simply encourage everyone to use PURL URIs from any domain, what separates such a URI from any HTTP URL with no planned persistence? I see this as a short cut to casual assignment of volatile identifiers based on web application structures and hence to rapid identifier rot.
2. I still feel intense discomfort (pace the W3C) over adopting identifiers prefixed HTTP:// for objects such as type specimens which have had an important place in the literature for decades and which can expect still to be referenced in 50 years time. Even though the HTTP protocol feels like the air we breathe right now, it seems certain to be superseded at some point. Do we want to use identifiers which will seem totally "retro" in the future? The usual objection is that HTTP is certain to outlast the LSID protocol. I agree fully, but the urn: prefix is making a statement about naming, not about technology.
If I am alone in these feelings, the suggested PURL route may be simpler, but we should consider what can be done to maximise the robustness of their use.
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Hilmar Lapp [mailto:hlapp@duke.edu] Sent: Tuesday, 7 April 2009 4:54 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
On Apr 7, 2009, at 1:55 AM, Donald.Hobern@csiro.au wrote:
Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/ <catalogueNumber>. The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345
Wouldn't that say according to your proposed usage guideline that tdwg.org is whoGeneratedTheData and csiro.anic is whatCollectionItBelongsTo, when in reality CSIRO generated the data and ANIC is the collection it belongs to?
I understand why you're suggesting the LSID formatted as you do, and you might say that the name-mangling isn't too drastic. But don't have data owners a strong sense of ownership in their data objects and in their collections? And more importantly, don't you think that a usage guideline that contradicts itself (or that is bound to be internally inconsistent) will continue to raise debate and be in the way of broader adoption?
and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345 both being mapped through to http://www.csiro.au/anic/specimens/ 12345.
Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more justice to the names of whoGeneratedTheData and whatCollectionItBelongsTo, be easier to implement, and have the same possibilities to implement caching etc, in fact using standard software such as mod_proxy for apache?
Just some thoughts.
-hilmar
Just hosting SRV records or supplying a redirect service does not actually provide any persistence at all to the data/metadata. Persistence of a GUID to 500 error rather than a not found is not helpful.
I have said in the past "If persistence is important to you then keep your own copy." This is how it has worked for 100s of years in the library community. If the reason for having a centralised resolution mechanism is to try and support persistence then the centralised service should actually cache metadata (not data). I would imagine a scalable infrastructure would be quite simple to implement. Data could be stored in a Lucene index or Hadoop cluster or something. It would only be a very large hash table and only keep the latest version of the RDF.
Without some kind of persistence mechanism the only advantage of LSIDs is that they *look* like they are supposed to be persistent. Unfortunately, because many people are using UUIDs as their object identifiers LSIDs actually look like something you wouldn't want to look at let alone expose to a user! CoL actually hide them because they look like this:
urn:lsid:catalogueoflife.org:taxon:d755ba3e-29c1-102b-9a4a-00304854f820:ac2009
No normal person is going to read this or type it in. I am afraid that when people started using UUIDs in LSIDs it blew the sociological argument for LSIDs out of the water for me. I had carefully designed BCI identifiers to be human readable and writable like this:
urn:lsid:biocol.org:col:15670
Which would work as a foot note in a paper but only way a UUID can work in that context is if it is hyperlinked and to be hyperlinked it will have to be an HTTP URL underneath which begs the question of why we are displaying a non human readable string as the human readable part of a hyperlink! So we hide the LSID completely and have no sociological advantage.
I understand why people used UUIDs. There are good technical reasons especially in distributed systems.
If LSIDs are a brand then they need a "unique selling proposition" and that implies something behind them beyond what can be had for free from other brands. You must use LSIDs because.... "We recommend them" is not an adequate answer.
Another point that worries me is that all discussion of LSIDs is about how to publish them not how to consume them. LSIDs are better than HTTP URIs for the client because... (I still can't answer this question)
Currently the reason for me tagging my data with GUIDs has to be because it enables users to access and exploit my data in cost effective ways they couldn't before whilst crediting me with producing it so that I can attract funding to my organisation to curate and collect more data.
The reason for clients using GUIDs is that it enables them to mix and match data in ways they couldn't before so as to produce more, higher quality scientific publications and so attract funding and kudo.
These are the selling points for GUIDs. How well do LSIDs enable them?
To summarise this overlong post we have to have a service that adds *real value* (of an order of magnitude that crossref adds to DOIs) to LSID usage. Without this we are better off sticking with todays standard web technologies.
Sorry for so many words. I don't have time to write less today.
Roger
On 7 Apr 2009, at 08:17, Donald.Hobern@csiro.au wrote:
Thanks, Hilmar.
I agree that using tdwg.org as the authority for the LSID is less than ideal - hence my recommendation later that we should consider instead using e.g. csiro.tdwg.org (and I don't think it should be tdwg.org - perhaps something more neutral like csiro.bio-id.org. My concern there was the proliferation of SRV records if we support the LSID protocol.
You are also correct that the big issue with this is the question of ownership. Quite frankly, if we had believed in 2006 that institutions would be prepared to cede responsibility for handling their identifiers to a third party, the recommendations from the TDWG workshops would probably have been rather different. Part of the reason for adopting LSIDs was because institutions did not seem to want to use an identifier which might imply that a third-party was responsible for the data.
The PURL form would have some benefits and would be a perfectly consistent alternative. I seem to be the only person who wants to avoid an outright capitulation to using HTTP URIs to identify objects in our domain. However, in case anyone cares, here again are my reasons why I prefer HTTP-wrapped non-HTTP identifiers over plain HTTP URIs:
- The "urn:lsid:" part of the identifier serves as a clear
statement of intent which is not present with an HTTP URI. We could mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our domain and that e.g. http://www.csiro.au/ URIs cannot do so, but that seems an arrogant and arbitrary rule. However, if we simply encourage everyone to use PURL URIs from any domain, what separates such a URI from any HTTP URL with no planned persistence? I see this as a short cut to casual assignment of volatile identifiers based on web application structures and hence to rapid identifier rot.
- I still feel intense discomfort (pace the W3C) over adopting
identifiers prefixed HTTP:// for objects such as type specimens which have had an important place in the literature for decades and which can expect still to be referenced in 50 years time. Even though the HTTP protocol feels like the air we breathe right now, it seems certain to be superseded at some point. Do we want to use identifiers which will seem totally "retro" in the future? The usual objection is that HTTP is certain to outlast the LSID protocol. I agree fully, but the urn: prefix is making a statement about naming, not about technology.
If I am alone in these feelings, the suggested PURL route may be simpler, but we should consider what can be done to maximise the robustness of their use.
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Hilmar Lapp [mailto:hlapp@duke.edu] Sent: Tuesday, 7 April 2009 4:54 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
On Apr 7, 2009, at 1:55 AM, Donald.Hobern@csiro.au wrote:
Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/ <catalogueNumber>. The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345
Wouldn't that say according to your proposed usage guideline that tdwg.org is whoGeneratedTheData and csiro.anic is whatCollectionItBelongsTo, when in reality CSIRO generated the data and ANIC is the collection it belongs to?
I understand why you're suggesting the LSID formatted as you do, and you might say that the name-mangling isn't too drastic. But don't have data owners a strong sense of ownership in their data objects and in their collections? And more importantly, don't you think that a usage guideline that contradicts itself (or that is bound to be internally inconsistent) will continue to raise debate and be in the way of broader adoption?
and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345 both being mapped through to http://www.csiro.au/anic/specimens/ 12345.
Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more justice to the names of whoGeneratedTheData and whatCollectionItBelongsTo, be easier to implement, and have the same possibilities to implement caching etc, in fact using standard software such as mod_proxy for apache?
Just some thoughts.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : ===========================================================
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Thanks, Roger.
Certainly SRV records don't help with the actual persistence, although they may help with relocatability. The real issue underlying your point on persistence is whether or not we are interested in offering our data for integration, e.g. using semantic web technologies. If so, we need to address the underlying social issues. I agree that a central caching system would be the right way to go to make this all efficient and stable - like GenBank. However our community has always been dubious about such centralisation. Ultimately the issue is a cost-benefit question about the costs of integration against the real applications to which the data are put.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Roger Hyam [mailto:rogerhyam@mac.com] Sent: Tuesday, 7 April 2009 6:39 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: hlapp@duke.edu; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
Just hosting SRV records or supplying a redirect service does not actually provide any persistence at all to the data/metadata. Persistence of a GUID to 500 error rather than a not found is not helpful.
I have said in the past "If persistence is important to you then keep your own copy." This is how it has worked for 100s of years in the library community. If the reason for having a centralised resolution mechanism is to try and support persistence then the centralised service should actually cache metadata (not data). I would imagine a scalable infrastructure would be quite simple to implement. Data could be stored in a Lucene index or Hadoop cluster or something. It would only be a very large hash table and only keep the latest version of the RDF.
Without some kind of persistence mechanism the only advantage of LSIDs is that they *look* like they are supposed to be persistent. Unfortunately, because many people are using UUIDs as their object identifiers LSIDs actually look like something you wouldn't want to look at let alone expose to a user! CoL actually hide them because they look like this:
urn:lsid:catalogueoflife.org:taxon:d755ba3e-29c1-102b-9a4a-00304854f820:ac2009
No normal person is going to read this or type it in. I am afraid that when people started using UUIDs in LSIDs it blew the sociological argument for LSIDs out of the water for me. I had carefully designed BCI identifiers to be human readable and writable like this:
urn:lsid:biocol.org:col:15670
Which would work as a foot note in a paper but only way a UUID can work in that context is if it is hyperlinked and to be hyperlinked it will have to be an HTTP URL underneath which begs the question of why we are displaying a non human readable string as the human readable part of a hyperlink! So we hide the LSID completely and have no sociological advantage.
I understand why people used UUIDs. There are good technical reasons especially in distributed systems.
If LSIDs are a brand then they need a "unique selling proposition" and that implies something behind them beyond what can be had for free from other brands. You must use LSIDs because.... "We recommend them" is not an adequate answer.
Another point that worries me is that all discussion of LSIDs is about how to publish them not how to consume them. LSIDs are better than HTTP URIs for the client because... (I still can't answer this question)
Currently the reason for me tagging my data with GUIDs has to be because it enables users to access and exploit my data in cost effective ways they couldn't before whilst crediting me with producing it so that I can attract funding to my organisation to curate and collect more data.
The reason for clients using GUIDs is that it enables them to mix and match data in ways they couldn't before so as to produce more, higher quality scientific publications and so attract funding and kudo.
These are the selling points for GUIDs. How well do LSIDs enable them?
To summarise this overlong post we have to have a service that adds *real value* (of an order of magnitude that crossref adds to DOIs) to LSID usage. Without this we are better off sticking with todays standard web technologies.
Sorry for so many words. I don't have time to write less today.
Roger
On 7 Apr 2009, at 08:17, Donald.Hobern@csiro.au wrote:
Thanks, Hilmar.
I agree that using tdwg.org as the authority for the LSID is less than ideal - hence my recommendation later that we should consider instead using e.g. csiro.tdwg.org (and I don't think it should be tdwg.org - perhaps something more neutral like csiro.bio-id.org. My concern there was the proliferation of SRV records if we support the LSID protocol.
You are also correct that the big issue with this is the question of ownership. Quite frankly, if we had believed in 2006 that institutions would be prepared to cede responsibility for handling their identifiers to a third party, the recommendations from the TDWG workshops would probably have been rather different. Part of the reason for adopting LSIDs was because institutions did not seem to want to use an identifier which might imply that a third-party was responsible for the data.
The PURL form would have some benefits and would be a perfectly consistent alternative. I seem to be the only person who wants to avoid an outright capitulation to using HTTP URIs to identify objects in our domain. However, in case anyone cares, here again are my reasons why I prefer HTTP-wrapped non-HTTP identifiers over plain HTTP URIs:
- The "urn:lsid:" part of the identifier serves as a clear
statement of intent which is not present with an HTTP URI. We could mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our domain and that e.g. http://www.csiro.au/ URIs cannot do so, but that seems an arrogant and arbitrary rule. However, if we simply encourage everyone to use PURL URIs from any domain, what separates such a URI from any HTTP URL with no planned persistence? I see this as a short cut to casual assignment of volatile identifiers based on web application structures and hence to rapid identifier rot.
- I still feel intense discomfort (pace the W3C) over adopting
identifiers prefixed HTTP:// for objects such as type specimens which have had an important place in the literature for decades and which can expect still to be referenced in 50 years time. Even though the HTTP protocol feels like the air we breathe right now, it seems certain to be superseded at some point. Do we want to use identifiers which will seem totally "retro" in the future? The usual objection is that HTTP is certain to outlast the LSID protocol. I agree fully, but the urn: prefix is making a statement about naming, not about technology.
If I am alone in these feelings, the suggested PURL route may be simpler, but we should consider what can be done to maximise the robustness of their use.
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Hilmar Lapp [mailto:hlapp@duke.edu] Sent: Tuesday, 7 April 2009 4:54 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
On Apr 7, 2009, at 1:55 AM, Donald.Hobern@csiro.au wrote:
Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at http://www.csiro.au/anic/specimens/ <catalogueNumber>. The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345
Wouldn't that say according to your proposed usage guideline that tdwg.org is whoGeneratedTheData and csiro.anic is whatCollectionItBelongsTo, when in reality CSIRO generated the data and ANIC is the collection it belongs to?
I understand why you're suggesting the LSID formatted as you do, and you might say that the name-mangling isn't too drastic. But don't have data owners a strong sense of ownership in their data objects and in their collections? And more importantly, don't you think that a usage guideline that contradicts itself (or that is bound to be internally inconsistent) will continue to raise debate and be in the way of broader adoption?
and the HTTP URI http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345 both being mapped through to http://www.csiro.au/anic/specimens/ 12345.
Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more justice to the names of whoGeneratedTheData and whatCollectionItBelongsTo, be easier to implement, and have the same possibilities to implement caching etc, in fact using standard software such as mod_proxy for apache?
Just some thoughts.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : ===========================================================
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Centralization is a double edged sword ... but one which can work if the centralization (with redundancy) is 'community owned and managed' ... and perhaps the decentralized GBIF model can work to our advantage here with perhaps 5 or 6 nodes taking responsibility for the 'centralized' entity we all appear to want but are too frightened to agree to ... ;-)
My two penn'orth
Paul
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Donald.Hobern@csiro.au Sent: 07 April 2009 10:33 To: rogerhyam@mac.com Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
Thanks, Roger.
Certainly SRV records don't help with the actual persistence, although they may help with relocatability. The real issue underlying your point on persistence is whether or not we are interested in offering our data for integration, e.g. using semantic web technologies. If so, we need to address the underlying social issues. I agree that a central caching system would be the right way to go to make this all efficient and stable - like GenBank. However our community has always been dubious about such centralisation. Ultimately the issue is a cost-benefit question about the costs of integration against the real applications to which the data are put.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Roger Hyam [mailto:rogerhyam@mac.com] Sent: Tuesday, 7 April 2009 6:39 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: hlapp@duke.edu; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
Just hosting SRV records or supplying a redirect service does not actually provide any persistence at all to the data/metadata. Persistence of a GUID to 500 error rather than a not found is not helpful.
I have said in the past "If persistence is important to you then keep your own copy." This is how it has worked for 100s of years in the library community. If the reason for having a centralised resolution mechanism is to try and support persistence then the centralised service should actually cache metadata (not data). I would imagine a scalable infrastructure would be quite simple to implement. Data could be stored in a Lucene index or Hadoop cluster or something. It would only be a very large hash table and only keep the latest version of the RDF.
Without some kind of persistence mechanism the only advantage of LSIDs is that they *look* like they are supposed to be persistent. Unfortunately, because many people are using UUIDs as their object identifiers LSIDs actually look like something you wouldn't want to look at let alone expose to a user! CoL actually hide them because they look like this:
urn:lsid:catalogueoflife.org:taxon:d755ba3e-29c1-102b-9a4a-00304854f820: ac2009
No normal person is going to read this or type it in. I am afraid that when people started using UUIDs in LSIDs it blew the sociological argument for LSIDs out of the water for me. I had carefully designed BCI identifiers to be human readable and writable like this:
urn:lsid:biocol.org:col:15670
Which would work as a foot note in a paper but only way a UUID can work in that context is if it is hyperlinked and to be hyperlinked it will have to be an HTTP URL underneath which begs the question of why we are displaying a non human readable string as the human readable part of a hyperlink! So we hide the LSID completely and have no sociological advantage.
I understand why people used UUIDs. There are good technical reasons especially in distributed systems.
If LSIDs are a brand then they need a "unique selling proposition" and that implies something behind them beyond what can be had for free from other brands. You must use LSIDs because.... "We recommend them" is not an adequate answer.
Another point that worries me is that all discussion of LSIDs is about how to publish them not how to consume them. LSIDs are better than HTTP URIs for the client because... (I still can't answer this question)
Currently the reason for me tagging my data with GUIDs has to be because it enables users to access and exploit my data in cost effective ways they couldn't before whilst crediting me with producing it so that I can attract funding to my organisation to curate and collect more data.
The reason for clients using GUIDs is that it enables them to mix and match data in ways they couldn't before so as to produce more, higher quality scientific publications and so attract funding and kudo.
These are the selling points for GUIDs. How well do LSIDs enable them?
To summarise this overlong post we have to have a service that adds *real value* (of an order of magnitude that crossref adds to DOIs) to LSID usage. Without this we are better off sticking with todays standard web technologies.
Sorry for so many words. I don't have time to write less today.
Roger
On 7 Apr 2009, at 08:17, Donald.Hobern@csiro.au wrote:
Thanks, Hilmar.
I agree that using tdwg.org as the authority for the LSID is less than ideal - hence my recommendation later that we should consider instead using e.g. csiro.tdwg.org (and I don't think it should be tdwg.org - perhaps something more neutral like csiro.bio-id.org. My concern there was the proliferation of SRV records if we support the LSID protocol.
You are also correct that the big issue with this is the question of ownership. Quite frankly, if we had believed in 2006 that institutions would be prepared to cede responsibility for handling their identifiers to a third party, the recommendations from the TDWG workshops would probably have been rather different. Part of the reason for adopting LSIDs was because institutions did not seem to want to use an identifier which might imply that a third-party was responsible for the data.
The PURL form would have some benefits and would be a perfectly consistent alternative. I seem to be the only person who wants to avoid an outright capitulation to using HTTP URIs to identify objects in our domain. However, in case anyone cares, here again are my reasons why I prefer HTTP-wrapped non-HTTP identifiers over plain HTTP URIs:
- The "urn:lsid:" part of the identifier serves as a clear
statement of intent which is not present with an HTTP URI. We could mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our domain and that e.g. http://www.csiro.au/ URIs cannot do so, but that seems an arrogant and arbitrary rule. However, if we simply encourage everyone to use PURL URIs from any domain, what separates such a URI from any HTTP URL with no planned persistence? I see this as a short cut to casual assignment of volatile identifiers based on web application structures and hence to rapid identifier rot.
- I still feel intense discomfort (pace the W3C) over adopting
identifiers prefixed HTTP:// for objects such as type specimens which have had an important place in the literature for decades and which can expect still to be referenced in 50 years time. Even though the HTTP protocol feels like the air we breathe right now, it seems certain to be superseded at some point. Do we want to use identifiers which will seem totally "retro" in the future? The usual objection is that HTTP is certain to outlast the LSID protocol. I agree fully, but the urn: prefix is making a statement about naming, not about technology.
If I am alone in these feelings, the suggested PURL route may be simpler, but we should consider what can be done to maximise the robustness of their use.
Best wishes,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Hilmar Lapp [mailto:hlapp@duke.edu] Sent: Tuesday, 7 April 2009 4:54 PM To: Hobern, Donald (Entomology, Black Mountain) Cc: tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
On Apr 7, 2009, at 1:55 AM, Donald.Hobern@csiro.au wrote:
Assume further that ANIC has a script on its servers which can return the RDF data for these specimens, say at
http://www.csiro.au/anic/specimens/
<catalogueNumber>. The registration process could result in the LSID urn:lsid:tdwg.org:csiro.anic:12345
Wouldn't that say according to your proposed usage guideline that tdwg.org is whoGeneratedTheData and csiro.anic is whatCollectionItBelongsTo, when in reality CSIRO generated the data and ANIC is the collection it belongs to?
I understand why you're suggesting the LSID formatted as you do, and you might say that the name-mangling isn't too drastic. But don't have data owners a strong sense of ownership in their data objects and in their collections? And more importantly, don't you think that a usage guideline that contradicts itself (or that is bound to be internally inconsistent) will continue to raise debate and be in the way of broader adoption?
and the HTTP URI
http://lsid.tdwg.org/urn:lsid:tdwg.org:csiro.anic:12345
both being mapped through to http://www.csiro.au/anic/specimens/ 12345.
Wouldn't http://purl.tdwg.org/CSIRO/ANIC/12345 be shorter, do more justice to the names of whoGeneratedTheData and whatCollectionItBelongsTo, be easier to implement, and have the same possibilities to implement caching etc, in fact using standard software such as mod_proxy for apache?
Just some thoughts.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : ===========================================================
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Find out about CABI's global summit on 'Food security in a climate of change' at www.cabiglobalsummit.com 19 - 21 October 2009, London, UK.
************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
A few random comments:
Donald wrote:
InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData
This assumes that it's good to have lots of metadata embedded in the identifier. This level of "branding" might be fine for specimens (assuming each data provider has the ability to serve their own data), but what about shared identifiers such as taxon names -- I suspect having to "choose a brand" is going to be an obstacle to adoption for just the identifiers that we most need to share. Identifiers such as DOIs have less branding (although publishers have managed to attach branding significant to the few digits after the "10." prefix).
Note also that DOIs (and Handles) can be queried for metadata, see Tony Hammnd's OpenHandle project (http://www.crossref.org/CrossTech/2008/10/the_last_mile.html and http://code.google.com/p/openhandle/), so we don't need to embed this in the actual identifier itself.
- The "urn:lsid:" part of the identifier serves as a clear
statement of intent which is not present with an HTTP URI. We could mandate that ONLY http://purl.tdwg.org/ URIs count as GUIDs in our domain and that e.g. http://www.csiro.au/ URIs cannot do
Yes, but intent matters little unless backed up by actual services.
You are also correct that the big issue with this is the question of ownership. Quite frankly, if we had believed in 2006 that institutions would be prepared to cede responsibility for handling their identifiers to a third party, the recommendations from the TDWG workshops would probably have been rather different. Part of the reason for adopting LSIDs was because institutions did not seem to want to use an identifier which might imply that a third-party was responsible for the data.
If commercial rivals usually at each others throats (e.g., publishers) can get over these issues and form CrossRef, surely biodiversity providers can get over this issue. That we can't suggests that the field hasn't bought into the idea of global identifiers and sharing yet.
Roger wrote:
I have said in the past "If persistence is important to you then keep your own copy." This is how it has worked for 100s of years in the library community. If the reason for having a centralised resolution mechanism is to try and support persistence then the centralised service should actually cache metadata (not data). I would imagine a scalable infrastructure would be quite simple to implement. Data could be stored in a Lucene index or Hadoop cluster or something. It would only be a very large hash table and only keep the latest version of the RDF.
This sounds a lot like CrossRef to me. Cache the metadata and provide services on top. Deja vu all over again.
No normal person is going to read this or type it in. I am afraid that when people started using UUIDs in LSIDs it blew the sociological argument for LSIDs out of the water for me. I had carefully designed BCI identifiers to be human readable and writable like this
Yep. Plus the irony of having a globally unique identifier (the UUID) as part of another globally unique identifier (LSID), which is then part of another identifier (the HTTP proxied version of the LSID). We're not making things easy for ourselves.
So, for the sake of a straw man, why don't we:
1. Use DOIs/Handles, assigned by a central agency
2. Provide a central set of services running on top of these identifiers, modelled upon CrossRef but specific to our data types. Among the services are an HTTP proxy that supports 303 redirects (a la linked data)
3. The central service monitors data availability and has a "league table" of performance (or some related measure of data quality). It has a central cache to ensure data consumers are minimally affected if a provider goes offline.
If we are wedded to HTTP then LSIDs don't make much sense. If we have concerns about HTTP-based identifiers, then why not use a system that has already proved itself (DOI/Handle)? Surely we need a better argument than the "Concorde fallacy" that we've invested so much effort in LSIDs so far it's too late to stop...
Regards
Rod
--------------------------------------------------------- Roderic Page Professor of Taxonomy DEEB, FBLS Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 AIM: rodpage1962@aim.com Facebook: http://www.facebook.com/profile.php?id=1112517192 Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
A few non random comments on Rod's random comments on Donald's proposal
On Tue, Apr 7, 2009 at 11:19 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
A few random comments:
Donald wrote:
InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData
Rod responded:
This assumes that it's good to have lots of metadata embedded in the identifier. This level of "branding" might be fine for specimens (assuming each data provider has the ability to serve their own data), but what about shared identifiers such as taxon names -- I suspect having to "choose a brand" is going to be an obstacle to adoption for just the identifiers that we most need to share. Identifiers such as DOIs have less branding (although publishers have managed to attach branding significant to the few digits after the "10." prefix).
Bob cites: "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to. The users of the LSIDs are permitted to use individual components (as specified elsewhere in this document) of LSIDs - although the LSID component parts themselves should be treated as opaque pieces of the identifier." LSID spec, Section 8.
It's regrettable that the LSID spec is so poorly written that it permits the useless term "should". Alas, I suppose that leaves room for argument with my position that LSIDs with embedded metadata are not LSIDs--they are something else based on the LSID syntax. There's nothing inherently wrong with, oh, say, a Handles implementation based on prefacing LSID syntax with something controlled. See below.
Rod remarks:
Note also that DOIs (and Handles) can be queried for metadata, see Tony Hammnd's OpenHandle project (http://www.crossref.org/CrossTech/2008/10/the_last_mile.html and http://code.google.com/p/openhandle/), so we don't need to embed this in the actual identifier itself.
Bob replies DOIs \are/ Handles. This is the (unstated?) reason that http://wiki.tdwg.org/twiki/bin/view/GUID/TechnologyComparison is filled with comparisons of the form "DOI: Same as Handles"
DOI is an implementation of Handles, with the additional treatment of things about which Handles is silent . See http://www.doi.org/factsheets/DOIHandle.html When I read that document casually, I come to the initial conclusion that Donald's proposal is essentially doing the same kind of extension to Handles (possibly a Good Thing if correct), except for allowing metadata in the identifier (yech!).
--Bob
There seem to be a number of strong viewpoints, expectations for social compliance as well as institutional commitment, and disagreements on the best suited technical avenues that I'm wondering whether this wouldn't be the stuff out of which a position paper could be distilled for e-Biosphere, with the aim of publishing it as a community paper in a journal later.
Maybe something along the lines of something like Hilse & Kothe, Implementing Persistent Identifiers (http://www.knaw.nl/ecpa/publ/pdf/2732.pdf ) but focused on a comparison between a few alternatives for biodiversity informatics and their implementation and institutional consequences?
Maybe that would help to better move the community forward in between those discussions?
Or is someone writing that already?
-hilmar
The Hilse & Kothe document to which Hilmar refers is a great overview of technologies, issues and the steps that any institution should take if they wish to use persistent identifiers for data.
Aside from the documents produced from the two TDWG GUID workshops, and the LSID Applicability Statement currently under review, I don't think we have any such document under development for our community.
Personally I would like to see this developed as a package consisting of several elements. The first of these should include the kind of general material found in the Hilse & Kothe document to give context to the whole idea of persistent identification and the processes and policies necessary to make it work. The rest should be applicability statements for LSIDs and for other identifier technologies which might be adopted for biodiversity data.
One factor which has been somewhat ignored in all this discussion is the form which metadata should take when associated with different identifier types. The LSID specification is relatively clear in this regard - requiring RDF metadata. I believe that the main reason we are all interested in globally-unique identifiers (and in the persistent and resolvable flavours of such identifiers) is that we hope to simplify the task of cross-institutional data integration. Our primary goal is to facilitate some level of machine interoperability around our data. This means that we do need to mandate appropriate standard practices for associating machine-readable metadata with any identifier, whether it is an LSID, a PURL, a DOI or something else. If we adopt a less challenging form of GUID, but do not increase the number of providers offering interoperable metadata through such identifiers, we will not have gained anything.
In other words, the most important thing of all is to make some real progress with the TDWG ontology work.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Hilmar Lapp [mailto:hlapp@duke.edu] Sent: Wednesday, 8 April 2009 4:43 AM To: Bob Morris; Roderic Page; Technical Architecture Group mailing list; Hobern, Donald (Entomology, Black Mountain) Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
There seem to be a number of strong viewpoints, expectations for social compliance as well as institutional commitment, and disagreements on the best suited technical avenues that I'm wondering whether this wouldn't be the stuff out of which a position paper could be distilled for e-Biosphere, with the aim of publishing it as a community paper in a journal later.
Maybe something along the lines of something like Hilse & Kothe, Implementing Persistent Identifiers (http://www.knaw.nl/ecpa/publ/pdf/2732.pdf ) but focused on a comparison between a few alternatives for biodiversity informatics and their implementation and institutional consequences?
Maybe that would help to better move the community forward in between those discussions?
Or is someone writing that already?
-hilmar
A few comments on semantic opacity...
1. My examples ("urn:lsid:csiro.tdwg.org:anic:12345") were deliberately transparent (or at least translucent) to make it easier to follow the example, but I would have no real problem with them having a form more like "urn:lsid:bio-id.org:9876:12345".
2. I think a single-minded drive towards semantic opacity would be as quixotic and self-destructive as anything we could do. UUIDs are nicely opaque, and we could build a DOI-like system which maps individual UUIDs to their current locations. Such an approach would be painful and an administrative nightmare. I also suspect that such opaque identifiers would be resisted by most users. If we step away from such a pure implementation, the alternatives all embed some kind of semantic cues which make the system operate better. The form of a DOI encodes relevant data on the source of the object. PURLs and LSIDs do the same. The point with semantic opacity in the LSID specification is that it is not possible for a client to make inferences about the location of data based on the subelements within the LSID. It is up to the resolver implementations to determine how to return the data. Once this point is accepted, I would in fact say that the presence of some semantic clues within the identifier text is a good thing. The clues may for various reasons no longer conform to the reality of how the metadata are managed, but a user may still rapidly glean relevant indications whether an identifier is worth resolving (it may indicate that it relates to a nomenclatural record, or that it, at least originally, was minted by some respected source). I see such clues as having the same kind of value which has enabled Linnaean nomenclature to persist so long. My preference for LSIDs would therefore be for them to be like the ones Roger minted for BCI.
3. I also note that this discussion has suggested remarkable near-unanimity from many people in their distaste for LSIDs. However I fear that the level of agreement would be little higher if we were discussing DOIs, or PURLs. Some of the objections have been that LSIDs do not fit well with the key technologies of the semantic web and that something more like PURLs would be the right course to follow. Other objections have related to the semantic near-transparency of many LSIDs or the absence of strongly centralised support with the implication that something more like DOIs would be better. Both arguments have value, but they point in different directions. The various identifier schemes make up a landscape within which no identifier scheme represents an adaptive peak in all contexts. We need to develop applicability statements for how to use several of these schemes as alternatives for biodiversity data and we need to identify the drivers which may guide different providers to different schemes for different purposes.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Wednesday, 8 April 2009 2:04 AM To: Roderic Page Cc: Hobern, Donald (Entomology, Black Mountain); Roger Hyam; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
A few non random comments on Rod's random comments on Donald's proposal
On Tue, Apr 7, 2009 at 11:19 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
A few random comments:
Donald wrote:
InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData
Rod responded:
This assumes that it's good to have lots of metadata embedded in the identifier. This level of "branding" might be fine for specimens (assuming each data provider has the ability to serve their own data), but what about shared identifiers such as taxon names -- I suspect having to "choose a brand" is going to be an obstacle to adoption for just the identifiers that we most need to share. Identifiers such as DOIs have less branding (although publishers have managed to attach branding significant to the few digits after the "10." prefix).
Bob cites: "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to. The users of the LSIDs are permitted to use individual components (as specified elsewhere in this document) of LSIDs - although the LSID component parts themselves should be treated as opaque pieces of the identifier." LSID spec, Section 8.
It's regrettable that the LSID spec is so poorly written that it permits the useless term "should". Alas, I suppose that leaves room for argument with my position that LSIDs with embedded metadata are not LSIDs--they are something else based on the LSID syntax. There's nothing inherently wrong with, oh, say, a Handles implementation based on prefacing LSID syntax with something controlled. See below.
Rod remarks:
Note also that DOIs (and Handles) can be queried for metadata, see Tony Hammnd's OpenHandle project (http://www.crossref.org/CrossTech/2008/10/the_last_mile.html and http://code.google.com/p/openhandle/), so we don't need to embed this in the actual identifier itself.
Bob replies DOIs \are/ Handles. This is the (unstated?) reason that http://wiki.tdwg.org/twiki/bin/view/GUID/TechnologyComparison is filled with comparisons of the form "DOI: Same as Handles"
DOI is an implementation of Handles, with the additional treatment of things about which Handles is silent . See http://www.doi.org/factsheets/DOIHandle.html When I read that document casually, I come to the initial conclusion that Donald's proposal is essentially doing the same kind of extension to Handles (possibly a Good Thing if correct), except for allowing metadata in the identifier (yech!).
--Bob
General comments on decision making...
This is a technical discussion list. We are clever techie people. Given a challenge of X resources and Y requirements we can come up with a preferred list of solutions and probably implement one of them with our eyes closed.
To use a fine English idiom "you cut your cloth to suit your purse". We don't have a shared purse so arguments about how to cut our cloth will never be resolved. X is undefined and I am not sure Y is that well defined.
I know this is "chicken and egg" in that we need to come up with requirements to request a purse but we really need to have some indication that some one will be willing to commit long term resources to the common good before we can present a menu of choices for what money could be spent on.
1.0 developers/year (in perpetuity) gets us a server or two managed to support some kind of DNS based SRV hosting or redirect services or Handle system with support for some library development and help desk stuff. (Note I am not talking servers or meetings or reports or technology and I am talking commitment to pay people to have it as their responsibility to maintain the system both socially and technically - for the long term!!!).
Without the indication that some one (a consortium perhaps) is likely to formally commit to a minimum of this level of resources we are wasting our time talking about resolution mechanisms that are not DNS based i.e. variations on the PURL model.
If we don't have the money to build a walled garden we have to graze on the common with everyone else.
Roger
(BTW: I am not totally convinced that building a walled garden is the way forward but would happily come and graze in it if some benefactor would fund its perpetual maintenance).
On 8 Apr 2009, at 01:40, Donald.Hobern@csiro.au wrote:
A few comments on semantic opacity...
- My examples ("urn:lsid:csiro.tdwg.org:anic:12345") were
deliberately transparent (or at least translucent) to make it easier to follow the example, but I would have no real problem with them having a form more like "urn:lsid:bio-id.org:9876:12345".
- I think a single-minded drive towards semantic opacity would be
as quixotic and self-destructive as anything we could do. UUIDs are nicely opaque, and we could build a DOI-like system which maps individual UUIDs to their current locations. Such an approach would be painful and an administrative nightmare. I also suspect that such opaque identifiers would be resisted by most users. If we step away from such a pure implementation, the alternatives all embed some kind of semantic cues which make the system operate better. The form of a DOI encodes relevant data on the source of the object. PURLs and LSIDs do the same. The point with semantic opacity in the LSID specification is that it is not possible for a client to make inferences about the location of data based on the subelements within the LSID. It is up to the resolver implementations to determine how to return the data. Once this point is accepted, I would in fact say that the presence of some semantic clues within the identifier text is a good thing. The clues may for various reasons no longer conform to the reality of how the metadata are managed, but a user may still rapidly glean relevant indications whether an identifier is worth resolving (it may indicate that it relates to a nomenclatural record, or that it, at least originally, was minted by some respected source). I see such clues as having the same kind of value which has enabled Linnaean nomenclature to persist so long. My preference for LSIDs would therefore be for them to be like the ones Roger minted for BCI.
- I also note that this discussion has suggested remarkable near-
unanimity from many people in their distaste for LSIDs. However I fear that the level of agreement would be little higher if we were discussing DOIs, or PURLs. Some of the objections have been that LSIDs do not fit well with the key technologies of the semantic web and that something more like PURLs would be the right course to follow. Other objections have related to the semantic near- transparency of many LSIDs or the absence of strongly centralised support with the implication that something more like DOIs would be better. Both arguments have value, but they point in different directions. The various identifier schemes make up a landscape within which no identifier scheme represents an adaptive peak in all contexts. We need to develop applicability statements for how to use several of these schemes as alternatives for biodiversity data and we need to identify the drivers which may guide different providers to different schemes for different purposes.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Wednesday, 8 April 2009 2:04 AM To: Roderic Page Cc: Hobern, Donald (Entomology, Black Mountain); Roger Hyam; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
A few non random comments on Rod's random comments on Donald's proposal
On Tue, Apr 7, 2009 at 11:19 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
A few random comments:
Donald wrote:
InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData
Rod responded:
This assumes that it's good to have lots of metadata embedded in the identifier. This level of "branding" might be fine for specimens (assuming each data provider has the ability to serve their own data), but what about shared identifiers such as taxon names -- I suspect having to "choose a brand" is going to be an obstacle to adoption for just the identifiers that we most need to share. Identifiers such as DOIs have less branding (although publishers have managed to attach branding significant to the few digits after the "10." prefix).
Bob cites: "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to. The users of the LSIDs are permitted to use individual components (as specified elsewhere in this document) of LSIDs - although the LSID component parts themselves should be treated as opaque pieces of the identifier." LSID spec, Section 8.
It's regrettable that the LSID spec is so poorly written that it permits the useless term "should". Alas, I suppose that leaves room for argument with my position that LSIDs with embedded metadata are not LSIDs--they are something else based on the LSID syntax. There's nothing inherently wrong with, oh, say, a Handles implementation based on prefacing LSID syntax with something controlled. See below.
Rod remarks:
Note also that DOIs (and Handles) can be queried for metadata, see Tony Hammnd's OpenHandle project (http://www.crossref.org/CrossTech/2008/10/the_last_mile.html and http://code.google.com/p/openhandle/), so we don't need to embed this in the actual identifier itself.
Bob replies DOIs \are/ Handles. This is the (unstated?) reason that http://wiki.tdwg.org/twiki/bin/view/GUID/TechnologyComparison is filled with comparisons of the form "DOI: Same as Handles"
DOI is an implementation of Handles, with the additional treatment of things about which Handles is silent . See http://www.doi.org/factsheets/DOIHandle.html When I read that document casually, I come to the initial conclusion that Donald's proposal is essentially doing the same kind of extension to Handles (possibly a Good Thing if correct), except for allowing metadata in the identifier (yech!).
--Bob
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
I resonate with Roger :-).
More than anything else, these functions in our community needs an economic model. Possibilities include an inspired foundation to endow it or hitchhiking with a bigger community with the long-term support already planned and working. Giving up some technical independence and compromising on short-term technical objectives should not get in the way of getting something going, one small step at a time.
It is interesting to see where long-term vision comes from at a global community level. We have organizations big and small, national and international who are naturally preoccupied with supporting their own requirements, as we all are.
But where are the biodiversity heroes with resources?
_____________________________ James H. Beach Biodiversity Institute University of Kansas 1345 Jayhawk Boulevard Lawrence, KS 66045, USA T 785 864-4645, F 785 864-5335
________________________________
From: tdwg-tag-bounces@lists.tdwg.org on behalf of Roger Hyam Sent: Wed 4/8/2009 4:57 AM To: Donald.Hobern@csiro.au Cc: morris.bob@gmail.com; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
General comments on decision making...
This is a technical discussion list. We are clever techie people. Given a challenge of X resources and Y requirements we can come up with a preferred list of solutions and probably implement one of them with our eyes closed.
To use a fine English idiom "you cut your cloth to suit your purse". We don't have a shared purse so arguments about how to cut our cloth will never be resolved. X is undefined and I am not sure Y is that well defined.
I know this is "chicken and egg" in that we need to come up with requirements to request a purse but we really need to have some indication that some one will be willing to commit long term resources to the common good before we can present a menu of choices for what money could be spent on.
1.0 developers/year (in perpetuity) gets us a server or two managed to support some kind of DNS based SRV hosting or redirect services or Handle system with support for some library development and help desk stuff. (Note I am not talking servers or meetings or reports or technology and I am talking commitment to pay people to have it as their responsibility to maintain the system both socially and technically - for the long term!!!).
Without the indication that some one (a consortium perhaps) is likely to formally commit to a minimum of this level of resources we are wasting our time talking about resolution mechanisms that are not DNS based i.e. variations on the PURL model.
If we don't have the money to build a walled garden we have to graze on the common with everyone else.
Roger
(BTW: I am not totally convinced that building a walled garden is the way forward but would happily come and graze in it if some benefactor would fund its perpetual maintenance).
On 8 Apr 2009, at 01:40, Donald.Hobern@csiro.au wrote:
A few comments on semantic opacity...
- My examples ("urn:lsid:csiro.tdwg.org:anic:12345") were
deliberately transparent (or at least translucent) to make it easier to follow the example, but I would have no real problem with them having a form more like "urn:lsid:bio-id.org:9876:12345".
- I think a single-minded drive towards semantic opacity would be
as quixotic and self-destructive as anything we could do. UUIDs are nicely opaque, and we could build a DOI-like system which maps individual UUIDs to their current locations. Such an approach would be painful and an administrative nightmare. I also suspect that such opaque identifiers would be resisted by most users. If we step away from such a pure implementation, the alternatives all embed some kind of semantic cues which make the system operate better. The form of a DOI encodes relevant data on the source of the object. PURLs and LSIDs do the same. The point with semantic opacity in the LSID specification is that it is not possible for a client to make inferences about the location of data based on the subelements within the LSID. It is up to the resolver implementations to determine how to return the data. Once this point is accepted, I would in fact say that the presence of some semantic clues within the identifier text is a good thing. The clues may for various reasons no longer conform to the reality of how the metadata are managed, but a user may still rapidly glean relevant indications whether an identifier is worth resolving (it may indicate that it relates to a nomenclatural record, or that it, at least originally, was minted by some respected source). I see such clues as having the same kind of value which has enabled Linnaean nomenclature to persist so long. My preference for LSIDs would therefore be for them to be like the ones Roger minted for BCI.
- I also note that this discussion has suggested remarkable near-
unanimity from many people in their distaste for LSIDs. However I fear that the level of agreement would be little higher if we were discussing DOIs, or PURLs. Some of the objections have been that LSIDs do not fit well with the key technologies of the semantic web and that something more like PURLs would be the right course to follow. Other objections have related to the semantic near- transparency of many LSIDs or the absence of strongly centralised support with the implication that something more like DOIs would be better. Both arguments have value, but they point in different directions. The various identifier schemes make up a landscape within which no identifier scheme represents an adaptive peak in all contexts. We need to develop applicability statements for how to use several of these schemes as alternatives for biodiversity data and we need to identify the drivers which may guide different providers to different schemes for different purposes.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Wednesday, 8 April 2009 2:04 AM To: Roderic Page Cc: Hobern, Donald (Entomology, Black Mountain); Roger Hyam; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
A few non random comments on Rod's random comments on Donald's proposal
On Tue, Apr 7, 2009 at 11:19 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
A few random comments:
Donald wrote:
InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData
Rod responded:
This assumes that it's good to have lots of metadata embedded in the identifier. This level of "branding" might be fine for specimens (assuming each data provider has the ability to serve their own data), but what about shared identifiers such as taxon names -- I suspect having to "choose a brand" is going to be an obstacle to adoption for just the identifiers that we most need to share. Identifiers such as DOIs have less branding (although publishers have managed to attach branding significant to the few digits after the "10." prefix).
Bob cites: "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to. The users of the LSIDs are permitted to use individual components (as specified elsewhere in this document) of LSIDs - although the LSID component parts themselves should be treated as opaque pieces of the identifier." LSID spec, Section 8.
It's regrettable that the LSID spec is so poorly written that it permits the useless term "should". Alas, I suppose that leaves room for argument with my position that LSIDs with embedded metadata are not LSIDs--they are something else based on the LSID syntax. There's nothing inherently wrong with, oh, say, a Handles implementation based on prefacing LSID syntax with something controlled. See below.
Rod remarks:
Note also that DOIs (and Handles) can be queried for metadata, see Tony Hammnd's OpenHandle project (http://www.crossref.org/CrossTech/2008/10/the_last_mile.html and http://code.google.com/p/openhandle/), so we don't need to embed this in the actual identifier itself.
Bob replies DOIs \are/ Handles. This is the (unstated?) reason that http://wiki.tdwg.org/twiki/bin/view/GUID/TechnologyComparison is filled with comparisons of the form "DOI: Same as Handles"
DOI is an implementation of Handles, with the additional treatment of things about which Handles is silent . See http://www.doi.org/factsheets/DOIHandle.html When I read that document casually, I come to the initial conclusion that Donald's proposal is essentially doing the same kind of extension to Handles (possibly a Good Thing if correct), except for allowing metadata in the identifier (yech!).
--Bob
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
One of the core elements of the data domain we work in is the names we apply to life on this planet we occupy with nM other species. Two big international project working in this domain are GBIF and EoL. Both are investing in a Global Names Architecture (probably to the order of $50k to date). At the core of this (pilot implementation) are UUIDs. If these two projects (and by implication those who would use the data and services they provide ... IUCN, FAO, WHO, the two 'invasives' projects [alien invasives are one of the biggest threats to loss of biodiversity - I've heard people say] etc)want the GNA to work is it not in their interest to implement the most appropriate GUID technology to tie the names (a very very very small but important part of the data domain) to the rest of the biodiversity data they serve/manage/mobilize?
I guess we have taken the small step and got something going - we decided to go with LSIDs. I'm not sure why the TBWG LSID resolver keeps breaking (I'm not that much of a techie but I do know that if something isn't broken don't try to fix it). As to other GUIDs ... didn't we reject DOIs because of the 'real'**[see below] cost - or can I get the 500k DOIs I need for Index Fungorum for no real cost, the 1.5M I need for the British Fungi database I look after in my spare time for no real cost, or can 'we' get the nnnM DOIs we need to assign to our natural history collections as they are digitized, at no real cost ... can the DOI supporters answer this? And if the use of DOIs are not without real costs to use why are we still discussing them?
Another two penn'orth, and time to get back to real work ... ;-)
Paul
** real cost is when you have to reach into your pocket and part with cash, unreal cost (a.k.a. hidden cost) is what I'm doing now ... CABI is paying my salary but I'm not really doing what CABI pays me to do ... ;-)
-----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Beach, James H Sent: 08 April 2009 13:23 To: Roger Hyam; Donald.Hobern@csiro.au Cc: morris.bob@gmail.com; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - Resonatewith Roger
I resonate with Roger :-).
More than anything else, these functions in our community needs an economic model. Possibilities include an inspired foundation to endow it or hitchhiking with a bigger community with the long-term support already planned and working. Giving up some technical independence and compromising on short-term technical objectives should not get in the way of getting something going, one small step at a time.
It is interesting to see where long-term vision comes from at a global community level. We have organizations big and small, national and international who are naturally preoccupied with supporting their own requirements, as we all are.
But where are the biodiversity heroes with resources?
_____________________________ James H. Beach Biodiversity Institute University of Kansas 1345 Jayhawk Boulevard Lawrence, KS 66045, USA T 785 864-4645, F 785 864-5335
________________________________
From: tdwg-tag-bounces@lists.tdwg.org on behalf of Roger Hyam Sent: Wed 4/8/2009 4:57 AM To: Donald.Hobern@csiro.au Cc: morris.bob@gmail.com; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
General comments on decision making...
This is a technical discussion list. We are clever techie people. Given a challenge of X resources and Y requirements we can come up with a preferred list of solutions and probably implement one of them with our eyes closed.
To use a fine English idiom "you cut your cloth to suit your purse". We don't have a shared purse so arguments about how to cut our cloth will never be resolved. X is undefined and I am not sure Y is that well defined.
I know this is "chicken and egg" in that we need to come up with requirements to request a purse but we really need to have some indication that some one will be willing to commit long term resources to the common good before we can present a menu of choices for what money could be spent on.
1.0 developers/year (in perpetuity) gets us a server or two managed to support some kind of DNS based SRV hosting or redirect services or Handle system with support for some library development and help desk stuff. (Note I am not talking servers or meetings or reports or technology and I am talking commitment to pay people to have it as their responsibility to maintain the system both socially and technically - for the long term!!!).
Without the indication that some one (a consortium perhaps) is likely to formally commit to a minimum of this level of resources we are wasting our time talking about resolution mechanisms that are not DNS based i.e. variations on the PURL model.
If we don't have the money to build a walled garden we have to graze on the common with everyone else.
Roger
(BTW: I am not totally convinced that building a walled garden is the way forward but would happily come and graze in it if some benefactor would fund its perpetual maintenance).
On 8 Apr 2009, at 01:40, Donald.Hobern@csiro.au wrote:
A few comments on semantic opacity...
- My examples ("urn:lsid:csiro.tdwg.org:anic:12345") were
deliberately transparent (or at least translucent) to make it easier to follow the example, but I would have no real problem with them having a form more like "urn:lsid:bio-id.org:9876:12345".
- I think a single-minded drive towards semantic opacity would be as
quixotic and self-destructive as anything we could do. UUIDs are nicely opaque, and we could build a DOI-like system which maps individual UUIDs to their current locations. Such an approach would be painful and an administrative nightmare. I also suspect that such opaque identifiers would be resisted by most users. If we step away from such a pure implementation, the alternatives all embed some kind of semantic cues which make the system operate better. The form of a DOI encodes relevant data on the source of the object. PURLs and LSIDs do the same. The point with semantic opacity in the LSID specification is that it is not possible for a client to make inferences about the location of data based on the subelements within the LSID. It is up to the resolver implementations to determine how to return the data. Once this point is accepted, I would in fact say that the presence of some semantic clues within the identifier text is
a good thing. The clues may for various reasons no longer conform to the reality of how the metadata are managed, but a user may still rapidly glean relevant indications whether an identifier is worth resolving (it may indicate that it relates to a nomenclatural record, or that it, at least originally, was minted by some respected source).
I see such clues as having the same kind of value which has enabled Linnaean nomenclature to persist so long. My preference for LSIDs would therefore be for them to be like the ones Roger minted for BCI.
- I also note that this discussion has suggested remarkable near-
unanimity from many people in their distaste for LSIDs. However I fear that the level of agreement would be little higher if we were discussing DOIs, or PURLs. Some of the objections have been that LSIDs do not fit well with the key technologies of the semantic web and that something more like PURLs would be the right course to follow. Other objections have related to the semantic near- transparency of many LSIDs or the absence of strongly centralised support with the implication that something more like DOIs would be better. Both arguments have value, but they point in different directions. The various identifier schemes make up a landscape within
which no identifier scheme represents an adaptive peak in all contexts. We need to develop applicability statements for how to use several of these schemes as alternatives for biodiversity data and we need to identify the drivers which may guide different providers to different schemes for different purposes.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Wednesday, 8 April 2009 2:04 AM To: Roderic Page Cc: Hobern, Donald (Entomology, Black Mountain); Roger Hyam; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
A few non random comments on Rod's random comments on Donald's proposal
On Tue, Apr 7, 2009 at 11:19 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
A few random comments:
Donald wrote:
InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData
Rod responded:
This assumes that it's good to have lots of metadata embedded in the identifier. This level of "branding" might be fine for specimens (assuming each data provider has the ability to serve their own data), but what about shared identifiers such as taxon names -- I suspect having to "choose a brand" is going to be an obstacle to adoption for just the identifiers that we most need to share. Identifiers such as DOIs have less branding (although publishers have
managed to attach branding significant to the few digits after the "10." prefix).
Bob cites: "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to. The users of the LSIDs are permitted to use individual components (as specified elsewhere in this document) of LSIDs - although the LSID component parts themselves should be treated as opaque pieces of the identifier." LSID spec, Section 8.
It's regrettable that the LSID spec is so poorly written that it permits the useless term "should". Alas, I suppose that leaves room for argument with my position that LSIDs with embedded metadata are not LSIDs--they are something else based on the LSID syntax. There's nothing inherently wrong with, oh, say, a Handles implementation based
on prefacing LSID syntax with something controlled. See below.
Rod remarks:
Note also that DOIs (and Handles) can be queried for metadata, see Tony Hammnd's OpenHandle project (http://www.crossref.org/CrossTech/2008/10/the_last_mile.html and http://code.google.com/p/openhandle/), so we don't need to embed
this in the actual identifier itself.
Bob replies DOIs \are/ Handles. This is the (unstated?) reason that http://wiki.tdwg.org/twiki/bin/view/GUID/TechnologyComparison is filled with comparisons of the form "DOI: Same as Handles"
DOI is an implementation of Handles, with the additional treatment of things about which Handles is silent . See http://www.doi.org/factsheets/DOIHandle.html When I read that document casually, I come to the initial conclusion that Donald's proposal is essentially doing the same kind of extension to Handles (possibly a Good Thing if correct), except for allowing metadata in the identifier (yech!).
--Bob
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag ************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
Donald is right that the discussion points in contradictory directions:
1. Some people believe we need a non-http system. This points to DOI or handles
a) Certainly doi is successful in the publishing industry - not as technology but as business model. If we want DOIs, we get them as cheap as possible from http://www.tib-hannover.de/en/the-tib/doi-registration-agency/ - see mails of 2008-11-25 to 2008-11-26 on this list.
b) Or we can ride piggy-back on the DOI = handles technology. Create a BOI-handle system using the same software. But as Roger points out, this is a business model and needs a business model. It is not a suitable technology solving any problems technologically.
2. Technologically I have not seen any argument why -- with respect to desiring a RESOLVABLE entity -- a EoL / CoL / GNA / GBIF / TDWG whatever -- urn:lsid:persistent-identifier.tdwg.org:anic:12345 is in any ways better than: http://persistent-identifier.tdwg.org/anic/12345. Yes, in theory, LSIDs allow for resolving without using DNS, but in practice DNS is the method, so http://persistent-identifier.tdwg.org/ is the central resolving point. And I believe you can probably create - in theory again - a non-DNS-based resolution for parts of http - the only complication is that this would be an enumerated list of prefixes, whereas changing the resolution of urn:lsid can use a single prefix to hook in.)
http is a core web technology. It is working now and it will continue, and after that there will be a very wide upgrade path. It is the thing standard software like CMS, wikis, etc. work with.
I maintain my point, that this being the technical list, we place too much focus on interaction between large-scale technologies, and ignore the cost of explaining 99.999999% of biologists to ignore the lsids they see on their screen and use the lsid-through-http-method instead. And then, surprise, what people really need to use it a CENTRAL and HTTP-based resolver. And, surprise, this will show up as millions of http links in publications, be it PDF, cms, wiki, whatever.
The only thing with http-URLs that I see is wrong is that they are NORMALLY not properly managed as persistent URLs (link-rot). To create a multilateral system that fits both humans, the semantic web, and management practices, we could provide a cooking recipe for dataproviders:
* create a http://persistent-identifier.yourorganisation.tld domain * make sure it resolves the objects you want to publish there and that you are prepared to manage the stability of the service over a long time-period. * be prepared that by prefixing you domain with "persistent-identifier" you make a promise in the name of "yourorganisation"- Others will monitor how well that promise is kept. * be prepared to re-assign the domain to a central gbif/etc provider should your organisation no longer be prepared to maintain the service. * set up a content negotiation, so that humans see html, machines see rdf when they resolve (detailed recipe here...)
---- . If you want the benefits of branding, central resolution and business model, use a handle technology -- if you want a multilateral technology, use http?
Gregor
PURL := Technology and likely beyond the reach of many data providers. LSID is probably the easy way out because they do not need to be resolved to be useful. An unresolvable PURL is no PURL at all. Either way depends on infrastructure.
Right now I am thinking, why not have both? Different clients, different requirements, different solutions. We do this all the time. Is there a reason why we should not alias PURL and LSID (or DOI) as identifiers for the one resource and join in both games? PURLs from providers already committed to LSID and using UUID local identifiers may end up looking a little strange if not repetitive but there would be little change in overhead for most providers. For most of us GUIDs are for external consumption anyway.
For the infrastructure, which may be quite light weight, though heavy social component, I have gleaned a list of candidate requirements from the thread:
. GUIDS are URIs . Objects may have more than one GUID (URI aliases) . GUIDS are Resolvable. At least for dereferencable URIs. . GUIDS resolve to RDF using standard vocabularies . Data Provider resolves GUID or delegates authority for this responsibility. . Authorities use sub-domains to simplify transfer of delegation . GUIDs := <uri class>[:/+]<authority>[:/]<namespace>[:/]<uniqueLocalIdentifier> . GUID classes may be delegated independently . Delegate/Provider and GUID assignment must be registered . Delegates guarantee to serve provider meta data intact . Derivative objects must reference source URI . Derivative objects must not present as aliases.
I imagine existing aggregators stepping in to take on delegation of resolution services and LSID proxies where required. Will they object?
greg
appropriate English idioms: Cut the cloth ... silk purse ... casting purls ...
2009/4/8 Roger Hyam rogerhyam@mac.com:
General comments on decision making...
This is a technical discussion list. We are clever techie people. Given a challenge of X resources and Y requirements we can come up with a preferred list of solutions and probably implement one of them with our eyes closed.
To use a fine English idiom "you cut your cloth to suit your purse". We don't have a shared purse so arguments about how to cut our cloth will never be resolved. X is undefined and I am not sure Y is that well defined.
I know this is "chicken and egg" in that we need to come up with requirements to request a purse but we really need to have some indication that some one will be willing to commit long term resources to the common good before we can present a menu of choices for what money could be spent on.
1.0 developers/year (in perpetuity) gets us a server or two managed to support some kind of DNS based SRV hosting or redirect services or Handle system with support for some library development and help desk stuff. (Note I am not talking servers or meetings or reports or technology and I am talking commitment to pay people to have it as their responsibility to maintain the system both socially and technically - for the long term!!!).
Without the indication that some one (a consortium perhaps) is likely to formally commit to a minimum of this level of resources we are wasting our time talking about resolution mechanisms that are not DNS based i.e. variations on the PURL model.
If we don't have the money to build a walled garden we have to graze on the common with everyone else.
Roger
(BTW: I am not totally convinced that building a walled garden is the way forward but would happily come and graze in it if some benefactor would fund its perpetual maintenance).
On 8 Apr 2009, at 01:40, Donald.Hobern@csiro.au wrote:
A few comments on semantic opacity...
- My examples ("urn:lsid:csiro.tdwg.org:anic:12345") were
deliberately transparent (or at least translucent) to make it easier to follow the example, but I would have no real problem with them having a form more like "urn:lsid:bio-id.org:9876:12345".
- I think a single-minded drive towards semantic opacity would be
as quixotic and self-destructive as anything we could do. UUIDs are nicely opaque, and we could build a DOI-like system which maps individual UUIDs to their current locations. Such an approach would be painful and an administrative nightmare. I also suspect that such opaque identifiers would be resisted by most users. If we step away from such a pure implementation, the alternatives all embed some kind of semantic cues which make the system operate better. The form of a DOI encodes relevant data on the source of the object. PURLs and LSIDs do the same. The point with semantic opacity in the LSID specification is that it is not possible for a client to make inferences about the location of data based on the subelements within the LSID. It is up to the resolver implementations to determine how to return the data. Once this point is accepted, I would in fact say that the presence of some semantic clues within the identifier text is a good thing. The clues may for various reasons no longer conform to the reality of how the metadata are managed, but a user may still rapidly glean relevant indications whether an identifier is worth resolving (it may indicate that it relates to a nomenclatural record, or that it, at least originally, was minted by some respected source). I see such clues as having the same kind of value which has enabled Linnaean nomenclature to persist so long. My preference for LSIDs would therefore be for them to be like the ones Roger minted for BCI.
- I also note that this discussion has suggested remarkable near-
unanimity from many people in their distaste for LSIDs. However I fear that the level of agreement would be little higher if we were discussing DOIs, or PURLs. Some of the objections have been that LSIDs do not fit well with the key technologies of the semantic web and that something more like PURLs would be the right course to follow. Other objections have related to the semantic near- transparency of many LSIDs or the absence of strongly centralised support with the implication that something more like DOIs would be better. Both arguments have value, but they point in different directions. The various identifier schemes make up a landscape within which no identifier scheme represents an adaptive peak in all contexts. We need to develop applicability statements for how to use several of these schemes as alternatives for biodiversity data and we need to identify the drivers which may guide different providers to different schemes for different purposes.
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: Bob Morris [mailto:morris.bob@gmail.com] Sent: Wednesday, 8 April 2009 2:04 AM To: Roderic Page Cc: Hobern, Donald (Entomology, Black Mountain); Roger Hyam; tdwg-tag@lists.tdwg.org Subject: Re: [tdwg-tag] SourceForge LSID project websites broken - role for TDWG?
A few non random comments on Rod's random comments on Donald's proposal
On Tue, Apr 7, 2009 at 11:19 AM, Roderic Page r.page@bio.gla.ac.uk wrote:
A few random comments:
Donald wrote:
InstitutionCode/CollectionCode/CatalogueNumber triple and to the three main substitutable elements in an LSID. Some systems such as DOI may obscure the whoGeneratedTheData
Rod responded:
This assumes that it's good to have lots of metadata embedded in the identifier. This level of "branding" might be fine for specimens (assuming each data provider has the ability to serve their own data), but what about shared identifiers such as taxon names -- I suspect having to "choose a brand" is going to be an obstacle to adoption for just the identifiers that we most need to share. Identifiers such as DOIs have less branding (although publishers have managed to attach branding significant to the few digits after the "10." prefix).
Bob cites: "LSIDs are intended to be semantically opaque, in that the LSID assigned to a resource should not be counted on to describe the characteristics or attributes of the resource that the LSID refers to. The users of the LSIDs are permitted to use individual components (as specified elsewhere in this document) of LSIDs - although the LSID component parts themselves should be treated as opaque pieces of the identifier." LSID spec, Section 8.
It's regrettable that the LSID spec is so poorly written that it permits the useless term "should". Alas, I suppose that leaves room for argument with my position that LSIDs with embedded metadata are not LSIDs--they are something else based on the LSID syntax. There's nothing inherently wrong with, oh, say, a Handles implementation based on prefacing LSID syntax with something controlled. See below.
Rod remarks:
Note also that DOIs (and Handles) can be queried for metadata, see Tony Hammnd's OpenHandle project (http://www.crossref.org/CrossTech/2008/10/the_last_mile.html and http://code.google.com/p/openhandle/), so we don't need to embed this in the actual identifier itself.
Bob replies DOIs \are/ Handles. This is the (unstated?) reason that http://wiki.tdwg.org/twiki/bin/view/GUID/TechnologyComparison is filled with comparisons of the form "DOI: Same as Handles"
DOI is an implementation of Handles, with the additional treatment of things about which Handles is silent . See http://www.doi.org/factsheets/DOIHandle.html When I read that document casually, I come to the initial conclusion that Donald's proposal is essentially doing the same kind of extension to Handles (possibly a Good Thing if correct), except for allowing metadata in the identifier (yech!).
--Bob
-- Robert A. Morris Professor of Computer Science UMASS-Boston ram@cs.umb.edu http://bdei.cs.umb.edu/ http://www.cs.umb.edu/~ram http://www.cs.umb.edu/~ram/calendar.html phone (+1)617 287 6466 _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Really like this analogy Roger. Guess it is ancestral, but I had not heard it before - will henceforth use it shamelessly.
It opens some interesting philosophical questions for TDWG (and other community projects). For example, even if we had the funds to build a walled garden (gated community? sheltered workshop?) I am prepared to entertain argument that we should not do it, investing rather in improving and defending the commons, fending of the looming tragedy thereof.
TDWG is (or supposed to be) about 'standards' and I have always taken this to be standards "by all, for all", but increasingly we get drawn into implementation and application, in effect walled instances of the standards intended for the commons. This is unavoidable as the standards have to be ground truthed and reality tested but the boundary between a standard and its implementation is becoming grey and oftentimes I get this uneasy feeling that the commons is getting walled off by tribes of Google wannabes with their perfect hammer for our imperfect nail.
(Also like, "we are clever techie people". Clever techie people brought us thalidomide, Microsoft, the titanic, Hiroshima, the leaf blower, silly putty, climate change and computer driven derivative trading...)
jim - deciding whether he should go to work or lie down in front of the GUI/LSID/DOI bulldozers... :)
On Wed, Apr 8, 2009 at 7:57 PM, Roger Hyam rogerhyam@mac.com wrote:
If we don't have the money to build a walled garden we have to graze on the common with everyone else.
participants (10)
-
Beach, James H
-
Bob Morris
-
Donald.Hobern@csiro.au
-
greg whitbread
-
Gregor Hagedorn
-
Hilmar Lapp
-
Jim Croft
-
Paul Kirk
-
Roderic Page
-
Roger Hyam