[tdwg-tag] SourceForge LSID project websites broken - role for TDWG?

Wed Apr 1 09:20:27 CEST 2009

On 31 Mar 2009, at 23:59, <Donald.Hobern at csiro.au> <Donald.Hobern at csiro.au 
 > wrote:

>
> Although the model I outlined does move in the direction of the DOI  
> model, it does not oblige any data provider to use a centrally  
> administered service (i.e. a "trusted party") unless they wish to do  
> so.  The autonomy of data providers to make their own decisions and  
> to be the primary (or sole) point of contact for a data record was a  
> significant factor in the TDWG GUID discussions a few years ago.   
> Views on this point may have changed sufficiently that it is no  
> longer an issue.  Personally I would still prefer a scheme with  
> identifiers which are obviously identifiers even when met out of  
> context, but the doi: prefix should cover that for DOIs.

I think that this has been the Achilles heel of biodiversity  
informatics. Having data providers be the primary (or sole) point of  
contact leads naturally to having a distributed identifier system  
(like LSIDs). However, providers are not good at keeping their data  
online (http://bioguid.info/status), and LSIDs are non-trivial to  
install.  Some organisations, such as Index of Organism Names haven't  
made them properly resolvable (i.e., they've skipped the DNS SRV  
record bit), and some organisations seem incapable of keeping LSIDs  
working (e.g., the Catalogue of Life http://darwin.zoology.gla.ac.uk/~rpage/lsid/tester/?q=urn%3Alsid%3Acatalogueoflife.org%3Ataxon%3Aef0ae064-29c1-102b-9a4a-00304854f820%3Aac2009) 
.

Of course, even DOIs break, but there are tools to report these, and  
social mechanisms to deal with repeat offenders. I'm not wishing to  
advocate any identifier, but I think it would be useful to ask that if  
we go down a centrally managed route, at what point do we end up  
replicating CrossRef, and at what point would it make sense to avoid  
that duplication of effort?

>
>
> I agree that the real question about the model I outlined is whether  
> there remain enough reasons to justify using LSIDs rather than  
> PURLs.  In many ways what I have described turns out to be an  
> attempt to replicate the benefits of PURLs, while holding on to some  
> of the other perceived benefits of LSIDs.  Those benefits would be:
>
> 1. An identifier format which does not depend on the persistence of  
> any particular protocol, HTTP included.
> 2. An identifier format which is clearly recognisable in all  
> contexts (even when prepended with a resolver URL).
> 3. An identifier scheme which requires data providers to consider  
> issues of persistence when issuing identifiers.
>
> The third of these remains my key practical concern about PURLs as  
> identifiers within our community.  Use of standard PURL  
> infrastructure such as http://purl.oclc.org/ certainly does require  
> data providers to recognise the distinction between name and  
> location, and supplies the tools to handle future changes in the  
> layout of the provider's web site.  However, unless we mandate the  
> use of a limited number of community-owned domains for our PURLs  
> (e.g. http://purl.oclc.org/ and http://purl.tdwg.org/), we have no  
> good a priori way to identify which URLs are PURLs and which URLs  
> are simply today's web locations for a document.  I therefore worry  
> that we will just open the door from the very beginning for a sloppy  
> approach to selecting identifiers.

This is a good point. This leaves us DOIs, Handles, and LSIDs. Given  
that none of these play well with the Linked Data 303 model, we would  
need maintain an HTTP resolver that had this functionality (I have one  
for DOIs, e.g. http://bioguid.info/doi:10.1016/j.ympev.2006.06.014 ),  
or rely on the Linked Data community to maintain such tools. So,  
unless we go directly to URLs, we will need to support some sort of  
resolution mechanism if we want to play ball with the Semantic Web  
crowd.

Part of me thinks that if we were serious about all this, we would  
drop the "role your own" approach, bite the bullet, and adopt DOIs. We  
are data publishers, after all.

Having said that, I agree with Kevin that LSIDs forced us to think  
about RDF, which was a good thing (and we probably wouldn't have if  
we'd adopted another GUID).

The other issue is that LSIDs are now in the published literature,  
some some poor bastards are going to have to support them for some  
time to come...

Rod

>
>
> The real battle in getting a persistent identifier scheme to work is  
> in the social support for the system, including institutional  
> commitment to think about the nature of persistent identifiers and  
> to factor the costs of maintaining such identifiers into their  
> planning for web infrastructure.  Many of these costs should be the  
> same for well-managed PURLs and for LSIDs.  Right now we are making  
> it much more complex than it need be for data providers seamlessly  
> to use LSIDs.  The relative costs of the two approaches could be  
> more or less equal.  However we need to decide whether the perceived  
> benefits I outline above justify the effort of our putting this  
> extra work into LSIDs.
>
> Many thanks to all who are prepared still after all this time to  
> debate this matter...
>
> Donald
>
> Donald Hobern, Director, Atlas of Living Australia
> CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
> Phone: (02) 62464352 Mobile: 0437990208
> Email: Donald.Hobern at csiro.au
> Web: http://www.ala.org.au/
>
>
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at duke.edu]
> Sent: Wednesday, 1 April 2009 6:13 AM
> To: Hobern, Donald (Entomology, Black Mountain)
> Cc: tdwg-tag at lists.tdwg.org
> Subject: Re: [tdwg-tag] SourceForge LSID project websites broken -  
> role for TDWG?
>
> I like the idea of the trusted party but:
>
> - aren't the accordingly formatted LSIDs substituting (and hence
> falsely pretending) an authority (lsid.tdwg.org) for records for which
> it is not, in fact, an authority? (I think one of the ill-designed
> things about LSID is that authority, which no-one should need to care
> about for identifying and resolving an object, was made part of the
> identifier in an explicit way.)
>
> - what would determine whether a party is trusted or not, and what
> would prevent a non-trusted party from doing the same? (I.e., LSIDs
> issued by a non-trusted party would still be resolvable through any
> LSID resolver and not recognizable as "un-trusted" as such, unlike for
> example if someone other than a DOI-accredited registry were to issue
> a DOI - it wouldn't resolve.)
>
> - how is the registration and redirection different on a fundamental
> level from the services that the OCLC provides with purl.org? (I.e.,
> this infrastructure exists already for HTTP URIs, and will in all
> likelihood be sustained longer and better than anything that would
> rely solely on the Biodiversity Informatics field for support.)
>
> - if what we really want is the stability that the DOI business model
> has shown to provide, then why shouldn't we use DOIs? (There is an
> accredited DOI registry now for primary and secondary scientific data.
> They promise to bring the price down into the cents/DOI range - that
> may still sound like a lot, but haven't we just said that the idea of
> a sustainable infrastructure for resolvable GUIDs costs money and
> requires commitment from everyone?)
>
> My $0.02 (some day worth 2 DOIs :)
>
>        -hilmar
>
> On Mar 30, 2009, at 10:38 PM, Donald.Hobern at csiro.au wrote:
>
>> Some thoughts on where I think we should be going with LSIDs...
>>
>> My take is still that having something that is a standard identifier
>> scheme is a major benefit and it is not a big stretch for us to
>> recognise that <ResolverURL><Identifier> should be considered the
>> same as <Identifier> for purposes of inferring identity.  I still
>> believe that it would be a total disaster for us simply to say that
>> we advocate the use of PURL-style identifiers, since this is a
>> slippery slope with no separation between a good identifier and a
>> web-server-du-jour URL with no planned persistence.  Vince Smith's
>> blog is a reminder that relying even on institutional domain names
>> is risky.
>>
>> I believe that we need to get behind providing solid infrastructure
>> for LSIDs.  This should be aimed at making it as easy as possible
>> for any provider to set up LSIDs without touching their own DNS
>> records unless they want to do so.  We should have a central service
>> which does the central registration part of what DOI.org does for
>> DOIs, BUT NOT AT THE LEVEL OF INDIVIDUAL LSIDs.  We are trying to
>> establish something like this here in Australia under the
>> taxonomy.org.au domain.  This could be done at a global, national or
>> community level.  What is needed is simply the following:
>>
>> 1. A trusted party (TDWG, GBIF, EOL, ALA, etc.) commits to handle
>> the DNS resolution side of LSIDs for any data providers wishing to
>> use its services.
>>
>> 2. Any provider can register a dataset with the trusted party and
>> will receive a corresponding LSID namespace to use in their
>> identifiers.  To register the dataset they need to be able to
>> provide a parameterised URL which takes a single parameter - either
>> the full LSID for the record, or just the final record-id part of
>> the LSID (this could be a configuration choice when registering the
>> data set) - and which returns the corresponding data record as RDF
>> (if we don't drive the use of structured data with GUIDs, we are not
>> solving anything, but there could be an option to return the records
>> to the trusted party in some simpler format and for the trusted
>> party to generate the RDF).
>>
>> 3. The trusted party registers itself in DNS as the resolver for
>> LSIDs for its domain and hosts a resolver implementation which
>> extracts the LSID namespace from LSIDs and forwards the request to
>> the appropriate data provider with either the record-id or the whole
>> LSID as a parameter.  The trusted party also hosts an HTTP LSID
>> proxy and prepends the proxy URL to all identifiers in RDF documents.
>>
>> A working example with TDWG as the trusted party and an NHM
>> Sphingidae data set as the data set to be shared.
>>
>> A. TDWG registers lsid.tdwg.org in DNS as an LSID service.
>>
>> B. NHM registers a TAPIR database (or whatever CGI interface they
>> like) for their Sphingidae database using the namespace
>> "nhm.sphingidae" and the endpoint "http://nhm.ac.uk/tapir/sphingidae?op=s&...&darwin:GUID=
>> %S" (where %S is to be replaced with the actual request GUID).
>>
>> C. NHM populates its records with GUID values of the pattern "urn:lsid:lsid.tdwg.org:nhm.sphingidae
>> :<record-id>"
>>
>> D. A user follows a link http://lsid.tdwg.org/urn:lsid:lsid.tdwg.org:nhm.sphingidae:12345
>> and hits the LSID resolver.  The LSID resolver maps
>> "nhm.sphingidae" to the NHM endpoint and requests the record, which
>> it then returns to the user.
>>
>> This could be enhanced in many different ways to make it more robust
>> and flexible:
>>
>> 1. As mentioned, the trusted party could map other formats to RDF
>> (indeed it could have templates for embedding data in Darwin Core,
>> etc.).
>>
>> 2. The trusted party could automatically prepend LSIDs in response
>> data with references to its own proxy so that early 21st century WWW
>> technology works as expected.
>>
>> 3. The trusted party could add additional services around hosted
>> copies of the data and could manage a metadata record for the
>> resource.
>>
>> 4. The trusted party could in fact use DOIs for the namespace part
>> (in other words the NHM example would end up using something like urn:lsid.tdwg.org:10.1000/987:12345
>> as the identifier.  If the 10.1000/987 DOI served as a citable
>> identifier for the dataset and could be resolved to get the metadata
>> for the dataset, it could be elegant on several different levels.
>>
>> This is really all so easy.  As mentioned, taxonomy.org.au has been
>> going through the teething pains of doing this for an Australian
>> therevid data set held in Mandala in Illinois.  I would hope that we
>> could quickly roll this out as a service for any Australian data
>> providers and then try deploying a similar set-up with TDWG, GBIF or
>> EOL.
>>
>> At least that's what I think...
>>
>> One other basic point is that, if we abandon LSIDs and still want a
>> GUID solution with some promise that the data could be relocated, we
>> need a system which somewhere embeds the concepts of provider,
>> dataset and record and can use these to track down the record.  This
>> means we can't allow a total free-for-all on identifiers and need
>> either a robust heavy central registry of records like DOI or need
>> to have a standard place for these three elements in the GUID.  Once
>> we get that far, we may as well adopt LSIDs even if we choose (as
>> the major party using the model) to extend or even replace the
>> models for resolving them.
>>
>> Donald
>>
>>
>>
>> Donald Hobern, Director, Atlas of Living Australia
>> CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601
>> Phone: (02) 62464352 Mobile: 0437990208
>> Email: Donald.Hobern at csiro.au
>> Web: http://www.ala.org.au/
>>
>> -----Original Message-----
>> From: tdwg-tag-bounces at lists.tdwg.org
>> [mailto:tdwg-tag-bounces at lists.tdwg.org] On Behalf Of renato at cria.org.br
>> Sent: Tuesday, 31 March 2009 12:29 PM
>> To: tdwg-tag at lists.tdwg.org
>> Subject: Re: [tdwg-tag] SourceForge LSID project websites broken -
>> role for
>> TDWG?
>>
>> I think I would prefer to see a different solution. Dropping LSIDs
>> altogether seems a bit drastic after all work that was done. If we
>> had a
>> perfect GUID technology, I would understand this kind of decision,
>> but we
>> all know we don't have such thing. On the other hand, focusing
>> exclusively
>> on LSIDs could prevent some of our data providers to serve and to
>> maintain
>> GUIDs. So why not just offer an alternative?
>>
>> If clients will need to deal with different types of GUIDs anyway,
>> especially if they will have to interact with different types of
>> providers, the matter of having to agree on and to adopt a single  
>> GUID
>> technology becomes less important. We already live on a world where
>> different types of GUIDs are being provided.
>>
>> Personally I've always preferred PURLs for its simplicity and
>> compliance
>> with existing tools and technologies, although I know it has
>> drawbacks. If
>> some of our data providers can reliably serve LSIDs - great. But if
>> LSIDs
>> are too complicated for other data providers, I don't see any
>> problem for
>> our community to create an additional applicability statement for
>> another
>> GUID technology. The most important thing, in my opinion, is to
>> agree on
>> the data models/vocabularies that our GUIDs will resolve to, no
>> matter the
>> resolution mechanism used. But that's another story...
>>
>> Best Regards,
>> --
>> Renato
>>
>>
>>> Perhaps the question is whether LSIDs are a hurdle to adoption of  
>>> the
>>> use of GUIDs or an aid to it.
>>>
>>> DOIs are not just a technology they are a business model plus a
>>> technology (they use HANDLE for the technology). It is worth the
>>> client overcoming technical difficulties in their use because of the
>>> value added by the publisher paying for the associated
>>> infrastructure.  I would argue that DOIs/HANDLE are, in fact, a
>>> complete pain because they don't integrate well with semantic web
>>> technologies but that they are carried along purely by the business
>>> model.
>>>
>>> In advocating the use of LSIDs we are advocating the pain without  
>>> the
>>> benefits. Just like DOIs they are awkward and non-standard to set  
>>> up.
>>> They need to be constantly explained. They don't work in semantic  
>>> web
>>> technologies. They don't even integrate with XML (could you host an
>>> XML Schema on an LSID?). All this would be OK if they had an
>>> associated business model - but they don't.
>>>
>>> My personal belief is that we should either put together a business
>>> model (with the financial backing of big projects and within the  
>>> next
>>> few months) where some core services are provided by a third party  
>>> or
>>> we should drop LSIDs altogether. Alas I fear the big projects are
>>> more
>>> interested in data volume and pretty pictures than doing good  
>>> science
>>> and providing basic services (I am being contentious for emphasis so
>>> don't take it personally).
>>>
>>> From the technical perspective this:
>>>
>>> urn:lsid:zoobank.org:act:8BDC0735-FEA4-4298-83FA-D04F67C3FBEC
>>>
>>> is far harder than this:
>>>
>>> http://purl.zoobank.org/8BDC0735-FEA4-4298-83FA-D04F67C3FBEC
>>>
>>> so we need a good business case for doing the former. What is it?
>>>
>>> All the best,
>>>
>>> Roger
>>>
>>>
>>> On 23 Mar 2009, at 01:58, Kevin Richards wrote:
>>>
>>>> As convener of the GUID subgroup of TDWG TAG, I thought I should  
>>>> add
>>>> some comments.
>>>>
>>>> The debate over LSIDs, their suitability, technical issues, etc,  
>>>> has
>>>> been going on for some years now in the TDWG community (and also
>>>> within a few other communities - especially the HCLS Health Care  
>>>> and
>>>> Life Science semantic web group).  Most issues have been raised and
>>>> dealt with, and as with most technologies, there is no perfect
>>>> solution for a GUID technology.  To review these discussions see  
>>>> the
>>>> TDWG pages at http://wiki.tdwg.org/GUID/ and
>>>> http://www.tdwg.org/activities/guid/documents/
>>>> . Documents that cover an introduction to GUIDs/LSIDs,  
>>>> applicability
>>>> statements, and technical issues can be found here.
>>>>
>>>> I feel we are getting to a stage with LSIDs that a lot of people in
>>>> this community have had some sort of dealing with the technology
>>>> (whether it is setting up an LSID resolver, or using them/resolving
>>>> them as through client software) and we therefore have a good range
>>>> of experiences, knowledge and conclusions about the use of LSIDs.
>>>> As part of the TDWG meeting in Montpellier this year, we hope to
>>>> hold a session for "LSIDs in Practice" which should give us a good
>>>> indication of any LSIDs issues, and how they have been dealt with  
>>>> in
>>>> practice.
>>>>
>>>> Also, there are several activities going on that should aid with  
>>>> the
>>>> adoption of LSIDs, such as development of software tools and
>>>> services, and as we speak the LSID web site is being transferred to
>>>> a TDWG server to be hosted there (it has been a bit of a technical
>>>> hurdle for some of us to get this web site moved, so you may need  
>>>> to
>>>> bear with us for a little while).
>>>>
>>>> Generally the technical issues of LSIDs are relatively minor.  The
>>>> more obvious issues (such as persistence - ie that an LSID will be
>>>> resolvable indefinitely, and community support and technological
>>>> aids will always be available), tend to be community/social issues.
>>>> What really makes the success of any initiative is the community
>>>> support and drive behind the initiative, and the same is true with
>>>> whatever technologies we adopt in the TDWG community.  The  
>>>> important
>>>> thing therefore is that we start using the GUIDs, linking them up
>>>> with other GUIDs/data, distributing them, promoting "authoritative"
>>>> GUIDs, and then I really believe any remaining issues will be  
>>>> easily
>>>> overcome.
>>>>
>>>> Thanks
>>>> Kevin
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> _______________________________________________
>> tdwg-tag mailing list
>> tdwg-tag at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>>
>> _______________________________________________
>> tdwg-tag mailing list
>> tdwg-tag at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:- hlapp at duke dot edu :
> ===========================================================
>
>
>
>
> _______________________________________________
> tdwg-tag mailing list
> tdwg-tag at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-tag
>

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
DEEB, FBLS
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
AIM: rodpage1962 at aim.com
Facebook: http://www.facebook.com/profile.php?id=1112517192
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html