I'm with Tim on this one, and taking one of Rod's other posts ("LSIDs, disaster or opportunity" http://iphylo.blogspot.com/2009/04/lsids-disaster-or-opportunity.html) a bit further, I think coming up with a simple, extend-able URL resolver would give us many benefits and allow LSIDs with extra, added information around them for all to use. Looking at his example, a URL would get permanent tracking that would also post referrers, location and traffic. A summary of the link could even be a page in itself, a cached version, a screenshot, or just a scrape of the code - pulling out the HTML tags, for future reference in case the real link goes down. We could use the ability to create a customizable prefix (ie- http://someresolvr.com/bhl/SDFoijF), to somewhat follow DOI conventions, but could even save old DOIs or handles for historical purposes in a field attached to the new URL, or for reuse, making the new URL resolve to a current DOI with a simple post at the end of the new URL (ie- http://someresolvr.com/bhl/SDFoijF/DOI). In the same way we could use user input, data pulled about the URL semantically to generate RDFa (by using pyRdfahttp://www.w3.org/2007/08/pyRdfa/), then exposing that for all newly created URLS, and coming up with a standard to make it predictable (ie- http://someresolvr.com/bhl/SDFoijF/RDF). The example at bit.ly shows the use of Open Calais (http://opencalais.com/) to get more background information on the original link to provide more information, but it could also be pointed to other services we provide/use in biodiversity to provide a snapshot across the board of more context/content. Users of the service could login to examine/add/edit the data by hand if desired, so they would still retain ultimate control over how their record is presented. Thus, from a simple URL, we could build a complete summary that would build on what we're given while sharing it all back out.
Then the architecture (aka, the fun part) would be simple and distributed. A webserver able to process PHP, running the database CouchDB (http://couchdb.apache.org/) would be all that is needed to run the resolver. CouchDB is schema-less, so the way it handles replication is very simple, and is built to be distributed, only handing out the bits that have changed during replication, as well as scale in this manner. Having a batch of main servers behind a URL in a pooled setup (think of a simplified/smaller version of the Pool of Unix networked time servers (http://www.pool.ntp.org/)) would allow a round-robin DNS, or a ucarp setup ("urcarp allows a couple of hosts to share common virtual IP addresses in order to provide automatic failover." http://www.ucarp.org/project/ucarp), so if one main server went down, another would automatically take over, without the user needing to change the URL. Plus, if we wanted to, to battle heavy usage of the main servers we could use the idea of Primary and Secondary servers as outlined in the pool.ntp.org model, so an institution with heavy usage could become a Secondary host and run their own resolver simply, with almost no maintenance. They would just need the PHP files, which would be a versioned project, and then have a cron task to replicate the database from a pool of the main servers. The institution's resolver could be customized to appear as their own, (ie- http://someresolvr.bhl.org/bhl/SDFoijF) and for simplicity could be read-only. This way a link like http://someresolvr.com/bhl/SDFoijF could be resolvable against any institution's server, like http://someresolvr.bhl.org/bhl/SDFoijF or http://someresolvr.ebio.org/bhl/SDFoijF - as all of the databases would be the same, although maybe a day behind, depending on the replication schedule. New entries would only be entered on a main server, or in 'the pool' (ie- http://pool.someresolvr.com/), then those changes would be in the database to be handed out to all on the next replication (I won't add my P2P ideas in this email - it may not be needed for the deltas that would need to be transfered daily or weekly). Add to all of this that CouchDB is designed as "...a distributed, fault-tolerant and schema-free document-oriented database" which would fit into what we want to do; build a store of documents (data) about a URL that we can serve, while being a permanent, sustainable resolver to the original document. If the service ever died, it could be resurrected from anyone's copy of the database (think LOCKSS (Lots of Copies Keep Stuff Safe) http://www.lockss.org/lockss/Home), so that no data (original or accumulated) would be lost. The data could be exported from the database in XML, and then migrated from that to a desired platform.
I have not been dealing with LSIDs as long as most on this list so I expect I'm glossing over (or missing) some of the concepts, so please let me know what I am lacking. This is a needed service, and is a project I'd like to be involved in building.
Thanks
P
On Tue, Apr 28, 2009 at 4:06 AM, Tim Robertson trobertson@gbif.org wrote:
Probably no one knew about it, but TDWG did offer to help with this (for free) for a long time (since Aug 2007), but no one took it up I believe:
http://www.tdwg.org/activities/online-services/lsid-authority-ids/
I think the idea is worth exploring further. Perhaps a quick "hands up who wants this?" vote to canvas interest?
Crazy idea: I would propose taking it further and suggesting "centralising" a cache of the LSID response also (configurable expiry date on items in the cache), to alleviate load on servers (thinking of IPNI for example). It would not be drastically expensive to offer a billion LSID cache with high availability with cloud computing ("centralised" in the sense that it is on a distributed cloud such as Amazon S3 + Cloudfront). We could share Amazon S3 as a data store and everyone just pays for their own PUT cost ($0.01 for 1000 records) and then share the bandwidth cost. I think this is worth exploring regardless of GUID technology used... What was it Rod said - "distributed begets centralized"
Tim
On 28 Apr 2009, at 10:09, Robert Huber wrote:
Dear Dave,
Good to read something about this issue in this list! I like the idea, it reminds a bit on how handle and/or doi manage this. The doi system allows registrants to act as doi resellers, and this is working. Prices vary, but e.g. for packages of several thousand dois are sold for an annual fee (http://www.medra.org/en/terms.htm) or for free (http://www.std-doi.de). Unless someone does it for free (GBIF?), selling LSIDs could be part of the business model for TDWG?. And in analogy to the handle system, a fee for registering some blabla.tdwg.org subdomain (was it authority?) as you mentioned would surely help to make the whole LSID SYSTEM more persistent.
Well, and I hope LSID registrant(s) would manage the metadata issue better than the DOI system does. Most people ignore that while doi registrants do have to register dois+metadata there is no common way to retrieve this metadata. Crossref, which is mentioned frequently here, does not hold the metadata on any doi but only on those dois they registered. Try to get some metadata for example for some dois of the other registrants such as doi:10.1594/PANGAEA.339110.
best regards, Robert
On Tue, Apr 28, 2009 at 12:25 AM, Dave Vieglais vieglais@ku.edu wrote:
I'm not sure if anyone has suggested this strategy (I'll be surprised if not):
TDWG seems determined to use LSIDs for GUIDs, yet the technical issues for implementation are discouraging enough for some to defer deployment. Perhaps TDWG could offer as a bonus for membership (or perhaps a small additional charge) the provision of some elements of the LSID infrastructure stack, overloading the tdwg.org domain?
Then, instead of having each institution create DNS entries such as "mydepartment.institution.org" and deal with the SRV details, use TDWG as a kind of registrar and do something like "mycollectionid.tdwg.org". TDWG would then be responsible for the appropriate DNS SRV registration, and could even operate a resolver and/or redirection service for that domain.
The income would not be very much (say $10/year per org * 100 participants = $1k), but it should be a lot less expensive for the entire community than the total cost for each organization operating their own infrastructure (perhaps $10/year DNS + $1000/year operations
- 100 participants = $101k).
So as not to overload the TDWG infrastructure, it would be sensible to encourage technically astute groups (e.g. GBIF, EoL, NCEAS) to contribute computing cycles and fallback DNS services to ensure reliability of the entire system.
The end result could be a reliable, distributed infrastructure for LSID (or whatever GUID scheme is decided upon) resolution that conforms to the requirements / specifications derived from the TDWG procedures at a small cost for participation. The low cost and deferral of technical overhead to a knowledgeable group would hopefully encourage participation by a broader audience in this piece of fundamental architecture.
(It may also help reduce the endless cycles of discussion about GUIDs and LSIDs)
Dave V. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- Dr. Robert Huber,
WDC-MARE / PANGAEA - www.pangaea.de Stratigraphy.net - www.stratigraphy.net _____________________________________________ MARUM - Center for Marine Environmental Sciences University Bremen Leobener Strasse POP 330 440 28359 Bremen Phone ++49 421 218-65593, Fax ++49 421 218-65505 e-mail rhuber@wdc-mare.org _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag