[Tdwg-guid] DOIs and persistence -- when DOIs go bad
I know we've pretty much abandoned DOIs, but I thought the following example might serve as a reminder that persistence is a social problem, not a technical one. DOIs are perhaps the most widespread GUIDs relevant to our area, and have a lot of technical and financial resources behind them. But they can still fail.
For example, the DOI 10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2 doesn't exist in CrossRef (try http://dx.doi.org/10.1600/0363-6445(2003)028%5B0387:PORLFR%5D2.0.CO;2). The paper with this DOI is here: http://www.bioone.org/perlserv/?request=get- abstract&doi=10.1600%2F0363 -6445%282003%29028%5B0387%3APORLFR%5D2.0.CO%3B2
This problem affects a lot of BioOne-hosted journals (for another example see my comments on using Flickr to store Open Access images - http://iphylo.blogspot.com/2006/05/open-access-taxonomy.html). I've been in touch with both CrossRef and BioOne about this, and it may be a few months before this is resolved (BioOne have installed some new software, and haven't registered their DOIs -- which as far as I can figure out violates CrossRef's rules).
The DOI example above is particularly annoying for me, because it relates to IPNI's LSIDs. One of my favourite plant taxa is Poissonia heterantha (lsidres:urn:lsid:ipni.org:names:20012728-1). The IPNI metadata for this name include the following:
tn:publishedInSyst. Bot. 28(2): 401 (2003). </tn:publishedIn>
Wouldn't it be nice to have something like:
<tn:publishedIn rdf:resource="doi:10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2" />
Given a DOI, users can locate the article quickly and, via CrossRef, extract metadata. The more taxonomic literature is associated with GUIDs such as DOIs, Handles, and LSIDs, the better.
It's interesting (and disconcerting) that even DOIs can be screwed up.
Rod
------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com
Rod wrote: ...
The DOI example above is particularly annoying for me, because it relates to IPNI's LSIDs. One of my favourite plant taxa is Poissonia heterantha (lsidres:urn:lsid:ipni.org:names:20012728-1). The IPNI metadata for this name include the following:
tn:publishedInSyst. Bot. 28(2): 401 (2003). </tn:publishedIn>
Wouldn't it be nice to have something like:
<tn:publishedIn rdf:resource="doi:10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2" />
Given a DOI, users can locate the article quickly and, via CrossRef, extract metadata. The more taxonomic literature is associated with GUIDs such as DOIs, Handles, and LSIDs, the better.
This somewhat relates to a cross conversation (sorry, fell off the mailing list) we've been having with Neil Thomson about his BHL proposal & it points up another social problem, this time with legacy data
Even if the DOIs were working as advertised, what would it take to get to the point where we could supplement 'Syst. Bot. 28(2): 401 (2003).' with "doi:10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2" ?
This example is actually quite close to being doable programmatically. 'Syst. Bot.' is a standardised publication abbreviation in IPNI, and the collation is in a standard form too, so if we had a source of DOIs tied to articles in Syst. Bot. we could imagine writing a routine to programmatically import the relevant DOI into the IPNI record.
We're chugging away with standardisation in IPNI and we're doing updates at the moment that are cleaning up aspects of 10,000, 15,000, up to 50,000 records at a time. But there are 1.5 million records to do and we can still find (just from within Poissonia) citations like 'Adansonia, ix. (1870) 295. ' or 'Bol. Mus. Hist. Nat. Tucuman no. 6: 8. 1925 Hauman, in Kew Bull. 1925: 279. 1925 ' which look much less tractable
60% of publications are not standardised in IPNI, and of the standardised ones many of them predate DOIs (are there any moves among journals to retrofit DOIs to older series?) Plus I've noticed in the IPNI and other editors at Kew a strange reluctance to type in strings that look like '10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2' - I don't know whether barcode reading technology could help here...
I think that any solution to standardising citations of literature must inevitably include (if not be confined to) some use of the infrastructure that DOIs represent. But we will also have to come up with a plan that will help us deal with the vast weight of legacy data that databases like IPNI carry
Sally*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
Sally wrote
Even if the DOIs were working as advertised, what would it take to get to the point where we could supplement 'Syst. Bot. 28(2): 401 (2003).' with "doi:10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2" ?
This example is actually quite close to being doable programmatically. 'Syst. Bot.' is a standardised publication abbreviation in IPNI, and the collation is in a standard form too, so if we had a source of DOIs tied to articles in Syst. Bot. we could imagine writing a routine to programmatically import the relevant DOI into the IPNI record.
CrossRef provide an interface to extract DOIs (see http://iphylo.blogspot.com/2006/05/crossrefs-openurl-resolver.html for an example). There's also a web form that you can use top play around with: http://www.crossref.org/guestquery/.
We're chugging away with standardisation in IPNI and we're doing updates at the moment that are cleaning up aspects of 10,000, 15,000, up to 50,000 records at a time. But there are 1.5 million records to do and we can still find (just from within Poissonia) citations like 'Adansonia, ix. (1870) 295. ' or 'Bol. Mus. Hist. Nat. Tucuman no. 6: 8. 1925 Hauman, in Kew Bull. 1925: 279. 1925 ' which look much less tractable
Hard work. There is a small literature on citation matching which may be relevant, e.g: http://citeseer.ist.psu.edu/pasula02identity.html and http://citeseer.ist.psu.edu/lawrence99autonomous.html.
60% of publications are not standardised in IPNI, and of the standardised ones many of them predate DOIs (are there any moves among journals to retrofit DOIs to older series?)
Some are. I've noticed some very old articles hosted by Springer have DOIs (sadly, I can't give a specific example). Of course you don't need to solely rely on DOIs. Handles should also be catered for. For example, the American Museum of Natural History provide full-text of current and back issues of their publications: http://digitallibrary.amnh.org/dspace/. Of course, they don't do plants, but if other institutions adopt the AMNH's approach of using DSpace, then handles are part of the mix. Some repositories use URLs.
Plus I've noticed in the IPNI and other editors at Kew a strange reluctance to type in strings that look like '10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2'
- I don't know whether barcode reading technology could help here...
Yeah, but there's this cool invention called "cut and paste" ;-) This can all be automated. Look at Connotea - you can highlight a DOI in your browser, click on a bookmarklet and hey presto, it's added to Connotea. Also, much of this would be done by scripts, web crawling, etc.
I think that any solution to standardising citations of literature must inevitably include (if not be confined to) some use of the infrastructure that DOIs represent. But we will also have to come up with a plan that will help us deal with the vast weight of legacy data that databases like IPNI carry
I think there are two things to be addressed here. The first is GUIDs for old literature that no publisher is every going to assign DOIs to (the publisher may no longer exist, the work is in the public domain, etc.).
Also, several modern publishers have not adopted DOIs but instead use fairly standard URLs. These include a lot of journals relevant to us (e.g., American Journal of Botany and other HighWire Press journals ).
I think major name databases such as IPNI would have to provide GUIDs for the literature the have. Then, map as much as possible to external GUIDs.
This raises the second issue, multiple GUIDs for publications. We have this already, with DOIs, URLs, JSTOR, and PubMed ids for the same papers.
It would be cool to have a GUID "broker" that, say could take a DOI and say "well, IPNI calls this reference xxxxxx, and MOBOT have it as yyyyyy, and there's a handle hdl:zzzzz for it, etc." It could also mimic CrossRef by taking bibliographic data and attempting to match that to an existing GUID. In otherwords, it would be an OpenURL resolver.
Regards
Rod
Sally*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com
Send instant messages to your online friends http://uk.messenger.yahoo.com
Ah Rod
Plus I've noticed in the IPNI and other editors at Kew a strange reluctance to type in strings that look like '10.1600/0363-6445(2003)028[0387:PORLFR]2.0.CO;2'
- I don't know whether barcode reading technology could help here...
Yeah, but there's this cool invention called "cut and paste" ;-) This can all be automated. Look at Connotea - you can highlight a DOI in your browser, click on a bookmarklet and hey presto, it's added to Connotea. Also, much of this would be done by scripts, web crawling, etc.
Some day I must take you to the IPNI editors' room to see the piles of books and journals from the library that they work with. You may remember books ...
In my dream world all editors of botanical journals and publishers of botanical books will supply ipni with a TCS document appended to the end of a PDF with all of the new names, new taxa and new combinations, plus of course the relevant DOIs, and they will simply need to check that the physical representation does exist (no electronic publishing allowed in botany) and then approve all of the records to be uploaded. And then get on with back standardisation of the rest.
Then I wake up ...
Thanks for the links - will pass them on to our standardisation programmer
Sally*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
On 26 May 2006, at 11:30, Sally Hinchcliffe wrote:
Ah Rod
Some day I must take you to the IPNI editors' room to see the piles of books and journals from the library that they work with. You may remember books ...
Books, they're made of -- what ya call it -- "paper", right? I've written one, and edited another, so I'm vaguely familiar with the concept ;-)
In my dream world all editors of botanical journals and publishers of botanical books will supply ipni with a TCS document appended to the end of a PDF with all of the new names, new taxa and new combinations, plus of course the relevant DOIs, and they will simply need to check that the physical representation does exist (no electronic publishing allowed in botany) and then approve all of the records to be uploaded. And then get on with back standardisation of the rest.
Then I wake up ...
OK,
1. What about library catalogues. Surely a lot of literature will have an electronic card catalogue somewhere? What about WorldCat - http://www.oclc.org/worldcat/. Is it not possible to search global library cataologues to extract basic metadata about a lot of literature, or am I being - as usual - hopelessly naive?
2. For the future, doing what you suggest (names + DOIs) would be trivial, especially if taxonomy gets it's act together and sets up a journal that accepts appropriately marked up documents in the first place. As soon as we realise that what we need is a database that thinks it's a journal, we will make progress. Any notion of capturing data after the fact in the 21st century is ludicrous (which is why I've no expectation that Zoobank will succeed).
Rod
Thanks for the links - will pass them on to our standardisation programmer
Sally*** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
------------------------------------------------------------------------ ---------------------------------------- Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom
Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page@bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html
Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names: http://darwin.zoology.gla.ac.uk/~rpage/portal/ Find out what we know about a species: http://ispecies.org Rod's rants on phyloinformatics: http://iphylo.blogspot.com
Rod
- What about library catalogues. Surely a lot of literature will have
an electronic card catalogue somewhere? What about WorldCat - http://www.oclc.org/worldcat/. Is it not possible to search global library cataologues to extract basic metadata about a lot of literature, or am I being - as usual - hopelessly naive?
It's true, there are electronic sources of some of this information, and it's possible that we are capturing in IPNI some information which is then _also_ captured (from the same journal) in the Kew Record and then _also_ captured (at least for books) by the library catalogue So a central place where some of the metadata - especially horrible hairy things like DOIs and ISBNs - could be captured once and re-used might be helpful ESPECIALLY if the source was from the publisher itself (that's what Amazon makes you do, after all) - someone with an incentive to get it right and correct it if it's wrong. Actually just doing it once at Kew would be an improvement (but we'd need to actually write the tools to do that). Of course, there will be no substitute for the editors actually scanning (ie. with their eyes) the literature themselves to check that everything is present and correct - descriptions, types and so on. And the KR team are looking for different things as well - everyone has additional information they need from a publication that they will want to extract.
The more I think about it the more I can see that we do need a really good robust system for identifying and cross linking all these things. DOIs will do part of it, but I guess BHL will have to cover the rest ... As to whether we end up with something centralised or standardised, I haven't really thought about it enough to say ...
Sally *** Sally Hinchcliffe *** Computer section, Royal Botanic Gardens, Kew *** tel: +44 (0)20 8332 5708 *** S.Hinchcliffe@rbgkew.org.uk
participants (2)
-
Roderic Page
-
Sally Hinchcliffe