October 2005 - tdwg-tag - lists.tdwg.org

Re: [tdwg-tapir] {Definitely Spam?} Modification in the Inventoryresponse
by "Döring, Markus" 14 Nov '05

14 Nov '05

I like the idea. It will make the inventory a little bit more "customizable" similar to the searches. Are there any objections or other preferred names for the new attribute? Could be for example elementName, responseElement, responseName, tag, tagname, name, renamed. I think I quite like tag. And I think it should be optional and default to the <value> tag if not supplied. Markus -----Ursprüngliche Nachricht----- Von: tdwg-tapir-bounces(a)lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] Im Auftrag von Javier privat Gesendet: Dienstag, 25. Oktober 2005 12:41 An: tdwg-tapir(a)lists.tdwg.org Betreff: [tdwg-tapir] {Definitely Spam?} Modification in the Inventoryresponse Dear all, Samy Gaiji, from IPGRI, sent us yesterday an email with comments about TAPIR. He consider the response format of the inventory operation inconvenient. For those not remembering an inventory operation looks like this: Request: ------------ <?xml version='1.0' encoding='UTF-8'?> <request> <header /> <inventory count='true' start='0' limit='50' xmlns:dwc='http://digir.net/schema/conceptual/darwin/2003/1.0'> <concepts> <concept path='dwc:/Country' /> <concept path='dwc:/Genus' /> </concepts> </inventory> </request> ---- Response ------------- <?xml version="1.0" encoding="UTF-8"?> <response> <header></header> <inventory> <record> <value>AUSTRALIA</value> <value>Calicium</Genus> </record> <summary start="0" totalReturned="50" totalMatched="73" next="50" /> </inventory> </response> -------------- He find hards to parse after that all concepts are named 'value' and having to trust on that the elements are returned in the same order as they were request. I don't know, for me this does not look like a big issue, but in any case here is a proposal that makes possible to assign names to the elements that are responded. Request: ----------------------------------- <?xml version='1.0' encoding='UTF-8'?> <request> <header /> <inventory count='true' start='0' limit='50' xmlns:dwc='http://digir.net/schema/conceptual/darwin/2003/1.0'> <concepts> <concept path='dwc:/Country' elementName='Country' /> <concept path='dwc:/Genus' elementName='Genus' /> </concepts> </inventory> </request> ------------------------------- Response: ------------------------------- <?xml version="1.0" encoding="UTF-8"?> <response> <header></header> <inventory> <record> <Country>AUSTRALIA</Country> <Genus>Calicium</Genus> </record> <summary start="0" totalReturned="50" totalMatched="73" next="50" /> </inventory> </response> ------------------------ You can find attached a modification of the latest protocol schema that includes this. What are your thoughts on this?

5 4

Topic 3: GUIDs for Taxon Names and Taxon Concepts
by Donald Hobern 30 Oct '05

30 Oct '05

[ Another topic for comments. Please keep the Topic number in responses. ] Topic 3: GUIDs for Taxon Names and Taxon Concepts Another key area in which TDWG has recognised the need for globally unique identifiers is in connection with taxon names and the various concepts associated with them. This issue actually also intersects with that of identifiers for taxonomic publications. Definitions In the following discussion, a "taxon name" is a scientific name string which simply identifies a name assigned in the taxonomic literature. In many cases such a name may have been applied in different ways by the original author and subsequent taxonomists. Each such application of a taxon name by a taxonomist to a set of organisms is here referred to as a "taxon concept". An understanding of the taxon concept adopted by a researcher is frequently essential if data are to be interpreted correctly. In its most basic form a "taxon concept" can be considered to be the use of a given "taxon name" in a given "taxonomic publication", in other words something that could be represented as, "Agenus aspecies Author1 Year1 sec. Author2 Year2". One possible approach to assigning identifiers to taxon concepts would therefore be to assign identifiers to taxon names and to taxonomic publications and to use a combination these identifiers to identify each taxon concept. Note that a taxon concept may be defined at least in part by a set of assertions about the relationship between the present concept and the concepts adopted by earlier taxonomists. In addition it is possible for other researchers to make their own assertions about the relationships between the concepts published by different taxonomists. Much of the interest and value to be gained from modeling taxonomy relates to the interpretation of these asserted relationships. Although the distinction between taxon names and taxon concepts may seem (over-)subtle, it is important that we should know whether we are referring simply to a nomenclaturally valid name, quite independently of any set of organisms to which it may be applied, or to a taxon concept which somehow applies such a name to such a set of organisms. Without this distinction, we will be restricted in our ability to develop biodiversity informatics, although of course there will be many cases in which all we can say is that a data set refers to some unspecified taxon concept associated with a given taxon name. Identifiers Clearly there are many situations in which a taxon name can itself be treated as a unique identifier without any apparent ambiguity about which name is being referenced (e.g. Turdus merula; Poa annua), but the existence of homonyms prevents this from being generally true. Even when taxon names include citations of the original publications (e.g. Turdus merula Linnaeus, 1758; Poa annua L.), they can be very difficult to compare since the form of the citations may vary greatly. Even where there is no ambiguity about which name is being referenced, such a name does not by itself serve to identify which associated concept is being referenced. There are many different systems in place for associating other identifiers with either taxon names or taxon concepts. ITIS (http://www.itis.usda.gov/, http://www.cbif.gc.ca/pls/itisca, http://siit.conabio.gob.mx/) assigns Taxonomic Serial Numbers (TSNs) to each name in its system. Other species databases have their own identifiers for taxon concepts. Recording schemes often have their own identifiers for taxa (e.g. Bradley and Fletcher numbers for Lepidoptera in the UK, various systems of four-letter codes for North American bird species). These are often used to provide some stability and clarity in the taxonomy used by a given project. Questions I would like therefore to ask the following questions of any of you who use scientific names in your databases (either taxonomic databases recording a list of taxa, or databases recording information about taxa, specimens, observations, etc.): 1. Is your data organised using taxon names or to taxon concepts? 2. Do you assign any reusable identifiers to taxon names or concepts (i.e. identifiers used in more than one database)? 3. If so, what is the process in assigning new identifiers for additional taxa and for accommodating taxonomic change? 4. Where are these identifiers used (other organizations, databases, data exchange, recording forms, etc.)? 5. Do you use identifiers from any external classification within your database? 6. Would there be any social or technical roadblocks to replacing these identifiers with a single identifier that was guaranteed to be unique? As before I am looking for information on existing practices and any requirements that would need to be accommodated within any general system of identifiers. Thanks, Donald --------------------------------------------------------------- Donald Hobern (dhobern(a)gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 ---------------------------------------------------------------

1 0

Re: Topic 3: GUIDs for Taxon Names and Taxon Concepts
by Richard Pyle 30 Oct '05

30 Oct '05

Thank you, Donald, for starting the discussion thread for which I have been waiting (no so) patiently a very long time, and in a context that might, for the first time, possibly lead to some meaningful resolution. Of course, I wholeheartedly endorse your approach to distinguishing "names" from "concepts" as different informational objects, and also support the basic notion that a "Concept Object" can be conveniently and reliably represented as a combination of a "Name" and some sort of documented usage of the Name (usually in the form of a publication). Before I provide my own answers to your specific questions, though, I want to underscore what I feel is a fundamentally important issue that needs to be addressed early on in any serious discussion of GUIDs for taxonomic names. There is no broad agreement on what a unit "Name" really is, or should be. Consider the following list: 1. Pomacanthidae 2. Pomacanthinae 3. Centropyge 4. Xiphypops 5. Centropyge (Xiphypops) 6. Centropyge flavicaudus 7. Centropyge flavicauda 8. Xiphypops flavicaudus 9. Centropyge (Xiphypops) flavicauda 10. Centropyge fisheri 11. Centropyge fisheri flavicauda 12. Centropyge (Xiphypops) fisheri flavicauda How many Name-GUIDs would be needed for the above list? From one perspective there would be twelve GUIDs -- one for each "namestring". In ITIS, there would be ten TSNs (#9 would not receive a separate TSN from #7, nor would #12 receive a separate TSN from #11). From the botanical perspective (imagining these as botanical names), there would be at least seven (#6 & #7 would be spelling variants of the same "name", and I don't believe that #9 and #12 would be treated as different "names" from #7 and #11, respectively), and perhaps eight (not sure if #1 & #2 would be the same or different "names", the former being at rank Family, and the latter Subfamily). From the zoological perspective, there may be only five: [1+2], [3], [4+5], [6+7+8+9+11+12], [10] (the various flavors of each "Name" unit would be considered attributes of the usage -- i.e., tied to the Concept object). Before a GUID system can be implemented for taxon names, there needs to be a clear definition of what "unit" of name should receive a unique GUID, vs. what textual elements represent attributes of a usage (~concept) instance. No definition is perfectly unambiguous in all cases, but I think it's important that the broader community adopt a SINGLE definition of what a Name unit is. Having separate systems for Botany vs. Zoology vs. whatever would, I think, go a very long way toward defeating the purpose of establishing taxon name GUIDs in the first place. Now on to the specific questions: > Is your data organised using taxon names or to taxon concepts? I use Taxon concepts as the core unit, with only one series of ID #s (32-bit integers). Name IDs are derived from a defined subset of Concept IDs (the original description usage instance for each name). For a full explanation, see: www.phyloinformatics.org/pdf/1.pdf Note: I would NOT recommend this approach (names IDs derived from subset of concept IDs) for GUIDs. It works WONDERFULLY and elegantly for my Taxonomer application, where ID numbers are always passed in context. But for universally accessed GUIDs, there may be ambiguity whether ID#12345 references the concept asserted within the original description of a name, or just the concept-less name object. > Do you assign any reusable identifiers to taxon names or concepts > (i.e. identifiers used in more than one database)? I guess it depends on what you mean by "one database". I think the best answer to your question for the "databases" I manage is "yes". > If so, what is the process in assigning new identifiers for additional > taxa and for accommodating taxonomic change? New names & concepts are created from multiple sources, and identifiers are assigned automatically within a single, common taxon data table accessed by all sources via the network. Because records represent Name-usage instances, they never need to change (except for correcting data entry/transcription errors). Changing taxonomies are documented automatically simply by virtue of the fact that each usage is treated as a separate record, so the data table creates a history of alternate usages over time. A single internal "current use" taxonomy is established by selecting a single usage record for each "Name" (sensu zoological perspective), representing the specific usage that we feel got it "right". > Where are these identifiers used (other organizations, > databases, data exchange, recording forms, etc.)? At this moment, they are used only internally within our institution. Soon, they will be shared among partners of the Pacific Basin Information Node (PBIN) -- part of the U.S. National Biological Information Infrastructure (NBII). > Do you use identifiers from any external classification > within your database? Not sure what this means, exactly, but we do cross-map our IDs to other IDs (e.g., ITIS TSNs, Catalog of Fishes ID numbers, etc.). And the nature of our data structure (tracking usage instances) automatically keeps track of multiple classifications. > Would there be any social or technical roadblocks to > replacing these identifiers with a single identifier > that was guaranteed to be unique? Not really -- depending on how a Name "unit" is scoped (as per my discussion above). Aloha, Rich Richard L. Pyle, PhD Database Coordinator for Natural Sciences and Associate Zoologist in Ichthyology Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef(a)bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html

1 0

List test.
by Roger Hyam 28 Oct '05

28 Oct '05

Just a quick test to check the list is running OK before more people are invited to join. -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger(a)tdwg.org +44 1578 722782 -------------------------------------

1 0

Re: [tdwg-tapir] {Definitely Spam?} Modification inthe Inventoryresponse
by "Döring, Markus" 26 Oct '05

26 Oct '05

...and what about naming the attribute just "label" ? More humanlike, isn't it? <concept path='dwc:/Country' label='Country' /> <concept path='dwc:/Country' tag='Country' /> <concept path='dwc:/Country' elementName='Country' /> Markus -----Ursprüngliche Nachricht----- Von: tdwg-tapir-bounces(a)lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] Im Auftrag von Roger Hyam Gesendet: Dienstag, 25. Oktober 2005 15:24 An: Döring, Markus Cc: tdwg-tapir(a)lists.tdwg.org Betreff: Re: [tdwg-tapir] {Definitely Spam?} Modification inthe Inventoryresponse Seems OK to me as well Döring, Markus wrote: >I like the idea. >It will make the inventory a little bit more "customizable" similar to the searches. >Are there any objections or other preferred names for the new attribute? > >Could be for example elementName, responseElement, responseName, tag, tagname, name, renamed. I think I quite like tag. > >And I think it should be optional and default to the <value> tag if not supplied. > >Markus > > > >-----Ursprüngliche Nachricht----- >Von: tdwg-tapir-bounces(a)lists.tdwg.org >[mailto:tdwg-tapir-bounces@lists.tdwg.org] Im Auftrag von Javier privat >Gesendet: Dienstag, 25. Oktober 2005 12:41 >An: tdwg-tapir(a)lists.tdwg.org >Betreff: [tdwg-tapir] {Definitely Spam?} Modification in the >Inventoryresponse > >Dear all, > >Samy Gaiji, from IPGRI, sent us yesterday an email with comments about TAPIR. He consider the response format of the inventory operation inconvenient. For those not remembering an inventory operation looks like this: > >Request: >------------ ><?xml version='1.0' encoding='UTF-8'?> ><request> > <header /> > <inventory count='true' start='0' limit='50' > xmlns:dwc='http://digir.net/schema/conceptual/darwin/2003/1.0'> > <concepts> > <concept path='dwc:/Country' /> > <concept path='dwc:/Genus' /> > </concepts> > </inventory> ></request> >---- >Response >------------- ><?xml version="1.0" encoding="UTF-8"?> ><response> > <header></header> > <inventory> > <record> > <value>AUSTRALIA</value> > <value>Calicium</Genus> > </record> > <summary start="0" totalReturned="50" totalMatched="73" >next="50" /> > </inventory> ></response> >-------------- > >He find hards to parse after that all concepts are named 'value' and having to trust on that the elements are returned in the same order as they were request. I don't know, for me this does not look like a big issue, but in any case here is a proposal that makes possible to assign names to the elements that are responded. > >Request: >----------------------------------- ><?xml version='1.0' encoding='UTF-8'?> ><request> > <header /> > <inventory count='true' start='0' limit='50' > xmlns:dwc='http://digir.net/schema/conceptual/darwin/2003/1.0'> > <concepts> > <concept path='dwc:/Country' elementName='Country' /> > <concept path='dwc:/Genus' elementName='Genus' /> > </concepts> > </inventory> ></request> >------------------------------- >Response: >------------------------------- ><?xml version="1.0" encoding="UTF-8"?> ><response> > <header></header> > <inventory> > <record> > <Country>AUSTRALIA</Country> > <Genus>Calicium</Genus> > </record> > <summary start="0" totalReturned="50" totalMatched="73" >next="50" /> > </inventory> ></response> >------------------------ > >You can find attached a modification of the latest protocol schema that includes this. > >What are your thoughts on this? > > >_______________________________________________ >tdwg-tapir mailing list >tdwg-tapir(a)lists.tdwg.org >http://lists.tdwg.org/mailman/listinfo/tdwg-tapir_lists.tdwg.org > > > -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger(a)tdwg.org +44 1578 722782 ------------------------------------- _______________________________________________ tdwg-tapir mailing list tdwg-tapir(a)lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir_lists.tdwg.org

2 1

Re: Topic 2: GUIDs for Collections and Specimens
by Roderic Page 25 Oct '05

25 Oct '05

Apologies to Dave for being slightly dense, I now understand the proposed solution, which works fine for MVZ records (and others). I should have paid more attention to the XML I was getting back, which would have showed me what I needed to know. This temporary hack isn't a great solution to the problem of GUIDs for specimens, but works for me at present. The only pain is trying to map classic specimen codes such as "MVZ 193037" from GenBank records onto the correct specimen, but that's another story. To perhaps be slightly more relevant to this discussion, the issue of multiple identifiers for the "same" information keeps coming up. For example, information on a MVZ specimen may be retrieved using DiGIR (as an XML document), directly from the MVZ web site (as an HTML document, with a different specimen id from the DiGIR record), or through GBIF (with yet another id). For my own purposes I'm linking the different representations, so that my database knows about them (for the technically minded, I'm using RDF so the link is made using the "rdf:sameAs" tag). In the case of specimens I'm guessing the information is usually the "same" (typically it is ultimately served by the same source database), but in other cases it can be very different (e.g., publications where resolving a PubMed id and a DOI lead to very different digital documents). Regards Rod On 21 Oct 2005, at 00:18, Dave Vieglais wrote: > Hi Rod, > I was just pointing out that if you include CollectionCode in your > example then you would not have the duplication of records that occurs > in the example. The combination of InstitutionCode, CollectionCode, > and > CatalogNumber should provide a GUID to a specimen record. So to > slightly modify your example > > DiGIR provider URL : resource : CollectionCode : specimen code > > will generally be sufficient, but in some cases, a single server > resource may offer records from several intitutions, hence: > > DiGIR provider URL : resource : InstitutionCode : CollectionCode : > specimen code > > would be unique. It would be a simple matter to extend DiGIR slightly > to support direct resolution of such an identifier. Perhaps something > like: > > http://some.server/digir.php?id=resource/InstitutionCode/ > CollectionCode/CatalogNumber > > would be sufficient to identify a single record and retrieve its > digital > representation as well. > > regards, > Dave V. > > Roderic Page wrote: >> My point is that it isn't always done (and the MVZ example concerns >> totally different specimens, rather than preparations of the same >> specimen). My aim is not to criticise DiGIR and Darwin Core >> specifically (although the absence of a GUID is a major weakness), but >> simply to provide a concrete example where digital records for totally >> different specimens are not clearly distinguished. In the MVZ example, >> one could retrieve the record for the desired specimen if one searched >> on the taxonomic name, but this is cumbersome -- ideally I want a GUID >> that can be resolved to the appropriate specimen independent of any >> other information. DiGIR can do this, so long as DiGIR providers using >> different resource names for different collections. >> >> Regards >> >> Rod >> >> >> >> On 20 Oct 2005, at 23:11, Dave Vieglais wrote: >> >>> Hi Roderic, >>> In general, for records retrieved from data sources exposed using the >>> Darwin Core one should be able to combine InstitutionCode, >>> CollectionCode and CatalogNumber to provide unique identifiers for >>> those >>> records. This is not always the case however, the most common >>> example >>> of which is probably the presence of records for different >>> preparations >>> of the same specimen. >>> >>> regards, >>> Dave V. >>> >>> Roderic Page wrote: >>> >>>> As a consumer of specimen GUIDs, I've found specimens to be >>>> frustrating >>>> to deal with as individual collections don't guarantee uniqueness of >>>> identifiers (Donald's point 2 below). For example, in the absence of >>>> specimen GUIDs (such as LSIDs) I'd hoped to use a three part >>>> identifier >>>> based on the DiGIR provider, e.g. >>>> >>>> DiGIR provider URL : resource : specimen code >>>> >>>> Hence, >>>> >>>> digir.fieldmuseum.org/digir/DiGIR.php:MammalsDwC2:FMNH158106 >>>> \-----------------------------------/ \---------/ \--------/ >>>> provider resource specimen >>>> >>>> identifies specimen FMNH 158106 of Tatera robusta at the Field >>>> Museum >>>> in >>>> Chicago. The idea behind this crude hack is that the identifier can >>>> be >>>> resolved (there's enough information in the identifier to retrieve >>>> the >>>> record, see for example >>>> http://darwin.zoology.gla.ac.uk/~rpage/hacks/2/index.html ). >>>> >>>> To my horror, if I do this for MVZ 148946, I get three specimens >>>> back, >>>> one each for Chaetodipus baileyi baileyi, Calidris mauri, and Rana >>>> cascadae. This is an instance where the same specimen code is being >>>> used >>>> in three different collections (mammals, birds, and herps). I guess >>>> MVZ >>>> could have avoided this by using a different name for the 'resource' >>>> field for each collection. >>>> >>>> I offer this as an example of where GUIDs are vital if we are to >>>> avoid >>>> linking to the wrong information, and also where individual >>>> providers >>>> need to ensure that the identifiers they generate are unique. >>>> >>>> Regards >>>> >>>> Rod >>>> >>>> >>>> Professor Roderic D. M. Page >>>> Editor, Systematic Biology >>>> DEEB, IBLS >>>> Graham Kerr Building >>>> University of Glasgow >>>> Glasgow G12 8QP >>>> United Kingdom >>>> >>>> Phone: +44 141 330 4778 >>>> Fax: +44 141 330 2792 >>>> email: r.page(a)bio.gla.ac.uk >>>> web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html >>>> reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html >>>> >>>> Subscribe to Systematic Biology through the Society of Systematic >>>> Biologists Website: http://systematicbiology.org >>>> Search for taxon names at >>>> http://darwin.zoology.gla.ac.uk/~rpage/portal/ >>>> >>>> >>>> >>>> >>> >>> >> Professor Roderic D. M. Page >> Editor, Systematic Biology >> DEEB, IBLS >> Graham Kerr Building >> University of Glasgow >> Glasgow G12 8QP >> United Kingdom >> >> Phone: +44 141 330 4778 >> Fax: +44 141 330 2792 >> email: r.page(a)bio.gla.ac.uk >> web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html >> reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html >> >> Subscribe to Systematic Biology through the Society of Systematic >> Biologists Website: http://systematicbiology.org >> Search for taxon names at >> http://darwin.zoology.gla.ac.uk/~rpage/portal/ >> > > Professor Roderic D. M. Page Editor, Systematic Biology DEEB, IBLS Graham Kerr Building University of Glasgow Glasgow G12 8QP United Kingdom Phone: +44 141 330 4778 Fax: +44 141 330 2792 email: r.page(a)bio.gla.ac.uk web: http://taxonomy.zoology.gla.ac.uk/rod/rod.html reprints: http://taxonomy.zoology.gla.ac.uk/rod/pubs.html Subscribe to Systematic Biology through the Society of Systematic Biologists Website: http://systematicbiology.org Search for taxon names at http://darwin.zoology.gla.ac.uk/~rpage/portal/

1 0

[tdwg-tapir] TapirLite and SimpleFiltering pages
by "Döring, Markus" 25 Oct '05

25 Oct '05

Hi Roger & Renato, I somehow wasn't subscribed to the list properly, so I nearly repeated all of Renatos comments on the wiki without being aware of this mail ... I think we said earlier that all parameters in views should be optional by nature - this is also how the pywrapper implements it. Does it make sense for any view to have mandatory parameters in filters? If a parameter is not being used in the actual view call, then this part of the filter should be ignored. Otherwise it would evaluate to false and therefore no AND combinations of parameters would be possible. But if we have a view exposing only one parameter, lets say the objectID, and this one is not given - that also doesnt make too much sense. So this could be a case where a parameter is required. >>>From the formal aspect I support Renatos idea of having another flag "optional" or "required" to indicate this. We could also use an attribute "use='required'" as in attribute definitions in xml schema. By default I think all parameters should be optional. Most of the changes are included now in the current schema proposal, but views are still a separate category in capabilities. I dont mind to change this the way renato showed below. One other issue to discuss is importing/including directives in views and schemas. Would it be of great value to have includes in views? If so, would we need include directives on the base of the protocol: <include href="dwc_base.xml"> or should we define xml processing instructions like this: <?include href="dwc_base.xml" ?> If we want to be able to do includes at any place in a document, I think we have to go for PIs. Just something to think about if you are bored. But my main concern is still the IndexingElementExplosion: http://ww3.bgbm.org/protocolwiki/IndexingElementExplosion I hope to have some new ideas about the explosion in my next mail. Thanks, Markus ---------------------------------------- Hi Roger, No need to be nervous, tapirs are friendly animals... ;-) I really like the idea of TapirLite. Originally the capibilities response had a specific section to indicate the supported operations. I think we could bring it back, making ping, metadata, and capabilities the only mandatory operations, as suggested. For consistency, perhaps we could make the accepted views subelements of the corresponding operation element. And since dynamic views can actually be represented by the functionality of the search operation, they would become optional. So for TapirLite implementations, that section could look like: <operations> <ping/> <metadata/> <capabilities/> <view> <view identifier="http://tdwg.org/tapir/views/a" alias="a"/> <view identifier="http://tdwg.org/tapir/views/b" alias="b"/> </view> </operations> I also like the idea of only using view ids: GUIDs redirecting to the respective xml definitions. The alias would be the view name used in URLs. About filtering, I think it's already possible to have an empty section "operators" in the capabilities response. And when a TapirLite provider says it understands a particular view, even if that view contains an XML-encoded filter the provider could hard code the local translation for that filter and not necessarily be able to parse generic filters. Regarding the new "id-defined" operator, I was thinking if there's another way to achieve the same results. Perhaps by creating an additional attribute in the <parameter> element called "optional". "Optional" could also be optional, and when not specified the parameter would be considered mandatory. An explicit optional="true" combined with the inexistence of the parameter could have the effect of telling the parser to ignore that condition. Just another idea... Best Regards, -- Renato On 20 Oct 2005 at 11:41, Roger Hyam wrote: > Hi Everyone, > > I am nervous at being the first to post to this most esteemed list but > here goes. > > I have just added two pages to the wiki concerning minor changes that > could be made to the protocol to make it easier to implement 'Lite' > versions of Tapir providers. > > http://ww3.bgbm.org/protocolwiki/TapirLite > http://ww3.bgbm.org/protocolwiki/SimpleFiltering > > Please read and add your support or reservations to the wiki or > discuss it here. > > All the best, > > Roger

5 4

Re: [tdwg-tapir] TapirLite and SimpleFiltering pages
by "Döring, Markus" 24 Oct '05

24 Oct '05

I totally agree. I can't remember who suggested the include idea. I cant see any further arguments for it right now. Lets drop includes for now. Markus -----Ursprüngliche Nachricht----- Von: tdwg-tapir-bounces(a)lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] Im Auftrag von Roger Hyam Gesendet: Montag, 24. Oktober 2005 12:31 An: Renato De Giovanni Cc: tdwg-tapir(a)lists.tdwg.org Betreff: Re: [tdwg-tapir] TapirLite and SimpleFiltering pages Include mechanisms: Does having views referred to by URL remove the problem of having some form of include mechanism actually in the view. The file that is retrieved from the URL can be created by whatever dynamic means is most useful - could be XSLT or even old fashioned server side includes. No single view is likely to get so large it is worth dicing up is it? Perhaps this is something that could be push forward into a future version? Renato De Giovanni wrote: Hello Markus, Some quick comments about the three topics: 1- Filter parameters: I do remember we have discussed about that before, but somehow I had the impression that we didn't come to any concrete conclusion (and if we did, I think that's something we left out of the integration document, so, I'm sorry if I forgot about this detail...). Anyway, I definitely agree that it would be better if we could have it formalized as an attribute. 2- Include directives Not sure if there'll be significant benefits here, but if someone could give an interesting example of such functionality, maybe we can consider it. 3- "IndexingElementExplosion" I already included a comment in the wiki. I really think that's a "non-problem"... Best Regards, -- Renato On 21 Oct 2005 at 19:40, Döring, Markus wrote: Hi Roger & Renato, I somehow wasn't subscribed to the list properly, so I nearly repeated all of Renatos comments on the wiki without being aware of this mail ... I think we said earlier that all parameters in views should be optional by nature - this is also how the pywrapper implements it. Does it make sense for any view to have mandatory parameters in filters? If a parameter is not being used in the actual view call, then this part of the filter should be ignored. Otherwise it would evaluate to false and therefore no AND combinations of parameters would be possible. But if we have a view exposing only one parameter, lets say the objectID, and this one is not given - that also doesnt make too much sense. So this could be a case where a parameter is required. >From the formal aspect I support Renatos idea of having another flag "optional" or "required" to indicate this. We could also use an attribute "use='required'" as in attribute definitions in xml schema. By default I think all parameters should be optional. Most of the changes are included now in the current schema proposal, but views are still a separate category in capabilities. I dont mind to change this the way renato showed below. One other issue to discuss is importing/including directives in views and schemas. Would it be of great value to have includes in views? If so, would we need include directives on the base of the protocol: <include href="dwc_base.xml"> or should we define xml processing instructions like this: <?include href="dwc_base.xml" ?> If we want to be able to do includes at any place in a document, I think we have to go for PIs. Just something to think about if you are bored. But my main concern is still the IndexingElementExplosion: http://ww3.bgbm.org/protocolwiki/IndexingElementExplosion I hope to have some new ideas about the explosion in my next mail. Thanks, Markus _______________________________________________ tdwg-tapir mailing list tdwg-tapir(a)lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir_lists.tdwg.org -- ------------------------------------- Roger Hyam Technical Architect Taxonomic Databases Working Group ------------------------------------- http://www.tdwg.org roger(a)tdwg.org +44 1578 722782 -------------------------------------

1 0

Re: Topic 2: GUIDs for Collections and Specimens
by Donald Hobern 23 Oct '05

23 Oct '05

Thanks, Chuck for this detailed response. You are quite right that we need to be clear what we mean by "specimen". Your clarification of MOBOT's use of identifiers shows not only that there are many identifiers in use, but also they may apply to any in a series of increasingly refined objects (or sets of objects), and that there are good reasons for wanting to be able to identify each item in that series. If we think of this in software modeling terms, each of these could be a separate object which could be manipulated and referenced independently of the others. Different communities within biological collections, will clearly have different series of identifiable objects. For example an entomological collection could have the following series: (Survey?) -> Contents of an (malaise/light/water/etc.) trap -> Individual insect -> Insect part (genitalia preparation, leg removed for DNA analysis) -> (DNA preparation?) Handling of plankton samples, culture collections and seedbank accessions will be different again. Within botanical collections, is there any attempt to indicate that two separate collecting events relate to the same plant or clonal population? Depending on the needs and purpose of an individual collection, it may track different items in these series. Individual insects may be part of a numbered series or have their own numbers. As Chuck suggests, this means that it is not clear that we have a single common definition of "specimen" that would be accepted by all of us. My use of the word "subsample" and the phrase "identifiable set" in my original question was an attempt to recognise that one group's specimen may be seen by another group as just a part of a specimen or as a set of specimens. The ABCD Schema uses the general term Unit to reflect the variation between different items recorded by different providers. It seems to me that there are various ways that we can try to handle this: 1. We could try to develop wording that explains what we agree to be a reasonable shared definition of a specimen that can be applied by each collection to select an appropriate identifier or require them to generate a new one. This seems unlikely ever to be successful given the wide range of situations, collections and databases that need to be covered. 2. We could let each provider give an indication of the nature of the item being referenced (sample with multiple organisms, individual organism, tissue, etc.; living material, dead material) using terminology that is appropriate to their community. This may help human readers of the data to interpret the data but does not allow us to reason reliably about the data we receive. This is close to the approach followed today by Darwin Core (BasisOfRecord) and the ABCD Schema (Unit/RecordBasis). 3. We could work as a community to develop and enforce a controlled terminology for the nature of items referenced. By limiting the range of terms that can be used, it should in many cases be possible to reason more clearly about what each record describes. 4. We could go further and manage the controlled terminology as an ontology that includes hierarchically-arranged definitions (e.g. a CultureCollection isA LiveUnit, a HerbariumSheet isA DeadUnit) and other relationships (e.g. a Tissue derivesFrom a DeadUnit). There would be more work in doing this, but the BioMOBY project provides one example of how to build such an ontology as an open community activity. As we consider the use of GUIDs, I would really also like us to think about the fourth of these options. Any "Unit" (or whatever else we may use as a generic term for a biological item being recorded) can be identified as belonging to a particular class of objects identified within a shared ontology. We can do this by having an element whose value must be the identifier for an object class registered in the ontology. This allows an institution to make an assertion that one record relates to an individual dead organism and that another relates to a tissue sample, and for those assertions to be ones that software applications can process. Better still, the presence of GUIDs for each of these records would allow us to add an extra element to the tissue sample record that securely identifies the specimen from which it was taken. The bottom line here is that we certainly need to do some work to make sure that we know what we are talking about when we speak of a "specimen" (or any other similar term), but that we can use a combination of GUIDs and a shared ontology to transcend the difficulties this could present, and to construct subtle and informative webs of information. Donald --------------------------------------------------------------- Donald Hobern ( <mailto:dhobern@gbif.org> dhobern(a)gbif.org) Programme Officer for Data Access and Database Interoperability Global Biodiversity Information Facility Secretariat Universitetsparken 15, DK-2100 Copenhagen, Denmark Tel: +45-35321483 Mobile: +45-28751483 Fax: +45-35321480 --------------------------------------------------------------- _____ From: Taxonomic Databases Working Group GUID Project [mailto:TDWG-GUID@LISTSERV.NHM.KU.EDU] On Behalf Of Chuck Miller Sent: 22 October 2005 00:40 To: TDWG-GUID(a)LISTSERV.NHM.KU.EDU Subject: Re: Topic 2: GUIDs for Collections and Specimens I am responding to Donald's questions as they apply at Missouri Botanical Garden. As several have described, there are multiple layers of identification that occur with specimens, particularly botanical specimens. Our physical herbarium specimens are structured in a hierarchy, starting from the original plant that was collected down to individual pieces with labels. COLLECTION Identification begins at collection. Multiple "samples" are usually taken from one plant or an entire small plant may be taken, a collector's number is assigned to the sample in the collector's field book along with notes and samples also numbered. Samples of other plants of the same kind may also be taken with different numbers assigned to each in the field book and on the sample. Samples may be made up of multiple pieces - leaves and stems, fruits, seeds, bark, etc. - some may be dried, others left wet. All of the pieces/samples of the one plant described in one numbered field book entry belong to the one organism noted by the collector. PREPARATION The pieces of dried or wet samples are shipped back to MBG with their identifying numbers. Nowadays, the information from the field book is recorded in Tropicos including the collector's number. A unique TropicosID number is assigned in database to the specimen or "sample" and the data from the field book is recorded including the collector's name and number. Accession numbers are assigned to each of the pieces of the sample that will be "mounted" in a different way. A mounting sheet has the accession number pre-printed on the sheet and the number applies to whatever is mounted on the sheet. But, a separate large fruit from the same plant would be put in a bag for instance and assigned a different accession number. Nowadays, these accession numbers are also recorded in Tropicos. A label is printed for the sheet and duplicate labels are printed for each of the related "accessions". They are all the same label with the TropicosID and collector's number on them. DUPLICATES Labels are also printed for the "duplicate" samples but no accession numbers are assigned to them and they are not mounted. The duplicates may be sent unmounted to specialists for determination or to other herbaria. The identification of these samples/specimens is what is printed on the included label - which includes Tropicos ID, Collector's Name and Collector's Number. The receiving institution may or may not assign additional numbers, mount the sample on a sheet, database it, etc. Totally up to them. MOUNTING The flat pieces are mounted on the sheets, large samples may require multiple sheets for one copy. Large things (fruits, bark, branches) may be put into bags or other holding methods. A barcode number is attached to the sheet and any additional pieces/accessions and recorded in Tropicos. A different barcode is on each piece or accession. So, barcodes have a one-to-one match to accession numbers. The duplicate printed labels are also attached to the sheet and any related pieces/accessions. If an attached barcode comes off and is lost, a new, replacement barcode is attached and updated in Tropicos. The use of Lead Collector's Last Name and field book (also called catalog) number is very common in botany - eg. CROAT 10100. The collector-number method is frequently used in reference literature plus the addition of the Index Herbariorium code for the institution where the specimen was seen or gotten from. Duplicates of CROAT 10100 could be at MO, K, P, F, etc. and those sheets may have different accession numbers or no accession number at all. Donald's Questions: 1. What identifiers (how many per specimen) get assigned to specimens in your organisation or domain (field numbers, catalogue numbers, etc.)? On one mounted specimen sheet at MBG are the following numbers/identifiers: - Accession number (100% unique) - Barcode number (100% unique) - Tropicos ID (applies to all accessions and barcodes for one sample/specimen) - Collector's name and number (applies to all accessions, barcodes, TropicosIDs, and duplicate samples/labels sent to other institutions from the original collected organism) All of these numbers are recorded in the Tropicos database. 2. What is the scope of uniqueness for each of these identifiers (notebook page, collector, database, institution, global, etc.)? I attempted to describe this above. Collector's numbers are commonly unique to a collector and don't repeat across notebooks, but the numbers are not unique themselves and are only unique when combined with Collector's name Accession numbers and barcodes are unique to the sheet/bag they are attached to and are one-to-one with each other and are unique within the institution TropicosID is unique within the database and the institution and is supposed to be one-to-one with collector/collector number. Lead collector last name plus number is unique within the database and within the institution but not unique globally. 3. Can you explain the life cycle of each of these identifiers (who assigns them, how they are subsequently tracked)? Described at the beginning. 4. Can you give examples of how these identifiers are used to retrieve the specimen and/or information on the specimen? The primary search for specimens in Tropicos is by collector name and number. 5. Would there be any social or technical roadblocks to replacing these identifiers with a single identifier that was guaranteed to be unique? Technically, it would require addition of an "alias" identifier and additional programming to enable searching on the alias. Since there are 4 identifiers in hierarchical relationship, which of them could be the "single" identifier? This goes to my continuing question of "what are we trying to identify"? The original specimen (and its duplicates), a specific sheet, a specific part of a sheet, or part of a specimen in an alcohol bottle separate from the sheet? 6. In the case of subsamples from a specimen, can you identify issues around associating the sample and associated information with the source specimen and associated information? By subsample, are we referring to the occurrence of "duplicates" of the original organism or rather to the pieces of it, like bark, fruit, leaves? What constitutes the "specimen" versus the sample? We really need to sharpen the language in these discussions to eliminate the round-robin responses that occur as everyone states their opinion of what they think the terms mean but no one decides exactly the definition to be used by everyone. The biggest issue to me is that there are no standards for identification of anything below the level of the original collecting event and even the collector name + number is just a common practice in botany, not a "standard" and not universal by any means. The term "accession" means different things to different institutions. Accession number at MBG refers to an associated part of a specimen, not the whole specimen. Does catalog number mean the same thing everywhere? To some it means the collector's number. I suppose another issue is that because of the common practice in botany of collecting duplicate samples and sending them around to other institutions, any worldwide count of databased specimens that does not account for these duplicates will overstate the real number. The subject of specimen identifiers is somewhat linked to that of collection identifiers, since Darwin Core and the ABCD Schema have used institution and collection codes together with catalogue numbers to identify specimens in the absence of GUIDs. It would also be useful here to collect information on the following: 7. How are your specimens organised into larger identifiable sets (collections, named collections, databases, institutions, etc.)? We don't separate our collections into sets, they are all part of one herbarium collection. Accessions combine into one specimen. Duplicate specimens can be at other institutions. We do record the institutions where we know duplicates of a specimen are located but we do not record the other institution's catalog numbers 8. What identifiers get assigned to each of these sets in your organization or domain (institution codes, collection codes, Index Herbarium acronyms, etc.)? 9. Can you explain the life cycle of each of these identifiers (who assigns them, how they are subsequently tracked)? 10. Can you give examples of how these identifiers are used to locate the set and/or information on the set? 11. Would there be any social or technical roadblocks to replacing these identifiers with a single identifier that was guaranteed to be unique? Previously discussed.

1 0

Re: Topic 2: GUIDs for Collections and Specimens
by Chuck Miller 21 Oct '05

21 Oct '05

I am responding to Donald's questions as they apply at Missouri Botanical Garden. As several have described, there are multiple layers of identification that occur with specimens, particularly botanical specimens. Our physical herbarium specimens are structured in a hierarchy, starting from the original plant that was collected down to individual pieces with labels. COLLECTION Identification begins at collection. Multiple "samples" are usually taken from one plant or an entire small plant may be taken, a collector's number is assigned to the sample in the collector's field book along with notes and samples also numbered. Samples of other plants of the same kind may also be taken with different numbers assigned to each in the field book and on the sample. Samples may be made up of multiple pieces - leaves and stems, fruits, seeds, bark, etc. - some may be dried, others left wet. All of the pieces/samples of the one plant described in one numbered field book entry belong to the one organism noted by the collector. PREPARATION The pieces of dried or wet samples are shipped back to MBG with their identifying numbers. Nowadays, the information from the field book is recorded in Tropicos including the collector's number. A unique TropicosID number is assigned in database to the specimen or "sample" and the data from the field book is recorded including the collector's name and number. Accession numbers are assigned to each of the pieces of the sample that will be "mounted" in a different way. A mounting sheet has the accession number pre-printed on the sheet and the number applies to whatever is mounted on the sheet. But, a separate large fruit from the same plant would be put in a bag for instance and assigned a different accession number. Nowadays, these accession numbers are also recorded in Tropicos. A label is printed for the sheet and duplicate labels are printed for each of the related "accessions". They are all the same label with the TropicosID and collector's number on them. DUPLICATES Labels are also printed for the "duplicate" samples but no accession numbers are assigned to them and they are not mounted. The duplicates may be sent unmounted to specialists for determination or to other herbaria. The identification of these samples/specimens is what is printed on the included label - which includes Tropicos ID, Collector's Name and Collector's Number. The receiving institution may or may not assign additional numbers, mount the sample on a sheet, database it, etc. Totally up to them. MOUNTING The flat pieces are mounted on the sheets, large samples may require multiple sheets for one copy. Large things (fruits, bark, branches) may be put into bags or other holding methods. A barcode number is attached to the sheet and any additional pieces/accessions and recorded in Tropicos. A different barcode is on each piece or accession. So, barcodes have a one-to-one match to accession numbers. The duplicate printed labels are also attached to the sheet and any related pieces/accessions. If an attached barcode comes off and is lost, a new, replacement barcode is attached and updated in Tropicos. The use of Lead Collector's Last Name and field book (also called catalog) number is very common in botany - eg. CROAT 10100. The collector-number method is frequently used in reference literature plus the addition of the Index Herbariorium code for the institution where the specimen was seen or gotten from. Duplicates of CROAT 10100 could be at MO, K, P, F, etc. and those sheets may have different accession numbers or no accession number at all. Donald's Questions: 1. What identifiers (how many per specimen) get assigned to specimens in your organisation or domain (field numbers, catalogue numbers, etc.)? On one mounted specimen sheet at MBG are the following numbers/identifiers: - Accession number (100% unique) - Barcode number (100% unique) - Tropicos ID (applies to all accessions and barcodes for one sample/specimen) - Collector's name and number (applies to all accessions, barcodes, TropicosIDs, and duplicate samples/labels sent to other institutions from the original collected organism) All of these numbers are recorded in the Tropicos database. 2. What is the scope of uniqueness for each of these identifiers (notebook page, collector, database, institution, global, etc.)? I attempted to describe this above. Collector's numbers are commonly unique to a collector and don't repeat across notebooks, but the numbers are not unique themselves and are only unique when combined with Collector's name Accession numbers and barcodes are unique to the sheet/bag they are attached to and are one-to-one with each other and are unique within the institution TropicosID is unique within the database and the institution and is supposed to be one-to-one with collector/collector number. Lead collector last name plus number is unique within the database and within the institution but not unique globally. 3. Can you explain the life cycle of each of these identifiers (who assigns them, how they are subsequently tracked)? Described at the beginning. 4. Can you give examples of how these identifiers are used to retrieve the specimen and/or information on the specimen? The primary search for specimens in Tropicos is by collector name and number. 5. Would there be any social or technical roadblocks to replacing these identifiers with a single identifier that was guaranteed to be unique? Technically, it would require addition of an "alias" identifier and additional programming to enable searching on the alias. Since there are 4 identifiers in hierarchical relationship, which of them could be the "single" identifier? This goes to my continuing question of "what are we trying to identify"? The original specimen (and its duplicates), a specific sheet, a specific part of a sheet, or part of a specimen in an alcohol bottle separate from the sheet? 6. In the case of subsamples from a specimen, can you identify issues around associating the sample and associated information with the source specimen and associated information? By subsample, are we referring to the occurrence of "duplicates" of the original organism or rather to the pieces of it, like bark, fruit, leaves? What constitutes the "specimen" versus the sample? We really need to sharpen the language in these discussions to eliminate the round-robin responses that occur as everyone states their opinion of what they think the terms mean but no one decides exactly the definition to be used by everyone. The biggest issue to me is that there are no standards for identification of anything below the level of the original collecting event and even the collector name + number is just a common practice in botany, not a "standard" and not universal by any means. The term "accession" means different things to different institutions. Accession number at MBG refers to an associated part of a specimen, not the whole specimen. Does catalog number mean the same thing everywhere? To some it means the collector's number. I suppose another issue is that because of the common practice in botany of collecting duplicate samples and sending them around to other institutions, any worldwide count of databased specimens that does not account for these duplicates will overstate the real number. The subject of specimen identifiers is somewhat linked to that of collection identifiers, since Darwin Core and the ABCD Schema have used institution and collection codes together with catalogue numbers to identify specimens in the absence of GUIDs. It would also be useful here to collect information on the following: 7. How are your specimens organised into larger identifiable sets (collections, named collections, databases, institutions, etc.)? We don't separate our collections into sets, they are all part of one herbarium collection. Accessions combine into one specimen. Duplicate specimens can be at other institutions. We do record the institutions where we know duplicates of a specimen are located but we do not record the other institution's catalog numbers 8. What identifiers get assigned to each of these sets in your organization or domain (institution codes, collection codes, Index Herbarium acronyms, etc.)? 9. Can you explain the life cycle of each of these identifiers (who assigns them, how they are subsequently tracked)? 10. Can you give examples of how these identifiers are used to locate the set and/or information on the set? 11. Would there be any social or technical roadblocks to replacing these identifiers with a single identifier that was guaranteed to be unique? Previously discussed.

1 0