Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an *additional* term with a *clear rule* (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) dhobern@gbif.org wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here but,
having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which
seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss
botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for
accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com] On
Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; TDWG
TAG mailing list; Christian Gendreau
Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with
clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype.
Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used
for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name.
In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use
canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we
develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] trobertson@gbif.org
wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your
discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of
content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation
libraries [1] at the moment? If not, is there a reason why you don't?
They are used in all GBIF projects (portal, checklistbank etc), and the
more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to
enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
I guess the parts that confuse me are:
1) What providers are able to produce a canonicalScientificName as per Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
2) What consumers could make use of a canonicalScientificName as per Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) dhobern@gbif.org wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here but, having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; TDWG TAG mailing list; Christian Gendreau Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype. Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name. In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] trobertson@gbif.org wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't? They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum.
Rich,
I wished those terms were sufficient, but as mentioned in the justification for http://code.google.com/p/darwincore/issues/detail?id=150:
genus, specificEpithet, infraspecificEpithet: concatenated, this terms are identical to the canonicalScientificName for genera, species and infraspecific taxa. For higher taxa or infrageneric taxa, these terms are not sufficient. In addition, there is some ambiguity regarding the genus definition: for synonyms, is it the accepted genus or the genus that is part of the synonym name? See: http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In the former case, the genus cannot be used to concatenate a canonicalScientificName.
To give an example for a higher taxon: scientificName: Magnoliidae Novák ex Takhtajan taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle deepreef@bishopmuseum.orgwrote:
I guess the parts that confuse me are:
- What providers are able to produce a canonicalScientificName as per
Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
- What consumers could make use of a canonicalScientificName as per
Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) dhobern@gbif.org wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here but,
having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which
seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss
botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for
accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com]
On Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list;
TDWG TAG mailing list; Christian Gendreau
Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with
clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype.
Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used
for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name.
In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use
canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we
develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] trobertson@gbif.org
wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your
discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of
content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation
libraries [1] at the moment? If not, is there a reason why you don't?
They are used in all GBIF projects (portal, checklistbank etc), and
the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to
enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
'For higher taxa or infrageneric taxa, these terms are not sufficient' ... why?
Paul
________________________________
From: tdwg-tag-bounces@lists.tdwg.org [tdwg-tag-bounces@lists.tdwg.org] on behalf of Peter Desmet [peter.desmet@umontreal.ca] Sent: 14 March 2012 18:26 To: Richard Pyle Cc: TDWG content mailing list; Donald Hobern (GBIF); dev Developers; Christian Gendreau; TDWG TAG mailing list Subject: Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Rich,
I wished those terms were sufficient, but as mentioned in the justification for http://code.google.com/p/darwincore/issues/detail?id=150:
genus, specificEpithet, infraspecificEpithet: concatenated, this terms are identical to the canonicalScientificName for genera, species and infraspecific taxa. For higher taxa or infrageneric taxa, these terms are not sufficient. In addition, there is some ambiguity regarding the genus definition: for synonyms, is it the accepted genus or the genus that is part of the synonym name? See: http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In the former case, the genus cannot be used to concatenate a canonicalScientificName.
To give an example for a higher taxon: scientificName: Magnoliidae Novák ex Takhtajan taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle <deepreef@bishopmuseum.orgmailto:deepreef@bishopmuseum.org> wrote:
I guess the parts that confuse me are:
1) What providers are able to produce a canonicalScientificName as per Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
2) What consumers could make use of a canonicalScientificName as per Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern@gbif.orgmailto:dhobern@gbif.org> wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here but, having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.orgmailto:dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471tel:%2B45%203532%201471 Mob: +45 2875 1471tel:%2B45%202875%201471 Fax: +45 2875 1480tel:%2B45%202875%201480
-----Original Message----- From: peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; TDWG TAG mailing list; Christian Gendreau Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype. Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name. In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't? They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.nethttp://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354tel:514-343-6111%20%2382354 Fax: 514-343-2288tel:514-343-2288 Email: peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.nethttp://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354tel:514-343-6111%20%2382354 Fax: 514-343-2288tel:514-343-2288 Email: peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.nethttp://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
P Think Green - don't print this email unless you really need to
************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 832111 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071...
**************************************************************************
Hi Paul,
Higher taxon: "Magnoliidae Novák ex Takhtajan" (a subclass). - scientificName: Magnoliidae Novák ex Takhtajan - taxonRank: subclass But there are no terms to share the canonical name "Magnoliidae". The only available options are kingdom, phylum, class, order, family, genus, subgenus, specificEpithet, infraspecificEpithet, none of which are appropriate.
Solution: - canonicalScientificName: Magnoliidae
Infrageneric taxon: "Abies sect. Amabilis (Matzenko) Farjon & Rushforth" (a section) - scientificName: Abies sect. Amabilis (Matzenko) Farjon & Rushforth - taxonRank: section - genus: Abies But there are no terms to share "Abies Amabilis", "Abies sect. Amabilis", "Abies section Amabilis" or even "Amabilis". The only available options are kingdom, phylum, class, order, family, genus, subgenus, specificEpithet, infraspecificEpithet, none of which are appropriate. Why we have subgenus, but not *infragenericEpithet* is another issue. I would at least be able to share "Amabilis".
Solution: - canonicalScientificName: Abies Amabilis - taxonRank: section
Peter
There is no place to share the canonical name "Magnoliidae" for this taxon.
On Wed, Mar 14, 2012 at 14:37, Paul Kirk p.kirk@cabi.org wrote:
'For higher taxa or infrageneric taxa, these terms are not sufficient' ... why?
Paul
*From:* tdwg-tag-bounces@lists.tdwg.org [tdwg-tag-bounces@lists.tdwg.org] on behalf of Peter Desmet [peter.desmet@umontreal.ca] *Sent:* 14 March 2012 18:26 *To:* Richard Pyle *Cc:* TDWG content mailing list; Donald Hobern (GBIF); dev Developers; Christian Gendreau; TDWG TAG mailing list
*Subject:* Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Rich,
I wished those terms were sufficient, but as mentioned in the justification for http://code.google.com/p/darwincore/issues/detail?id=150 :
genus, specificEpithet, infraspecificEpithet: concatenated, this terms are identical to the canonicalScientificName for genera, species and infraspecific taxa. For higher taxa or infrageneric taxa, these terms are not sufficient. In addition, there is some ambiguity regarding the genus definition: for synonyms, is it the accepted genus or the genus that is part of the synonym name? See: http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In the former case, the genus cannot be used to concatenate a canonicalScientificName.
To give an example for a higher taxon: scientificName: Magnoliidae Novák ex Takhtajan taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle deepreef@bishopmuseum.orgwrote:
I guess the parts that confuse me are:
- What providers are able to produce a canonicalScientificName as per
Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
- What consumers could make use of a canonicalScientificName as per
Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.org [mailto: tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) dhobern@gbif.org wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here
but, having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which
seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss
botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for
accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com]
On Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list;
TDWG TAG mailing list; Christian Gendreau
Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with
clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype.
Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only
used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name.
In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use
canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since
we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] <
trobertson@gbif.org> wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to
your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of
content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation
libraries [1] at the moment? If not, is there a reason why you don't?
They are used in all GBIF projects (portal, checklistbank etc), and
the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to
enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
/main/java/org/gbif/ecat/parser/NameParser.java [2]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
/test/java/org/gbif/ecat/parser/NameParserTest.java [3]
http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src
/#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
P Think Green - don't print this email unless you really need to** **
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 832111 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071...
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Peter –you just described what TCS offered….this was all covered in the discussion on TCS… (and many more things that have been discussed recently) The guide to using it covers some of the thoughts behind these issues I think… http://www.tdwg.org/fileadmin/subgroups/tnc/User_Guide.pdf
From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: 14 March 2012 20:11 To: Paul Kirk Cc: TDWG content mailing list; Donald Hobern (GBIF); TDWG TAG mailing list; Christian Gendreau; dev Developers Subject: Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Hi Paul,
Higher taxon: "Magnoliidae Novák ex Takhtajan" (a subclass). - scientificName: Magnoliidae Novák ex Takhtajan - taxonRank: subclass But there are no terms to share the canonical name "Magnoliidae". The only available options are kingdom, phylum, class, order, family, genus, subgenus, specificEpithet, infraspecificEpithet, none of which are appropriate.
Solution: - canonicalScientificName: Magnoliidae
Infrageneric taxon: "Abies sect. Amabilis (Matzenko) Farjon & Rushforth" (a section) - scientificName: Abies sect. Amabilis (Matzenko) Farjon & Rushforth - taxonRank: section - genus: Abies But there are no terms to share "Abies Amabilis", "Abies sect. Amabilis", "Abies section Amabilis" or even "Amabilis". The only available options are kingdom, phylum, class, order, family, genus, subgenus, specificEpithet, infraspecificEpithet, none of which are appropriate. Why we have subgenus, but not infragenericEpithet is another issue. I would at least be able to share "Amabilis".
Solution: - canonicalScientificName: Abies Amabilis - taxonRank: section
Peter
There is no place to share the canonical name "Magnoliidae" for this taxon.
On Wed, Mar 14, 2012 at 14:37, Paul Kirk <p.kirk@cabi.orgmailto:p.kirk@cabi.org> wrote:
'For higher taxa or infrageneric taxa, these terms are not sufficient' ... why?
Paul
________________________________ From: tdwg-tag-bounces@lists.tdwg.orgmailto:tdwg-tag-bounces@lists.tdwg.org [tdwg-tag-bounces@lists.tdwg.orgmailto:tdwg-tag-bounces@lists.tdwg.org] on behalf of Peter Desmet [peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca] Sent: 14 March 2012 18:26 To: Richard Pyle Cc: TDWG content mailing list; Donald Hobern (GBIF); dev Developers; Christian Gendreau; TDWG TAG mailing list
Subject: Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Rich,
I wished those terms were sufficient, but as mentioned in the justification for http://code.google.com/p/darwincore/issues/detail?id=150:
genus, specificEpithet, infraspecificEpithet: concatenated, this terms are identical to the canonicalScientificName for genera, species and infraspecific taxa. For higher taxa or infrageneric taxa, these terms are not sufficient. In addition, there is some ambiguity regarding the genus definition: for synonyms, is it the accepted genus or the genus that is part of the synonym name? See: http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In the former case, the genus cannot be used to concatenate a canonicalScientificName. To give an example for a higher taxon: scientificName: Magnoliidae Novák ex Takhtajan taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle <deepreef@bishopmuseum.orgmailto:deepreef@bishopmuseum.org> wrote:
I guess the parts that confuse me are:
1) What providers are able to produce a canonicalScientificName as per Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
2) What consumers could make use of a canonicalScientificName as per Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.orgmailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern@gbif.orgmailto:dhobern@gbif.org> wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here but, having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.orgmailto:dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471tel:%2B45%203532%201471 Mob: +45 2875 1471tel:%2B45%202875%201471 Fax: +45 2875 1480tel:%2B45%202875%201480
-----Original Message----- From: peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; TDWG TAG mailing list; Christian Gendreau Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype. Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name. In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] <trobertson@gbif.orgmailto:trobertson@gbif.org> wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't? They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.nethttp://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354tel:514-343-6111%20%2382354 Fax: 514-343-2288tel:514-343-2288 Email: peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.nethttp://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354tel:514-343-6111%20%2382354 Fax: 514-343-2288tel:514-343-2288 Email: peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.orgmailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.nethttp://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354tel:514-343-6111%20%2382354 Fax: 514-343-2288tel:514-343-2288 Email: peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
P Think Green - don't print this email unless you really need to ************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions. If you have received this communication in error, please notify us by e-mail at cabi@cabi.orgmailto:cabi@cabi.org or by telephone on +44 (0)1491 832111tel:%2B44%20%280%291491%20832111 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071...
**************************************************************************
_______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.orgmailto:tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.nethttp://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.camailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.commailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
Edinburgh Napier University is one of Scotland's top universities for graduate employability. 93.2% of graduates are in work or further study within six months of leaving. The university is also proud winner of the Queen's Anniversary Prize for Higher and Further Education 2009, awarded for innovative housing construction for environmental benefit and quality of life.
This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender. It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Edinburgh Napier University does not accept liability for any loss or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the University's system is subject to routine monitoring and filtering by the University.
Edinburgh Napier University is a registered Scottish charity. Registration number SC018373
I do not have an opinion on this issue but wanted to note that the TaxonName part of the TDWG Ontology appears to be fully functional. Given that the TaxonName and TaxonConcept ontologies are based on TCS, there may be existing terms (based on TCS) with stable URIs to represent exactly what people want to say. They wouldn't be Darwin Core terms, but they would be defined and have stable URIs nonetheless. For example,
http://rs.tdwg.org/ontology/voc/TaxonName#nameComplete which can be abbreviated tn:nameComplete where tn:=http://rs.tdwg.org/ontology/voc/TaxonName#
is defined as "The complete uninomial, binomial or trinomial name without any authority or year components."
Thus one could mark up data as tn:nameCompleteHomo sapiens</tn:nameComplete> and theoretically this would have meaning to the extent to which people take the TDWG Ontology seriously. But that is a different item for discussion...
Steve
To view the rdf, see: http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo... and http://code.google.com/p/tdwg-ontology/source/browse/trunk/ontology/voc/Taxo...
On 3/14/2012 3:15 PM, Kennedy, Jessie wrote:
Peter –you just described what TCS offered….this was all covered in the discussion on TCS… (and many more things that have been discussed recently)
The guide to using it covers some of the thoughts behind these issues I think…
http://www.tdwg.org/fileadmin/subgroups/tnc/User_Guide.pdf
*From:*tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] *On Behalf Of *Peter Desmet *Sent:* 14 March 2012 20:11 *To:* Paul Kirk *Cc:* TDWG content mailing list; Donald Hobern (GBIF); TDWG TAG mailing list; Christian Gendreau; dev Developers *Subject:* Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Hi Paul,
Higher taxon: "Magnoliidae Novák ex Takhtajan" (a subclass).
scientificName: Magnoliidae Novák ex Takhtajan
taxonRank: subclass
But there are no terms to share the canonical name "Magnoliidae". The only available options are kingdom, phylum, class, order, family, genus, subgenus, specificEpithet, infraspecificEpithet, none of which are appropriate.
Solution:
- canonicalScientificName: Magnoliidae
Infrageneric taxon: "Abies sect. Amabilis (Matzenko) Farjon & Rushforth" (a section)
scientificName: Abies sect. Amabilis (Matzenko) Farjon & Rushforth
taxonRank: section
genus: Abies
But there are no terms to share "Abies Amabilis", "Abies sect. Amabilis", "Abies section Amabilis" or even "Amabilis". The only available options are kingdom, phylum, class, order, family, genus, subgenus, specificEpithet, infraspecificEpithet, none of which are appropriate. Why we have subgenus, but not *infragenericEpithet* is another issue. I would at least be able to share "Amabilis".
Solution:
canonicalScientificName: Abies Amabilis
taxonRank: section
Peter
There is no place to share the canonical name "Magnoliidae" for this taxon.
On Wed, Mar 14, 2012 at 14:37, Paul Kirk <p.kirk@cabi.org mailto:p.kirk@cabi.org> wrote:
'For higher taxa or infrageneric taxa, these terms are not sufficient' ... why?
Paul
*From:*tdwg-tag-bounces@lists.tdwg.org mailto:tdwg-tag-bounces@lists.tdwg.org [tdwg-tag-bounces@lists.tdwg.org mailto:tdwg-tag-bounces@lists.tdwg.org] on behalf of Peter Desmet [peter.desmet@umontreal.ca mailto:peter.desmet@umontreal.ca] *Sent:* 14 March 2012 18:26 *To:* Richard Pyle *Cc:* TDWG content mailing list; Donald Hobern (GBIF); dev Developers; Christian Gendreau; TDWG TAG mailing list
*Subject:* Re: [tdwg-tag] [tdwg-content] Canonical name parsing
Rich,
I wished those terms were sufficient, but as mentioned in the justification for http://code.google.com/p/darwincore/issues/detail?id=150:
genus, specificEpithet, infraspecificEpithet: concatenated, this terms are identical to the canonicalScientificName for genera, species and infraspecific taxa. For higher taxa or infrageneric taxa, these terms are not sufficient. In addition, there is some ambiguity regarding the genus definition: for synonyms, is it the accepted genus or the genus that is part of the synonym name? See:http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In the former case, the genus cannot be used to concatenate a canonicalScientificName.
To give an example for a higher taxon:
scientificName: Magnoliidae Novák ex Takhtajan
taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle <deepreef@bishopmuseum.org mailto:deepreef@bishopmuseum.org> wrote:
I guess the parts that confuse me are:
- What providers are able to produce a canonicalScientificName as per
Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
- What consumers could make use of a canonicalScientificName as per
Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.org mailto:tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) <dhobern@gbif.org mailto:dhobern@gbif.org> wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here
but, having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD
which seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you
discuss botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for
accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 tel:%2B45%203532%201471 Mob: +45 2875 1471
tel:%2B45%202875%201471 Fax: +45 2875 1480 tel:%2B45%202875%201480
-----Original Message----- From: peter.desmet.cubc@gmail.com
mailto:peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com mailto:peter.desmet.cubc@gmail.com] On Behalf Of Peter Desmet
Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list;
TDWG TAG mailing list; Christian Gendreau
Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with
clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype.
Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only
used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name.
In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use
canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries.
Since we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF]
<trobertson@gbif.org mailto:trobertson@gbif.org> wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to
your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of
content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation
libraries [1] at the moment? If not, is there a reason why you don't?
They are used in all GBIF projects (portal, checklistbank etc),
and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches
to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4]
http://repository.gbif.org/index.html#nexus-search;quick~ecat-common http://repository.gbif.org/index.html#nexus-search;quick%7Eecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net http://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 tel:514-343-6111%20%2382354 Fax: 514-343-2288 tel:514-343-2288 Email: peter.desmet@umontreal.ca mailto:peter.desmet@umontreal.ca
/ peter.desmet.cubc@gmail.com mailto:peter.desmet.cubc@gmail.com
Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net http://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 tel:514-343-6111%20%2382354 Fax: 514-343-2288 tel:514-343-2288 Email: peter.desmet@umontreal.ca mailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com mailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum.
tdwg-content mailing list tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net http://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 tel:514-343-6111%20%2382354 Fax: 514-343-2288 tel:514-343-2288 Email: peter.desmet@umontreal.ca mailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com mailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
PThink Green - don't print this email unless you really need to
The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org mailto:cabi@cabi.org or by telephone on +44 (0)1491 832111 tel:%2B44%20%280%291491%20832111 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071...
tdwg-tag mailing list tdwg-tag@lists.tdwg.org mailto:tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net http://www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca mailto:peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com mailto:peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
Edinburgh Napier University is one of Scotland's top universities for graduate employability. 93.2% of graduates are in work or further study within six months of leaving. This university is also proud winner of the Queen's Anniversary Prize for Higher and Further Education 2009, awarded for innovative housing construction for environmental benefit and quality of life.
This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender. It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Edinburgh Napier University does not accept liability for any loss or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the University's system is subject to routine monitoring and filtering by the University.
Edinburgh Napier University is a registered Scottish charity. Registration number SC018373
Hi Peter,
Right -- so the problem is in how to provide monomials at ranks other than those provided by DwC (kingdom | phylum | class | order | family | genus | subgenus | specificEpithet | infraspecificEpithet). These would include all the subs and supers and infra-rank-group ranks, etc.
I think the rationale the last time this was discussed was that these rarely come with authorships, so in most cases the scientificName will be identical to what you describe for canonicalScientificName. In cases where there is one of these names not covered by the existing canonical terms that *does* have an authorship, then in theory it should be reasonably straightforward to strip the canonical bit out of scientificName in almost all cases by trimming after the first space (monomials not having spaces within the namestring itself).
No, it's not perfect; and yes, it's less than ideal. But the small amount of noise generated by this is VASTLY smaller than the noise that already exists in the source data.
Another limitation is for cases of infrasubspecific quadrinomials (where you have Genus species subspecies variety; or Genus species subspecies form; or Genus species variety form; or Genus species subspecies variety form; etc.) In those cases, the convention is to only provide the terminal epithet for infraspecificEpithet, and the full-context name in scientifcName.
And, of course, there is the problem that Gregor mentioned about autonyms -- although with the parsed content, I still think it's fairly simple to reconstruct the full string algorithmically. It can be stripped by simply finding the scientificNameAuthorship *within* scientificName (doesn't need to be at the end of scientificName) and excising it; and to concatenate it from the parsed bits, just look to see if specificEpithet=infraspecificEpithet and check nomenclaturalCode, and place the authorship in the concatenated form appropriately.
The hybrid issue is a complex one, and isn't really helped that much by the addition of canonicalScientificName.
But I think Rod captured this one best when he asked, "Why do we let edge cases determine what we do?"
If the problem of Monomials with authorship at ranks not covered by the existing DwC terms, and/or the need to have parsed quadrinomials represents a real problem for sharing data via DwC, then I would agree with Donald that the best approach would be to refine the definition of scientificName such that it should only include authorship when the provider is unable to parse the authorship bits from the name bits. When the provider can provided them pre-parsed, then scientificName should be effectively the same as what you proposed for canonicalScientificName, and the authorship bits should be included as scientificNameAuthorship. Of course, that leaves the problem for the data consumer when a record has no scientificNameAuthorship, whether to interpret that as a case where no authorship information is known, vs. authorship is not parsed.
Don't get me wrong -- I understand why there would be some value in having something like canonicalScientificName (like I said, this has already been suggested and debated, so the potential need is there). However, part of the problem is in how that thing is defined. Paul has already indicated the issue of whether or not to include the infraspecific rank tag. We may want some people who would like canonicalScientificNameWithRanks, and some who would like canonicalScientificNameWithoutRanks; then we might want some who want canonicalScientificNameWithInfrageneric, and some canonicalScientificNameWithoutInfrageneric; then we might want some who want canonicalScientificNameTrinomialOnly.....
The point here is that I suspect we will find that different communities have slightly different ideas about what they would want in a canonicalScientificName term definition.
Again, I haven’t looked up the dialog the last time we had this conversation in the context of DwC, but a lot of what I write above sounds familiar to me; so I think these are some of the reasons why the decision was made not to create a canonicalScientificName term.
I'm not saying it's not useful, and I'm not even saying I'm opposed to the idea. But I am saying that there are non-obvious reasons why it could, as Donald has suggested, either simply move the same problem from one place to another, or ultimately create more noise and confusion than it attempts to solve.
Finally, I agree with Jessie that TCS confronted this issue and, in my opinion, dealt with it elegantly (that was another place this conversation happened). But I was focusing on DwC, as that is how the original question was framed.
Aloha, Rich
Richard L. Pyle, PhD Database Coordinator for Natural Sciences Associate Zoologist in Ichthyology Dive Safety Officer Department of Natural Sciences, Bishop Museum 1525 Bernice St., Honolulu, HI 96817 Ph: (808)848-4115, Fax: (808)847-8252 email: deepreef@bishopmuseum.org http://hbs.bishopmuseum.org/staff/pylerichard.html
Note: This disclaimer formally apologizes for the disclaimer below, over which I have no control.
From: peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 8:27 AM To: Richard Pyle Cc: Donald Hobern (GBIF); TDWG content mailing list; dev Developers; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list Subject: Re: [tdwg-content] Canonical name parsing
Rich,
I wished those terms were sufficient, but as mentioned in the justification for http://code.google.com/p/darwincore/issues/detail?id=150: genus, specificEpithet, infraspecificEpithet: concatenated, this terms are identical to the canonicalScientificName for genera, species and infraspecific taxa. For higher taxa or infrageneric taxa, these terms are not sufficient. In addition, there is some ambiguity regarding the genus definition: for synonyms, is it the accepted genus or the genus that is part of the synonym name? See: http://lists.tdwg.org/pipermail/tdwg-content/2010-November/002052.html. In the former case, the genus cannot be used to concatenate a canonicalScientificName. To give an example for a higher taxon: scientificName: Magnoliidae Novák ex Takhtajan taxonRank: subclass
There is no place to share the canonical name "Magnoliidae" for this taxon.
Peter
On Wed, Mar 14, 2012 at 14:13, Richard Pyle deepreef@bishopmuseum.org wrote:
I guess the parts that confuse me are:
1) What providers are able to produce a canonicalScientificName as per Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
2) What consumers could make use of a canonicalScientificName as per Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) dhobern@gbif.org wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here but, having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; TDWG TAG mailing list; Christian Gendreau Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype. Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name. In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] trobertson@gbif.org wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't? They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
and ...
an infraspecific name in ICNAFP (previously ICBN) without an infraspecific rank in NOT a name, so less than useless ... :-)
Paul ________________________________________ From: tdwg-tag-bounces@lists.tdwg.org [tdwg-tag-bounces@lists.tdwg.org] on behalf of Richard Pyle [deepreef@bishopmuseum.org] Sent: 14 March 2012 18:13 To: 'Peter Desmet'; 'Donald Hobern (GBIF)' Cc: 'TDWG content mailing list'; 'dev Developers'; 'Christian Gendreau'; 'TDWG TAG mailing list' Subject: Re: [tdwg-tag] [tdwg-content] Canonical name parsing
I guess the parts that confuse me are:
1) What providers are able to produce a canonicalScientificName as per Peter’s definition, but are unable to provide the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
2) What consumers could make use of a canonicalScientificName as per Peter’s definition, but are unable to make (even better) use of the pre-parsed elements of genus | subgenus | specificEpithet | infraspecificEpithet?
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 7:03 AM To: Donald Hobern (GBIF) Cc: TDWG content mailing list; Christian Gendreau; Tim Robertson [GBIF]; TDWG TAG mailing list; dev Developers Subject: Re: [tdwg-content] Canonical name parsing
Hi Donald,
scientificName, with its current definition [1] is a great term and should be continued to used as such. As with most Darwin Core terms, it offers flexibility, so its not an impediment for publishing data. In the GBIF context, this term is considered mandatory: records without it are ignored during indexing (I believe). All of this can stay.
canonicalScientificName would be an additional term with a clear rule (see my proposed definition [2]). This is the case for other Darwin Core terms as well, such as decimalLatitude [3], minimalElevationInMeters [4] or countryCode [5]. They serve as an ready-to-use addition/alternative to verbatimLatitude [6], verbatimElevation [7] and country [8] respectively. These terms don't stop anyone from publishing data, but data publishers who can provide this kind of information have the choice to do so. It would be the same for canonicalScientificName.
And yes, an aggregator like GBIF can play an important role in providing consistent data to its users and figuring out what they really need, but not all data is consumed that way. In addition, I hope a user would be able to download cleaned data from the GBIF portal as Darwin Core. Wouldn't it be nice that the parsed canonicalScientificName created by GBIF can be provided in its proper term? There are users out there who want this!
Regards,
Peter
[1] http://rs.tdwg.org/dwc/terms/index.htm#scientificName [2] http://code.google.com/p/darwincore/issues/detail?id=150 [3] http://rs.tdwg.org/dwc/terms/index.htm#decimalLatitude [4] http://rs.tdwg.org/dwc/terms/index.htm#minimumElevationInMeters [5] http://rs.tdwg.org/dwc/terms/index.htm#countryCode [6] http://rs.tdwg.org/dwc/terms/index.htm#verbatimLatitude [7] http://rs.tdwg.org/dwc/terms/index.htm#verbatimElevation [8] http://rs.tdwg.org/dwc/terms/index.htm#country
On Wed, Mar 14, 2012 at 11:19, Donald Hobern (GBIF) dhobern@gbif.org wrote:
Hi Peter.
I certainly agree that aggregators only represent one use case here but, having seen a lot of the mess of real-world data, I don't believe that simply adding a new term will fix this problem for the users you describe. To get the results you want, we would need a sufficiently large majority of data sets to follow the rules perfectly that we could ignore those that were non-conformant. This would mean we should mandate that every data set must use the new element (with or without the existing scientificName element) and that they must present scientific names in the expected way (or else have their data considered non-compliant). Until now, the philosophy on publishing Darwin Core data has been to make it as easy as possible for data providers to expose their data, even at the expense of greater complexity for consumers. I suspect that we would have a lot less data available for use now if we had taken a more stringent approach.
In some ways, this proposal reminds me of the structures in ABCD which seek to offer users verbatim and more normalised ways to represent several types of information. This actually makes consuming all the possible forms of such data very complex, since a record may contain all variant forms or just any one of them. If multiple forms are available, which one should be considered the primary version?
I suspect that things may also get complicated as soon as you discuss botanical subspecies, varieties, subvarieties, forms and subforms. There are recommended ways to abbreviate the rank markers in these cases but some variation can be expected.
Of course aggregators should be providing more robust services for accessing exactly what you want in a consistent, predictable way and I would suggest that the best place to attack the problem is to define exactly what a typical user needs to see and then for GBIF and similar projects to work on delivering predictable data downloads and web services that clean out all of these nomenclatural inconsistencies - and perhaps also add value in other ways such as augmenting the data with associated environmental values (as the Atlas of Living Australia does). This would allow us all to work together on developing a consistent and predictable algorithm for handling interpretation of name strings, including synonymy, misspellings, virus names and everything else that makes this such a difficult problem.
Best wishes,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: peter.desmet.cubc@gmail.com [mailto:peter.desmet.cubc@gmail.com] On Behalf Of Peter Desmet Sent: Wednesday, March 14, 2012 3:41 PM To: Tim Robertson [GBIF] Cc: Donald Hobern (GBIF); dev Developers; TDWG content mailing list; TDWG TAG mailing list; Christian Gendreau Subject: Re: Canonical name parsing
Hi Tim,
I agree, aggregators like GBIF and Canadensys will have to deal with clean and dirty data in each field anyway: they need code libraries to deal with this and it is good that these are being developed. But, that doesn't help someone who wants to use data from a Darwin Core Archive with his data in Excel or a Roderic Page who wants to get things done for a prototype. Having to use Java libraries or even the Name Parser [1] (though both great) is a barrier to data use. Darwin Core (Archives) is not only used for machine to machine interaction, humans use it too, and I think we should allow easy hacking (I mean this in the good sense), especially for something as important as the scientific name. In addition, as a data publisher (e.g. for our VASCAN checklist) I *do* have the information to provide a clean and simple to use canonicalScientificName, but I just can't share it via the otherwise excellent biodiversity sharing standard Darwin Core. I think that's a pity.
Peter
[1] http://tools.gbif.org/nameparser/ [2] http://data.canadensys.net/vascan
PS: Yes, Canadensys will use the GBIF interpretation libraries. Since we develop in Java as well, using those libraries is as easy as the proverbial "one line of code". We're looking forward in testing them and providing patches to enhance them. Open source FTW! :-)
On Wed, Mar 14, 2012 at 07:32, Tim Robertson [GBIF] trobertson@gbif.org wrote:
Hi Peter,
I'm replying off the TDWG list, since it is a bit of a tangent to your discussion. If you feel it is relevant, please CC the list again.
At GBIF as you know, we have to interpret all kinds of quality of content. I tend to agree with Donald that this would not really help in consumption, as in my experience we will have to deal with both clean and dirty data in each field *anyway* when this is used at network scale. I would rather see us evolve the interpretation libraries to handle all the corner cases, which we need to develop anyway. We already do a pretty decent job at extracting canonicals. This is further enhanced when you couple the extracted canonical with a fuzzy match against the "authoritative names" we can now index thanks to the availability of checklists in DwC-A format.
I know you are a Java shop. Are you using the GBIF interpretation libraries [1] at the moment? If not, is there a reason why you don't? They are used in all GBIF projects (portal, checklistbank etc), and the more we enhance them, the better it is for everyone. We have a significant test coverage [2,3] and there have been quite some man months (years?) spent already in their development and with some real regular expression experts (most notably Markus D. and Dave M.). All our work is Maven-ized, versioned and available in our Maven repository [4].
I hope these are interesting to you. We would welcome any patches to enhance them, or assistance in identifying the corner cases and capturing those as unit tests.
Hope this helps, Tim
[1] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /main/java/org/gbif/ecat/parser/NameParser.java [2] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /test/java/org/gbif/ecat/parser/NameParserTest.java [3] http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/src /#src%2Ftest%2Fresources [4] http://repository.gbif.org/index.html#nexus-search;quick~ecat-common
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Peter Desmet Biodiversity Informatics Manager Canadensys - www.canadensys.net
Université de Montréal Biodiversity Centre 4101 rue Sherbrooke est Montreal, QC, H1X2B2 Canada
Phone: 514-343-6111 #82354 Fax: 514-343-2288 Email: peter.desmet@umontreal.ca / peter.desmet.cubc@gmail.com Skype: anderhalv Public profile: http://www.linkedin.com/in/peterdesmet
This message is only intended for the addressee named above. Its contents may be privileged or otherwise protected. Any unauthorized use, disclosure or copying of this message or its contents is prohibited. If you have received this message by mistake, please notify us immediately by reply mail or by collect telephone call. Any personal opinions expressed in this message do not necessarily represent the views of the Bishop Museum. _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag ************************************************************************ The information contained in this e-mail and any files transmitted with it is confidential and is for the exclusive use of the intended recipient. If you are not the intended recipient please note that any distribution, copying or use of this communication or the information in it is prohibited.
Whilst CAB International trading as CABI takes steps to prevent the transmission of viruses via e-mail, we cannot guarantee that any e-mail or attachment is free from computer viruses and you are strongly advised to undertake your own anti-virus precautions.
If you have received this communication in error, please notify us by e-mail at cabi@cabi.org or by telephone on +44 (0)1491 829199 and then delete the e-mail and any copies of it.
CABI is an International Organization recognised by the UK Government under Statutory Instrument 1982 No. 1071.
**************************************************************************
----- Original Message ----- From: "Paul Kirk" p.kirk@cabi.org Sent: Wednesday, March 14, 2012 7:34 PM
and ...
an infraspecific name in ICNAFP (previously ICBN) without an infraspecific rank-indicator is NOT a name, so less than useless ... :-)
*** Surely this is not right. The "connecting term" denoting the rank is not part of the name. However, its use is mandatory.
In higher ranks a similar problem may occur, where a non-standard termination is used or where a name may be used in more than one rank. Anyway, it never hurts to have a rank-indicating term for names in higher ranks, for those users who are not familiar with the standard terminations.
Paul
P.S. I would guess "ICNafp"?
participants (6)
-
Kennedy, Jessie
-
Paul Kirk
-
Paul van Rijckevorsel
-
Peter Desmet
-
Richard Pyle
-
Steve Baskauf