[tdwg-content] Darwin Core vernacularName field

Sun Jul 24 16:27:35 CEST 2011

Here are some of my thoughts, as someone who has had my fair share of experience in consuming a wide range of DwC/ABCD/other specimen/observation data and attempted to bring it all together within a DwC-centric framework.

First, thanks to the care taken by John and the other developers of DwC, there are actually very few terms in DwC that are good candidates for repetition in Simple DwC.  If DwC had a term like otherCatalogNumber, I can see few good reasons why the no-repetition restriction would have to apply to that term.  In fact, DwC does not have otherCatalogNumber.  It has otherCatalogNumbers, which sidesteps the issue.  I tend to think that processing of other catalog numbers would be much simpler for all concerned if each number was provided separately rather than as a human-readable concatenation.  

Secondly, the real problem with repeated metadata terms comes when there are implicit nested semantic relationships between terms.  decimalLatitude and decimalLongitude are a good example.  Repeated pairs of these terms without some organising structure could not safely be interpreted.  If it was considered important for a vernacular name to be associated with the language where the name is used, perhaps through a vernacularNameLanguage term to accompany vernacularLanguage, this problem would occur.  However, as far as I can see, DwC is making no attempt to track the language of vernacular names.  I would say therefore that vernacularName (as currently used in DwC) is remarkably close to the fictitious otherCatalogName in the previous paragraph.  Repeating this term would not cause semantic problems.  It would only cause problems for certain kinds of serialisation or storage of the data. 

Thirdly, any consumer of DwC needs to be able to handle many issues.  Repeated vernacular names is one of the less problematic.  In practice, any expectation that providers will serve ABCD or non-Simple DwC will require most clients to deal with complex cases.  

My feeling is therefore that it might in theory be beneficial for us to define a useful form of DwC which varied from Simple DwC only in that it allowed repetition of some terms.  However, while the only obvious term requiring this exemption is vernacularName, it may be simplest for those providers that need to serve multiple vernacular names for a single record not to claim to use Simple DwC.  As suggested, for most serious clients, the real requirement will be to consume any DwC, not just simple.

This brings me however to something that is a very real concern to me.  Class-based DwC representations of data may be very complex.  The reason that Simple DwC exists is that it corresponds with a range of end uses which are well understood and which rely on what-species-was-recorded-when-and-where-and-with-what-level-of-evidence.  This means essentially that these consumers need, for any DwC record, to be able to determine at least the scientificName, decimalLatitude, decimalLongitude, eventDate and basisOfRecord (and preferably the coordinatePrecision, coordinateUncertaintyInMeters and some of the record/provider-identifier terms).  Even when a collection database contains multiple identifications for a specimen, there will normally be a current-best identification.  Similarly for other repeatable database elements.  We certainly need to be able to stream out the full complexity of our biodiversity data, but have we ensured that there is a reliable and consistent way for consumers to take the class-based data and derive the equivalent of the most appropriate Simple DwC for the same data?  If not, what can be done to promote the reliability and consistency of such interpretations.

I'd be very interested in thoughts on this last part (if, at this time of night, I've made sense).

Thanks,

Donald

Donald Hobern, Director, Atlas of Living Australia
CSIRO Ecosystem Sciences, GPO Box 1700, Canberra, ACT 2601
Phone: (02) 62464352 Mobile: 0437990208 
Email: Donald.Hobern at csiro.au
Web: http://www.ala.org.au/ 

-----Original Message-----
From: tdwg-content-bounces at lists.tdwg.org [mailto:tdwg-content-bounces at lists.tdwg.org] On Behalf Of Bob Morris
Sent: Saturday, 23 July 2011 1:57 AM
To: tuco at berkeley.edu
Cc: tdwg-content at lists.tdwg.org; Geoffrey Allen
Subject: Re: [tdwg-content] Darwin Core vernacularName field

Your point is fair enough, and living with Simple DwC is a Good Thing for people with not much experience. But by and large the people who write and support tools like IPT and similar aids are experienced software engineers who would have little trouble implementing, e.g.
serving multiple records against the same ResourceID.  The issue would then become what problems does this present to existing or future consuming applications, and how does the cost of solving those problems compare to that of solving those that arise from some other solution, such as having to include an atomizer to parse a concatenation-based string. (Probably ability to that do that carries a somewhat lower experience barrier to entry than integrating
records.)

Bob

On Fri, Jul 22, 2011 at 11:31 AM, John Wieczorek <tuco at berkeley.edu> wrote:
> It's a storage issue, a generation issue, a transportation issue, a 
> processing issue, a consumption issue - it affects all aspects of a 
> workflow. It is meant to help those whose lives are not steeped in 
> informatics, and who have no desire to tread there - in fact, the 
> majority of those providing data and who would not be able to under 
> current conditions without tools such at the GBIF Integrated 
> Publishing Toolkit (IPT) or without assistance.
>
> On Fri, Jul 22, 2011 at 8:16 AM, joel sachs <jsachs at csee.umbc.edu> wrote:
>> Hi John,
>>
>> The description of Simple Darwin Core justifies the restriction by 
>> saying that it's just like the restriction in relational databases. 
>> But that's a storage issue, not a representation issue. Maybe my real 
>> question is: Whose life is Simple Darwin Core supposed to simplify, 
>> the data provider's, or the aggregator's?
>>
>> Joel.
>>
>>
>>
>>
>> On Fri, 22 Jul 2011, John Wieczorek wrote:
>>
>>> Joel, is the description of the Simple Darwin Core
>>> (http://rs.tdwg.org/dwc/terms/simple/index.htm) insufficient to 
>>> explain the restriction?
>>>
>>> I would say that the goal of many of "us" is to encourage everyone 
>>> to share biodiversity information. I would even go so far as to say 
>>> that our success as biodiversity informaticians will be to make sure 
>>> that most people never have to think in rdf. Like any good 
>>> infrastructure, it should disappear from everyday concern.
>>>
>>> On Fri, Jul 22, 2011 at 6:42 AM, joel sachs <jsachs at csee.umbc.edu> wrote:
>>>>
>>>> I'd love it if someone could explain the reason for this 
>>>> restriction on Simple Darwin Core. It seems somewhat anachronistic, 
>>>> given that we're encouraging everyone to think in rdf. On the 
>>>> representation side, repetition of a field poses no problems for 
>>>> spreadsheets, xml, or rdf.  On the storage side, it is an issue for 
>>>> RDBMS systems; but, consuming applications can address this by 
>>>> creating the kinds of records Bob describes below. Am I missing 
>>>> something?
>>>>
>>>> Many thanks,
>>>> Joel.
>>>>
>>>>
>>>> On Thu, 21 Jul 2011, Bob Morris wrote:
>>>>
>>>>> There's a general issue with repeated attributes in a metadata 
>>>>> record of any kind.  Depending on the representation language, 
>>>>> when there is more than one such thing in the record, it can be 
>>>>> difficult to specify any linkages between them when they are semantically related.
>>>>>
>>>>> One general solution is to have multiple metadata records for the 
>>>>> same resource. This can be costly if there is a powerful reason 
>>>>> that every such record should carry the complete set of attributes 
>>>>> except for the repeated ones, but in the case you put on the 
>>>>> table, I think the only powerful reason would take the form "There 
>>>>> are a lot of stupid DwC applications out there that might discover 
>>>>> a record that has nothing in it but, say, the French vernacular 
>>>>> name and a resourceID, and stop there without ever looking for/at 
>>>>> another record with the same resourceID and more comprehensive 
>>>>> metadata, and integrating the results at the application level."
>>>>>
>>>>> A response might be "But the point of simple DwC is to support 
>>>>> simple applications." But "simple application" is not the same 
>>>>> thing as "simple minded application", and my guess is that 
>>>>> addressing the issue of multiple metadata records at the 
>>>>> application side is, for many applications, less programming effort than other workarounds.
>>>>>
>>>>>
>>>>> Bob Morris
>>>>>
>>>>>
>>>>> On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen <gsallen at unb.ca> wrote:
>>>>>>
>>>>>> Greeting,
>>>>>> I have recently begun the process of digitising the 60,000 
>>>>>> specimen vouchers from the UNB herbarium. The textual data for 
>>>>>> 40,000+ of those has already been entered into a database, and I 
>>>>>> am now trying to map those values to DwC so that we may share the 
>>>>>> data with other collections.
>>>>>> I have some concern over the fact that simple DwC does not allow 
>>>>>> the repetition or extension of certain fields. The vernacularName 
>>>>>> field is a particular problem. New Brunswick is Canada's only 
>>>>>> officially bilingual province, as such, our specimens are all 
>>>>>> identified with both their English and French common names in the 
>>>>>> database. It would be very useful if we could extend DwC, 
>>>>>> creating something along the lines of <vernacularName lang=en>, 
>>>>>> or allow nesting of elements, perhaps in the form:
>>>>>> <vernacularName>
>>>>>> <English>Chives</English>
>>>>>> <French>Ciboulette, brulotte</French> </vernacularName> The other 
>>>>>> option, as I see it, is that we store the English and French 
>>>>>> common names in our own fields, and then concatenate the two to 
>>>>>> create the DwC:vernacularName field. I see this option as less 
>>>>>> than ideal since it may hinder search/browsability. It may also 
>>>>>> cause a host of other problems from interpreting to storing the 
>>>>>> data. The herbarium with whom we first intent to share the data 
>>>>>> has already expressed a concern that their system cannot handle 
>>>>>> the diacritics found in many of the French names (!). They would 
>>>>>> like the Eng. common names, but not the French. This is more 
>>>>>> difficult to achieve if we concat the values.
>>>>>> One additional thought is that the herbarium's imprint, _Flora of 
>>>>>> New Brunswick_, also includes common names in Maliseet and 
>>>>>> Mi'kmaq wherever possible. Although these two aboriginal 
>>>>>> languages do not currently exist in the dataset we are using, 
>>>>>> there is the potential that they may be added at some point in 
>>>>>> the future.
>>>>>> It seems to me that the repetition of fields may be necessary in 
>>>>>> other instances too. I am having some difficulty figuring out how 
>>>>>> to record all the location data we have for the specimens, which 
>>>>>> are indicated using verbal descriptions, Lat/Long, UTM, and NTS 
>>>>>> coordinates - in many cases using all 4 for a single sample, but 
>>>>>> I will save the details for another posting.
>>>>>> I will watch for the group's thoughts on this problem.
>>>>>> Many thanks,
>>>>>> Geoffrey
>>>>>> --------------------------------------------
>>>>>> Geoffrey Allen
>>>>>> Digital Projects Librarian
>>>>>> Electronic Text Centre
>>>>>> Harriet Irving Library
>>>>>> University of New Brunswick
>>>>>> Fredericton, NB  E3B 5H5
>>>>>> Tel: (506) 447-3250
>>>>>> Fax: (506) 453-4595
>>>>>> gsallen at unb.ca
>>>>>>
>>>>>> _______________________________________________
>>>>>> tdwg-content mailing list
>>>>>> tdwg-content at lists.tdwg.org
>>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert A. Morris
>>>>>
>>>>> Emeritus Professor  of Computer Science UMASS-Boston 100 Morrissey 
>>>>> Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project 
>>>>> Department of Organismal and Evolutionary Biology Harvard 
>>>>> University
>>>>>
>>>>>
>>>>> email: morris.bob at gmail.com
>>>>> web: http://efg.cs.umb.edu/
>>>>> web: http://etaxonomy.org/mw/FilteredPush
>>>>> http://www.cs.umb.edu/~ram
>>>>> phone (+1) 857 222 7992 (mobile)
>>>>> _______________________________________________
>>>>> tdwg-content mailing list
>>>>> tdwg-content at lists.tdwg.org
>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>
>>>> _______________________________________________
>>>> tdwg-content mailing list
>>>> tdwg-content at lists.tdwg.org
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>
>>>>
>>
>

--
Robert A. Morris

Emeritus Professor  of Computer Science
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390
IT Staff
Filtered Push Project
Department of Organismal and Evolutionary Biology Harvard University

email: morris.bob at gmail.com
web: http://efg.cs.umb.edu/
web: http://etaxonomy.org/mw/FilteredPush
http://www.cs.umb.edu/~ram
phone (+1) 857 222 7992 (mobile)
_______________________________________________
tdwg-content mailing list
tdwg-content at lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content