[tdwg-content] Delimiters for Darwin Core list-type terms

Steve Baskauf steve.baskauf at vanderbilt.edu
Mon Oct 7 22:24:39 CEST 2013


I don't want to put the cart before the horse here because the Darwin 
Core RDF Guide has not been formally introduced as an addition to Darwin 
Core (it's waiting until John W. feels the time is right to do so in the 
context of dealing with the various issues he's been working through).  
But it is in the queue with recommendation of the RDF Task Group to 
adopt and can be viewed online.  I mention this because it contains a 
specific recommendation for how to deal with terms that have multiple 
values in a concatenated list.  See
http://code.google.com/p/tdwg-rdf/wiki/DwcRdfGuideProposal#2.5.1_Definition_of_dwcuri:_terms
for the details. 

In a nutshell, the guide establishes the convention that the existing 
dwc: namespace terms be used with literals which are formatted as 
described in the existing standard (e.g. a delineated, concatenated 
list).  It creates new versions of the terms (in a new namespace 
dwcuri:) which are intended to be repeatable and to have single values 
which are URI references. 

I am hesitant to bring this up because the guide has not been formally 
introduced nor has a 30 day discussion period been declared.  But I 
think in light of this discussion, it is important for people to know 
that the guide does address this issue in the context of RDF. 

Steve

Markus Döring wrote:
> +1 to remove multi value recommendations from the main DwC definitions and leave it to the implementations to deal with lists if needed.
>
> As many terms currently are in plural we can easily create new terms for single values. I am not entirely convinced though that these terms are very useful if they combine various properties into an unstructured string. In these cases it might be better to define several single value terms instead. A quick attempt to create single value terms:
>
>
> Terms currently in plural form:
> ----------------------------------------
> dataGeneralizations
> -> dataGeneralization
>
> dynamicProperties
> -> dynamicProperty
>
> preparations
> -> preparation
>
> associatedSequences
> -> associatedSequence
>
> georeferenceSources
> -> georeferenceSource
>
> associatedReferences
> -> associatedReference  (id and/or citation string ?)
>
> otherCatalogNumbers
> -> otherCatalogNumber   (needed at all if there is already catalogNumber ?)
>
> previousIdentifications
> -> previousIdentification  (combines various identification properties into human string. Redefine as just the previously identified scientificName or deprecate entirely ?)
>
> associatedOccurrences
> -> associatedOccurrence  (combines occurrenceID with relation type. Maybe just associatedOccurrenceID ? Or just deprecate it in favor of the ResourceRelationship terms ?)
>
> associatedTaxa
> -> associatedTaxon  (combines taxonID or scientificName with associationType)
>
>
>
> Terms already in singular form. 
> ------------------------------------
> Can we redefine those terms or do we need to create new ones?
>
> typeStatus
> # seems to combine any property about typification/type designation right now, although I have mostly seen a single status so far.
> The discussion page recommends values for the "status portion of the content". How about restricting the term to this status / kind of type?
>
> vernacularName
> # I have not seen anyone using this as a list. Is anyone aware of such a case? Might not cause too much trouble to redefine
>
> recordedBy
>
> associatedMedia   
>
> higherClassification   
> # isn't a single classification always a list? 
>
> higherGeography
>
> informationWithheld
>
>
>
> Markus
>
>
>
>
> On 07.10.2013, at 18:15, joel sachs wrote:
>
>   
>> On Mon, 7 Oct 2013, John Wieczorek wrote:
>>
>>     
>>> On Mon, Oct 7, 2013 at 4:54 PM, Tim Robertson [GBIF]
>>> <trobertson at gbif.org> wrote:
>>>       
>>>>> I kind of expected it was futile to make the plea "Please ignore the
>>>>> issue of whether the idea of list-type terms is a
>>>>> good idea or not - that is not the issue we're trying to resolve
>>>>> here." I had to try.
>>>>>           
>>>> Sorry John, I fell into your trap.
>>>>         
>>> :-) You were not alone.
>>>
>>>       
>>>> Surely though, talking about solutions to problems rather than the problem itself is the _worst_ option available to us?
>>>>         
>>> I just wanted to keep the issues separate and focus on the one
>>> submitted, for the very reason that it will otherwise get to broad and
>>> contentious to provide any solutions at all. I don't much like
>>> spending energy when that is the likely outcome. The process more
>>> often seem to yield results when it is kept simple. And in this case,
>>> having a better recommendation does not set us in any worse position
>>> than we are now. No one presented an issue to the tracker recommending
>>> the deprecation of all "list" terms".
>>>       
>> I do plan on submitting an issue to the tracker. It won't be to deprecate the terms, but, as Tim suggests to change the definitions so that the recommendations on how to deal with multiple values is left to the various representation guides (text, xml, rdf). (If anyone else wants to submit this issue first, please go ahead.)
>>
>> I'm glad that Tim fell into your trap, as it raises an important issue, withour precluding Darwin Core (through the representation guides) from providing consistent guidance on these terms.
>>
>> Joel.
>>
>>
>>
>>
>>
>>
>>
>>     
>>>> Consider the likes of this list term
>>>> http://rs.tdwg.org/dwc/terms/#typeStatus
>>>> The description suggests a separated and concatenated list but the example
>>>> (unless I misunderstand) is showing only 1 list item which is a triplet of
>>>> "type + author + pub" in a human readable form.  This one field is actually
>>>> suggesting a structure of a repeatable triplet, so need 2 delimiters if
>>>> machines are to extract the scientific name for the typification .   Perhaps
>>>> these terms are really just verbatim text blocks intended for human
>>>> consumption (which is fine with me, and we don't need to define delimiters)?
>>>> Or perhaps we should be discussing terms to atomize them further (e.g.
>>>> introduce dwc:typeName and dwc:typePublication)?
>>>>         
>>> Yes, the example gives only one typeStatus entry, not a list. Yes, one
>>> can argue that the content mixes concepts if those distinct concepts
>>> are of interest. A look at the history of typeStatus will reveal that
>>> it has its origins deep in the Darwin Core history, and no one has yet
>>> suggested that it should be other than what it is. Another item for
>>> the issue tracker if anyone wants to defend a change.
>>>
>>>       
>>>> If we are heading this way though, can I also suggest we consider declaring
>>>> the expected ordering on lists where omitted?  The likes of
>>>> http://rs.tdwg.org/dwc/terms/#higherGeography doesn't  have one whereas
>>>> http://rs.tdwg.org/dwc/terms/#higherGeography does.
>>>>         
>>> Same example. Did you mean
>>> http://rs.tdwg.org/dwc/terms/#higherClassification? It would be a fine
>>> thing to amend the recommendation for higherGeography to suggests the
>>> ordering. If anyone seconds the motion I'll create an issue for it.
>>>
>>>       
>>>>> There are definitely more rigorous ways to share the information than
>>>>> in concatenated lists. The "list" terms are just as Tim describes, an
>>>>> attempt to share in a flat data structure data that do not fit well in
>>>>> a flat structure, but are nevertheless of common interest. There
>>>>> probably shouldn't be an expectation that one could process the
>>>>> content of such fields and derive individual values, as we can't even
>>>>> get simple content under control yet (see
>>>>> http://soyouthinkyoucandigitize.wordpress.com/2013/07/18/data-diversity-of-the-week-sex/).
>>>>>
>>>>> Nevertheless, these terms do exist, and they expect lists, and people
>>>>> are using them in distinct ways that make them a challenge to process.
>>>>> It would be nice to give guidance. I have no problem if that guidance
>>>>> stays out of the term definitions, but we have a legacy problem of
>>>>> definitions that tell us that the content should consist of a
>>>>> delimited list.
>>>>>           
>>>> Other than deprecating and redefining as new concepts (terms), I don't see
>>>> any robust way I am afraid.  Some things are just not meant to be
>>>> denormalized.
>>>>         
>>> That would be a fine conclusion as well, if we can get consensus. I
>>> would then just add secondary documentation saying "Beware all ye who
>>> enter (data) here."
>>>
>>>       
>>>> Cheers,
>>>> Tim
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 7, 2013 at 4:02 PM, Tim Robertson [GBIF]
>>>> <trobertson at gbif.org> wrote:
>>>>
>>>> I suspect any attempt to find a universal delimiter will be flakey at best -
>>>>
>>>> see Unicode character 1 as an example [1].
>>>>
>>>> I would urge DwC to stop at only defining the concept of each term and leave
>>>>
>>>> it to the serialization formats, schema definitions, data models etc (e.g.
>>>>
>>>> DwC-A, XML, RDF, JSON, HTML, excel templates etc) to define those kind of
>>>>
>>>> things.
>>>>
>>>>
>>>> If you were to design an XML schema you would use things like:
>>>>
>>>>
>>>> <tim:identifications>
>>>>
>>>> <dwc:scientificName>A</dwc:scientificName>
>>>>
>>>> <dwc:scientificName>B</dwc:scientificName>
>>>>
>>>> <dwc:scientificName>C</dwc:scientificName>
>>>>
>>>> </tim:identifications>
>>>>
>>>>
>>>> and not:
>>>>
>>>>
>>>> <tim:identifications>
>>>>
>>>> <dwc:scientificName>A|B|C</dwc:scientificName>
>>>>
>>>> </tim:identifications>
>>>>
>>>>
>>>> I don't think it wise for the DwC standard to suggest anyone should.
>>>>
>>>>
>>>> I suspect this request stems from those working with denormalized data
>>>>
>>>> structures, and trying to shoe-horn all data into flat structures (e.g.
>>>>
>>>> DwC-A).  I think that is a dangerous path to go down, and makes things more
>>>>
>>>> difficult for both producers and consumers.  Very quickly you will get into
>>>>
>>>> the situation where you will want to also suggest "well the element at index
>>>>
>>>> [0] of field X should be interpreted as the index [0] for field Y" (e.g.
>>>>
>>>> identifications and identification dates).
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Tim
>>>>
>>>>
>>>> [1] http://www.fileformat.info/info/unicode/char/1f/index.htm
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Oct 7, 2013, at 3:45 PM, Steve Baskauf wrote:
>>>>
>>>>
>>>> I don't have an opinion about what the recommended delimiter should be, but
>>>>
>>>> I think it would be beneficial for there to be consistency between Darwin
>>>>
>>>> Core and Audubon Core.  You can see what the recommendation is for Audubon
>>>>
>>>> Core at
>>>>
>>>> http://terms.gbif.org/wiki/Audubon_Core_%281.0_normative%29#Lists_of_plain_text_values
>>>>
>>>> - it's the pipe "|".  Either Darwin Core should go with this, or if there is
>>>>
>>>> a consensus reached here that is different, then AC should be changed before
>>>>
>>>> it is ratified, which potentially could happen in a matter of weeks.  It is
>>>>
>>>> highly likely that there will be records that are a mixture of AC and DwC,
>>>>
>>>> so it would not be a good thing for the recommendations to differ.
>>>>
>>>>
>>>> Steve
>>>>
>>>>
>>>> Markus Döring wrote:
>>>>
>>>>
>>>> Hi John et al.,
>>>>
>>>>
>>>> I would like to see a single recommended default delimiter, preferrably the
>>>>
>>>> semicolon as its natural and hardly used in values.
>>>>
>>>> For dwc archives there is a multiValueDelimiter attribute for every term
>>>>
>>>> mapping that allows to declare other delimiters if needed.
>>>>
>>>>
>>>> Currently it is hardly possible to detect multi values in a field and you
>>>>
>>>> can just test for some often used ones but even then you never know if they
>>>>
>>>> were meant to be delimiters.
>>>>
>>>> Having a single default value helps to get the idea of multi values across
>>>>
>>>> and make it a bit more accessible I believe.
>>>>
>>>>
>>>> dwc:vernacularName I would personally prefer to see as a single value term
>>>>
>>>> as it is mostly useful in combination with a locale and rarely is shared on
>>>>
>>>> its own.
>>>>
>>>> Seeing dwc:typeStatus being a multi value term also feels wrong as the name
>>>>
>>>> is in singluar while the others carry the multi value nature in the name
>>>>
>>>> already.
>>>>
>>>>
>>>>
>>>> Markus
>>>>
>>>>
>>>>
>>>>
>>>> n 07.10.2013, at 12:28, John Wieczorek wrote:
>>>>
>>>>
>>>>
>>>>
>>>> Dear all,
>>>>
>>>>
>>>> On the list of pending Darwin Core issues is a topic of general
>>>>
>>>> concern about terms that could or do recommend the concatenation and
>>>>
>>>> delimiting of a list of values. The specific issue was submitted on
>>>>
>>>> the Darwin Core Project site at
>>>>
>>>> https://code.google.com/p/darwincore/issues/detail?id=168. Right now
>>>>
>>>> there is variation in the recommendations of distinct terms.
>>>>
>>>>
>>>> The Darwin Core terms that could be used to hold lists include the
>>>>
>>>> following (use the index at
>>>>
>>>> http://rs.tdwg.org/dwc/terms/index.htm#theterms to find and see the
>>>>
>>>> details of each of these):
>>>>
>>>>
>>>> informationWithheld
>>>>
>>>> dataGeneralizations
>>>>
>>>> dynamicProperties
>>>>
>>>> recordedBy
>>>>
>>>> preparations
>>>>
>>>> otherCatalogNumbers
>>>>
>>>> previousIdentifications
>>>>
>>>> associatedMedia
>>>>
>>>> associatedReferences
>>>>
>>>> associatedOccurrences
>>>>
>>>> associatedSequences
>>>>
>>>> associatedTaxa
>>>>
>>>> higherGeography
>>>>
>>>> georeferenceSources
>>>>
>>>> typeStatus
>>>>
>>>> higherClassification
>>>>
>>>> vernacularName
>>>>
>>>>
>>>> There are some issues. Many terms do not show examples. Most of those
>>>>
>>>> that do show examples recommend semi-colon (';') -
>>>>
>>>> associatedOccurrences, recordedBy, preparations, otherCatalogNumbers,
>>>>
>>>> previousIdentifications, higherGeography, georeferenceSources, and
>>>>
>>>> higherClassification, The example for higherClassification does not
>>>>
>>>> have spaces after the semi-colon while all others do.
>>>>
>>>>
>>>> Terms that could hold a list of URLs would require a delimiter that
>>>>
>>>> would be an invalid part of a URL unless it was escaped. This
>>>>
>>>> precludes comma (','), semi-colon (';'), and colon (':'), among
>>>>
>>>> others. One possibility here might be the vertical bar or "pipe"
>>>>
>>>> ('|').
>>>>
>>>>
>>>> The term dynamicProperties is meant to take key-value pairs. The
>>>>
>>>> examples suggest the format key=value, with any list delimited by a
>>>>
>>>> semi-colon, for example, "tragusLengthInMeters=0.014;
>>>>
>>>> weightInGrams=120". The example for associatedTaxa also shows a
>>>>
>>>> key-value pair ("host: Quercus alba"), but it is formatted differently
>>>>
>>>> from the examples for dynamicProperties. There are other terms, such
>>>>
>>>> as vernacularName, which could potentially also take a key-value pair,
>>>>
>>>> though it is not currently recommended to be a list.
>>>>
>>>>
>>>> Please ignore the issue of whether the idea of list-type terms is a
>>>>
>>>> good idea or not - that is not the issue we're trying to resolve here.
>>>>
>>>> Instead, the issue is whether a consistent recommendation can be made
>>>>
>>>> for how to delimit the values in a list. And if not a consistent
>>>>
>>>> recommendation, can we make specific recommendations for distinct
>>>>
>>>> terms? If specific recommendations can be made for a term, should that
>>>>
>>>> be reflected in examples within the term definitions, or should such
>>>>
>>>> recommendations reside only in Type 3 supplementary documentation such
>>>>
>>>> as that which can be found on the Darwin Core Project site at, for
>>>>
>>>> example,
>>>>
>>>> https://code.google.com/p/darwincore/wiki/Occurrence#associatedSequences?
>>>>
>>>> Should some of these terms have specific recommendations to contain
>>>>
>>>> only single values (e.g., vernacularName), in which case they are not
>>>>
>>>> really viable in Simple Darwin Core?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> John
>>>>
>>>> _______________________________________________
>>>>
>>>> tdwg-content mailing list
>>>>
>>>> tdwg-content at lists.tdwg.org
>>>>
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>>
>>>> tdwg-content mailing list
>>>>
>>>> tdwg-content at lists.tdwg.org
>>>>
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>
>>>>
>>>> .
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Steven J. Baskauf, Ph.D., Senior Lecturer
>>>>
>>>> Vanderbilt University Dept. of Biological Sciences
>>>>
>>>>
>>>> postal mail address:
>>>>
>>>> PMB 351634
>>>>
>>>> Nashville, TN  37235-1634,  U.S.A.
>>>>
>>>>
>>>> delivery address:
>>>>
>>>> 2125 Stevenson Center
>>>>
>>>> 1161 21st Ave., S.
>>>>
>>>> Nashville, TN 37235
>>>>
>>>>
>>>> office: 2128 Stevenson Center
>>>>
>>>> phone: (615) 343-4582,  fax: (615) 322-4942
>>>>
>>>> If you fax, please phone or email so that I will know to look for it.
>>>>
>>>> http://bioimages.vanderbilt.edu
>>>>
>>>>
>>>> _______________________________________________
>>>>
>>>> tdwg-content mailing list
>>>>
>>>> tdwg-content at lists.tdwg.org
>>>>
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>>
>>>> tdwg-content mailing list
>>>>
>>>> tdwg-content at lists.tdwg.org
>>>>
>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>
>>>>
>>>>
>>>>
>>>>         
>>> _______________________________________________
>>> tdwg-content mailing list
>>> tdwg-content at lists.tdwg.org
>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>       
>> _______________________________________________
>> tdwg-content mailing list
>> tdwg-content at lists.tdwg.org
>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>     
>
> _______________________________________________
> tdwg-content mailing list
> tdwg-content at lists.tdwg.org
> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>
> .
>
>   

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
PMB 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 322-4942
If you fax, please phone or email so that I will know to look for it.
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20131007/046c0e4f/attachment.html 


More information about the tdwg-content mailing list