Re: [tdwg-content] What I learned at the TechnoBioBlitz

12 Oct 2010

      This conversation about values for basisOfRecord, establishmentMeans, 
and the nature of what actually constitutes a dwc:Occurrence is very 
important.  We have sitting on the table before us several official 
requests for additions and modifications to Darwin Core:
http://code.google.com/p/darwincore/issues/detail?id=68
http://code.google.com/p/darwincore/issues/detail?id=69
http://code.google.com/p/darwincore/issues/detail?id=80
and
http://code.google.com/p/darwincore/issues/detail?id=81
that cannot and should not be decided until this discussion occurs.  In 
particular, a discussion of what exactly a dwc:Occurrence is lies at the 
heart of much of what we are discussing in this thread and is critical 
to other processes that are moving forward, such as guidelines for how 
we represent things in RDF.  On this list I requested discussion on this 
suite of topics when I proposed the Darwin Core modifications, and I 
requested to members of the TAG that this discussion happen at the TDWG 
meeting.  It didn't happen either place, so I'm glad it's happening here 
now. 

Roger has correctly noted that we colloquially talk about Occurrences in 
two ways that are fundamentally different.  We use Occurrence (1) to 
mean that a species occurs generically at a particular locality (the 
"checklist" use), and (2) we talk about particular instances of 
particular individual organisms being noticed at a particular place at a 
particular time.  Based on the clarification that John Wieczorek gave in 
the thread that surrounds 
http://lists.tdwg.org/pipermail/tdwg-content/2009-October/000280.html, 
an Occurrence record simply asserts that an organism was someplace at a 
certain time (and doesn't imply any fitness of use such as for 
documenting distributions).  This is consistent with meaning (2).  I 
think that the "checklist" use (meaning 1) really should be called 
something else because it is conceptually something very different.

Assuming that when we talk about a dwc:Occurrence we intend meaning (2), 
it is important to clarify what aspect of an organism occurring 
somewhere at some time we intend for dwc:Occurrence to mean.  When 
people talk about Occurrences, the conversation often goes awry because 
people are considering an Occurrence to include more or fewer conceptual 
entities.  I don't know if images can be embedded in messages sent to 
the list, so look at this image:
http://bioimages.vanderbilt.edu/pages/resource-diagram.gif
before reading further.  In that diagram, I'm trying to be as generic as 
possible.  I think it is the intention of both TDWG and GBIF to go 
beyond thinking that Occurrences can only be specimens.  So consider 
that this generic Occurrence could be a PreservedSpecimen, but could 
also be an image of an organism, DNA sample, or any other token of the 
presence of the Organism at a particular time and place (or a 
HumanObservation that has no token at all).  I have heard people say 
that an Occurrence is a dctype:Event.  That recognizes the arrow on the 
left side of the diagram which represents the time and place of the 
Occurrence.  I have heard people say that if we photograph an organism, 
that is an "observation" with associated media.  That recognizes the 
collected metadata (i.e. the "observation") part and the representation 
of the organism part (the photograph).  When we talk about a 
PreservedSpecimen being an Occurrence, we probably intend the metadata 
as well as the physical thing in a jar or glued to a sheet of paper (the 
representation of the organism) and may or may not include the arrow on 
the left.  I have taken the position that an Occurrence includes all of 
the components shown in the diagram.  I'm not saying that this is the 
correct or only view on this subject, but if somebody intends for an 
Occurrence to mean something else, then they need to be clear about 
which component(s) of the diagram they are talking about. 

Being conceptually clear about these things is important because that 
clarity informs the decision-making process about the pending issues 
that I mentioned, such as whether DigitalStillImage should be added as a 
DwC type (and hence have a URI and be an accepted value for 
dwc:basisOfRecord) and how we should structure RDF when we try to 
describe the properties of an Occurrence.  If by "basisOfRecord" we mean 
a representation or token on which the Occurrence is based (or lack of 
token in the case of observations), then we should add as DwC types any 
type of physical or digital artifact that will be used by several people 
to document that an Occurrence existed at some point.  It would not make 
logical sense to say that sometimes the basisOfRecord can be an artifact 
like a specimen, but other supporting artifacts such as digital images 
cannot and must be relegated to being associatedMedia. 

I am not going to say more on this topic right now, partly because I 
have mid-semester progress reports to finish by the end of the day, but 
mostly because I wrote a paper discussing these issues and it lays out 
the conceptual framework I'm talking about better than I can in an 
email.  I have cited that paper both in my requests for the Darwin Core 
changes and in previous emails to this list.  However, based on the 
various emails that have been flying around, I don't think many people 
on the list have read it.  That paper isn't a spur of the moment rant.  
I spent over a year writing it, solicited and received comments about it 
from a number of people including several people on the TAG, and went 
through the peer review process for several months before it was finally 
published this spring.  It does not necessarily represent "the correct" 
view on the topics that we are discussing, but I believe that it does 
represent a logically consistent way of conceptualizing Occurrences and 
how a broad range of types of Occurrences can be described and related 
to other resources.  If others can present clear and consistent 
alternatives to the framework that I've suggested, I would like to hear 
what they are.  The article, Biodiversity Informatics 7:14-44 can be 
accessed at https://journals.ku.edu/index.php/jbi/article/view/3664 .  
In particular, take note of the discussion on p.27-28 regarding the 
criterion for determining whether an Occurrence documents a species' 
distribution, p. 28 where I discuss the difference between the use of 
dwc:recordedBy and dcterms:created, and p. 29 where I suggest controlled 
values for dwc:establishmentMeans that can be used for differentiating 
the extent to which an individual documented by an Occurrence occurs 
"naturally" at its location (native, naturalized, adventive, or 
cultivated - intended to apply to either plants or animals; a farm or 
zoo animal would be considered "cultivated"-I would be happy to define 
and propose these as a controlled vocabulary).  These are all things 
that have come up in this thread.  I also should note that I have been 
successfully applying this framework to live plant images at 
http://bioimages.vanderbilt.edu where I serve RDF that is consistent 
with the design discussed in the paper. 

I would like to say more about the relationship between LivingSpecimens, 
Individuals, establishmentMeans, and indicating whether an Occurrence 
document's a species' distribution, but that will have to wait until later.

Steve Baskauf

joel sachs wrote:
...
One of the goals of the recent bioblitz was to think about the suitability 
and appropriatness of TDWG standards for citizen science. Robert Stevenson 
has volunteered to take the lead on preparing a technobioblitz lessons 
learned document, and though the scope of this document is not yet 
determined, I think the audience will include bioblitz organizers, 
software developers, and TDWG as a whole. I hope no one is shy about 
sharing lessons they think they learned, or suggestions that they have. We 
can use the bioblitz google group for this discussion, and copy in 
tdwg-content when our discussion is standards-specific.
Here are some of my immediate observations:
1. Darwin Core is almost exactly right for citizen science. However, there 
is a desperate need for examples and templates of its use. To illustrate 
this need: one of the developers spoke of the design choice between "a 
simple csv file and a Darwin Core record". But a simple csv file is a 
legitimate representation of Darwin Core! To be fair to the developer, 
such a sentence might not have struck me as absurd a year ago, before 
Remsen said "let's use DwC for the bioblitz".
We provided a couple of example DwC records (text and rdf) in the bioblitz 
data profile [1]. I  think the lessons learned document should include an 
on-line catalog of cut-and-pasteable examples covering a variety of use 
cases, together with a dead simple desciption of DwC, something like 
"Darwin Core is a collection of terms, together with definitions."
Here are areas where we augemented or diverged from DwC in the bioblitz:
i. We added obs:observedBy [2], since there is no equivalent property in 
DwC, and it's important in Citizen Science (though often not available).
ii. We used geo:lat and geo:long [3] instead of DwC terms for latitude and 
longitude. The geo namespace is a well used and supported standard, and 
records with geo coordinates are automatically mapped by several 
applications. Since everyone was using GPS  to retrieve their coordinates, 
we were able to assume WGS-84 as the datum.
If someone had used another Datum, say XYZ, we would have added columns to 
the Fusion table so that they could have expressed their coordiantes in 
DwC, as, e.g.:
DwC:decimalLatitude=41.5
DwC:decimalLongitude=-70.7
DwC:geodeticDatum=XYZ
(I would argue that it should be kosher DwC to express the above as simply 
XYZ:lat and XYZ:long. DwC already incorporates terms from other 
namespaces, such as Dublin Core, so there is precedent for this.
2. DwC:scientificName might be more user friendly than taxonomy:binomial 
and the other taxonomy machine tags EOL uses for flickr images.  If 
DwC:scientificName isn't self-explanatory enough, a user can look it up, 
and see that any scientific name is acceptable, at any taxonomic rank, or 
not having any rank. And once we have a scientific name, higher ranks can 
be inferred.
3. Catalogue of Life was an important part of the workflow, but we 
had some problems with it. Future bioblitzes might consider using 
something like a CoL fork, as recently described by Rod Page [4].
4. We didn't include "basisOfRecord" in the original data profile, and so 
it wasn't a column in the Fusion Table [5]. But when a transcriber felt it 
was necessary to include in order to capture data in a particular field 
sheet, she just added the column to the table. This flexibility of schema 
is important, and is in harmony with the semantic web.
5. There seemed to be enthusiasm for another field event at next year's 
TDWG. This could be an opportunity to gather other types of data (eg. 
character data) and thereby 
i) expose meeting particpants to another set of everyday problems from the 
world of biodiversity workflows, and ii) try other TDWG technology on 
for size, e.g. the observation exchange format, annotation framework, etc.
Happy Thanksgiving to all in Canada -
Joel.
----
1. http://groups.google.com/group/tdwg-bioblitz/web/tdwg-bioblitz-profile-v1-1
2. Slightly bastardizing our old observation ontology - 
http://spire.umbc.edu/ontologies/Observation.owl
3. http://www.w3.org/2003/01/geo/
4. http://iphylo.blogspot.com/2010/10/replicating-and-forking-data-in-2010.html
5. http://tables.googlelabs.com/DataSource?dsrcid=248798
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content
.
-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu