[tdwg-content] New Darwin Core terms proposed relating to material samples

Mon May 27 03:53:58 CEST 2013

Hi Steve,

Again, great post.  Some more detailed comments:

> I suppose that Rich and Rob W. have already looked at
http://code.google.com/p/darwin-sw/wiki/TaxonomicHeterogeneity .  
> I think it pretty much encapsulates what they are talking about.  

Yes (see the P.S. of my reply to Gregor) -- and it's this reasoning that
eventually convinced me that you were right about this.

> I should note that the way DSW defines dsw:IndividualOrganism does not
require it to be a single organism.  
> It can be a collection of organisms (herd, colony, school) or part of an
organism (tissue).  

Yes -- which is exactly why we defined it this way.  The only reason we
stripped the "organism" part (or, more precisely, why we think of
"individualOrganism" as a subclass of "individual" -- though we haven't
formally established it that way yet), is because in our application, we
want to track individuals that are not organisms (I'd be happy to provide
examples -- but that falls outside the scope of this conversation....and
this email list).  The point is, our concept of "Individual" is the
superclass (perhaps we should all it "Object"?), which includes a subclass
of things that would be properly labeled as "IndividualOrganism" (or maybe
"BiologicalObject").

> The basic requirement is that it is a "taxonomically homogeneous entity". 

> In a variant form of DSW (dsw_alt.owl) we included "taxonomically
heterogeneous entity" (THeE) 
> which would basically include what Rich and Rob W. are talking about (lots
of organisms which are seperatable 
> and aren't necessarily from the same lowest taxonomic level).  

No, that's not quite right.  We've been defining "Individual" (in the
context of organisms) as what you refer to as "taxonomically homogeneous
entity".  In other words, if more than one taxon is involved, we split them
into appropriate subset individual instances (so that each individual is
taxonomically homogeneous).  Like I said, your arguments won me over.  But
that doesn't mean that the concept of THeE isn't a necessary one for our
community.  Examples are things like rocks containing multiple fossil
organisms from multiple taxa, a collected rock on the seafloor containing
multiple organisms, a soil sample, or a seafloor core (etc.).  These are
units of information that fall within the realm of "CollectionObject", but
not within the concept of a "taxonomically homogeneous entity" (i.e.,
individualOrganism).

Somehow these things need to be defined well enough that we can develop
practical vocabularies and ontologies that are both practical to implement,
and meet our data management needs.

> It should be no surprise that THeE does what Rich wants because we 
> included it in DSW because during the preceding discussion Rich said 
> he wanted something like it.  

 Again, while I agree there is a need for a THeE "thing", this is not what
I'm talking about in this thread.  I see that as a separate discussion --
which we need to have, but will probably only confuse this discussion.  I
think we should focus on the relationship between a "taxonomically
homogeneous entity", which is what Rob W. and I -- and also, I believe, what
dwc:individualID is intended to represent -- are including within our notion
of "individual" as discussed in my earlier posts on this thread, and
"materialSample" -- which I generally see as a subclass of "individual"
(=dwc:individualID = Pyle/Whitton "Individual" = Baskauf "taxonomically
homogeneous entity"). 

> In dsw_alt.owl, properties like "hasPart" and "isPartOf" are used to 
> connect physical entities whose properties can be inferred by
inheritance.  
> What this diagram includes that Rich did not mention are "tokens"
(evidence).  

Same here.  We use the term "Evidence", but I think we define it slightly
differently from your "token".  Again, this is an important topic -- but one
I deliberately left out of my earlier posts, because my real question
pertains to the proposal of "materialSample" as a new dwc term (along with
materialSampleID).  I want to make sure I understand what the relationship
is between this concept, and the concept represented by dwc:individualID
(e.g., would "dwc:materialSample" constitute a subclass of dwc:individual,
if the latter also existed?)  I'm not at all opposed to the introduction of
"materialSample/ID" in DWC -- but I am not sure there has been enough
clarification on the relationship/distinction between what these new terms
represent, and what is already represented via dwc:individualID (and, of
course, the relevant terms in Darwin-SW and its kin).

> We defined a class for evidence, but we also considered not having
evidence 
> being an explicit class.  Not defining an explicit Token class would have
simplified 
> the diagram at the bottom of the page - one could just say that there
should be 
> evidence and it should be linked to the resource it documents.  
> Token and THeE/IndividualOrganism are not disjoint classes - the physical
entity 
> can be the evidence if somebody "owns" it and makes it available for
people to examine.  
> However, in DSW, Token and THeE are not synonymous because we allow
evidence 
> to include things that are not physically derived from the entity (e.g.
images, sounds, 
> string data records) in addition to physical specimens.

I have much to comment on concerning the above text, but I'll save it for a
different thread with a different subject line.

> I think that we have to be careful when we say "we don't need X", 
> "there is pressing value for X but not for Y", 
> "X is too vaguely defined", etc.  

Agreed!  I hope nobody thinks I was suggesting any of these things about
materialSample.  If so, I was misunderstood.  What I want to discuss is how
these things conceptually relate to each other.  One of the biggest problems
with DWC in actual use is not so much whether we have the right terms, or
have too many, or whatever; rather, it's that the terms we have leave enough
room for interpretation that people use them in subtly (or not-so-subtly)
different ways.  This is one of the things I heard repeatedly in Berlin last
week -- that different providers use the same terms to represent different
information.

I think it's incumbent on us to minimize further generation of this sort of
ambiguity, so I want to make sure we are clear in what the terms are, how
they are meant to be used, and how they relate to other terms in DWC.

> MaterialSample does exactly what the metagenomics people need because they

> invented it to serve the purposes they want it to serve (handle material
samples 
> in which one may or may not ever know what all organisms are included or
even 
> if there are organisms in it).  Individual (sensu Pyle/Whitton)/THeH is
just vague 
> enough to do what Rich and Rob W. want it to do with their lots and
specimens, 
> but is too vague for Rob G.  

Again, just to be absolutely clear: "Individual (sensu Pyle/Whitton)" <>
"THeH (sensu Baskauf)"

> IndividualOrganism (sensu DSW) and Token does exactly what Cam Webb and I
want it to do 
> with our images, specimens, DNA samples, and data records, and the
requirement that 
> IndividualOrganism be taxonomically homogeneous allows us to infer that a
determination 
> applied to one resource also applies to other resources which are derived
from the same 
> IndividualOrganism (a requirement not stated by the others) but it's too
restrictive for both 
> Rob G and Rich.  

What you describe above is *EXACTLY* what "Individual (sensu Pyle/Whitton)"
is intended to be, for *exactly* the same reasons.  So I hope we can now
dispense completely with the incorrect equivalence of "Individual (sensu
Pyle/Whitton)" with "THeH (sensu Baskauf)".

I think "materalSample" is very-much more analogous to (but not identical
to) "collectionObject".  The similarities include:

- Something extracted from nature that may or may not involve a biological
organism (the concept of "collectionObject" could be used for a mineralogy
collection, or a soil sample with no organisms in it).
- Something that is often taxonomically homogeneous, but not necessarily so
- Something that is studied or examined outside of its occurrence in nature

The main difference I see is that, traditionally at least,
"collectionObject" refers to something typically intended to be maintained
for a long time (like a voucher specimen), whereas "materialSample"
explicitly includes things that only exist as such for a temporary period
(e.g., are lost or destroyed upon analysis).  But I don't think that
"collectionObject" necessary excludes things that are temporary, which is
why really "materialSample" is either equivalent to an instance of
"collectionObject", or is a subclass of "collectionObject".

So, we have this cloud of overlapping terms: 'individual" (as implied by
dwc:individualID), "individualOrganism" (as defined in DSW),
"collectionObject" (as used historically in the ASC and MVZ models, and more
recently in the iDigBio document thatI sent the link for), and
"materialSample" (as currently proposed).

The overlap of these things vastly exceeds the non-overlap, so I think it's
important that we refine the definitions to more precisely articulate how
they are the same, and how they are different.  Once we get the concepts
sorted out, *then* we should figure out the best terms to use.

> If we start in on the game of saying "WE need the features that I think 
> are important but not the features that YOU think are important" then 
> we are in for another month of massive email traffic on this list and 
> will end up no better off than we were when we started.

ABSOLUTELY!  Let's assume that any need expressed by anyone is, by
definition, important.

> I think that it is clear from this and preceding discussions that there is

> a need for some system of tracking things that are like
individuals/organisms/samples/lots.  
> It is my believe that what needs to happen is:
> 1. define clearly what the various stakeholders want to accomplish by
their 
> version of individuals/organisms/samples/lots (i.e. use cases/competency
questions) 
> 2. use set theory or some other kind of logical system to describe clearly
how the 
> various versions of individuals/organisms/samples/lots are related to each
other
> 3. examine alternative mechanisms for defining the relationships among the
variously 
> defined individuals/organisms/samples/lots terms and determine how each
approach 
> can or cannot satisfy the use cases/competency questions.
> 4. use one or more mechanisms which pass test #3 to define the terms that
are deemed 
> necessary and include them in some TDWG standard which may or may not be
Darwin Core.

YES!!!  This is *exactly* what I was heading towards, but you captured it
much more clearly than I did.  But I think it is a bit of a
multi-dimensional issue.  On one axis is, as you outline above, the
hierarchy from "group" things down to "part" things (school, flock, lot,
specimen, specimen part, tissue sample, DNA extraction, etc.).  Another axis
is the "taxonomically homogeneous" vs. "taxonomically heterogeneous"
distinction (the latter of which is mostly needed for curated objects,
rather than objects as they occur in nature -- although one could conceive
of an "ecosystem" as a top-level THeE that our community may want to track).
And yet another axis is the whole issue of "what is a 'thing'", contrasted
with "what is evidence of a 'thing'" (e.g., a specimen in a jar is a
"thing", whereas an image of that specimen is evidence of that "thing").

> In September 2011, John Wieczorek had packaged several of the proposed
class additions 
> to Darwin Core into a concrete proposal:
http://lists.tdwg.org/pipermail/tdwg-content/2011-September/002727.html .  
> This proposal was deferred by the Executive Committee (see the last
comment at 
> http://code.google.com/p/darwincore/issues/detail?id=117 ) "... until we
can further 
> examine broader changes including the new classes and any insights that
might come 
> out of the RDF Interest Group."  

Ah!  OK, this specifically answers my question concerning "where do we stand
on discussions of an "individual" class within DWC?  The last thing I want
to do is slow down progress, so if a quick addition of materialSample/ID
will help solve some problem, then I'm all for it.  But to save long/complex
conversations later on (which will inevitably happen when we try to
reconcile dwc:individualID and the proposed dwc:materialSampleID), it's
probably worth at least a bit of discussion now.  If that discussion shows
no signs of resolution anytime, soon, then by all means, let's not bog the
process down unnecessarily.  But one of the reasons I spoke up on this
thread is that, what is different now from last September, is that we (Rob
W. and I) now have about a year's worth of experience dealing with the
ontological relationships between "Location", "Event", "Occurrence",
"Evidence", "Individual", "Determination" and "Taxon" (among other classes
of things, like "Agent" and Reference") -- both in terms of harvesting
legacy data, and in building workflows to capture new data in real time.
(Indeed, as I write this, Rob W. is currently aboard a NOAA ship somewhere
between the Northwestern Hawaiian Islands and Johnston Atoll, on a 1-month
cruise gathering in-situ observations, imagery, voucher specimens, and
tissue samples, and is using the workflow that he and I developed to
integrate these new data with pre-existing
images/specimens/observations/literature reports/tissue samples/etc., to
build an evidence-based checklist for the Northwestern Hawaiian Islands.  In
other words, what we have now that we didn't have a year or two ago, is
real-world experience organizing this kind of information.

> So the RDF Task Group has specifically been charged with 
> the task of examining the addition of additional classes to Darwin Core 
> and their implications.  The RDF TG has assembled competency 
> questions http://code.google.com/p/tdwg-rdf/wiki/CompetencyQuestions 
> and use cases http://code.google.com/p/tdwg-rdf/wiki/UseCases but 
> has not moved beyond that.  So that's a start on Item #1 in the list
above.  
> However, the process has not moved beyond that.  I recently made an 
> appeal to the TG for someone to take up work on delivering some 
> concrete progress on deliverables, but got no responses.  I cannot be 
> the person to move this forward for two reasons.  One is that I already 
> have my hands full with the DwC RDF guide (which doesn't address these 
> issues) and the other is that I have reached the limits of my technical 
> skills and am not able to take leadership on items #2-#4.  Who will
champion this?

As soon as Rob W. returns from his cruise, he and I will discuss this in
much more detail, and perhaps we can help make progress in this area.  In
the meantime, I'll review the content at the links you provided.

> At the risk of making this email too long, I will add one more comment.  
> There seems to be a developing consensus that an OWL ontology 
> structured according to the OBO Foundry (http://www.obofoundry.org/) 
> principles is the answer to #2 and #3 above.  However, I have yet to see 
> the evidence that the complexity introduced by a formal OWL ontology 
> is necessary or any actual concrete examples of how an OBO-style ontology 
> would be used to satisfy the use cases.  We have shown with DSW that 
> some use cases can be met using only simple RDF and SPARQL 
> (i.e. no actual reasoner involved).  I presume that Rich and Rob W. have
in 
> hand a technical solution to their use cases that doesn't involve RDF at
all.  

Indeed, you are correct!  Most of our implementation is at the database
model and UI workflow level.  We have not made an attempt to translate this
into RDF/OWL/OBO sorts of documentation -- in part because this is not our
area of expertise.  We can provide ER-diagrams, a plethora of various
use-case examples (from collected specimens with and without tissue samples
and their DNA-ish derivatives, to tissue samples without vouchers, to
in-situ imagery, to post-collection imagery, to observation-only occurrence,
to literature-based occurrence, to telemetry-based occurrence data, to
managing misidentifications, to determination and collecting-event
inheritance via a hierarchy of individuals, etc.)  We are by no means at the
end-game of all of this.  Rob and I still argue about a number of things
(including how to precisely draw the lines between "Location", "Event", and
"Occurrence", and which properties should be associated with instances in
each class; as well as the question of whether a specimen in a Museum is
really an "Occurrence" or an "Individual"); and we have not yet taken a stab
at dealing with THeH use-cases.  But we do have a lot of real-world
experience now dealing with what is probably 90-95% of the data in our
community.

> So I think that there need to be some iterations of defining and testing 
> before we adopt a technology by acclimation.  We've been down that 
> road before with the TDWG Ontology and look how that turned out.

Agreed!  Again, I'll discuss with Rob after he gets home (mid-June), and
we'll try to put together some ER diagrams and a bunch of detailed
use-cases.  Perhaps others can provide use-cases that they have encountered
(the more complex, the better), to challenge our approach to these sorts of
things.

Aloha,
Rich