Hi Steve,
Again, great post. Some more detailed comments:
I suppose that Rich and Rob W. have already looked at
http://code.google.com/p/darwin-sw/wiki/TaxonomicHeterogeneity .
I think it pretty much encapsulates what they are talking about.
Yes (see the P.S. of my reply to Gregor) -- and it's this reasoning that eventually convinced me that you were right about this.
I should note that the way DSW defines dsw:IndividualOrganism does not
require it to be a single organism.
It can be a collection of organisms (herd, colony, school) or part of an
organism (tissue).
Yes -- which is exactly why we defined it this way. The only reason we stripped the "organism" part (or, more precisely, why we think of "individualOrganism" as a subclass of "individual" -- though we haven't formally established it that way yet), is because in our application, we want to track individuals that are not organisms (I'd be happy to provide examples -- but that falls outside the scope of this conversation....and this email list). The point is, our concept of "Individual" is the superclass (perhaps we should all it "Object"?), which includes a subclass of things that would be properly labeled as "IndividualOrganism" (or maybe "BiologicalObject").
The basic requirement is that it is a "taxonomically homogeneous entity".
In a variant form of DSW (dsw_alt.owl) we included "taxonomically
heterogeneous entity" (THeE)
which would basically include what Rich and Rob W. are talking about (lots
of organisms which are seperatable
and aren't necessarily from the same lowest taxonomic level).
No, that's not quite right. We've been defining "Individual" (in the context of organisms) as what you refer to as "taxonomically homogeneous entity". In other words, if more than one taxon is involved, we split them into appropriate subset individual instances (so that each individual is taxonomically homogeneous). Like I said, your arguments won me over. But that doesn't mean that the concept of THeE isn't a necessary one for our community. Examples are things like rocks containing multiple fossil organisms from multiple taxa, a collected rock on the seafloor containing multiple organisms, a soil sample, or a seafloor core (etc.). These are units of information that fall within the realm of "CollectionObject", but not within the concept of a "taxonomically homogeneous entity" (i.e., individualOrganism).
Somehow these things need to be defined well enough that we can develop practical vocabularies and ontologies that are both practical to implement, and meet our data management needs.
It should be no surprise that THeE does what Rich wants because we included it in DSW because during the preceding discussion Rich said he wanted something like it.
Again, while I agree there is a need for a THeE "thing", this is not what I'm talking about in this thread. I see that as a separate discussion -- which we need to have, but will probably only confuse this discussion. I think we should focus on the relationship between a "taxonomically homogeneous entity", which is what Rob W. and I -- and also, I believe, what dwc:individualID is intended to represent -- are including within our notion of "individual" as discussed in my earlier posts on this thread, and "materialSample" -- which I generally see as a subclass of "individual" (=dwc:individualID = Pyle/Whitton "Individual" = Baskauf "taxonomically homogeneous entity").
In dsw_alt.owl, properties like "hasPart" and "isPartOf" are used to connect physical entities whose properties can be inferred by
inheritance.
What this diagram includes that Rich did not mention are "tokens"
(evidence).
Same here. We use the term "Evidence", but I think we define it slightly differently from your "token". Again, this is an important topic -- but one I deliberately left out of my earlier posts, because my real question pertains to the proposal of "materialSample" as a new dwc term (along with materialSampleID). I want to make sure I understand what the relationship is between this concept, and the concept represented by dwc:individualID (e.g., would "dwc:materialSample" constitute a subclass of dwc:individual, if the latter also existed?) I'm not at all opposed to the introduction of "materialSample/ID" in DWC -- but I am not sure there has been enough clarification on the relationship/distinction between what these new terms represent, and what is already represented via dwc:individualID (and, of course, the relevant terms in Darwin-SW and its kin).
We defined a class for evidence, but we also considered not having
evidence
being an explicit class. Not defining an explicit Token class would have
simplified
the diagram at the bottom of the page - one could just say that there
should be
evidence and it should be linked to the resource it documents. Token and THeE/IndividualOrganism are not disjoint classes - the physical
entity
can be the evidence if somebody "owns" it and makes it available for
people to examine.
However, in DSW, Token and THeE are not synonymous because we allow
evidence
to include things that are not physically derived from the entity (e.g.
images, sounds,
string data records) in addition to physical specimens.
I have much to comment on concerning the above text, but I'll save it for a different thread with a different subject line.
I think that we have to be careful when we say "we don't need X", "there is pressing value for X but not for Y", "X is too vaguely defined", etc.
Agreed! I hope nobody thinks I was suggesting any of these things about materialSample. If so, I was misunderstood. What I want to discuss is how these things conceptually relate to each other. One of the biggest problems with DWC in actual use is not so much whether we have the right terms, or have too many, or whatever; rather, it's that the terms we have leave enough room for interpretation that people use them in subtly (or not-so-subtly) different ways. This is one of the things I heard repeatedly in Berlin last week -- that different providers use the same terms to represent different information.
I think it's incumbent on us to minimize further generation of this sort of ambiguity, so I want to make sure we are clear in what the terms are, how they are meant to be used, and how they relate to other terms in DWC.
MaterialSample does exactly what the metagenomics people need because they
invented it to serve the purposes they want it to serve (handle material
samples
in which one may or may not ever know what all organisms are included or
even
if there are organisms in it). Individual (sensu Pyle/Whitton)/THeH is
just vague
enough to do what Rich and Rob W. want it to do with their lots and
specimens,
but is too vague for Rob G.
Again, just to be absolutely clear: "Individual (sensu Pyle/Whitton)" <> "THeH (sensu Baskauf)"
IndividualOrganism (sensu DSW) and Token does exactly what Cam Webb and I
want it to do
with our images, specimens, DNA samples, and data records, and the
requirement that
IndividualOrganism be taxonomically homogeneous allows us to infer that a
determination
applied to one resource also applies to other resources which are derived
from the same
IndividualOrganism (a requirement not stated by the others) but it's too
restrictive for both
Rob G and Rich.
What you describe above is *EXACTLY* what "Individual (sensu Pyle/Whitton)" is intended to be, for *exactly* the same reasons. So I hope we can now dispense completely with the incorrect equivalence of "Individual (sensu Pyle/Whitton)" with "THeH (sensu Baskauf)".
I think "materalSample" is very-much more analogous to (but not identical to) "collectionObject". The similarities include:
- Something extracted from nature that may or may not involve a biological organism (the concept of "collectionObject" could be used for a mineralogy collection, or a soil sample with no organisms in it). - Something that is often taxonomically homogeneous, but not necessarily so - Something that is studied or examined outside of its occurrence in nature
The main difference I see is that, traditionally at least, "collectionObject" refers to something typically intended to be maintained for a long time (like a voucher specimen), whereas "materialSample" explicitly includes things that only exist as such for a temporary period (e.g., are lost or destroyed upon analysis). But I don't think that "collectionObject" necessary excludes things that are temporary, which is why really "materialSample" is either equivalent to an instance of "collectionObject", or is a subclass of "collectionObject".
So, we have this cloud of overlapping terms: 'individual" (as implied by dwc:individualID), "individualOrganism" (as defined in DSW), "collectionObject" (as used historically in the ASC and MVZ models, and more recently in the iDigBio document thatI sent the link for), and "materialSample" (as currently proposed).
The overlap of these things vastly exceeds the non-overlap, so I think it's important that we refine the definitions to more precisely articulate how they are the same, and how they are different. Once we get the concepts sorted out, *then* we should figure out the best terms to use.
If we start in on the game of saying "WE need the features that I think are important but not the features that YOU think are important" then we are in for another month of massive email traffic on this list and will end up no better off than we were when we started.
ABSOLUTELY! Let's assume that any need expressed by anyone is, by definition, important.
I think that it is clear from this and preceding discussions that there is
a need for some system of tracking things that are like
individuals/organisms/samples/lots.
It is my believe that what needs to happen is:
- define clearly what the various stakeholders want to accomplish by
their
version of individuals/organisms/samples/lots (i.e. use cases/competency
questions)
- use set theory or some other kind of logical system to describe clearly
how the
various versions of individuals/organisms/samples/lots are related to each
other
- examine alternative mechanisms for defining the relationships among the
variously
defined individuals/organisms/samples/lots terms and determine how each
approach
can or cannot satisfy the use cases/competency questions. 4. use one or more mechanisms which pass test #3 to define the terms that
are deemed
necessary and include them in some TDWG standard which may or may not be
Darwin Core.
YES!!! This is *exactly* what I was heading towards, but you captured it much more clearly than I did. But I think it is a bit of a multi-dimensional issue. On one axis is, as you outline above, the hierarchy from "group" things down to "part" things (school, flock, lot, specimen, specimen part, tissue sample, DNA extraction, etc.). Another axis is the "taxonomically homogeneous" vs. "taxonomically heterogeneous" distinction (the latter of which is mostly needed for curated objects, rather than objects as they occur in nature -- although one could conceive of an "ecosystem" as a top-level THeE that our community may want to track). And yet another axis is the whole issue of "what is a 'thing'", contrasted with "what is evidence of a 'thing'" (e.g., a specimen in a jar is a "thing", whereas an image of that specimen is evidence of that "thing").
In September 2011, John Wieczorek had packaged several of the proposed
class additions
to Darwin Core into a concrete proposal:
http://lists.tdwg.org/pipermail/tdwg-content/2011-September/002727.html .
This proposal was deferred by the Executive Committee (see the last
comment at
http://code.google.com/p/darwincore/issues/detail?id=117 ) "... until we
can further
examine broader changes including the new classes and any insights that
might come
out of the RDF Interest Group."
Ah! OK, this specifically answers my question concerning "where do we stand on discussions of an "individual" class within DWC? The last thing I want to do is slow down progress, so if a quick addition of materialSample/ID will help solve some problem, then I'm all for it. But to save long/complex conversations later on (which will inevitably happen when we try to reconcile dwc:individualID and the proposed dwc:materialSampleID), it's probably worth at least a bit of discussion now. If that discussion shows no signs of resolution anytime, soon, then by all means, let's not bog the process down unnecessarily. But one of the reasons I spoke up on this thread is that, what is different now from last September, is that we (Rob W. and I) now have about a year's worth of experience dealing with the ontological relationships between "Location", "Event", "Occurrence", "Evidence", "Individual", "Determination" and "Taxon" (among other classes of things, like "Agent" and Reference") -- both in terms of harvesting legacy data, and in building workflows to capture new data in real time. (Indeed, as I write this, Rob W. is currently aboard a NOAA ship somewhere between the Northwestern Hawaiian Islands and Johnston Atoll, on a 1-month cruise gathering in-situ observations, imagery, voucher specimens, and tissue samples, and is using the workflow that he and I developed to integrate these new data with pre-existing images/specimens/observations/literature reports/tissue samples/etc., to build an evidence-based checklist for the Northwestern Hawaiian Islands. In other words, what we have now that we didn't have a year or two ago, is real-world experience organizing this kind of information.
So the RDF Task Group has specifically been charged with the task of examining the addition of additional classes to Darwin Core and their implications. The RDF TG has assembled competency questions http://code.google.com/p/tdwg-rdf/wiki/CompetencyQuestions and use cases http://code.google.com/p/tdwg-rdf/wiki/UseCases but has not moved beyond that. So that's a start on Item #1 in the list
above.
However, the process has not moved beyond that. I recently made an appeal to the TG for someone to take up work on delivering some concrete progress on deliverables, but got no responses. I cannot be the person to move this forward for two reasons. One is that I already have my hands full with the DwC RDF guide (which doesn't address these issues) and the other is that I have reached the limits of my technical skills and am not able to take leadership on items #2-#4. Who will
champion this?
As soon as Rob W. returns from his cruise, he and I will discuss this in much more detail, and perhaps we can help make progress in this area. In the meantime, I'll review the content at the links you provided.
At the risk of making this email too long, I will add one more comment. There seems to be a developing consensus that an OWL ontology structured according to the OBO Foundry (http://www.obofoundry.org/) principles is the answer to #2 and #3 above. However, I have yet to see the evidence that the complexity introduced by a formal OWL ontology is necessary or any actual concrete examples of how an OBO-style ontology would be used to satisfy the use cases. We have shown with DSW that some use cases can be met using only simple RDF and SPARQL (i.e. no actual reasoner involved). I presume that Rich and Rob W. have
in
hand a technical solution to their use cases that doesn't involve RDF at
all.
Indeed, you are correct! Most of our implementation is at the database model and UI workflow level. We have not made an attempt to translate this into RDF/OWL/OBO sorts of documentation -- in part because this is not our area of expertise. We can provide ER-diagrams, a plethora of various use-case examples (from collected specimens with and without tissue samples and their DNA-ish derivatives, to tissue samples without vouchers, to in-situ imagery, to post-collection imagery, to observation-only occurrence, to literature-based occurrence, to telemetry-based occurrence data, to managing misidentifications, to determination and collecting-event inheritance via a hierarchy of individuals, etc.) We are by no means at the end-game of all of this. Rob and I still argue about a number of things (including how to precisely draw the lines between "Location", "Event", and "Occurrence", and which properties should be associated with instances in each class; as well as the question of whether a specimen in a Museum is really an "Occurrence" or an "Individual"); and we have not yet taken a stab at dealing with THeH use-cases. But we do have a lot of real-world experience now dealing with what is probably 90-95% of the data in our community.
So I think that there need to be some iterations of defining and testing before we adopt a technology by acclimation. We've been down that road before with the TDWG Ontology and look how that turned out.
Agreed! Again, I'll discuss with Rob after he gets home (mid-June), and we'll try to put together some ER diagrams and a bunch of detailed use-cases. Perhaps others can provide use-cases that they have encountered (the more complex, the better), to challenge our approach to these sorts of things.
Aloha, Rich