Hi Steve,
Many thanks for taking the time to carefully articulate all of this. Though your post is long, I think it is clear and well-written, and perhaps a good genesis for a web or wiki page somewhere, to add to the collective documentation of DwC-space.
A couple of comments, as I read through what you wrote:
However, this decision does not excuse us from thinking carefully about whether a term can be appropriately applied to a resource that is a member of some class (e.g. should we say that a digital photograph has a scientific name?). Placing a term within a class is a suggestion that the term would appropriately be applied as a property of an instance of a class.
I'm a bit unsure where this notion of "digital photograph has a scientific name" comes into play. My best guess is that if the basisOfRecord for an Occurrence record is "Digital Image" (not actually listed among the examples at http://rs.tdwg.org/dwc/terms/#basisOfRecord), perhaps a consumer of such a record will misinterpret it as though the Image *is* the item that Occurs (and hence has a scientific name). But I think of basisOfRecord as the "basis" of our *belief* (aka "evidence") that the Occurrence was real. That is, the Occurrence is always understood to refer to an organism (though the documentation doesn't say this explicitly), and the "basis" of the Occurrence is the reason we have for believing that the organism occurred at a place and time. Again, my interpretation of this may be wrong, and it's confounded by our tendency to shortcut information. For example, often in our community the statement "an individual organism that was documented to occur at a place and time was identified by someone as belonging to a taxon concept that is best represented by this scientific name" is truncated to "Occurrence has scientific name". Because we tend to do this, I can easily understand taking it one step furter and reduce the statement "an individual organism that was documented *by a digital image* to occur at a place and time was identified by someone as belonging to a taxon concept that is best represented by this scientific name" to "Digital Image has scientific name".
So, maybe we should try to avoid these short-cut representations of our data as much as possible?
- When users want to "flatten" and simplify their
databases, they tend to eliminate one-to-many (1:M) relationships in favor of one-to-one (1:1) relationships. The result of that is differences like we saw in
http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif (which allows 1:M relationships between Occurrences and Events and between Events and Locations) and
http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif (which "atomizes" every Occurrence by considering it to have its own separate eventTime and Location information).
Another way to look at the difference between diagram1 and diagram2 is that the latter is simply a flattened (aka "denormalized") version of the first. I don't think the latter is really a more "atomized" version, because if eventTime and Location are recorded with such precision, then they could easily be represented in the structure of diagram1 -- except that the 1:M links would tend to shift in abundance towards more 1:1 (there is no problem represnting a 1:M relationship where most instances actually are 1:1). But what really makes diagram2 different is that it is likely to include replication of identical Event propery values over multiple records in cases where there really is a 1:M relationship between Event and Occurrence. No harm there -- it's just a bit denormalized. Denormalization is fine for mechanisms to transmit bulk content around -- especially if they were generated from more normalized original data structures at the source.
A. There is nothing intrinsically "right" or "wrong" about either of these approaches, because they each have their own advantages. The 1:M approach is more efficient, but results in a more complicated database, while the 1:1 approach results in a simpler database but may require repeating some or many term values in the records.
Exactly. Perhaps I misunderstood your point about "atomized".
C. This collapsing of the diagram is also the reason for some disagreement about whether a term belongs in a certain class or not. In the example above, 1:1 people would say that eventDate is a property of an Occurrence, while 1:M people would say that eventDate is a property of an Event.
That's not quite how I would characterize it. I would say, if you establish an Event class at all, then eventDate is clearly a member of that Class. I think the real question is "Do we need an Event class?" If yes, then eventDate belongs to it. If no, then we "collapse" eventDate to Occurrence.
By the way, when I say "Do we need an Event class?", I mean it at two levels. At one level, the question is: "Is it useful to establish it within DwC?" At another level, it's "How shall I structure my pacakge of data using DwC terms?" My understanding (which is crude), is that even with Event class defined in DwC, I still have the choice of representing my Occurrence data as:
1) "Normalized": =================== occurenceID: 1234 eventID: 9876 identificationID: 7654 individualCount: 4 recordedBy: "J. Smith" =================== eventID: 9876 LocationID: 4567 eventDate: 24-October-2010 eventTime: 02:13:00 =================== LocationID: 4567 decimalLatitude: 52.453016 decimalLongitude: 13.309418 geodeticDatum: "WGS84" country: "Germany" locality: "Botanischer Garten Und Botanisches Museum Berlin-Dahlem" =================== identificationID: 7654 taxonID: 2345 identifiedBy: "Richard Pyle" dateIdentified: 24-October-2010 =================== taxonID: 2345 scientificName: "Homo sapiens Linnaeus 1758" namePublishedIn: "Linnaeus, C. 1758. Systema Naturae...." nameAccordingTo: "Linnaeus, C. 1758. Systema Naturae...." ===================
2) "Flattened": =================== occurenceID: 1234 individualCount: 4 recordedBy: "J. Smith" eventDate: 24-October-2010 eventTime: 02:13:00 decimalLatitude: 52.453016 decimalLongitude: 13.309418 geodeticDatum: "WGS84" country: "Germany" locality: "Botanischer Garten Und Botanisches Museum Berlin-Dahlem" identifiedBy: "Richard Pyle" dateIdentified: 24-October-2010 scientificName: "Homo sapiens Linnaeus 1758" namePublishedIn: "Linnaeus, C. 1758. Systema Naturae...." nameAccordingTo: "Linnaeus, C. 1758. Systema Naturae...." ===================
In my understanding, both of these would be legitimate implementations of the DwC terms. The difference is that in the first case, the content is normalized such that the value of the different properties (sorry, Bob -- not sure of the correct word here) are inherited through the various "[class]ID" links; whereas in the the "flattened" version, the properties are represented directly on the Occurrence instance.
The advantage of the first is that the atomized and ID'd class instances can be reused for multiple occurrences, whereas the advantage of the second is that it greatly simplifies the content structure.
- I would propose that the "right" relationship diagram is not
necessarily
one that caters to a certain "right" philosophical point of view. Rather,
the "right" diagram is the one that allows users to define the
relationships
that they need for the organization of their metadata in the simplest
manner,
and which provides the most clarity about what resources of various kinds are, and how they are connected.
Agreed. But another component to "rightness" is the extent to which users want to re-use content from the various class instances. For example, it's incredibly easy to conver the Normalized version to the flattened version. But is's not always so easy to parse the flattened version back to the normalized one. You can always do it by creating unique values of the provided terms for each class, but this can be potentially misleading and artificial -- especially if there were more properties for each of the individual class entities that were not included with the packaged data.
There also seemed to be a consensus that an observation was simply an Occurrence that did not have an associated token.
Well...technically the "token" in this case is a pattern of neurons in the observer's brain that constitute a memory....but that may be a bit abstract.
http://bioimages.vanderbilt.edu/pages/token-assumed.gif which I will refer to as the "assumed token" model and
http://bioimages.vanderbilt.edu/pages/token-explicit.gif which I will refer to as the "explicit token" model.
Nice -- and without reading another word of your message, I'm going to take a chance and say that I conceptually agree with your "token-explicit" diagram. The hard part is (as with the case of class:individual) deciding whether this level of "normalization" is valuable for DwC purposes.
I believe that historically the assumed token model has been the one which most people have had in mind.
Actually, I've always envisioned it as you have in your token-explicit version (and have said as much at various meetings to discuss DwC, going back to 1.0). In fact, I remember discussing this exact issue with Stan Blum long before DwC existed (he was the first to suggest to me the term "evidence" in this context -- which I think is functionally equivalent to your "token"). However, I've conceeded that this level of normalization would probably be too much for the intended purpose of the DwC terms. But I'll keep an open mind on that.
Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token.
My rationalization of it in the early days (pre-DwC) was that *everything* was effectively an observation, and beyond that, the only question was a matter of evidence. In my earliest models, I categorized "evidence" into "Specimen", "Image", "Literature Report", and "Unvouchered Observation" (I was using the word "voucher" in the general sense, as in the verb "to vouch" -- not in the more specific sense for our community, which implies "Specimen preserved in Museum"). My read on the history of DwC is that it was initially established as a means to aggregate and/or share Specimen data amongst Museums (hence its Specimen-centric nature). Later, the Specimen/Observation dichotomy was introduced to allow DwC content to allow more sophisticated and complete representations of the occurrence of organisms in place and time, because there was muchmore information than what existed as specimens in Museums. In my mind, the "Observation" side was effectively a collapsing of my "Image", "Literature Report" and "Unvouchered Observation" -- which I was OK with in the context of the time. Because at the time, the vast majority of content available in computer databases came from museum specimen databases, and from observational databases (largely in the bird realm).
So...I see the current iteration of DwC as another step in the evolution of moving from "sharing and aggregating specimen data among museums" to "documenting biodiversity in nature". It's not all the way into the fully normalized representation of biodiversity data, but it's far enough that it is a nice compromise between practical and effective for the majority of the user constituency. In my mind, the next logical step in this evolutionary trajectory would be to recognize "Individual" as a class (which DwC is apready primed for, via individualID).
It is not clear how one is supposed to handle the actually metadata for the image that serves as the token.
That seems to me to be in the domain of the TDWG MRTG group (http://www.tdwg.org/standards/638/).
Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample?
In my mind, things like seeds, tissue samples, and DNA sequences are simply different kinds of specimens (just like dried skeletons vs. botanical pressed sheets vs. whole organisms in jars of alcohol vs. prepared skins, etc.) They may have certain properties specific to each subclass of specimen, but fundamentally I think it's fair to treat them as specimens. DNA sequences are a bit different, of course, because they are not the "stuff" of an organism, but rather an indirect representation of the "stuff". In my mind, that difference justifies associatedSequences, where we don't have associatedSeeds, associatedTeeth, associatedSkins, associatedSkeletons, etc.
However, if I'm going to make this admission, I demand that the other guilty parties also confess, namely people who want to assert that Occurrences have properties that actually are properties of specimens.
I'll plead innocent of those charges, as I always understood the representation of Specimen-properties as applied to Occurrence instances as just another compromise of "flat" vs. "normalized", in much the same way that applying properties of Locations to Events, and Locations+Events to Occurrences, would likewise be compromises in the interest of simplification.
If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness
Not necessarily. See my comment earlier about patterns on neurons in a human brain that constitute a memory. Just as a digital image rendered on a hard disk requires certain machinery to convert into photons that strike our retinas (i.e., a computer and monitor), so too does a memory require such machinery (e.g., the brain itself, transmission of sound waves via vocal chords, soud ways striking ear drums, etc.) This may sound weird, but I'm being serious: a human memory is, fundamentally, every bit as much of a "token" as a specimen or a digital image. It's just considerably less accessible and well-resolved.
Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist.
I don't think the issue was ever in convincing people that Individuals exist -- that much, I think, was clear to everyone (as proof: see dwc:individualID). The issue was always more about where the current DwC should lie on the scale of highly flattened (e.g., DwC 1.0) to highly normalized (e.g., ABCD and CDM). It's necessarily a compromise between modelling the information "as it really is", vs. modelling the information in a way that's both accessible to the majority to content providers, and useful to the majority of contnent consumers. I think we both understand what the trade-offs are in either direction. The question is, what is the "sweet spot" for the majority of our community at this time in history?
I would venture that at the time DwC 1.0 was developed, that hit the sweet spot reasonably well. As more content holders develop inclreasingly sophisticated DBMS for their content, and as the user community delves into increasingly sophisticated analyses of the data, the "sweet spot" will shift from the flattened end of the scale to the normalized end of the scale. And, I would hope, DwC wll evolve accordingly.
It is just too hard to get motion to happen in the TDWG community.
People make the same complaint about another organization that I'm involved with (ICZN). But here's the thing: as in the case of nomenclature, stability in itself can be a very important thing. If DwC changed every six months, then by the time people developed software apps to work with it, those apps would already be obsolete. If someone writes code that consumes DwC content as expressed in the current version of DwC, then that code may break if people start providing content with class:individual and class:token content. If our community is going to move forward successfully, I think standards like DwC need to evolve in a punctuated way, rather than a gradualist way (same goes for the Codes of nomenclature). That is, a bit of inertia in the system is probably a good thing.
OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model:
http://bioimages.vanderbilt.edu/pages/token-assumed.gif
or the explicit token model:
http://bioimages.vanderbilt.edu/pages/token-explicit.gif
?
Here's how I would answer: When modelling my own databases, tracking my own content, I would *definitely* (and indeed already have, for a long time now) go with the token-expicit.
But when deciding on a community data exchange standard (i.e., DwC), compromise between flat and normalized is still a necesssity, and as such, the answer in terms of modifying DwC needs to take into account the form of the bulk of the existing content, the needs of the bulk of the existing users/consumers, and the virtues of stability of Standards in a world where software app development time stretches for months or years.
Maybe the answer to this is to treat different versions of DwC as concurrent, rather than serial. That is, as long as the next most sophisticated version can easily be "collapsed" to all previous versions (aka, backward compatibility), then maybe we just need a clear mechanism for consuming applications to indicate desired DwC version. That way, apps developed to work with v2.1 can indicate to a provider that is capable of produding v3.6 content, that they want it in v2.1 format. Assuming we maintain backward compatibility (i.e., the more-normalized version can be easily collapsed to the more flattened version), then is should be a very simple matter for the content provider to stream the same content in v2.1 format.
But now I'm dabbling in areas that are WAY outside my scope of expertise...
Anyway...I would reiterate that I, for one, appreciate that you took the time to write all this down (took me over 3 hours to read & respond -- so obviously I care! -- of course, I'm waiting for a taxi to go to the airport, so really not much else for me to do right now). If I didn't reply to parts of your message, it was either because I agreed with you and had nothing to elaborate or expound upon, or I didn't really understand (e.g., all the rdf stuff).
Aloha, Rich