Thanks everyone, Im not so much concerned about producing dwc in whatever form. Adapting the schemas or IPT "extension" definitions is straight forward and as John says we could even add support for all 3 classes as core records in the IPT easily.
My concern is rather on the consumers side of things. When I get a record in dwca with a rowType of dwc:Occurrence, I currently treat it as if its an observation, a specimen or anything that we used to accept as an occurrence. With the change I should be able to say this is *not* a collection object or organism, but that is sth I can't say for sure as I don't know which version of dwc this records adhere to. Is this a no brainer and doesn't matter in practice?
I think it matters, and is not a no-brainer, and the solution depends on the implementation. For Simple Darwin Core the distinction is indeed simple, and wouldn't change anything, because the information contained in a record is defined by basisOfRecord. An old-style Occurrence that was a PreservedSpecimen will be a new-style CollectionObject that is a PreservedSpecimen. There will be no difference except to those who care that PreservedSpecimen will not longer refine Occurrence, rather, it will refine CollectionObject. Existing Simple Darwin Core observation records would have to change a only two fields (individualID to organismID and occurrenceRemarks to organismRemarks under some circumstances), if they were already in use. Existing Simple Darwin Core records of things that fall into the new CollectionObject category would have to change up to four fields; 1) individualID to organismID, 2) occurrenceRemarks to organismRemarks under some circumstances, 3) associatedOccurrences to associatedOrganisms in some cases, and 4) associatedOccurrences to associatedCollectionObjects in some cases) if they were already in use. These are still issues from the publishers point of view that Im not so much concerned about. But what about consumers receiving mixed versions with old and new records? There are certain term name changes that you mentioned, so a consumer must know about the historical terms too. But as basis of record for simple dwc did and still does define the true record "class" there doesn't seem to be a big change after all.
For dwc archives the row type - which is a dwc "class" term - is more crucial and consumers logic depends on it. But for GBIF at least I would think there is not much of a change as we - at least currently - treat all occurrence records the same way.
Or does the problem lie rather in the implementation technology and we should do versioning of our "schemas" and transmit them with records, but not the dwc namespace? At first glance that actually sounds like a good way to go. The dwc xml schemas would have to have a new attribute with a default value in the root element in that case though - sth agreeable?
I think XML schemas will have to be versioned unless we can make them completely backward compatible. I don't like the prospects of maintenance for the latter option. I understand what you mean by adding an attribute for version in the schemas, but what did you mean by "but not the dwc namespace?" Versioning the namespace is sth people (also in tdwg) usually do for versioning standards. Dwc used to do this before, but there were reasons I can't remember why this was abandoned. Verisoning the xml schemas themselves using the schema attribute doesn't really help as we don't exchange the schema files, but the instances they define. So if the namespace always remains the same the another option would be to define an additional attribute/element that becomes part of every record instance. For dwc archives this element could be a new attribute of the meta.xml file, for simple xml for example a new version attribute in <SimpleDarwinRecordSet version="1.1">
Markus
I'm curious. Are any of the Darwin Core XML schemas in use beyond the Apiary and the GermPlasm extensions?
Markus
On Sep 13, 2011, at 7:39 PM, Steve Baskauf wrote:
Markus, Well, I don't know that I'd go so far as to say that it's a drastic change in semantics, at least in the formal semantics of the normative document (which I _think_ can be viewed at http://code.google.com/p/darwincore/source/browse/trunk/rdf/dwcterms.rdf). That document says (in human terms) that Occurrence is an rdfs:Class, that its status is Recommended, and some bookkeeping stuff about versioning. The main change is in the rdfs:comment which presents the description "The category of information pertaining to evidence of an occurrence in nature, in a collection, or in a dataset (specimen, observation, etc.)." but that's really a human thing and as I've said, there has been quite a bit of misunderstanding about what the "human" definition means. As has been noted on this list before, DwC doesn't get into domain and range issues for its terms at all and usually doesn't get into subclassing, so there is very little in the normative document to be "broken" in terms of semantics. That's a rather different situation than changing an owl ontology class definition where relationships among classes and their properties are likely to be more complicated (e.g. disjoint classes, subclassing, range, domain) and therefore more easily "broken".
There is the issue that a number of property terms would have their dwcattributes:organizedInClass property changed from dwc:Occurrence to something else. But my understanding was that the organization of the property terms under the DwC classes was more of a suggestion as opposed to a declaration of domain. So it doesn't seem likely that the understanding of machines will be "broken" by this change since I don't think that much of any machine reasoning based on DwC is going on at the present. But this is getting beyond my area of expertise, so maybe others can clarify things on this point.
There is an RDF version of DwC viewable at http://code.google.com/p/darwincore/source/browse/trunk/rdf/dwctermshistory.... which actually has dated versions of the terms (e.g. Occurrence-2009-04-29). But I must confess, I don't understand how this document is related to the dwcterms.rdf document I mentioned above. Perhaps John can enlighten us...
Steve
Markus Döring (GBIF) wrote:
Hi Steve, I agree this is a good thing to me more clear about what an occurrence actualize is and I would't disagree with the proposed 3 classes. Still there is a drastic change in semantics of an existing term Occurrence and I would feel more comfortable if we can tell those different usages apart. If thats via a namespace based versioning of (all?) darwin core terms, through the use of a different term name or sth else I don't know.
Don't you think this an issue? Would you also change an owl ontology class definition in the same way and would't that be harmful to existing instances?
Markus
With regards to Markus' concern about whether people will be able to know whether somebody is talking about a "new-style" Occurrence or an "old" Occurrence, I would assert that the "old" Occurrence didn't really have a clear meaning. If you review the summary of the discussion on Occurrence, you can see that it was used to mean at least three different kinds of "things" by different people. What John is actually doing with his proposal is to add clarity about what an Occurrence is where it didn't exist before. I think that is a good thing. If, by the "old" kind of Occurrence people are meaning that Occurrence is a fancier name for PreservedSpecimen (which I believe is how some people in the museum community are thinking of it), then I would say that such a characterization is incorrect (based on the apparent consensus) and that clarifying the incorrectness of that view is a really good thing.
Steve
Éamonn Ó Tuama (GBIF) wrote:
It would be good to hear from someone who is familiar with the work going on in the Observations Task Group and could explain how a generic model for observations/measurements (e.g. OBOE) might help sort out these issues. It seems to me that we are trying to build in an ad-hoc manner an increasingly complex model on top of DwC which is really just a glossary of terms. That does not seem like a good approach - but I'm no modeller :-) _Éamonn
-----Original Message----- From: Dag Endresen (GBIF) [
mailto:dendresen@gbif.org
] Sent: 13 September 2011 12:18 To: "Markus Döring (GBIF)" Cc:
tdwg-content@lists.tdwg.org
; Éamonn Ó Tuama Subject: Re: [tdwg-content] Occurrences, Organisms, and CollectionObjects: a review
Hi Markus,
I believe that the discussion here originates from the view that the "CollectionObject"/"Sample" is a different thing from the "Organism" - and that there can be a relationship between CollectionObjects/Samples and Organisms that could be difficult to describe if these things are identified as the same think (occurrenceID). Do you think that the "Occurrence" would be seen as a thing different from the proposed CollectionObject/Sample and Organism - or as a super-class that would include CollectionObjects/Samples and Organisms? Would the semantics of Occurrence change?
I fully share your view that the Darwin Core Archive (DwC-A) would not be suited to share the full complex relationship between entities - even if persistent identifiers would be used. However if we start to describe and include other things (core types) than only the taxon and occurrences then perhaps the DwC-A could be a useful way to provide a simple list of these entities? This could perhaps provide easier indexing and discovery of these new entities?
Dag
On Tue, 13 Sep 2011 10:03:00 +0200, Markus Döring (GBIF) wrote:
> I have to say that the change in semantics to the Occurrence class > makes me a bit nervous. > Can someone try help fighting my fears? > > DarwinCore has no versioning of namespaces, so there is no way for a > consumer to detect if its an old style Occurrence or a new one. I am > currently parsing various RSS feeds and even though its a mess having > to parse 10 different styles I am glad that at least the designers > made sure they all have their own namespace! Also removing or > renaming > terms might cause serious problems. Would discrete versions of dwc > with their own namespace hurt? > > Another observation relates to dwc archives and its star schema. As > an index to data that has been flattened there is no problem with > more > classes and core row types, but if you want it as a way to transfer > complete normalized data it will not work. But that never really was > the intention and I simply wanted to stress that fact. > > Markus > > > > On Sep 9, 2011, at 4:52 PM, Steve Baskauf wrote: > > > > > >> Richard Pyle wrote: >> >> >> >> >>> I'm also wondering if we necessarily need to "break" the >>> traditional view of >>> the "Occurrence" class in order to implement Organism and >>> CollectionObject. >>> As long as we keep in mind that DwC is a vocabulary of terms >>> focused on >>> representing an exchange standard (rather than a full-blown >>> Ontology), >>> perhaps Occurrence records can continue to be represented in the >>> traditional >>> way as "flat" content, but the Organism and CollectionObject >>> classes allow >>> us to present data in a somewhat more "normalized" way in those >>> circumstances that call for it (e.g. tracking individuals or groups >>> over >>> time [Organism], or managing fossil rocks with multiple taxa >>> [CollectionObject] -- to name just two). >>> >>> >>> >>> >>> >> I've been thinking about this issue of "backward compatibility" with >> respect to Occurrences if the CollectionObject/Sample/Token/whatever >> class is adopted. I really don't think it is going to be as big of >> a >> deal as people are making it out to be. >> >> It seems to me that the main problems arise in two areas: when one >> wants >> to be clear about typing and when one wants to express relationships >> in >> a system where it is possible to do through semantics (like RDF). >> In >> that kind of circumstance, it's bad (oh yeh, I forgot - the term is >> "naughty") to say something like >> resourceA hasOccurrence resourceB >> when resourceB isn't actually an Occurrence. "Wrong" typing also >> happens all the time because the classes don't exist (yet) to do the >> typing correctly. As a case in point, in the Morphbank system, I >> have >> multiple images of the same tree. In that system the tree is typed >> as a >> "specimen". That is totally wrong because the tree isn't a >> specimen, >> but what else is it going to be typed as? There isn't (yet) an >> appropriate class to put it in. >> >> Although these two problems (wrong typing and using a term with the >> wrong kind of object which are actually different manifestations of >> the >> same class-based problem) are naughty, realistically very few people >> are >> actually using a system that is "semantic-aware" (e.g. serving and >> consuming RDF) so right now making those mistakes doesn't really >> "break" >> anything. Most data providers are using traditional databases or >> even >> Excel spreadsheets where the DwC terms are just column headings with >> no >> real "meaning" other than what the data managers intend for them to >> mean. So if a manager has a table where each line contains a record >> for >> a specimen and has a column heading for a column entitled >> "dwc:catalogNumber", there isn't really anything other than an idea >> in >> the manager's head that the catalogNumber is a property of a >> specimen or >> Occurence or CollectionObject. If each line in the database table >> is >> "flat" such that one specimen=one CollectionObject=one Occurrence, >> all >> that is required to make catalogNumber be a property of a >> CollectionObject instead of an Occurrence is a different way of >> thinking >> in the managers mind because there are really no semantics embedded >> in >> the table. We are already doing this kind of mental gymnastics with >> existing classes like dwc:Identification . If our hypothetical >> database >> manager has a column heading that says "dwc:identifiedBy" in the >> specimen table, that is really a property of dwc:Identification, not >> dwc:Occurrence but again that is a distinction that is only going to >> be >> made in the manager's mind. Making the distinction really only >> becomes >> an issue when the database stops being "flat" for a particular >> relationship, e.g. if the database wants to allow multiple >> Identifications per specimen record. Then the database structure >> must >> be changed accordingly to accommodate that "normalization". >> >> What we have here at the present moment is a situation where data >> providers don't have any way to have anything but "flat" records >> where 1 >> specimen=1 Occurrence=1 Organism. By adding the Organism and >> CollectionObject classes, we allow people who need or want to have >> less >> "flat" (=more "normalized") databases to have something to call the >> entities that are represented by the new tables they create to >> handle >> 1:many relationships instead of 1:1 relationships. Anybody who only >> cares about 1:1 relationships really doesn't need to worry about the >> fact that the new class exists, just as people currently don't have >> to >> worry about the Identification class if they only allow one >> Identification per specimen in their database. >> >> So I guess what I'm saying is that if a database manager has a table >> labeled Occurrence, they really don't have to freak out if we now >> tell >> them that their table actually should be labeled CollectionObject as >> long as there is only one CollectionObject per Occurrence. They >> didn't >> freak out before when we told them that they should call their table >> "Occurrence" instead of "Observation" or "Specimen" in 2009, did >> they? >> >> I think what I'm saying here is what Rich was trying to say in the >> paragraph I quoted, but I'm not sure. >> >> Steve >> >> -- >> Steven J. Baskauf, Ph.D., Senior Lecturer >> Vanderbilt University Dept. of Biological Sciences >> >> postal mail address: >> VU Station B 351634 >> Nashville, TN 37235-1634, U.S.A. >> >> delivery address: >> 2125 Stevenson Center >> 1161 21st Ave., S. >> Nashville, TN 37235 >> >> office: 2128 Stevenson Center >> phone: (615) 343-4582, fax: (615) 343-6707 >> >> >> http://bioimages.vanderbilt.edu >> >> >> >> >> _______________________________________________ >> tdwg-content mailing list >> >> >> tdwg-content@lists.tdwg.org >> http://lists.tdwg.org/mailman/listinfo/tdwg-content >> >> >> >> >> >>
_______________________________________________ tdwg-content mailing list
tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
.
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content