I'm going to start this post with two comments about RDF. I think some people think they have a phobia of RDF (I know that I did at first). What I really think is that they have a phobia of RDF represented as XML or RDF represented in N3 notation. This point has been made before: RDF is a system for describing properties of and relationships among resources (i.e. things that can be assigned identifiers) but it does not have only one particular way that these properties and relationships must be specified. It is perfectly correct to represent RDF entirely in pictures (i.e. as an RDF "graph", see http://www.w3.org/TR/rdf-primer/ and ignore all of the text - just look at the figures). RDF graph notation wouldn't be of much use to a computer, but that graph could easily be translated into one of the other notations (XML or N3) and then a computer would understand it perfectly. Since RDF is something that is specifically designed to represent relationships among classes of resources, it is the perfect thing to clearly lay out what we mean when we have a discussion of the sort that we are having here. One of the reasons why I am so keen to make diagrams of the sort I posted in the first message in this series is because once you have the diagram, it is a relatively simple matter to change the shapes of the boxes and add arrows instead of triangles or lines with crow's feet and voila! you have an RDF graph. It then becomes an academic exercise to have an RDF model in XML or whatever format you like. I am of the opinion that we are actually pretty close to a consensus about what the diagram should be, which means that we are also pretty close to a simple RDF model for Darwin Core.
The other comment about RDF is that we need to work out a basic model now. Partly this is because there are already several people who have been contributing to this discussion who are already writing RDF or who intend to do so in the near future. If we have any delusions about doing even the most simple kind of machine reasoning in the future, we all need to be using the same basic diagram (i.e. model). The other reason why we need to work this out now is that if we don't, we will impede the process of utilizing GUIDs/Persistent Identifiers. The draft TDWG GUID Applicability Statement (http://www.tdwg.org/stdtrack/article/download/150/51 recommendation 10) says clearly that a proper GUID should be able to be dereferenced to provide an RDF/XML representation (did I use "dereferenced" right, Bob?). If we don't agree on how to represent the classes of resources that are of interest to the DwC community in RDF then we are setting up the situation where TDWG makes a recommendation (on how GUIDs are implemented) that is impossible for people to follow. I believe that it is best to settle on a basic model now rather than at an indefinite point in the future for this reason.
Having given this rationale, I'm going to talk about how we look at classes and types in Darwin Core and how the need for an RDF representation of DwC should influence our view on this topic. In Darwin Core as it stands (see the "Audience" section of http://rs.tdwg.org/dwc/terms/index.htm), classes are simply categories that group terms that describe instances of the class. The description specifically states that the terms are intended to be properties of the class (i.e. properties of instances of the class). When DwC terms are used as column headings in a database table, there isn't any "rule" that say that one must specify the type of thing to which that term applies.
On the other hand, I think that it is considered a Bad Thing in RDF to apply properties to a resource having an unspecified type. It's not impossible to do so, but specifying the rdf:type of a resource is one of the most fundamental things that one does in creating a description of the resource. This is recognized in the TDWG GUID Applicability Statement (recommendation 11) which says that objects identified by GUIDs should be typed using a well-known vocabulary. One "well-known vocabulary" is the Darwin Core Type Vocabulary (http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm). There isn't any formal relationship between the Darwin Core classes in the dwc: (http://rs.tdwg.org/dwc/terms/) namespace and the types in the dwctype: (http://rs.tdwg.org/dwc/dwctype/) namespace. We could use the dwctypes to describe resources that we want to say are instances of dwc: classes (and meet the GUID guidelines), but that would raise problems that I will get into later. The point is that as Darwin Core is currently set up, there isn't a formal relationship between the dwc: classes that are used to group the terms and the dwctype: types that could be used to rdf:type them. As it is described, the dwctype vocabulary is simply stated to be used as values for basisOfRecord and as I pointed out in the previous post, basisOfRecord only really works when Occurrences are limited to having a single token.
In RDF, the relationship between classes and types is different from the way it currently stands in Darwin Core. RDF classes and types are tied together by definition (http://www.w3.org/TR/2004/REC-rdf-schema-20040210/#ch_type). If you assert that a resource has an rdf:type of X, you are simultaneously asserting that the resource is an instance of class X. The relationship between a class in RDF and the declaration of rdf:type is so entwined that naming a XML container element by the class of the resource is an instance is identical to an explicit declaration of type. The following two examples produce exactly the same result if you paste them into an RDF validator like http://www.w3.org/RDF/Validator/ :
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#%22%3E <rdf:Description rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ%22%3E <rdf:type rdf:resource="http://rs.tdwg.org/dwc/terms/Occurrence%22/%3E </rdf:Description> </rdf:RDF>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dwc="http://rs.tdwg.org/dwc/terms/%22%3E <dwc:Occurrence rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ%22%3E </dwc:Occurrence> </rdf:RDF>
Even though there is no explicit declaration of rdf:type in the text of the second example (i.e. the dwc:Occurrence container element is empty), the validator treats the code as if a type property were stated explicitly - you can see that the resulting triple and graph created by the validator shows the RDF as having made an explicit declaration of rdf:type=dwc:Occurrence.
So my point is that to enable people to follow the TDWG GUID recommendations and provide RDF that tells people the type of the resource, TDWG bears a responsibility to provide GUID users with terms that are suitable for use as an rdf:type property for every class of resource that we can reasonably be expected to want to assign a GUID. In my book, that's every box shown on the summary diagram http://bioimages.vanderbilt.edu/pages/full-model.jpg except for tokens (and excluding Time if we agree that we will always denormalize it out of existence as a class). I exclude token as a group because they are not a single class of resource. Any type of resource that provides evidence that an Occurrence happened can be a token. In some cases (such as images and sounds) those types are already defined in Dublin Core. Darwin Core would only need to define types for things that aren't defined elsewhere, such as the Collecting Units in the ASC model (but this is the topic of the third installment).
One way to do this (and the way that I favor) is to make sure that there is a Darwin Core class for every category of resource for which one would reasonably expect to assign a GUID. Referring to the full model diagram, the only categories that don't have classes at the moment are Individual (which I have proposed to add), Time (which may or may not be necessary), and Collecting Unit (again, more on this in the final installment). The first category could be created by voting to accept my proposal about the class Individual. The last would require a new recommendation, but I think that Rich has pretty much suggested that this should happen when he says that there are a lot of terms in the Occurrence category that don't belong there (i.e. they belong with Collecting Units). So it would make sense from the point of view of a more logical organization of terms to do this anyway. As Bob has pointed out, in RDF making a declaration of rdf:type=X is the same thing as declaring that class X exists. So why not make the rdf:types BE the Darwin Core classes so we will be declaring something that actually does already exist instead of conjuring up virtual classes from types that we make up? There have been some people who have questioned my proposal for adding Individual as a DwC class on the basis that it is not clear that anybody "needs" it. What I am stating here is that anybody who plans to write RDF following recommendations based on a fully normalized Darwin Core RDF model (which should be EVERYONE who writes RDF using Darwin Core!) "needs" all of the classes that connect resources they plan to describe. That means that anybody who plans to connect Occurrence metadata to Identifications should be doing it in their RDF through named instances of the dwc:Individual class.
Another alternative would be to fix the dwctype vocabulary, but that would be messier. The dwctype vocabulary is designated as the controlled vocabulary for basisOfRecord, so it is a bit dangerous to mess with it without breaking basisOfRecord. The other problem as was noted earlier on the list is that currently certain types in the DwC type vocabulary are declared as subClasses of other types, and that these declarations will cause unintentional assertions that don't make sense in the context of the general model that we've been discussing (namely that every PhysicalSpecimen is an Occurrence which is also an Event). It seems to me that there is more "fixing" required here than is worth the effort given that we can just use the classes as the rdf:types as I described in the previous paragraph.
The final alternative would be to make the TDWG Ontology functional and use it to type resources. Although there has been some recent discussion on the list about working on the Ontology, at the present moment there isn't a clear plan or timeline to finish it. Telling people to wait for something that may never happen is not an acceptable alternative to me. I think it is clear that there are multiple people and institutions that are either ready to write RDF in support of GUIDs or are already doing it now. Six months is about the longest timeframe that I think is reasonable for coming up with a solution to the typing problem discussed above and to have some kind of basic guidelines for the structuring of RDF. A general model based on the existing Darwin Core classes is the only path forward that I can see as feasible in that time frame and a general model could always be build into a more sophisticated model (i.e. the Ontology) at liesure if anyone cared to take the time. If TDWG doesn't get its act together on a six-month to one year time scale, people will simply give up and write Darwin Core-based RDF without any TDWG guidelines. It has been suggested that a Task Group be formed to draft a DwC RDF Guide. I would be very keen to see that happen and would be willing to be involved in the process, but I'm not interested in it if the process doesn't start out with some version the consensus model we've discussed here and with some quick decision from the TAG about how to handle the rdf:typing problem. Without those two things, there would just be endless unproductive debate about how to go about building the model from scratch and I've got better things to do than that.
I will end this with one final comment about the proposed Individual class in this context. I have stated clearly in several earlier posts that I don't think that the Individual class really has many properties and that it functions primarily as a named node to facilitate one-to-many relationships with other classes. This may strike some people as odd, given that the primary purpose of classes in the existing Darwin Core seems to be to group similar terms that can act as properties for the class. What became apparent to me when I was creating the diagrams for the first post was that if the Time terms are pulled out of the Event class (as they probably should be in a fully normalized model) and the "Collecting Unit" terms are pulled out of the Occurrence class (as I think must happen if we separate tokens from Occurrences), there are also very few property terms left in the Event and Occurrence classes. Most of the terms that remain are "housekeeping) ones used for remarks, or to make note of the person who documented the instance and when. Most of the terms that actually describe measurable properties are found in the peripheral classes like Location, Time, and Collecting Unit. Just as in the case of the proposed Individual class, the Event and Occurrence classes are primarily named nodes that connect other classes. The only reason they have very many terms at the present is because we have some of the terms in the "wrong" place for a fully normalized model. I think that it is also no coincidence that these three classes (Event, Occurrence, and Individual) are also the three that we have had the most trouble defining. I think that's precisely because they have very few properites of their own. They do roughly correspond to things for which we have conceptual images, which is why we are able to come up with meaningful names for them. But as I have argued, it is better to define them according to what we want them to DO rather than by our mental image of them. And that is a lead-in to the third and final post.