[tdwg-content] Background for the Individual class proposal. 2. Classes and types

13 Nov 2010

      I'm going to start this post with two comments about RDF.  I think some 
people think they have a phobia of RDF (I know that I did at first).  
What I really think is that they have a phobia of RDF represented as XML 
or RDF represented in N3 notation.  This point has been made before: RDF 
is a system for describing properties of and relationships among 
resources (i.e. things that can be assigned identifiers) but it does not 
have only one particular way that these properties and relationships 
must be specified.  It is perfectly correct to represent RDF entirely in 
pictures (i.e. as an RDF "graph", see http://www.w3.org/TR/rdf-primer/ 
and ignore all of the text - just look at the figures).  RDF graph 
notation wouldn't be of much use to a computer, but that graph could 
easily be translated into one of the other notations (XML or N3) and 
then a computer would understand it perfectly.  Since RDF is something 
that is specifically designed to represent relationships among classes 
of resources, it is the perfect thing to clearly lay out what we mean 
when we have a discussion of the sort that we are having here.  One of 
the reasons why I am so keen to make diagrams of the sort I posted in 
the first message in this series is because once you have the diagram, 
it is a relatively simple matter to change the shapes of the boxes and 
add arrows instead of triangles or lines with crow's feet and voila! you 
have an RDF graph.  It then becomes an academic exercise to have an RDF 
model in XML or whatever format you like.  I am of the opinion that we 
are actually pretty close to a consensus about what the diagram should 
be, which means that we are also pretty close to a simple RDF model for 
Darwin Core. 

The other comment about RDF is that we need to work out a basic model 
now.  Partly this is because there are already several people who have 
been contributing to this discussion who are already writing RDF or who 
intend to do so in the near future.  If we have any delusions about 
doing even the most simple kind of machine reasoning in the future, we 
all need to be using the same basic diagram (i.e. model).  The other 
reason why we need to work this out now is that if we don't, we will 
impede the process of utilizing GUIDs/Persistent Identifiers.  The draft 
TDWG GUID Applicability Statement 
(http://www.tdwg.org/stdtrack/article/download/150/51 recommendation 10) 
says clearly that a proper GUID should be able to be dereferenced to 
provide an RDF/XML representation (did I use "dereferenced" right, 
Bob?).  If we don't agree on how to represent the classes of resources 
that are of interest to the DwC community in RDF then we are setting up 
the situation where TDWG makes a recommendation (on how GUIDs are 
implemented) that is impossible for people to follow.  I believe that it 
is best to settle on a basic model now rather than at an indefinite 
point in the future for this reason.

Having given this rationale, I'm going to talk about how we look at 
classes and types in Darwin Core and how the need for an RDF 
representation of DwC should influence our view on this topic.  In 
Darwin Core as it stands (see the "Audience" section of 
http://rs.tdwg.org/dwc/terms/index.htm), classes are simply categories 
that group terms that describe instances of the class.  The description 
specifically states that the terms are intended to be properties of the 
class (i.e. properties of instances of the class).  When DwC terms are 
used as column headings in a database table, there isn't any "rule" that 
say that one must specify the type of thing to which that term applies. 

On the other hand, I think that it is considered a Bad Thing in RDF to 
apply properties to a resource having an unspecified type.  It's not 
impossible to do so, but specifying the rdf:type of a resource is one of 
the most fundamental things that one does in creating a description of 
the resource.   This is recognized in the TDWG GUID Applicability 
Statement (recommendation 11) which says that objects identified by 
GUIDs should be typed using a well-known vocabulary.  One "well-known 
vocabulary" is the Darwin Core Type Vocabulary 
(http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm).  There isn't 
any formal relationship between the Darwin Core classes in the dwc: 
(http://rs.tdwg.org/dwc/terms/) namespace and the types in the dwctype: 
(http://rs.tdwg.org/dwc/dwctype/) namespace.  We could use the dwctypes 
to describe resources that we want to say are instances of dwc: classes 
(and meet the GUID guidelines), but that would raise problems that I 
will get into later.  The point is that as Darwin Core is currently set 
up, there isn't a formal relationship between the dwc: classes that are 
used to group the terms and the dwctype: types that could be used to 
rdf:type them.  As it is described, the dwctype vocabulary is simply 
stated to be used as values for basisOfRecord and as I pointed out in 
the previous post, basisOfRecord only really works when Occurrences are 
limited to having a single token. 

In RDF, the relationship between classes and types is different from the 
way it currently stands in Darwin Core.  RDF classes and types are tied 
together by definition 
(http://www.w3.org/TR/2004/REC-rdf-schema-20040210/#ch_type).  If you 
assert that a resource has an rdf:type of X, you are simultaneously 
asserting that the resource is an instance of class X.  The relationship 
between a class in RDF and the declaration of rdf:type is so entwined 
that naming a XML container element by the class of the resource is an 
instance is identical to an explicit declaration of type.  The following 
two examples produce exactly the same result if you paste them into an 
RDF validator like http://www.w3.org/RDF/Validator/ :

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description 
rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ">
    <rdf:type rdf:resource="http://rs.tdwg.org/dwc/terms/Occurrence"/>
  </rdf:Description>
</rdf:RDF>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
  <dwc:Occurrence 
rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ">
  </dwc:Occurrence>
</rdf:RDF>

Even though there is no explicit declaration of rdf:type in the text of 
the second example (i.e. the dwc:Occurrence container element is empty), 
the validator treats the code as if a type property were stated 
explicitly - you can see that the resulting triple and graph created by 
the validator shows the RDF as having made an explicit declaration of 
rdf:type=dwc:Occurrence. 

So my point is that to enable people to follow the TDWG GUID 
recommendations and provide RDF that tells people the type of the 
resource, TDWG bears a responsibility to provide GUID users with terms 
that are suitable for use as an rdf:type property for every class of 
resource that we can reasonably be expected to want to assign a GUID.  
In my book, that's every box shown on the summary diagram 
http://bioimages.vanderbilt.edu/pages/full-model.jpg except for tokens 
(and excluding Time if we agree that we will always denormalize it out 
of existence as a class).  I exclude token as a group because they are 
not a single class of resource.  Any type of resource that provides 
evidence that an Occurrence happened can be a token.  In some cases 
(such as images and sounds) those types are already defined in Dublin 
Core.  Darwin Core would only need to define types for things that 
aren't defined elsewhere, such as the Collecting Units in the ASC model 
(but this is the topic of the third installment). 

One way to do this (and the way that I favor) is to make sure that there 
is a Darwin Core class for every category of resource for which one 
would reasonably expect to assign a GUID.  Referring to the full model 
diagram, the only categories that don't have classes at the moment are 
Individual (which I have proposed to add), Time (which may or may not be 
necessary), and Collecting Unit (again, more on this in the final 
installment).  The first category could be created by voting to accept 
my proposal about the class Individual.  The last would require a new 
recommendation, but I think that Rich has pretty much suggested that 
this should happen when he says that there are a lot of terms in the 
Occurrence category that don't belong there (i.e. they belong with 
Collecting Units).  So it would make sense from the point of view of a 
more logical organization of terms to do this anyway.  As Bob has 
pointed out, in RDF making a declaration of rdf:type=X is the same thing 
as declaring that class X exists.  So why not make the rdf:types BE the 
Darwin Core classes so we will be declaring something that actually does 
already exist instead of conjuring up virtual classes from types that we 
make up?  There have been some people who have questioned my proposal 
for adding Individual as a DwC class on the basis that it is not clear 
that anybody "needs" it.  What I am stating here is that anybody who 
plans to write RDF following recommendations based on a fully normalized 
Darwin Core RDF model (which should be EVERYONE who writes RDF using 
Darwin Core!) "needs" all of the classes that connect resources they 
plan to describe.  That means that anybody who plans to connect 
Occurrence metadata to Identifications should be doing it in their RDF 
through named instances of the dwc:Individual class.

Another alternative would be to fix the dwctype vocabulary, but that 
would be messier.  The dwctype vocabulary is designated as the 
controlled vocabulary for basisOfRecord, so it is a bit dangerous to 
mess with it without breaking basisOfRecord.  The other problem as was 
noted earlier on the list is that currently certain types in the DwC 
type vocabulary are declared as subClasses of other types, and that 
these declarations will cause unintentional assertions that don't make 
sense in the context of the general model that we've been discussing 
(namely that every PhysicalSpecimen is an Occurrence which is also an 
Event).  It seems to me that there is more "fixing" required here than 
is worth the effort given that we can just use the classes as the 
rdf:types as I described in the previous paragraph. 

The final alternative would be to make the TDWG Ontology functional and 
use it to type resources.  Although there has been some recent 
discussion on the list about working on the Ontology, at the present 
moment there isn't a clear plan or timeline to finish it.  Telling 
people to wait for something that may never happen is not an acceptable 
alternative to me.  I think it is clear that there are multiple people 
and institutions that are either ready to write RDF in support of GUIDs 
or are already doing it now.  Six months is about the longest timeframe 
that I think is reasonable for coming up with a solution to the typing 
problem discussed above and to have some kind of basic guidelines for 
the structuring of RDF.  A general model based on the existing Darwin 
Core classes is the only path forward that I can see as feasible in that 
time frame and a general model could always be build into a more 
sophisticated model (i.e. the Ontology) at liesure if anyone cared to 
take the time.  If TDWG doesn't get its act together on a six-month to 
one year time scale, people will simply give up and write Darwin 
Core-based RDF without any TDWG guidelines.  It has been suggested that 
a Task Group be formed to draft a DwC RDF Guide.  I would be very keen 
to see that happen and would be willing to be involved in the process, 
but I'm not interested in it if the process doesn't start out with some 
version the consensus model we've discussed here and with some quick 
decision from the TAG about how to handle the rdf:typing problem.  
Without those two things, there would just be endless unproductive 
debate about how to go about building the model from scratch and I've 
got better things to do than that.

I will end this with one final comment about the proposed Individual 
class in this context.  I have stated clearly in several earlier posts 
that I don't think that the Individual class really has many properties 
and that it functions primarily as a named node to facilitate 
one-to-many relationships with other classes.  This may strike some 
people as odd, given that the primary purpose of classes in the existing 
Darwin Core seems to be to group similar terms that can act as 
properties for the class.  What became apparent to me when I was 
creating the diagrams for the first post was that if the Time terms are 
pulled out of the Event class (as they probably should be in a fully 
normalized model) and the "Collecting Unit" terms are pulled out of the 
Occurrence class (as I think must happen if we separate tokens from 
Occurrences), there are also very few property terms left in the Event 
and Occurrence classes.  Most of the terms that remain are 
"housekeeping) ones used for remarks, or to make note of the person who 
documented the instance and when.  Most of the terms that actually 
describe measurable properties are found in the peripheral classes like 
Location, Time, and Collecting Unit.  Just as in the case of the 
proposed Individual class, the Event and Occurrence classes are 
primarily named nodes that connect other classes.  The only reason they 
have very many terms at the present is because we have some of the terms 
in the "wrong" place for a fully normalized model.  I think that it is 
also no coincidence that these three classes (Event, Occurrence, and 
Individual) are also the three that we have had the most trouble 
defining.  I think that's precisely because they have very few 
properites of their own.  They do roughly correspond to things for which 
we have conceptual images, which is why we are able to come up with 
meaningful names for them.  But as I have argued, it is better to define 
them according to what we want them to DO rather than by our mental 
image of them.  And that is a lead-in to the third and final post.

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

[tdwg-content] Background for the Individual class proposal. 2. Classes and types

Steve Baskauf