[tdwg-content] Background for the Individual class proposal. 2. Classes and types
Steve Baskauf
steve.baskauf at vanderbilt.edu
Sat Nov 13 17:23:54 CET 2010
I'm going to start this post with two comments about RDF. I think some
people think they have a phobia of RDF (I know that I did at first).
What I really think is that they have a phobia of RDF represented as XML
or RDF represented in N3 notation. This point has been made before: RDF
is a system for describing properties of and relationships among
resources (i.e. things that can be assigned identifiers) but it does not
have only one particular way that these properties and relationships
must be specified. It is perfectly correct to represent RDF entirely in
pictures (i.e. as an RDF "graph", see http://www.w3.org/TR/rdf-primer/
and ignore all of the text - just look at the figures). RDF graph
notation wouldn't be of much use to a computer, but that graph could
easily be translated into one of the other notations (XML or N3) and
then a computer would understand it perfectly. Since RDF is something
that is specifically designed to represent relationships among classes
of resources, it is the perfect thing to clearly lay out what we mean
when we have a discussion of the sort that we are having here. One of
the reasons why I am so keen to make diagrams of the sort I posted in
the first message in this series is because once you have the diagram,
it is a relatively simple matter to change the shapes of the boxes and
add arrows instead of triangles or lines with crow's feet and voila! you
have an RDF graph. It then becomes an academic exercise to have an RDF
model in XML or whatever format you like. I am of the opinion that we
are actually pretty close to a consensus about what the diagram should
be, which means that we are also pretty close to a simple RDF model for
Darwin Core.
The other comment about RDF is that we need to work out a basic model
now. Partly this is because there are already several people who have
been contributing to this discussion who are already writing RDF or who
intend to do so in the near future. If we have any delusions about
doing even the most simple kind of machine reasoning in the future, we
all need to be using the same basic diagram (i.e. model). The other
reason why we need to work this out now is that if we don't, we will
impede the process of utilizing GUIDs/Persistent Identifiers. The draft
TDWG GUID Applicability Statement
(http://www.tdwg.org/stdtrack/article/download/150/51 recommendation 10)
says clearly that a proper GUID should be able to be dereferenced to
provide an RDF/XML representation (did I use "dereferenced" right,
Bob?). If we don't agree on how to represent the classes of resources
that are of interest to the DwC community in RDF then we are setting up
the situation where TDWG makes a recommendation (on how GUIDs are
implemented) that is impossible for people to follow. I believe that it
is best to settle on a basic model now rather than at an indefinite
point in the future for this reason.
Having given this rationale, I'm going to talk about how we look at
classes and types in Darwin Core and how the need for an RDF
representation of DwC should influence our view on this topic. In
Darwin Core as it stands (see the "Audience" section of
http://rs.tdwg.org/dwc/terms/index.htm), classes are simply categories
that group terms that describe instances of the class. The description
specifically states that the terms are intended to be properties of the
class (i.e. properties of instances of the class). When DwC terms are
used as column headings in a database table, there isn't any "rule" that
say that one must specify the type of thing to which that term applies.
On the other hand, I think that it is considered a Bad Thing in RDF to
apply properties to a resource having an unspecified type. It's not
impossible to do so, but specifying the rdf:type of a resource is one of
the most fundamental things that one does in creating a description of
the resource. This is recognized in the TDWG GUID Applicability
Statement (recommendation 11) which says that objects identified by
GUIDs should be typed using a well-known vocabulary. One "well-known
vocabulary" is the Darwin Core Type Vocabulary
(http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm). There isn't
any formal relationship between the Darwin Core classes in the dwc:
(http://rs.tdwg.org/dwc/terms/) namespace and the types in the dwctype:
(http://rs.tdwg.org/dwc/dwctype/) namespace. We could use the dwctypes
to describe resources that we want to say are instances of dwc: classes
(and meet the GUID guidelines), but that would raise problems that I
will get into later. The point is that as Darwin Core is currently set
up, there isn't a formal relationship between the dwc: classes that are
used to group the terms and the dwctype: types that could be used to
rdf:type them. As it is described, the dwctype vocabulary is simply
stated to be used as values for basisOfRecord and as I pointed out in
the previous post, basisOfRecord only really works when Occurrences are
limited to having a single token.
In RDF, the relationship between classes and types is different from the
way it currently stands in Darwin Core. RDF classes and types are tied
together by definition
(http://www.w3.org/TR/2004/REC-rdf-schema-20040210/#ch_type). If you
assert that a resource has an rdf:type of X, you are simultaneously
asserting that the resource is an instance of class X. The relationship
between a class in RDF and the declaration of rdf:type is so entwined
that naming a XML container element by the class of the resource is an
instance is identical to an explicit declaration of type. The following
two examples produce exactly the same result if you paste them into an
RDF validator like http://www.w3.org/RDF/Validator/ :
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description
rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ">
<rdf:type rdf:resource="http://rs.tdwg.org/dwc/terms/Occurrence"/>
</rdf:Description>
</rdf:RDF>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<dwc:Occurrence
rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ">
</dwc:Occurrence>
</rdf:RDF>
Even though there is no explicit declaration of rdf:type in the text of
the second example (i.e. the dwc:Occurrence container element is empty),
the validator treats the code as if a type property were stated
explicitly - you can see that the resulting triple and graph created by
the validator shows the RDF as having made an explicit declaration of
rdf:type=dwc:Occurrence.
So my point is that to enable people to follow the TDWG GUID
recommendations and provide RDF that tells people the type of the
resource, TDWG bears a responsibility to provide GUID users with terms
that are suitable for use as an rdf:type property for every class of
resource that we can reasonably be expected to want to assign a GUID.
In my book, that's every box shown on the summary diagram
http://bioimages.vanderbilt.edu/pages/full-model.jpg except for tokens
(and excluding Time if we agree that we will always denormalize it out
of existence as a class). I exclude token as a group because they are
not a single class of resource. Any type of resource that provides
evidence that an Occurrence happened can be a token. In some cases
(such as images and sounds) those types are already defined in Dublin
Core. Darwin Core would only need to define types for things that
aren't defined elsewhere, such as the Collecting Units in the ASC model
(but this is the topic of the third installment).
One way to do this (and the way that I favor) is to make sure that there
is a Darwin Core class for every category of resource for which one
would reasonably expect to assign a GUID. Referring to the full model
diagram, the only categories that don't have classes at the moment are
Individual (which I have proposed to add), Time (which may or may not be
necessary), and Collecting Unit (again, more on this in the final
installment). The first category could be created by voting to accept
my proposal about the class Individual. The last would require a new
recommendation, but I think that Rich has pretty much suggested that
this should happen when he says that there are a lot of terms in the
Occurrence category that don't belong there (i.e. they belong with
Collecting Units). So it would make sense from the point of view of a
more logical organization of terms to do this anyway. As Bob has
pointed out, in RDF making a declaration of rdf:type=X is the same thing
as declaring that class X exists. So why not make the rdf:types BE the
Darwin Core classes so we will be declaring something that actually does
already exist instead of conjuring up virtual classes from types that we
make up? There have been some people who have questioned my proposal
for adding Individual as a DwC class on the basis that it is not clear
that anybody "needs" it. What I am stating here is that anybody who
plans to write RDF following recommendations based on a fully normalized
Darwin Core RDF model (which should be EVERYONE who writes RDF using
Darwin Core!) "needs" all of the classes that connect resources they
plan to describe. That means that anybody who plans to connect
Occurrence metadata to Identifications should be doing it in their RDF
through named instances of the dwc:Individual class.
Another alternative would be to fix the dwctype vocabulary, but that
would be messier. The dwctype vocabulary is designated as the
controlled vocabulary for basisOfRecord, so it is a bit dangerous to
mess with it without breaking basisOfRecord. The other problem as was
noted earlier on the list is that currently certain types in the DwC
type vocabulary are declared as subClasses of other types, and that
these declarations will cause unintentional assertions that don't make
sense in the context of the general model that we've been discussing
(namely that every PhysicalSpecimen is an Occurrence which is also an
Event). It seems to me that there is more "fixing" required here than
is worth the effort given that we can just use the classes as the
rdf:types as I described in the previous paragraph.
The final alternative would be to make the TDWG Ontology functional and
use it to type resources. Although there has been some recent
discussion on the list about working on the Ontology, at the present
moment there isn't a clear plan or timeline to finish it. Telling
people to wait for something that may never happen is not an acceptable
alternative to me. I think it is clear that there are multiple people
and institutions that are either ready to write RDF in support of GUIDs
or are already doing it now. Six months is about the longest timeframe
that I think is reasonable for coming up with a solution to the typing
problem discussed above and to have some kind of basic guidelines for
the structuring of RDF. A general model based on the existing Darwin
Core classes is the only path forward that I can see as feasible in that
time frame and a general model could always be build into a more
sophisticated model (i.e. the Ontology) at liesure if anyone cared to
take the time. If TDWG doesn't get its act together on a six-month to
one year time scale, people will simply give up and write Darwin
Core-based RDF without any TDWG guidelines. It has been suggested that
a Task Group be formed to draft a DwC RDF Guide. I would be very keen
to see that happen and would be willing to be involved in the process,
but I'm not interested in it if the process doesn't start out with some
version the consensus model we've discussed here and with some quick
decision from the TAG about how to handle the rdf:typing problem.
Without those two things, there would just be endless unproductive
debate about how to go about building the model from scratch and I've
got better things to do than that.
I will end this with one final comment about the proposed Individual
class in this context. I have stated clearly in several earlier posts
that I don't think that the Individual class really has many properties
and that it functions primarily as a named node to facilitate
one-to-many relationships with other classes. This may strike some
people as odd, given that the primary purpose of classes in the existing
Darwin Core seems to be to group similar terms that can act as
properties for the class. What became apparent to me when I was
creating the diagrams for the first post was that if the Time terms are
pulled out of the Event class (as they probably should be in a fully
normalized model) and the "Collecting Unit" terms are pulled out of the
Occurrence class (as I think must happen if we separate tokens from
Occurrences), there are also very few property terms left in the Event
and Occurrence classes. Most of the terms that remain are
"housekeeping) ones used for remarks, or to make note of the person who
documented the instance and when. Most of the terms that actually
describe measurable properties are found in the peripheral classes like
Location, Time, and Collecting Unit. Just as in the case of the
proposed Individual class, the Event and Occurrence classes are
primarily named nodes that connect other classes. The only reason they
have very many terms at the present is because we have some of the terms
in the "wrong" place for a fully normalized model. I think that it is
also no coincidence that these three classes (Event, Occurrence, and
Individual) are also the three that we have had the most trouble
defining. I think that's precisely because they have very few
properites of their own. They do roughly correspond to things for which
we have conceptual images, which is why we are able to come up with
meaningful names for them. But as I have argued, it is better to define
them according to what we want them to DO rather than by our mental
image of them. And that is a lead-in to the third and final post.
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
More information about the tdwg-content
mailing list