[tdwg-content] If you need something for referring to a population, then it is probably best to do it as a related class
Steve Baskauf
steve.baskauf at vanderbilt.edu
Sun May 1 15:01:09 CEST 2011
OK, Pete, I'm going to try to write the other half of the email that I
promised. I'm going to start by saying that some of what I'm talking
about here has already been posted on the Darwin-SW (DSW) wiki page
called RelationshipToExistingModels
(http://code.google.com/p/darwin-sw/wiki/RelationshipToExistingModels).
I've actually wanted to bring this up with you for about six months, but
have never taken the time to put it into an email. Since this is going
out to the list, I'll include some comments about background that you
already know (assuming a broader audience). I'd welcome your comments
and feedback on what I've said and whether you think it is accurate or not.
One of the things that I think makes it difficult for people to follow
what you are proposing on taxonconcept.org is that the structure of your
RDF is complex. I'm not saying that is a bad thing, I'm just saying
that if you combine that with people's general unfamiliarity with RDF
and the difficulty that some people have with visualizing RDF in XML
format, it just isn't accessible to most people. Even that difficulty
in itself isn't necessarily a bad thing because RDF isn't really
intended to be understood primarily by people - it's designed to be
understood by computers, so many people on this list don't really need
to care about it. Nevertheless, in order to have a discussion about a
proposal, one must be able to visualize it. I am a very right-brained
person and must have maps, diagrams, and graphs to conceptualize
things. So the first thing I did was to go to
http://www.w3.org/RDF/Validator/, put in the URI your example, and tell
the parser to give me graph only. In the RelationshipToExistingModels
wiki page, I looked at
http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9.rdf
which provided information about an occurrence. One of the most obvious
features of the resulting graph is that it is complex. I used the word
"reticulated" because there are many cross-connections between the
nodes. It takes a bit of time and a large-screen monitor to sort it all
out, but if one ignores the literals and concentrates only on the nodes
that are labeled with URIs, the structure is actually very similar to
the structure we used in DSW. So that tells me that we (Cam and I) are
seeing the biodiversity informatics world in a very similar manner to
the way you have been seeing it. On obvious difference is in the class
names used to type resources - we mostly used DwC classes and you used
ones that you defined in your ontology, but that is a cosmetic
difference if one assumes that the classes in DSW and at
taxonconcept.org represent the same thing.
If one considers the basic RDF graph for DSW shown on the DSW home page
(http://code.google.com/p/darwin-sw/), with the exception of dsw:Token
which can be evidence for several things (and foaf:Person which was kind
of thrown in at the end), the basic structure of DSW is linear. There
is a connection between each class wherever there is a potential need
for a one-to-many join between class instances (see triangles [="crow's
feet"] on the Fully Normalized Model on the RelationshipToExistingModels
wiki page; DSW is like this diagram except there is no Time class, and
TaxonNameUsage in the diagram is the Taxon class in DSW). The
connections are made by object properties that we defined in the DSW
ontology. A major difference between the DSW structure and the
structure that can be seen in the RDF graph of the taxonconcept.org
occurrence example is that there is not just one connection that can be
used to traverse the classes. For example, in DSW, to obtain
information about Occurrences that are associated with an
Identification, one would have to "surf" from a dwc:Identification
instance to dsw:Individual instance using the dsw:identifies property,
then from the dsw:Individual instance to the dwc:Occurrence instance
using the dsw:hasOccurrence property. Similarly, in the
taxonconcept.org example, one could go from the txn:Identification
instance to the txn:SpeciesIndividual instance using the
txn:identificationOfIndividual property, then from the
txn:SpeciesIndividual instance to the txn:Occurrence instance using the
txn:individualHasOccurrence property. However, taxonconcept.org also
allows one to make the connection from the txn:Identification instance
directly to the txn:Occurrence instance using the
txn:identificationHasOccurrence, skipping the txn:SpeciesIndividual
altogether. Similar "shortcuts" connect other classes in
taxonconcept.org whose analogues in DSW are separated by intervening
classes, e.g. from txn:Occurrence to txn:SpeciesConcept (roughly
analogous to dwc:Taxon) by ttxn:occurrenceHasSpeciesConcept, from
dwc_area:Area (roughly analogous to dcterms:Location) to
txn:SpeciesConcept by txn:areaHasObservedSpeciesConcept, etc. I don't
think the taxonconcept.org ontology has every possible connection
between every class, but in theory one could do that if one wanted.
There would be even more connections and shortcuts if the
taxonconcept.org ontology included a class that is analogous to
dwc:Event (it's "flattened out" of the taxonconcept.org model, see
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001710.html
and the posts that precede and follow it for more on the topic of
"flattening" databases). It is the presence of these "shortcut"
properties in the taxonconcept.org ontology that makes it's RDF graph so
complex and "reticulated" and the absence of them that makes a DWC RDF
graph much simpler.
Which approach is correct? As the old adage says "anybody can say
anything about anything". There isn't really anything intrinsically
"wrong" with either including or excluding "shortcut" properties. I am
guessing that the reason why your ontology has them and DSW doesn't may
be a reflection of the reasons why the ontologies were created. From
what you've said in the past, I gather that you would like to facilitate
assembling masses of metadata in triple stores and run SPARQL queries on
them to discover interesting things. Cam and I want to make it possible
to apply GUIDs to very diverse kinds of things and be able to track what
happens to them if they end up in different places. These are not
necessarily mutually exclusive desires, but they do represent a
difference in outlook. I know virtually nothing about SPARQL, but at
the risk of exposing myself as an ignoramus, I'm going to mention SPARQL
queries in this post anyway. I assume from your examples that it is
relatively easy to run a query to discover resources that are one object
property-step away from a subject resource. I would assume that it
would be much more difficult to run such queries on things that are five
object property-steps apart. For example, if one wanted to know all of
the instances of txn:SpeciesConcept's that occurred at a dwc_area:Area
all one would have to do is to search for all of the objects of the
txn:areaHasObservedSpeciesConcept properties for instances of that
particular dwc_area:Area in a triple store. In DSW, one would need to
look for all of the dwc:Events that happened at that dcterms:Location,
then find all of the dwc:Occurrences that happened at those dwc:Events,
then find out which dsw:Individuals were represented in those
dwc:Occurrences, then look up all of the dwc:Identifications for those
dsw:Individuals, and finally make a non-redundant list of dwc:Taxon
instances that were represented in those dwc:Identifications. I don't
know if there is a simple SPARQL query for that, but I doubt it. So
from the standpoint of querying, the "shortcut" property method that
taxonconcept.org uses is much better.
However, there is an important problem with the "shortcut" strategy. In
order to be able to make a simple query that makes use of single-step
properties, one must know what kind of query a user will want to make
and then make sure that there is a shortcut property that connects the
classes of interest. This requires either a crystal ball to be able to
predict what people are interested in asking, or just making up
properties for every possible shortcut. If I'm doing the math right,
with the 6 classes that are included in the existing DwC plus
IndividualOrganism (or SpeciesIndividual if you prefer), there would be
15 connections among the classes which would make 30 object properties
required to connect them if one wanted every connection to have a pair
of inverse properties to enable going in either direction. If one
included the Token class, that would make 21 pairs. The burden would
then fall on the metadata provider to provide values for all of those
properties. And although 21 connections doesn't sound that bad, there
could actually be a lot more actually property assignments than that
because there isn't any restriction that says that there will only be
one value for a property. If an organism has two Identifications, then
every xxxHasIdentification kind of property is going to have two
values. If there are many Identifications (e.g.
http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf) there
would be many values. Essentially, the metadata provider is left with
the job of pre-running every kind of query that a user could possibly
want to do.
An alternative to this would be to simply the model by "flattening out"
certain classes (making the model less "normalized"). You did that with
Event. In my Biodiversity Informatics article
(https://journals.ku.edu/index.php/jbi/article/view/3664) I did it for
Event and Location. Historically museum people have "flattened out"
IndividualOrganism and Token. People normalize out Identification all
of the time. As Rich Pyle pointed out in the post I cited above, people
"flatten" more complex models into simpler models all the time because
it is convenient and it makes their databases simpler and easier to
manage. But if our desire is to come up with a general model that will
work for museum people and their old specimen labels, bird and whale
observation people, DNA barcoding people, people who document live
organisms with images and sound, bioblitzers, etc. it has to include
every class that participants can reasonably need to have to facilitate
needed "one-to-many database joins" or whatever you want to call that.
In October, I posted a message to the tdwg-content list where I warned
against setting precedents for using "wrong" RDF
(http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001663.html).
In that post, my point was that people should not apply properties to
instances of the wrong class. That is exactly what happens when people
simplify models by eliminating classes that have only one-to-one
relationships with other classes in their database. So if I were to
restate my complaint again, I'd frame it this way: a "wrong" RDF model
is one that leaves out classes that potential users may need to express
the complexities of their data. This principle was an underlying
assumption when we constructed DSW, and to know what classes people
needed, we looked at the discussion that took place on the tdwg-content
list in Oct/Nov. As I point out on the RelationshipToExistingModels
wiki page, we could have included a Time class, but as a practical
matter, nobody has expressed a need for it (at least yet). So given
this principle, reducing the number of shortcut properties by getting
rid of classes is simply not an option for any model that hopes to
include all of the kinds of metadata that one would like to describe
within a community.
So the bottom line, in my opinion, is that in a model as complex as what
we need in the biodiversity informatics community (i.e. a "fully
normalized" model) we simply cannot hope to create and assign object
properties that connect every class. Hence in DSW we only created the
minimum number of object properties needed to express what we considered
the fundamental relationships among the classes. What this means is
that it simply may not be possible to make simple SPARQL queries on the
data to find out what people want to know. Rather, the burden will fall
on software developers to create software that can traverse the network
of connections among the classes and extract the information that they
need to answer the questions they want people to be able to pose through
use of their software. Nailing down what the community consensus is on
the classes and their connections is the first step to being able to
create that kind of software.
This email is already too long, but I think that I need to make one more
point about the impossibility of expecting a metadata provider
pre-populating all of the necessary "shortcut" properties that one would
want to use in simple SPARQL queries. If there is only one person at
one institution creating all of the metadata, then it is easy to make
sure that all of the subject resources are assigned values for the
appropriate shortcut object properties. (I think this is the case in
the example SPARQL queries that you have put out on the list, i.e. all
of the metadata was provided by you - I didn't go back and look at the
examples again, so I could be wrong about that.) However, in the
situation that Cam and I are interested in, the various connected
resources may be at different institutions with metadata submitted "to
the cloud" by different people. For example, the tree
http://bioimages.vanderbilt.edu/uncg/84 is in the University of North
Carolina at Greensboro arboretum. An image of that tree,
http://bioimages.vanderbilt.edu/kirchoff/em1968 , is in the Bioimages
image collection. A specimen from that tree,
http://bioimages.vanderbilt.edu/specimen/ncu592805 , is in the
University of North Carolina herbarium in Chapel Hill. Although at the
moment these URIs are all under the http://bioimages.vanderbilt.edu
subdomain, I would hope that at some point in the future, there would be
permanent GUIDs for all of them (except the image) under someone else's
management other than me. Hopefully there will be a GUID for the Taxon
assigned to an Identification of the tree which would eventually be
managed at some community-maintained place like the Global Name Use Bank
(GNUB). So lets say I used the shortcut model and assigned a
""dsw:occurrenceHasTaxon" property (which doesn't actually exist in DSW
at the present) to the Occurrence documented by the image in my
collection (URI=http://bioimages.vanderbilt.edu/kirchoff/em1968#occ).
Now let's say that a /Quercus /expert looks at the UNC specimen and
decides that it is some different species (i.e. creates a different
dwc:Identification). There now should be an additional value of the
"dsw:occurrenceHasTaxon" property of the Occurrence metadata that I'm
managing, but I'm not going to know that because the Identification has
been made by somebody else, not me. [I should note that the BiSciCol
project is hoping to make it possible for people to find out this kind
of thing, see http://biscicol.blogspot.com/ .] Is it my responsibility
to continually trawl the cloud and always be updating all of the many
shortcut properties that would be possible to assign to the resources
whose metadata I'm managing? If I don't do that, then SPARQL queries
that people would run on "dsw:occurrenceHasTaxon" properties would miss
information that had been added to the cloud by others - it would only
find out things that I already knew when I created my metadata record
for the resource I control. It seems to me that a major point of Linked
Open Data is that individuals add to the cloud by contributing their
little bit to it and that Wonderful Things happen when people find out
stuff by connecting those bits with other bits contributed by other
people at another place in the cloud. If we create a system that only
works when people are expected to know in advance what those Wonderful
Things are, then the whole exercise becomes pointless.
Anyway, I hope that this explains to some extent one of the reasons why
Cam and I created DSW rather than just jumping in and using the
taxonconcept.org ontology. We wanted something considerably simpler.
I was going to comment/ask about at least one more thing about the
ontology at taxonconcept.org, but this email is already way too long, so
I'll take that up in a subsequent email.
Steve
Peter DeVries wrote:
> I am still somewhat puzzled why TDWG seems so opposed to adopting
> anything that comes from outside a small click?
>
> I was thinking that it would be best to create a separate class that
> can be used for populations of a species.
>
> This would require adding an additional tag to the TaxonConcept
> Species Concept Model, which currently includes several tags like entities
>
> http://lod.taxonconcept.org/ses/mCcSp#Species <- The Species Concept
> for the Cougar
>
> See http://lod.taxonconcept.org/ses/v6n7p.html HTML
> http://lod.taxonconcept.org/ses/v6n7p.rdf RDF
>
> http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%2Fses%2Fv6n7p%23Species
> Knowledge Base View (http://bit.ly bit.ly/gMFqR1
> <http://bit.ly%20bit.ly/gMFqR1>
>
> The model mints URI's for the following related entities. See RDF. or
> KB View
>
> http://lod.taxonconcept.org/ses/mCcSp#Image - An image of a Cougar
> http://lod.taxonconcept.org/ses/mCcSp#Occurrence - An occurrence of a
> Cougar
> http://lod.taxonconcept.org/ses/mCcSp#Individual - An individual Cougar
> http://lod.taxonconcept.org/ses/mCcSp#Taxonomy - A Basic Taxonomy
> for the Cougar, one alternative among many potential classifications
> http://lod.taxonconcept.org/ses/mCcSp#NCBI_Taxonomy - The NCBI
> Taxonomy for Cougar, or starting at the lowest available clade
> http://lod.taxonconcept.org/ses/mCcSp#OriginalDescription - The
> Original Description of the Cougar, ideally with links to the PDF or
> BHL URI.
>
>
> Here is how a subset of these would relate to the new #Population Tag
> and related semantic entities.
>
>
> This tag is used an individual organism that that is an instance of
> the species concept pecies concept RDF.
> This allows you to refer to a individual cougar in a way that is
> separate from the concept of cougar and retains links to other data
> relating to that species concept.
>
>
> <txn:SpeciesIndividualTag
> rdf:about="http://lod.taxonconcept.org/ses/v6n7p#Individual">
> <dcterms:title>A Tag for individuals of the species concept Puma
> concolor se:v6n7p</dcterms:title>
> <skos:prefLabel>A Tag-like resource that is used to label
> individuals of the species concept Puma concolor se:v6n7p</skos:prefLabel>
>
> <dcterms:identifier>http://lod.taxonconcept.org/ses/v6n7p#Individual</dcterms:identifier>
> <dcterms:description>A lightweight tag that can be used to label
> individuals of this species. These allow individual organisms to be
> modeled as instances of SpeciesIndividualTag</dcterms:description>
> <dcterms:isPartOf
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p#Species"/>
> <wdrs:describedby
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p.rdf"/>
> </txn:SpeciesIndividualTag>
>
> Add a tag for a species population to the species concept RDF.
> This allows you to refer to a population of cougars in a way that is
> separate for an individual cougar and retains links to other data
> relating to that species concept.
>
> <txn:SpeciesPopulationTag
> rdf:about="http://lod.taxonconcept.org/ses/v6n7p#Population">
> <dcterms:title>A Tag for populations of the species concept Puma
> concolor se:v6n7p</dcterms:title>
> <skos:prefLabel>A Tag-like resource that is used to label
> populations of the species concept Puma concolor se:v6n7p</skos:prefLabel>
>
> <dcterms:identifier>http://lod.taxonconcept.org/ses/v6n7p#Population</dcterms:identifier>
> <dcterms:description>A lightweight tag that can be used to label
> populations of this species. These allow populations of a species to
> be modeled as instances of SpeciesIndividualTag</dcterms:description>
> <dcterms:isPartOf
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p#Species"/>
> <wdrs:describedby
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p.rdf"/>
> </txn:SpeciesPopulationTag>
>
>
> This is the RDF for a population, it has as one of it's parts an
> individual organism.
> It is typed to indicate that it refers to a population of Cougars.
>
> <owl:Class
> rdf:about="http://lod.taxonconcept.org/pops/NorthAmericanCougarPopulation">
> <rdf:type
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p#Population"/>
> <skos:prefLabel>The population of North American Cougars Puma
> concolor se:v6n7 </skos:prefLabel>
> <dcterms:hasPart
> rdf:resource="http://ocs.taxonconcept.org/ocs/51cd124d-78c5-40aa-a7ff-2e3f58ca6ade#Individual"/>
> <wdrs:describedby
> rdf:resource="http://lod.taxonconcept.org/pops/NorthAmericanCougarPopulation.rdf"/>
> </owl:Class>
>
> Respectfully,
>
> - Pete
>
> -------------------------------------------------------------------------------------
>
> Pete DeVries
>
> Department of Entomology
>
> University of Wisconsin - Madison
>
> 445 Russell Laboratories
>
> 1630 Linden Drive
>
> Madison, WI 53706
>
> Email: pdevries at wisc.edu <mailto:pdevries at wisc.edu>
>
> TaxonConcept <http://www.taxonconcept.org/> & GeoSpecies
> <http://lod.geospecies.org/> Knowledge Bases
>
> A Semantic Web, Linked Open Data <http://linkeddata.org/> Project
>
> ---------------------------------------------------------------------------------------
>
>
--
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences
postal mail address:
VU Station B 351634
Nashville, TN 37235-1634, U.S.A.
delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235
office: 2128 Stevenson Center
phone: (615) 343-4582, fax: (615) 343-6707
http://bioimages.vanderbilt.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110501/dbf0f783/attachment.html
More information about the tdwg-content
mailing list