[tdwg-content] If you need something for referring to a population, then it is probably best to do it as a related class

Sun May 1 15:01:09 CEST 2011

OK, Pete, I'm going to try to write the other half of the email that I 
promised.  I'm going to start by saying that some of what I'm talking 
about here has already been posted on the Darwin-SW (DSW) wiki page 
called RelationshipToExistingModels 
(http://code.google.com/p/darwin-sw/wiki/RelationshipToExistingModels).  
I've actually wanted to bring this up with you for about six months, but 
have never taken the time to put it into an email.  Since this is going 
out to the list, I'll include some comments about background that you 
already know (assuming a broader audience).  I'd welcome your comments 
and feedback on what I've said and whether you think it is accurate or not.

One of the things that I think makes it difficult for people to follow 
what you are proposing on taxonconcept.org is that the structure of your 
RDF is complex.  I'm not saying that is a bad thing, I'm just saying 
that if you combine that with people's general unfamiliarity with RDF 
and the difficulty that some people have with visualizing RDF in XML 
format, it just isn't accessible to most people.  Even that difficulty 
in itself isn't necessarily a bad thing because RDF isn't really 
intended to be understood primarily by people - it's designed to be 
understood by computers, so many people on this list don't really need 
to care about it.  Nevertheless, in order to have a discussion about a 
proposal, one must be able to visualize it.  I am a very right-brained 
person and must have maps, diagrams, and graphs to conceptualize 
things.  So the first thing I did was to go to 
http://www.w3.org/RDF/Validator/, put in the URI your example, and tell 
the parser to give me graph only.  In the RelationshipToExistingModels 
wiki page, I looked at 
http://ocs.taxonconcept.org/ocs/f522444a-2dd9-400e-be59-47213ef38cb9.rdf 
which provided information about an occurrence.  One of the most obvious 
features of the resulting graph is that it is complex.  I used the word 
"reticulated" because there are many cross-connections between the 
nodes.  It takes a bit of time and a large-screen monitor to sort it all 
out, but if one ignores the literals and concentrates only on the nodes 
that are labeled with URIs, the structure is actually very similar to 
the structure we used in DSW.  So that tells me that we (Cam and I) are 
seeing the biodiversity informatics world in a very similar manner to 
the way you have been seeing it.  On obvious difference is in the class 
names used to type resources - we mostly used DwC classes and you used 
ones that you defined in your ontology, but that is a cosmetic 
difference if one assumes that the classes in DSW and at 
taxonconcept.org represent the same thing.

If one considers the basic RDF graph for DSW shown on the DSW home page 
(http://code.google.com/p/darwin-sw/), with the exception of dsw:Token 
which can be evidence for several things (and foaf:Person which was kind 
of thrown in at the end), the basic structure of DSW is linear.  There 
is a connection between each class wherever there is a potential need 
for a one-to-many join between class instances (see triangles [="crow's 
feet"] on the Fully Normalized Model on the RelationshipToExistingModels 
wiki page; DSW is like this diagram except there is no Time class, and 
TaxonNameUsage in the diagram is the Taxon class in DSW).  The 
connections are made by object properties that we defined in the DSW 
ontology.  A major difference between the DSW structure and the 
structure that can be seen in the RDF graph of the taxonconcept.org 
occurrence example is that there is not just one connection that can be 
used to traverse the classes.  For example, in DSW, to obtain 
information about Occurrences that are associated with an 
Identification, one would have to "surf" from a dwc:Identification 
instance to dsw:Individual instance using the dsw:identifies property, 
then from the dsw:Individual instance to the dwc:Occurrence instance 
using the dsw:hasOccurrence property.  Similarly, in the 
taxonconcept.org example, one could go from the txn:Identification 
instance to the txn:SpeciesIndividual instance using the 
txn:identificationOfIndividual property, then from the 
txn:SpeciesIndividual instance to the txn:Occurrence instance using the 
txn:individualHasOccurrence property.  However, taxonconcept.org also 
allows one to make the connection from the txn:Identification instance 
directly to the txn:Occurrence instance using the 
txn:identificationHasOccurrence, skipping the txn:SpeciesIndividual 
altogether.  Similar "shortcuts" connect other classes in 
taxonconcept.org whose analogues in DSW are separated by intervening 
classes, e.g. from txn:Occurrence to txn:SpeciesConcept (roughly 
analogous to dwc:Taxon) by ttxn:occurrenceHasSpeciesConcept, from 
dwc_area:Area (roughly analogous to dcterms:Location) to 
txn:SpeciesConcept by txn:areaHasObservedSpeciesConcept, etc.  I don't 
think the taxonconcept.org ontology has every possible connection 
between every class, but in theory one could do that if one wanted.  
There would be even more connections and shortcuts if the 
taxonconcept.org ontology included a class that is analogous to 
dwc:Event (it's "flattened out" of the taxonconcept.org model, see 
http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001710.html 
and the posts that precede and follow it for more on the topic of 
"flattening" databases).  It is the presence of these "shortcut" 
properties in the taxonconcept.org ontology that makes it's RDF graph so 
complex and "reticulated" and the absence of them that makes a DWC RDF 
graph much simpler.

Which approach is correct?  As the old adage says "anybody can say 
anything about anything".  There isn't really anything intrinsically 
"wrong" with either including or excluding "shortcut" properties.  I am 
guessing that the reason why your ontology has them and DSW doesn't may 
be a reflection of the reasons why the ontologies were created.  From 
what you've said in the past, I gather that you would like to facilitate 
assembling masses of metadata in triple stores and run SPARQL queries on 
them to discover interesting things.  Cam and I want to make it possible 
to apply GUIDs to very diverse kinds of things and be able to track what 
happens to them if they end up in different places.  These are not 
necessarily mutually exclusive desires, but they do represent a 
difference in outlook.  I know virtually nothing about SPARQL, but at 
the risk of exposing myself as an ignoramus, I'm going to mention SPARQL 
queries in this post anyway.  I assume from your examples that it is 
relatively easy to run a query to discover resources that are one object 
property-step away from a subject resource.  I would assume that it 
would be much more difficult to run such queries on things that are five 
object property-steps apart.  For example, if one wanted to know all of 
the instances of txn:SpeciesConcept's that occurred at a dwc_area:Area 
all one would have to do is to search for all of the objects of the 
txn:areaHasObservedSpeciesConcept properties for instances of that 
particular dwc_area:Area in a triple store.  In DSW, one would need to 
look for all of the dwc:Events that happened at that dcterms:Location, 
then find all of the dwc:Occurrences that happened at those dwc:Events, 
then find out which dsw:Individuals were represented in those 
dwc:Occurrences, then look up all of the dwc:Identifications for those 
dsw:Individuals, and finally make a non-redundant list of dwc:Taxon 
instances that were represented in those dwc:Identifications.  I don't 
know if there is a simple SPARQL query for that, but I doubt it.  So 
from the standpoint of querying, the "shortcut" property method that 
taxonconcept.org uses is much better.

However, there is an important problem with the "shortcut" strategy.  In 
order to be able to make a simple query that makes use of single-step 
properties, one must know what kind of query a user will want to make 
and then make sure that there is a shortcut property that connects the 
classes of interest.  This requires either a crystal ball to be able to 
predict what people are interested in asking, or just making up 
properties for every possible shortcut.  If I'm doing the math right, 
with the 6 classes that are included in the existing DwC plus 
IndividualOrganism (or SpeciesIndividual if you prefer), there would be 
15 connections among the classes which would make 30 object properties 
required to connect them if one wanted every connection to have a pair 
of inverse properties to enable going in either direction.  If one 
included the Token class, that would make 21 pairs.  The burden would 
then fall on the metadata provider to provide values for all of those 
properties. And although 21 connections doesn't sound that bad, there 
could actually be a lot more actually property assignments than that 
because there isn't any restriction that says that there will only be 
one value for a property.  If an organism has two Identifications, then 
every xxxHasIdentification kind of property is going to have two 
values.  If there are many Identifications (e.g. 
http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf) there 
would be many values.  Essentially, the metadata provider is left with 
the job of pre-running every kind of query that a user could possibly 
want to do. 

An alternative to this would be to simply the model by "flattening out" 
certain classes (making the model less "normalized").  You did that with 
Event.  In my Biodiversity Informatics article 
(https://journals.ku.edu/index.php/jbi/article/view/3664) I did it for 
Event and Location.  Historically museum people have "flattened out" 
IndividualOrganism and Token.  People normalize out Identification all 
of the time.  As Rich Pyle pointed out in the post I cited above, people 
"flatten" more complex models into simpler models all the time because 
it is convenient and it makes their databases simpler and easier to 
manage.  But if our desire is to come up with a general model that will 
work for museum people and their old specimen labels, bird and whale 
observation people, DNA barcoding people, people who document live 
organisms with images and sound, bioblitzers, etc. it has to include 
every class that participants can reasonably need to have to facilitate 
needed "one-to-many database joins" or whatever you want to call that.  
In October, I posted a message to the tdwg-content list where I warned 
against setting precedents for using "wrong" RDF 
(http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001663.html).  
In that post, my point was that people should not apply properties to 
instances of the wrong class.  That is exactly what happens when people 
simplify models by eliminating classes that have only one-to-one 
relationships with other classes in their database.  So if I were to 
restate my complaint again, I'd frame it this way: a "wrong" RDF model 
is one that leaves out classes that potential users may need to express 
the complexities of their data.  This principle was an underlying 
assumption when we constructed DSW, and to know what classes people 
needed, we looked at the discussion that took place on the tdwg-content 
list in Oct/Nov.  As I point out on the RelationshipToExistingModels 
wiki page, we could have included a Time class, but as a practical 
matter, nobody has expressed a need for it (at least yet).  So given 
this principle, reducing the number of shortcut properties by getting 
rid of classes is simply not an option for any model that hopes to 
include all of the kinds of metadata that one would like to describe 
within a community. 

So the bottom line, in my opinion, is that in a model as complex as what 
we need in the biodiversity informatics community (i.e. a "fully 
normalized" model) we simply cannot hope to create and assign object 
properties that connect every class.  Hence in DSW we only created the 
minimum number of object properties needed to express what we considered 
the fundamental relationships among the classes.  What this means is 
that it simply may not be possible to make simple SPARQL queries on the 
data to find out what people want to know.  Rather, the burden will fall 
on software developers to create software that can traverse the network 
of connections among the classes and extract the information that they 
need to answer the questions they want people to be able to pose through 
use of their software.  Nailing down what the community consensus is on 
the classes and their connections is the first step to being able to 
create that kind of software. 

This email is already too long, but I think that I need to make one more 
point about the impossibility of expecting a metadata provider 
pre-populating all of the necessary "shortcut" properties that one would 
want to use in simple SPARQL queries.  If there is only one person at 
one institution creating all of the metadata, then it is easy to make 
sure that all of the subject resources are assigned values for the 
appropriate shortcut object properties.  (I think this is the case in 
the example SPARQL queries that you have put out on the list, i.e. all 
of the metadata was provided by you - I didn't go back and look at the 
examples again, so I could be wrong about that.)  However, in the 
situation that Cam and I are interested in, the various connected 
resources may be at different institutions with metadata submitted "to 
the cloud" by different people.  For example, the tree 
http://bioimages.vanderbilt.edu/uncg/84 is in the University of North 
Carolina at Greensboro arboretum.  An image of that tree, 
http://bioimages.vanderbilt.edu/kirchoff/em1968 , is in the Bioimages 
image collection.  A specimen from that tree, 
http://bioimages.vanderbilt.edu/specimen/ncu592805 , is in the 
University of North Carolina herbarium in Chapel Hill.  Although at the 
moment these URIs are all under the http://bioimages.vanderbilt.edu 
subdomain, I would hope that at some point in the future, there would be 
permanent GUIDs for all of them (except the image) under someone else's 
management other than me.  Hopefully there will be a GUID for the Taxon 
assigned to an Identification of the tree which would eventually be 
managed at some community-maintained place like the Global Name Use Bank 
(GNUB).  So lets say I used the shortcut model and assigned a 
""dsw:occurrenceHasTaxon" property (which doesn't actually exist in DSW 
at the present) to the Occurrence documented by the image in my 
collection (URI=http://bioimages.vanderbilt.edu/kirchoff/em1968#occ).  
Now let's say that a /Quercus /expert looks at the UNC specimen and 
decides that it is some different species (i.e. creates a different 
dwc:Identification).  There now should be an additional value of the 
"dsw:occurrenceHasTaxon" property of the Occurrence metadata that I'm 
managing, but I'm not going to know that because the Identification has 
been made by somebody else, not me.  [I should note that the BiSciCol 
project is hoping to make it possible for people to find out this kind 
of thing, see http://biscicol.blogspot.com/ .]  Is it my responsibility 
to continually trawl the cloud and always be updating all of the many 
shortcut properties that would be possible to assign to the resources 
whose metadata I'm managing?  If I don't do that, then SPARQL queries 
that people would run on "dsw:occurrenceHasTaxon" properties would miss 
information that had been added to the cloud by others - it would only 
find out things that I already knew when I created my metadata record 
for the resource I control.  It seems to me that a major point of Linked 
Open Data is that individuals add to the cloud by contributing their 
little bit to it and that Wonderful Things happen when people find out 
stuff by connecting those bits with other bits contributed by other 
people at another place in the cloud.  If we create a system that only 
works when people are expected to know in advance what those Wonderful 
Things are, then the whole exercise becomes pointless. 

Anyway, I hope that this explains to some extent one of the reasons why 
Cam and I created DSW rather than just jumping in and using the 
taxonconcept.org ontology.  We wanted something considerably simpler.

I was going to comment/ask about at least one more thing about the 
ontology at taxonconcept.org, but this email is already way too long, so 
I'll take that up in a subsequent email.

Steve

Peter DeVries wrote:
> I am still somewhat puzzled why TDWG seems so opposed to adopting 
> anything that comes from outside a small click?
>
> I was thinking that it would be best to create a separate class that 
> can be used for populations of a species.
>
> This would require adding an additional tag to the TaxonConcept 
> Species Concept Model, which currently includes several tags like entities
>
> http://lod.taxonconcept.org/ses/mCcSp#Species <- The Species Concept 
> for the Cougar
>
> See http://lod.taxonconcept.org/ses/v6n7p.html HTML
>        http://lod.taxonconcept.org/ses/v6n7p.rdf  RDF
>       
>  http://lsd.taxonconcept.org/describe/?url=http%3A%2F%2Flod.taxonconcept.org%2Fses%2Fv6n7p%23Species 
> Knowledge Base View (http://bit.ly bit.ly/gMFqR1 
> <http://bit.ly%20bit.ly/gMFqR1>
>  
> The model mints URI's for the following related entities. See RDF. or 
> KB View
>
> http://lod.taxonconcept.org/ses/mCcSp#Image      - An image of a Cougar
> http://lod.taxonconcept.org/ses/mCcSp#Occurrence - An occurrence of a 
> Cougar
> http://lod.taxonconcept.org/ses/mCcSp#Individual - An individual Cougar
> http://lod.taxonconcept.org/ses/mCcSp#Taxonomy   - A Basic Taxonomy 
> for the Cougar, one alternative among many potential classifications
> http://lod.taxonconcept.org/ses/mCcSp#NCBI_Taxonomy - The NCBI 
> Taxonomy for Cougar, or starting at the lowest available clade
> http://lod.taxonconcept.org/ses/mCcSp#OriginalDescription - The 
> Original Description of the Cougar, ideally with links to the PDF or 
> BHL URI.
>     
>     
> Here is how a subset of these would relate to the new #Population Tag 
> and related semantic entities.
>
>
> This tag is used an individual organism that that is an instance of 
> the species concept pecies concept RDF.
> This allows you to refer to a individual cougar in a way that is 
> separate from the concept of cougar and retains links to other data 
> relating to that species concept.
>
>
>   <txn:SpeciesIndividualTag 
> rdf:about="http://lod.taxonconcept.org/ses/v6n7p#Individual">
>     <dcterms:title>A Tag for individuals of the species concept Puma 
> concolor se:v6n7p</dcterms:title>
>     <skos:prefLabel>A Tag-like resource that is used to label 
> individuals of the species concept Puma concolor se:v6n7p</skos:prefLabel>
>     
> <dcterms:identifier>http://lod.taxonconcept.org/ses/v6n7p#Individual</dcterms:identifier>
>     <dcterms:description>A lightweight tag that can be used to label 
> individuals of this species. These allow individual organisms to be 
> modeled as instances of SpeciesIndividualTag</dcterms:description>
>     <dcterms:isPartOf 
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p#Species"/>
>     <wdrs:describedby 
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p.rdf"/>
>   </txn:SpeciesIndividualTag>
>
> Add a tag for a species population to the species concept RDF.
> This allows you to refer to a population of cougars in a way that is 
> separate for an individual cougar and retains links to other data 
> relating to that species concept.
>
>   <txn:SpeciesPopulationTag 
> rdf:about="http://lod.taxonconcept.org/ses/v6n7p#Population">
>     <dcterms:title>A Tag for populations of the species concept Puma 
> concolor se:v6n7p</dcterms:title>
>     <skos:prefLabel>A Tag-like resource that is used to label 
> populations of the species concept Puma concolor se:v6n7p</skos:prefLabel>
>     
> <dcterms:identifier>http://lod.taxonconcept.org/ses/v6n7p#Population</dcterms:identifier>
>     <dcterms:description>A lightweight tag that can be used to label 
> populations of this species. These allow populations of a species to 
> be modeled as instances of SpeciesIndividualTag</dcterms:description>
>     <dcterms:isPartOf 
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p#Species"/>
>     <wdrs:describedby 
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p.rdf"/>
>   </txn:SpeciesPopulationTag>
>
>
> This is the RDF for a population, it has as one of it's parts an 
> individual organism.
> It is typed to indicate that it refers to a population of Cougars.
>
>   <owl:Class 
> rdf:about="http://lod.taxonconcept.org/pops/NorthAmericanCougarPopulation">
>     <rdf:type 
> rdf:resource="http://lod.taxonconcept.org/ses/v6n7p#Population"/>
>     <skos:prefLabel>The population of North American Cougars Puma 
> concolor se:v6n7 </skos:prefLabel>
>     <dcterms:hasPart 
> rdf:resource="http://ocs.taxonconcept.org/ocs/51cd124d-78c5-40aa-a7ff-2e3f58ca6ade#Individual"/>
>     <wdrs:describedby 
> rdf:resource="http://lod.taxonconcept.org/pops/NorthAmericanCougarPopulation.rdf"/>
>   </owl:Class>
>
> Respectfully,
>
> - Pete
>
> -------------------------------------------------------------------------------------
>
> Pete DeVries
>
> Department of Entomology
>
> University of Wisconsin - Madison
>
> 445 Russell Laboratories
>
> 1630 Linden Drive
>
> Madison, WI 53706
>
> Email: pdevries at wisc.edu <mailto:pdevries at wisc.edu>
>
> TaxonConcept  <http://www.taxonconcept.org/> &  GeoSpecies 
> <http://lod.geospecies.org/> Knowledge Bases
>
> A Semantic Web, Linked Open Data  <http://linkeddata.org/> Project
>
> ---------------------------------------------------------------------------------------
>
>

-- 
Steven J. Baskauf, Ph.D., Senior Lecturer
Vanderbilt University Dept. of Biological Sciences

postal mail address:
VU Station B 351634
Nashville, TN  37235-1634,  U.S.A.

delivery address:
2125 Stevenson Center
1161 21st Ave., S.
Nashville, TN 37235

office: 2128 Stevenson Center
phone: (615) 343-4582,  fax: (615) 343-6707
http://bioimages.vanderbilt.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20110501/dbf0f783/attachment.html