[tdwg-content] Schema-last and crazy: correlated?

20 Feb 2011

      Hilmar,

Schema-last, to me, is an attitude of holding back (sometimes 
forever) before
i) restricting the vocabulary available to users; and/or
ii) defining a semantics that draws inferences way beyond a user's 
assertions.

I think this attitude can apply not only to the terms of an ontology, but 
to the general shape and style of the ontology, and I am concerned about 
GBIF/TDWG assuming that its ontologies should be DL in flavour. By DL, I 
mean more than whether an ontology is technically within the OWL-DL profile. I 
mean the general approach of building classifiers, which, traditionally, 
has been the goal of description logics. So, by DL in flavour, I mean 
making heavy use of domain and range restrictions, functional and 
inverseFunctional properties, class definition via property 
restriction, etc. This DL-based approach seems to be working in genomics.

Will it work in biodiversity informatics? One cause for concern is that 
the current Darwin Core, which is simple, is widely misunderstood and 
intimidates many. It is possible that the problem will be solved with 
tighter restriction and more formalisms. But I'm skeptical.

Even if we are able, through the laborious process of doing things 
a certain way, to build classifiers for biodiversity informatics 
artifacts (occurrence records, evidence, identifications, etc.) in ths 
same way that we can build them for actual objects of biology (genes, 
taxa, etc.), why would we want to? The natural world comes without labels, 
so it's helpful to be able to synthesize everything that we know about 
something to determine what it is. But human-made information artifacts 
are typically labeled, or have their types implied by context.

I'm currently arguing  with  someone off-list about what I think is my 
minimal example, that I hope that everyone can agree on. It's about domain 
constraints on "hasIdentification". If I say

"http://fu.bar hasIdentifcation rabbit",

should we, as a community, interpret that to mean that http://fu.bar is an 
individulOrganism (as opposed to, say, a picture)? Must I, as a guy who 
likes to make assertions, be told either

a. that I need additional vocabulary terms: pictureHasIdentification, 
occurenceHasIdentification, individualHasIdentification, etc.
or
b. that I need to limit hasIdentification to describing a single type of 
thing.

If you can convince me of either (a) or (b) above, then I'll be inclined 
to accept your entire vision for the semantic web.

A few more comments, in-line, below ...

On Thu, 17 Feb 2011, Hilmar Lapp wrote:
...
On Feb 17, 2011, at 3:23 PM, Shawn Bowers wrote:
...
Both OBOE and EQ do introduce classes that prescribe how to structure new 
classes and type
individuals
That's actually not quite true. The EQ model itself doesn't prescribe any 
new classes or the types that individuals must be of; instead it simply says 
that a phenotype instance can be expressed as some instance of a quality Q 
that inheres_in some instance of an entity E, and thus a class of phenotypes 
(or observations of an organism's characteristics) is the intersection of 
all instances of Q (a subclass restriction), and all things that inhere_in E 
(a property restriction).
While typically we will draw Q and E from certain ontologies (such as PATO 
for qualities), you can designate any class (term) in those places, and the 
class expression by itself will not support inferences about the nature of Q 
or E or their instances (the ontologies that Q and E are drawn from do 
that). The class expression itself is often anonymous, but there are 
(so-called "pre-composed") ontologies that identify and label them.
That being said, while EQ in principle allows you to do real crazy things if 
you want to (which perhaps is what Joel means by schema-last?), if you want 
to be able to do discovery and reasoning with a set of EQ class expressions 
from different sources, they will need to follow some shared conventions, 
such as not simply making up quality and entity terms as needed, but drawing 
them from PATO and shared entity ontologies.
Conversely, OBOE does prescribe the nature of the things that it relates to 
each other in the model, the cardinality of those relationships, and what it 
means for an instance it is has such a relationship. For example, if I 
assert o oboe:ofEntity e, the semantics of oboe:ofEntity prescribe that o is 
an instance of oboe:Observation, e is an instance of oboe:Entity, and if I 
also assert o oboe:ofEntity e1, it prescribes that e and e1 are identical, 
i.e., the same instance.
I think these differences are a result of how they were motivated, and it is 
interesting to me that Joel would pick these as examples for illustrating 
"schema-lastishness".
An example of why I see EQ being more schema-last than OBOE is the 
question you recently forwarded to the Observations list: How do you 
represent "petiole 5x longer than wide"?

In EQ, you could say something like:
<5:1 length to width ratio> <inheres_in> <petiole>
and then wait for some more examples of ratios to come in, before deciding 
how to update your Quality ontology to handle ratios.
In OBOE (please correct me if I'm wrong), it seems (to me) that you need 
to make more of an ontological commitment to express the same thing.

(Also, could you please direct me to sources of OBOE instance data? A 
quick search of TDWG-Observation, SONet, Google, and Swoogle only turned up the 
ontolgy itself, and a few examples of the "how do you do this in OBOE" 
variety.)
...
OBOE was motivated by having a unified data model for 
observational data, in the interest of better data exchange and integration. 
I think all its class and property constraints are a reflection of that - 
there is a desire not to "allow anything". Conversely, EQ wouldn't make for 
a good model in which to exchange arbitrary observational data - there would 
be no guarantees for what you get. However, it is very powerful for 
reasoning over the semantics of the observations (see the Washington et al 
2009 paper), which is what it was conceived for.
I like the Washington paper a lot. One thing it illustrates to me is the 
power that comes from the judicious use of an appropriate domain ontology 
with witch to value simple attributes. One of  the most important 
recommemdations in the KOS report, IMO, is the one I quoted to Pete: "Promote widespread 
adoption of URI-based standard values for key Darwin Core attribute 
values." Constructing appropriate ontologies for these values strikes me 
as a much better way to bring DwC on to the semantic web than recrafting 
DwC as an OWL ontology. (I'm not opposed to the latter, which may serve a 
data validation need, but I don't think its necessary for typical data 
integration use cases.)
...
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs <jsachs@csee.umbc.edu> wrote:
...
Do you (or does anyone else on the list) know the status of OBD? From the
NCBO FAQ:
Funny you should ask. We're in the final stages of writing up a manuscript 
about it. I can share a preprint with you next week. OBD is what is 
underpinning the Phenoscape Knowledgebase (http://kb.phenoscape.org).
The URL is http://www.berkeleybop.org/obd/. It is still pretty outdated, but 
will be updated very soon.
...
Is it still the plan to integrate OBD into BioPortal?
I don't think so. And there are lots of resources working on that (at least 
in the biomedical domain), so it'd be hard for them to pick what to follow.
...
So in the OBOE case, the characteristics (color, perimeter texture, basic 
shape) are given a priori, while in the EQ case they would (presumably) be 
abstracted during subsequent ontology development.
Yes. They are implied by the subclass structure of PATO (and thus subject to 
change).
...
it might be worth experimenting with tag-driven ontology evolution, as in 
[1], where tags are associated to concepts in an ontology. [...] So the 
domain expert/knowledge engineer
partnership is preserved, but with the domain expert role being replaced 
by collective wisdom from the community.
Are you aware of the "Fast, Cheap, and Out of Control" paper from Mark 
Wilkinson's group:
Good et al. 2006. Fast, Cheap and Out of Control: A Zero Curation Model for 
Ontology Development. Pacific Symposium on Biocomputing 11: 128-139.
http://psb.stanford.edu/psb-online/proceedings/psb06/good.pdf
Cool, thanks. Looks like what they're describing is, essentially, the 
first VoCamp.

Joel.
...
-hilmar
-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================