Re: [tdwg-content] More Strange Monkey Business-like things in GBIF KOS Document

17 Feb 2011

      Hilmar,

I guess I'm now guilty of conflating concepts myself, namely 
"instance-data generation as an integral component of the ontology development spiral", 
and "schema last". They're distinct, but related in the sense that the 
latter can be seen as an extreme case of the former. Separating them:

Instance Data.
Do you (or does anyone else on the list) know the status of OBD? From the 
NCBO FAQ:

---
What is OBD?
OBD is a database for storing data typed using OBO ontologies

Where is it?
In development!

Is there a demo?
See http://www.fruitfly.org/~cjm/obd

Datasets
See the above URL for now
---

But the demo link is broken, and it's hard to find information on OBD that 
isn't a few years old. Is it still the plan to integrate OBD into 
BioPortal? If not, then maybe the "Missing Functionality [of BioPortal]" 
section of the KOS report should include a subsection about providing 
access to instance data. Considering GBIF's data holdings, it seems like 
it would be a shame to not integrate data browsing into any ontology 
browsing infrastructure that GBIF provides.

Schema Last.
I think schema-last is a malleable enough buzzword that we can hijack it 
slightly, and I've been wondering about what it should mean in the context 
of TDWG ontologies. Some ontology paradigms are inherently more 
schema-last-ish than others. For example, EQ strikes me as more 
schema-last-ish than OBOE or Prometheus. Extending an example from the 
Fall, EQ gives:

fruit - green
bark - brown
leaves - yellow
leaves - ridged
leaves - broad

and OBOE gives

fruit - colour - green
bark - colour - brown
leaves - colour - yellow
leaves - perimeter texture - ridged
leaves - basic shape - broad

So in the OBOE case, the characteristics (color, perimeter texture, basic 
shape) are given a priori, while in the EQ case they would (presumably) be 
abstracted during subsequent ontology development. In theory, these two 
approaches may be isomorphic, since, presumably, the OBOE characteristics 
are also abstracted from examples collected as part of the requirements 
gathering process. In practice, though, I suspect that EQ leaves more 
scope for instance-informed schemas. I have no basis for this suspicion 
other than intiuition, and would welcome any evidence or references that 
anyone can provide.

Also, schema-last could perhaps be a guiding philosophy as we seek to put 
in place a mechanism for facilitating ontology update and evolution. For 
example, it might be worth experimenting with tag-driven ontology 
evolution, as in [1], where tags are associated to concepts in an 
ontology. If a tag 
can't be mapped into the ontology, the ontology  engineer takes this as a 
clue that the ontology needs revision. So the domain expert/knowledge engineer 
partnership is preserved, but with the domain expert role being replaced 
by collective wisdom from the community. Passant's focus was information 
retrieval, where the only reasoning is using subsumption hierarchies to 
expand the scope of a query, but the principle should apply to 
other reasoning tasks as well. The example in my mind is using a DL 
representation of SDD as the basis for polyclave keys. When users enter 
terms not in the ontology, it would trigger a process that could lead to 
ontology update.

I don't dispute the importance of involving individual domain experts, 
especially at the beginning, but also throughout the process. And I agree 
that catalyzing this process is, indeed, a job for TDWG.

Joel.

1. Passant, http://www.icwsm.org/papers/paper15.html

On Tue, 15 Feb 2011, Hilmar Lapp wrote:
...
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs <jsachs@csee.umbc.edu> wrote:
...
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last.
Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything);
followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies  (to facilitate reasoning). Perhaps the only placewhere we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
...
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
...
I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
-hilmar
--
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================