Re: [tdwg-content] More schema-last (was Monkey Business)

21 Feb 2011

      Hi,

More comments below ...

On Mon, Feb 21, 2011 at 1:51 PM, joel sachs <jsachs@csee.umbc.edu> wrote:
...
Shawn,
I'm not sure if we're agreeing. Comments below ...
On Sat, 19 Feb 2011, Shawn Bowers wrote:
...
Hi,
Within the database community, schema first refers to having to fix a
data structure (like the attributes in a relational table) before
adding data, whereas schema last refers to being able to add schema
after data have been added.
Even in an rdbms, you can add schema after data, for example with "Alter
Table".
While this is true, it is still "schema-first" since you can't add
some data to a table without there already being a column in the table
to store the data into!
...
RDF is different in that i) the schema can be distributed, and ii)
the schema definition and data definition languages are the same. So it is
literally true that "schema is data too". So, to the extent that it makes
sense to use a term like schema-last, it seems reasonable to apply it to
practices rather than languages.
I agree that it makes sense to apply the ideas to what practices
(although my impression is that the phrase is primarily about data
models within the database community).

I think "type" or "semantics" is a better term here than "schema"
(which to me implies data structure/storage structure, but usually
only minimal constraints).  So, e.g., "semantics-later" versus
"semantics-first".
...
Could you point me to endpoints where I can query data via the OBOE
ontology?
Nothing that is publicly available at this time. We have created a
couple of prototypes of querying datasets through the ontology, one of
which is ObsDB (a recent paper published at e-science 2010 on this
system can be found here:
http://www.cs.gonzaga.edu/~bowers/papers/escience-2010.pdf) and
another for evaluating different query algorithms for efficiently
answering similar queries over large repositories such as the KNB.
There is also a web-based query UI for querying annotated datasets,
but it doesn't yet expose the datasets as instance data.  We're
working on these tools now within the Semtools and SONet projects.
...
...
...
For example, we could require that
all instance data be validated with an ontology, and not have a mechansim
for updating the ontology in response to the frustrations of our users.
This statement seems contrary to the use of the OWL framework ...
It's contrary to common sense, but compatible with OWL. If I'm exagerating
about ontologies not being responsive to the frustrations of their users,
it's because most ontologies don't have users. I'll check Swoogle for some
statistics to back that up, but does anyone really dispute it?
I think an obvious counterexample is GO and its associated ontologies,
which seem to be heavily used for a number of different purposes.

Shawn
...
Joel.
...
Shawn
On Sat, Feb 19, 2011 at 6:32 PM, joel sachs <jsachs@csee.umbc.edu> wrote:
...
Hi Shawn,
On Thu, 17 Feb 2011, Shawn Bowers wrote:
...
Hi Joel,
I think the OWL model in general is "schema-last".
A good point, although I would phrase it differently and say that rdf, in
general, is highly compatible with schema-last. But it's also compatible
with decidedly non schema-last practices. For example, we could require
that
all instance data be validated with an ontology, and not have a mechansim
for updating the ontology in response to the frustrations of our users.
Anyway, I'd be happy to stop using the phrase, and instead talk about
specifics of what our ontologies should look like, and where they should
come from.
Regards,
Joel.
...
In particular, the
only fixed "schema" is the triple model (subject, predicate, object),
and one can add and remove triples as needed. I don't think OBOE or EQ
(or any other OWL ontology) is any more schema-first versus
schema-last than the other -- since they are based on OWL/RDF.
Alternatively, a particular dataset (with specific attributes) is a
typical example of "schema first", i.e., before I can store data rows,
I have to define the attributes (so this would be true in, e.g.,
Darwin Core). In both OBOE and EQ, one could have a set of triples,
and then come along later at any time and add triples that give type
information to existing individuals, etc. Both OBOE and EQ do
introduce classes that prescribe how to structure new classes and type
individuals -- but it would be really hard given this to say one is
more "schema last" than the other because of these basic upper-level
classes.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs <jsachs@csee.umbc.edu>
wrote:
...
Hilmar,
I guess I'm now guilty of conflating concepts myself, namely
"instance-data generation as an integral component of the ontology
development spiral",
and "schema last". They're distinct, but related in the sense that the
latter can be seen as an extreme case of the former. Separating them:
Instance Data.
Do you (or does anyone else on the list) know the status of OBD? From
the
NCBO FAQ:
---
What is OBD?
OBD is a database for storing data typed using OBO ontologies
Where is it?
In development!
Is there a demo?
See http://www.fruitfly.org/~cjm/obd
Datasets
See the above URL for now
---
But the demo link is broken, and it's hard to find information on OBD
that
isn't a few years old. Is it still the plan to integrate OBD into
BioPortal? If not, then maybe the "Missing Functionality [of
BioPortal]"
section of the KOS report should include a subsection about providing
access to instance data. Considering GBIF's data holdings, it seems
like
it would be a shame to not integrate data browsing into any ontology
browsing infrastructure that GBIF provides.
Schema Last.
I think schema-last is a malleable enough buzzword that we can hijack
it
slightly, and I've been wondering about what it should mean in the
context
of TDWG ontologies. Some ontology paradigms are inherently more
schema-last-ish than others. For example, EQ strikes me as more
schema-last-ish than OBOE or Prometheus. Extending an example from the
Fall, EQ gives:
fruit - green
bark - brown
leaves - yellow
leaves - ridged
leaves - broad
and OBOE gives
fruit - colour - green
bark - colour - brown
leaves - colour - yellow
leaves - perimeter texture - ridged
leaves - basic shape - broad
So in the OBOE case, the characteristics (color, perimeter texture,
basic
shape) are given a priori, while in the EQ case they would (presumably)
be
abstracted during subsequent ontology development. In theory, these two
approaches may be isomorphic, since, presumably, the OBOE
characteristics
are also abstracted from examples collected as part of the requirements
gathering process. In practice, though, I suspect that EQ leaves more
scope for instance-informed schemas. I have no basis for this suspicion
other than intiuition, and would welcome any evidence or references
that
anyone can provide.
Also, schema-last could perhaps be a guiding philosophy as we seek to
put
in place a mechanism for facilitating ontology update and evolution.
For
example, it might be worth experimenting with tag-driven ontology
evolution, as in [1], where tags are associated to concepts in an
ontology. If a tag
can't be mapped into the ontology, the ontology  engineer takes this as
a
clue that the ontology needs revision. So the domain expert/knowledge
engineer
partnership is preserved, but with the domain expert role being
replaced
by collective wisdom from the community. Passant's focus was
information
retrieval, where the only reasoning is using subsumption hierarchies to
expand the scope of a query, but the principle should apply to
other reasoning tasks as well. The example in my mind is using a DL
representation of SDD as the basis for polyclave keys. When users enter
terms not in the ontology, it would trigger a process that could lead
to
ontology update.
I don't dispute the importance of involving individual domain experts,
especially at the beginning, but also throughout the process. And I
agree
that catalyzing this process is, indeed, a job for TDWG.
Joel.
1. Passant, http://www.icwsm.org/papers/paper15.html
On Tue, 15 Feb 2011, Hilmar Lapp wrote:
...
Hi Joel -
I'm in full agreement re: importance of generating instance data as
driving principle in developing an ontology. This is the case indeed
in all
the OBO Foundry ontologies I'm familiar with, in the form of data
curation
needs driving ontology development. Which is perhaps my bias as to why
I
treat this as implicit.
That being said, it has also been found that in specific subject areas
progress can be made fastest if you convene a small group of domain
experts
and simply model the knowledge about those subject areas, rather than
doing
so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the
model of intense centralized data and vocabulary curation that it
employs is
tenable within our domain, and I have a hard time imagining how
schema-last
would not result in an incoherent data soup otherwise. But then
perhaps I
just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs <jsachs@csee.umbc.edu> wrote:
> Hi Hilmar,
>
> No argument from me, just my prejudice against "solution via
> ontology",
> and my enthusiasm for "schema-last" - the idea that the schema
> reveals
> itself after you've populated the knowledge base. This was never
> really
> possible with relational databases, where a table must be defined
> before it
> can be populated. But graph databases (expecially the "anyone can say
> anything" semantic web) practically invite a degree of schema-last.
> Examples include Freebase (schema-last by design), and FOAF, whose
> specification is so widely ignored and mis-used (often to good
> effect), that
> the de-facto spec is the one that can be abstracted from FOAF files
> in the
> wild.
>
> The semantic web is littered with ontologies lacking instance data;
> my
> hope is that generating instance data is a significant part of the
> ontology
> building process for each of the ontologies proposed by the report.
> By
> "generating instance data" I mean not simply marking up a few example
> records, but generating millions of triples to query over as part of
> the
> development cycle. This will indicate both the suitability of the
> ontology
> to the use cases, and also its ease of use.
>
> I like the order in which the GBIF report lists its infrastructure
> recommendations. Persistent URIs (the underpinning of everything);
> followed by competency questions and use cases (very helpful in the
> prevention of mental masturbation); followed by OWL ontologies  (to
> facilitate reasoning). Perhaps the only placewhere we differ is that
> you're
> comfortable with "incorporate instance data into the ontology design
> process" being implicit, while I never tire of seeing that point
> hammered
> home.
>
> Regards - Joel.
>
>
> On Mon, 14 Feb 2011, Hilmar Lapp wrote:
>
>>
>> On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
>>
>>> I think the recommendations are heavy on building ontologies, and
>>> light on suggesting paths to linked data representations of
>>> instance data.
>>
>>
>> Good observation. I can't speak for all of the authors, but in my
>> experience building Linked Data representations is mostly a
>> technical
>> problem, and thus much easier compared to building soundly
>> engineered,
>> commonly agreed upon ontologies with deep domain knowledge capture.
>> The
>> latter is hard, because it requires overcoming a lot of social
>> challenges.
>>
>> As for the GBIF report, personally I think linked biodiversity data
>> representations will come at about the same pace whether or not GBIF
>> pushes
>> on that front (though GBIF can help make those representations
>> better by
>> provisioning stable resolvable identifier services, URIs etc). There
>> is a
>> unique opportunity though for "neutral" organizations such as GBIF
>> (or, in
>> fact, TDWG), to significantly accelerate the development of sound
>> ontologies
>> by catalyzing the community engagement, coherence, and discourse
>> that is
>> necessary for them.
>>
>>    -hilmar
>> --
>> ===========================================================
>> : Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
>> ===========================================================
>>
>>
_______________________________________________
tdwg-content mailing list
tdwg-content@lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-content