[tdwg-content] More schema-last (was Monkey Business)

Shawn Bowers bowers at gonzaga.edu
Mon Feb 21 23:46:23 CET 2011


Hi,

More comments below ...

On Mon, Feb 21, 2011 at 1:51 PM, joel sachs <jsachs at csee.umbc.edu> wrote:
> Shawn,
>
> I'm not sure if we're agreeing. Comments below ...
>
> On Sat, 19 Feb 2011, Shawn Bowers wrote:
>
>> Hi,
>>
>> Within the database community, schema first refers to having to fix a
>> data structure (like the attributes in a relational table) before
>> adding data, whereas schema last refers to being able to add schema
>> after data have been added.
>
> Even in an rdbms, you can add schema after data, for example with "Alter
> Table".

While this is true, it is still "schema-first" since you can't add
some data to a table without there already being a column in the table
to store the data into!

> RDF is different in that i) the schema can be distributed, and ii)
> the schema definition and data definition languages are the same. So it is
> literally true that "schema is data too". So, to the extent that it makes
> sense to use a term like schema-last, it seems reasonable to apply it to
> practices rather than languages.

I agree that it makes sense to apply the ideas to what practices
(although my impression is that the phrase is primarily about data
models within the database community).

I think "type" or "semantics" is a better term here than "schema"
(which to me implies data structure/storage structure, but usually
only minimal constraints).  So, e.g., "semantics-later" versus
"semantics-first".

> Could you point me to endpoints where I can query data via the OBOE
> ontology?

Nothing that is publicly available at this time. We have created a
couple of prototypes of querying datasets through the ontology, one of
which is ObsDB (a recent paper published at e-science 2010 on this
system can be found here:
http://www.cs.gonzaga.edu/~bowers/papers/escience-2010.pdf) and
another for evaluating different query algorithms for efficiently
answering similar queries over large repositories such as the KNB.
There is also a web-based query UI for querying annotated datasets,
but it doesn't yet expose the datasets as instance data.  We're
working on these tools now within the Semtools and SONet projects.

>>> For example, we could require that
>>> all instance data be validated with an ontology, and not have a mechansim
>>> for updating the ontology in response to the frustrations of our users.
>>
>> This statement seems contrary to the use of the OWL framework ...
>
> It's contrary to common sense, but compatible with OWL. If I'm exagerating
> about ontologies not being responsive to the frustrations of their users,
> it's because most ontologies don't have users. I'll check Swoogle for some
> statistics to back that up, but does anyone really dispute it?

I think an obvious counterexample is GO and its associated ontologies,
which seem to be heavily used for a number of different purposes.

Shawn

>
>
> Joel.
>
>
>>
>> Shawn
>>
>> On Sat, Feb 19, 2011 at 6:32 PM, joel sachs <jsachs at csee.umbc.edu> wrote:
>>>
>>> Hi Shawn,
>>>
>>> On Thu, 17 Feb 2011, Shawn Bowers wrote:
>>>
>>>> Hi Joel,
>>>>
>>>> I think the OWL model in general is "schema-last".
>>>
>>> A good point, although I would phrase it differently and say that rdf, in
>>> general, is highly compatible with schema-last. But it's also compatible
>>> with decidedly non schema-last practices. For example, we could require
>>> that
>>> all instance data be validated with an ontology, and not have a mechansim
>>> for updating the ontology in response to the frustrations of our users.
>>>
>>> Anyway, I'd be happy to stop using the phrase, and instead talk about
>>> specifics of what our ontologies should look like, and where they should
>>> come from.
>>>
>>> Regards,
>>> Joel.
>>>
>>>
>>>> In particular, the
>>>> only fixed "schema" is the triple model (subject, predicate, object),
>>>> and one can add and remove triples as needed. I don't think OBOE or EQ
>>>> (or any other OWL ontology) is any more schema-first versus
>>>> schema-last than the other -- since they are based on OWL/RDF.
>>>> Alternatively, a particular dataset (with specific attributes) is a
>>>> typical example of "schema first", i.e., before I can store data rows,
>>>> I have to define the attributes (so this would be true in, e.g.,
>>>> Darwin Core). In both OBOE and EQ, one could have a set of triples,
>>>> and then come along later at any time and add triples that give type
>>>> information to existing individuals, etc. Both OBOE and EQ do
>>>> introduce classes that prescribe how to structure new classes and type
>>>> individuals -- but it would be really hard given this to say one is
>>>> more "schema last" than the other because of these basic upper-level
>>>> classes.
>>>>
>>>> Shawn
>>>>
>>>> On Thu, Feb 17, 2011 at 11:28 AM, joel sachs <jsachs at csee.umbc.edu>
>>>> wrote:
>>>>>
>>>>> Hilmar,
>>>>>
>>>>> I guess I'm now guilty of conflating concepts myself, namely
>>>>> "instance-data generation as an integral component of the ontology
>>>>> development spiral",
>>>>> and "schema last". They're distinct, but related in the sense that the
>>>>> latter can be seen as an extreme case of the former. Separating them:
>>>>>
>>>>> Instance Data.
>>>>> Do you (or does anyone else on the list) know the status of OBD? From
>>>>> the
>>>>> NCBO FAQ:
>>>>>
>>>>> ---
>>>>> What is OBD?
>>>>> OBD is a database for storing data typed using OBO ontologies
>>>>>
>>>>> Where is it?
>>>>> In development!
>>>>>
>>>>> Is there a demo?
>>>>> See http://www.fruitfly.org/~cjm/obd
>>>>>
>>>>> Datasets
>>>>> See the above URL for now
>>>>> ---
>>>>>
>>>>> But the demo link is broken, and it's hard to find information on OBD
>>>>> that
>>>>> isn't a few years old. Is it still the plan to integrate OBD into
>>>>> BioPortal? If not, then maybe the "Missing Functionality [of
>>>>> BioPortal]"
>>>>> section of the KOS report should include a subsection about providing
>>>>> access to instance data. Considering GBIF's data holdings, it seems
>>>>> like
>>>>> it would be a shame to not integrate data browsing into any ontology
>>>>> browsing infrastructure that GBIF provides.
>>>>>
>>>>> Schema Last.
>>>>> I think schema-last is a malleable enough buzzword that we can hijack
>>>>> it
>>>>> slightly, and I've been wondering about what it should mean in the
>>>>> context
>>>>> of TDWG ontologies. Some ontology paradigms are inherently more
>>>>> schema-last-ish than others. For example, EQ strikes me as more
>>>>> schema-last-ish than OBOE or Prometheus. Extending an example from the
>>>>> Fall, EQ gives:
>>>>>
>>>>> fruit - green
>>>>> bark - brown
>>>>> leaves - yellow
>>>>> leaves - ridged
>>>>> leaves - broad
>>>>>
>>>>> and OBOE gives
>>>>>
>>>>> fruit - colour - green
>>>>> bark - colour - brown
>>>>> leaves - colour - yellow
>>>>> leaves - perimeter texture - ridged
>>>>> leaves - basic shape - broad
>>>>>
>>>>> So in the OBOE case, the characteristics (color, perimeter texture,
>>>>> basic
>>>>> shape) are given a priori, while in the EQ case they would (presumably)
>>>>> be
>>>>> abstracted during subsequent ontology development. In theory, these two
>>>>> approaches may be isomorphic, since, presumably, the OBOE
>>>>> characteristics
>>>>> are also abstracted from examples collected as part of the requirements
>>>>> gathering process. In practice, though, I suspect that EQ leaves more
>>>>> scope for instance-informed schemas. I have no basis for this suspicion
>>>>> other than intiuition, and would welcome any evidence or references
>>>>> that
>>>>> anyone can provide.
>>>>>
>>>>> Also, schema-last could perhaps be a guiding philosophy as we seek to
>>>>> put
>>>>> in place a mechanism for facilitating ontology update and evolution.
>>>>> For
>>>>> example, it might be worth experimenting with tag-driven ontology
>>>>> evolution, as in [1], where tags are associated to concepts in an
>>>>> ontology. If a tag
>>>>> can't be mapped into the ontology, the ontology  engineer takes this as
>>>>> a
>>>>> clue that the ontology needs revision. So the domain expert/knowledge
>>>>> engineer
>>>>> partnership is preserved, but with the domain expert role being
>>>>> replaced
>>>>> by collective wisdom from the community. Passant's focus was
>>>>> information
>>>>> retrieval, where the only reasoning is using subsumption hierarchies to
>>>>> expand the scope of a query, but the principle should apply to
>>>>> other reasoning tasks as well. The example in my mind is using a DL
>>>>> representation of SDD as the basis for polyclave keys. When users enter
>>>>> terms not in the ontology, it would trigger a process that could lead
>>>>> to
>>>>> ontology update.
>>>>>
>>>>> I don't dispute the importance of involving individual domain experts,
>>>>> especially at the beginning, but also throughout the process. And I
>>>>> agree
>>>>> that catalyzing this process is, indeed, a job for TDWG.
>>>>>
>>>>> Joel.
>>>>>
>>>>> 1. Passant, http://www.icwsm.org/papers/paper15.html
>>>>>
>>>>>
>>>>> On Tue, 15 Feb 2011, Hilmar Lapp wrote:
>>>>>
>>>>>> Hi Joel -
>>>>>>
>>>>>> I'm in full agreement re: importance of generating instance data as
>>>>>> driving principle in developing an ontology. This is the case indeed
>>>>>> in all
>>>>>> the OBO Foundry ontologies I'm familiar with, in the form of data
>>>>>> curation
>>>>>> needs driving ontology development. Which is perhaps my bias as to why
>>>>>> I
>>>>>> treat this as implicit.
>>>>>>
>>>>>> That being said, it has also been found that in specific subject areas
>>>>>> progress can be made fastest if you convene a small group of domain
>>>>>> experts
>>>>>> and simply model the knowledge about those subject areas, rather than
>>>>>> doing
>>>>>> so piecemeal in response to data curation needs.
>>>>>>
>>>>>> BTW I don't think Freebase is a good example here. I don't think the
>>>>>> model of intense centralized data and vocabulary curation that it
>>>>>> employs is
>>>>>> tenable within our domain, and I have a hard time imagining how
>>>>>> schema-last
>>>>>> would not result in an incoherent data soup otherwise. But then
>>>>>> perhaps I
>>>>>> just don't understand what you mean by schema-last.
>>>>>>
>>>>>> -hilmar
>>>>>>
>>>>>> Sent with a tap.
>>>>>>
>>>>>> On Feb 15, 2011, at 8:24 PM, joel sachs <jsachs at csee.umbc.edu> wrote:
>>>>>>
>>>>>>> Hi Hilmar,
>>>>>>>
>>>>>>> No argument from me, just my prejudice against "solution via
>>>>>>> ontology",
>>>>>>> and my enthusiasm for "schema-last" - the idea that the schema
>>>>>>> reveals
>>>>>>> itself after you've populated the knowledge base. This was never
>>>>>>> really
>>>>>>> possible with relational databases, where a table must be defined
>>>>>>> before it
>>>>>>> can be populated. But graph databases (expecially the "anyone can say
>>>>>>> anything" semantic web) practically invite a degree of schema-last.
>>>>>>> Examples include Freebase (schema-last by design), and FOAF, whose
>>>>>>> specification is so widely ignored and mis-used (often to good
>>>>>>> effect), that
>>>>>>> the de-facto spec is the one that can be abstracted from FOAF files
>>>>>>> in the
>>>>>>> wild.
>>>>>>>
>>>>>>> The semantic web is littered with ontologies lacking instance data;
>>>>>>> my
>>>>>>> hope is that generating instance data is a significant part of the
>>>>>>> ontology
>>>>>>> building process for each of the ontologies proposed by the report.
>>>>>>> By
>>>>>>> "generating instance data" I mean not simply marking up a few example
>>>>>>> records, but generating millions of triples to query over as part of
>>>>>>> the
>>>>>>> development cycle. This will indicate both the suitability of the
>>>>>>> ontology
>>>>>>> to the use cases, and also its ease of use.
>>>>>>>
>>>>>>> I like the order in which the GBIF report lists its infrastructure
>>>>>>> recommendations. Persistent URIs (the underpinning of everything);
>>>>>>> followed by competency questions and use cases (very helpful in the
>>>>>>> prevention of mental masturbation); followed by OWL ontologies  (to
>>>>>>> facilitate reasoning). Perhaps the only placewhere we differ is that
>>>>>>> you're
>>>>>>> comfortable with "incorporate instance data into the ontology design
>>>>>>> process" being implicit, while I never tire of seeing that point
>>>>>>> hammered
>>>>>>> home.
>>>>>>>
>>>>>>> Regards - Joel.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 14 Feb 2011, Hilmar Lapp wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
>>>>>>>>
>>>>>>>>> I think the recommendations are heavy on building ontologies, and
>>>>>>>>> light on suggesting paths to linked data representations of
>>>>>>>>> instance data.
>>>>>>>>
>>>>>>>>
>>>>>>>> Good observation. I can't speak for all of the authors, but in my
>>>>>>>> experience building Linked Data representations is mostly a
>>>>>>>> technical
>>>>>>>> problem, and thus much easier compared to building soundly
>>>>>>>> engineered,
>>>>>>>> commonly agreed upon ontologies with deep domain knowledge capture.
>>>>>>>> The
>>>>>>>> latter is hard, because it requires overcoming a lot of social
>>>>>>>> challenges.
>>>>>>>>
>>>>>>>> As for the GBIF report, personally I think linked biodiversity data
>>>>>>>> representations will come at about the same pace whether or not GBIF
>>>>>>>> pushes
>>>>>>>> on that front (though GBIF can help make those representations
>>>>>>>> better by
>>>>>>>> provisioning stable resolvable identifier services, URIs etc). There
>>>>>>>> is a
>>>>>>>> unique opportunity though for "neutral" organizations such as GBIF
>>>>>>>> (or, in
>>>>>>>> fact, TDWG), to significantly accelerate the development of sound
>>>>>>>> ontologies
>>>>>>>> by catalyzing the community engagement, coherence, and discourse
>>>>>>>> that is
>>>>>>>> necessary for them.
>>>>>>>>
>>>>>>>>    -hilmar
>>>>>>>> --
>>>>>>>> ===========================================================
>>>>>>>> : Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
>>>>>>>> ===========================================================
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> tdwg-content mailing list
>>>>> tdwg-content at lists.tdwg.org
>>>>> http://lists.tdwg.org/mailman/listinfo/tdwg-content
>>>>>
>>>
>


More information about the tdwg-content mailing list