More Strange Monkey Business-like things in GBIF KOS Document
More Strange Monkey Business-like things in GBIF KOS Document.
Does this accurately characterize my project?
"in the GeoSpecies project104 based on a small purpose-built ontology105 of mosquito-borne human pathogens."
Did they bother to read any of the seven other examples on this page?
Or here http://www.taxonconcept.org/
or here http://www.delicious.com/kidehen/pivot_collection_app+linked_geo_species
or here http://www.slideshare.net/pjdwi/biodiversity-informatics-on-the-semantic-web
Also note that this particular link they used in the document 104 does not work
http://about.geospecies.org/index.htm
While this
or this does
http://about.geospecies.org/index.html
Also the "small" TaxonConcept SPARQL endpoint has ~27 million triples.
It might also be useful to explain how reasoning can be used on the larger data sets.
Do they have an example of reasoning that works on a data set over 100 million triples?
Is there some reason why there is so much "push" towards specialized near proprietary solutions like LSID's and LOD unfriendly vocabularies?
Respectfully,
- Pete
Pete:
On Feb 11, 2011, at 5:24 PM, Peter DeVries wrote:
Is there some reason why there is so much "push" towards specialized near proprietary solutions like LSID's and LOD unfriendly vocabularies?
Would you mind to elaborate what you mean by this?
-hilmar
Sure, I will need to be brief.
1) The KOS document is still largely dismissive of Linked Open Data
2) If you look at the current Darwin Core as represented by the TDWG BioBlitz Occurrence Data Set. http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
a) uses it's own date vocab rather and formatting rather than dc:date b) don't think the current version of the vocab resolves correctly following LOD standards c) other than the *geo* which TDWG does not seem to agree with how much of this is using any other commonly used LOD vocabulary
How well does this data set work to query for occurrences of a given species?
Or identifications or observations by a particular person?
Was there any thought to identifying which of the various identifications is the preferred one for mapping etc?
These are all issues that become apparent when you start marking up records and attempting queries.
I modified this somewhat so that at least some of the occurrences are tied not to only a particular non-normalized name but to a species concept.
I also started to normalize the various text strings for people to a standard URI. The data itself has the same person identified with several name variations.
It is here: http://lod.taxonconcept.org/tdwg2010bioblitz/TechnoBioblitzOccurrences_dates...
As I showed earlier this version has the same person linked via the same identifier and allows browsing a queries base on semantically informative identifiers.
To query the original data for occurrences of a given species you would need to know all the various names that people entered for that taxon.
Here is an example of a improved record http://bit.ly/hy4HFi (People and species concept unambiguously defined with URI's and linked to related records)
Here is a poorly linked record http://bit.ly/fSReZS (The same people and the same species labeled with a number of different string combinations etc. )
Not clear who recorded what things, what other things the recorded, or what things are the same species and what things are different species.
In summary, to get these to work in a way that others expect I had to make them more like the TaxonConcept.org records.
I have been advocating for some of differences like a URI for a species concept, and adopting well understood external vocabularies this and other lists since 2006.
Considering how many examples etc. and discussions I have been involved in it seems a bit strange that the authors of the KOS paper characterize my efforts as
"in the GeoSpecies project104 based on a small purpose-built ontology105 of mosquito-borne human pathogens."
Respectfully,
- Pete
On Fri, Feb 11, 2011 at 4:34 PM, Hilmar Lapp hlapp@nescent.org wrote:
Pete:
On Feb 11, 2011, at 5:24 PM, Peter DeVries wrote:
Is there some reason why there is so much "push" towards specialized near proprietary solutions like LSID's and LOD unfriendly vocabularies?
Would you mind to elaborate what you mean by this?
-hilmar
--
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
Pete,
A few things are being conflated here. Teasing them out:
1. My read of the sentence
"As examples, see the OpenLink Data Explorer [102] and offer it Quercus alba or the SPARQL query [103] in the GeoSpecies project [104] based on a small purpose-built ontology [105] of mosquito-borne human pathogens."
is that it's the SPARQL query that's based on a small purpose built ontology of mosquito-borne human pathogens, not the GeoSpecies project. I think it's appropriate that the only example of linked biodiversity data given by the report comes from GeoSpecies/taxonconcept.org, since you've gone farther in this space than anyone else.
2. My rdf representation of the TDWG bioblitz data is primarily an experiment in representing Darwin Core on the semantic web. Amongst those thinking about the right way to do this, I probably advocate the least amount of change from the current Darwin Core: one or two new classes, and possibly some "hasX" properties, where X is a class. I can see some utility for range constraints on these classes, but would avoid domain constraints almost entirely. Others on tdwg-content are advocating a different style. That is all to say that http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf may not be the best example of everything wrong with TDWG, only because many in TDWG would not endorse it. That said, I'll address some of your points:
i. For the most part, there is no overlap between Darin Core and "commonly used Linked Data vocabularies". DwC itself encourages use of Dublin Core where appropriate. The exceptions are for dateTimes and locations. I don't know why these exceptions were made (though I guess the reasons are in the archives somewhere). In any event, I rejected the somewhat baroque DwC construction for location, and opted to use the geo vocabulary. You're probably right that it makes sense to use dcterms for timestamps as well.
ii. The dataset works pretty well to query for instances of a particular species. It's not hard to query for people either. It would, I agree, be easier if people's names were more standardized, and assgning URIs of the sort you created (e.g. http://lod.taxonconcept.org/people/tdwg2010bioblitz#Donald_Hobern) is one way to do that. Your approach is in harmony with the recommendation of the GBIF report to "Promote the widespread adoption of URI-based standard values for key Darwin Core attribute values". (Recommendation 3.1.j)
iii. The current version of the data uses taxonconceptIDs from taxonconcept.org for 411 of the records. It remains (I think) non-trivial to assign taxonconceptIDs appropriately to all occurrence records. In response to some of the anomalies you pointed out earlier, I also made a pass at normalizing the transparent ETHAN IDs that the dataset uses.
iv. In regards to identifying which identifications are preferred, there are a number of ways forward. What would you sugget?
3. Broadly speaking, I don't see anything obectionable in the report, although I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Joel.
On Fri, 11 Feb 2011, Peter DeVries wrote:
Sure, I will need to be brief.
The KOS document is still largely dismissive of Linked Open Data
If you look at the current Darwin Core as represented by the TDWG
BioBlitz Occurrence Data Set. http://www.cs.umbc.edu/~jsachs/occurrences/TechnoBioblitzOccurrences.rdf .
a) uses it's own date vocab rather and formatting rather than dc:date b) don't think the current version of the vocab resolves correctly following LOD standards c) other than the *geo* which TDWG does not seem to agree with how much of this is using any other commonly used LOD vocabulary
How well does this data set work to query for occurrences of a given species?
Or identifications or observations by a particular person?
Was there any thought to identifying which of the various identifications is the preferred one for mapping etc?
These are all issues that become apparent when you start marking up records and attempting queries.
I modified this somewhat so that at least some of the occurrences are tied not to only a particular non-normalized name but to a species concept.
I also started to normalize the various text strings for people to a standard URI. The data itself has the same person identified with several name variations.
It is here: http://lod.taxonconcept.org/tdwg2010bioblitz/TechnoBioblitzOccurrences_dates...
As I showed earlier this version has the same person linked via the same identifier and allows browsing a queries base on semantically informative identifiers.
To query the original data for occurrences of a given species you would need to know all the various names that people entered for that taxon.
Here is an example of a improved record http://bit.ly/hy4HFi (People
and species concept unambiguously defined with URI's and linked to related records)
Here is a poorly linked record http://bit.ly/fSReZS (The same people
and the same species labeled with a number of different string combinations etc. )
Not clear who recorded what things, what other things the recorded,
or what things are the same species and what things are different species.
In summary, to get these to work in a way that others expect I had to make them more like the TaxonConcept.org records.
I have been advocating for some of differences like a URI for a species concept, and adopting well understood external vocabularies this and other lists since 2006.
Considering how many examples etc. and discussions I have been involved in it seems a bit strange that the authors of the KOS paper characterize my efforts as
"in the GeoSpecies project104 based on a small purpose-built ontology105 of mosquito-borne human pathogens."
Respectfully,
- Pete
On Fri, Feb 11, 2011 at 4:34 PM, Hilmar Lapp hlapp@nescent.org wrote:
Pete:
On Feb 11, 2011, at 5:24 PM, Peter DeVries wrote:
Is there some reason why there is so much "push" towards specialized near proprietary solutions like LSID's and LOD unfriendly vocabularies?
Would you mind to elaborate what you mean by this?
-hilmar
--
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
-hilmar
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last. Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything); followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies (to facilitate reasoning). Perhaps the only place where we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last. Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything); followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies (to facilitate reasoning). Perhaps the only placewhere we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
Hilmar,
I guess I'm now guilty of conflating concepts myself, namely "instance-data generation as an integral component of the ontology development spiral", and "schema last". They're distinct, but related in the sense that the latter can be seen as an extreme case of the former. Separating them:
Instance Data. Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
--- What is OBD? OBD is a database for storing data typed using OBO ontologies
Where is it? In development!
Is there a demo? See http://www.fruitfly.org/~cjm/obd
Datasets See the above URL for now ---
But the demo link is broken, and it's hard to find information on OBD that isn't a few years old. Is it still the plan to integrate OBD into BioPortal? If not, then maybe the "Missing Functionality [of BioPortal]" section of the KOS report should include a subsection about providing access to instance data. Considering GBIF's data holdings, it seems like it would be a shame to not integrate data browsing into any ontology browsing infrastructure that GBIF provides.
Schema Last. I think schema-last is a malleable enough buzzword that we can hijack it slightly, and I've been wondering about what it should mean in the context of TDWG ontologies. Some ontology paradigms are inherently more schema-last-ish than others. For example, EQ strikes me as more schema-last-ish than OBOE or Prometheus. Extending an example from the Fall, EQ gives:
fruit - green bark - brown leaves - yellow leaves - ridged leaves - broad
and OBOE gives
fruit - colour - green bark - colour - brown leaves - colour - yellow leaves - perimeter texture - ridged leaves - basic shape - broad
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development. In theory, these two approaches may be isomorphic, since, presumably, the OBOE characteristics are also abstracted from examples collected as part of the requirements gathering process. In practice, though, I suspect that EQ leaves more scope for instance-informed schemas. I have no basis for this suspicion other than intiuition, and would welcome any evidence or references that anyone can provide.
Also, schema-last could perhaps be a guiding philosophy as we seek to put in place a mechanism for facilitating ontology update and evolution. For example, it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. If a tag can't be mapped into the ontology, the ontology engineer takes this as a clue that the ontology needs revision. So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community. Passant's focus was information retrieval, where the only reasoning is using subsumption hierarchies to expand the scope of a query, but the principle should apply to other reasoning tasks as well. The example in my mind is using a DL representation of SDD as the basis for polyclave keys. When users enter terms not in the ontology, it would trigger a process that could lead to ontology update.
I don't dispute the importance of involving individual domain experts, especially at the beginning, but also throughout the process. And I agree that catalyzing this process is, indeed, a job for TDWG.
Joel.
1. Passant, http://www.icwsm.org/papers/paper15.html
On Tue, 15 Feb 2011, Hilmar Lapp wrote:
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last. Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything); followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies (to facilitate reasoning). Perhaps the only placewhere we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
Hi Joel,
I think the OWL model in general is "schema-last". In particular, the only fixed "schema" is the triple model (subject, predicate, object), and one can add and remove triples as needed. I don't think OBOE or EQ (or any other OWL ontology) is any more schema-first versus schema-last than the other -- since they are based on OWL/RDF. Alternatively, a particular dataset (with specific attributes) is a typical example of "schema first", i.e., before I can store data rows, I have to define the attributes (so this would be true in, e.g., Darwin Core). In both OBOE and EQ, one could have a set of triples, and then come along later at any time and add triples that give type information to existing individuals, etc. Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals -- but it would be really hard given this to say one is more "schema last" than the other because of these basic upper-level classes.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hilmar,
I guess I'm now guilty of conflating concepts myself, namely "instance-data generation as an integral component of the ontology development spiral", and "schema last". They're distinct, but related in the sense that the latter can be seen as an extreme case of the former. Separating them:
Instance Data. Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
What is OBD? OBD is a database for storing data typed using OBO ontologies
Where is it? In development!
Is there a demo? See http://www.fruitfly.org/~cjm/obd
Datasets See the above URL for now
But the demo link is broken, and it's hard to find information on OBD that isn't a few years old. Is it still the plan to integrate OBD into BioPortal? If not, then maybe the "Missing Functionality [of BioPortal]" section of the KOS report should include a subsection about providing access to instance data. Considering GBIF's data holdings, it seems like it would be a shame to not integrate data browsing into any ontology browsing infrastructure that GBIF provides.
Schema Last. I think schema-last is a malleable enough buzzword that we can hijack it slightly, and I've been wondering about what it should mean in the context of TDWG ontologies. Some ontology paradigms are inherently more schema-last-ish than others. For example, EQ strikes me as more schema-last-ish than OBOE or Prometheus. Extending an example from the Fall, EQ gives:
fruit - green bark - brown leaves - yellow leaves - ridged leaves - broad
and OBOE gives
fruit - colour - green bark - colour - brown leaves - colour - yellow leaves - perimeter texture - ridged leaves - basic shape - broad
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development. In theory, these two approaches may be isomorphic, since, presumably, the OBOE characteristics are also abstracted from examples collected as part of the requirements gathering process. In practice, though, I suspect that EQ leaves more scope for instance-informed schemas. I have no basis for this suspicion other than intiuition, and would welcome any evidence or references that anyone can provide.
Also, schema-last could perhaps be a guiding philosophy as we seek to put in place a mechanism for facilitating ontology update and evolution. For example, it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. If a tag can't be mapped into the ontology, the ontology engineer takes this as a clue that the ontology needs revision. So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community. Passant's focus was information retrieval, where the only reasoning is using subsumption hierarchies to expand the scope of a query, but the principle should apply to other reasoning tasks as well. The example in my mind is using a DL representation of SDD as the basis for polyclave keys. When users enter terms not in the ontology, it would trigger a process that could lead to ontology update.
I don't dispute the importance of involving individual domain experts, especially at the beginning, but also throughout the process. And I agree that catalyzing this process is, indeed, a job for TDWG.
Joel.
On Tue, 15 Feb 2011, Hilmar Lapp wrote:
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last. Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything); followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies (to facilitate reasoning). Perhaps the only placewhere we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Feb 17, 2011, at 3:23 PM, Shawn Bowers wrote:
Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals
That's actually not quite true. The EQ model itself doesn't prescribe any new classes or the types that individuals must be of; instead it simply says that a phenotype instance can be expressed as some instance of a quality Q that inheres_in some instance of an entity E, and thus a class of phenotypes (or observations of an organism's characteristics) is the intersection of all instances of Q (a subclass restriction), and all things that inhere_in E (a property restriction).
While typically we will draw Q and E from certain ontologies (such as PATO for qualities), you can designate any class (term) in those places, and the class expression by itself will not support inferences about the nature of Q or E or their instances (the ontologies that Q and E are drawn from do that). The class expression itself is often anonymous, but there are (so-called "pre-composed") ontologies that identify and label them.
That being said, while EQ in principle allows you to do real crazy things if you want to (which perhaps is what Joel means by schema- last?), if you want to be able to do discovery and reasoning with a set of EQ class expressions from different sources, they will need to follow some shared conventions, such as not simply making up quality and entity terms as needed, but drawing them from PATO and shared entity ontologies.
Conversely, OBOE does prescribe the nature of the things that it relates to each other in the model, the cardinality of those relationships, and what it means for an instance it is has such a relationship. For example, if I assert o oboe:ofEntity e, the semantics of oboe:ofEntity prescribe that o is an instance of oboe:Observation, e is an instance of oboe:Entity, and if I also assert o oboe:ofEntity e1, it prescribes that e and e1 are identical, i.e., the same instance.
I think these differences are a result of how they were motivated, and it is interesting to me that Joel would pick these as examples for illustrating "schema-lastishness". OBOE was motivated by having a unified data model for observational data, in the interest of better data exchange and integration. I think all its class and property constraints are a reflection of that - there is a desire not to "allow anything". Conversely, EQ wouldn't make for a good model in which to exchange arbitrary observational data - there would be no guarantees for what you get. However, it is very powerful for reasoning over the semantics of the observations (see the Washington et al 2009 paper), which is what it was conceived for.
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
Funny you should ask. We're in the final stages of writing up a manuscript about it. I can share a preprint with you next week. OBD is what is underpinning the Phenoscape Knowledgebase (http://kb.phenoscape.org ).
The URL is http://www.berkeleybop.org/obd/. It is still pretty outdated, but will be updated very soon.
Is it still the plan to integrate OBD into BioPortal?
I don't think so. And there are lots of resources working on that (at least in the biomedical domain), so it'd be hard for them to pick what to follow.
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development.
Yes. They are implied by the subclass structure of PATO (and thus subject to change).
it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. [...] So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community.
Are you aware of the "Fast, Cheap, and Out of Control" paper from Mark Wilkinson's group: Good et al. 2006. Fast, Cheap and Out of Control: A Zero Curation Model for Ontology Development. Pacific Symposium on Biocomputing 11: 128-139.
http://psb.stanford.edu/psb-online/proceedings/psb06/good.pdf
-hilmar
Hi Hilmar,
Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals
That's actually not quite true. The EQ model itself doesn't prescribe any new classes or the types that individuals must be of; instead it simply says that a phenotype instance can be expressed as some instance of a quality Q that inheres_in some instance of an entity E, and thus a class of phenotypes (or observations of an organism's characteristics) is the intersection of all instances of Q (a subclass restriction), and all things that inhere_in E (a property restriction).
There are two ways to type things in OWL, classes and properties (I should have said properties in addition to classes above, since OBOE also introduces properties). So, in this way, the "inheres_in" property is how EQ prescribes type information on "instances". It also sounds like it prescribes E's and Q's (since this really defines what inheres_in is), and so at least implicitly these are types also "introduced" by EQ.
While typically we will draw Q and E from certain ontologies (such as PATO for qualities), you can designate any class (term) in those places, and the class expression by itself will not support inferences about the nature of Q or E or their instances (the ontologies that Q and E are drawn from do that). The class expression itself is often anonymous, but there are (so-called "pre-composed") ontologies that identify and label them.
But, one would imagine that designating a class within an inheres_in statement (even if anonymous) means it is either an E or a Q (at least implicitly, i.e., it may not be inferable from EQ that this is the case, but that seems like a detail). Of course, PATO as a realization of EQ uses a Quality class.
That being said, while EQ in principle allows you to do real crazy things if you want to (which perhaps is what Joel means by schema-last?), if you want to be able to do discovery and reasoning with a set of EQ class expressions from different sources, they will need to follow some shared conventions, such as not simply making up quality and entity terms as needed, but drawing them from PATO and shared entity ontologies.
Right.
Conversely, OBOE does prescribe the nature of the things that it relates to each other in the model, the cardinality of those relationships, and what it means for an instance it is has such a relationship. For example, if I assert o oboe:ofEntity e, the semantics of oboe:ofEntity prescribe that o is an instance of oboe:Observation, e is an instance of oboe:Entity, and if I also assert o oboe:ofEntity e1, it prescribes that e and e1 are identical, i.e., the same instance.
Yes, this is true.
BTW. I'm a bit confused though -- is EQ an OWL ontology? Or is it purely an abstract model that prescribes a convention for defining qualities, with concrete quality and entity ontologies being drawn from other places (like PATO)? Where is the inheres_in property defined?
I think these differences are a result of how they were motivated, and it is interesting to me that Joel would pick these as examples for illustrating "schema-lastishness". OBOE was motivated by having a unified data model for observational data, in the interest of better data exchange and integration. I think all its class and property constraints are a reflection of that - there is a desire not to "allow anything".
I agree with this.
As an example, one of the driving use cases for OBOE is annotating relational data sets in which the attributes within a given data set is tagged with observation/measurement types and from these annotations OBOE instance data (i.e., sets of triples) are automatically generated.
Conversely, EQ wouldn't make for a good model in which to exchange arbitrary observational data - there would be no guarantees for what you get. However, it is very powerful for reasoning over the semantics of the observations (see the Washington et al 2009 paper), which is what it was conceived for.
Right ... and I think ideally EQ models could be used within OBOE for assigning qualities to specific observations. This would allow for both the reasoning abilities of OBOE (e.g., context, units, etc.) plus those for qualities via EQ.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
Funny you should ask. We're in the final stages of writing up a manuscript about it. I can share a preprint with you next week. OBD is what is underpinning the Phenoscape Knowledgebase (http://kb.phenoscape.org).
The URL is http://www.berkeleybop.org/obd/. It is still pretty outdated, but will be updated very soon.
Is it still the plan to integrate OBD into BioPortal?
I don't think so. And there are lots of resources working on that (at least in the biomedical domain), so it'd be hard for them to pick what to follow.
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development.
Yes. They are implied by the subclass structure of PATO (and thus subject to change).
it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. [...] So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community.
Are you aware of the "Fast, Cheap, and Out of Control" paper from Mark Wilkinson's group: Good et al. 2006. Fast, Cheap and Out of Control: A Zero Curation Model for Ontology Development. Pacific Symposium on Biocomputing 11: 128-139.
http://psb.stanford.edu/psb-online/proceedings/psb06/good.pdf
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
On Feb 18, 2011, at 12:46 AM, Shawn Bowers wrote:
BTW. I'm a bit confused though -- is EQ an OWL ontology? Or is it purely an abstract model that prescribes a convention for defining qualities, with concrete quality and entity ontologies being drawn from other places (like PATO)?
It's an abstract model. It can be expressed and implemented in OWL (and also in OBO). It is model for defining phenotype classes (though indeed in this model an EQ phenotype (class) is a subclass if a quality (class)). The quality and entity terms are drawn from ontologies that exist independently of (and in part predate) EQ.
Where is the inheres_in property defined?
In RO (the Relations Ontology, see Smith et al, 2005).
-hilmar
Great, thanks. This was my impression, but was starting to get confused.
I've read the Smith paper some time ago ... I'll go back and look again at the inheres_in property. I have read about inheres_in in the other papers on EQ -- but wasn't sure where it is defined.
Thanks again,
Shawn
On Thu, Feb 17, 2011 at 10:12 PM, Hilmar Lapp hlapp@nescent.org wrote:
On Feb 18, 2011, at 12:46 AM, Shawn Bowers wrote:
BTW. I'm a bit confused though -- is EQ an OWL ontology? Or is it purely an abstract model that prescribes a convention for defining qualities, with concrete quality and entity ontologies being drawn from other places (like PATO)?
It's an abstract model. It can be expressed and implemented in OWL (and also in OBO). It is model for defining phenotype classes (though indeed in this model an EQ phenotype (class) is a subclass if a quality (class)). The quality and entity terms are drawn from ontologies that exist independently of (and in part predate) EQ.
Where is the inheres_in property defined?
In RO (the Relations Ontology, see Smith et al, 2005).
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
Might I make a suggestion? When the topic of a thread diverges significantly from the original subject line, let's send it to the list with a new subject line that reflects the nature of the new thread. I don't know if I'm the only person who does this, but I do go to the list archives to try to find old posts that I remember and to which I would like to refer. However, that gets really hard when there are a lot of posts and the subject lines don't correspond to the topic of the posts. This is a good example of a series of posts to which I'm likely to want to refer in the future but I have a feeling I wouldn't find them under this subject line.
Thanks, Steve
Shawn Bowers wrote:
Hi Hilmar,
Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals
That's actually not quite true. The EQ model itself doesn't prescribe any new classes or the types that individuals must be of; instead it simply says that a phenotype instance can be expressed as some instance of a quality Q that inheres_in some instance of an entity E, and thus a class of phenotypes (or observations of an organism's characteristics) is the intersection of all instances of Q (a subclass restriction), and all things that inhere_in E (a property restriction).
There are two ways to type things in OWL, classes and properties (I should have said properties in addition to classes above, since OBOE also introduces properties). So, in this way, the "inheres_in" property is how EQ prescribes type information on "instances". It also sounds like it prescribes E's and Q's (since this really defines what inheres_in is), and so at least implicitly these are types also "introduced" by EQ.
While typically we will draw Q and E from certain ontologies (such as PATO for qualities), you can designate any class (term) in those places, and the class expression by itself will not support inferences about the nature of Q or E or their instances (the ontologies that Q and E are drawn from do that). The class expression itself is often anonymous, but there are (so-called "pre-composed") ontologies that identify and label them.
But, one would imagine that designating a class within an inheres_in statement (even if anonymous) means it is either an E or a Q (at least implicitly, i.e., it may not be inferable from EQ that this is the case, but that seems like a detail). Of course, PATO as a realization of EQ uses a Quality class.
That being said, while EQ in principle allows you to do real crazy things if you want to (which perhaps is what Joel means by schema-last?), if you want to be able to do discovery and reasoning with a set of EQ class expressions from different sources, they will need to follow some shared conventions, such as not simply making up quality and entity terms as needed, but drawing them from PATO and shared entity ontologies.
Right.
Conversely, OBOE does prescribe the nature of the things that it relates to each other in the model, the cardinality of those relationships, and what it means for an instance it is has such a relationship. For example, if I assert o oboe:ofEntity e, the semantics of oboe:ofEntity prescribe that o is an instance of oboe:Observation, e is an instance of oboe:Entity, and if I also assert o oboe:ofEntity e1, it prescribes that e and e1 are identical, i.e., the same instance.
Yes, this is true.
BTW. I'm a bit confused though -- is EQ an OWL ontology? Or is it purely an abstract model that prescribes a convention for defining qualities, with concrete quality and entity ontologies being drawn from other places (like PATO)? Where is the inheres_in property defined?
I think these differences are a result of how they were motivated, and it is interesting to me that Joel would pick these as examples for illustrating "schema-lastishness". OBOE was motivated by having a unified data model for observational data, in the interest of better data exchange and integration. I think all its class and property constraints are a reflection of that - there is a desire not to "allow anything".
I agree with this.
As an example, one of the driving use cases for OBOE is annotating relational data sets in which the attributes within a given data set is tagged with observation/measurement types and from these annotations OBOE instance data (i.e., sets of triples) are automatically generated.
Conversely, EQ wouldn't make for a good model in which to exchange arbitrary observational data - there would be no guarantees for what you get. However, it is very powerful for reasoning over the semantics of the observations (see the Washington et al 2009 paper), which is what it was conceived for.
Right ... and I think ideally EQ models could be used within OBOE for assigning qualities to specific observations. This would allow for both the reasoning abilities of OBOE (e.g., context, units, etc.) plus those for qualities via EQ.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
Funny you should ask. We're in the final stages of writing up a manuscript about it. I can share a preprint with you next week. OBD is what is underpinning the Phenoscape Knowledgebase (http://kb.phenoscape.org).
The URL is http://www.berkeleybop.org/obd/. It is still pretty outdated, but will be updated very soon.
Is it still the plan to integrate OBD into BioPortal?
I don't think so. And there are lots of resources working on that (at least in the biomedical domain), so it'd be hard for them to pick what to follow.
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development.
Yes. They are implied by the subclass structure of PATO (and thus subject to change).
it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. [...] So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community.
Are you aware of the "Fast, Cheap, and Out of Control" paper from Mark Wilkinson's group: Good et al. 2006. Fast, Cheap and Out of Control: A Zero Curation Model for Ontology Development. Pacific Symposium on Biocomputing 11: 128-139.
http://psb.stanford.edu/psb-online/proceedings/psb06/good.pdf
-hilmar
--
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content .
Hilmar,
Schema-last, to me, is an attitude of holding back (sometimes forever) before i) restricting the vocabulary available to users; and/or ii) defining a semantics that draws inferences way beyond a user's assertions.
I think this attitude can apply not only to the terms of an ontology, but to the general shape and style of the ontology, and I am concerned about GBIF/TDWG assuming that its ontologies should be DL in flavour. By DL, I mean more than whether an ontology is technically within the OWL-DL profile. I mean the general approach of building classifiers, which, traditionally, has been the goal of description logics. So, by DL in flavour, I mean making heavy use of domain and range restrictions, functional and inverseFunctional properties, class definition via property restriction, etc. This DL-based approach seems to be working in genomics.
Will it work in biodiversity informatics? One cause for concern is that the current Darwin Core, which is simple, is widely misunderstood and intimidates many. It is possible that the problem will be solved with tighter restriction and more formalisms. But I'm skeptical.
Even if we are able, through the laborious process of doing things a certain way, to build classifiers for biodiversity informatics artifacts (occurrence records, evidence, identifications, etc.) in ths same way that we can build them for actual objects of biology (genes, taxa, etc.), why would we want to? The natural world comes without labels, so it's helpful to be able to synthesize everything that we know about something to determine what it is. But human-made information artifacts are typically labeled, or have their types implied by context.
I'm currently arguing with someone off-list about what I think is my minimal example, that I hope that everyone can agree on. It's about domain constraints on "hasIdentification". If I say
"http://fu.bar hasIdentifcation rabbit",
should we, as a community, interpret that to mean that http://fu.bar is an individulOrganism (as opposed to, say, a picture)? Must I, as a guy who likes to make assertions, be told either
a. that I need additional vocabulary terms: pictureHasIdentification, occurenceHasIdentification, individualHasIdentification, etc. or b. that I need to limit hasIdentification to describing a single type of thing.
If you can convince me of either (a) or (b) above, then I'll be inclined to accept your entire vision for the semantic web.
A few more comments, in-line, below ...
On Thu, 17 Feb 2011, Hilmar Lapp wrote:
On Feb 17, 2011, at 3:23 PM, Shawn Bowers wrote:
Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals
That's actually not quite true. The EQ model itself doesn't prescribe any new classes or the types that individuals must be of; instead it simply says that a phenotype instance can be expressed as some instance of a quality Q that inheres_in some instance of an entity E, and thus a class of phenotypes (or observations of an organism's characteristics) is the intersection of all instances of Q (a subclass restriction), and all things that inhere_in E (a property restriction).
While typically we will draw Q and E from certain ontologies (such as PATO for qualities), you can designate any class (term) in those places, and the class expression by itself will not support inferences about the nature of Q or E or their instances (the ontologies that Q and E are drawn from do that). The class expression itself is often anonymous, but there are (so-called "pre-composed") ontologies that identify and label them.
That being said, while EQ in principle allows you to do real crazy things if you want to (which perhaps is what Joel means by schema-last?), if you want to be able to do discovery and reasoning with a set of EQ class expressions from different sources, they will need to follow some shared conventions, such as not simply making up quality and entity terms as needed, but drawing them from PATO and shared entity ontologies.
Conversely, OBOE does prescribe the nature of the things that it relates to each other in the model, the cardinality of those relationships, and what it means for an instance it is has such a relationship. For example, if I assert o oboe:ofEntity e, the semantics of oboe:ofEntity prescribe that o is an instance of oboe:Observation, e is an instance of oboe:Entity, and if I also assert o oboe:ofEntity e1, it prescribes that e and e1 are identical, i.e., the same instance.
I think these differences are a result of how they were motivated, and it is interesting to me that Joel would pick these as examples for illustrating "schema-lastishness".
An example of why I see EQ being more schema-last than OBOE is the question you recently forwarded to the Observations list: How do you represent "petiole 5x longer than wide"?
In EQ, you could say something like: <5:1 length to width ratio> <inheres_in> <petiole> and then wait for some more examples of ratios to come in, before deciding how to update your Quality ontology to handle ratios. In OBOE (please correct me if I'm wrong), it seems (to me) that you need to make more of an ontological commitment to express the same thing.
(Also, could you please direct me to sources of OBOE instance data? A quick search of TDWG-Observation, SONet, Google, and Swoogle only turned up the ontolgy itself, and a few examples of the "how do you do this in OBOE" variety.)
OBOE was motivated by having a unified data model for observational data, in the interest of better data exchange and integration. I think all its class and property constraints are a reflection of that - there is a desire not to "allow anything". Conversely, EQ wouldn't make for a good model in which to exchange arbitrary observational data - there would be no guarantees for what you get. However, it is very powerful for reasoning over the semantics of the observations (see the Washington et al 2009 paper), which is what it was conceived for.
I like the Washington paper a lot. One thing it illustrates to me is the power that comes from the judicious use of an appropriate domain ontology with witch to value simple attributes. One of the most important recommemdations in the KOS report, IMO, is the one I quoted to Pete: "Promote widespread adoption of URI-based standard values for key Darwin Core attribute values." Constructing appropriate ontologies for these values strikes me as a much better way to bring DwC on to the semantic web than recrafting DwC as an OWL ontology. (I'm not opposed to the latter, which may serve a data validation need, but I don't think its necessary for typical data integration use cases.)
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
Funny you should ask. We're in the final stages of writing up a manuscript about it. I can share a preprint with you next week. OBD is what is underpinning the Phenoscape Knowledgebase (http://kb.phenoscape.org).
The URL is http://www.berkeleybop.org/obd/. It is still pretty outdated, but will be updated very soon.
Is it still the plan to integrate OBD into BioPortal?
I don't think so. And there are lots of resources working on that (at least in the biomedical domain), so it'd be hard for them to pick what to follow.
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development.
Yes. They are implied by the subclass structure of PATO (and thus subject to change).
it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. [...] So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community.
Are you aware of the "Fast, Cheap, and Out of Control" paper from Mark Wilkinson's group: Good et al. 2006. Fast, Cheap and Out of Control: A Zero Curation Model for Ontology Development. Pacific Symposium on Biocomputing 11: 128-139.
http://psb.stanford.edu/psb-online/proceedings/psb06/good.pdf
Cool, thanks. Looks like what they're describing is, essentially, the first VoCamp.
Joel.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
On 20/02/2011, at 1:24 PM, joel sachs wrote:
I'm currently arguing with someone off-list about what I think is my minimal example, that I hope that everyone can agree on. It's about domain constraints on "hasIdentification". If I say
"http://fu.bar hasIdentifcation rabbit",
should we, as a community, interpret that to mean that http://fu.bar is an individulOrganism (as opposed to, say, a picture)? Must I, as a guy who likes to make assertions, be told either
been a while since I chimed in on this list.
hasIdentification has an RDF namespace. If the full name of the predicate is actually
http://tdwg.org/voc/Organism#hasIdentification
Then it's probably quite reasonable to make the type assumption. If you want to make it more general, then define a more general predicate
http://tdwg.org/voc/Common#hasIdentification
and type (IdentifiableThing), and make subclass/subproperty assertions. If the namespace/ontoogy that you are importing makes it clear that we are talking about organisms, then a person who uses that predicate to describe a painting is misusing the vocabulary and deserves what they get.
_______________________________________________
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Paul,
Is that Organism#hasIdentification URI from the TDWG ontology? I thought the TDWG ontology was de facto deprecated. Am I wrong about that?
Not only does http://tdwg.org/voc/Organism#hasIdentification not dereference, but it doesn't even generate real hits via Google. (Even http://tdwg.org/voc is broken.)
My thoughts about hasIdentification are in the context of representing Darwin Core as rdf. I think it's important that we continue to allow (and encourage) spreadsheet represntations of DwC, and that these map naturally to de-normalized rdf.
I agree that it makes sense, as you suggest, to define hasIdentification as a property without domain constraints, and then introduce subProperties individualHasIdentification, occurrenceHasIdentification, pictureHasIdentification, etc., each with the appropriate domain. Then, applications that know what they're doing can apply the correct property.
You wrote: "a person who uses that predicate to describe a painting is misusing the vocabulary and deserves what they get."
The problem is that it's not just the person who misuses a vocabulary that gets a mess of incorrect inferences. We all do.
Joel.
On Tue, 22 Feb 2011, Paul Murray wrote:
On 20/02/2011, at 1:24 PM, joel sachs wrote:
I'm currently arguing with someone off-list about what I think is my minimal example, that I hope that everyone can agree on. It's about domain constraints on "hasIdentification". If I say
"http://fu.bar hasIdentifcation rabbit",
should we, as a community, interpret that to mean that http://fu.bar is an individulOrganism (as opposed to, say, a picture)? Must I, as a guy who likes to make assertions, be told either
been a while since I chimed in on this list.
hasIdentification has an RDF namespace. If the full name of the predicate is actually
http://tdwg.org/voc/Organism#hasIdentification
Then it's probably quite reasonable to make the type assumption. If you want to make it more general, then define a more general predicate
http://tdwg.org/voc/Common#hasIdentification
and type (IdentifiableThing), and make subclass/subproperty assertions. If the namespace/ontoogy that you are importing makes it clear that we are talking about organisms, then a person who uses that predicate to describe a painting is misusing the vocabulary and deserves what they get.
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Hi Joel,
I see schema last as part of an iterative process in which you mark things up as you think they will work and revise the data and ontology until it allows the kinds of queries etc. that you want.
The current Darwin Core maps to spreadsheets and XML but I think that we might want to work on some representation that works well on the LOD.
This form could be created by cleaning and normalizing DarwinCore submissions.
- Pete
On Wed, Feb 23, 2011 at 4:05 PM, joel sachs jsachs@csee.umbc.edu wrote:
Paul,
Is that Organism#hasIdentification URI from the TDWG ontology? I thought the TDWG ontology was de facto deprecated. Am I wrong about that?
Not only does http://tdwg.org/voc/Organism#hasIdentification not dereference, but it doesn't even generate real hits via Google. (Even http://tdwg.org/voc is broken.)
My thoughts about hasIdentification are in the context of representing Darwin Core as rdf. I think it's important that we continue to allow (and encourage) spreadsheet represntations of DwC, and that these map naturally to de-normalized rdf.
I agree that it makes sense, as you suggest, to define hasIdentification as a property without domain constraints, and then introduce subProperties individualHasIdentification, occurrenceHasIdentification, pictureHasIdentification, etc., each with the appropriate domain. Then, applications that know what they're doing can apply the correct property.
You wrote: "a person who uses that predicate to describe a painting is misusing the vocabulary and deserves what they get."
The problem is that it's not just the person who misuses a vocabulary that gets a mess of incorrect inferences. We all do.
Joel.
On Tue, 22 Feb 2011, Paul Murray wrote:
On 20/02/2011, at 1:24 PM, joel sachs wrote:
I'm currently arguing with someone off-list about what I think is my minimal example, that I hope that everyone can agree on. It's about
domain
constraints on "hasIdentification". If I say
"http://fu.bar hasIdentifcation rabbit",
should we, as a community, interpret that to mean that http://fu.bar is
an
individulOrganism (as opposed to, say, a picture)? Must I, as a guy who likes to make assertions, be told either
been a while since I chimed in on this list.
hasIdentification has an RDF namespace. If the full name of the predicate
is actually
http://tdwg.org/voc/Organism#hasIdentification
Then it's probably quite reasonable to make the type assumption. If you
want to make it more general, then define a more general predicate
http://tdwg.org/voc/Common#hasIdentification
and type (IdentifiableThing), and make subclass/subproperty assertions.
If the namespace/ontoogy that you are importing makes it clear that we are talking about organisms, then a person who uses that predicate to describe a painting is misusing the vocabulary and deserves what they get.
If you have received this transmission in error please notify us
immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On 24/02/2011, at 9:13 AM, Peter DeVries wrote:
I see schema last as part of an iterative process in which you mark things up as you think they will work and revise the data and ontology until it allows the kinds of queries etc. that you want.
Absolutely.
The key is: automated testing. I'm doing this with my XML schemas - modify the code that generates my XML, update my schema, then *run a test* to confirm that my generate data and my schema agree. The process for RDF is similar - run your test data though a reasoner, and see if it goes "nope".
The other key is sensible use of namespaces and rules so that the previous vocabularies can be left alone.
2010.rdf: predicate hasFoo predicate hasBar predicate hasBaz
2011.rdf: import 2010 predicate hasFoo predicate hasBar predicate hasBaz
<!-- we have narrowed the meaning of foo --> hasFoo subpredicateof 2010:hasFoo
<!-- we have broadened the meaning of bar --> 2010:hasBar subpredicateof hasBar
<!-- Baz has changed its meaning, but this term and the old one are substantially the same --> predicate hasBazNarrowlyDefined predicate hasBazBroadlyDefined 2010:hasBaz subpredicateof hasBazBroadlyDefined hasBaz subpredicateof hasBazBroadlyDefined hasBazNarrowlyDefined subpredicateof 2010:hasBaz hasBazNarrowlyDefined subpredicateof hasBaz
People using the old vocabulary are unaffected. People using the new one can read ontologies using the old vocabulary with no problem.
_______________________________________________
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
On 24/02/2011, at 9:05 AM, joel sachs wrote:
Is that Organism#hasIdentification URI from the TDWG ontology? I thought the TDWG ontology was de facto deprecated. Am I wrong about that?
(Sorry about the delay in replying. We have had Tony Rees up here and are trying to integrate Taxamatch into our search service. Could be good.)
I was speaking more about the idea of subclassing properties in general than making specific comment of particular "real" ontology terms.
You wrote: "a person who uses that predicate to describe a painting is misusing the vocabulary and deserves what they get."
The problem is that it's not just the person who misuses a vocabulary that gets a mess of incorrect inferences. We all do.
True, but there's just no way to avoid that. Although we like to talk about the semantic web as "all predicates everywhere", that Ontology is inconsistent. In practice, whenever you reason, to take a set of ontologies that you trust - whether you specify them by literal filenames or simply by saying "I trust everything at some SPARQL endpoint".
With a tighter vocabulary, if someone has a bad predicate somewhere then anyone who uses that ontology - whether directly or by indirect inclusion - winds up with an inconsistent vocabulary that can't be reasoned over.
But - the curators of that data are simply getting it wrong, *provided* that the documentation is clear enough about how the predicates *should* be used.
I suppose a parallel is the DNS system. One bad DNS has a ripple effect. For that reason, DNS servers don't take records from just anyone - there's a network of trust and responsibility.
The benefit of a tighter vocabulary is that "getting it wrong" becomes a machine-detectable occurence.
As for usability: the situation is (say) that someone wants to say that something not an occurrence has an identification, and the TDWG vocabulary declares that hasIdentification has domain Occurrece. Well ... then they simply don't mean "hasidentification" in the tdwg vocabulary sense of the word.
A: "Green" means any colour whose HSB equivalent has a hue of 0.22 to 0.44 B: my car is green A: no it isn't, it's teal B: well, *I* think it's green A: Cool! Use your own vocabulary namespace and define green how you like. B: But I want to use *your* term. A: why? B: so that people looking for what you call green cars will find mine A: but your car isn't what we call green. It's what we call teal. If someone searches for what we call green and gets your car, they will not get what they want to find. B: but my car *is* green!
And round and round it goes.
C: Ok - how about we define a colour "greenish" and declare that anything that is green or teal is therefore greenish? A: I don't want to add that to my vocabulary of colours C: cool. make a separate vocabulary, host it somewhere. B can use that.
B: but people won't know to search for "greenish" if it's in a separate vocabulary Me: ask A to add it B: A? A: Nope. "greenish" is right out. B: C? C: Dude, you simply don't own A's vocabulary, and that's all there is to it. Define your own, import it in your ontology, and anyone who doesn't like it just has to live without your data.
D: Hey! I want to use B's data, but I don't like B's vocabulary, particularly not this "greenish" thing. A, B, C: I'm sorry, D, but what you are asking for is inherently impossible.
_______________________________________________
If you have received this transmission in error please notify us immediately by return e-mail and delete all copies. If this e-mail or any attachments have been sent to you in error, that error does not constitute waiver of any confidentiality, privilege or copyright in respect of information in the e-mail or attachments.
Please consider the environment before printing this email.
Hi Shawn,
On Thu, 17 Feb 2011, Shawn Bowers wrote:
Hi Joel,
I think the OWL model in general is "schema-last".
A good point, although I would phrase it differently and say that rdf, in general, is highly compatible with schema-last. But it's also compatible with decidedly non schema-last practices. For example, we could require that all instance data be validated with an ontology, and not have a mechansim for updating the ontology in response to the frustrations of our users.
Anyway, I'd be happy to stop using the phrase, and instead talk about specifics of what our ontologies should look like, and where they should come from.
Regards, Joel.
In particular, the only fixed "schema" is the triple model (subject, predicate, object), and one can add and remove triples as needed. I don't think OBOE or EQ (or any other OWL ontology) is any more schema-first versus schema-last than the other -- since they are based on OWL/RDF. Alternatively, a particular dataset (with specific attributes) is a typical example of "schema first", i.e., before I can store data rows, I have to define the attributes (so this would be true in, e.g., Darwin Core). In both OBOE and EQ, one could have a set of triples, and then come along later at any time and add triples that give type information to existing individuals, etc. Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals -- but it would be really hard given this to say one is more "schema last" than the other because of these basic upper-level classes.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hilmar,
I guess I'm now guilty of conflating concepts myself, namely "instance-data generation as an integral component of the ontology development spiral", and "schema last". They're distinct, but related in the sense that the latter can be seen as an extreme case of the former. Separating them:
Instance Data. Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
What is OBD? OBD is a database for storing data typed using OBO ontologies
Where is it? In development!
Is there a demo? See http://www.fruitfly.org/~cjm/obd
Datasets See the above URL for now
But the demo link is broken, and it's hard to find information on OBD that isn't a few years old. Is it still the plan to integrate OBD into BioPortal? If not, then maybe the "Missing Functionality [of BioPortal]" section of the KOS report should include a subsection about providing access to instance data. Considering GBIF's data holdings, it seems like it would be a shame to not integrate data browsing into any ontology browsing infrastructure that GBIF provides.
Schema Last. I think schema-last is a malleable enough buzzword that we can hijack it slightly, and I've been wondering about what it should mean in the context of TDWG ontologies. Some ontology paradigms are inherently more schema-last-ish than others. For example, EQ strikes me as more schema-last-ish than OBOE or Prometheus. Extending an example from the Fall, EQ gives:
fruit - green bark - brown leaves - yellow leaves - ridged leaves - broad
and OBOE gives
fruit - colour - green bark - colour - brown leaves - colour - yellow leaves - perimeter texture - ridged leaves - basic shape - broad
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development. In theory, these two approaches may be isomorphic, since, presumably, the OBOE characteristics are also abstracted from examples collected as part of the requirements gathering process. In practice, though, I suspect that EQ leaves more scope for instance-informed schemas. I have no basis for this suspicion other than intiuition, and would welcome any evidence or references that anyone can provide.
Also, schema-last could perhaps be a guiding philosophy as we seek to put in place a mechanism for facilitating ontology update and evolution. For example, it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. If a tag can't be mapped into the ontology, the ontology �engineer takes this as a clue that the ontology needs revision. So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community. Passant's focus was information retrieval, where the only reasoning is using subsumption hierarchies to expand the scope of a query, but the principle should apply to other reasoning tasks as well. The example in my mind is using a DL representation of SDD as the basis for polyclave keys. When users enter terms not in the ontology, it would trigger a process that could lead to ontology update.
I don't dispute the importance of involving individual domain experts, especially at the beginning, but also throughout the process. And I agree that catalyzing this process is, indeed, a job for TDWG.
Joel.
On Tue, 15 Feb 2011, Hilmar Lapp wrote:
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last. Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything); followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies �(to facilitate reasoning). Perhaps the only placewhere we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
I think the recommendations are heavy on building ontologies, and light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
� �-hilmar
=========================================================== : Hilmar Lapp �-:- Durham, NC -:- informatics.nescent.org : ===========================================================
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi,
Within the database community, schema first refers to having to fix a data structure (like the attributes in a relational table) before adding data, whereas schema last refers to being able to add schema after data have been added. So, at least to me w.r.t. this more traditional use of the phrase, saying "schema last" isn't quite the right usage, although I think I understand what you are trying to get at.
In general, I don't usually equate an ontology with the notion of a "schema" per se ... For example, we typically use OBOE together with an annotation language, but in doing so, OBOE is not used as the data storage language. Datasets are stored in their native format (tabular datasets), and the annotations can be thought of as specifying views over the underlying data tablse. One can then query the underlying data through the views specified by the annotation language (e.g., for data discovery), but never have to explicitly store data as OBOE instances.
For example, we could require that all instance data be validated with an ontology, and not have a mechansim for updating the ontology in response to the frustrations of our users.
This statement seems contrary to the use of the OWL framework ...
Shawn
On Sat, Feb 19, 2011 at 6:32 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Shawn,
On Thu, 17 Feb 2011, Shawn Bowers wrote:
Hi Joel,
I think the OWL model in general is "schema-last".
A good point, although I would phrase it differently and say that rdf, in general, is highly compatible with schema-last. But it's also compatible with decidedly non schema-last practices. For example, we could require that all instance data be validated with an ontology, and not have a mechansim for updating the ontology in response to the frustrations of our users.
Anyway, I'd be happy to stop using the phrase, and instead talk about specifics of what our ontologies should look like, and where they should come from.
Regards, Joel.
In particular, the only fixed "schema" is the triple model (subject, predicate, object), and one can add and remove triples as needed. I don't think OBOE or EQ (or any other OWL ontology) is any more schema-first versus schema-last than the other -- since they are based on OWL/RDF. Alternatively, a particular dataset (with specific attributes) is a typical example of "schema first", i.e., before I can store data rows, I have to define the attributes (so this would be true in, e.g., Darwin Core). In both OBOE and EQ, one could have a set of triples, and then come along later at any time and add triples that give type information to existing individuals, etc. Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals -- but it would be really hard given this to say one is more "schema last" than the other because of these basic upper-level classes.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hilmar,
I guess I'm now guilty of conflating concepts myself, namely "instance-data generation as an integral component of the ontology development spiral", and "schema last". They're distinct, but related in the sense that the latter can be seen as an extreme case of the former. Separating them:
Instance Data. Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
What is OBD? OBD is a database for storing data typed using OBO ontologies
Where is it? In development!
Is there a demo? See http://www.fruitfly.org/~cjm/obd
Datasets See the above URL for now
But the demo link is broken, and it's hard to find information on OBD that isn't a few years old. Is it still the plan to integrate OBD into BioPortal? If not, then maybe the "Missing Functionality [of BioPortal]" section of the KOS report should include a subsection about providing access to instance data. Considering GBIF's data holdings, it seems like it would be a shame to not integrate data browsing into any ontology browsing infrastructure that GBIF provides.
Schema Last. I think schema-last is a malleable enough buzzword that we can hijack it slightly, and I've been wondering about what it should mean in the context of TDWG ontologies. Some ontology paradigms are inherently more schema-last-ish than others. For example, EQ strikes me as more schema-last-ish than OBOE or Prometheus. Extending an example from the Fall, EQ gives:
fruit - green bark - brown leaves - yellow leaves - ridged leaves - broad
and OBOE gives
fruit - colour - green bark - colour - brown leaves - colour - yellow leaves - perimeter texture - ridged leaves - basic shape - broad
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development. In theory, these two approaches may be isomorphic, since, presumably, the OBOE characteristics are also abstracted from examples collected as part of the requirements gathering process. In practice, though, I suspect that EQ leaves more scope for instance-informed schemas. I have no basis for this suspicion other than intiuition, and would welcome any evidence or references that anyone can provide.
Also, schema-last could perhaps be a guiding philosophy as we seek to put in place a mechanism for facilitating ontology update and evolution. For example, it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. If a tag can't be mapped into the ontology, the ontology engineer takes this as a clue that the ontology needs revision. So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community. Passant's focus was information retrieval, where the only reasoning is using subsumption hierarchies to expand the scope of a query, but the principle should apply to other reasoning tasks as well. The example in my mind is using a DL representation of SDD as the basis for polyclave keys. When users enter terms not in the ontology, it would trigger a process that could lead to ontology update.
I don't dispute the importance of involving individual domain experts, especially at the beginning, but also throughout the process. And I agree that catalyzing this process is, indeed, a job for TDWG.
Joel.
On Tue, 15 Feb 2011, Hilmar Lapp wrote:
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last. Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything); followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies (to facilitate reasoning). Perhaps the only placewhere we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
On Feb 14, 2011, at 12:05 PM, joel sachs wrote:
> I think the recommendations are heavy on building ontologies, and > light on suggesting paths to linked data representations of instance data.
Good observation. I can't speak for all of the authors, but in my experience building Linked Data representations is mostly a technical problem, and thus much easier compared to building soundly engineered, commonly agreed upon ontologies with deep domain knowledge capture. The latter is hard, because it requires overcoming a lot of social challenges.
As for the GBIF report, personally I think linked biodiversity data representations will come at about the same pace whether or not GBIF pushes on that front (though GBIF can help make those representations better by provisioning stable resolvable identifier services, URIs etc). There is a unique opportunity though for "neutral" organizations such as GBIF (or, in fact, TDWG), to significantly accelerate the development of sound ontologies by catalyzing the community engagement, coherence, and discourse that is necessary for them.
-hilmar
=========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Shawn,
I'm not sure if we're agreeing. Comments below ...
On Sat, 19 Feb 2011, Shawn Bowers wrote:
Hi,
Within the database community, schema first refers to having to fix a data structure (like the attributes in a relational table) before adding data, whereas schema last refers to being able to add schema after data have been added.
Even in an rdbms, you can add schema after data, for example with "Alter Table". RDF is different in that i) the schema can be distributed, and ii) the schema definition and data definition languages are the same. So it is literally true that "schema is data too". So, to the extent that it makes sense to use a term like schema-last, it seems reasonable to apply it to practices rather than languages.
The first several years of the sematic web were hampered by the attitude that, to be on the semantic web, you first need an ontology. My guess is that many in TDWG believe this to be true, due to the emphasis we've placed on ontologies over the years. "Ontologies where necessary, but not necessarily ontologies" strikes me as a good motto for the semantic web.
So, at least to me w.r.t. this more traditional use of the phrase, saying "schema last" isn't quite the right usage, although I think I understand what you are trying to get at.
In general, I don't usually equate an ontology with the notion of a "schema" per se ... For example, we typically use OBOE together with an annotation language, but in doing so, OBOE is not used as the data storage language. Datasets are stored in their native format (tabular datasets), and the annotations can be thought of as specifying views over the underlying data tablse. One can then query the underlying data through the views specified by the annotation language (e.g., for data discovery), but never have to explicitly store data as OBOE instances.
Could you point me to endpoints where I can query data via the OBOE ontology?
For example, we could require that all instance data be validated with an ontology, and not have a mechansim for updating the ontology in response to the frustrations of our users.
This statement seems contrary to the use of the OWL framework ...
It's contrary to common sense, but compatible with OWL. If I'm exagerating about ontologies not being responsive to the frustrations of their users, it's because most ontologies don't have users. I'll check Swoogle for some statistics to back that up, but does anyone really dispute it?
Joel.
Shawn
On Sat, Feb 19, 2011 at 6:32 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Shawn,
On Thu, 17 Feb 2011, Shawn Bowers wrote:
Hi Joel,
I think the OWL model in general is "schema-last".
A good point, although I would phrase it differently and say that rdf, in general, is highly compatible with schema-last. But it's also compatible with decidedly non schema-last practices. For example, we could require that all instance data be validated with an ontology, and not have a mechansim for updating the ontology in response to the frustrations of our users.
Anyway, I'd be happy to stop using the phrase, and instead talk about specifics of what our ontologies should look like, and where they should come from.
Regards, Joel.
In particular, the only fixed "schema" is the triple model (subject, predicate, object), and one can add and remove triples as needed. I don't think OBOE or EQ (or any other OWL ontology) is any more schema-first versus schema-last than the other -- since they are based on OWL/RDF. Alternatively, a particular dataset (with specific attributes) is a typical example of "schema first", i.e., before I can store data rows, I have to define the attributes (so this would be true in, e.g., Darwin Core). In both OBOE and EQ, one could have a set of triples, and then come along later at any time and add triples that give type information to existing individuals, etc. Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals -- but it would be really hard given this to say one is more "schema last" than the other because of these basic upper-level classes.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hilmar,
I guess I'm now guilty of conflating concepts myself, namely "instance-data generation as an integral component of the ontology development spiral", and "schema last". They're distinct, but related in the sense that the latter can be seen as an extreme case of the former. Separating them:
Instance Data. Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
What is OBD? OBD is a database for storing data typed using OBO ontologies
Where is it? In development!
Is there a demo? See http://www.fruitfly.org/~cjm/obd
Datasets See the above URL for now
But the demo link is broken, and it's hard to find information on OBD that isn't a few years old. Is it still the plan to integrate OBD into BioPortal? If not, then maybe the "Missing Functionality [of BioPortal]" section of the KOS report should include a subsection about providing access to instance data. Considering GBIF's data holdings, it seems like it would be a shame to not integrate data browsing into any ontology browsing infrastructure that GBIF provides.
Schema Last. I think schema-last is a malleable enough buzzword that we can hijack it slightly, and I've been wondering about what it should mean in the context of TDWG ontologies. Some ontology paradigms are inherently more schema-last-ish than others. For example, EQ strikes me as more schema-last-ish than OBOE or Prometheus. Extending an example from the Fall, EQ gives:
fruit - green bark - brown leaves - yellow leaves - ridged leaves - broad
and OBOE gives
fruit - colour - green bark - colour - brown leaves - colour - yellow leaves - perimeter texture - ridged leaves - basic shape - broad
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development. In theory, these two approaches may be isomorphic, since, presumably, the OBOE characteristics are also abstracted from examples collected as part of the requirements gathering process. In practice, though, I suspect that EQ leaves more scope for instance-informed schemas. I have no basis for this suspicion other than intiuition, and would welcome any evidence or references that anyone can provide.
Also, schema-last could perhaps be a guiding philosophy as we seek to put in place a mechanism for facilitating ontology update and evolution. For example, it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. If a tag can't be mapped into the ontology, the ontology �engineer takes this as a clue that the ontology needs revision. So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community. Passant's focus was information retrieval, where the only reasoning is using subsumption hierarchies to expand the scope of a query, but the principle should apply to other reasoning tasks as well. The example in my mind is using a DL representation of SDD as the basis for polyclave keys. When users enter terms not in the ontology, it would trigger a process that could lead to ontology update.
I don't dispute the importance of involving individual domain experts, especially at the beginning, but also throughout the process. And I agree that catalyzing this process is, indeed, a job for TDWG.
Joel.
On Tue, 15 Feb 2011, Hilmar Lapp wrote:
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Hilmar,
No argument from me, just my prejudice against "solution via ontology", and my enthusiasm for "schema-last" - the idea that the schema reveals itself after you've populated the knowledge base. This was never really possible with relational databases, where a table must be defined before it can be populated. But graph databases (expecially the "anyone can say anything" semantic web) practically invite a degree of schema-last. Examples include Freebase (schema-last by design), and FOAF, whose specification is so widely ignored and mis-used (often to good effect), that the de-facto spec is the one that can be abstracted from FOAF files in the wild.
The semantic web is littered with ontologies lacking instance data; my hope is that generating instance data is a significant part of the ontology building process for each of the ontologies proposed by the report. By "generating instance data" I mean not simply marking up a few example records, but generating millions of triples to query over as part of the development cycle. This will indicate both the suitability of the ontology to the use cases, and also its ease of use.
I like the order in which the GBIF report lists its infrastructure recommendations. Persistent URIs (the underpinning of everything); followed by competency questions and use cases (very helpful in the prevention of mental masturbation); followed by OWL ontologies �(to facilitate reasoning). Perhaps the only placewhere we differ is that you're comfortable with "incorporate instance data into the ontology design process" being implicit, while I never tire of seeing that point hammered home.
Regards - Joel.
On Mon, 14 Feb 2011, Hilmar Lapp wrote:
> > On Feb 14, 2011, at 12:05 PM, joel sachs wrote: > >> I think the recommendations are heavy on building ontologies, and >> light on suggesting paths to linked data representations of instance data. > > > Good observation. I can't speak for all of the authors, but in my > experience building Linked Data representations is mostly a technical > problem, and thus much easier compared to building soundly engineered, > commonly agreed upon ontologies with deep domain knowledge capture. The > latter is hard, because it requires overcoming a lot of social challenges. > > As for the GBIF report, personally I think linked biodiversity data > representations will come at about the same pace whether or not GBIF pushes > on that front (though GBIF can help make those representations better by > provisioning stable resolvable identifier services, URIs etc). There is a > unique opportunity though for "neutral" organizations such as GBIF (or, in > fact, TDWG), to significantly accelerate the development of sound ontologies > by catalyzing the community engagement, coherence, and discourse that is > necessary for them. > > � �-hilmar > -- > =========================================================== > : Hilmar Lapp �-:- Durham, NC -:- informatics.nescent.org : > =========================================================== > >
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Joel:
On Feb 21, 2011, at 4:51 PM, joel sachs wrote:
most ontologies don't have users. I'll check Swoogle for some statistics to back that up, but does anyone really dispute it?
I'm not sure that's a useful statement by itself. It is akin to saying that most software source code doesn't have users, and therefore the way we think about software is flawed.
So, of course if you count any ontology that has ever been started by anyone, the majority of those will likely not have users. That doesn't mean at all that that is necessarily also so for each and every community of practice. Most of the ontologies in the OBO Foundry/ Library do have users, and publications arising from that.
And what does that then mean for TDWG / Biodiversity ontologies, if you mean to say that most of those do not have users? I don't claim to know, but I think it does go to suggest 3 things: 1) Ontologies created by a narrow (not the same as small) group of people and intended to be used by many will likely end up not getting used at all. 2) To get domain scientists engaged in ontology development at breadth, training and community are not dispensable. 3) Ontology building is time consuming, and merely talking about ontologies, or developing ontologies for the sake of having developed ontologies, doesn't justify anyone's time investment. But using them to demonstrate biological discovery does.
I"m a big fan of LOD, in particular *because* it does not require full- blown ontologies for entry. I'm hugely in favor of de-siloing data, and LOD has much promise in this regard by applying the ultimate normalization. But we should also not fool ourselves into believing that somehow normalizing all data into triple form will let us discover new knowledge. I have yet to see the paper that reports a scientific discovery from a flat vocabulary LOD-style RDF integration that you couldn't have achieved in a fraction of the time by cobbling together a database schema and some massaging scripts.
-hilmar
Hilmar,
You're a fan of LOD who sees a number of use cases where deep domain ontologies play a crucial role. So am I. Our differences are irreconcilable!
If we're arguing, it could be because we differ on what those use cases are, and, generally, how to charecterize them. My sense, in regards the ontologies recommended by the GBIF KOS report:
Darwin Core: As I've been arguing, I would't get too carried away here.
SDD: This is a great example of a good match for description logics. SDD expressed as OWL2-DL *could* be a path towards robust polyclave keys, which (I believe) have long been a goal not just for citizen science, but for field identification in general. I stress "could" above, because I'm a little surprised it hasn't happened yet. There was movement in that direction at least as far back as 2005 [http://dcpapers.dublincore.org/ojs/pubs/article/viewFile/808/804], and I heard the idea discussed back at the Montpellier VoCamp. I don't know if lack of progress here is because of lack of funding, or because the problem is a lot harder that it at first appears. I'd love to see a concerted effort in this direction, starting modestly, focusing on a small taxonomic group for which there is already a lot of SDD instance data. (This would, IMHO, make a strong funding proposal.)
Taxonomic treatments: I don't know a lot about this, but, as I previously indicated, I think that ontolgies for the artifacts of human behaviour should be less constrained than ontologies for the natural world.
SPM: We're mostly talking here about the resources that humans want and expect in a species description. This should be straightforward.
Moving beyond the report to charcterizing the ontological needs of general use cases:
DATA INTEGRATION: Recourse to upper level ontologies for data integration have so far proved to be of limited utility. Can anyone point me to examples of this approach working for anything other than contrived examples, or narrow domain areas? Maybe OBOE will be the first to succeed.
DISCOVERING NEW KNOWLEDGE (as envisioned by Einstein and the Queen in last year's KR classic, http://www.xtranormal.com/watch/7471601/): This is one of the most potentially exciting areas of the semantic web, and has been for ten years. Consider two approaches to answering the query "Find occurrences of invasive species."
i. In the ETHAN ontology, we define a taxon as invasive by asserting it to be a subClass of a class of invaders, like the class "GISDThing". So querying for occurrences of invaders simply involves looking for occurrences, and then doing subsumption reasoning over the Invasives ontology and branches of the Tree of Life.
ii. What if, instead, we defined an invader as any species which has a definite tendency to expand its range into areas where it is unwanted (Thorpe's definition)? Could we still answer the query "Find occurrences of invasive species."? To do so with this definition would, potentially, involve the discovery of a new scientific fact, the discovery that a species, previously thought benign, is, in fact, invasive. Is there the prospect of being able to do this? Maybe. It would take a lot of work (and would be another good funding proposal).
(I do realize that the line between data integration and discovering new knowledge is blurry. If you can integrate the data, you can apply exploratory data mining techniques to discover new knowledge. So by dscovering new knowledge via ontologies, I (like Einstien and the Queen) mean that it's the OWL reasoner itself that's making the discoveries.)
Before responding to a couple of your specific comments, I want to stress for anyone following that there is (or should be) no tension between LOD and ontologies. LOD is simply the RESTful way to do the semantic web, and is the current semantic web best practice. So whether our semantic web rests on fancy ontologies or simple ones, we all (I think) agree that the ontologies and instance data should be published according to best practices.
Further comments below ...
On Mon, 21 Feb 2011, Hilmar Lapp wrote:
Joel:
On Feb 21, 2011, at 4:51 PM, joel sachs wrote:
most ontologies don't have users. I'll check Swoogle for some statistics to back that up, but does anyone really dispute it?
I'm not sure that's a useful statement by itself. It is akin to saying that most software source code doesn't have users, and therefore the way we think about software is flawed.
True. I apologize for trying to pull a fast one. (Although the way most people think about software *is* flawed.)
So, of course if you count any ontology that has ever been started by anyone, the majority of those will likely not have users. That doesn't mean at all that that is necessarily also so for each and every community of practice. Most of the ontologies in the OBO Foundry/Library do have users, and publications arising from that.
And what does that then mean for TDWG / Biodiversity ontologies, if you mean to say that most of those do not have users? I don't claim to know, but I think it does go to suggest 3 things: 1) Ontologies created by a narrow (not the same as small) group of people and intended to be used by many will likely end up not getting used at all. 2) To get domain scientists engaged in ontology development at breadth, training and community are not dispensable. 3) Ontology building is time consuming, and merely talking about ontologies, or developing ontologies for the sake of having developed ontologies, doesn't justify anyone's time investment. But using them to demonstrate biological discovery does.
I agree with all the above. The only point I would add (which is how this conversation got started) is 4) Ontologies that are developed without generating significant amounts of instance data as part of the development spiral start life with two strikes against them.
I"m a big fan of LOD, in particular *because* it does not require full-blown ontologies for entry. I'm hugely in favor of de-siloing data, and LOD has much promise in this regard by applying the ultimate normalization. But we should also not fool ourselves into believing that somehow normalizing all data into triple form will let us discover new knowledge. I have yet to see the paper that reports a scientific discovery from a flat vocabulary LOD-style RDF integration that you couldn't have achieved in a fraction of the time by cobbling together a database schema and some massaging scripts.
You can always cobble something together if you happen to know where the data is, and have easy access to it. LOD exposes data.
Have you seen any papers that report a scientific discovery from fancy ontologies *on the semantic web* ?. The Washington paper, which we both think is a good example of ontologies at work, doesn't mention rdf, owl, or the semantic web. Ontologies predate the semantic web, and live fine without it. Many of us are working at migrating Washington's approach onto the semantic web. Time will tell if description logics distribute well.
Joel.
-hilmar
--
: Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
I stress "could" above, because I'm a little surprised it hasn't happened yet. There was movement in that direction at least as far back as 2005 [http://dcpapers.dublincore.org/ojs/pubs/article/viewFile/808/804], and I heard the idea discussed back at the Montpellier VoCamp. I don't know if lack of progress here is because of lack of funding, or because the problem is a lot harder that it at first appears. I'd love to see a concerted effort in this direction, starting modestly, focusing on a small taxonomic group for which there is already a lot of SDD instance data. (This would, IMHO, make a strong funding proposal.)
It is lack of funding and person resources. The SDD working group did realize as far back as ca. 2000 (Noel Cross was the first to point that out to me) that RDF is a match for SDD. However, it was really difficult to grasp the implications at that time, we found rather than making progress on the subject domain (build on DELTA and slightly modernize it) we were hitting a wall. So we purposely decided to stay closer to object oriented modeling and map this to xml schema. It took us 5 years to gather the necessary competence in xml schema to finalize SDD, and when at the end of that period the leader of tdwg made the decision that all needs to be re-done in RDF we were exhausted and in fact almost all SDD core members were out of funding for identification or descriptive work.
I welcome any intitiative to create an OWL-compatible form of the core concepts of SDD - with or without my participation.
Gregor
On Feb 25, 2011, at 8:56 AM, Gregor Hagedorn wrote:
I welcome any intitiative to create an OWL-compatible form of the core concepts of SDD - with or without my participation.
Sounds like another great target for a VoCamp. Has one of you guys (meaning SDD folks) been at our 2009 VoCamp?
-hilmar
Hi,
More comments below ...
On Mon, Feb 21, 2011 at 1:51 PM, joel sachs jsachs@csee.umbc.edu wrote:
Shawn,
I'm not sure if we're agreeing. Comments below ...
On Sat, 19 Feb 2011, Shawn Bowers wrote:
Hi,
Within the database community, schema first refers to having to fix a data structure (like the attributes in a relational table) before adding data, whereas schema last refers to being able to add schema after data have been added.
Even in an rdbms, you can add schema after data, for example with "Alter Table".
While this is true, it is still "schema-first" since you can't add some data to a table without there already being a column in the table to store the data into!
RDF is different in that i) the schema can be distributed, and ii) the schema definition and data definition languages are the same. So it is literally true that "schema is data too". So, to the extent that it makes sense to use a term like schema-last, it seems reasonable to apply it to practices rather than languages.
I agree that it makes sense to apply the ideas to what practices (although my impression is that the phrase is primarily about data models within the database community).
I think "type" or "semantics" is a better term here than "schema" (which to me implies data structure/storage structure, but usually only minimal constraints). So, e.g., "semantics-later" versus "semantics-first".
Could you point me to endpoints where I can query data via the OBOE ontology?
Nothing that is publicly available at this time. We have created a couple of prototypes of querying datasets through the ontology, one of which is ObsDB (a recent paper published at e-science 2010 on this system can be found here: http://www.cs.gonzaga.edu/~bowers/papers/escience-2010.pdf) and another for evaluating different query algorithms for efficiently answering similar queries over large repositories such as the KNB. There is also a web-based query UI for querying annotated datasets, but it doesn't yet expose the datasets as instance data. We're working on these tools now within the Semtools and SONet projects.
For example, we could require that all instance data be validated with an ontology, and not have a mechansim for updating the ontology in response to the frustrations of our users.
This statement seems contrary to the use of the OWL framework ...
It's contrary to common sense, but compatible with OWL. If I'm exagerating about ontologies not being responsive to the frustrations of their users, it's because most ontologies don't have users. I'll check Swoogle for some statistics to back that up, but does anyone really dispute it?
I think an obvious counterexample is GO and its associated ontologies, which seem to be heavily used for a number of different purposes.
Shawn
Joel.
Shawn
On Sat, Feb 19, 2011 at 6:32 PM, joel sachs jsachs@csee.umbc.edu wrote:
Hi Shawn,
On Thu, 17 Feb 2011, Shawn Bowers wrote:
Hi Joel,
I think the OWL model in general is "schema-last".
A good point, although I would phrase it differently and say that rdf, in general, is highly compatible with schema-last. But it's also compatible with decidedly non schema-last practices. For example, we could require that all instance data be validated with an ontology, and not have a mechansim for updating the ontology in response to the frustrations of our users.
Anyway, I'd be happy to stop using the phrase, and instead talk about specifics of what our ontologies should look like, and where they should come from.
Regards, Joel.
In particular, the only fixed "schema" is the triple model (subject, predicate, object), and one can add and remove triples as needed. I don't think OBOE or EQ (or any other OWL ontology) is any more schema-first versus schema-last than the other -- since they are based on OWL/RDF. Alternatively, a particular dataset (with specific attributes) is a typical example of "schema first", i.e., before I can store data rows, I have to define the attributes (so this would be true in, e.g., Darwin Core). In both OBOE and EQ, one could have a set of triples, and then come along later at any time and add triples that give type information to existing individuals, etc. Both OBOE and EQ do introduce classes that prescribe how to structure new classes and type individuals -- but it would be really hard given this to say one is more "schema last" than the other because of these basic upper-level classes.
Shawn
On Thu, Feb 17, 2011 at 11:28 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hilmar,
I guess I'm now guilty of conflating concepts myself, namely "instance-data generation as an integral component of the ontology development spiral", and "schema last". They're distinct, but related in the sense that the latter can be seen as an extreme case of the former. Separating them:
Instance Data. Do you (or does anyone else on the list) know the status of OBD? From the NCBO FAQ:
What is OBD? OBD is a database for storing data typed using OBO ontologies
Where is it? In development!
Is there a demo? See http://www.fruitfly.org/~cjm/obd
Datasets See the above URL for now
But the demo link is broken, and it's hard to find information on OBD that isn't a few years old. Is it still the plan to integrate OBD into BioPortal? If not, then maybe the "Missing Functionality [of BioPortal]" section of the KOS report should include a subsection about providing access to instance data. Considering GBIF's data holdings, it seems like it would be a shame to not integrate data browsing into any ontology browsing infrastructure that GBIF provides.
Schema Last. I think schema-last is a malleable enough buzzword that we can hijack it slightly, and I've been wondering about what it should mean in the context of TDWG ontologies. Some ontology paradigms are inherently more schema-last-ish than others. For example, EQ strikes me as more schema-last-ish than OBOE or Prometheus. Extending an example from the Fall, EQ gives:
fruit - green bark - brown leaves - yellow leaves - ridged leaves - broad
and OBOE gives
fruit - colour - green bark - colour - brown leaves - colour - yellow leaves - perimeter texture - ridged leaves - basic shape - broad
So in the OBOE case, the characteristics (color, perimeter texture, basic shape) are given a priori, while in the EQ case they would (presumably) be abstracted during subsequent ontology development. In theory, these two approaches may be isomorphic, since, presumably, the OBOE characteristics are also abstracted from examples collected as part of the requirements gathering process. In practice, though, I suspect that EQ leaves more scope for instance-informed schemas. I have no basis for this suspicion other than intiuition, and would welcome any evidence or references that anyone can provide.
Also, schema-last could perhaps be a guiding philosophy as we seek to put in place a mechanism for facilitating ontology update and evolution. For example, it might be worth experimenting with tag-driven ontology evolution, as in [1], where tags are associated to concepts in an ontology. If a tag can't be mapped into the ontology, the ontology engineer takes this as a clue that the ontology needs revision. So the domain expert/knowledge engineer partnership is preserved, but with the domain expert role being replaced by collective wisdom from the community. Passant's focus was information retrieval, where the only reasoning is using subsumption hierarchies to expand the scope of a query, but the principle should apply to other reasoning tasks as well. The example in my mind is using a DL representation of SDD as the basis for polyclave keys. When users enter terms not in the ontology, it would trigger a process that could lead to ontology update.
I don't dispute the importance of involving individual domain experts, especially at the beginning, but also throughout the process. And I agree that catalyzing this process is, indeed, a job for TDWG.
Joel.
On Tue, 15 Feb 2011, Hilmar Lapp wrote:
Hi Joel -
I'm in full agreement re: importance of generating instance data as driving principle in developing an ontology. This is the case indeed in all the OBO Foundry ontologies I'm familiar with, in the form of data curation needs driving ontology development. Which is perhaps my bias as to why I treat this as implicit.
That being said, it has also been found that in specific subject areas progress can be made fastest if you convene a small group of domain experts and simply model the knowledge about those subject areas, rather than doing so piecemeal in response to data curation needs.
BTW I don't think Freebase is a good example here. I don't think the model of intense centralized data and vocabulary curation that it employs is tenable within our domain, and I have a hard time imagining how schema-last would not result in an incoherent data soup otherwise. But then perhaps I just don't understand what you mean by schema-last.
-hilmar
Sent with a tap.
On Feb 15, 2011, at 8:24 PM, joel sachs jsachs@csee.umbc.edu wrote:
> Hi Hilmar, > > No argument from me, just my prejudice against "solution via > ontology", > and my enthusiasm for "schema-last" - the idea that the schema > reveals > itself after you've populated the knowledge base. This was never > really > possible with relational databases, where a table must be defined > before it > can be populated. But graph databases (expecially the "anyone can say > anything" semantic web) practically invite a degree of schema-last. > Examples include Freebase (schema-last by design), and FOAF, whose > specification is so widely ignored and mis-used (often to good > effect), that > the de-facto spec is the one that can be abstracted from FOAF files > in the > wild. > > The semantic web is littered with ontologies lacking instance data; > my > hope is that generating instance data is a significant part of the > ontology > building process for each of the ontologies proposed by the report. > By > "generating instance data" I mean not simply marking up a few example > records, but generating millions of triples to query over as part of > the > development cycle. This will indicate both the suitability of the > ontology > to the use cases, and also its ease of use. > > I like the order in which the GBIF report lists its infrastructure > recommendations. Persistent URIs (the underpinning of everything); > followed by competency questions and use cases (very helpful in the > prevention of mental masturbation); followed by OWL ontologies (to > facilitate reasoning). Perhaps the only placewhere we differ is that > you're > comfortable with "incorporate instance data into the ontology design > process" being implicit, while I never tire of seeing that point > hammered > home. > > Regards - Joel. > > > On Mon, 14 Feb 2011, Hilmar Lapp wrote: > >> >> On Feb 14, 2011, at 12:05 PM, joel sachs wrote: >> >>> I think the recommendations are heavy on building ontologies, and >>> light on suggesting paths to linked data representations of >>> instance data. >> >> >> Good observation. I can't speak for all of the authors, but in my >> experience building Linked Data representations is mostly a >> technical >> problem, and thus much easier compared to building soundly >> engineered, >> commonly agreed upon ontologies with deep domain knowledge capture. >> The >> latter is hard, because it requires overcoming a lot of social >> challenges. >> >> As for the GBIF report, personally I think linked biodiversity data >> representations will come at about the same pace whether or not GBIF >> pushes >> on that front (though GBIF can help make those representations >> better by >> provisioning stable resolvable identifier services, URIs etc). There >> is a >> unique opportunity though for "neutral" organizations such as GBIF >> (or, in >> fact, TDWG), to significantly accelerate the development of sound >> ontologies >> by catalyzing the community engagement, coherence, and discourse >> that is >> necessary for them. >> >> -hilmar >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : >> =========================================================== >> >>
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
participants (7)
-
Gregor Hagedorn
-
Hilmar Lapp
-
joel sachs
-
Paul Murray
-
Peter DeVries
-
Shawn Bowers
-
Steve Baskauf