A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Its been a couple of weeks but I said Id try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be.
I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this:
1. basisOfRecord should be used ONLY to indicate the type of evidence that lies behind a record a key aspect of whether the record is likely to be useful for different purposes
2. basisOfRecord values should be taken from a hierarchical vocabulary with three main branches:
a. specimens (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as pinnedSpecimen, herbariumSheet, etc.
b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as dnaSequence, soundRecording, stillImage, etc.
c. asserted observations with no revisitable evidence other than the authority of the observer
3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, )
4. This ontology should not attempt to map all the complexity of biodiversity-related data just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) it should leave definition of other properties as a separate, open-ended activity for the community
5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably wont break anything, so start simple)
6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records
7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out)
8. Every data publisher should be encouraged to share all relevant data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include:
a. A completely hierarchical, ABCD-like, XML representation
b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality
c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed
9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used this allows us to infer the shape of the data in terms of the ontology
10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record
Here is an example. In todays terms, what sort of DwC record is this? Do I really have to replace recordId with eventId, occurrenceId or similar? And which should I choose?
recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount
I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c.
I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently and unambiguously for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently.
Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the shape of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse.
Thanks,
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - mailto:dhobern@gbif.org dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/ http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
----------------------------------------------------------------------
Hi Donald,
MANY thanks for this! And you are certainly not alone in your concerns about these issues. In fact, we have planned a Symposium for Documenting DarwinCore (https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies #track11), and one of the four sessions (Session 3, to be precise) of the symposium focuses exactly on this issue of basisOfRecord/dcterms:type/etc.
Another session (Session 2) will focus on proposed and perhaps-to-be-proposed new classes (Individual, MaterialSample, Evidence), and will start out with a series graphs illustrating the existing high-level ontology and possible alternative high-level ontologies, as you indicate in your items 3 & 4.
Aloha, Rich
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Donald Hobern [GBIF] Sent: Sunday, October 13, 2013 5:13 AM To: 'TDWG Content Mailing List' Cc: 'Chuck Miller' Subject: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Its been a couple of weeks but I said Id try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be.
I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this:
1. basisOfRecord should be used ONLY to indicate the type of evidence that lies behind a record a key aspect of whether the record is likely to be useful for different purposes 2. basisOfRecord values should be taken from a hierarchical vocabulary with three main branches: a. specimens (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as pinnedSpecimen, herbariumSheet, etc. b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as dnaSequence, soundRecording, stillImage, etc. c. asserted observations with no revisitable evidence other than the authority of the observer 3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, ) 4. This ontology should not attempt to map all the complexity of biodiversity-related data just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) it should leave definition of other properties as a separate, open-ended activity for the community 5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably wont break anything, so start simple) 6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records 7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out) 8. Every data publisher should be encouraged to share all relevant data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include: a. A completely hierarchical, ABCD-like, XML representation b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed 9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used this allows us to infer the shape of the data in terms of the ontology 10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record
Here is an example. In todays terms, what sort of DwC record is this? Do I really have to replace recordId with eventId, occurrenceId or similar? And which should I choose?
recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount
I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c.
I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently and unambiguously for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently.
Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the shape of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse.
Thanks,
Donald
---------------------------------------------------------------------- Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480 ----------------------------------------------------------------------
Thanks, Rich.
Very pleased to see this. With this encouragement, I'll say just a little bit more about why I think this is a critical need.
I see the model I describe as the perfect real-world realisation of most of the key components in the GBIO Framework (http://www.biodiversityinformatics.org/), as follows:
1. Everyone zips up whatever data they have from each resource (databases, field instruments, sequencers, data extracted from literature, checklists, whatever) into a DwC Archive using whatever DwC elements they can for data elements and describing other elements not currently recognised in DwC (the GBIO DATA layer)
2. These archives should be placed in repositories that offer basic services (DOIs, annotation services, etc.) (the GBIO CULTURE layer)
3. Harvesters assess the contents of each archive and determine what views can be supported from the supplied elements (occurrence records for GBIF, name usage records, species interactions, etc.) and catalogue these views in relevant discovery indexes (GBIF, Catalogue of Life, TraitBank, etc.) (the GBIO EVIDENCE layer)
4. Users can at any time annotate elements in the archives to provide mappings for (potentially more recently defined) DwC or other properties, opening up new options for reuse
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
----------------------------------------------------------------------
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Sunday, October 13, 2013 6:49 PM To: 'Donald Hobern [GBIF]'; 'TDWG Content Mailing List' Cc: 'Chuck Miller' Subject: RE: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Hi Donald,
MANY thanks for this! And you are certainly not alone in your concerns about these issues. In fact, we have planned a Symposium for Documenting DarwinCore
( <https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
#track11), and one of the four sessions (Session 3, to be precise) of the symposium focuses exactly on this issue of basisOfRecord/dcterms:type/etc.
Another session (Session 2) will focus on proposed and perhaps-to-be-proposed new classes (Individual, MaterialSample, Evidence), and will start out with a series graphs illustrating the existing high-level ontology and possible alternative high-level ontologies, as you indicate in your items 3 & 4.
Aloha,
Rich
Ahem. These zipped up archives will be obsolete approximately 1 minute after zipping, if not earlier.
How is it planned to update them as names change, names of localities change, date and code and lat and long and collector and and and and and field contents are corrected over the coming days, months, weeks and years? And then when the new version of the database, field instrument output, etc. is zipped up tomorrow, how will that replace the one zipped up the day before, and already fed into the user community, along with its previous versions of the field contents? Which version of "my" database will the harvester use? My most recently zipped up and submitted, or the one from a week ago or month ago or year ago? If the db is zipped up, how will the user annotate it and that annotation get back to the parent db?
Just a thought from out in a mud hole.
Smile.
Dan and Winnie
On Oct 13, 2013, at 1:24 PM, Donald Hobern [GBIF] dhobern@gbif.org wrote:
Thanks, Rich.
Very pleased to see this. With this encouragement, I'll say just a little bit more about why I think this is a critical need.
I see the model I describe as the perfect real-world realisation of most of the key components in the GBIO Framework (http://www.biodiversityinformatics.org/), as follows:
Everyone zips up whatever data they have from each resource (databases, field instruments, sequencers, data extracted from literature, checklists, whatever) into a DwC Archive using whatever DwC elements they can for data elements and describing other elements not currently recognised in DwC (the GBIO DATA layer)
These archives should be placed in repositories that offer basic services (DOIs, annotation services, etc.) (the GBIO CULTURE layer)
Harvesters assess the contents of each archive and determine what views can be supported from the supplied elements (occurrence records for GBIF, name usage records, species interactions, etc.) and catalogue these views in relevant discovery indexes (GBIF, Catalogue of Life, TraitBank, etc.) (the GBIO EVIDENCE layer)
Users can at any time annotate elements in the archives to provide mappings for (potentially more recently defined) DwC or other properties, opening up new options for reuse
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Sunday, October 13, 2013 6:49 PM To: 'Donald Hobern [GBIF]'; 'TDWG Content Mailing List' Cc: 'Chuck Miller' Subject: RE: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Hi Donald,
MANY thanks for this! And you are certainly not alone in your concerns about these issues. In fact, we have planned a Symposium for “Documenting DarwinCore” (https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies #track11), and one of the four sessions (Session 3, to be precise) of the symposium focuses exactly on this issue of basisOfRecord/dcterms:type/etc.
Another session (Session 2) will focus on proposed and perhaps-to-be-proposed new classes (Individual, MaterialSample, Evidence), and will start out with a series graphs illustrating the existing high-level ontology and possible alternative high-level ontologies, as you indicate in your items 3 & 4.
Aloha, Rich _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Thanks, Dan.
Quite correct. What I had in mind was a twin approach that supports versioning of data. For many resources, no one is making further updates to the primary data. Placing it in a repository with stable management of identifiers and annotation services would allow the whole community to collaborate in enhancing the data. For data sets with active management at source, good practice with identifiers would allow issuing of new versions at regular intervals. If every data set has a DOI and every record has a unique id within that data set, we should be able to handle these issues. Hybrid approaches with active offline management and community annotation tools are more complex but not unachievable. FilteredPush has been working on some of this. Standarising approaches to identifiers and annotations would make it all work.
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - mailto:dhobern@gbif.org dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/ http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
----------------------------------------------------------------------
From: Daniel Janzen [mailto:djanzen@sas.upenn.edu] Sent: Sunday, October 13, 2013 8:37 PM To: Donald Hobern [GBIF] Cc: 'Richard Pyle'; 'TDWG Content Mailing List'; 'Chuck Miller' Subject: Re: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Ahem. These zipped up archives will be obsolete approximately 1 minute after zipping, if not earlier.
How is it planned to update them as names change, names of localities change, date and code and lat and long and collector and and and and and field contents are corrected over the coming days, months, weeks and years? And then when the new version of the database, field instrument output, etc. is zipped up tomorrow, how will that replace the one zipped up the day before, and already fed into the user community, along with its previous versions of the field contents? Which version of "my" database will the harvester use? My most recently zipped up and submitted, or the one from a week ago or month ago or year ago? If the db is zipped up, how will the user annotate it and that annotation get back to the parent db?
Just a thought from out in a mud hole.
Smile.
Dan and Winnie
On Oct 13, 2013, at 1:24 PM, Donald Hobern [GBIF] dhobern@gbif.org wrote:
Thanks, Rich.
Very pleased to see this. With this encouragement, I'll say just a little bit more about why I think this is a critical need.
I see the model I describe as the perfect real-world realisation of most of the key components in the GBIO Framework ( http://www.biodiversityinformatics.org/ http://www.biodiversityinformatics.org/), as follows:
1. Everyone zips up whatever data they have from each resource (databases, field instruments, sequencers, data extracted from literature, checklists, whatever) into a DwC Archive using whatever DwC elements they can for data elements and describing other elements not currently recognised in DwC (the GBIO DATA layer)
2. These archives should be placed in repositories that offer basic services (DOIs, annotation services, etc.) (the GBIO CULTURE layer)
3. Harvesters assess the contents of each archive and determine what views can be supported from the supplied elements (occurrence records for GBIF, name usage records, species interactions, etc.) and catalogue these views in relevant discovery indexes (GBIF, Catalogue of Life, TraitBank, etc.) (the GBIO EVIDENCE layer)
4. Users can at any time annotate elements in the archives to provide mappings for (potentially more recently defined) DwC or other properties, opening up new options for reuse
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - mailto:dhobern@gbif.org dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/ http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
----------------------------------------------------------------------
-----Original Message----- From: Richard Pyle [mailto:deepreef@ http://bishopmuseum.org bishopmuseum.org] Sent: Sunday, October 13, 2013 6:49 PM To: 'Donald Hobern [GBIF]'; 'TDWG Content Mailing List' Cc: 'Chuck Miller' Subject: RE: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Hi Donald,
MANY thanks for this! And you are certainly not alone in your concerns about these issues. In fact, we have planned a Symposium for Documenting DarwinCore
( <https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
#track11), and one of the four sessions (Session 3, to be precise) of the symposium focuses exactly on this issue of basisOfRecord/dcterms:type/etc.
Another session (Session 2) will focus on proposed and perhaps-to-be-proposed new classes (Individual, MaterialSample, Evidence), and will start out with a series graphs illustrating the existing high-level ontology and possible alternative high-level ontologies, as you indicate in your items 3 & 4.
Aloha,
Rich
_______________________________________________ tdwg-content mailing list mailto:tdwg-content@lists.tdwg.org tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content http://lists.tdwg.org/mailman/listinfo/tdwg-content
Donald,
With regards to the uncertainty about the meaning of dwc:basisOfRecord, the proposed Darwin Core RDF Guide attempts to inject clarity into the situation. It does so in two ways:
1. It allows dwc:basisOfRecord to be used with literal (text) values to allow existing implementations to expose whatever values they currently have for that term. However, it specifies that rdf:type should be used exclusively as the property for specifying URI-reference values intended to indicate the type of the subject resource. [1] There is some ambiguity about what the subject is of a dwc:basisOrRecord property (the resource, or the record about the resource?). However, there is no similar ambiguity about rdf:type which always serves to indicate the class of which the subject resource is an instance.
2. It specifies that classes in the Darwin Core Type vocabulary namespace (dwctype: = http://rs.tdwg.org/dwc/dwctype/ ) should be used for typing resources in the biodiversity domain rather than any corresponding classes in the main Darwin Core namespace (dwc: = http://rs.tdwg.org/dwc/terms/ ). [2] In other words, if given the choice between dwc:Occurrence and dwctype:Occurrence, use dwctype:Occurrence. The guide proposes to add to the type vocabulary any classes which exist in the dwc: namespace and not in the dwctype: namespace (e.g. dwc:Identification). The intention is that the DwC type vocabulary would be what it's name suggests: the vocabulary for describing types. There are some issues involving the current definitions in the type vocabulary, which I won't go into in this email. As Rich said earlier, this is a topic for one of the Documenting Darwin Core sessions at the meeting.
Although these guidelines would hold force specifically for RDF implementations, this is a convention that could be followed in other implementations.
Steve
[1] http://code.google.com/p/tdwg-rdf/wiki/DwcRdfGuideProposal#2.3.1.4_Other_pre... [2] http://code.google.com/p/tdwg-rdf/wiki/DwcRdfGuideProposal#2.3.1.5_Classes_t...
Donald Hobern [GBIF] wrote:
Thanks, Rich.
Very pleased to see this. With this encouragement, I'll say just a little bit more about why I think this is a critical need.
I see the model I describe as the perfect real-world realisation of most of the key components in the GBIO Framework (http://www.biodiversityinformatics.org/), as follows:
Everyone zips up whatever data they have from each resource
(databases, field instruments, sequencers, data extracted from literature, checklists, whatever) into a DwC Archive using whatever DwC elements they can for data elements and describing other elements not currently recognised in DwC (the GBIO DATA layer)
These archives should be placed in repositories that offer
basic services (DOIs, annotation services, etc.) (the GBIO CULTURE layer)
Harvesters assess the contents of each archive and determine
what views can be supported from the supplied elements (occurrence records for GBIF, name usage records, species interactions, etc.) and catalogue these views in relevant discovery indexes (GBIF, Catalogue of Life, TraitBank, etc.) (the GBIO EVIDENCE layer)
Users can at any time annotate elements in the archives to
provide mappings for (potentially more recently defined) DwC or other properties, opening up new options for reuse
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Sunday, October 13, 2013 6:49 PM To: 'Donald Hobern [GBIF]'; 'TDWG Content Mailing List' Cc: 'Chuck Miller' Subject: RE: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Hi Donald,
MANY thanks for this! And you are certainly not alone in your concerns about these issues. In fact, we have planned a Symposium for "Documenting DarwinCore"
(https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
#track11), and one of the four sessions (Session 3, to be precise) of the symposium focuses exactly on this issue of basisOfRecord/dcterms:type/etc.
Another session (Session 2) will focus on proposed and perhaps-to-be-proposed new classes (Individual, MaterialSample, Evidence), and will start out with a series graphs illustrating the existing high-level ontology and possible alternative high-level ontologies, as you indicate in your items 3 & 4.
Aloha,
Rich
Thanks, Steve.
Taking this back to the concerns, I raised at the beginning, I think my concern can best be expressed by the fact that the rdf:type for many published records is not easily defined (or at least leads to arguments about whether some of the available data elements can properly apply to an object of that class). I think the majority of our records are best seen as a denormalised view of a join between instances of different classes rather than as an instance of a class.
Your comments in your other messages about on-going TDWG work on ontologies are much appreciated. I would like to see that work carrying through to accepted recommendations and for the main Darwin Core vocabulary for the time being not to get distracted by whether the associated records are Events, Occurrences, MaterialSamples or whatever.
Thanks again.
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - mailto:dhobern@gbif.org dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/ http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
----------------------------------------------------------------------
From: Steve Baskauf [mailto:steve.baskauf@vanderbilt.edu] Sent: Monday, October 14, 2013 12:45 AM To: Donald Hobern [GBIF] Cc: 'Richard Pyle'; 'TDWG Content Mailing List'; 'Chuck Miller' Subject: Re: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Donald,
With regards to the uncertainty about the meaning of dwc:basisOfRecord, the proposed Darwin Core RDF Guide attempts to inject clarity into the situation. It does so in two ways:
1. It allows dwc:basisOfRecord to be used with literal (text) values to allow existing implementations to expose whatever values they currently have for that term. However, it specifies that rdf:type should be used exclusively as the property for specifying URI-reference values intended to indicate the type of the subject resource. [1] There is some ambiguity about what the subject is of a dwc:basisOrRecord property (the resource, or the record about the resource?). However, there is no similar ambiguity about rdf:type which always serves to indicate the class of which the subject resource is an instance.
2. It specifies that classes in the Darwin Core Type vocabulary namespace (dwctype: = http://rs.tdwg.org/dwc/dwctype/ ) should be used for typing resources in the biodiversity domain rather than any corresponding classes in the main Darwin Core namespace (dwc: = http://rs.tdwg.org/dwc/terms/ ). [2] In other words, if given the choice between dwc:Occurrence and dwctype:Occurrence, use dwctype:Occurrence. The guide proposes to add to the type vocabulary any classes which exist in the dwc: namespace and not in the dwctype: namespace (e.g. dwc:Identification). The intention is that the DwC type vocabulary would be what it's name suggests: the vocabulary for describing types. There are some issues involving the current definitions in the type vocabulary, which I won't go into in this email. As Rich said earlier, this is a topic for one of the Documenting Darwin Core sessions at the meeting.
Although these guidelines would hold force specifically for RDF implementations, this is a convention that could be followed in other implementations.
Steve
[1] http://code.google.com/p/tdwg-rdf/wiki/DwcRdfGuideProposal#2.3.1.4_Other_pre dicates_used_to_indicate_type [2] http://code.google.com/p/tdwg-rdf/wiki/DwcRdfGuideProposal#2.3.1.5_Classes_t o_be_used_for_type_declarations_of_resources_de
Donald Hobern [GBIF] wrote:
Thanks, Rich.
Very pleased to see this. With this encouragement, I'll say just a little bit more about why I think this is a critical need.
I see the model I describe as the perfect real-world realisation of most of the key components in the GBIO Framework (http://www.biodiversityinformatics.org/), as follows:
Everyone zips up whatever data they have from each resource (databases, field instruments, sequencers, data extracted from literature, checklists, whatever) into a DwC Archive using whatever DwC elements they can for data elements and describing other elements not currently recognised in DwC (the GBIO DATA layer)
These archives should be placed in repositories that offer basic services (DOIs, annotation services, etc.) (the GBIO CULTURE layer)
Harvesters assess the contents of each archive and determine what views can be supported from the supplied elements (occurrence records for GBIF, name usage records, species interactions, etc.) and catalogue these views in relevant discovery indexes (GBIF, Catalogue of Life, TraitBank, etc.) (the GBIO EVIDENCE layer)
Users can at any time annotate elements in the archives to provide mappings for (potentially more recently defined) DwC or other properties, opening up new options for reuse
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
----------------------------------------------------------------------
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Sunday, October 13, 2013 6:49 PM To: 'Donald Hobern [GBIF]'; 'TDWG Content Mailing List' Cc: 'Chuck Miller' Subject: RE: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Hi Donald,
MANY thanks for this! And you are certainly not alone in your concerns about these issues. In fact, we have planned a Symposium for Documenting DarwinCore
( <https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
#track11), and one of the four sessions (Session 3, to be precise) of the symposium focuses exactly on this issue of basisOfRecord/dcterms:type/etc.
Another session (Session 2) will focus on proposed and perhaps-to-be-proposed new classes (Individual, MaterialSample, Evidence), and will start out with a series graphs illustrating the existing high-level ontology and possible alternative high-level ontologies, as you indicate in your items 3 & 4.
Aloha,
Rich
Donald,
With respect to your second paragraph, the proposed Darwin Core RDF Guide [1] purposefully avoids getting entangled in issues of defining exactly what the Darwin Core classes are and how they are related to each other. To put it in other words, it does not define object properties connecting the main Darwin Core classes. The guide restricts itself to spelling out how users can express Darwin Core properties from the main vocabulary as RDF in a consistent way.
Creating properties to describe the normalized relationships is an important task, but is left to another layer that could be built on top of the Guide and which requires additional consensus building as to how those relationships should be modeled. Alternatively, there could be several models and procedures could be established for converting from one to the other. But the guide stays out of this and sticks with the basics (as you say, avoids getting distracted).
Steve
[1] http://code.google.com/p/tdwg-rdf/wiki/DwcRdf
Donald Hobern [GBIF] wrote:
Thanks, Steve.
Taking this back to the concerns, I raised at the beginning, I think my concern can best be expressed by the fact that the rdf:type for many published records is not easily defined (or at least leads to arguments about whether some of the available data elements can properly apply to an object of that class). I think the majority of our records are best seen as a denormalised view of a join between instances of different classes rather than as an instance of a class.
Your comments in your other messages about on-going TDWG work on ontologies are much appreciated. I would like to see that work carrying through to accepted recommendations and for the main Darwin Core vocabulary for the time being not to get distracted by whether the associated records are Events, Occurrences, MaterialSamples or whatever.
Thanks again.
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org mailto:dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
*From:* Steve Baskauf [mailto:steve.baskauf@vanderbilt.edu] *Sent:* Monday, October 14, 2013 12:45 AM *To:* Donald Hobern [GBIF] *Cc:* 'Richard Pyle'; 'TDWG Content Mailing List'; 'Chuck Miller' *Subject:* Re: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Donald,
With regards to the uncertainty about the meaning of dwc:basisOfRecord, the proposed Darwin Core RDF Guide attempts to inject clarity into the situation. It does so in two ways:
- It allows dwc:basisOfRecord to be used with literal (text) values
to allow existing implementations to expose whatever values they currently have for that term. However, it specifies that rdf:type should be used exclusively as the property for specifying URI-reference values intended to indicate the type of the subject resource. [1] There is some ambiguity about what the subject is of a dwc:basisOrRecord property (the resource, or the record about the resource?). However, there is no similar ambiguity about rdf:type which always serves to indicate the class of which the subject resource is an instance.
- It specifies that classes in the Darwin Core Type vocabulary
namespace (dwctype: = http://rs.tdwg.org/dwc/dwctype/ ) should be used for typing resources in the biodiversity domain rather than any corresponding classes in the main Darwin Core namespace (dwc: = http://rs.tdwg.org/dwc/terms/ ). [2] In other words, if given the choice between dwc:Occurrence and dwctype:Occurrence, use dwctype:Occurrence. The guide proposes to add to the type vocabulary any classes which exist in the dwc: namespace and not in the dwctype: namespace (e.g. dwc:Identification). The intention is that the DwC type vocabulary would be what it's name suggests: the vocabulary for describing types. There are some issues involving the current definitions in the type vocabulary, which I won't go into in this email. As Rich said earlier, this is a topic for one of the Documenting Darwin Core sessions at the meeting.
Although these guidelines would hold force specifically for RDF implementations, this is a convention that could be followed in other implementations.
Steve
[1] http://code.google.com/p/tdwg-rdf/wiki/DwcRdfGuideProposal#2.3.1.4_Other_pre... [2] http://code.google.com/p/tdwg-rdf/wiki/DwcRdfGuideProposal#2.3.1.5_Classes_t...
Donald Hobern [GBIF] wrote:
Thanks, Rich.
Very pleased to see this. With this encouragement, I'll say just a little bit more about why I think this is a critical need.
I see the model I describe as the perfect real-world realisation of most of the key components in the GBIO Framework (http://www.biodiversityinformatics.org/), as follows:
Everyone zips up whatever data they have from each resource (databases, field instruments, sequencers, data extracted from literature, checklists, whatever) into a DwC Archive using whatever DwC elements they can for data elements and describing other elements not currently recognised in DwC (the GBIO DATA layer)
These archives should be placed in repositories that offer basic services (DOIs, annotation services, etc.) (the GBIO CULTURE layer)
Harvesters assess the contents of each archive and determine what views can be supported from the supplied elements (occurrence records for GBIF, name usage records, species interactions, etc.) and catalogue these views in relevant discovery indexes (GBIF, Catalogue of Life, TraitBank, etc.) (the GBIO EVIDENCE layer)
Users can at any time annotate elements in the archives to provide mappings for (potentially more recently defined) DwC or other properties, opening up new options for reuse
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org mailto:dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
-----Original Message----- From: Richard Pyle [mailto:deepreef@bishopmuseum.org] Sent: Sunday, October 13, 2013 6:49 PM To: 'Donald Hobern [GBIF]'; 'TDWG Content Mailing List' Cc: 'Chuck Miller' Subject: RE: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Hi Donald,
MANY thanks for this! And you are certainly not alone in your concerns about these issues. In fact, we have planned a Symposium for "Documenting DarwinCore"
(https://mbgserv18.mobot.org/ocs/index.php/tdwg/2013/schedConf/trackPolicies
#track11), and one of the four sessions (Session 3, to be precise) of the symposium focuses exactly on this issue of basisOfRecord/dcterms:type/etc.
Another session (Session 2) will focus on proposed and perhaps-to-be-proposed new classes (Individual, MaterialSample, Evidence), and will start out with a series graphs illustrating the existing high-level ontology and possible alternative high-level ontologies, as you indicate in your items 3 & 4.
Aloha,
Rich
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 322-4942 If you fax, please phone or email so that I will know to look for it. http://bioimages.vanderbilt.edu
I've always been somewhat puzzled by the disconnect between the TDWG LSID ontology (e.g., http://rs.tdwg.org/ontology/voc/TaxonConcept ) which has a rich set of classes and links between those classes, and Darwin Core (e.g., http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm ) which overlaps with this vocabulary and, in my opinion, does a worse job in some areas, notably taxon names and concepts. Maybe the LSID vocabulary suffered from the limited uptake of LSIDs (apart from the nomenclators and Catalogue of Life) or from the complexity of dealing with RDF, but it seems that much of the essential work was done when Roger Hyam created that ontology.
What might help is a way to visualise the TDWG LSID ontology in terms of the interconnections between the different classes. I'm not aware of such a visualisation (nor of an equivalent one for the Darwin Core classes).
In any event, it seems odd to have two distinct ontologies that are both in use, and which overlap so significantly.
Regards
Rod On 13 Oct 2013, at 16:12, Donald Hobern [GBIF] wrote:
It’s been a couple of weeks but I said I’d try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be.
I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this:
basisOfRecord should be used ONLY to indicate the type of evidence that lies behind a record – a key aspect of whether the record is likely to be useful for different purposes
basisOfRecord values should be taken from a hierarchical vocabulary with three main branches:
a. “specimens” (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as “pinnedSpecimen”, “herbariumSheet”, etc. b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as “dnaSequence”, “soundRecording”, “stillImage”, etc. c. asserted observations with no revisitable evidence other than the authority of the observer 3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, …) 4. This ontology should not attempt to map all the complexity of biodiversity-related data – just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) – it should leave definition of other properties as a separate, open-ended activity for the community 5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps – provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably won’t break anything, so start simple) 6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records 7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out) 8. Every data publisher should be encouraged to share all relevant data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include: a. A completely hierarchical, ABCD-like, XML representation b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed 9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used – this allows us to infer the “shape” of the data in terms of the ontology 10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record
Here is an example. In today’s terms, what sort of DwC record is this? Do I really have to replace “recordId” with “eventId”, “occurrenceId” or similar? And which should I choose?
recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount
I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c.
I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently – and unambiguously – for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently.
Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the “shape” of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse.
Thanks,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ORCID: http://orcid.org/0000-0002-7101-9767
Rod --- There are a couple different conceptions of interrelationships between Darwin Core "classes", including the Darwin Core Semantic Web effort led by Steve Baskauf and Cam Web, and the BiSciCol project. Darwin Core SW is here: https://code.google.com/p/darwin-sw/ and the BiSciCol "take" is here: http://biscicol.blogspot.com/2013_03_01_archive.html. The Darwin Core SW version includes new classes not in Darwin Core, while BiSciCol uses only existing class terms and a very simple set of predicates.
I think in many people's view, including those of the authors of the above (although I hate speaking for them), neither DW-SW or DW-BiSciCol may be really able to handle the current needs for linking resources together effectively. There has been a major effort to refocus away from jury-rigging Darwin Core to try to serve in a more semantic framework and pushing towards other solutions that align biodiversity standards more with the OBO Foundry (http://www.obofoundry.org/). The Biocollections Ontology (BCO; https://code.google.com/p/bco/) represents (what I hope) is a clear rethinking of the challenge that does connect back to the Darwin Core.
Best, Rob
On Sun, Oct 13, 2013 at 1:52 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
I've always been somewhat puzzled by the disconnect between the TDWG LSID ontology (e.g., http://rs.tdwg.org/ontology/voc/TaxonConcept ) which has a rich set of classes and links between those classes, and Darwin Core (e.g., http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm ) which overlaps with this vocabulary and, in my opinion, does a worse job in some areas, notably taxon names and concepts. Maybe the LSID vocabulary suffered from the limited uptake of LSIDs (apart from the nomenclators and Catalogue of Life) or from the complexity of dealing with RDF, but it seems that much of the essential work was done when Roger Hyam created that ontology.
What might help is a way to visualise the TDWG LSID ontology in terms of the interconnections between the different classes. I'm not aware of such a visualisation (nor of an equivalent one for the Darwin Core classes).
In any event, it seems odd to have two distinct ontologies that are both in use, and which overlap so significantly.
Regards
Rod On 13 Oct 2013, at 16:12, Donald Hobern [GBIF] wrote:
It’s been a couple of weeks but I said I’d try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be.****
I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this:****
basisOfRecord should be used ONLY to indicate the type of
evidence that lies behind a record – a key aspect of whether the record is likely to be useful for different purposes**** 2. basisOfRecord values should be taken from a hierarchical vocabulary with three main branches:**** a. “specimens” (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as “pinnedSpecimen”, “herbariumSheet”, etc.**** b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as “dnaSequence”, “soundRecording”, “stillImage”, etc.**** c. asserted observations with no revisitable evidence other than the authority of the observer**** 3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, …)**** 4. This ontology should not attempt to map all the complexity of biodiversity-related data – just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) – it should leave definition of other properties as a separate, open-ended activity for the community**** 5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps – provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably won’t break anything, so start simple)**** 6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records**** 7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out)
Every data publisher should be encouraged to share all relevant
data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include:**** a. A completely hierarchical, ABCD-like, XML representation**** b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality** ** c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed**** 9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used – this allows us to infer the “shape” of the data in terms of the ontology**** 10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record****
Here is an example. In today’s terms, what sort of DwC record is this? Do I really have to replace “recordId” with “eventId”, “occurrenceId” or similar? And which should I choose?****
*recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount*
I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c.****
I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently – and unambiguously – for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently.****
Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the “shape” of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse.****
Thanks,****
Donald****
----------------------------------------------------------------------**** Donald Hobern - GBIF Director - dhobern@gbif.org**** Global Biodiversity Information Facility http://www.gbif.org/**** GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark**** Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480**** ----------------------------------------------------------------------**** _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ORCID: http://orcid.org/0000-0002-7101-9767
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Thanks, Rob.
I obviously have to admit to being well out of touch with what is going on in these ontology discussions. However part of the reason for my plea is that I think the basic business of facilitating data publishing and discovery is currently being harmed by the Darwin Core classes. It is much better to proceed with the ontology work and to leave significant flexibility in the publishing of data from real systems. I feel confident this will allow rapid progress on all fronts. Right now we are putting barriers in the way of content mobilisation.
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - mailto:dhobern@gbif.org dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/ http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
----------------------------------------------------------------------
From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Robert Guralnick Sent: Sunday, October 13, 2013 10:11 PM To: Roderic Page Cc: TDWG Content Mailing List Subject: Re: [tdwg-content] A plea around basisOfRecord (Was: Proposed new Darwin Core terms - abundance, abundanceAsPercent)
Rod --- There are a couple different conceptions of interrelationships between Darwin Core "classes", including the Darwin Core Semantic Web effort led by Steve Baskauf and Cam Web, and the BiSciCol project. Darwin Core SW is here: https://code.google.com/p/darwin-sw/ and the BiSciCol "take" is here: http://biscicol.blogspot.com/2013_03_01_archive.html. The Darwin Core SW version includes new classes not in Darwin Core, while BiSciCol uses only existing class terms and a very simple set of predicates.
I think in many people's view, including those of the authors of the above (although I hate speaking for them), neither DW-SW or DW-BiSciCol may be really able to handle the current needs for linking resources together effectively. There has been a major effort to refocus away from jury-rigging Darwin Core to try to serve in a more semantic framework and pushing towards other solutions that align biodiversity standards more with the OBO Foundry (http://www.obofoundry.org/). The Biocollections Ontology (BCO; https://code.google.com/p/bco/) represents (what I hope) is a clear rethinking of the challenge that does connect back to the Darwin Core.
Best, Rob
On Sun, Oct 13, 2013 at 1:52 PM, Roderic Page r.page@bio.gla.ac.uk wrote:
I've always been somewhat puzzled by the disconnect between the TDWG LSID ontology (e.g., http://rs.tdwg.org/ontology/voc/TaxonConcept ) which has a rich set of classes and links between those classes, and Darwin Core (e.g., http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm ) which overlaps with this vocabulary and, in my opinion, does a worse job in some areas, notably taxon names and concepts. Maybe the LSID vocabulary suffered from the limited uptake of LSIDs (apart from the nomenclators and Catalogue of Life) or from the complexity of dealing with RDF, but it seems that much of the essential work was done when Roger Hyam created that ontology.
What might help is a way to visualise the TDWG LSID ontology in terms of the interconnections between the different classes. I'm not aware of such a visualisation (nor of an equivalent one for the Darwin Core classes).
In any event, it seems odd to have two distinct ontologies that are both in use, and which overlap so significantly.
Regards
Rod
On 13 Oct 2013, at 16:12, Donald Hobern [GBIF] wrote:
Its been a couple of weeks but I said Id try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be.
I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this:
1. basisOfRecord should be used ONLY to indicate the type of evidence that lies behind a record a key aspect of whether the record is likely to be useful for different purposes
2. basisOfRecord values should be taken from a hierarchical vocabulary with three main branches:
a. specimens (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as pinnedSpecimen, herbariumSheet, etc.
b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as dnaSequence, soundRecording, stillImage, etc.
c. asserted observations with no revisitable evidence other than the authority of the observer
3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, )
4. This ontology should not attempt to map all the complexity of biodiversity-related data just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) it should leave definition of other properties as a separate, open-ended activity for the community
5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably wont break anything, so start simple)
6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records
7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out)
8. Every data publisher should be encouraged to share all relevant data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include:
a. A completely hierarchical, ABCD-like, XML representation
b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality
c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed
9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used this allows us to infer the shape of the data in terms of the ontology
10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record
Here is an example. In todays terms, what sort of DwC record is this? Do I really have to replace recordId with eventId, occurrenceId or similar? And which should I choose?
recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount
I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c.
I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently and unambiguously for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently.
Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the shape of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse.
Thanks,
Donald
----------------------------------------------------------------------
Donald Hobern - GBIF Director - mailto:dhobern@gbif.org dhobern@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/ http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
Tel: +45 3532 1471 tel:%2B45%203532%201471 Mob: +45 2875 1471 tel:%2B45%202875%201471 Fax: +45 2875 1480 tel:%2B45%202875%201480
----------------------------------------------------------------------
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 tel:%2B44%20141%20330%204778 Fax: +44 141 330 2792 tel:%2B44%20141%20330%202792
Skype: rdmpage Facebook: http://www.facebook.com/rdmpage
LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page
Citations: http://scholar.google.co.uk/citations?hl=en http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ &user=4Z5WABAAAAAJ
ORCID: http://orcid.org/0000-0002-7101-9767
_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Sorry, I don't agree at all.
The core Darwin-SW classes include only Darwin Core classes and the two proposed DwC classes (Organism and CollectionObject a.k.a. dsw:IndividualOrganism and dsw:Evidence) which underwent 30 day public comment period [1] and were submitted to the Executive which recommended further consideration by the RDF Task Group and the community at large. The Documenting Darwin Core sessions at the TDWG meeting will pick up these and other open issues for further discussion and hopefully move them towards closure one way or the other. If the two proposed classes are at some point accepted for inclusion in DwC, Darwin-SW will use the new classes and deprecate dsw:IndividualOrganism and dsw:Evidence, leaving only Darwin Core classes as the core classes in Darwin-SW.
It is NOT my view that Darwin-SW is unable to handle current needs for linking resources effectively. If anyone wants to know why I say that, come to our talk in the Friday 9AM session on Ontologies and Formal Models at the meeting. We will show how real SPARQL queries on Darwin-SW-based data can address important competency questions involving diverse linked resources. Or see me any time during the meeting earlier in the week and I'll be happy to give you a personal demonstration not limited to 9 minutes.
Steve
[1] http://lists.tdwg.org/pipermail/tdwg-content/2011-September/002727.html see also open issue https://code.google.com/p/darwincore/issues/detail?id=69
Robert Guralnick wrote:
Rod --- There are a couple different conceptions of interrelationships between Darwin Core "classes", including the Darwin Core Semantic Web effort led by Steve Baskauf and Cam Web, and the BiSciCol project. Darwin Core SW is here: https://code.google.com/p/darwin-sw/ and the BiSciCol "take" is here: http://biscicol.blogspot.com/2013_03_01_archive.html. The Darwin Core SW version includes new classes not in Darwin Core, while BiSciCol uses only existing class terms and a very simple set of predicates.
I think in many people's view, including those of the authors of the above (although I hate speaking for them), neither DW-SW or DW-BiSciCol may be really able to handle the current needs for linking resources together effectively. There has been a major effort to refocus away from jury-rigging Darwin Core to try to serve in a more semantic framework and pushing towards other solutions that align biodiversity standards more with the OBO Foundry (http://www.obofoundry.org/). The Biocollections Ontology (BCO; https://code.google.com/p/bco/) represents (what I hope) is a clear rethinking of the challenge that does connect back to the Darwin Core.
Best, Rob
On Sun, Oct 13, 2013 at 1:52 PM, Roderic Page <r.page@bio.gla.ac.uk mailto:r.page@bio.gla.ac.uk> wrote:
I've always been somewhat puzzled by the disconnect between the TDWG LSID ontology (e.g., http://rs.tdwg.org/ontology/voc/TaxonConcept ) which has a rich set of classes and links between those classes, and Darwin Core (e.g., http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm ) which overlaps with this vocabulary and, in my opinion, does a worse job in some areas, notably taxon names and concepts. Maybe the LSID vocabulary suffered from the limited uptake of LSIDs (apart from the nomenclators and Catalogue of Life) or from the complexity of dealing with RDF, but it seems that much of the essential work was done when Roger Hyam created that ontology. What might help is a way to visualise the TDWG LSID ontology in terms of the interconnections between the different classes. I'm not aware of such a visualisation (nor of an equivalent one for the Darwin Core classes). In any event, it seems odd to have two distinct ontologies that are both in use, and which overlap so significantly. Regards Rod On 13 Oct 2013, at 16:12, Donald Hobern [GBIF] wrote:
It’s been a couple of weeks but I said I’d try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be. I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this: 1. basisOfRecord should be used ONLY to indicate the type of evidence that lies behind a record – a key aspect of whether the record is likely to be useful for different purposes 2. basisOfRecord values should be taken from a hierarchical vocabulary with three main branches: a. “specimens” (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as “pinnedSpecimen”, “herbariumSheet”, etc. b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as “dnaSequence”, “soundRecording”, “stillImage”, etc. c. asserted observations with no revisitable evidence other than the authority of the observer 3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, …) 4. This ontology should not attempt to map all the complexity of biodiversity-related data – just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) – it should leave definition of other properties as a separate, open-ended activity for the community 5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps – provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably won’t break anything, so start simple) 6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records 7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out) 8. Every data publisher should be encouraged to share all relevant data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include: a. A completely hierarchical, ABCD-like, XML representation b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed 9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used – this allows us to infer the “shape” of the data in terms of the ontology 10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record Here is an example. In today’s terms, what sort of DwC record is this? Do I really have to replace “recordId” with “eventId”, “occurrenceId” or similar? And which should I choose? *recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount* I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c. I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently – and unambiguously – for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently. Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the “shape” of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse. Thanks, Donald ---------------------------------------------------------------------- Donald Hobern - GBIF Director - dhobern@gbif.org <mailto:dhobern@gbif.org> Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 <tel:%2B45%203532%201471> Mob: +45 2875 1471 <tel:%2B45%202875%201471> Fax: +45 2875 1480 <tel:%2B45%202875%201480> ---------------------------------------------------------------------- _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK Email: r.page@bio.gla.ac.uk <mailto:r.page@bio.gla.ac.uk> Tel: +44 141 330 4778 <tel:%2B44%20141%20330%204778> Fax: +44 141 330 2792 <tel:%2B44%20141%20330%202792> Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ <http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ> ORCID: http://orcid.org/0000-0002-7101-9767 _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content
Rod, http://code.google.com/p/tdwg-rdf/wiki/BiodiversityOntologies which has been online since last March. Steve
Roderic Page wrote: ...
What might help is a way to visualise the TDWG LSID ontology in terms of the interconnections between the different classes. I'm not aware of such a visualisation (nor of an equivalent one for the Darwin Core classes).
In any event, it seems odd to have two distinct ontologies that are both in use, and which overlap so significantly.
Regards
Rod On 13 Oct 2013, at 16:12, Donald Hobern [GBIF] wrote:
It’s been a couple of weeks but I said I’d try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be.
I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this:
basisOfRecord should be used ONLY to indicate the type of
evidence that lies behind a record – a key aspect of whether the record is likely to be useful for different purposes 2. basisOfRecord values should be taken from a hierarchical vocabulary with three main branches: a. “specimens” (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as “pinnedSpecimen”, “herbariumSheet”, etc. b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as “dnaSequence”, “soundRecording”, “stillImage”, etc. c. asserted observations with no revisitable evidence other than the authority of the observer 3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, …) 4. This ontology should not attempt to map all the complexity of biodiversity-related data – just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) – it should leave definition of other properties as a separate, open-ended activity for the community 5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps – provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably won’t break anything, so start simple) 6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records 7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out) 8. Every data publisher should be encouraged to share all relevant data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include: a. A completely hierarchical, ABCD-like, XML representation b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed 9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used – this allows us to infer the “shape” of the data in terms of the ontology 10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record
Here is an example. In today’s terms, what sort of DwC record is this? Do I really have to replace “recordId” with “eventId”, “occurrenceId” or similar? And which should I choose?
*recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount*
I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c.
I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently – and unambiguously – for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently.
Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the “shape” of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse.
Thanks,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org mailto:dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
tdwg-content mailing list tdwg-content@lists.tdwg.org mailto:tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk mailto:r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ORCID: http://orcid.org/0000-0002-7101-9767
Hi Steve,
Thanks for the link, sorry I'd missed it.
Regards
Rod
On 14 Oct 2013, at 15:34, Steve Baskauf wrote:
Rod, http://code.google.com/p/tdwg-rdf/wiki/BiodiversityOntologies which has been online since last March. Steve
Roderic Page wrote: ...
What might help is a way to visualise the TDWG LSID ontology in terms of the interconnections between the different classes. I'm not aware of such a visualisation (nor of an equivalent one for the Darwin Core classes).
In any event, it seems odd to have two distinct ontologies that are both in use, and which overlap so significantly.
Regards
Rod On 13 Oct 2013, at 16:12, Donald Hobern [GBIF] wrote:
It’s been a couple of weeks but I said I’d try to write something about a more general concern I have around the way we use basisOfRecord and dcterms:type to hold values like occurrence, event and materialSample. This is something that has concerned me for years and that, I worry, is making everything we all do much messier than it need be.
I believe that the way we have come to use Darwin Core basisOfRecord is confused and unhelpful. I really wish we used Darwin Core like this:
basisOfRecord should be used ONLY to indicate the type of evidence that lies behind a record – a key aspect of whether the record is likely to be useful for different purposes
basisOfRecord values should be taken from a hierarchical vocabulary with three main branches:
a. “specimens” (i.e. biological material that can be reviewed), with a hierarchy of subordinate values such as “pinnedSpecimen”, “herbariumSheet”, etc. b. derived, non-biological evidence (not sure what name), with a hierarchy of subordinate values such as “dnaSequence”, “soundRecording”, “stillImage”, etc. c. asserted observations with no revisitable evidence other than the authority of the observer 3. TDWG should deliver a basic ontology in the form of a graph of key relationships between the most significant conceptual entities in our world (TaxonName, TaxonConcept, Identification, Collection, Specimen, Locality, Agent, …) 4. This ontology should not attempt to map all the complexity of biodiversity-related data – just provide the high-level map and key relationships (TaxonConcept hasName TaxonName, Specimen heldIn Collection, etc.) – it should leave definition of other properties as a separate, open-ended activity for the community 5. This ontology should be reviewed at regular intervals and versioned as necessary to address critical gaps – provided that backwards compatibility is maintained (splitting a class into multiple consitituent classes probably won’t break anything, so start simple) 6. The Darwin Core vocabulary should be published as a flat, open-ended list of terms with clear definitions that can be freely combined as columns in denormalised records 7. Every Darwin Core term should be documented to be tightly associated with a single, fixed class in the ontology (e.g. scientificName and specificEpithet are ALWAYS considered to be properties of a TaxonName whether or not that TaxonName object is clearly referenced or separated out) 8. Every data publisher should be encouraged to share all relevant data elements in their source data in the most convenient normalised or denormalised form, provided they use the recognised Darwin Core properties for elements that match the definition for those terms, and provided they give some metadata for other elements. Possible forms include: a. A completely hierarchical, ABCD-like, XML representation b. A completely flat denormalised, simple-DwC-like, CVS representation, if the data includes no elements with higher cardinality c. A set of flat, relational, CVS representations, as with Darwin Core Archive star schemas, but with freedom to have more complex graphed relationships as needed 9. Each table of CVS data in 8b and 8c is a view that corresponds to a linear subgraph of the TDWG ontology, identified by the classes of the DwC properties used – this allows us to infer the “shape” of the data in terms of the ontology 10. If we do this, we do not need to worry about whether a record is a checklist record, an event, an occurrence, a material sample or whatever else, although we could use the dcterms: type property, or some new property, to hold this detail as a further clue to intent and possible use for the record
Here is an example. In today’s terms, what sort of DwC record is this? Do I really have to replace “recordId” with “eventId”, “occurrenceId” or similar? And which should I choose?
recordId, decimalLatitude, decimalLongitude, coordinatePrecision, eventDate, scientificName, individualCount
I think it is clear that this record tells us that there was a recording event at a particular time and place where someone or some process recorded a given number of individual organisms which were identified as representatives of a taxon concept with a name corresponding to the supplied scientific name. In other words this gives us some properties from a subgraph that might include, say, instances of TDWG Event, Locality, Date, Occurrence, Identification, TaxonConcept and TaxonName classes. None of these is specifically referenced but we can unambiguously fold the flat record onto the ontology. We can moreover then use the combination of supplied elements to decide whether this record would be of interest to GBIF, a national information facility, a tool cataloguing uses of scientific names, etc. The same will also apply if multiple CVS tables are provided as in 8c.
I have thought about this for a long time and cannot yet think of an area in which this would not work efficiently – and unambiguously – for all concerned. There are some cases where multiple instances of the same ontology class would be referenced within a single record, which may mean more care is needed by the publisher (e.g. if an insect specimen record includes a reference to a host plant). There may be cases where automated review of the data indicates that there are impossible combinations or ambiguities that the publisher must resolve. However I believe we could use this approach to generalise all mobilisation and consumption of biodiversity data (including all the things we have addressed under ABCD, SDD, TCS, Plinian Core, etc.) and to make it genuinely possible for any data holder to share all the data they have in a form that makes sense to them, while allowing others to consume these data intelligently.
Right now, I think our confused use of basisOfRecord is almost the only thing that stops us from exploring this. We have blurred the question of the evidence for a record, with the question of the “shape” of the record as a subgraph. These are different things. Separating them will allow us to get away from some of our unresolvable debates and open up the doors to much simpler data sharing and reuse.
Thanks,
Donald
Donald Hobern - GBIF Director - dhobern@gbif.org Global Biodiversity Information Facility http://www.gbif.org/ GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark Tel: +45 3532 1471 Mob: +45 2875 1471 Fax: +45 2875 1480
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ORCID: http://orcid.org/0000-0002-7101-9767
-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences
postal mail address: PMB 351634 Nashville, TN 37235-1634, U.S.A.
delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235
office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 322-4942 If you fax, please phone or email so that I will know to look for it. http://bioimages.vanderbilt.edu
--------------------------------------------------------- Roderic Page Professor of Taxonomy Institute of Biodiversity, Animal Health and Comparative Medicine College of Medical, Veterinary and Life Sciences Graham Kerr Building University of Glasgow Glasgow G12 8QQ, UK
Email: r.page@bio.gla.ac.uk Tel: +44 141 330 4778 Fax: +44 141 330 2792 Skype: rdmpage Facebook: http://www.facebook.com/rdmpage LinkedIn: http://uk.linkedin.com/in/rdmpage Twitter: http://twitter.com/rdmpage Blog: http://iphylo.blogspot.com Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ ORCID: http://orcid.org/0000-0002-7101-9767
participants (6)
-
Daniel Janzen
-
Donald Hobern [GBIF]
-
Richard Pyle
-
Robert Guralnick
-
Roderic Page
-
Steve Baskauf