Darwin Core vernacularName field
Greeting,
I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections.
I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form: <vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName>
The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values.
One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future.
It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting.
I will watch for the group's thoughts on this problem.
Many thanks, Geoffrey --------------------------------------------
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
Hi Geoffrey,
There is (currently) no elegant solution to publish multiple vernacular names in simple DwC, it is one of the limitations of simple DwC.
It is however possible to publish multiple vernacular names (and their language and some other information) if you use extensions. There is even a official GBIF Vernacular Name extension: http://rs.gbif.org/extension/gbif/1.0/vernacularname.xml. This extension is a text file where every line is a vernacular name, with a link (via an ID) back to the core file, containing the specimen information. More information here: http://code.google.com/p/gbif-ecat/wiki/DwCArchive
However, this extension was intended for the use of species checklists, with a TAXON core file. I don't think it has ever been used to link to an occurrence/specimen core file. I do know that some herbaria record vernacular names on a specimen level, but that is not really best practice, since vernacular names are in fact properties of taxa, and only by relation of specimens: Alpine alumroot and heuchère glabre are vernacular names for Heuchera glabra, and thus of every living or preserved specimen of that species (http://data.canadensys.net/vascan/name/alpine%20alumroot).
If you want to see an example of a checklist Darwin Core file using a vernacular name extension, we created one for all the vascular plants in Canada (might be useful for your herbarium): http://data.canadensys.net/vascan/dataset You can also search and create your own Darwin Core files for Canadian plants using our checklist builder: http://data.canadensys.net/vascan/checklist
Hope this helps,
Peter
On Thu, Jul 21, 2011 at 11:23, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
On Thu, 21 Jul 2011 11:59:19 -0400 Peter Desmet peter.desmet@umontreal.ca wrote:
I do know that some herbaria record vernacular names on a specimen level, but that is not really best practice, since vernacular names are in fact properties of taxa, and only by relation of specimens:
The place where an association of vernacular names with specimens is appropriate is in ethnobotany, where a specimen may be a voucher for the use of a particular vernacular name by a particular person or group for a particular plant.
Other than that, I'd concur, vernacular names are best treated as related to taxa.
-Paul
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I'd love it if someone could explain the reason for this restriction on Simple Darwin Core. It seems somewhat anachronistic, given that we're encouraging everyone to think in rdf. On the representation side, repetition of a field poses no problems for spreadsheets, xml, or rdf. On the storage side, it is an issue for RDBMS systems; but, consuming applications can address this by creating the kinds of records Bob describes below. Am I missing something?
Many thanks, Joel.
On Thu, 21 Jul 2011, Bob Morris wrote:
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Joel, is the description of the Simple Darwin Core (http://rs.tdwg.org/dwc/terms/simple/index.htm) insufficient to explain the restriction?
I would say that the goal of many of "us" is to encourage everyone to share biodiversity information. I would even go so far as to say that our success as biodiversity informaticians will be to make sure that most people never have to think in rdf. Like any good infrastructure, it should disappear from everyday concern.
On Fri, Jul 22, 2011 at 6:42 AM, joel sachs jsachs@csee.umbc.edu wrote:
I'd love it if someone could explain the reason for this restriction on Simple Darwin Core. It seems somewhat anachronistic, given that we're encouraging everyone to think in rdf. On the representation side, repetition of a field poses no problems for spreadsheets, xml, or rdf. On the storage side, it is an issue for RDBMS systems; but, consuming applications can address this by creating the kinds of records Bob describes below. Am I missing something?
Many thanks, Joel.
On Thu, 21 Jul 2011, Bob Morris wrote:
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Hi John,
The description of Simple Darwin Core justifies the restriction by saying that it's just like the restriction in relational databases. But that's a storage issue, not a representation issue. Maybe my real question is: Whose life is Simple Darwin Core supposed to simplify, the data provider's, or the aggregator's?
Joel.
On Fri, 22 Jul 2011, John Wieczorek wrote:
Joel, is the description of the Simple Darwin Core (http://rs.tdwg.org/dwc/terms/simple/index.htm) insufficient to explain the restriction?
I would say that the goal of many of "us" is to encourage everyone to share biodiversity information. I would even go so far as to say that our success as biodiversity informaticians will be to make sure that most people never have to think in rdf. Like any good infrastructure, it should disappear from everyday concern.
On Fri, Jul 22, 2011 at 6:42 AM, joel sachs jsachs@csee.umbc.edu wrote:
I'd love it if someone could explain the reason for this restriction on Simple Darwin Core. It seems somewhat anachronistic, given that we're encouraging everyone to think in rdf. On the representation side, repetition of a field poses no problems for spreadsheets, xml, or rdf. On the storage side, it is an issue for RDBMS systems; but, consuming applications can address this by creating the kinds of records Bob describes below. Am I missing something?
Many thanks, Joel.
On Thu, 21 Jul 2011, Bob Morris wrote:
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
It's a storage issue, a generation issue, a transportation issue, a processing issue, a consumption issue - it affects all aspects of a workflow. It is meant to help those whose lives are not steeped in informatics, and who have no desire to tread there - in fact, the majority of those providing data and who would not be able to under current conditions without tools such at the GBIF Integrated Publishing Toolkit (IPT) or without assistance.
On Fri, Jul 22, 2011 at 8:16 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hi John,
The description of Simple Darwin Core justifies the restriction by saying that it's just like the restriction in relational databases. But that's a storage issue, not a representation issue. Maybe my real question is: Whose life is Simple Darwin Core supposed to simplify, the data provider's, or the aggregator's?
Joel.
On Fri, 22 Jul 2011, John Wieczorek wrote:
Joel, is the description of the Simple Darwin Core (http://rs.tdwg.org/dwc/terms/simple/index.htm) insufficient to explain the restriction?
I would say that the goal of many of "us" is to encourage everyone to share biodiversity information. I would even go so far as to say that our success as biodiversity informaticians will be to make sure that most people never have to think in rdf. Like any good infrastructure, it should disappear from everyday concern.
On Fri, Jul 22, 2011 at 6:42 AM, joel sachs jsachs@csee.umbc.edu wrote:
I'd love it if someone could explain the reason for this restriction on Simple Darwin Core. It seems somewhat anachronistic, given that we're encouraging everyone to think in rdf. On the representation side, repetition of a field poses no problems for spreadsheets, xml, or rdf. On the storage side, it is an issue for RDBMS systems; but, consuming applications can address this by creating the kinds of records Bob describes below. Am I missing something?
Many thanks, Joel.
On Thu, 21 Jul 2011, Bob Morris wrote:
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
John,
I'm sorry - I still don't understand. How does telling a data provider that he can't re-use a field make his life easier?
Thanks, Joel.
On Fri, 22 Jul 2011, John Wieczorek wrote:
It's a storage issue, a generation issue, a transportation issue, a processing issue, a consumption issue - it affects all aspects of a workflow. It is meant to help those whose lives are not steeped in informatics, and who have no desire to tread there - in fact, the majority of those providing data and who would not be able to under current conditions without tools such at the GBIF Integrated Publishing Toolkit (IPT) or without assistance.
On Fri, Jul 22, 2011 at 8:16 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hi John,
The description of Simple Darwin Core justifies the restriction by saying that it's just like the restriction in relational databases. But that's a storage issue, not a representation issue. Maybe my real question is: Whose life is Simple Darwin Core supposed to simplify, the data provider's, or the aggregator's?
Joel.
On Fri, 22 Jul 2011, John Wieczorek wrote:
Joel, is the description of the Simple Darwin Core (http://rs.tdwg.org/dwc/terms/simple/index.htm) insufficient to explain the restriction?
I would say that the goal of many of "us" is to encourage everyone to share biodiversity information. I would even go so far as to say that our success as biodiversity informaticians will be to make sure that most people never have to think in rdf. Like any good infrastructure, it should disappear from everyday concern.
On Fri, Jul 22, 2011 at 6:42 AM, joel sachs jsachs@csee.umbc.edu wrote:
I'd love it if someone could explain the reason for this restriction on Simple Darwin Core. It seems somewhat anachronistic, given that we're encouraging everyone to think in rdf. On the representation side, repetition of a field poses no problems for spreadsheets, xml, or rdf. On the storage side, it is an issue for RDBMS systems; but, consuming applications can address this by creating the kinds of records Bob describes below. Am I missing something?
Many thanks, Joel.
On Thu, 21 Jul 2011, Bob Morris wrote:
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I really appreciate all of the feedback on this point. Lots of interesting ideas to think about.
Looking at the various responses, it seems to me that by not allowing the repetition of fields, DwC limits its usefulness to information managers such as myself. Some of the work-arounds, such as the GBIF Vernacular Name extension that Peter Desmet pointed me towards, look useful in the particular example that I gave, but won't work in others. It is also a fairly complex process, belying the "simple" part of DwC.
Such a process definitely wouldn't work with some of the other fields that I would like to repeat for our data. I quite dislike the idea of concatenating all the the sample collectors from one specimen into a single field since that will make the process of finding individuals more challenging. It would not be possible to create a relational table for our collectors such as the one for vernacular names.
The other field(s) that I need to repeat pertain to location data. Our dataset currently lists location information in at least five different systems (decimal Lat/Long; deg. min. sec.; UTM; NTS; and verbatim descriptions), and often up to four are used on a given sample. At times the UTM data is generated from degrees Lat/Log, but at other times the reverse is true (and, of course, there is no way of telling from the database alone which is the original). Further, small errors abound in the data that could have crept in during conversions, or possibly even reside with the original data. The data from over 40,000 specimens have already been entered into the database in this manner, and no one is going to go back to double check them all. I desperately want to keep ALL the location data out of fear that we might not present the one accurate measure, and creating a relational table for every geographic point in New Brunswick (let alone the rest of Canada!) is out of the question. (This description of the locational data has been significantly simplified from the actual reality, so please don't start nailing me on technicalities here)
It seems to me, then, that we will have to maintain the data in our own metadata system, and use that to generate DwC (along the lines of what Bob Morris recommended in his first response). That's fine by us, but should, perhaps, be of some concern to a metadata standards working group. Since Darwin Core will not be our de facto standard, its generation and accuracy will be of less relevance to us. I fear the DwC records will become out of date, or start to reflect errors as it maintenance become less important to us. Furthermore, it suggests that we may have to create duplicate sets of data, rather than one set that can be easily harvested for use by other collections.
From my perspective, it would be nice if we could mark this biological data up in one well designed, flexible metadata standard. The Dublin Core group, of course, recognised the importance of flexibility, allowing for Qualified DC along with their simple set, and XML is as popular as it is today because of it extensibility. I would worry that DwC might be painting itself into a corner if it tries to adhere to too narrow a set of rules.
Perplexing stuff, indeed.
Thanks again for all your advice, Geoffrey --------------------------------------------
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
Wait. You weren't talking about using Simple DwC for data in the backend were you? That's not the primary purpose of Simple DwC, which is rather, an exchange standard. It could be used in the backend if the backend is a single flat table living within its restrictions, but you already know that you can't live with that.
I don't think DwC per-se is in danger of painting itself into a corner. The GBIF Data Portal serves 293M taxon occurrence records ingested from 339 different data providers and served in, among other forms, DwC. Seems like a pretty big corner to me. (Though I may be the only regular reader of this list who doesn't know whether it ingests and/or serves Simple DwC.... ).
I'd dare say that you'd get a lot of disagreement from TDWG members who have designed complex XML-Schemas about how wonderful the extension mechanisms for XML are, if one cares about structure constrained by an XML-Schema. One is pretty much limited to use of xs:any --- which somewhat defeats the purpose of a schema language---or something dynamic like WSDL and SOAP wherein the client discovers the Schema at query time, or runtime-applied rule languages like Schematron.
Bob Morris
On Fri, Jul 22, 2011 at 1:17 PM, Geoffrey Allen gsallen@unb.ca wrote:
I really appreciate all of the feedback on this point. Lots of interesting ideas to think about. Looking at the various responses, it seems to me that by not allowing the repetition of fields, DwC limits its usefulness to information managers such as myself. Some of the work-arounds, such as the GBIF Vernacular Name extension that Peter Desmet pointed me towards, look useful in the particular example that I gave, but won't work in others. It is also a fairly complex process, belying the "simple" part of DwC. Such a process definitely wouldn't work with some of the other fields that I would like to repeat for our data. I quite dislike the idea of concatenating all the the sample collectors from one specimen into a single field since that will make the process of finding individuals more challenging. It would not be possible to create a relational table for our collectors such as the one for vernacular names. The other field(s) that I need to repeat pertain to location data. Our dataset currently lists location information in at least five different systems (decimal Lat/Long; deg. min. sec.; UTM; NTS; and verbatim descriptions), and often up to four are used on a given sample. At times the UTM data is generated from degrees Lat/Log, but at other times the reverse is true (and, of course, there is no way of telling from the database alone which is the original). Further, small errors abound in the data that could have crept in during conversions, or possibly even reside with the original data. The data from over 40,000 specimens have already been entered into the database in this manner, and no one is going to go back to double check them all. I desperately want to keep ALL the location data out of fear that we might not present the one accurate measure, and creating a relational table for every geographic point in New Brunswick (let alone the rest of Canada!) is out of the question. (This description of the locational data has been significantly simplified from the actual reality, so please don't start nailing me on technicalities here) It seems to me, then, that we will have to maintain the data in our own metadata system, and use that to generate DwC (along the lines of what Bob Morris recommended in his first response). That's fine by us, but should, perhaps, be of some concern to a metadata standards working group. Since Darwin Core will not be our de facto standard, its generation and accuracy will be of less relevance to us. I fear the DwC records will become out of date, or start to reflect errors as it maintenance become less important to us. Furthermore, it suggests that we may have to create duplicate sets of data, rather than one set that can be easily harvested for use by other collections. From my perspective, it would be nice if we could mark this biological data up in one well designed, flexible metadata standard. The Dublin Core group, of course, recognised the importance of flexibility, allowing for Qualified DC along with their simple set, and XML is as popular as it is today because of it extensibility. I would worry that DwC might be painting itself into a corner if it tries to adhere to too narrow a set of rules. Perplexing stuff, indeed. Thanks again for all your advice, Geoffrey
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
On Fri, Jul 22, 2011 at 11:03 AM, Bob Morris morris.bob@gmail.com wrote:
Wait. You weren't talking about using Simple DwC for data in the backend were you? That's not the primary purpose of Simple DwC, which is rather, an exchange standard. It could be used in the backend if the backend is a single flat table living within its restrictions, but you already know that you can't live with that.
Exactly. This illustrates very well why the Darwin Core in general and Simple Darwin Core specifically should not be misconstrued as a model for a database design. There was never any such intention.
I don't think DwC per-se is in danger of painting itself into a
corner. The GBIF Data Portal serves 293M taxon occurrence records ingested from 339 different data providers and served in, among other forms, DwC. Seems like a pretty big corner to me. (Though I may be the only regular reader of this list who doesn't know whether it ingests and/or serves Simple DwC.... ).
Hopefully my last post shed some light on that. GBIF can harvest Darwin Core Archives, among other forms and methods of transport of Darwin Core and other data sets. In the Darwin Core Archive, there is always a core record, which can be represented completely in Simple Darwin Core.
[snip]
Bob Morris
On Fri, Jul 22, 2011 at 1:17 PM, Geoffrey Allen gsallen@unb.ca wrote:
I really appreciate all of the feedback on this point. Lots of interesting ideas to think about. Looking at the various responses, it seems to me that by not allowing the repetition of fields, DwC limits its usefulness to information managers such as myself. Some of the work-arounds, such as the GBIF Vernacular Name extension that Peter Desmet pointed me towards, look useful in the particular example that I gave, but won't work in others. It is also a fairly complex process, belying the "simple" part of DwC. Such a process definitely wouldn't work with some of the other fields that I would like to repeat for our data. I quite dislike the idea of concatenating all the the sample collectors from one specimen into a single field since that will make the process of finding individuals more challenging. It would not be possible to create a relational table for our collectors such as the one for vernacular names. The other field(s) that I need to repeat pertain to location data. Our dataset currently lists location information in at least five different systems (decimal Lat/Long; deg. min. sec.; UTM; NTS; and verbatim descriptions), and often up to four are used on a given sample. At times the UTM data is generated from degrees Lat/Log, but at other times the reverse is true (and, of course, there is no way of telling from the database alone which is the original). Further, small errors abound in the data that could have crept in during conversions, or possibly even reside with the original data. The data from over 40,000 specimens have already been entered into the database in this manner, and no one is going to go back to double check them all. I desperately want to keep ALL the location data out of fear that we might not present the one accurate measure, and creating a relational table for every geographic point in New Brunswick (let alone the rest of Canada!) is out of the question. (This description of the locational data has been significantly simplified from the actual reality, so please don't start nailing me on technicalities here) It seems to me, then, that we will have to maintain the data in our own metadata system, and use that to generate DwC (along the lines of what Bob Morris recommended in his first response). That's fine by us, but should, perhaps, be of some concern to a metadata standards working group. Since Darwin Core will not be our de facto standard, its generation and accuracy will be of less relevance to us. I fear the DwC records will become out of date, or start to reflect errors as it maintenance become less important to us. Furthermore, it suggests that we may have to create duplicate sets of data, rather than one set that can be easily harvested for use by other collections. From my perspective, it would be nice if we could mark this biological data up in one well designed, flexible metadata standard. The Dublin Core group, of course, recognised the importance of flexibility, allowing for Qualified DC along with their simple set, and XML is as popular as it is today because of it extensibility. I would worry that DwC might be painting itself into a corner if it tries to adhere to too narrow a set of rules. Perplexing stuff, indeed. Thanks again for all your advice, Geoffrey
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)
Dear all,
I recommend reading all of the documents pertaining to the Darwin Core (http://rs.tdwg.org/dwc/), as there are many misconceptions surfacing here. I'll try to summarize, but this shouldn't be taken as a substitute for the "facts", which are the standard a published.
The Darwin Core is first and foremost a set of terms - the common ground through which we seek to convey biodiversity information, so that when we label something with one of these terms, we all have the potential to understand what it means. The canonical form of this set of terms is defined in RDF. The rest is about implementation, for which there are many documents and reference specifications as part of the standard, and many software tools that make use of them.
The Simple Darwin Core is just one of the many ways of using Darwin Core terms. Simple Darwin Core has its uses and its limitations. It is easy to produce, reflects a lot of the data we have, and must be "flat" by design. The Simple Darwin Core can be used as a "core" to structured data as well, as implemented in the GBIF Integrated Publishing Toolkit (IPT) software, which accepts a "core" record as a Taxon or Occurrence and allows that to be extended in a structures that can be expressed in a star schema - basically one level of remove from "flat", where all extensions are related only to the core, but each of which can express a one-to-many relationship to that core record. This isn't "The Darwin Core", it is one example of what someone has done with the Darwin Core in software to make it useful. This text-based use of the Darwin Core is supported by the documentation in the Darwin Core Text Guide (http://rs.tdwg.org/dwc/terms/guides/text/index.htm), which describes the specifications of a Darwin Core Archive.
That doesn't mean Darwin Core can't support highly relational data. Reference schemas for XML (http://darwincore.googlecode.com/svn/trunk/xsd/), and documents on how to use XML for Darwin Core (http://rs.tdwg.org/dwc/terms/guides/xml/index.htm) are a part of the standard, and have been both implemented and extended on at least two occasions (Germplasm Extension for genetic resources - http://code.google.com/p/darwincore-germplasm/downloads/detail?name=ipt_germ...; and Apiary Extension for Herbarium specimen labels - http://www.apiaryproject.org/about-apiary-project). These schemas can be used to share documents in XML using, for example, the TapirLink software (http://wiki.tdwg.org/twiki/bin/view/TAPIR/TapirLink), which implements another of TDWG's standards - TAPIR (the TDWG Access Protocol for Information Retrieval - http://www.tdwg.org/standards/449/). This XML-based use of the Darwin Core is supported by the Darwin Core XML Guide (http://rs.tdwg.org/dwc/terms/guides/xml/index.htm), which describes how to use and construct Darwin Core schemas.
People will naturally ask about Darwin Core in RDF. The canonical form of the Darwin Core is an RDF document, which contains all of the attributes of every term, including the RDF attributes that relate all of the terms to each other, and to terms in other standards such as Dublin Core. There is no RDF Guide in the body of Darwin Core documents published with the standard. This was intentional. It reflects our level of competence as a community in semantic-web technologies at the time the standard was published. Many excellent discussions around that topic have taken place here on this list in an effort to fill the gap for those who would like to link biodiversity and other data in new ways. That subject begs for dedicated attention from those who have the skills and resources to lead it forward.
In summary, the Darwin Core is a living standard (in the sense of being active), having mechanisms to expand and adapt around its core competency, which is the definition of the meaning of the common terms through which we would like to promote the sharing of biodiversity information.
On Fri, Jul 22, 2011 at 10:17 AM, Geoffrey Allen gsallen@unb.ca wrote:
I really appreciate all of the feedback on this point. Lots of interesting ideas to think about. Looking at the various responses, it seems to me that by not allowing the repetition of fields, DwC limits its usefulness to information managers such as myself. Some of the work-arounds, such as the GBIF Vernacular Name extension that Peter Desmet pointed me towards, look useful in the particular example that I gave, but won't work in others. It is also a fairly complex process, belying the "simple" part of DwC. Such a process definitely wouldn't work with some of the other fields that I would like to repeat for our data. I quite dislike the idea of concatenating all the the sample collectors from one specimen into a single field since that will make the process of finding individuals more challenging. It would not be possible to create a relational table for our collectors such as the one for vernacular names. The other field(s) that I need to repeat pertain to location data. Our dataset currently lists location information in at least five different systems (decimal Lat/Long; deg. min. sec.; UTM; NTS; and verbatim descriptions), and often up to four are used on a given sample. At times the UTM data is generated from degrees Lat/Log, but at other times the reverse is true (and, of course, there is no way of telling from the database alone which is the original). Further, small errors abound in the data that could have crept in during conversions, or possibly even reside with the original data. The data from over 40,000 specimens have already been entered into the database in this manner, and no one is going to go back to double check them all. I desperately want to keep ALL the location data out of fear that we might not present the one accurate measure, and creating a relational table for every geographic point in New Brunswick (let alone the rest of Canada!) is out of the question. (This description of the locational data has been significantly simplified from the actual reality, so please don't start nailing me on technicalities here) It seems to me, then, that we will have to maintain the data in our own metadata system, and use that to generate DwC (along the lines of what Bob Morris recommended in his first response). That's fine by us, but should, perhaps, be of some concern to a metadata standards working group. Since Darwin Core will not be our de facto standard, its generation and accuracy will be of less relevance to us. I fear the DwC records will become out of date, or start to reflect errors as it maintenance become less important to us. Furthermore, it suggests that we may have to create duplicate sets of data, rather than one set that can be easily harvested for use by other collections. From my perspective, it would be nice if we could mark this biological data up in one well designed, flexible metadata standard. The Dublin Core group, of course, recognised the importance of flexibility, allowing for Qualified DC along with their simple set, and XML is as popular as it is today because of it extensibility. I would worry that DwC might be painting itself into a corner if it tries to adhere to too narrow a set of rules. Perplexing stuff, indeed. Thanks again for all your advice, Geoffrey
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
Your point is fair enough, and living with Simple DwC is a Good Thing for people with not much experience. But by and large the people who write and support tools like IPT and similar aids are experienced software engineers who would have little trouble implementing, e.g. serving multiple records against the same ResourceID. The issue would then become what problems does this present to existing or future consuming applications, and how does the cost of solving those problems compare to that of solving those that arise from some other solution, such as having to include an atomizer to parse a concatenation-based string. (Probably ability to that do that carries a somewhat lower experience barrier to entry than integrating records.)
Bob
On Fri, Jul 22, 2011 at 11:31 AM, John Wieczorek tuco@berkeley.edu wrote:
It's a storage issue, a generation issue, a transportation issue, a processing issue, a consumption issue - it affects all aspects of a workflow. It is meant to help those whose lives are not steeped in informatics, and who have no desire to tread there - in fact, the majority of those providing data and who would not be able to under current conditions without tools such at the GBIF Integrated Publishing Toolkit (IPT) or without assistance.
On Fri, Jul 22, 2011 at 8:16 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hi John,
The description of Simple Darwin Core justifies the restriction by saying that it's just like the restriction in relational databases. But that's a storage issue, not a representation issue. Maybe my real question is: Whose life is Simple Darwin Core supposed to simplify, the data provider's, or the aggregator's?
Joel.
On Fri, 22 Jul 2011, John Wieczorek wrote:
Joel, is the description of the Simple Darwin Core (http://rs.tdwg.org/dwc/terms/simple/index.htm) insufficient to explain the restriction?
I would say that the goal of many of "us" is to encourage everyone to share biodiversity information. I would even go so far as to say that our success as biodiversity informaticians will be to make sure that most people never have to think in rdf. Like any good infrastructure, it should disappear from everyday concern.
On Fri, Jul 22, 2011 at 6:42 AM, joel sachs jsachs@csee.umbc.edu wrote:
I'd love it if someone could explain the reason for this restriction on Simple Darwin Core. It seems somewhat anachronistic, given that we're encouraging everyone to think in rdf. On the representation side, repetition of a field poses no problems for spreadsheets, xml, or rdf. On the storage side, it is an issue for RDBMS systems; but, consuming applications can address this by creating the kinds of records Bob describes below. Am I missing something?
Many thanks, Joel.
On Thu, 21 Jul 2011, Bob Morris wrote:
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Here are some of my thoughts, as someone who has had my fair share of experience in consuming a wide range of DwC/ABCD/other specimen/observation data and attempted to bring it all together within a DwC-centric framework.
First, thanks to the care taken by John and the other developers of DwC, there are actually very few terms in DwC that are good candidates for repetition in Simple DwC. If DwC had a term like otherCatalogNumber, I can see few good reasons why the no-repetition restriction would have to apply to that term. In fact, DwC does not have otherCatalogNumber. It has otherCatalogNumbers, which sidesteps the issue. I tend to think that processing of other catalog numbers would be much simpler for all concerned if each number was provided separately rather than as a human-readable concatenation.
Secondly, the real problem with repeated metadata terms comes when there are implicit nested semantic relationships between terms. decimalLatitude and decimalLongitude are a good example. Repeated pairs of these terms without some organising structure could not safely be interpreted. If it was considered important for a vernacular name to be associated with the language where the name is used, perhaps through a vernacularNameLanguage term to accompany vernacularLanguage, this problem would occur. However, as far as I can see, DwC is making no attempt to track the language of vernacular names. I would say therefore that vernacularName (as currently used in DwC) is remarkably close to the fictitious otherCatalogName in the previous paragraph. Repeating this term would not cause semantic problems. It would only cause problems for certain kinds of serialisation or storage of the data.
Thirdly, any consumer of DwC needs to be able to handle many issues. Repeated vernacular names is one of the less problematic. In practice, any expectation that providers will serve ABCD or non-Simple DwC will require most clients to deal with complex cases.
My feeling is therefore that it might in theory be beneficial for us to define a useful form of DwC which varied from Simple DwC only in that it allowed repetition of some terms. However, while the only obvious term requiring this exemption is vernacularName, it may be simplest for those providers that need to serve multiple vernacular names for a single record not to claim to use Simple DwC. As suggested, for most serious clients, the real requirement will be to consume any DwC, not just simple.
This brings me however to something that is a very real concern to me. Class-based DwC representations of data may be very complex. The reason that Simple DwC exists is that it corresponds with a range of end uses which are well understood and which rely on what-species-was-recorded-when-and-where-and-with-what-level-of-evidence. This means essentially that these consumers need, for any DwC record, to be able to determine at least the scientificName, decimalLatitude, decimalLongitude, eventDate and basisOfRecord (and preferably the coordinatePrecision, coordinateUncertaintyInMeters and some of the record/provider-identifier terms). Even when a collection database contains multiple identifications for a specimen, there will normally be a current-best identification. Similarly for other repeatable database elements. We certainly need to be able to stream out the full complexity of our biodiversity data, but have we ensured that there is a reliable and consistent way for consumers to take the class-based data and derive the equivalent of the most appropriate Simple DwC for the same data? If not, what can be done to promote the reliability and consistency of such interpretations.
I'd be very interested in thoughts on this last part (if, at this time of night, I've made sense).
Thanks,
Donald
Donald Hobern, Director, Atlas of Living Australia CSIRO Ecosystem Sciences, GPO Box 1700, Canberra, ACT 2601 Phone: (02) 62464352 Mobile: 0437990208 Email: Donald.Hobern@csiro.au Web: http://www.ala.org.au/
-----Original Message----- From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Bob Morris Sent: Saturday, 23 July 2011 1:57 AM To: tuco@berkeley.edu Cc: tdwg-content@lists.tdwg.org; Geoffrey Allen Subject: Re: [tdwg-content] Darwin Core vernacularName field
Your point is fair enough, and living with Simple DwC is a Good Thing for people with not much experience. But by and large the people who write and support tools like IPT and similar aids are experienced software engineers who would have little trouble implementing, e.g. serving multiple records against the same ResourceID. The issue would then become what problems does this present to existing or future consuming applications, and how does the cost of solving those problems compare to that of solving those that arise from some other solution, such as having to include an atomizer to parse a concatenation-based string. (Probably ability to that do that carries a somewhat lower experience barrier to entry than integrating records.)
Bob
On Fri, Jul 22, 2011 at 11:31 AM, John Wieczorek tuco@berkeley.edu wrote:
It's a storage issue, a generation issue, a transportation issue, a processing issue, a consumption issue - it affects all aspects of a workflow. It is meant to help those whose lives are not steeped in informatics, and who have no desire to tread there - in fact, the majority of those providing data and who would not be able to under current conditions without tools such at the GBIF Integrated Publishing Toolkit (IPT) or without assistance.
On Fri, Jul 22, 2011 at 8:16 AM, joel sachs jsachs@csee.umbc.edu wrote:
Hi John,
The description of Simple Darwin Core justifies the restriction by saying that it's just like the restriction in relational databases. But that's a storage issue, not a representation issue. Maybe my real question is: Whose life is Simple Darwin Core supposed to simplify, the data provider's, or the aggregator's?
Joel.
On Fri, 22 Jul 2011, John Wieczorek wrote:
Joel, is the description of the Simple Darwin Core (http://rs.tdwg.org/dwc/terms/simple/index.htm) insufficient to explain the restriction?
I would say that the goal of many of "us" is to encourage everyone to share biodiversity information. I would even go so far as to say that our success as biodiversity informaticians will be to make sure that most people never have to think in rdf. Like any good infrastructure, it should disappear from everyday concern.
On Fri, Jul 22, 2011 at 6:42 AM, joel sachs jsachs@csee.umbc.edu wrote:
I'd love it if someone could explain the reason for this restriction on Simple Darwin Core. It seems somewhat anachronistic, given that we're encouraging everyone to think in rdf. On the representation side, repetition of a field poses no problems for spreadsheets, xml, or rdf. On the storage side, it is an issue for RDBMS systems; but, consuming applications can address this by creating the kinds of records Bob describes below. Am I missing something?
Many thanks, Joel.
On Thu, 21 Jul 2011, Bob Morris wrote:
There's a general issue with repeated attributes in a metadata record of any kind. Depending on the representation language, when there is more than one such thing in the record, it can be difficult to specify any linkages between them when they are semantically related.
One general solution is to have multiple metadata records for the same resource. This can be costly if there is a powerful reason that every such record should carry the complete set of attributes except for the repeated ones, but in the case you put on the table, I think the only powerful reason would take the form "There are a lot of stupid DwC applications out there that might discover a record that has nothing in it but, say, the French vernacular name and a resourceID, and stop there without ever looking for/at another record with the same resourceID and more comprehensive metadata, and integrating the results at the application level."
A response might be "But the point of simple DwC is to support simple applications." But "simple application" is not the same thing as "simple minded application", and my guess is that addressing the issue of multiple metadata records at the application side is, for many applications, less programming effort than other workarounds.
Bob Morris
On Thu, Jul 21, 2011 at 11:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
-- Robert A. Morris
Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 IT Staff Filtered Push Project Department of Organismal and Evolutionary Biology Harvard University
email: morris.bob@gmail.com web: http://efg.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Geoffrey, I want to point out that ABCD, another TDWG collection data standard, does enable multiple, including vernacular, names. ABCD is being used in Europe and I believe GBIF interfaces with it along with DwC and DwCArchive. DwCArchive also enables multiple names but is not a TDWG standard. Either could be a solution to your situation.
Chuck
On Jul 21, 2011, at 10:23 AM, "Geoffrey Allen" gsallen@unb.ca wrote:
Greeting,
I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections.
I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName>
The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values.
One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future.
It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting.
I will watch for the group's thoughts on this problem.
Many thanks, Geoffrey
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Slight correction, the Darwin Core Archive is an implementation of the Darwin Core Text specification, which is a part of the standard.
On Thu, Jul 21, 2011 at 9:57 AM, Chuck Miller Chuck.Miller@mobot.org wrote:
Geoffrey, I want to point out that ABCD, another TDWG collection data standard, does enable multiple, including vernacular, names. ABCD is being used in Europe and I believe GBIF interfaces with it along with DwC and DwCArchive. DwCArchive also enables multiple names but is not a TDWG standard. Either could be a solution to your situation. Chuck
On Jul 21, 2011, at 10:23 AM, "Geoffrey Allen" gsallen@unb.ca wrote:
Greeting, I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections. I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName> The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values. One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future. It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting. I will watch for the group's thoughts on this problem. Many thanks, Geoffrey -------------------------------------------- Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
There are W3C standards for expressing (tagging) the language of elements in an XML document [1], which is built into the XML standard (the xml:lang attribute), and for the language of strings in RDF documents [2] (like this: "some string"@en-US). The tags for languages are also standardized [3].
I'd strongly recommend against reinventing mechanisms for this - XML is for exchanging, and not displaying information. Reinventions of the standard (like the below, or putting the language into parentheses) typically appear motivated by how one would like to display the information - that's what XSLT or custom programming is for, though.
Using the notation for text in RDF, you could easily enumerate several strings, each tagged with a different language and perhaps delimited by comma, for a single instance of DwC-A field.
-hilmar
[1] http://www.w3.org/TR/xml-i18n-bp/#AuthLang [2] http://www.w3.org/2007/OWL/wiki/InternationalizedStringSpec#Preliminaries [3] http://www.w3.org/International/questions/qa-choosing-language-tags
On Jul 21, 2011, at 11:23 AM, Geoffrey Allen wrote:
Greeting,
I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections.
I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName>
The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values.
One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future.
It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting.
I will watch for the group's thoughts on this problem.
Many thanks, Geoffrey
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
I was thinking that the DarwinCore names should be set to be either subproperties of rdfs:label or
skos:preflabel - dwc:scientificName skos:altlabel - dwc:vernacularName
This would consuming services to understand how to interpret them.
If you look at the DBpedia record for the Cougar you will find the following:
<rdf:Description rdf:about="http://dbpedia.org/resource/Cougar%22%3Erdfs:label xml:lang="en"Cougar</rdfs:label></rdf:Description> <rdf:Description rdf:about="http://dbpedia.org/resource/Cougar%22%3Erdfs:label xml:lang="fr"Puma</rdfs:label></rdf:Description>
So if the DarwinCore is RDF then having two vernacularNames that differ in their language tag should not be a problem
If this is strict schema bound XML then it is.
- Pete
On Thu, Jul 21, 2011 at 10:23 AM, Geoffrey Allen gsallen@unb.ca wrote:
Greeting,
I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections.
I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName>
The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values.
One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future.
It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting.
I will watch for the group's thoughts on this problem.
Many thanks, Geoffrey
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
Peter, dwc terms, even though defined with the help of rdf, are not tied to a technology and cannot represent complex objects with multiple properties. Subclassing DC terms is sth we do in the definitions (aka "refines"), but that doesn't mean they inherit the DC implementations of xml or rdf.
Said that I agree that it makes sense to have the 2 name terms refine some label, maybe even dc:title to draw from DC as we do in other places? But that won't give us the language attribute Geoffrey is looking for.
Markus
PS: From the last years experience at GBIF I can only say that sharing common names with the vernacular names extension for dwc archives worked very well for various partners, both publishing and consuming: http://rs.gbif.org/extension/gbif/1.0/vernacularname.xml
On Jul 25, 2011, at 21:52, Peter DeVries wrote:
I was thinking that the DarwinCore names should be set to be either subproperties of rdfs:label or
skos:preflabel - dwc:scientificName skos:altlabel - dwc:vernacularName
This would consuming services to understand how to interpret them.
If you look at the DBpedia record for the Cougar you will find the following:
<rdf:Description rdf:about="http://dbpedia.org/resource/Cougar%22%3E<rdfs:label xml:lang="en">Cougar</rdfs:label></rdf:Description> <rdf:Description rdf:about="http://dbpedia.org/resource/Cougar%22%3E<rdfs:label xml:lang="fr">Puma</rdfs:label></rdf:Description>
So if the DarwinCore is RDF then having two vernacularNames that differ in their language tag should not be a problem
If this is strict schema bound XML then it is.
- Pete
On Thu, Jul 21, 2011 at 10:23 AM, Geoffrey Allen gsallen@unb.ca wrote: Greeting,
I have recently begun the process of digitising the 60,000 specimen vouchers from the UNB herbarium. The textual data for 40,000+ of those has already been entered into a database, and I am now trying to map those values to DwC so that we may share the data with other collections.
I have some concern over the fact that simple DwC does not allow the repetition or extension of certain fields. The vernacularName field is a particular problem. New Brunswick is Canada's only officially bilingual province, as such, our specimens are all identified with both their English and French common names in the database. It would be very useful if we could extend DwC, creating something along the lines of <vernacularName lang=en>, or allow nesting of elements, perhaps in the form:
<vernacularName> <English>Chives</English> <French>Ciboulette, brulotte</French> </vernacularName>
The other option, as I see it, is that we store the English and French common names in our own fields, and then concatenate the two to create the DwC:vernacularName field. I see this option as less than ideal since it may hinder search/browsability. It may also cause a host of other problems from interpreting to storing the data. The herbarium with whom we first intent to share the data has already expressed a concern that their system cannot handle the diacritics found in many of the French names (!). They would like the Eng. common names, but not the French. This is more difficult to achieve if we concat the values.
One additional thought is that the herbarium's imprint, _Flora of New Brunswick_, also includes common names in Maliseet and Mi'kmaq wherever possible. Although these two aboriginal languages do not currently exist in the dataset we are using, there is the potential that they may be added at some point in the future.
It seems to me that the repetition of fields may be necessary in other instances too. I am having some difficulty figuring out how to record all the location data we have for the specimens, which are indicated using verbal descriptions, Lat/Long, UTM, and NTS coordinates - in many cases using all 4 for a single sample, but I will save the details for another posting.
I will watch for the group's thoughts on this problem.
Many thanks, Geoffrey
Geoffrey Allen Digital Projects Librarian Electronic Text Centre Harriet Irving Library University of New Brunswick Fredericton, NB E3B 5H5 Tel: (506) 447-3250 Fax: (506) 453-4595 gsallen@unb.ca
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
--
Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 Email: pdevries@wisc.edu TaxonConcept & GeoSpecies Knowledge Bases A Semantic Web, Linked Open Data Project
tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content
participants (11)
-
"Markus Döring (GBIF)"
-
Bob Morris
-
Chuck Miller
-
Donald.Hobern@csiro.au
-
Geoffrey Allen
-
Hilmar Lapp
-
joel sachs
-
John Wieczorek
-
Paul J. Morris
-
Peter Desmet
-
Peter DeVries