Idea for Discussion, Differentiating between "type's" of identifiers

newer
Fwd: ANN: LOD Cloud - Statistics...

Peter DeVries

3 Oct 2010 3 Oct '10

10:54

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary. I am not sure if it is even a good idea but might be worth some discussion. For those fields that have both a string and "ID" form maybe the following pattern might be useful hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc. This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved. One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa. My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems. Is this an LSD, is it a URI, is it a UUID etc. ? This impacts the structure of the RDF. * Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name <dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource=" http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8 "/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> Some system may choke on the LSID form assuming that it uses a standard resolution mechanism So it might be best to use this form <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> - Pete ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies Knowledge Base <http://lod.geospecies.org/> About the GeoSpecies Knowledge Base <http://about.geospecies.org/> ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Attachments:

attachment.html (text/html — 3.0 KB)

Show replies by date

Gregor Hagedorn

3 Oct 3 Oct

11:33

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

I second this. One of the advantages of GUIDs over normal, domain and scope specific identifiers is that they work independently of rules. The TDWG-LSID to resolvable, semantic web compliant http-URI rules partly break this, requiring knowledge about http to LSID to http conversions, and in which situation which form would be the canonical form. A pattern supporting resolvable, sw-compliant identifiers in parallel to LSIDs itself avoids this. Gregor _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Steve Baskauf

4 Oct 4 Oct

08:41

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter. I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:recordedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings. With respect to the example <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better. Steve Peter DeVries wrote:

...

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies Knowledge Base <http://lod.geospecies.org/> About the GeoSpecies Knowledge Base <http://about.geospecies.org/> ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Peter DeVries

6 Oct 6 Oct

00:02

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Hi Steve, You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label. This would make them an rdfs:Label but a special kind of rdfs:Label. This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly. This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label. - Pete On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu

...

wrote:

...

Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:recordedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" <http:proxy> etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource=" http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8 "/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies Knowledge Base <http://lod.geospecies.org/> About the GeoSpecies Knowledge Base <http://about.geospecies.org/> ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies Knowledge Base <http://lod.geospecies.org/> About the GeoSpecies Knowledge Base <http://about.geospecies.org/> ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Rutger Vos

01:01

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

For labels, would it perhaps make sense to use skos:prefLabel and skos:altLabel? On Wed, Oct 6, 2010 at 8:02 AM, Peter DeVries <pete.devries@gmail.com> wrote:

...

Hi Steve, You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label. This would make them an rdfs:Label but a special kind of rdfs:Label. This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly. This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label. - Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote:

...
Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:recordedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary. I am not sure if it is even a good idea but might be worth some discussion. For those fields that have both a string and "ID" form maybe the following pattern might be useful hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc. This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved. One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa. My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems. Is this an LSD, is it a URI, is it a UUID etc. ? This impacts the structure of the RDF. * Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name <dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> Some system may choke on the LSID form assuming that it uses a standard resolution mechanism So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> - Pete ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

-- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Peter DeVries

05:53

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Yes, skos:prefLabel and skos:altLabel I believe are also subproperties of rdfs:Label. So you could do any of these but if you want it to have the name scientificName and be interpreted as a rdfs:Label one way is to have it be either a subproperty of rdfs:Label or the skos:Label. In the end, they are all interpreted as a kind of rdfs:Label.

...

I don't really see why this is so different from DC.

Yes, people using DC also have the problem in that they can't figure out do I put a URI here or a string? If you look at the expected states in this example you will see the names for the states, that is because Sindice knows to put in the label associated with the geoname URI. Otherwise you would see a geoname URI. The same is true for the GNI names. The geonames label is from the GeoNames RDF, I simply mark these up using the Geonames URI's. In a sense you use the URI and you get the label for "free" < http://sig.ma/search?pid=e95218de8a57e4cda099116caa25c5ac > The use of hasScientificName vs scientificName is simply to make the triples read more naturally. It is not required. <concept> dwc:hasScientificName "Puma concolor" I think part of the problem we are having is that people are not recognizing how different RDF is from straight XML. You really just have to add some variant of rdfs:Label in one of the files and all the other things that reference that URI will get the label for free. So as long as the GNI or Geonames RDF contains the label in it's RDF, I don't need include that in my RDF. At the level of the cloud or the contents of the triple store the label only has to be associated with a particular URI once. - Pete On Wed, Oct 6, 2010 at 3:01 AM, Rutger Vos <rutgeraldo@gmail.com> wrote:

...

For labels, would it perhaps make sense to use skos:prefLabel and skos:altLabel?

On Wed, Oct 6, 2010 at 8:02 AM, Peter DeVries <pete.devries@gmail.com> wrote:

...
Hi Steve, You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label. This would make them an rdfs:Label but a special kind of rdfs:Label. This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly. This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label. - Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote:

...
Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use

the

...
dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:recordedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary. I am not sure if it is even a good idea but might be worth some discussion. For those fields that have both a string and "ID" form maybe the following pattern might be useful hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc. This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved. One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa. My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems. Is this an LSD, is it a URI, is it a UUID etc. ? This impacts the structure of the RDF. * Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name <dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource=" http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8 "/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> Some system may choke on the LSID form assuming that it uses a standard resolution mechanism So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> - Pete ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

-- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com

Markus Döring

06:00

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

...

I think part of the problem we are having is that people are not recognizing how different RDF is from straight XML.

You really just have to add some variant of rdfs:Label in one of the files and all the other things that reference that URI will get the label for free.

So as long as the GNI or Geonames RDF contains the label in it's RDF, I don't need include that in my RDF.

At the level of the cloud or the contents of the triple store the label only has to be associated with a particular URI once.

Does that mean you dont need any additional literal term at all and all dwc terms should only be used with URI? Markus

...

- Pete

On Wed, Oct 6, 2010 at 3:01 AM, Rutger Vos <rutgeraldo@gmail.com> wrote: For labels, would it perhaps make sense to use skos:prefLabel and skos:altLabel?

On Wed, Oct 6, 2010 at 8:02 AM, Peter DeVries <pete.devries@gmail.com> wrote:

...
Hi Steve, You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label. This would make them an rdfs:Label but a special kind of rdfs:Label. This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly. This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label. - Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote:

...
Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:recordedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary. I am not sure if it is even a good idea but might be worth some discussion. For those fields that have both a string and "ID" form maybe the following pattern might be useful hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc. This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved. One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa. My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems. Is this an LSD, is it a URI, is it a UUID etc. ? This impacts the structure of the RDF. * Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name <dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> Some system may choke on the LSID form assuming that it uses a standard resolution mechanism So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> - Pete ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

-- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Peter DeVries

06:48

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

...

Does that mean you dont need any additional literal term at all and all dwc terms should only be used with URI?

Yes, and it might be interesting to try this with those records for which the might be billions or at least tens of millions like occurrences. This would result in the the most efficient and least ambiguous data format. By that I mean it might be interesting to try this with a test set and see how well it works and what people think. I was thinking about the LSID addition because there are people that will want to track or include their LSID's. If it is include in the vocabulary then they can. Those that don't will not have to put those in their records. Also it seems that one of the biggest problems we have with the DarwinCore is that users have trouble interpreting what to put into the different fields. I think that what I am suggesting makes this easier to understand. Many times it has been not been clear to me what to put in these different fields so I can imagine the situation is much worse for those who have have not been IT people for 22 years. - Pete On Wed, Oct 6, 2010 at 8:00 AM, Markus Döring <m.doering@mac.com> wrote:

...

Does that mean you dont need any additional literal term at all and all dwc terms should only be used with URI? Markus

...
- Pete

On Wed, Oct 6, 2010 at 3:01 AM, Rutger Vos <rutgeraldo@gmail.com> wrote: For labels, would it perhaps make sense to use skos:prefLabel and skos:altLabel?

On Wed, Oct 6, 2010 at 8:02 AM, Peter DeVries <pete.devries@gmail.com> wrote:

...
Hi Steve, You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label. This would make them an rdfs:Label but a special kind of rdfs:Label. This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly. This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label. - Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote:

...
Although this specific example deals with taxonomic name identifiers,

it

...
is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:recordedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary. I am not sure if it is even a good idea but might be worth some discussion. For those fields that have both a string and "ID" form maybe the following pattern might be useful hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc. This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved. One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa. My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems. Is this an LSD, is it a URI, is it a UUID etc. ? This impacts the structure of the RDF. * Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name <dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource=" http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8 "/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> Some system may choke on the LSID form assuming that it uses a standard resolution mechanism So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> - Pete ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

-- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Markus Döring

01:18

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Steve, Pete, Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate. Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label). With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type for ced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern. Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case? As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet? - Markus On Oct 6, 2010, at 9:02, Peter DeVries wrote:

...

Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:recordedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

...
I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707

http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Gregor Hagedorn

03:00

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

...

All URNs tell you what they are by their urn prefix (not necessarily how to resolve them)

LSID do not tell this, you only know they are URNs, which means the semantic web stops here. I believe and may be wrong, that switching between a http-proxied-form-of-an-LSID and a pure LSID requires rules that semantic web processors cannot process as sameAs. I believe this implies that if you want LSIDs in the semantic web, you need to inform about the http-proxied versions and pure LSIDs in *parallel*. Keeping it parallel requires some design pattern. This could be a structure inside scientific name of course, but since both identifier forms are attributes of the class Pete's proposal made a lot of sense to me FOR AN RDF implementation. Gregor

Markus Döring

05:55

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Like all urns LSIDs do include the "Namespace Identifier" that identifies them as LSIDs: URN:LSID:<Authority>:<Namespace>:<ObjectID>[:<Version>] The problem is that most or even all rdf frameworks have no clue how to resolve anything else than http, hence the need for proxies. Given that, is there still the need for having LSID or even URN specific terms? Isnt it good enough to use some sameAs assertion on the resolved object if noone can deal with a urn as a link anyway? Markus On Oct 6, 2010, at 12:00, Gregor Hagedorn wrote:

...

...
All URNs tell you what they are by their urn prefix (not necessarily how to resolve them)

LSID do not tell this, you only know they are URNs, which means the semantic web stops here. I believe and may be wrong, that switching between a http-proxied-form-of-an-LSID and a pure LSID requires rules that semantic web processors cannot process as sameAs.

I believe this implies that if you want LSIDs in the semantic web, you need to inform about the http-proxied versions and pure LSIDs in *parallel*.

Keeping it parallel requires some design pattern. This could be a structure inside scientific name of course, but since both identifier forms are attributes of the class Pete's proposal made a lot of sense to me FOR AN RDF implementation.

Gregor

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Hilmar Lapp

05:17

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

On Oct 6, 2010, at 4:18 AM, Markus Döring wrote:

...

Dwc has the goal of being technology independent

I think this is worth stressing, and a Good Thing(tm). Adding technology-specific fields to DwC with the sole goal of making it more suitable for a particular technology is a recipe for mess and thus non- compliance, rather than clarity and simplicity.

...

And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

I agree. I don't really see why this is so different from DC. There is no dcterms:identifierDOI or dcterms:identifierISSN or dcterms:identifierISBN either. Instead, there is a documented social convention on what to put into the dcterms:identifier field. That still leaves ample room for not following that convention, but it does keep the standard clear, and gives consumers one element to inspect when they want an identifier. I would rather have a simple and clear standard with documented social conventions, and a blessed validator tool that tells providers whether they are following the conventions or not.

...

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name.

I actually do think that there is a difference between a name (usually a label) and an identifier. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================

Steve Baskauf

7 Oct 7 Oct

07:41

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years. Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary. I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part. Steve Markus Döring wrote:

...

Steve, Pete,

Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label).

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern.

Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case?

As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet?

- Markus

On Oct 6, 2010, at 9:02, Peter DeVries wrote:

...
Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

...
I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707

http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

.

Roger Hyam

08:38

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

I am only just keeping up with this thread so excuse me if I speak out of turn. I wonder if we should have a normative rendering of DwC in RDF at all. There will probably be many many decisions to take in developing such a model. The correct answer to each one of these decisions will depend on the use-case for the model. If we don't have at least one clear use-case for *consuming* the data then our decisions will be either arbitrary or we will argue around and around in circles unable to decide on the perfect answer. A use-case should involve answering a real world question not a hypothetical one and it should be testable. As an example I am of the growing opinion that taxonomic classifications would be rendered perfectly adequately in SKOS - each classification being a free standing thesaurus linked to other thesauri/classifications using the standard SKOS terms. Any tool that understood SKOS would then "understand" taxonomy. One could produce a mapping from DwC checklists to SKOS to do this. It would be totally inappropriate to convert DwC dataset of specimens into a SKOS thesaurus as specimens could never be considered concepts. I hesitate to formally propose the SKOS model (I thought about presenting it at TDWG) because I haven't found a decent SKOS browser even or an application that would demonstrate the utility of this approach - though I have ideas... Anyhow just my tuppence worth - before departing for a long weekend :) All the best, Roger On 7 Oct 2010, at 15:41, Steve Baskauf wrote:

...

I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years.

Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary.

I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part.

Steve

Markus Döring wrote:

...
Steve, Pete,

Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label).

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo

rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern.

Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case?

As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet?

- Markus

On Oct 6, 2010, at 9:02, Peter DeVries wrote:

...
Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco

rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

...
I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707

http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

.

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Rutger Vos

08:49

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

...

As an example I am of the growing opinion that taxonomic classifications would be rendered perfectly adequately in SKOS - each classification being a free standing thesaurus linked to other thesauri/classifications using the standard SKOS terms.

This is how TreeBASE shows links between OTU labels submitted to it and their matches to uBio NameBank records and NCBI taxonomy records, using the skos:closeMatch predicate (this was deemed appropriate also because the TreeBASE webapp determines these links mostly just by string matching). -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com

Arlin Stoltzfus

09:01

This issue came up in the phylogenetic standards subgroup, which (at the TDWG meeting last week) started to assess current best practices for publishing a phylogenetic tree electronically. One of the issues that came up is that there is no standard way of making a link between a tree (a clade or node) and a species. Some of the file formats used in phylogenetics can do this explicitly, and others can't. I made some toy examples of encodings on our twiki here: http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/LinkingTrees2010#A_set_of_... Here is how it is done in phyloXML: <clade> <taxonomy> <id provider="UBio">http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:3... </id <scientific_name>Gallus gallus</scientific_name> </taxonomy> </clade> and here is how it is rendered in the nexml output from TreeBase: <otu id="otu4" label="Gallus_gallus_CAA25046.1"> <meta href="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:3... " id="meta5" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/> <meta content="Gallus gallus" datatype="xsd:string" id="meta6" property="skos:altLabel" xsi:type="nex:LiteralMeta"/> </otu> phyloXML and NeXML are emerging XML standards for phylogenetics. Arlin On Oct 7, 2010, at 11:49 AM, Rutger Vos wrote:

...

...
As an example I am of the growing opinion that taxonomic classifications would be rendered perfectly adequately in SKOS - each classification being a free standing thesaurus linked to other thesauri/classifications using the standard SKOS terms.

This is how TreeBASE shows links between OTU labels submitted to it and their matches to uBio NameBank records and NCBI taxonomy records, using the skos:closeMatch predicate (this was deemed appropriate also because the TreeBASE webapp determines these links mostly just by string matching).

-- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading RG6 6BX United Kingdom Tel: +44 (0) 118 378 7535 http://www.nexml.org http://rutgervos.blogspot.com _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

------- Arlin Stoltzfus (arlin@umd.edu) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org

Steve Baskauf

09:04

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Well, I'm not really proposing to create anything as complicated as a normative rendering of DwC in RDF. I just think it would be useful to have a guide similar to http://rs.tdwg.org/dwc/terms/guides/xml/index.htm and http://rs.tdwg.org/dwc/terms/guides/text/index.htm that would help people know how they could use DwC terms as properties in RDF. People are going to do that whether there is a guide or not - this would just make it easier. If people want to develop more sophisticated systems for doing semantic reasoning that's facilitated by RDF, all power to them (it's just not going to be me). I'm really trying to be focused on making it possible to meet the minimal requirements for exposing metadata as outlined in the Linked Data/GUID guidelines. That's a much lower bar than, say, creating formats and applications that "understand" taxonomy. If we can't do this (quick and dirty RDF for occurrences and the like) relatively quickly and in a relatively simple way, then let's just forget about having GUIDs that are actionable and just tell people that we don't expect actionability. Steve Roger Hyam wrote:

...

I am only just keeping up with this thread so excuse me if I speak out of turn.

I wonder if we should have a normative rendering of DwC in RDF at all.

There will probably be many many decisions to take in developing such a model. The correct answer to each one of these decisions will depend on the use-case for the model. If we don't have at least one clear use-case for *consuming* the data then our decisions will be either arbitrary or we will argue around and around in circles unable to decide on the perfect answer. A use-case should involve answering a real world question not a hypothetical one and it should be testable.

As an example I am of the growing opinion that taxonomic classifications would be rendered perfectly adequately in SKOS - each classification being a free standing thesaurus linked to other thesauri/classifications using the standard SKOS terms. Any tool that understood SKOS would then "understand" taxonomy. One could produce a mapping from DwC checklists to SKOS to do this. It would be totally inappropriate to convert DwC dataset of specimens into a SKOS thesaurus as specimens could never be considered concepts.

I hesitate to formally propose the SKOS model (I thought about presenting it at TDWG) because I haven't found a decent SKOS browser even or an application that would demonstrate the utility of this approach - though I have ideas...

Anyhow just my tuppence worth - before departing for a long weekend :)

All the best,

Roger

On 7 Oct 2010, at 15:41, Steve Baskauf wrote:

...
I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years.

Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary.

I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part.

Steve

Markus Döring wrote:

...
Steve, Pete,

Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label).

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo

rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern.

Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case?

As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet?

- Markus

On Oct 6, 2010, at 9:02, Peter DeVries wrote:

...
Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco

rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

...
I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707

http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------ _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

.

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org <mailto:tdwg-content@lists.tdwg.org> http://lists.tdwg.org/mailman/listinfo/tdwg-content

Peter DeVries

10:30

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

I like Steve's idea of setting up some test data. I have found that this is really helpful in figuring out how to represent entities in a way that allows useful queries. I also like Rogers idea of exploring of the taxonomic hierarchies using SKOS. I have something like this for the Catalog of Life classifications. It only goes down to Class. * I am not suggesting that this is the best way or the only way, and it may not be the most appropriate way to model these in SKOS depending on your particular use case. One thing thing that might be really interesting is following Roger's suggestion and try this and see if we can create some hierarchy ontologies which then can be run against each other to determine where they are the same an where they are different. There are two more issues that are worth mentioning. In some hierarchies there are "unassigned" clades which have other clades nested inside of them. You can see this in the Catalog of Life Fungi. To deal with this you need to create some sort of "Fungi_Class_Unassigned" clade in which to place the subordinate classes. This seems and looks somewhat awkward at first, but it will make sense when you start making you own. The other issue is that you to figure out a way to have the URI for each clade be unique and since some clades have the same name you might have to do something like plantGenus_Acer, to make sure each is unique. It might be useful to develop some common pattern for this and I would be interested what Roger suggests. Also in the case of the Catalog of Life the clades change each year so you will have to plan for that. Here is the http://lod.taxonconcept.org/ontology/phylo/CoL/CoL_2010_base.owl Another pattern for this would be * http://lod.taxonconcept.org/ontology/phylo/CoL/2010/base.owl* * I use "base" because I anticipate that once you get below something like order or family you might need to split the lower clades into separate files. Here is the ontology documentation http://lod.taxonconcept.org/ontology/phylo/CoL/doc/index.html This was made using Protege - which you can get here from Standford.edu http://protege.stanford.edu/ The particular reason I made this is described here: http://www.taxonconcept.org/taxonconcept-blog/2010/6/10/a-species-has_many-c... Again this might not be structured in the best way for all people but some might find the example useful and we might be able to come up with a common solutions to dealing with genera or other clades that have identical names. The URI's for these clades needs to be different but the rdfs:Label could be the same. There are a lot of features of SKOS, for one thing it can easily handle labels for the different clades in many different languages. This is especially good for vernacular or common names. The only side effect of using SKOS is that it entails the classes as skos:Concepts. Some in the LOD community think this might be a problem others do not. In using SKOS to document a clade you are making it skos:Concept, but classes can have many types, so something can be both a skos:Concept and a dwc:Rank etc. Here is a good resource on SKOS: http://www.w3.org/TR/skos-primer/ - Pete On Thu, Oct 7, 2010 at 10:38 AM, Roger Hyam <rogerhyam@mac.com> wrote:

...

I am only just keeping up with this thread so excuse me if I speak out of turn.

I wonder if we should have a normative rendering of DwC in RDF at all.

There will probably be many many decisions to take in developing such a model. The correct answer to each one of these decisions will depend on the use-case for the model. If we don't have at least one clear use-case for *consuming* the data then our decisions will be either arbitrary or we will argue around and around in circles unable to decide on the perfect answer. A use-case should involve answering a real world question not a hypothetical one and it should be testable.

As an example I am of the growing opinion that taxonomic classifications would be rendered perfectly adequately in SKOS - each classification being a free standing thesaurus linked to other thesauri/classifications using the standard SKOS terms. Any tool that understood SKOS would then "understand" taxonomy. One could produce a mapping from DwC checklists to SKOS to do this. It would be totally inappropriate to convert DwC dataset of specimens into a SKOS thesaurus as specimens could never be considered concepts.

I hesitate to formally propose the SKOS model (I thought about presenting it at TDWG) because I haven't found a decent SKOS browser even or an application that would demonstrate the utility of this approach - though I have ideas...

Anyhow just my tuppence worth - before departing for a long weekend :)

All the best,

Roger

On 7 Oct 2010, at 15:41, Steve Baskauf wrote:

I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years.

Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary.

I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part.

Steve

Markus Döring wrote:

Steve, Pete,

Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label).

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo

rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern.

Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case?

As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet?

- Markus

On Oct 6, 2010, at 9:02, Peter DeVries wrote:

Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> <steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco

rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" <http:proxy> etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8" <http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8>/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

-- ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------ _______________________________________________ tdwg-content mailing listtdwg-content@lists.tdwg.orghttp://lists.tdwg.org/mailman/listinfo/tdwg-content

.

-- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences

postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A.

delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235

office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707http://bioimages.vanderbilt.edu

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Blum, Stan

10:29

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Hi Steve, Sorry, I missed your message below (as well as your response to Roger) before I sent my reply about the utility of an RDF guide for DwC. Obviously, I think it’s a great idea. To do this within the “normal” TDWG process, this should be done as a Task Group. I could help you draft a charter for that, which would then need to be reviewed by the TAG and Exec. Once approved, we would put the charter up on the web site, and do our best to provide any other resources that would help speed the task. I don’t mean to slow you down. The Charter doesn’t have to be elaborate. It’s function is to let others in TDWG and beyond know that this task is proceeding, who to contact, how to get involved, etc. It also gives you the backing of the TDWG community. Let me know if you’d like to pursue this. -Stan On 10/7/10 7:41 AM, "Steve Baskauf" <steve.baskauf@vanderbilt.edu> wrote: I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years. Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary. I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part. Steve Markus Döring wrote: Steve, Pete, Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate. Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label). With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern. Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case? As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet? - Markus On Oct 6, 2010, at 9:02, Peter DeVries wrote: Hi Steve, You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label. This would make them an rdfs:Label but a special kind of rdfs:Label. This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly. This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label. - Pete On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> <mailto:steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter. I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings. With respect to the example <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better. Steve Peter DeVries wrote: I have been thinking about the following pattern. In part after looking at the GBIF vocabulary. I am not sure if it is even a good idea but might be worth some discussion. For those fields that have both a string and "ID" form maybe the following pattern might be useful hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" <http:proxy> etc. This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved. One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa. My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems. Is this an LSD, is it a URI, is it a UUID etc. ? This impacts the structure of the RDF. * Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name <dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8" <http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8> /> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> Some system may choke on the LSID form assuming that it uses a standard resolution mechanism So it might be best to use this form <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> - Pete ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

Thomas Bandholtz

8 Oct 8 Oct

05:07

New subject: [tdwg-tag] Idea for Discussion, Differentiating between "type's" of identifiers

Hi all, having just joined this list, I find it a great idea to have such an RDF Task Group. I am going to publish a little species catalog for the Federal Environment Agency in Germany as Linked Data, and I am looking for the best way to express it in RDF. In the Linked Data cloud [1] I find several related contributions, such as Geospecies, TaxonConcept, EUNIS, and more. Comparing these approaches I prefer the idea of reusing SKOS [2] labels and hierarchical relations, as in the Geospecies example [3]. It might be a good idea to apply the SKOS XL extension as well to go deeper into the taxon name properties. Finally, I would add the taxon ranks as a distinct concept scheme and link them to the taxon concepts with a mapping relation. Certainly this is not the only way to go, but it is rather simple and will be easily understood as SKOS is quite common in the Linked Data community. This might be called "Simple Darwin Core" and give room for more complex ontology approaches beyond that. Looking forward to discussion, Thomas [1] http://lod-cloud.net [2] http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/ [3] http://lod.geospecies.org/ses/v6n7p.rdf Peter J. DeVries Am 07.10.2010 19:29, schrieb Blum, Stan:

...

Hi Steve,

Sorry, I missed your message below (as well as your response to Roger) before I sent my reply about the utility of an RDF guide for DwC. Obviously, I think it’s a great idea. To do this within the “normal” TDWG process, this should be done as a Task Group. I could help you draft a charter for that, which would then need to be reviewed by the TAG and Exec. Once approved, we would put the charter up on the web site, and do our best to provide any other resources that would help speed the task. I don’t mean to slow you down. The Charter doesn’t have to be elaborate. It’s function is to let others in TDWG and beyond know that this task is proceeding, who to contact, how to get involved, etc. It also gives you the backing of the TDWG community.

Let me know if you’d like to pursue this.

-Stan

On 10/7/10 7:41 AM, "Steve Baskauf" <steve.baskauf@vanderbilt.edu> wrote:

I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years.

Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary.

I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part.

Steve

Markus Döring wrote:

Steve, Pete,

Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label).

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo

rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern.

Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case?

As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet?

- Markus

On Oct 6, 2010, at 9:02, Peter DeVries wrote:

Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> <mailto:steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco

rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" <http:proxy> etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8" <http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8> /> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

This body part will be downloaded on demand.

-- Thomas Bandholtz, thomas.bandholtz@innoq.com, http://www.innoq.com innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491

Bob Morris

19 Oct 19 Oct

06:21

New subject: [tdwg-tag] Idea for Discussion, Differentiating between "type's" of identifiers

It is very important to keep the words "Linked Data" and "Linked Open Data" clearly in the conversation and properly used. Neither of them is equivalent to the less well-defined "Semantic Web". Whether temporarily or not, neither LD and LOD address specific some use cases that are important for other semantic applications. As an example, some of the recommended practices for LOD do not currently support tractable reasoning either fundamentally or with current reasoners. Intractable reasoning means, among other things, that it is possible to launch queries that will never complete, and for which it is not possible to know in advance whether that is the case or not. The current conversation is treading on rather deep issues, some of which are on the bleeding edge of Knowledge Representation research. Premature decisions or KR-naive decisions will likely revisit the history of Darwin Core itself. That is, a tremendously useful solution will go a long way, provoke misuse along the way, and then come up against a stone wall requiring a major multi-year re-architecture effort, perhaps with huge expense to retrofit to the previous uses. For some insight into what the problems are, one could do worse than read the thread that begins(?) with http://lists.w3.org/Archives/Public/public-lod/2010Jul/0330.html That thread addresses the current wobbly state of the FOAF ontology, which to my knowledge still remains without an agreed upon form that guarantees tractable reasoning. Also, see especially the bullet points on "Most Notably" in the Objectives of 1st Workshop on Knowledge Injection into and Extraction from Linked Data http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=10142©ownerid=1... Bob Morris On Fri, Oct 8, 2010 at 8:07 AM, Thomas Bandholtz <thomas.bandholtz@innoq.com> wrote:

...

Hi all,

having just joined this list, I find it a great idea to have such an RDF Task Group. I am going to publish a little species catalog for the Federal Environment Agency in Germany as Linked Data, and I am looking for the best way to express it in RDF.

In the Linked Data cloud [1] I find several related contributions, such as Geospecies, TaxonConcept, EUNIS, and more. Comparing these approaches I prefer the idea of reusing SKOS [2] labels and hierarchical relations, as in the Geospecies example [3].

It might be a good idea to apply the SKOS XL extension as well to go deeper into the taxon name properties. Finally, I would add the taxon ranks as a distinct concept scheme and link them to the taxon concepts with a mapping relation.

Certainly this is not the only way to go, but it is rather simple and will be easily understood as SKOS is quite common in the Linked Data community. This might be called "Simple Darwin Core" and give room for more complex ontology approaches beyond that.

Looking forward to discussion, Thomas

[1] http://lod-cloud.net [2] http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/ [3] http://lod.geospecies.org/ses/v6n7p.rdf

Peter J. DeVries

Am 07.10.2010 19:29, schrieb Blum, Stan:

Hi Steve,

Sorry, I missed your message below (as well as your response to Roger) before I sent my reply about the utility of an RDF guide for DwC. Obviously, I think it’s a great idea. To do this within the “normal” TDWG process, this should be done as a Task Group. I could help you draft a charter for that, which would then need to be reviewed by the TAG and Exec. Once approved, we would put the charter up on the web site, and do our best to provide any other resources that would help speed the task. I don’t mean to slow you down. The Charter doesn’t have to be elaborate. It’s function is to let others in TDWG and beyond know that this task is proceeding, who to contact, how to get involved, etc. It also gives you the backing of the TDWG community.

Let me know if you’d like to pursue this.

-Stan

On 10/7/10 7:41 AM, "Steve Baskauf" <steve.baskauf@vanderbilt.edu> wrote:

I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years.

Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary.

I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part.

Steve

Markus Döring wrote:

Steve, Pete,

Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label).

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo

rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern.

Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case?

As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet?

- Markus

On Oct 6, 2010, at 9:02, Peter DeVries wrote:

Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> <mailto:steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco

rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" <http:proxy> etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8" <http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8> /> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

This body part will be downloaded on demand.

-- Thomas Bandholtz, thomas.bandholtz@innoq.com, http://www.innoq.com innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile)

Peter DeVries

11:02

New subject: [tdwg-tag] Idea for Discussion, Differentiating between "type's" of identifiers

I agree with Bob that solutions that might work well in the LOD Cloud might not be optimal for some uses. In one of my examples I have two links to a particular image because for some services I need to use foaf:depicts to get the image to display properly. For instance this example: http://sig.ma/search?pid=88b8af943c0734ae3c2322e0d6787417 <http://sig.ma/search?pid=88b8af943c0734ae3c2322e0d6787417>What I would prefer is a widely accepted solution that allowed me to tie a photograph as supporting documentation of a specimen and an occurrence record. A lot of the issues with foaf etc are being worked out and there is enough overlap between the LOD community and the tdwg community that often makes sense to ask them, "this is what we need, how would you recommend we do this". That was the procedure I followed to get an "Area" solution that supports a radius and works with the existing geo vocabulary. You might get suggestions that are not exactly what you are looking for. You might get conflicting opinions. However, the end result is usually a set of well informed suggestions. What Bob may not be appreciating is that by working within the greater LOD community you get the same kinds of benefits that you often get from other open source projects - data sets, tools and documentation that can be extremely valuable. I would also argue that some of the techniques and technologies used by the LOD community scale much better and enable more efficient data harvesting, and are, in many ways easier, for small groups to implement that other technologies. I mentioned this to Steve in a separate email, but to some extent the first step is to determine the kinds of queries you want to be able to make and then use that as a guide for how to design the RDF. I believe that I can design something that works well for the queries that I need and works well in the LOD cloud. What is not clear is if tdwg will choose to adopt any of this. Respectfully, - Pete On Tue, Oct 19, 2010 at 8:21 AM, Bob Morris <morris.bob@gmail.com> wrote:

...

It is very important to keep the words "Linked Data" and "Linked Open Data" clearly in the conversation and properly used. Neither of them is equivalent to the less well-defined "Semantic Web".

Whether temporarily or not, neither LD and LOD address specific some use cases that are important for other semantic applications. As an example, some of the recommended practices for LOD do not currently support tractable reasoning either fundamentally or with current reasoners. Intractable reasoning means, among other things, that it is possible to launch queries that will never complete, and for which it is not possible to know in advance whether that is the case or not.

The current conversation is treading on rather deep issues, some of which are on the bleeding edge of Knowledge Representation research. Premature decisions or KR-naive decisions will likely revisit the history of Darwin Core itself. That is, a tremendously useful solution will go a long way, provoke misuse along the way, and then come up against a stone wall requiring a major multi-year re-architecture effort, perhaps with huge expense to retrofit to the previous uses.

For some insight into what the problems are, one could do worse than read the thread that begins(?) with http://lists.w3.org/Archives/Public/public-lod/2010Jul/0330.html That thread addresses the current wobbly state of the FOAF ontology, which to my knowledge still remains without an agreed upon form that guarantees tractable reasoning.

Also, see especially the bullet points on "Most Notably" in the Objectives of 1st Workshop on Knowledge Injection into and Extraction from Linked Data

http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=10142©ownerid=1...

Bob Morris

On Fri, Oct 8, 2010 at 8:07 AM, Thomas Bandholtz <thomas.bandholtz@innoq.com> wrote:

...
Hi all,

having just joined this list, I find it a great idea to have such an RDF Task Group. I am going to publish a little species catalog for the Federal Environment Agency in Germany as Linked Data, and I am looking for the best way to express it in RDF.

In the Linked Data cloud [1] I find several related contributions, such as Geospecies, TaxonConcept, EUNIS, and more. Comparing these approaches I prefer the idea of reusing SKOS [2] labels and hierarchical relations, as in the Geospecies example [3].

It might be a good idea to apply the SKOS XL extension as well to go deeper into the taxon name properties. Finally, I would add the taxon ranks as a distinct concept scheme and link them to the taxon concepts with a mapping relation.

Certainly this is not the only way to go, but it is rather simple and will be easily understood as SKOS is quite common in the Linked Data community. This might be called "Simple Darwin Core" and give room for more complex ontology approaches beyond that.

Looking forward to discussion, Thomas

[1] http://lod-cloud.net [2] http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/ [3] http://lod.geospecies.org/ses/v6n7p.rdf

Peter J. DeVries

Am 07.10.2010 19:29, schrieb Blum, Stan:

Hi Steve,

Sorry, I missed your message below (as well as your response to Roger) before I sent my reply about the utility of an RDF guide for DwC. Obviously, I think it’s a great idea. To do this within the “normal” TDWG process, this should be done as a Task Group. I could help you draft a charter for that, which would then need to be reviewed by the TAG and Exec. Once approved, we would put the charter up on the web site, and do our best to provide any other resources that would help speed the task. I don’t mean to slow you down. The Charter doesn’t have to be elaborate. It’s function is to let others in TDWG and beyond know that this task is proceeding, who to contact, how to get involved, etc. It also gives you the backing of the TDWG community.

Let me know if you’d like to pursue this.

-Stan

On 10/7/10 7:41 AM, "Steve Baskauf" <steve.baskauf@vanderbilt.edu> wrote:

I agree that it is best to avoid a proliferation of terms and I agree that it is best to keep Darwin Core technology independent to the maximum extent possible. However, I think that the case of facilitating HTTP URIs is a special one because of the requirements of GUIDs/Persistent Identifiers. Both the TDWG and GBIF guidelines such as they currently stand say that GUIDs must be resolvable, that in their resolution they must return RDF, and that the RDF has to be in an XML format. Like it or not, that is what we have. Given the amount of time that it seems to have taken to settle on that much, I think it is best for us to decide to live with it, warts and all, rather than re-opening the discussion and delaying the implementation of GUIDs for another five years.

Given that assumption, there needs to be within Darwin Core some way to support this particular "technology" (Linked Data, RDF/XML) even if we don't do "special" things to support other technologies such as LSID, DOI, etc. The point is well taken that most of those other technologies have mechanisms for turning their identifiers into URIs and the aforementioned guidelines lay out how owl:sameAs can be used within the RDF to associate the non-HTTP-resolvable forms with the URIs. Based on my admittedly limited experience with trying to write RDF using Darwin Core terms, I think that in most cases there already exists appropriate terms for getting the job done. What may be lacking is concrete examples and community consensus on what terms to use for what. I also think that there are probably some "ID" terms where it isn't really very important (from an RDF point of view) that there exist both a URI form and a text string form. I'm thinking of something like dwc:identificationID, which is mostly likely to be needed to allow a machine to make a connection between some resource and its identification. The machine isn't going to care if there is a human-readable version. In contrast, something like dwc:collectionID is likely to need both a URI version (e.g. proxied version of the BCI LSID) for the machines and a string version (the name of the collection as it would be displayed) for humans. I think that trying to make example/template RDF for various types of resources will help make it clear in which cases one version (URI), the other (string), or both are actually necessary.

I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core. I am still willing to do this, although I'm still getting caught up at work from being at the TDWG meeting. However, next week we have fall break and I will make it a priority to come up with a draft which can be the subject of discussion. As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.). This can also be a subject of discussion and I think it will help to clarify what will meet the actual needs that we have discussed in this thread. I have a pretty clear picture of what I think Occurrence RDF should look like. I'm going to have to depend on Pete and others to deal with the taxonomy part.

Steve

Markus Döring wrote:

Steve, Pete,

Id like to draw your attention on a basic DarwinCore design pattern. Dwc has the goal of being technology independent by simply providing a list of abstract terms one can use in various arenas such as xml, rdf, xhtml, csv etc. And even within those there might be various ways of using them (e.g. we have a normalised and a simple flat xml schema), thats why we should have a guideline for each of them on how to use them. We are missing such a guideline for rdf currently, hence this debate.

Whether scientificName is a literal string or some complex object shouldnt matter - its defined to be a scientific name. Such a dwc rdf property could either hold a literal string or a url to some name rdf:resource (potentially with a rdfs:label).

With the introduction if many ID terms we have diluted that idea a little already in my mind. We could have as well used scientificName in xml to hold some identifier for that name. All URNs tell you what they are by their urn prefix (not necessarily how to resolve them), so you can easily detect a UUID, LSID, http(s) url, ftp, doi and apply the conventional resolution mechanism. The hardest problem are the local ids and other plain identifers. For those mainly we created the ID terms (at least in my mind). I am feeling rather uncomfortable discussing the introduction of specific dwc terms for each type of id. Maybe we should remove all id terms in dwc and use the specific guidelines to specify these? At least if you really think having all those id terms for rdf is a good thing I would feel much more comfortable going down this route instead of diluting dwc by adding more and more rather redundant terms. The abstract concept is key to a dwc term, not the actual data type fo

rced by the technology you are using it with. Would you want several date terms for various date formats? In fact we do that already to some degree (eventDate, eventTime, year, month, day, verbatimEventDate) and I always felt this is not a good idea. There are also a number of verbatimXXX terms in dwc which also contradict this pattern.

Talking about new dwc terms - in the examples given properties like "hasScientificName" is not strictly the correct dwc term, which is simply scientificName. I think it would be fine to have the convention in the rdf guidlines to use hasDwcTerm instead of dwcTerm, this is exactly what an rdf guideline is for. On the flip side I am sure this only applies to some terms, recordBy for example is likely to remain as it is. Its unclear to me what is best to do really. Always stick to the original dwc terms? Refine them through some rdfs or owl schema and define the relation to the original term? Should we still use the same namespace in this case?

As an rdf beginner even after a few years exposed I wonder if we cant simply stick to the non ID terms and use them either as literals or with a uri pointer. As in the rdf world a resolvable http is really required for resource relations to work, why not simply mandate this in the guidelines? If you only happen to have non resolvable uris like lsid or dois the guidelines should be asking you to use proxied versions, knowing it will break rdf frameworks and lod conventions otherwise. On the resolving side one could always include such urns with owl:sameAs (or sth alike) I believe. But how many non resolvable ids with no matching http counterpart are really out there yet?

- Markus

On Oct 6, 2010, at 9:02, Peter DeVries wrote:

Hi Steve,

You are probably right that it might be best to use rdfs:Label, but I am thinking we might be able to get the same result my defining the string variants as subproperties of rdfs:Label.

This would make them an rdfs:Label but a special kind of rdfs:Label.

This is one of those things that I would test with Sindice and URIburner to see if they interpret these correctly.

This would require a live vocabulary that Sindice could look at to determine that hasScientificName is to be treated as a rdfs:Label.

- Pete

On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf <steve.baskauf@vanderbilt.edu> <mailto:steve.baskauf@vanderbilt.edu> wrote: Although this specific example deals with taxonomic name identifiers, it is related to a previous discussion on this list about how we should use the dwc:xxxxxID terms and other terms (such as recordedBy and identifiedBy) that could have either a string (literal) or URI form. Although I don't really want to see an unnecessary proliferation of Darwin Core terms, I think that in the interest of clarity (particularly where RDF is involved) there either should be multiple terms that make it clear what form of identifier is expected, or else there should be an understanding that in RDF the default for such a term is a URI which would then have an rdfs:Label property which was the string form. I think the former would be preferable to the latter.

I came to this opinion when trying to write RDF describing an herbarium specimen. The collector should be the dwc:recordedBy property of the specimen. Optimally, there would be a database in which known collectors were assigned URIs so that "Glen N. Montz", "Glen Montz", "G. N. Montz", etc. would all be different labels for the same resource. However, realistically, I'm not going to drop what I'm doing to set up such a database (even if I were capable of doing it, which I'm not). So I ended up just writing it as <dwc:recordedBy>Glen N. Montz</dwc:recordedBy> even though I knew it wasn't probably the best thing. In a large Occurrence database that was compiled from the RDF created by a lot of people, there might end up being a mixture of strings and URIs for dwc:recordedBy properties of the specimens. It seems to me like it would be better to have properties like dwc:recordedBy for strings and dwc:recordedByURI for a corresponding URI (and I suppose dwc:reco

rdedByLSID if anyone wants to use it). Of course, this would require a number of term additions to DwC and clarification in the DwC documentation that the generic version was intended for strings.

With respect to the example

<dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> I think you are right that (with the possible exception of rdfs:seeAlso) there is an expectation that an rdf:resource attribute will be a resolvable URI that produces RDF. So <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> is probably better.

Steve

Peter DeVries wrote:

I have been thinking about the following pattern. In part after looking at the GBIF vocabulary.

I am not sure if it is even a good idea but might be worth some discussion.

For those fields that have both a string and "ID" form maybe the following pattern might be useful

hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" <http:proxy> etc.

This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved.

One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa.

My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems.

Is this an LSD, is it a URI, is it a UUID etc. ?

This impacts the structure of the RDF.

* Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name

<dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource=" http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8 " < http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8

/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>

Some system may choke on the LSID form assuming that it uses a standard resolution mechanism

So it might be best to use this form

<dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org: taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>

- Pete

---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base About the GeoSpecies Knowledge Base ------------------------------------------------------------

This body part will be downloaded on demand.

-- Thomas Bandholtz, thomas.bandholtz@innoq.com, http://www.innoq.com innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491

_______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

-- Robert A. Morris Emeritus Professor of Computer Science UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: morris.bob@gmail.com web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/mw/FilteredPush http://www.cs.umb.edu/~ram phone (+1) 857 222 7992 (mobile) _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Cam Webb

00:23

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Dear Steve, Pete, Stan, and others, I've just joined this list and have been trying to digest the key points of the many posts of the last few weeks. The posts touching on serializing data about individual organisms in the field as RDF and best-practice application of DwC terms have resonated with me. I am looking for recommendations for modeling individual-level data (with photos, DNA records, identifications, morphology observations, etc), for thousands of new plant observations in Indonesia over the next 3 years, as Linked Data, and would like to have to invent as few new terms as possible. Pete's Geospecies example and Steve's RDF Appendix in his recent Biodiv. Info. paper (7:17-44) have been very helpful.

...

I "volunteered" a couple weeks ago to have a go at writing an RDF guide for Darwin Core.... As a part of this process, I think it would be good to create one or more "boilerplate" RDF files for the various kinds of resources that are likely to be identified with GUIDs (e.g. Occurrences, Taxa, etc.).

Please sign me up to help! Such a guide is badly needed, for users. Perhaps we could start a page (or set of pages) on the TDWG wiki with the various choices of term for modeling each `hasA...' relationship, with a short outline of the pros and cons. I've been looking for such collection but haven't found one (other than the DwC ontology at http://rs.tdwg.org/ontology/voc/). I think the focus on the `Individual as a base Thing' that has been discussed this month is really important for potentially bridging to the work done by the ecological informatics community (e.g., modeling 1,000's of trees in sample plots as LOD) and the phenotype observation community (i.e., Quality-Entity relations as RDF). Determining the vocabulary that best extends DwC to these other domains will be very helpful. Best, Cam +-------------------------------------------------+ | CAMPBELL O. WEBB | | Arnold Arboretum of Harvard University | | [ Harvard University Herbaria, | | 22 Divinity Ave, Cambridge MA, 02138, USA ] | +-------------------------------------------------+ | Mail: Kotak Pos 2, Sukadana, Kab. Kayong Utara | | Kalimantan Barat 78852, Indonesia | | Mobile/SMS: +62-813-9917-7663 (GMT+7) | | Web/PGP: http://phylodiversity.net/cwebb/ | +-------------------------------------------------+

Kevin Richards

5 Oct 5 Oct

14:46

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

Sounds like a good idea, but I'm not sure I agree with this. Putting in a hard coded GUID type (LSID) into the name of an element does not sound wise. It restricts the use of that element too much. The issue here is that linked data technologies cannot automatically resolve a URN (basic http get). So it may be better to include all URNs in this idea, eg hasScientificName = string form hasScientificNameID = http resolvable URI hasScientificNameURN = LSID or other URN This also then relies on any consuming technology to "know" what a specific URN type is and how to resolve it, but I suppose it is really systems that are LSID aware that tend to have this issue in the first place, so this would cover this specific area of concern ... So, I suppose I am agreeing with this approach, and like Gregor stated this sounds like a better way than "hacking" with URNs to make them http URLs. Kevin From: tdwg-content-bounces@lists.tdwg.org [mailto:tdwg-content-bounces@lists.tdwg.org] On Behalf Of Peter DeVries Sent: Monday, 4 October 2010 6:55 a.m. To: tdwg-content@lists.tdwg.org Subject: [tdwg-content] Idea for Discussion, Differentiating between "type's" of identifiers I have been thinking about the following pattern. In part after looking at the GBIF vocabulary. I am not sure if it is even a good idea but might be worth some discussion. For those fields that have both a string and "ID" form maybe the following pattern might be useful hasScientificName = string form hasScientificNameURI = Resolvable LOD compliant identifier hasScientificNameLSID = LSID identifier which could be resolvable once you add the "http:proxy" etc. This allows all three forms to be included if desired, it also provides a hint as to how the field should be interpreted or resolved. One group could also provide a mapping service so that each record does not need to include all three forms, but would allow systems to find the matching LSID for a given URI or vs. versa. My concern was that it would be difficult to infer how a scientificNameID should be interpreted by other systems. Is this an LSD, is it a URI, is it a UUID etc. ? This impacts the structure of the RDF. * Note that the actual identifiers might not be correct, the example below is more about the form of the RDF * For instance, I don't think it is probably correct to see the COL LSID as just a namestring * Also in this example the GNI name does not exactly match the string name <dwc:hasScientificName>Puma concolor (Linnaeus 1771)</dwc:hasScientificName> <dwc:hasScientificNameURI rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"/> <dwc:hasScientificNameLSID rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/> Some system may choke on the LSID form assuming that it uses a standard resolution mechanism So it might be best to use this form <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID> - Pete ---------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base<http://www.taxonconcept.org/> / GeoSpecies Knowledge Base<http://lod.geospecies.org/> About the GeoSpecies Knowledge Base<http://about.geospecies.org/> ------------------------------------------------------------ ________________________________ Please consider the environment before printing this email Warning: This electronic message together with any attachments is confidential. If you receive it in error: (i) you must not read, use, disclose, copy or retain it; (ii) please contact the sender immediately by reply email and then delete the emails. The views expressed in this email may not be those of Landcare Research New Zealand Limited. http://www.landcareresearch.co.nz _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Hilmar Lapp

16:03

On Oct 5, 2010, at 5:46 PM, Kevin Richards wrote:

...

hasScientificNameID = http resolvable URI hasScientificNameURN = LSID or other URN

If you define the convention for the value for hasScientificNameID to be a http-resolvable URI, then that obviates the need for the second, doesn't it? LSIDs, as much as Handles, DOIs and other non-HTTP URI identifiers either have already or will have to have proxy forms that make them resolvable HTTP URIs, or otherwise I would argue there is little use for them in the standard. What is the use-case that requires the LSID, but not other non-HTTP identifier schemes, to be in a separate element, if there is already an element that can be expected to have a resolvable HTTP URI? What is special about LSIDs in this respect? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== _______________________________________________ tdwg-content mailing list tdwg-content@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-content

Peter DeVries

23:53

New subject: Idea for Discussion, Differentiating between "type's" of identifiers

There were a few goals in mind when I thought of this pattern. 1) It makes it clear to processing services how to resolve the entity 2) It makes it more clear to data submitters what should go in that field 3) It allows end users who want to use the LSID system to do so without breaking non LSID aware systems. The use of LSID makes it clear that this should be resolved using a LSID resolution system. If we used URN it would still not be clear what "kind" of URN. Also remember that just because their is a defined field does not mean you are obligated to use it. We could also define a field *hasScientificNameURN* for non-resolvable URNs. What I would like to avoid are cases where consuming agents including crawlers generate an error when they try to resolve an LSID by assuming that it will work like proper semantic web URI's. In some cases, these system stop processing when they encounter these errors. - Pete On Tue, Oct 5, 2010 at 6:03 PM, Hilmar Lapp <hlapp@nescent.org> wrote:

...

On Oct 5, 2010, at 5:46 PM, Kevin Richards wrote:

hasScientificNameID = http resolvable URI

...
hasScientificNameURN = LSID or other URN

If you define the convention for the value for hasScientificNameID to be a http-resolvable URI, then that obviates the need for the second, doesn't it? LSIDs, as much as Handles, DOIs and other non-HTTP URI identifiers either have already or will have to have proxy forms that make them resolvable HTTP URIs, or otherwise I would argue there is little use for them in the standard.

What is the use-case that requires the LSID, but not other non-HTTP identifier schemes, to be in a separate element, if there is already an element that can be expected to have a resolvable HTTP URI? What is special about LSIDs in this respect?

-hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : ===========================================================

5304

Age (days ago)

5320

Last active (days ago)

List overview

Download

25 comments

13 participants

participants (13)

Arlin Stoltzfus
Blum, Stan
Bob Morris
Cam Webb
Gregor Hagedorn
Hilmar Lapp
Kevin Richards
Markus Döring
Peter DeVries
Roger Hyam
Rutger Vos
Steve Baskauf
Thomas Bandholtz