[tdwg-content] [tdwg-tag] Idea for Discussion, Differentiating between "type's" of identifiers

Fri Oct 8 14:07:27 CEST 2010

 Hi all,

having just joined this list, I find it a great idea to have such an RDF
Task Group.
I am going to publish a little species catalog for the Federal
Environment Agency in Germany as Linked Data, and I am looking for the
best way to express it in RDF.

In the Linked Data cloud [1] I find several related contributions, such
as Geospecies, TaxonConcept, EUNIS, and more.
Comparing these approaches I prefer the idea of reusing SKOS [2] labels
and hierarchical relations, as in the Geospecies example [3].

It might be a good idea to apply the SKOS XL extension as well to go
deeper into the taxon name properties.
Finally, I would add the taxon ranks as a distinct concept scheme and
link them to the taxon concepts with a mapping relation.

Certainly this is not the only way to go, but it is rather simple and
will be easily understood as SKOS is quite common in the Linked Data
community. This might be called "Simple Darwin Core" and give room for
more complex ontology approaches beyond that.

Looking forward to discussion,
Thomas

[1] http://lod-cloud.net
[2] http://www.w3.org/TR/2009/NOTE-skos-primer-20090818/
[3] http://lod.geospecies.org/ses/v6n7p.rdf

Peter J. DeVries

Am 07.10.2010 19:29, schrieb Blum, Stan:
> Hi Steve,
>
> Sorry, I missed your message below (as well as your response to Roger)
> before I sent my reply about the utility of an RDF guide for DwC.
>  Obviously, I think it’s a great idea.  To do this within the “normal”
> TDWG process, this should be done as a Task Group.  I could help you
> draft a charter for that, which would then need to be reviewed by the
> TAG and Exec.  Once approved, we would put the charter up on the web
> site, and do our best to provide any other resources that would help
> speed the task.  I don’t mean to slow you down.  The Charter doesn’t
> have to be elaborate.  It’s function is to let others in TDWG and
> beyond know that this task is proceeding, who to contact, how to get
> involved, etc.  It also gives you the backing of the TDWG community.
>
> Let me know if you’d like to pursue this.
>
> -Stan
>
>
> On 10/7/10 7:41 AM, "Steve Baskauf" <steve.baskauf at vanderbilt.edu> wrote:
>
>     I agree that it is best to avoid a proliferation of terms and I
>     agree that it is best to keep Darwin Core technology independent
>     to the maximum extent possible.  However, I think that the case of
>     facilitating HTTP URIs is a special one because of the
>     requirements of GUIDs/Persistent Identifiers.  Both the TDWG and
>     GBIF guidelines such as they currently stand say that GUIDs must
>     be resolvable, that in their resolution they must return RDF, and
>     that the RDF has to be in an XML format.  Like it or not, that is
>     what we have.  Given the amount of time that it seems to have
>     taken to settle on that much, I think it is best for us to decide
>     to live with it, warts and all, rather than re-opening the
>     discussion and delaying the implementation of GUIDs for another
>     five years.  
>
>     Given that assumption, there needs to be within Darwin Core some
>     way to support this particular "technology" (Linked Data, RDF/XML)
>     even if we don't do "special" things to support other technologies
>     such as LSID, DOI, etc.  The point is well taken that most of
>     those other technologies have mechanisms for turning their
>     identifiers into URIs and the aforementioned guidelines lay out
>     how owl:sameAs can be used within the RDF to associate the
>     non-HTTP-resolvable forms with the URIs.  Based on my admittedly
>     limited experience with trying to write RDF using Darwin Core
>     terms, I think that in most cases there already exists appropriate
>     terms for getting the job done.  What may be lacking is concrete
>     examples and community consensus on what terms to use for what.  I
>     also think that there are probably some "ID" terms where it isn't
>     really very important (from an RDF point of view) that there exist
>     both a URI form and a text string form.  I'm thinking of something
>     like dwc:identificationID, which is mostly likely to be needed to
>     allow a machine to make a connection between some resource and its
>     identification.  The machine isn't going to care if there is a
>     human-readable version.  In contrast, something like
>     dwc:collectionID is likely to need both a URI version (e.g.
>     proxied version of the BCI LSID) for the machines and a string
>     version (the name of the collection as it would be displayed) for
>     humans.  I think that trying to make example/template RDF for
>     various types of resources will help make it clear in which cases
>     one version (URI), the other (string), or both are actually necessary.
>
>     I "volunteered" a couple weeks ago to have a go at writing an RDF
>     guide for Darwin Core.  I am still willing to do this, although
>     I'm still getting caught up at work from being at the TDWG
>     meeting.  However, next week we have fall break and I will make it
>     a priority to come up with a draft which can be the subject of
>     discussion.  As a part of this process, I think it would be good
>     to create one or more "boilerplate" RDF files for the various
>     kinds of resources that are likely to be identified with GUIDs
>     (e.g. Occurrences, Taxa, etc.).  This can also be a subject of
>     discussion and I think it will help to clarify what will meet the
>     actual needs that we have discussed in this thread.  I have a
>     pretty clear picture of what I think Occurrence RDF should look
>     like. I'm going to have to depend on Pete and others to deal with
>     the taxonomy part.
>
>     Steve
>
>     Markus Döring wrote:
>
>
>         Steve, Pete,
>
>         Id like to draw your attention on a basic DarwinCore design
>         pattern. Dwc has the goal of being technology independent by
>         simply providing a list of abstract terms one can use in
>         various arenas such as xml, rdf, xhtml, csv etc. And even
>         within those there might be various ways of using them (e.g.
>         we have a normalised and a simple flat xml schema), thats why
>         we should have a guideline for each of them on how to use
>         them. We are missing such a guideline for rdf currently, hence
>         this debate.
>
>         Whether scientificName is a literal string or some complex
>         object shouldnt matter - its defined to be a scientific name.
>         Such a dwc rdf property could either hold a literal string or
>         a url to some name rdf:resource (potentially with a rdfs:label).
>
>         With the introduction if many ID terms we have diluted that
>         idea a little already in my mind. We could have as well used
>         scientificName in xml to hold some identifier for that name.
>         All URNs tell you what they are by their urn prefix (not
>         necessarily how to resolve them), so you can easily detect a
>         UUID, LSID, http(s) url, ftp, doi and apply the conventional
>         resolution mechanism. The hardest problem are the local ids
>         and other plain identifers. For those mainly we created the ID
>         terms (at least in my mind). I am feeling rather uncomfortable
>         discussing the introduction of specific dwc terms for each
>         type of id. Maybe we should remove all id terms in dwc and use
>         the specific guidelines to specify these? At least if you
>         really think having all those id terms for rdf is a good thing
>         I would feel much more comfortable going down this route
>         instead of diluting dwc by adding more and more rather
>         redundant terms. The abstract concept is key to a dwc term,
>         not the actual data type fo
>
>         rced by the technology you are using it with. Would you want
>         several date terms for various date formats? In fact we do
>         that already to some degree (eventDate, eventTime, year,
>         month, day, verbatimEventDate) and I always felt this is not a
>         good idea. There are also a number of verbatimXXX terms in dwc
>         which also contradict this pattern.
>
>         Talking about new dwc terms - in the examples given properties
>         like "hasScientificName" is not strictly the correct dwc term,
>         which is simply scientificName. I think it would be fine to
>         have the convention in the rdf guidlines to use hasDwcTerm
>         instead of dwcTerm, this is exactly what an rdf guideline is
>         for. On the flip side I am sure this only applies to some
>         terms, recordBy for example is likely to remain as it is. Its
>         unclear to me what is best to do really. Always stick to the
>         original dwc terms? Refine them through some rdfs or owl
>         schema and define the relation to the original term? Should we
>         still use the same namespace in this case?
>
>         As an rdf beginner even after a few years exposed I wonder if
>         we cant simply stick to the non ID terms and use them either
>         as literals or with a uri pointer. As in the rdf world a
>         resolvable http is really required for resource relations to
>         work, why not simply mandate this in the guidelines? If you
>         only happen to have non resolvable uris like lsid or dois the
>         guidelines should be asking you to use proxied versions,
>         knowing it will break rdf frameworks and lod conventions
>         otherwise. On the resolving side one could always include such
>         urns with owl:sameAs (or sth alike) I believe. But how many
>         non resolvable ids with no matching http counterpart are
>         really out there yet?
>
>         - Markus
>
>
>         On Oct 6, 2010, at 9:02, Peter DeVries wrote:
>
>           
>          
>
>
>             Hi Steve,
>
>             You are probably right that it might be best to use
>             rdfs:Label, but I am thinking we might be able to get the same
>             result my defining the string variants as subproperties of
>             rdfs:Label.
>
>             This would make them an rdfs:Label but a special kind of
>             rdfs:Label.
>
>             This is one of those things that I would test with Sindice
>             and URIburner to see if they interpret these correctly.
>
>             This would require a live vocabulary that Sindice could
>             look at to determine that hasScientificName is to be
>             treated as a  rdfs:Label.
>
>             - Pete
>
>             On Mon, Oct 4, 2010 at 10:41 AM, Steve Baskauf
>             <steve.baskauf at vanderbilt.edu>
>             <mailto:steve.baskauf at vanderbilt.edu>  wrote:
>             Although this specific example deals with taxonomic name
>             identifiers, it is related to a previous discussion on
>             this list about how we should use the dwc:xxxxxID terms
>             and other terms (such as recordedBy and identifiedBy) that
>             could have either a string (literal) or URI form.
>              Although I don't really want to see an unnecessary
>             proliferation of Darwin Core terms, I think that in the
>             interest of clarity (particularly where RDF is involved)
>             there either should be multiple terms that make it clear
>             what form of identifier is expected, or else there should
>             be an understanding that in RDF the default for such a
>             term is a URI which would then have an rdfs:Label property
>             which was the string form.  I think the former would be
>             preferable to the latter.  
>
>             I came to this opinion when trying to write RDF describing
>             an herbarium specimen.  The collector should be the
>             dwc:recordedBy property of the specimen.  Optimally, there
>             would be a database in which known collectors were
>             assigned URIs so that "Glen N. Montz", "Glen Montz", "G.
>             N. Montz", etc. would all be different labels for the same
>             resource.  However, realistically, I'm not going to drop
>             what I'm doing to set up such a database (even if I were
>             capable of doing it, which I'm not).  So I ended up just
>             writing it as <dwc:recordedBy>Glen N.
>             Montz</dwc:recordedBy> even though I knew it wasn't
>             probably the best thing.  In a large Occurrence database
>             that was compiled from the RDF created by a lot of people,
>             there might end up being a mixture of strings and URIs for
>             dwc:recordedBy properties of the specimens.  It seems to
>             me like it would be better to have properties like
>             dwc:recordedBy for strings and dwc:recordedByURI for a
>             corresponding URI (and I suppose dwc:reco
>
>             rdedByLSID if anyone wants to use it).  Of course, this
>             would require a number of term additions to DwC and
>             clarification in the DwC documentation that the generic
>             version was intended for strings.  
>
>             With respect to the example
>
>             <dwc:hasScientificNameLSID
>             rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>
>             I think you are right that (with the possible exception of
>             rdfs:seeAlso) there is an expectation that an rdf:resource
>             attribute will be a resolvable URI that produces RDF.  So
>             <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>
>             is probably better.
>
>             Steve
>
>
>             Peter DeVries wrote:
>                 
>              
>
>
>                 I have been thinking about the following pattern. In
>                 part after looking at the GBIF vocabulary.
>
>                 I am not sure if it is even a good idea but might be
>                 worth some discussion.
>
>                 For those fields that have both a string and "ID" form
>                 maybe the following pattern might be useful
>
>                 hasScientificName = string form
>                 hasScientificNameURI = Resolvable LOD compliant identifier
>                 hasScientificNameLSID = LSID identifier which could be
>                 resolvable once you add the "http:proxy" <http:proxy>
>                  etc.
>
>                 This allows all three forms to be included if desired,
>                 it also provides a hint as to how the field should be
>                 interpreted or resolved.
>
>                 One group could also provide a mapping service so that
>                 each record does not need to include all three forms,
>                 but would allow systems
>                 to find the matching LSID for a given URI or vs. versa.
>
>                 My concern was that it would be difficult to infer how
>                 a scientificNameID should be interpreted by other systems.
>
>                 Is this an LSD, is it a URI, is it a UUID etc. ?
>
>                 This impacts the structure of the RDF.
>
>                 * Note that the actual identifiers might not be
>                 correct, the example below is more about the form of
>                 the RDF
>                 * For instance, I don't think it is probably correct
>                 to see the COL LSID as just a namestring
>                 * Also in this example the GNI name does not exactly
>                 match the string name
>
>                 <dwc:hasScientificName>Puma concolor (Linnaeus
>                 1771)</dwc:hasScientificName>
>                 <dwc:hasScientificNameURI
>                 rdf:resource="http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8"
>                 <http://gni.globalnames.org/name_strings/6c3dc35f-d901-5cc5-b9c8-ad241069b9f8>
>                 />
>                 <dwc:hasScientificNameLSID
>                 rdf:resource="urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010"/>
>
>                 Some system may choke on the LSID form assuming that
>                 it uses a standard resolution mechanism
>
>                 So it might be best to use this form
>
>                 <dwc:hasScientificNameLSID>urn:lsid:catalogueoflife.org:taxon:24e7d624-60a7-102d-be47-00304854f810:ac2010</dwc:hasScientificNameLSID>
>
>                 - Pete
>
>                 ----------------------------------------------------------------
>                 Pete DeVries
>                 Department of Entomology
>                 University of Wisconsin - Madison
>                 445 Russell Laboratories
>                 1630 Linden Drive
>                 Madison, WI 53706
>                 TaxonConcept Knowledge Base / GeoSpecies Knowledge Base
>                 About the GeoSpecies Knowledge Base
>                 ------------------------------------------------------------
>                       
>                  
>
>
>
> This body part will be downloaded on demand.

-- 
Thomas Bandholtz, thomas.bandholtz at innoq.com, http://www.innoq.com 
innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany
Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tdwg.org/pipermail/tdwg-content/attachments/20101008/e7ce3dd3/attachment-0001.html