November 2010 - tdwg-content

Background for the Individual class proposal. 2. Classes and types
by Steve Baskauf 13 Nov '10

13 Nov '10

I'm going to start this post with two comments about RDF. I think some people think they have a phobia of RDF (I know that I did at first). What I really think is that they have a phobia of RDF represented as XML or RDF represented in N3 notation. This point has been made before: RDF is a system for describing properties of and relationships among resources (i.e. things that can be assigned identifiers) but it does not have only one particular way that these properties and relationships must be specified. It is perfectly correct to represent RDF entirely in pictures (i.e. as an RDF "graph", see http://www.w3.org/TR/rdf-primer/ and ignore all of the text - just look at the figures). RDF graph notation wouldn't be of much use to a computer, but that graph could easily be translated into one of the other notations (XML or N3) and then a computer would understand it perfectly. Since RDF is something that is specifically designed to represent relationships among classes of resources, it is the perfect thing to clearly lay out what we mean when we have a discussion of the sort that we are having here. One of the reasons why I am so keen to make diagrams of the sort I posted in the first message in this series is because once you have the diagram, it is a relatively simple matter to change the shapes of the boxes and add arrows instead of triangles or lines with crow's feet and voila! you have an RDF graph. It then becomes an academic exercise to have an RDF model in XML or whatever format you like. I am of the opinion that we are actually pretty close to a consensus about what the diagram should be, which means that we are also pretty close to a simple RDF model for Darwin Core. The other comment about RDF is that we need to work out a basic model now. Partly this is because there are already several people who have been contributing to this discussion who are already writing RDF or who intend to do so in the near future. If we have any delusions about doing even the most simple kind of machine reasoning in the future, we all need to be using the same basic diagram (i.e. model). The other reason why we need to work this out now is that if we don't, we will impede the process of utilizing GUIDs/Persistent Identifiers. The draft TDWG GUID Applicability Statement (http://www.tdwg.org/stdtrack/article/download/150/51 recommendation 10) says clearly that a proper GUID should be able to be dereferenced to provide an RDF/XML representation (did I use "dereferenced" right, Bob?). If we don't agree on how to represent the classes of resources that are of interest to the DwC community in RDF then we are setting up the situation where TDWG makes a recommendation (on how GUIDs are implemented) that is impossible for people to follow. I believe that it is best to settle on a basic model now rather than at an indefinite point in the future for this reason. Having given this rationale, I'm going to talk about how we look at classes and types in Darwin Core and how the need for an RDF representation of DwC should influence our view on this topic. In Darwin Core as it stands (see the "Audience" section of http://rs.tdwg.org/dwc/terms/index.htm) classes are simply categories that group terms that describe instances of the class. The description specifically states that the terms are intended to be properties of the class (i.e. properties of instances of the class). When DwC terms are used as column headings in a database table, there isn't any "rule" that say that one must specify the type of thing to which that term applies. On the other hand, I think that it is considered a Bad Thing in RDF to apply properties to a resource having an unspecified type. It's not impossible to do so, but specifying the rdf:type of a resource is one of the most fundamental things that one does in creating a description of the resource. This is recognized in the TDWG GUID Applicability Statement (recommendation 11) which says that objects identified by GUIDs should be typed using a well-known vocabulary. One "well-known vocabulary" is the Darwin Core Type Vocabulary (http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm) There isn't any formal relationship between the Darwin Core classes in the dwc: (http://rs.tdwg.org/dwc/terms/) namespace and the types in the dwctype: (http://rs.tdwg.org/dwc/dwctype/) namespace. We could use the dwctypes to describe resources that we want to say are instances of dwc: classes (and meet the GUID guidelines), but that would raise problems that I will get into later. The point is that as Darwin Core is currently set up, there isn't a formal relationship between the dwc: classes that are used to group the terms and the dwctype: types that could be used to rdf:type them. As it is described, the dwctype vocabulary is simply stated to be used as values for basisOfRecord and as I pointed out in the previous post, basisOfRecord only really works when Occurrences are limited to having a single token. In RDF, the relationship between classes and types is different from the way it currently stands in Darwin Core. RDF classes and types are tied together by definition (http://www.w3.org/TR/2004/REC-rdf-schema-20040210/#ch_type) If you assert that a resource has an rdf:type of X, you are simultaneously asserting that the resource is an instance of class X. The relationship between a class in RDF and the declaration of rdf:type is so entwined that naming a XML container element by the class of the resource is an instance is identical to an explicit declaration of type. The following two examples produce exactly the same result if you paste them into an RDF validator like http://www.w3.org/RDF/Validator/ : <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ"> <rdf:type rdf:resource="http://rs.tdwg.org/dwc/terms/Occurrence"/> </rdf:Description> </rdf:RDF> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dwc="http://rs.tdwg.org/dwc/terms/"> <dwc:Occurrence rdf:about="http://bioimages.vanderbilt.edu/baskauf/10692#occ"> </dwc:Occurrence> </rdf:RDF> Even though there is no explicit declaration of rdf:type in the text of the second example (i.e. the dwc:Occurrence container element is empty), the validator treats the code as if a type property were stated explicitly - you can see that the resulting triple and graph created by the validator shows the RDF as having made an explicit declaration of rdf:type=dwc:Occurrence. So my point is that to enable people to follow the TDWG GUID recommendations and provide RDF that tells people the type of the resource, TDWG bears a responsibility to provide GUID users with terms that are suitable for use as an rdf:type property for every class of resource that we can reasonably be expected to want to assign a GUID. In my book, that's every box shown on the summary diagram http://bioimages.vanderbilt.edu/pages/full-model.jpg except for tokens (and excluding Time if we agree that we will always denormalize it out of existence as a class). I exclude token as a group because they are not a single class of resource. Any type of resource that provides evidence that an Occurrence happened can be a token. In some cases (such as images and sounds) those types are already defined in Dublin Core. Darwin Core would only need to define types for things that aren't defined elsewhere, such as the Collecting Units in the ASC model (but this is the topic of the third installment). One way to do this (and the way that I favor) is to make sure that there is a Darwin Core class for every category of resource for which one would reasonably expect to assign a GUID. Referring to the full model diagram, the only categories that don't have classes at the moment are Individual (which I have proposed to add), Time (which may or may not be necessary), and Collecting Unit (again, more on this in the final installment). The first category could be created by voting to accept my proposal about the class Individual. The last would require a new recommendation, but I think that Rich has pretty much suggested that this should happen when he says that there are a lot of terms in the Occurrence category that don't belong there (i.e. they belong with Collecting Units). So it would make sense from the point of view of a more logical organization of terms to do this anyway. As Bob has pointed out, in RDF making a declaration of rdf:type=X is the same thing as declaring that class X exists. So why not make the rdf:types BE the Darwin Core classes so we will be declaring something that actually does already exist instead of conjuring up virtual classes from types that we make up? There have been some people who have questioned my proposal for adding Individual as a DwC class on the basis that it is not clear that anybody "needs" it. What I am stating here is that anybody who plans to write RDF following recommendations based on a fully normalized Darwin Core RDF model (which should be EVERYONE who writes RDF using Darwin Core!) "needs" all of the classes that connect resources they plan to describe. That means that anybody who plans to connect Occurrence metadata to Identifications should be doing it in their RDF through named instances of the dwc:Individual class. Another alternative would be to fix the dwctype vocabulary, but that would be messier. The dwctype vocabulary is designated as the controlled vocabulary for basisOfRecord, so it is a bit dangerous to mess with it without breaking basisOfRecord. The other problem as was noted earlier on the list is that currently certain types in the DwC type vocabulary are declared as subClasses of other types, and that these declarations will cause unintentional assertions that don't make sense in the context of the general model that we've been discussing (namely that every PhysicalSpecimen is an Occurrence which is also an Event). It seems to me that there is more "fixing" required here than is worth the effort given that we can just use the classes as the rdf:types as I described in the previous paragraph. The final alternative would be to make the TDWG Ontology functional and use it to type resources. Although there has been some recent discussion on the list about working on the Ontology, at the present moment there isn't a clear plan or timeline to finish it. Telling people to wait for something that may never happen is not an acceptable alternative to me. I think it is clear that there are multiple people and institutions that are either ready to write RDF in support of GUIDs or are already doing it now. Six months is about the longest timeframe that I think is reasonable for coming up with a solution to the typing problem discussed above and to have some kind of basic guidelines for the structuring of RDF. A general model based on the existing Darwin Core classes is the only path forward that I can see as feasible in that time frame and a general model could always be build into a more sophisticated model (i.e. the Ontology) at liesure if anyone cared to take the time. If TDWG doesn't get its act together on a six-month to one year time scale, people will simply give up and write Darwin Core-based RDF without any TDWG guidelines. It has been suggested that a Task Group be formed to draft a DwC RDF Guide. I would be very keen to see that happen and would be willing to be involved in the process, but I'm not interested in it if the process doesn't start out with some version the consensus model we've discussed here and with some quick decision from the TAG about how to handle the rdf:typing problem. Without those two things, there would just be endless unproductive debate about how to go about building the model from scratch and I've got better things to do than that. I will end this with one final comment about the proposed Individual class in this context. I have stated clearly in several earlier posts that I don't think that the Individual class really has many properties and that it functions primarily as a named node to facilitate one-to-many relationships with other classes. This may strike some people as odd, given that the primary purpose of classes in the existing Darwin Core seems to be to group similar terms that can act as properties for the class. What became apparent to me when I was creating the diagrams for the first post was that if the Time terms are pulled out of the Event class (as they probably should be in a fully normalized model) and the "Collecting Unit" terms are pulled out of the Occurrence class (as I think must happen if we separate tokens from Occurrences), there are also very few property terms left in the Event and Occurrence classes. Most of the terms that remain are "housekeeping) ones used for remarks, or to make note of the person who documented the instance and when. Most of the terms that actually describe measurable properties are found in the peripheral classes like Location, Time, and Collecting Unit. Just as in the case of the proposed Individual class, the Event and Occurrence classes are primarily named nodes that connect other classes. The only reason they have very many terms at the present is because we have some of the terms in the "wrong" place for a fully normalized model. I think that it is also no coincidence that these three classes (Event, Occurrence, and Individual) are also the three that we have had the most trouble defining. I think that's precisely because they have very few properites of their own. They do roughly correspond to things for which we have conceptual images, which is why we are able to come up with meaningful names for them. But as I have argued, it is better to define them according to what we want them to DO rather than by our mental image of them. And that is a lead-in to the third and final post. -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

1 0

Background for the Individual class proposal. 1. Denormalization of models and correspondence to the ASC model
by Steve Baskauf 13 Nov '10

13 Nov '10

This is part 1 of three messages that attempt to summarize the issues that we have been discussing over the last month and to suggest a solution and a way forward. If you zone out when you get emails longer than three lines, please erase the messages and go on with your life. Unfortunately this is a complicated topic and I'm trying to lay out the issues in the simplest and most straightforward way that I can. The first email (this one) describes how a fully normalized model of Darwin Core can arise from modifying the ASC model to meet articulated needs of the Darwin Core constituency. The second email will describe why we need to come to a consensus on this and the criteria that I think should be considered before a decision is reached. The third email discusses the issue that Rich has raised as to whether the proposed Individual class should have a rather narrow scope (as I have advocated) or if it should be broadened to include other functions. I have separated this material into three emails because they are really separate but related issues and may each spawn threads relating to the particular issue. ---------------------------------------------------------------------- To try to get a better understanding of the issues we have been discussing, I went back to the Association of Systematics Collections (ASC) information that Stan posted at http://wiki.tdwg.org/twiki/bin/view/TAG/HistoricalDocuments - in particular, the chart http://wiki.tdwg.org/twiki/bin/viewfile/TAG/HistoricalDocuments?rev=1;filen… I have cut out a section of that chart that will fit on one screen and have created several images that have various models involving Darwin Core classes pasted at the top. Each subsequent Darwin Core class model is more normalized than the previous one. Below each model I show how that denormalization maps to the ASC model. The first diagram is the ASC model itself http://bioimages.vanderbilt.edu/pages/asc-model.jpg There are several differences in names between ASC and DwC. dwc:Location corresponds to Locality in ASC, dwc:Event corresponds to Collecting Event in ASC, dwc:Identification corresponds to Determination in ASC, and Collecting Unit in ASC corresponds to a subset of what I have been calling the "token" (evidence), that is limited to organisms, their pieces, and their conglomerations. One may quibble about exact correspondence, but I think that fundamentally those things are congruent. In the ASC model, the lines with crow's feet correspond to one-to-many relationships, with the foot at the "many" end. In my diagram a triangle does the same thing with the point of the triangle representing the "one" end. As you can see, the subset of the ASC model shown here can summarized in simplified form using DwC classes (excluding for the time being the parts of the model that fall into the DwC Taxon class). The ASC model reflects the "museum" perspective: in many or most cases the whole organism is collected, or if only part of the organism is collected (e.g. tree branch) the organism is rarely re-visited for additional collections. So this model is denormalized (flattened) to the extent that it doesn't allow for multiple types of tokens per organism and resampling of the organism over time. The second diagram represents Darwin Core at the time it became a standard in 2009. http://bioimages.vanderbilt.edu/pages/darwin-core-model.jpg The difference from the previous diagram is the creation of the Occurrence class. This class recognizes the needs of the observation community because it allows one to connect Events to Determinations directly without forcing them to be associated with a physical object (token). This modification was beneficial because terms describing the act of documenting the presence of a taxon during an Event are shared between observations and specimen collection. This model presupposes that there is no more than one token per Occurrence. dwc:basisOfRecord is used to describe the nature of that one token. Terms for handling tokens other than specimens are not well developed. The third diagram is a slight modification of the second and is what I've been calling the "explicit token" model: http://bioimages.vanderbilt.edu/pages/dwc-explicit-token-model.jpg The only difference between it and the previous model is that there is now recognition that the token is a separate thing from the Occurrence. Types of tokens other than specimens (such as images and sounds) are recognized explicitly as means of documenting Occurrences. The lines connecting Occurrence to tokens have "crow's feet" on the token side, allowing that there may be one to many tokens that act as evidence for a single Occurrence. When I complain that basisOfRecord "doesn't work", it is with this model in mind. In this model, there is not one single "basis" (token) for a record - under this model there would need to be the possibility to have multiple basisOfRecord values for an Occurrence, which I don't really think is supported currently in DwC. The fourth diagram, which I call the "full model" adds one more component to the explicit token model: http://bioimages.vanderbilt.edu/pages/full-model.jpg This model is what I consider to be the fully normalized version of Darwin Core (excluding the Taxon parts). This model introduces the Individual class exactly as I have defined it in my proposed term addition: as a node that connects Occurrences to Identifications (a.k.a. Determinations). This is not really an addition to the existing Darwin Core standard because the term individualID already exists in the Occurrence class. My proposal simply gives a name to the thing that is the object of individualID - in fact my original justification for the term addition says exactly that. The fundamental purpose that Individual serves is to accommodate the "crow's foot" on the Occurrence side of the line that connects Individual to Occurrence, i.e. to allow re-sampling over time and space. That is all. The line going to Identification/Determination has to be connected somewhere and it makes sense to connect it to Individual rather than Occurrence since the resampled entity is not going to change its identity from one sampling to another. I have done one more thing in this model to make it more denormalized. It's a spin-off from Paul Murray's post http://lists.tdwg.org/pipermail/tdwg-content/2010-October/001771.html which got me to thinking that if we were to treat time in the same way we are treating Locations and other entities a fully normalized model would have a class for Time since time can have varying degrees of specificity (just like Location and Taxon) and there is a one-to-many relationship between Time and Event (i.e. there can be many Events going on at different Locations at a given Time, just like there can be many Events at different Times at a given Location). We almost always denormalize the Time class out of our models because in most cases it can be represented as a single ISO 8601 string. But as Paul points out, Time can be a complicated thing that one might want to model in a more sophisticated way than a single string. I'm not suggesting that we should do this in Darwin Core if nobody needs it, but the point is that it COULD be done. There probably already is a class for Time defined by somebody else (does anyone know about this?). In summary, the fully normalized model that I have presented seems to be consistent with almost all of the discussion that has taken place on the list recently. Although the ASC model is "more normalized" than this in some parts, I haven't heard many of the participants in the discussion advocating for a general Darwin Core model that is more complex than what I've presented in the last link. Obviously, individuals (humans) could add many more classes of things in their own personal models, but I think the classes in this last model can acommodate nearly all of the resources people have said that they want to describe using Darwin Core. End of part 1 -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

1 0

TaxonConcept Ontology, Test Data Set, SPARQL Endpoint, HTML and RDF representations
by Peter DeVries 11 Nov '10

11 Nov '10

I have cleaned up my ontology* at : http://lod.taxonconcept.org/ontology/txn.owl <http://lod.taxonconcept.org/ontology/txn.owl>It validates here: http://owl.cs.manchester.ac.uk/validator/ Here is a small sample of species concepts, occurrences and related data in one gzipped .rdf http://lod.taxonconcept.org/txn_base.rdf.gz It includes these examples: http://lod.taxonconcept.org/examples.html This small file should allow people to test inferencing etc. I expect and would encourage people to try it to see if they can find something wrong, or some utility that they would like. I have a sparql endpoint that is described here: http://www.taxonconcept.org/sparql-endpoint/ The data set should also be available on the LOD Cloud Endpoint http://lod.openlinksw.com/sparql http://lod.openlinksw.com/isparql/ * This is live most of the time except when I am updating the server or data. The HTML and RDF representations have changed since my last update. Here is are two examples: http://lod.taxonconcept.org/ses/iuCXz.html http://lod.taxonconcept.org/ses/iuCXz.rdf <http://lod.taxonconcept.org/ses/iuCXz.rdf> http://lod.taxonconcept.org/ses/dwAmr.html <http://lod.taxonconcept.org/ses/dwAmr.html> http://lod.taxonconcept.org/ses/dwAmr.rdf <http://lod.taxonconcept.org/ses/dwAmr.rdf>* Once I have this worked out I have no problem with TDWG/GBIF/EoL taking this over. I just need a live namespace in which changes can be made quickly. Respectfully, - Pete --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies Knowledge Base <http://lod.geospecies.org/> About the GeoSpecies Knowledge Base <http://about.geospecies.org/> ------------------------------------------------------------

1 0

TDWG and the Darwin Core
by Jim Croft 11 Nov '10

11 Nov '10

"Alpha and Beta discuss the lack of a semantically rich version of the Darwin Core." you are a bad man Bob Morris... :) http://www.xtranormal.com/watch/7632561/ -- _________________ Jim Croft ~ jim.croft(a)gmail.com ~ +61-2-62509499 ~ http://www.google.com/profiles/jim.croft 'A civilized society is one which tolerates eccentricity to the point of doubtful sanity.' - Robert Frost, poet (1874-1963)

2 1

Request for vote on proposals to add Individual as a Darwin Core class and to add the term individualRemarks as a term within that class
by Steve Baskauf 10 Nov '10

10 Nov '10

I am pleased with the significant and thoughtful discussion that has taken place on the tdwg-content email list regarding the relationships among Occurrences, Individuals, and other entities that are a part of the community's thinking about biodiversity metadata and the way that those metadata are structured. It appears from the discussion that there is widespread acceptance of the idea that Individual as a concept has a place in the structuring of biodiversity metadata and that there is some consensus of what "Individual" means (i.e. an entity ranging from actual biological individuals to small coherent populations that can reliably be asserted to represent a single taxon). Whether that acceptance and consensus constitutes a compelling need for adding two new terms (the class dwc:Individual and dwc:individualRemarks) to the Darwin Core standard or not is the point of a TAG "vote". Given the discussion that has occurred, it seems to me that there are two reasons why there is an actual need for those terms. One reason is that if members of the Darwin Core constituency intend to structure their metadata in a fully normalized manner that includes grouping Occurrences by Individuals (and it appears that there are at least several who intend to do this), the term dwc:individualRemarks is needed to provide a means indicate the nature of the individual (i.e. is it a biological individual, clonal individuals, a small population, etc.?) and the class dwc:Individual is needed as the category within which to put individualRemarks so as to indicate that individualRemarks is a property of Individuals. The second reason for explicitly recognizing Individual as a class is that it would place a term representing the concept of "Individual" within a "well-known vocabulary". I feel that would be critical for facilitating the ultimate development of a recommendation for the representation of Darwin Core as RDF. At this point, it is not clear to me that there are any other existing DwC terms that should be moved to a new Individual class. Originally, I suggested that individualCount should be placed in that class, but I no longer think so. Counting the number of individuals is really something that happens when an Occurrence takes place and a small cohesive group of a single taxon (e.g. wolf pack or plant population) could have an individualCount that changes over time. As was discussed earlier in on the email list, the xxxxxxID terms probably really belong in the Record-level terms category rather than being listed within particular classes. So I don't believe that dwc:individualID should be in the proposed class either. As I detailed in my Biodiversity Informatics paper, an Individual is really an entity that serves primarily as a node that allows the grouping of other resources (namely Occurrences and Identifications). As such, it really has few (or no) properties that can be known outside of Occurrences. Thus I would like to "call the question" on the issue of the proposal. I would suggest that the issue of adding the class dwc:Individual and the term dwc:individualRemarks within it be addressed in a single vote, since there little point in having one term without the other. I would also hope that those on the TAG who choose to vote would review the list discussion carefully first. Given that the question of "what exactly is an Individual?" came up a few times after that question was clearly answered in the thread is an indication that some people entered the thread later on without the benefit of having read some of the earlier posts. Steve -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

10 20

Re: [tdwg-content] tdwg-content Digest, Vol 20, Issue 17
by Nico Cellinese 09 Nov '10

09 Nov '10

Steve is using Species as ranks in his definition and I think this is the wrong approach. Let's make all this rank agnostic please! Use the word taxon! What if I have a group of organisms that represents a polyphyletic species and I want to name a lineage (group of organisms) within this traditionally recognized species that I am not recognizing as species per se (as in rank of species). In other words, Identifications and ranks are two different things, so let's abandon ranks for a more objective discussion on taxa. Individuals are definitively not species. Nico (the other one) > > On Wed, Nov 3, 2010 at 7:24 AM, Steve Baskauf <steve.baskauf(a)vanderbilt.edu> wrote: > > John, > I'm not sure that I agree with your analysis that the definition prevents the possibility of making an Identification at a rank less specific than a species. My revised definition says that the Individual should only include groups of organisms that are reliably known to be of a single species - it doesn't say that we need to know what that species is (i.e. an identification to genus or family can be made with the hope that someone down the line would be able to refine the identification to species). Clarification on this point could be added to the comment or the Google Code page, but I don't think there is a problem with the definition per se. However, if there is a consensus that the definition is too restrictive, I would not object to changing the wording of the definition from "species (or lower taxonomic rank if it exists)" to "taxon" if there were clarification added to the comments or Google Code page that Individual was not intended to include aggregations of mult > > iple species. > > I agree that there is a need for a term that represents "collections", "bags", "aggregations", or whatever you want to call an aggregation that includes multiple species. But I have never intended that Individual should be that term. If we expand Individual to include aggregates, then it becomes unusable for its original intended purpose. I would prefer for someone to propose a different term for aggregates of individuals instead of adding that function to Individual. Then define the relationship of this new thing to Individual as a one:many relationship (one aggregation:many Individuals). > > Steve > > > John Wieczorek wrote: > Most of you probably do not receive postings from the Google Code site for Darwin Core. Steve B. updated the proposal for the new term Individual, and then commentary ensued on the Issue tracker. Since there remains an unresolved issue, I'm bringing the discussion back here by adding the commentary stream below. The unresolved issue is Steve's amendment is the restriction in the definition to "a single species (or lower taxonomic rank if it exists)." > > > > Rich argues that we should not obviate the capability of applying an Identification to an aggregate (e.g., fossil), where the aggregate consists of multiple taxa. > > Steve argues that Identifications should be applied only to aggregates of a single taxon. > > > > > Steve, aside from the aggregate issue (which should be solved satisfactorily), your suggestion is too restrictive, because it would obviate the possibility of making an Identification (even for a single organism) to any rank less specific than a species. That is a loss of capability, and therefore unreasonable. > > > > > > Comment 7 <http://code.google.com/p/darwincore/issues/detail?id=69&colspec=ID%20Type%2…> by baskaufs <http://code.google.com/u/baskaufs/> , Today (8 hours ago) > As a result of the discussion that has taken place on the tdwg-content email list during 2010 October and November, I am updating the term recommendation for Individual as follows: > > Definition: The category of information pertaining to an individual organism or > a group of individual organisms that can reliably be known to represent a single species (or lower taxonomic rank if it exists). > > Comment: Instances of this class can serve the purpose of connecting one or more instances of the Darwin Core class Occurrence to one or more instances of the Darwin Core class Identification. > > Refines: N/A > > Please note that as a precautionary measure, I have removed the statement that Individual refines http://purl.org/dc/dcmitype/PhysicalObject because the definition of PhysicalObject specifically mentions that the object is inanimate. I am not currently aware of any well-known term that defines living things. > > Steve Baskauf > > > > > Delete comment <http://code.google.com/p/darwincore/issues/detail?id=69&colspec=ID%20Type%2…> > > Comment 8 <http://code.google.com/p/darwincore/issues/detail?id=69&colspec=ID%20Type%2…> by deepreef(a)hawaii.rr.com <http://code.google.com/u/deepreef@hawaii.rr.com/> , Today (8 hours ago) > I think the definition should be "...represent a single taxon". We shouldn't restrict it to members of the same species (or lower), because then we technically can't include things that may represent more than one species, yet would best be treated within the scope of an Individual. > > Also, I'm slightly partial to the term "Organism" for this class, rather than "Individual", because it's more clearly tied to the biology domain, and less likely to collide with the word "Individual" in other domains. I know such collision is not a technical problem, but it might lead to some confusion. > > > > Delete comment <http://code.google.com/p/darwincore/issues/detail?id=69&colspec=ID%20Type%2…> > > Comment 9 <http://code.google.com/p/darwincore/issues/detail?id=69&colspec=ID%20Type%2…> by baskaufs <http://code.google.com/u/baskaufs/> , Today (8 hours ago) > Well, the reason that I defined it to be members of the same species is to ensure that the term Individual can serve the primary function that I perceived was needed: to make the connection from occurrences to identifications. When I said one or more identifications, I meant one or more opinions about what that single species (or lower) was, not that there could be multiple identifications of several different species that happened to be in the same "bag" such as the contents of a pitfall trap containing multiple species, an image that contained several species, or a specimen that contained parasites of a different species. I think that there is a need for a term for this other kind of thing, (a heterogeneous "lot", "batch", or something), but I think that including this in definition of Individual defeats the purpose for which I proposed it. If there were several different species in the "Individual", then > one would have to specify which identification went with which biological individual within the "lot", which would result in actually breaking down the "lot" into single species "Individuals" anyway. > >

7 33

Re: [tdwg-content] taxonomy != identification
by Dusty 05 Nov '10

05 Nov '10

Collections contain things that do not map nicely to a single taxon name of any (or no) rank. It's not clear to me if this proposal will support those kinds of data or not. A few examples: Uncertainty: http://arctos.database.museum/guid/KWP:Ento:1703 Composite specimens: http://arctos.database.museum/guid/UAM:Herb:12718 Hybrids: http://arctos.database.museum/guid/UAM:Mamm:3517 Things that aren't taxonomy at all: http://arctos.database.museum/guid/UAM:ES:3405 -D On Wed, Nov 3, 2010 at 10:07 PM, Peter DeVries <pete.devries(a)gmail.com>wrote: > > What I would recommend is that you treat a specimen that is identified to > an order (Perciformes) with something like the following. > > Species => Order Perciformes species undetermined. > > The individual is still an instance of a species, however that species has > yet to be determined. > > What would work best is to have some standard way of writing the green > string above. > > This would allow the occurrences that are of individuals identified only to > the Order Perciformes, to be interpreted as a species that falls somewhere > within the Order Perciformes. > > - Pete > > > --------------------------------------------------------------- > Pete DeVries > Department of Entomology > University of Wisconsin - Madison > 445 Russell Laboratories > 1630 Linden Drive > Madison, WI 53706 > TaxonConcept Knowledge Base <http://www.taxonconcept.org/> / GeoSpecies > Knowledge Base <http://lod.geospecies.org/> > About the GeoSpecies Knowledge Base <http://about.geospecies.org/> > ------------------------------------------------------------ > > _______________________________________________ > tdwg-content mailing list > tdwg-content(a)lists.tdwg.org > http://lists.tdwg.org/mailman/listinfo/tdwg-content > >

4 7

What is an Occurrence? [followup to "Wrong" RDF and What I learned... threads]
by Steve Baskauf 04 Nov '10

04 Nov '10

After the flurry of emails recently, I had an opportunity to carefully read all the way through the threads again, followed by enforced "think time" during my long commute. I was actually pretty cheerful after that because I think that in essence, most of the conversation about what constitutes an Occurrence really boils down to the same thing. So I have sat down and tried to summarize what seems to me to be a consensus about Occurrences. To follow my points, please refer to the diagram at: http://bioimages.vanderbilt.edu/pages/occurrence-diagram.gif Consensus on relationships 1. The fundamental definition of an Occurrence involves evidence that a representative of a taxon occurred at a place and time. Note 1.A: For clarity, I have modified John's statement in his last email by replacing "taxon" with "representative of a taxon". I'm considering a taxon to be an abstract concept that is applied to individuals or groups of organisms. Note 1.B. This definition is far more useful than the official definition of the class Occurrence "The category of information pertaining to evidence of an occurrence..." which is essentially circular. Note 1.C: This statement is extremely broad because the evidence could be of many sorts, the representative could range from a single individual to all organisms on the earth, the taxon could be anyone's definition at any taxonomic level, the place could range from a GPS point with uncertainty of less than 10 meters to the entire planet earth, and the time could range from a shutter click of less than one second to 3.4 billion years. 2. The diagram is an attempt to summarize in pictorial form statements and relationships that have been described in the thread. The taxon representative is recorded as existing at a particular time and place (the arrow) and the result is an Occurrence record. That Occurrence record exists as metadata which may be associated with a token that can be used to voucher the fact that the taxon representative existed. That token may be the organism itself (or a living part of it as in a twig for grafting), all or part of the organism in preserved form, an electronic representation such as an image or sound recording, and other kinds of things like tissue or DNA samples. There may also be no token at all, in which case we call the Occurrence record an observation. Based on direct observation of the taxon representative, examination of one or more tokens, or both, some determiner asserts that a taxon concept applies to the taxon representative and as a result a scientific name can be used to "identify" the taxon representative. (There may be a lot of other complicated stuff above the Identification box, but that will have to be filled in by the taxonomists.) Note 2.A: I have mapped onto this diagram the letters that John used in his last email to refer to entities that are involved in an Occurrence (T, E, L, O, and G). I will beg the forgiveness of fossil people because I don't really know how the geological context fits in. I'm assuming that it is a way of asserting time and location on a much broader scale than we do for extant organisms. Note 2.B: I have put a dotted line around the part of the diagram that I think includes all the things that people might consider part of the Occurrence itself. I have left out "T" and the other parts related to identification because it seems to me that you can have an occurrence that you document which does not yet (and perhaps never will) have an identification. The Occurrence still asserts that a taxon representative existed at a time and place; we just don't yet know what the taxon is. 3. The red lines indicate the relationships that connect the various entities (I'm going to go ahead and call them resources). Consistent with popular opinion, the Occurrence record is the center of the universe and most things are connected to it. Note 3.A: I am sticking to my guns and refuse to connect the Identification directly to the Occurrence. It is the taxon representative that is being identified, not the occurrence. One can assert another sort of relationship between the identification and the occurrence if one wants to say that one consulted the occurrence metadata and token in order to decide about the identification, but it is not correct to say that the Identification identifies either the Occurrence metadata or the token (as Rich pointed out). OK, so that's step one - defining what is related to what. If anyone disagrees with these relationships, please clarify or create your own diagram. Complicating circumstances/caveats 1. It is noted and recognized that some users will not care to include all of these relationships in their models. In the interest of simplification or "flattening" the relationships, they may wish to collapse some parts of this diagram (e.g. incorporate time and location metadata within the Occurrence metadata rather than considering them separate resources, applying scientific names directly to the taxon representatives without defining a taxon concept or recording the determination metadata, connecting identifications directly to the occurrence, etc.). This doesn't mean that the relationships don't exist, it just means that some users don't care about them. 2. It is recognized that different users will be interested in or able to specify the various resources to differing degrees of precision. Examples: A photographer might record times to the nearest second, a collector may only be interested in noting the date on which a specimen was collected. A location may be specified to the precision of a GPS reading or be defined as some geographic or political subdivision. The taxon representative may be an individual organism, a flock or clump, or some larger aggregation of taxon representatives. That's step two. If I've missed any complications, please point them out. My opinions about the implications of this diagram 1. The circle I've labeled as "taxon representative" is the resource type that I'm proposing to be represented by the class Individual. You will note that in both the definition of dwc:individualID ("An identifier for an individual or named group of individual organisms...") and the proposed class definition ("The category of information pertaining to an individual or named group of individual organisms represented in an Occurrence"), groups of individual organisms are included. Thus John's example of a fossil having myriad individuals, or Richard's examples of thousands of plankton, a large school of fish, herd of wildebeest, flock of birds, could all be categorized as "Individual" under this definition if there is a reasonable expectation that all of the individuals in the group are members of the same taxon. Perhaps there is a better name for this resource, but since dwc:individualID was already extant, I chose Individual as the class name for consistency with the pattern established with other classes and their associated xxxxID terms. 2. Although in note 1.C. I have given the ranges of the various resources to their logical extreme (as was done previously in the thread), I think that as a practical matter we can adopt guidelines to set reasonable values for the "normal" ranges of the resources. One such guideline might be that we suggest a range that can accommodate about 95% of the user needs within the community (this came from Rich's comment about satisfying 95% of the user need with an establishmentMeans controlled vocuabulary). For example, it was suggested that the range for the location of an Occurrence could span the entire planet Earth. True enough, but virtually nobody would find such a span useful. 95% of users would probably find a range between a GPS reading with 10 meter precision and the extent of a county or province useful for recording the location of an Occurrence. I can suggest similar "useful" ranges: one second to one day for an event time (excluding fossils), one individual organism to the number of organisms that would fit within a 50 meter radius for an "individual", and taxon identified to family for plants and maybe mammals, genus for birds, and order for insects. So framing the definition of an Occurrence in these terms it would be something like: "An occurrence involves evidence (consisting of a physical token, electronic record, or personal observation) that a representative (ranging from a single individual to the number that would fit on a football field) of a taxon (hopefully identified to some lower taxonomic level) occurred at a place (determined to a precision between that of a GPS reading and the size of a county/province) and time (spanning one second to one day)." A few people might object to this level of restrictiveness, but I would guess that it would make 95% of us happy. 3. With the exception of the "missing" class Individual, every resource type on this diagram except for the "token" and Scientific name has a Darwin Core class. Every resource type on the diagram except for "token" has a dwc:xxxxID term that can be used to refer to a GUID for the resource. The implication of this is that any resource on this diagram except for the token and taxon representative (i.e. Individual) is ready to be represented in RDF by Darwin Core terms in the sense that the relationships (red lines) can be represented by the xxxxID terms and that the resources can be rdfs:type'd using Darwin Core classes. (Lacking a class for the scientific name doesn't seem like a big deal to me since the scientific name can be a string literal - but then I'm not a taxonomist.) 4. OK, I've avoided it as long as I can, so I'm going to confess now to the RDF-phobes. The red lines and shapes are something pretty close to an RDF graph. What that means is that if the community can agree that this diagram correctly represents the relationships among the kinds of biodiversity resources that we care about, then the matter of providing guidelines on how to represent Darwin Core in RDF suddenly gets a lot simpler. Just convert the "picture" of the RDF graph into XML format and we have a template. Alright, that's an oversimplification, but I think it is essentially true because the most difficult part of achieving a consensus on RDF representations is to decide how we connect the resource types, not on the literals that we hang onto resources as properties. 5. While I'm beating the RDF drum again, the importance of my opinion number 2 can be extended into the GUID adoption process. In my comments to Kevin about the Beginner's Guide to Persistent Identifiers, I think I commented on the question of how one decides whether a GUID needs to be assigned to something or not. I believe that the answer to that question boils down to this: we need a GUID for any resource that will be referenced by more than one other resource. Do we need to be able to assign a GUID to Taxon concepts? Yes, because it is likely that many identifications will want to reference a particular taxon concept. Do we need to be able to assign a GUID to an Event? Maybe or maybe not. If every occurrence has its own separate time recorded, then no GUID is needed because the time is just a part of every separate occurrence record. If the event is defined to be a time range that represents a collecting trip, then there may be many Occurrences that are associated with that trip and all of them could reference the GUID for that event rather than repeating the event information for every Occurrence. The point here is that every shape (class of resources) on this diagram at least has the POTENTIAL to be a node connecting multiple resources and therefore should have the capability of being assigned a GUID, having its own RDF record, and being appropriately typed (presumably by a DwC class). So this is a final technical argument for why we need to have the DwC class Individual. Whether or not people ultimately choose to assign GUIDs to particular resource types or not is their own choice, but they need to at least be ABLE to if they need that resource to serve as a node given the structure of their metadata. We need to clarify how the "token" thing fits in, but I'm stopping there for now. I would very much appreciate responses indicating that: A. you agree with the diagram and connections (and consider this definition and diagram a consensus) B. you disagree with the diagram (and articulate why) C. you provide an alternative diagram or explanation of the relationships among the classes related to Occurrences. Thanks for you patience with another tome. Steve -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

16 53

What is dwc:basisOfRecord for?
by Steve Baskauf 03 Nov '10

03 Nov '10

OK, I know that this sounds like a stupid question, but I really want somebody who was involved in the development and maintenance of the current DwC standard to tell me how the term dwc:basisOfRecord is supposed to be used (not what it IS - I've seen the definition at http://rs.tdwg.org/dwc/terms/index.htm#basisOfRecord)? I would like for the answer of this question to be separated from the issue of what the Darwin Core type vocabulary (http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm) is for. I re-read the lengthy thread starting with http://lists.tdwg.org/pipermail/tdwg-content/2009-October/000301.html which talked a lot about basisOfRecord and its relationship to other ways of typing things. I don't want to re-plough that ground again, but I couldn't find the post that stated what the final decision was. I remember that there was a decision to NOT create the recordClass term which was the subject of much discussion. I guess my confusion at this point is with the inclusion of both "Occurrence" and "PreservedSpecimen" in the same list. Let's say that I have a flat database where I include metadata about the Occurrence (such as dwc:recordedBy) and the specimen (such as dwc:preparations) in the same line. What is the basisOfRecord for that line? I would guess that the "basis of the record" was the specimen. But the line in the record also represents an Occurrence. It seems like there is a lack of clarity as to whether basisOfRecord is supposed to indicate the type of the record (which would be an Occurrence record) or whether it's supposed to indicate the kind of evidence on which the record is based (which would be PreservedSpecimen). There have been various times where I've seen a database record that includes basisOfRecord and it seems to be inconsistently applied. I can see how the Darwin Core type vocabulary could be useful - it pretty much lays out useful values that one could give for rdfs:type. But basisOfRecord as a term is confusing me. Steve -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

4 8

Treatise on Occurrence, tokens, and basisOfRecord
by Steve Baskauf 02 Nov '10

02 Nov '10

I have been dreading trying to write this post which I have promised (or threatened depending on if you have enjoyed or been annoyed by the previous lengthy thread) for some time. I have dreaded it because this is a complicated subject and not one that is amenable to terse messages. However, after the previous conversation with Rich et al., I feel for the first time that I have the questions (not answers!) clearly in my mind. So rather than starting off rambling about LivingSpecimens and establishmentMeans as I had planned, I'm going to start by laying down several principles that have come into clarity in my mind after the previous conversation and the attempt to map things out in a diagram. I will apologize in advance for failure to use the correct database or IT technical terms when I'm in unfamiliar territory. Until there is a consensus about how we deal with the "tokens" we use to document Occurrences, I'm not sure that what I have to say about those other topics will make sense. PRINCIPLES (derived from earlier discussion) 1. We have a number of kinds of "things" (which I will henceforth refer to as "resources") that are useful for describing and organizing metadata that we collect in our attempts to document biodiversity. For many of these types of resources, we have defined classes to categorize the terms that can be used to describe the properties of resources that are instances of that class. Describing the class helps us to understand the type of resources that constitute instances of that class. 2. A conscious decision was made to avoid formally defining rdfs:domain for Darwin Core terms. This decision was made to provide flexibility in the way the terms can be used and to avoid the situation where semantic clients would draw incorrect or silly conclusions about what kind of things resources are. However, this decision does not excuse us from thinking carefully about whether a term can be appropriately applied to a resource that is a member of some class (e.g. should we say that a digital photograph has a scientific name?). Placing a term within a class is a suggestion that the term would appropriately be applied as a property of an instance of a class. 3. When users want to "flatten" and simplify their databases, they tend to eliminate one-to-many (1:M) relationships in favor of one-to-one (1:1) relationships. The result of that is differences like we saw in http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif (which allows 1:M relationships between Occurrences and Events and between Events and Locations) and http://bioimages.vanderbilt.edu/pages/rich-diagram2.gif (which "atomizes" every Occurrence by considering it to have its own separate eventTime and Location information). A. There is nothing intrinsically "right" or "wrong" about either of these approaches, because they each have their own advantages. The 1:M approach is more efficient, but results in a more complicated database, while the 1:1 approach results in a simpler database but may require repeating some or many term values in the records. B. The choices that users make in these situations is the cause of much of the disagreement about whether a certain class should exist or not since the people taking the 1:1 approach "collapse" the relationship diagram and eliminate classes they don't need while people who take the 1:M approach need instances of the class to act as nodes to connect their "many" resources to some other thing. C. This collapsing of the diagram is also the reason for some disagreement about whether a term belongs in a certain class or not. In the example above, 1:1 people would say that eventDate is a property of an Occurrence, while 1:M people would say that eventDate is a property of an Event. D. The choice of users on this issue influences their decision about whether or not to create resources that are instances of classes and hence to assign them identifiers. If users take the 1:M approach, they need identifiers for resources that are acting as connecting nodes so that they can make reference to that resource in the metadata of the many things they are connecting to it. If users take the 1:1 approach, they probably will skip creating explicit resources (and their corresponding identifiers) for resources of the class that they are "collapsing" out of the diagram). 4. I would propose that the "right" relationship diagram is not necessarily one that caters to a certain "right" philosophical point of view. Rather, the "right" diagram is the one that allows users to define the relationships that they need for the organization of their metadata in the simplest manner, and which provides the most clarity about what resources of various kinds are, and how they are connected. A. "Right" as I have defined it above depends on how broadly applicable the relationship diagram is intended to apply. An individual person or organization with limited interests may have a relationship diagram that is simpler than the diagram shown at http://bioimages.vanderbilt.edu/pages/rich-diagram1.gif or might choose to add classes for other things that are their personal interest. An organization interested focused on different issues or with broader interests might opt for many more or different classes that would be connected to those shown in the diagram. B. Given what I just said in A, what is "right" for Darwin Core is going to be defined by the needs of the Darwin Core constituency. At the TDWG meeting, John Wieczorek made a statement which I will paraphrase as "in order for a term to make it into Darwin Core, at least two people had to want it". I'm not sure to what extent he was joking about this, but it makes the point that one must consider community needs before saying that a certain part of the "diagram" is necessary. I think that the reason that Rich and I were so quickly able to come to a consensus on the organization of the left side of the diagram is because he realized that there was a significant part of the DwC constituency that needed a way to group occurrences (i.e. needed Individuals) and I realized that there was a significant part of the constituency that needed to group multiple Events at a Locality and multiple Occurrences at an Event. So in evaluating alternative conceptual systems for organizing resources, the question has to be asked as to the extent that an alternative allows broad segments of the DwC constituency to organize their metadata in an efficient and conceptually sensible way. If one alternative is more broadly applicable and conceptually clear than another, then that alternative is better regardless of the philosophical underpinnings of the argument. 5. The last point is one that has run as an undercurrent through various TDWG threads but which may not have been explicitly stated in this particular thread. That is that there should be a separation between what a resource IS and what we want to use a resource FOR. To use technical terms, we need to separate the "type" of a resource from its fitness of use. A digital image IS a digital image. It might be used FOR documenting that an organism was at a particular location at a particular time, but it could be used to illustrate a character, as a part of a visual key, as media for an educational presentation, as art, and probably many other things that aren't popping into my mind at the moment. I believe that much of the confusion about "what is an Occurrence" comes from a failure to make this distinction. THE ISSUE OF THE TOKEN Earlier in the thread of "What is an Occurrence", there was a general consensus that an Occurrence often had a "thing" that was associated with it that served as evidence that a taxon representative (i.e. Individual) occurred at a particular Location at a particular time. In my Biodiversity Informatics paper, I called this thing a "representation", but I now believe that "token" is a better term and will use it hereafter. There also seemed to be a consensus that an observation was simply an Occurrence that did not have an associated token. (This is with the understanding that observation is being narrowly defined as a type of Occurrence, with a definable time and location, as opposed to what I called the "checklist" definition which indicated that some undefined taxon representative was present in some defined geographical area at an indefinite time.) In one of my earlier posts, I pleaded for somebody to tell me whether there was an assumption that the token was considered a part of the Occurrence or whether it was a separate thing. I did not get any responses, which I'm construing to mean that people weren't sure about this. At the present, I now have a clearer idea of the general principles I outlined above, and also have the "Rich" diagram for modeling relationships, so I'm going to again pose this question, but in what I hope is a clearer way. I have re-made the earlier diagram as Rich suggested, using triangles rather than arrows. The wide side of the triangle is the "many" side of the relationship and the point is the "one" side. As before, I'm deferring on the right side of the diagram (to the right of Identification) to the taxonomists for now, so let's keep that out of the discussion for the moment. I have also clarified the diagram by coloring in the actual DwC classes to distinguish them from selected terms that fall within those classes (non-colored boxes) and which can be used as properties of resources that are instances of the class. The two alternatives that I'm discussion are: http://bioimages.vanderbilt.edu/pages/token-assumed.gif which I will refer to as the "assumed token" model and http://bioimages.vanderbilt.edu/pages/token-explicit.gif which I will refer to as the "explicit token" model. I believe that historically the assumed token model has been the one which most people have had in mind. Before the new DwC standard, we had specimens and we had observations. In order to avoid redundancies in terms for those two types of "things", a combined "thing" called "Occurrence" was created. An Occurrence that was an observation didn't have a token and an Occurrence that was a specimen had a physical or living specimen as its token. That's all pretty simple and sensible and we see evidence of this kind of thinking on the descriptions given http://rs.tdwg.org/dwc/terms/index.htm . A record for an Occurrence has a thing called its dwc:basisOfRecord that presumably describes the kind of token (if any). So if the token were a preserved specimen, we would say that [Occurrence] basisOfRecord [PreservedSpecimen]. If there were no token we would say [Occurrence] basisOfRecord [HumanObservation] or [Occurrence] basisOfRecord [MachineObservation]. Referring back to the assumed token diagram, in the case of a specimen there is no explicit reference to the specimen as a separate entity. The terms related to the specimen, such as preparations and disposition are just plopped into the Occurrence class which implies that they are properties of the Occurrence itself. There seems to be a general consensus that other kinds of tokens can be used to document an Occurrence. However, the way that the current Darwin Core terms are designed and placed within classes are very inconsistent as to how they handle types of tokens other than specimens. According to the instructions at the top of http://rs.tdwg.org/dwc/terms/index.htm, a camera trap bird sighting should have [Occurrence] basisOfRecord [MachineObservation]. It is not clear how one is supposed to handle the actually metadata for the image that serves as the token. Unlike specimens where the token's metadata terms are placed in the Occurrence class, I guess in the case of an image one is supposed to use associatedMedia to link the so-called MachineObservation to the image record. If DNA were extracted, one would link the sequence to the Occurrence using associatedSequences (although it's not clear to me what the basisOfRecord for that would be - "TookATissueSample"?). But what does one do for other kinds of tokens, like seeds or tissue samples - create terms like associatedSeed and associatedTissueSample? I think that the ResourceRelationship terms were supposed to handle this problem, but I have yet to see an example of exactly how this was supposed to work. As an attempt to resolve this confusion in my mind, I wrote the Biodiversity Informatics paper that I've promoted frequently on this list (https://journals.ku.edu/index.php/jbi/article/view/3664) In that paper, I take the basic assumed token model and broaden it in an attempt to make the assumed token model work for all kinds of tokens. Because I assumed that each occurrence has a single token, I "collapsed the diagram" and connected the properties of the token directly to the Occurrence resource (as was modeled when specimen properties were placed within the Occurrence class). If there were several tokens for a given Individual, I "flattened" the records by creating a separate Occurrence resource for each token. The model was generalized further by allowing secondary Occurrence records where the token was not derived directly from the organism but rather derived from a primary Occurrence record. In complicated circumstances such as those found in a botanical garden where a seed or cutting might be collected from a tree with subsequent generation of a LivingSpecimen which might have a PreservedSpecimen collected from it and a DigitalStillImage taken of the preserved specimen. You can see examples of the complex types of situations I tried to handle at http://bioimages.vanderbilt.edu/pages/conceptual-scheme-insect.gif and http://bioimages.vanderbilt.edu/pages/conceptual-scheme-botanical.gif I created my own terms (like sernec:derivativeOccurrence and sernec:derivedFrom) to describe the connections among the individual and the various layers of Occurrences. Does this system work? Yes, but there are a number of problems associated with it. The first problem is related to Principle 4 above. In order for this system to work, there needs to be a consensus in the DwC community about several things. One is that each Occurrence must have only one token. If we are going to "type" Occurrences by their basisOfRecord (and the acceptable values for basisOfRecord are officially DwC types, see http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm) then an Occurrence can't have two values for basisOfRecord. It is clear from the discussion we've had that people would like to consider a single Occurrence to be able to have multiple tokens as documentation. The second problem is that there needs to be a consensus that a secondary Occurrence can exist at all (i.e. can you call the image of a specimen "an Occurrence"?). It is clear to me from the discussion that when people are thinking about what an Occurrence means, they have in mind the documentation of the time and place of the Individual in its environment. In a previous communication, John Wieczorek clarified that terms describing Occurrences like recordedBy and eventDate should only apply to primary occurrences and that it would not be appropriate to use them as properties of what I'm calling a secondary occurrence (such as the image of a specimen). So I dealt with this by creating a distinction between Occurrences that document the distribution of a taxon (using the term sernec:documentsDistribution) and those that don't. This is something like the old validDistributionFlag, but I defined documentsDistribution specifically as having a value of "true" only for Occurrences that were derived directly from the Individual (gray arrows in the two diagrams from the paper). But I think that the worst "crime" of the system I suggested is violation of Principle 5 above. By asserting an unvarying 1:1 relationship between the Occurrence and its token and by collapsing my relationship diagram to not explicitly include a resource that is the token itself, I am confusing the USE of an Occurrence (to demonstrate that a representative of a taxon was present at a particular Location at a particular time) which what the token IS (a dead organism in a jar or glued to paper, an electronic representation of photon patterns, a series of characters representing a nucleotide sequence). So I'm charging myself with this "crime", pleading guilty, and accepting my sentence, which is to admit that the system I suggested in the Biodiversity Informatics paper is "wrong" based on the principles I outlined above. What this amounts to is an acceptance of the "rightness" of the explicit token model (in the sense that I defined "right" in Principle 3 above). However, if I'm going to make this admission, I demand that the other guilty parties also confess, namely people who want to assert that Occurrences have properties that actually are properties of specimens. If we are going to have a system that actually works, we can't straddle the fence and say that the assumed token model is correct for specimens and that the explicit token model is correct for every other kind of token. If we accept the explicit token model, then specimen will have to come off of it's throne and be a token like all of the other ways that we provide evidence that an Occurrence happened. If we accept the explicit token model, then as a biodiversity informatics resource type "observation" will have to disappear into a puff of nothingness just like the "luminescent ether", "centrifugal force", and other kinds of things that we thought we needed to have to explain things but which turned out to be unnecessary when we figured out more basic explanations. A human observation will simply be an Occurrence that doesn't have a token (which is what I've heard some people say all along). If we allow the Occurrence/token relationship to be a one-to-many relationship rather than one-to-one, then HumanObservation is just the one-to-zero case of the more general one-to-many. For those of you who like the idea of a "machine observation", that is just an Occurrence with a token that is whatever type of resource that the machine produces (electronic data file, image of the organism, image of a graph, or whatever). ADVANTAGES OF RECOGNIZING TOKENS EXPLICITLY If we accept the explicit token model over the assumed token model, a number of problems get solved. Just as was the case with Events, people who want to flatten things out by having only one token per Occurrence can do so. For example, if I want to atomize things by defining my Occurrence to have taken place during an Event that lasted only the one second within which my camera shutter clicked, I can do that and have only a single token associated with that Occurrence. On the other hand, if others want to define their Occurrence as taking place over the time over which they photographed, collected a leaf tissue sample, and then collected a branch of a tree for an herbarium specimen, then they can do that and associate all of those tokens (one or more images, the tissue sample, and the preserved specimen) with the single Occurrence. Another important benefit will come down the line when we actually try to develop RDF templates. Right now it is not exactly clear (at least to me) how properties should be divided up among resources that are being described in the RDF. Based on the assumed token model, I have been including the metadata for the token within the container element for the Occurrence. This leads to some of the kind of odd assertions that people have been objecting to, such as [Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"] or [Occurrence] preparations ["skin"]. In the explicit token model, dividing metadata up appropriately among separate Occurrence and token resources makes more sense, e.g. [Occurrence] recordedBy ["Joe Curator"] [image] dcterms:rights ["(c) 2002 Steven J. Baskauf"] [specimen] preparations ["skin"] If we wanted to be really explicit about this, we probably should have a separate class for PhysicalSpecimens and separate the terms that describe specimens from those that describe Occurrences in general. There might be some difficulty in doing this because there are some terms that might be hard to decide about, like catalogNumber. I don't really think the catalogNumber is a property of the Occurrence, because it makes more sense to me to say [specimen] catalogNumber ["12345"] than [Occurrence] catalogNumber ["12345"] Realistically, I can't see this kind of separation ever happening, given the amount of trouble it's been just to get a few people to admit that Individuals exist. It is just too hard to get motion to happen in the TDWG community. As a practical matter, people who "compress" the system (which we admit happens and make concession to in Principle 3) by having record tables where a single row contains the metadata for both the Occurrence and the token (i.e. treat it as a 1:1 relationship) will simply have a column heading for catalogNumber and not care whether the catalogNumber applies to the Occurrence or the token. It's the people who want to do the more complicated stuff like simultaneously keep track of multiple tokens per Occurrence (like several images, a sound recording, and a specimen), people who want to write RDF, or people who want to merge databases containing many types of tokens who will have to pay attention to this distinction. Physical specimens would really be the only kind of class we would have to create because there already is a rich vocabulary for media items that is separate from DwC (i.e. the MRTG schema) and there are probably also vocabularies for stuff like tissue samples and DNA sequences (although I'm not familiar with them). TYPING Bob has warned us about the dangers of asserting that a term always applies to a certain type of resource by asserting that the term has an rdfs:domain . However, we should not avoid attempting to assert that a resource is itself of a certain type. Describing the "type" of a resource is an important part of letting potential users assess the possible fitness of use of that resource. For example, you can collect DNA from a preserved specimen but not from an image. You can include an image in a print journal article but not a sound recording. You can create build a range map from Occurrences, but not from DNA samples. In RDF, one of the basic properties that should be described about every resource is its rdfs:type . In the generic Linked Data world, you can pretty much use anything that you want as an rdfs:type . If you decide to use something obscure, then the danger is that nobody else will have any idea what kind of thing you are describing. The Draft TDWG GUID Applicability Statement recommendation 11 says that "Objects in the biodiversity informatics domain that are identified by a GUID should be typed using the TDWG ontology or other well-known vocabularies in accordance with the TDWG common architecture." So in our community, we can't just type resources any way we want. But exactly how we SHOULD type things isn't clear. There isn't any functioning TDWG ontology at the moment. I have found it useful to use the DwC class as the rdfs:type in my attempts to write RDF. That works pretty well for things that have DwC classes. But if we follow the explicit token model, we need to have some consensus on what we will use as the rdfs:type for the tokens. At this point it looks to me like it would make sense to have the convention that for tokens one uses either a dcterms:type or a Darwin Core type (i.e. one of the types listed at http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm, although as I already noted, there is no need for HumanObservation in the case of describing a token because human observations don't have tokens). There isn't any sort of "collision" here of the sort that happened right after the adoption of the Darwin Core Standard when we tried to merge the Dublin and Darwin Core types (see http://www.keytonature.eu/wiki/MRTGv08_Type_term_inconsistent_with_DwC and http://lists.tdwg.org/pipermail/tdwg-content/2009-October/000301.html with many following responses for the gruesome details) since rdfs:type doesn't demand any particular type vocabulary. I'm not entirely happy with this approach because for digital still images the logical type would be dctype:StillImage, which doesn't give any indication as to whether the image is film or digital, but I guess at this point in the 21^st century most consuming applications will probably just assume digital anyway. So (assuming that Individuals become a DwC class) I guess I don't really see that there is any problem in using the current Darwin Core classes to indicate the rdfs:type of every kind of resource that we would be reasonably likely to assign GUIDs to EXCEPT for tokens. Typing of tokens could be done using a combination of Darwin Core and Dublin Core types. What I'm left scratching my head about is basisOfRecord. When I subscribed to the assumed token model (i.e. when I wrote the Biodiversity Informatics paper), I thought I knew what basisOfRecord meant. It meant the kind of token that backed up an Occurrence. So when I wrote RDF for a specimen (as in http://bioimages.vanderbilt.edu/rdf/examples/lsu000/0428.rdf) I used the "hand grenade" approach to typing. I lobbed every kind of "typing" that I knew of at the Occurrence record for a specimen: [Occurrence] rdfs:type [dwc:Occurrence] [Occurrence] dwc:basisOfRecord [dwctype:PreservedSpecimen] and [Occurrence] dcterms:type [dctype:PhysicalObject] Under the explicit token model, I would just use [Occurrence] rdfs:type [dwc:Occurrence] for the Occurrence and [specimen] rdfs:type [dwctype:PreservedSpecimen] for the specimen itself. If I also took an image at the same time and wanted to say that it was part of the same Occurrence as the specimen, I would use [image] rdfs:type [dctype:StillImage] Under the explicit token model, I really can't see any use for dwc:basisOfRecord . Despite the resolution of the "train wreck" involving dcterms:type that we narrowly avoided after the adoption of Darwin Core, the definition still says "the specific nature of the data record - a subtype of the dcterms:type." I think this is clearly wrong because I think we established that it was NOT a subtype of dcterms:type in that discussion that I referenced above. So what is basisOfRecord??? What is "the data record" of which we are describing the nature? If it's the Occurrence, then I think the consensus that I'm hearing in the discussion is that an Occurrence data record shouldn't have as its type any of the dwctype terms except for dwctype:Occurrence. So what are all of the other terms like PreservedSpecimen for??? Under the explicit token model, what we really need is NOT basisOfRecord. What we need is some term like "dwc:tokenID" if you like the Darwin Core IDREF style or if you prefer the style of the Linked Data community "dwc:hasToken". In both cases, the object of the term would be an identifier for the token that's associated with a subject Occurrence. This term could be applied from zero (for observations) to many times to an Occurrence. People who want to flatten everything out will just ignore this term and cram all their metadata for the Occurrence, token, Event, and Location onto one line in their metadata table. People who are going to use any kind of one-to-many relationships at all will have to figure out how to handle that anyway and won't be daunted by having more than one dwc:tokenID per Occurrence. In the spirit of the complicated resource relationship diagrams from my paper, one could link primary tokens (like specimens) to secondary tokens (like specimen images) by using dwc:tokenID as well. Any kind of token (primary, secondary, tertiary, ad infinatum) could be linked to the occurrence that it supports with dwc:occurrenceID. WHAT DOES THIS DEMAND OF US? OK, I've now gone on for eight pages of text explaining the rationale behind the question. So I'll return to the basic question: is the consensus for modeling the relationship between an Occurrence and associated token(s) the assumed token model: http://bioimages.vanderbilt.edu/pages/token-assumed.gif or the explicit token model: http://bioimages.vanderbilt.edu/pages/token-explicit.gif ? If we accept the assumed token model with all of its warts, then for consistency's sake, we must create dwctype terms for each of the types of tokens that people would reasonably want to use as evidence for Occurrences (and my proposal for adding DigitalStillImage as a Darwin Core type stands). We must also resign ourselves to assigning a separate occurrence to each token that users want to use to document the presence of a taxon at a time and place. We also must accept having goofy-sounding statements like [Occurrence] dcterms:rights ["(c) 2002 Steven J. Baskauf"] If we accept the explicit token model, then we need to either dump basisOfRecord or come up with some rational explanation for what it actually means (and my proposal to add DigitalStillImage as a Darwin Core type becomes irrelevant). We also need to create some kind of term like dwc:tokenID that will allow connections to be made between Occurrence records and their tokens. For people who want to flatten out their Occurrence records and put the tokens together with the Occurrence (i.e. "compress the diagram" to get rid of the token resource), and who feel some need to indicate the type of the token that they are using, let them use any appropriate term from the Dublin Core or Darwin Core types as a value for rdfs:type. Until we make one of these choices or the other and "fix" Darwin Core to work in a consistent way, we are just going to continue to misunderstand each other because each person will just "know an Occurrence when they see it". In the interest of space, I am going to defer on explaining my opinions about LivingSpecimen and establishmentMeans. Those explanations are contingent on the conclusion that we reach on this issue. Steve -- Steven J. Baskauf, Ph.D., Senior Lecturer Vanderbilt University Dept. of Biological Sciences postal mail address: VU Station B 351634 Nashville, TN 37235-1634, U.S.A. delivery address: 2125 Stevenson Center 1161 21st Ave., S. Nashville, TN 37235 office: 2128 Stevenson Center phone: (615) 343-4582, fax: (615) 343-6707 http://bioimages.vanderbilt.edu

17 71