[tdwg-tag] RE: tdwg-tag Digest, Vol 22, Issue 5

Thu Oct 11 23:30:17 CEST 2007

Roger writes:
> Why is region/geography a special case and not covered like any other
> kind of context with a subjectTag? It could point to a polygon or
> TDWG geographic region or whatever.

The principle question to me is whether we simply want to have

tag = distribution
text = occurring in higher altitude
value = archibiota
value = visible-only-in-summer
value = high-altitude-distribution
value = italy

that is, the consumer has to figure out what the meaning of the
categorical data is and what the relevance of them is with regard to
distribution, or whether specific attributes structure this
information so that a consumer can easily find information it is
interested in, like

tag = distribution
text = occurring in higher altitude
status = archibiota
geoArea = italy
modifier = visible-only-in-summer
altitude = high-altitude-distribution

Structurally the first is clearly possible, but it seems to require a
lot of semantic analysis to interpret, and my feeling is that it is
liable to misinterpretation.

> The same could be argued for associatedTaxon. (I prefer
> associatedTaxon to organismInteraction - the two taxa concerned  may
> not interact we may just be saying they occur in the same habitat or
> that Taxon A has some shared characteristic with  taxon B - though of
> course every atom in the universe does interact with every other in
> some way I suppose.)

There could be a taxon association, and that is interesting raw
information on a specimen level. However, when talking about
properties/traits of organisms, we are talking of knowledge, and the
fact that somewhere, sometime on earth two organisms have been seen
together is rather uninteresting. I want to learn about pathogens,
pollinators, not the fox that happened to be present while the plant
was pollinated. That is exactly what I would hope to express by using
organimsInteraction instead.

> What is the difference between tagging something with a geographic
> region and tagging it with a taxon?

Or tagging it with a color code, or tagging it with a nomenclatural
code, or tagging it with a museum name? Clearly all are categories,
but it seems to me they are of a different kind, i.e. the role why
they are added is different.

> If this is boiled down far enough we don't need the InfoItem as a
> container and can use common vocabularies (DublinCore) for most of
> this stuff.
>
> <tdwg:SpeciesProfile rdf:about="http://my.guid.could.be.lsid.org">
>         <tdwg:aboutTaxon rdf:about="urn:lsid:of:some:taxon" />
>         <tdwg:associatedTaxon rdf:about="urn:lsid:of:some:other:taxon" />
>         <dc:description>
>                 This is some text about how good this is to eat and other stuff.
>         </dc:descriptiono>
>         <dc:subject  rdf:about="http://my.controlled.list.of.terms#cooking" />
>         <dc:subject  rdf:about="http://my.controlled.list.of.terms#Brazil" />
> </tdwg:SpeciesProfile >

My problem with this is that it is unclear whether the taxon occurs in
Brazil, occurs in Argentine and is imported and cooked in Brazil, or
whether it occurs and is cooked in Germany, but the cooking recipe
originates from Brazil.

In constrast, having

       <dc:distribution
rdf:resource="http://my.controlled.list.of.terms#Brazil" />

seems to make this clear, provide semantics are defined for distribution.
I am mostly interested in analyzing taxon-specific information for
identification and phylogenetics, and it seems to me that the first
kind of communication would be worthless for such purposes.

> If we want to express Taxon-Taxon interactions Kevin Richards and I
> already came up with something to use for the HerbIMI LSID Authority
>
> http://rs.tdwg.org/ontology/voc/TaxonOccurrenceInteraction
>
> Note that this defines interactions between *occurrences* not taxa
> and the occurrences provide the context. I am not sure that this is
> the place to get into defining interactions other than in the most
> general way.

The values of interaction kind should be defined elsewhere. However, I
would prefer the concept that such values exist visible and defined. I
consider placing them as tags on the interaction class, but other
solutions are possible. I propose to have a special type for it
because we have an interaction of

I don't want to reject Roger's idea of remerging classes. Separating
out datatypes to me is a vehicle to be able to come up with sharper
definitions of the semantics of the various class attributes,
expressing which is a measurement which an aggregation, what is
aggregable, what is a context and what is scope under which
aggregation was performed, what is a subclassing of the aggregation
concept, what is subclassing of value concept, what is a frequency,
what a probability of a statement. This is the major concern I have
about the generality of SPM 0.2. The list was full of ideas for which
purposes value and context could be used but when receiving the data
no generic decoding seemed to be possible to me.

In SDD we distinguished between original measurements (SampleData) and
aggregations (SummaryData - which often already occur on the single
specimen level), between an aggregation Scope, and a Modification
(subclassing) of values and characters. The solution chosen is heavily
skewed towards acceptance. Originally we had frequency, probability,
value modification and character modification, all as values and text.
However, it was considered too complex so that now the modifier is
overloaded with all these (but the modifier concepts carry a
classification that allows making these distinctions in the end). The
solutions in SDD are particular, and it would be good to make them
more general as a result of the current discussion - but I don't think
the issues we tried to solve do not exist.

All this is largely irrelevant for free-form text, but what we are
discussing here is simply not free-form text, but exactly this.

> If you look at the DublinCore definition of "subject" it says:  "The
> topic of the resource. Typically, the topic will be represented using
> keywords, key phrases, or classification codes. Recommended best
> practice is to use a controlled vocabulary. To describe the spatial
> or temporal topic of the resource, use the Coverage element." The
> coverage element says "Spatial characteristics of the intellectual
> content of the resource."

> Could SPM just boil down to a "controlled vocabulary" for DublinCore
> metadata tags on chunks of text plus a predicate to indicate the
> taxon we are talking about? We could just do an applicability
> statement on how to tag them in HTML!

In DublinCore the taxon would simply be a value of subject. The
problem that results with this is we would end up with:

subject = pathogen, pollinator, taxon1, taxon2, taxon3,
coverage = Germany, Italy, UK, summer, 1950-2007

which is ok for roughly finding something that might be interesting to
read (which is what DC is good for) but almost worthless if you want
to figure out what pathogens taxon2 has in Germany. That is what we
need the container / envelope for, keeping things together.
Furthermore, my intuition is that it is significantly easier to
process data if in advance I know something is a status value, a
geographic area, a taxon, a tag - all of which may come from an
external rather than TDWG vocabulary - rather than having to figure
this out using owl.

But that is a principle question. In current SPM I found it very hard
to figure out which is the context and what context means. Context in
an actual observation / specimen is quite clear, but I find it
difficult to have "invasive" as value and "Germany" as context. Others
might want to have "frequently" and "rarely" as values and "Germany"
as context. Or both...

I cannot say whether in the future software will simply effortlessly
figure out what kind of category a value is (taxon, geoarea,
frequency, distributionstatus, conservationstatus, etc.) and analyse
the implications. But the brazilian cooking example to me indicates
that without some guidance, drastically alternative interpretations
are possible.

What I am after with defining multiple classes like FreeFormText,
Markuptext, Distribution, Interaction, QuantitativeMeasurement,
CategoricalMeasurement, MolecularSequence is to give enough guidance
to be able to explain how in a particular case the attributes relate
to each other - and provide an appropriate context for extension with
further attributes. My current understand is that if we do not explain
this outside of RDF/OWL we would be forced to model it through
reification, which we all seem to strive to avoid...

Gregor