[Tdwg-guid] Annotation and Link Out

18 May 2006

      I've been thinking about annotation, attribution, and LSID Foreign 
Authority Notification and wanted to continue the discussions from last 
month.

While I like the idea of notification of annotations, I'm not a fan of 
FAN.  In the first paragraph of the FAN proposal it states "An example 
authority might return metadata service endpoints that contain metadata 
(for example annotations) about LSIDs for which it is not the authority".

I'll explore this idea using two example LSID authorities, A and B.  
Assume that A has published some metadata identified by an LSID it 
assigned (A is the original source) and that B wants to annotate it.  B 
could be doing so for one of three reasons: to suggest a revision to A's 
metadata (correction), to propose additional information that 
supplements A's metadata (annotation), or to intentionally confuse 
clients and disrupt the system (spam).

A close reading of the FAN proposal makes me think that, according to 
the authors, B will create assertions directly about A's metadata 
(meaning B will use A's LSIDs as the subject of assertions).  FAN 
provides a method for B to notify A that it has done so and suggests a 
method for telling clients of A's resolver to directly contact B (using 
a modified form of resolution) to get additional information about A's 
LSID.  This leads me to believe that the authors think that a single 
metadata object could be split between multiple authorities.

Because of this I see three problems with FAN.  It increases the burden 
on LSID resolution clients and perhaps on the LSID resolution system as 
a whole, it makes attribution of assertions about an LSIDs indirect, and 
it makes it easier to spam the network.

FAN-aware resolution is more intensive than normal resolution and 
requires the following process:  The client knows an LSID.  It performs 
normal resolution by querying DNS for the service record of the 
authority, contacting the authority to find the correct endpoint, and 
calling getMetadata() on that endpoint..  Next it must parse this 
metadata and look for the special predicate which indicates that a 
foreign authority contains more information about an LSID.  Since the 
client does not know if it has retrieved the entire metadata object 
(some pieces may exist on other authorities), it must perform a modified 
getMetadata() call for each foreign authority.  To do so, the proposal 
suggests that the client again queries DNS for the foreign authority 
(because only the authority name is returned in the first resolution), 
gets its endpoint, and calls getMetadata() with the original LSID 
(though it was assigned by a different authority).  The implied final 
step is that the metadata from the foreign authority is merged with the 
metadata from the authoritative source before being handed back to 
calling code.

The assumption that a metadata object *could* be split across multiple 
authorities forces the client to query all foreign authorities, even if 
those authorities only contain annotations that the client is not 
interested in.  To look at an example case of names and specimens, it 
might be that we want notification of usage of names by specimens so we 
can immediately retrieve all specimens for a name.  Assume A issues a 
name and B issues a specimen.
Under FAN, the B authority could add an assertion to their data store 
that says [urn:lsid:A:names:1  hasSpecimen  urn:lsid:B:specimens:3].  I 
consider this an annotation (rather than a revision, modification, or 
correction) because the fact that B linked a specimen to a name does not 
change the semantic meaning of the name object (it merely provides 
additional information).  If you're interested in just the name and you 
don't know by resolving it against A that you've got the complete set of 
name metadata, you're forced to get a bunch of information about 
specimens that you're not interested in (and there could be a great deal 
of this kind of information).

To complicate matters, C might want to propose a correction to A.  C 
might believe that A misspelled the name.  Consider this simple example:

A's original assertion:
urn:lsid:A:names:1    fullScientificName    "Salmo saler, Linnaeus, 1758"

C's proposed modification:
urn:lsid:A:names:1    fullScientificName    "Salmo salar, Linnaeus, 1758"

Under FAN, C would propose its modification and notify A that it did 
so.  When a FAN-aware client goes to resolve A:names:1, it will get back 
some data that changes the meaning of A's original object (C's 
assertion) and a lot of data that's not directly about it (all the 
specimen assertions).  What's worse, it now has both of these assertions 
in the same model:

urn:lsid:A:names:1    fullScientificName    "Salmo saler, Linnaeus, 1758"
urn:lsid:A:names:1    fullScientificName    "Salmo salar, Linnaeus, 1758"

If the names class is defined with fullScientificName as a functional 
property (can have only one assertion about the property on an instance) 
we're in trouble and we no longer have a valid OWL instance.  So the FAN 
merge process has to know both that this is a problem, by understanding 
the ontology for names, and how to fix it, by deciding whether to trust 
C or not and select one or the other assertion but not both.

How does the client decide to trust or distrust C?.  The FAN proposal 
suggests that the client gets a semantic description of C and uses that 
to make the decision.  I think it more likely that the code that calls 
the client on behalf of the user should decide whether or not to trust C.

So to fix all this we could propose that the FAN-aware client has to be 
triply smart.  In addition to having to know how to deal with foreign 
authorities it should provide 1.) some mechanism for getting information 
about a foreign authority and deciding whether to trust it or not, 2.) 
some mechanism for conditionally merging the foreign metadata it cares 
about (proposed modifications for example) while discarding other 
information (specimens), and 3.) the ability to know whether or not its 
merge process violates the integrity of the metadata.  This seems to be 
quite a burden.  I think it will add not only complexity to LSID 
resolution, but also slow it down tremendously.  LSID resolution is a 
low-level data identification and access mechanism.  Since so many other 
services will build upon it, it ought to be fast.

That brings up the idea of attribution.  In FAN, the attribution of an 
assertion is not directly tied to the authority of the LSID that acts as 
the subject of the assertion.  Instead, since foreign authorities are 
allowed to make statements directly about an LSID issued by someone 
else, the resolution client has to deduce attribution of each assertion 
during the merge process (using the foreign authority predicate as 
evidence).  After the merge process is complete, downstream processes 
will have lost this attribution information.

Finally, FAN opens the door to spammers because attribution is not 
directly tied to the authority name.  If I trust A and I resolve the 
name id above I could get back information from C.  It could easily be 
the case that I don't trust C (because I believe it to be a spammer) but 
that my LSID resolution client doesn't know this fact.  The end user of 
the client may be using a FAN-enabled resolver or not and never know.  
The FAN proposal suggests that clients be notified, but what is your 
average piece of software that calls a resolution client on behalf of a 
user to do with this information?

I propose we take an alternative approach.  I like the notification 
system in FAN, but I dislike the assumption about distributed objects 
and indirect attribution.  I would rather we agree on an annotation and 
linking system that holds the following to be true:

1.)  An authority is never allowed to make an assertion whose subject is 
an LSID that it is not authoritative for.
2.)  LSIDs authoritatively attribute an assertion to the organization 
that owns the fully qualified authority domain (this includes sub 
domains to support the case of hosted authority services)
3.)  Objects are not distributed across authorities but instead link to 
other first-class objects that may exist within other authorities.
4.)  Annotations, corrections, etc. are valuable, and as such should be 
first class objects that are attributed to their issuers by the normal 
process of 2 above.  This is annotation as linking; the subject of the 
annotation is the LSID of the annotation, it is attributed to the 
authority of the issuer, and an agreed upon predicate is used to define 
the object that acts as domain of an annotation (what it applies to).  
This guarantees correct attribution of annotations throughout the network.
5.)  We cannot stop spammers from publishing malicious assertions but we 
can validate a given assertion acquired from some source external to the 
LSID system by resolving the LSID and checking the resulting metadata 
against the assertion.  Spammers will find it difficult to circumvent 
this method because it requires hijacking the DNS system.  If you trust 
that a source gave you a complete and full copy of an object you don't 
have to resolve its LSID.
6.)  We will often have to use systems that are external to LSID in 
order to find the data/metadata we're interested in (at the very least 
to identify LSIDs to resolve).  These other systems have to understand 
how to attribute assertions to their owners and can do so using the 
rules above without relying on metadata about a foreign authority.
7.)  We don't have to assume (as FAN does) that an authority is 
responsible for providing all information in the universe that relates 
in any way to its LSIDs
8.) If you get back invalid metadata from an authority (not incorrect, 
but metadata that, for example, violates a functional property 
restriction), you automatically know who is at fault (the organization 
that owns the authority name).
9.) A FAN style notification system could still be used for logging and 
tracking the use of one's data or for proposing corrections to and 
issuer who might want to approve and incorporate those suggestions into 
their published data.

So we might use a FAN-like system, but only for foreign annotation 
notification, not foreign authority notification.

Any comments?  Have I misunderstood the assumptions behind FAN?

-Steve

Steven Perry

Benjamin H Szekely

tags

participants (2)