[Tdwg-guid] Annotation and Link Out
I've been thinking about annotation, attribution, and LSID Foreign Authority Notification and wanted to continue the discussions from last month.
While I like the idea of notification of annotations, I'm not a fan of FAN. In the first paragraph of the FAN proposal it states "An example authority might return metadata service endpoints that contain metadata (for example annotations) about LSIDs for which it is not the authority".
I'll explore this idea using two example LSID authorities, A and B. Assume that A has published some metadata identified by an LSID it assigned (A is the original source) and that B wants to annotate it. B could be doing so for one of three reasons: to suggest a revision to A's metadata (correction), to propose additional information that supplements A's metadata (annotation), or to intentionally confuse clients and disrupt the system (spam).
A close reading of the FAN proposal makes me think that, according to the authors, B will create assertions directly about A's metadata (meaning B will use A's LSIDs as the subject of assertions). FAN provides a method for B to notify A that it has done so and suggests a method for telling clients of A's resolver to directly contact B (using a modified form of resolution) to get additional information about A's LSID. This leads me to believe that the authors think that a single metadata object could be split between multiple authorities.
Because of this I see three problems with FAN. It increases the burden on LSID resolution clients and perhaps on the LSID resolution system as a whole, it makes attribution of assertions about an LSIDs indirect, and it makes it easier to spam the network.
FAN-aware resolution is more intensive than normal resolution and requires the following process: The client knows an LSID. It performs normal resolution by querying DNS for the service record of the authority, contacting the authority to find the correct endpoint, and calling getMetadata() on that endpoint.. Next it must parse this metadata and look for the special predicate which indicates that a foreign authority contains more information about an LSID. Since the client does not know if it has retrieved the entire metadata object (some pieces may exist on other authorities), it must perform a modified getMetadata() call for each foreign authority. To do so, the proposal suggests that the client again queries DNS for the foreign authority (because only the authority name is returned in the first resolution), gets its endpoint, and calls getMetadata() with the original LSID (though it was assigned by a different authority). The implied final step is that the metadata from the foreign authority is merged with the metadata from the authoritative source before being handed back to calling code.
The assumption that a metadata object *could* be split across multiple authorities forces the client to query all foreign authorities, even if those authorities only contain annotations that the client is not interested in. To look at an example case of names and specimens, it might be that we want notification of usage of names by specimens so we can immediately retrieve all specimens for a name. Assume A issues a name and B issues a specimen. Under FAN, the B authority could add an assertion to their data store that says [urn:lsid:A:names:1 hasSpecimen urn:lsid:B:specimens:3]. I consider this an annotation (rather than a revision, modification, or correction) because the fact that B linked a specimen to a name does not change the semantic meaning of the name object (it merely provides additional information). If you're interested in just the name and you don't know by resolving it against A that you've got the complete set of name metadata, you're forced to get a bunch of information about specimens that you're not interested in (and there could be a great deal of this kind of information).
To complicate matters, C might want to propose a correction to A. C might believe that A misspelled the name. Consider this simple example:
A's original assertion: urn:lsid:A:names:1 fullScientificName "Salmo saler, Linnaeus, 1758"
C's proposed modification: urn:lsid:A:names:1 fullScientificName "Salmo salar, Linnaeus, 1758"
Under FAN, C would propose its modification and notify A that it did so. When a FAN-aware client goes to resolve A:names:1, it will get back some data that changes the meaning of A's original object (C's assertion) and a lot of data that's not directly about it (all the specimen assertions). What's worse, it now has both of these assertions in the same model:
urn:lsid:A:names:1 fullScientificName "Salmo saler, Linnaeus, 1758" urn:lsid:A:names:1 fullScientificName "Salmo salar, Linnaeus, 1758"
If the names class is defined with fullScientificName as a functional property (can have only one assertion about the property on an instance) we're in trouble and we no longer have a valid OWL instance. So the FAN merge process has to know both that this is a problem, by understanding the ontology for names, and how to fix it, by deciding whether to trust C or not and select one or the other assertion but not both.
How does the client decide to trust or distrust C?. The FAN proposal suggests that the client gets a semantic description of C and uses that to make the decision. I think it more likely that the code that calls the client on behalf of the user should decide whether or not to trust C.
So to fix all this we could propose that the FAN-aware client has to be triply smart. In addition to having to know how to deal with foreign authorities it should provide 1.) some mechanism for getting information about a foreign authority and deciding whether to trust it or not, 2.) some mechanism for conditionally merging the foreign metadata it cares about (proposed modifications for example) while discarding other information (specimens), and 3.) the ability to know whether or not its merge process violates the integrity of the metadata. This seems to be quite a burden. I think it will add not only complexity to LSID resolution, but also slow it down tremendously. LSID resolution is a low-level data identification and access mechanism. Since so many other services will build upon it, it ought to be fast.
That brings up the idea of attribution. In FAN, the attribution of an assertion is not directly tied to the authority of the LSID that acts as the subject of the assertion. Instead, since foreign authorities are allowed to make statements directly about an LSID issued by someone else, the resolution client has to deduce attribution of each assertion during the merge process (using the foreign authority predicate as evidence). After the merge process is complete, downstream processes will have lost this attribution information.
Finally, FAN opens the door to spammers because attribution is not directly tied to the authority name. If I trust A and I resolve the name id above I could get back information from C. It could easily be the case that I don't trust C (because I believe it to be a spammer) but that my LSID resolution client doesn't know this fact. The end user of the client may be using a FAN-enabled resolver or not and never know. The FAN proposal suggests that clients be notified, but what is your average piece of software that calls a resolution client on behalf of a user to do with this information?
I propose we take an alternative approach. I like the notification system in FAN, but I dislike the assumption about distributed objects and indirect attribution. I would rather we agree on an annotation and linking system that holds the following to be true:
1.) An authority is never allowed to make an assertion whose subject is an LSID that it is not authoritative for. 2.) LSIDs authoritatively attribute an assertion to the organization that owns the fully qualified authority domain (this includes sub domains to support the case of hosted authority services) 3.) Objects are not distributed across authorities but instead link to other first-class objects that may exist within other authorities. 4.) Annotations, corrections, etc. are valuable, and as such should be first class objects that are attributed to their issuers by the normal process of 2 above. This is annotation as linking; the subject of the annotation is the LSID of the annotation, it is attributed to the authority of the issuer, and an agreed upon predicate is used to define the object that acts as domain of an annotation (what it applies to). This guarantees correct attribution of annotations throughout the network. 5.) We cannot stop spammers from publishing malicious assertions but we can validate a given assertion acquired from some source external to the LSID system by resolving the LSID and checking the resulting metadata against the assertion. Spammers will find it difficult to circumvent this method because it requires hijacking the DNS system. If you trust that a source gave you a complete and full copy of an object you don't have to resolve its LSID. 6.) We will often have to use systems that are external to LSID in order to find the data/metadata we're interested in (at the very least to identify LSIDs to resolve). These other systems have to understand how to attribute assertions to their owners and can do so using the rules above without relying on metadata about a foreign authority. 7.) We don't have to assume (as FAN does) that an authority is responsible for providing all information in the universe that relates in any way to its LSIDs 8.) If you get back invalid metadata from an authority (not incorrect, but metadata that, for example, violates a functional property restriction), you automatically know who is at fault (the organization that owns the authority name). 9.) A FAN style notification system could still be used for logging and tracking the use of one's data or for proposing corrections to and issuer who might want to approve and incorporate those suggestions into their published data.
So we might use a FAN-like system, but only for foreign annotation notification, not foreign authority notification.
Any comments? Have I misunderstood the assumptions behind FAN?
-Steve
Hi Steve, I think you have a very strong grasp of the FAN system. FAN is of course just a proposal so I would be happy to work with any interested parties in comming up with a more useful specification. However, for now, I will go on the defensive to see what (if anything) we can keep in the current spec. I will try to address your three main concerns, though you had some interesting subtle points as well.
1.) Resolution Burden
Fan resolution may seem expensive, but in reality it requires an additional getMetadata() call to get the forieign authorities. However, in the general case, a client might have to do several getMetadata() calls given the authoritative WSDL anyway. Fan then requires a getAvailableServicesCall() and (possibly) many getMetadata() calls for every foreign authority. Let me point out a few mitigating factors in the seemingly expensive and complicated operation.
i) The functionality can be, and is already built into the Java client.
ii) Caching at all points can greatly improve performance (even though FAN information and metadata can change, it can still be cached to some extent) iii) If you think about it, FAN lays out the minimal amount of work necessary to resolve metadata hosted by another entity. We can't have the main authority point directly to foreign metadata services because these services could change....we use LSID resolution to provide a level indirection against this problem.
2.) Attribution of assertions
(This is the only point where I might come off as a bit rough so I apologize in advance. :) I also apologize if I have misunderstood your argurments on this point. )
To restrict who can say what about URIs in the semantic web is dangerous and eventually impossible. Adding a layer of indirection as suggested in requirement #3 is unnecessary when viewing the RDF graph as a whole. Also, where would such a triple live? It shouldn't be in the main metadata so you would end up with somethng isomorphic to FAN metadata.
Alternatively, consider the following. Because the main authority refers to the foreign authority, it does not mean that it fully endorses the foreign metadata as true statements about the LSID in question. In fact, because metadata is used to list the foreign authorities, it can include qualifications about the foreign sources, such as, I don't trust this one, or this one is 50% trustworthy..etc...
3.) Spam
If a given authority allowed anybody to register itself as a foreign authority without advising FAN enabled clients of its untrustworthiness, then yes, spamming would be a problem. But certainly, we can imagine a trust-based registration system, where trusted sources could register as foreign authorities, and the FAN metadata would label them as such. Anonymous sources would be either denied or flagged in the FAN metadata.
Now I will try to address some of your 9 points below
1.) An authority is never allowed to make an assertion whose subject is
an LSID that it is not authoritative for.
I truly believe that requirements such as these fly in the face of the flexibility of RDF.
2.) LSIDs authoritatively attribute an assertion to the organization that owns the fully qualified authority domain (this includes sub domains to support the case of hosted authority services)
This is definitely true for metadata services pointed to by the authoritative WSDL.
3.) Objects are not distributed across authorities but instead link to other first-class objects that may exist within other authorities.
It's a bit dangerous to think of resources as distributed objects. A URI is a URI, and anybody in the world can refer to it or make statements about it.
4.) Annotations, corrections, etc. are valuable, and as such should be first class objects that are attributed to their issuers by the normal process of 2 above. This is annotation as linking; the subject of the annotation is the LSID of the annotation, it is attributed to the authority of the issuer, and an agreed upon predicate is used to define the object that acts as domain of an annotation (what it applies to). This guarantees correct attribution of annotations throughout the
network.
I don't disagree that some annotations might need to be formal and seperate structures. (In fact, we have a system built that can do this in a rather elegant fashion.)
5.) We cannot stop spammers from publishing malicious assertions but we
can validate a given assertion acquired from some source external to the
LSID system by resolving the LSID and checking the resulting metadata against the assertion. Spammers will find it difficult to circumvent this method because it requires hijacking the DNS system. If you trust that a source gave you a complete and full copy of an object you don't have to resolve its LSID.
I don't have much to offer here. This is something I look forward to discussing together.
6.) We will often have to use systems that are external to LSID in order to find the data/metadata we're interested in (at the very least to identify LSIDs to resolve). These other systems have to understand how to attribute assertions to their owners and can do so using the rules above without relying on metadata about a foreign authority.
Certainly we can use other systems for annotation discovery but why not use LSID mechanisms as much as possible?
7.) We don't have to assume (as FAN does) that an authority is responsible for providing all information in the universe that relates in any way to its LSIDs
Agreed, but the authoritative source is the best place to start.
8.) If you get back invalid metadata from an authority (not incorrect, but metadata that, for example, violates a functional property restriction), you automatically know who is at fault (the organization that owns the authority name).
With FAN, this work as well...when you download Foreign metadata, you can validate it before merging with your main model.
9.) A FAN style notification system could still be used for logging and tracking the use of one's data or for proposing corrections to and issuer who might want to approve and incorporate those suggestions into their published data.
Neat idea!
I look forward to discussing this further with you.
- Ben
So we might use a FAN-like system, but only for foreign annotation notification, not foreign authority notification.
Any comments? Have I misunderstood the assumptions behind FAN?
-Steve
TDWG-GUID mailing list TDWG-GUID@mailman.nhm.ku.edu http://mailman.nhm.ku.edu/mailman/listinfo/tdwg-guid
participants (2)
-
Benjamin H Szekely
-
Steven Perry