I'd prefer to see highly structured information as the "default", with loosely structured "blobs" being optional, rather than the other way round.
This preference has been endorsed by several people, and seems to come down to whether a character list is required or optional.
In my draft model, a character list is optional (but obviously pretty necessary if anyone wants to parse the description). If a character list is present it can be used for validation, if one is absent no validation can occur. Similarly, if a part of a description is tagged as a defined character, it's useable as such, if another part is not so tagged, it's not.
Eric offered:
<DOCUMENT> <DESCRIPTION Taxon_Name = "Viola odorata"> <CHARACTER type="defined" Character_Name = "Leaves"> <STATE State_Name = "present"> </CHARACTER> <CHARACTER type="arbitrary" Character_Name = "scent"> a marvelous perfume on a perfect spring day </CHARACTER> </DESCRIPTION> </DOCUMENT>
But this still requires atomisation of the description (the character "scent" has been atomised out of the description 'ground matrix'). Since we're agreed that an 'arbitrary' character is not usefully parseable, what's the point of atomising it?
How would this model accommodate my example:
<DOCUMENT> <DESCRIPTION Name = "Viola eminens"> Perennial herb spreading by stolons; rootstock sometimes somewhat swollen and bulbous at the stem bases. Stems contracted so that the leaves form rosettes, never elongate with caulescent leaves. Leaves broad-reniform, the largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times wider than long, usually with a broad basal sinus; lamina with 9-20 +/- prominent teeth, glabrous or with scattered unicellular hairs on the upper surface, +/- concolorous bright green; petioles 2-8 cm long; stipules narrowly triangular, usually with several small, glandular teeth on each side. Flowers ... etc </DESCRIPTION> </DOCUMENT>
if Thiele and Prober are not yet up to the stage of atomising the description. Under Eric's model, it seems to me that this would need to be:
<DOCUMENT> <CHARACTER LIST> <CHARACTER Name = "Free text" Type = "arbitrary"/> </CHARACTER LIST> <DESCRIPTION Name = "Viola eminens"> <CHARACTER Character_Name = "Free text"> Perennial herb spreading by stolons; rootstock sometimes somewhat swollen and bulbous at the stem bases. Stems contracted so that the leaves form rosettes, never elongate with caulescent leaves. Leaves broad-reniform, the largest (10-)12-15(-25) mm long, (20-)25-35(-45) mm wide, 1.5-3.2 times wider than long, usually with a broad basal sinus; lamina with 9-20 +/- prominent teeth, glabrous or with scattered unicellular hairs on the upper surface, +/- concolorous bright green; petioles 2-8 cm long; stipules narrowly triangular, usually with several small, glandular teeth on each side. Flowers ... etc </CHARACTER> </DESCRIPTION> </DOCUMENT>
This seems like essentially tagging the blob as tagless.
In terms of a data model, tagging data adds information to it, so surely the untagged state is more basic (fundamental) than the tagged. The question of whether our default behaviour should be to tag data is to me a sociological one, and should be handled outside the data model.
Cheers - k