[tdwg-tapir] tapir metadata issues

Renato De Giovanni renato at cria.org.br
Wed Jul 4 18:15:43 CEST 2007


Hi all,

Let me share with you more thoughts about this subject using Wouter's
original message as a reference.

> dc:language is mandatory. What to do with data that
> is not language specific?
> Example: we are going to use Tapir for sharing lists of
> scientific names. Should the language be Latin in that case?
> We think about using specifying English (eng) as default in
> that case. The recommendation is to use IANA Language subtags.
> Probably better to recommend the languages from ethnologue.org
> (3-letter abbreviations). This because the data can be in much
> more languages then the IANA Languages, for instance common
> names in extinct languages. This is different from the xml:lang
> attribute, which is primarily for application development.

The fact that part of the data being exposed by a service contains
scientific names doesn't mean that a user needs to understand latin to
make use of this content. In my opinion, dc:language should only be used
to indicate that users need to know one or more specific languages if they
want to understand the content being served. The best example in our case
would probably be species description data. In this case dc:language
should definitely be used to indicate the language in which species are
described.

If a service exposes only pure taxonomic data or just names, without any
kind of description, I would probably not specify any language as part of
TAPIR metadata. Even if the content includes common names in the most
unusual languages, because names are essentially identifiers used to
designate entities.

However, when exposing common names associated with a taxon, I certainly
agree it's desirable to specify the language, but dc:language would not be
appropriate here since it's just a general reference about the whole
content of the service. It would be necessary to have a specific concept
to indicate the language for each common name, and the content of this
concept could be IANA codes, ethnologue, or any other option.

So now I think I agree with Markus that we could keep the existing
approach and force a specific language standard through the spec. This
standard could certainly be IANA, unless we expect services to provide
content (related to descriptions, explanations, etc.) in really unusual
languages.

By the way, even when the service content is not associated with any
particular language, we could keep dc:language as a mandatory element.
I've just discovered that the IANA code "zxx" means "No linguistic
content".

Would it be OK for everybody if we keep dc:language a mandatory element,
but now unbounded, and then force through the spec the use of IANA codes?

Best Regards,
--
Renato


> I think it may not be enough.  ISO 639-2 (3 letter codes) lists about 500
> languages if I am right. Ethnologue about 7000. The data can be in any
> language or dialect, especially common names or herbal information. The
> ethnologue 3-letter code list has the advantage of having a link between
> languages and countries, although the iso countries list they use is not
> completely up to date. Usually I prefer ISO standards, but in this case I
> am
> not sure.
>
> Wouter
>
> ----- Original Message -----
> From: "Döring, Markus" <m.doering at BGBM.org>
> To: "Wouter Addink" <wouter at eti.uva.nl>; <tdwg-tapir at lists.tdwg.org>
> Sent: Wednesday, July 04, 2007 12:08 PM
> Subject: Re: [tdwg-tapir] tapir metadata issues
>
>
> Isn't rfc3066 as used by xml schema enough?
> Any arguments against it?
>
> RFC3066 specifies the primary language to be ISO 639-2.
> The Library of Congress, maintainers of ISO 639-2, has made the list of
> languages registered available on the Internet. It can be found at
>
> http://www.loc.gov/standards/iso639-2/langhome.html
> http://www.w3.org/TR/xmlschema-2/#language
> http://www.ietf.org/rfc/rfc3066.txt





More information about the tdwg-tag mailing list