[tdwg-tapir] tapir metadata issues

Jim Graham jim at nrel.colostate.edu
Wed Jul 4 19:16:51 CEST 2007

Everything Renato is saying sounds right on and I think the invasive
community will be able to agree on a language standard as long as it covers
enough languages.

- The ISO 639-2 (3 letter codes) should work but the 2-letter codes miss
some language distinctions.
- The only IANA codes I am aware of are the "Internet Assigned Numbers
Authority" and these are country codes rather than language codes - did I
miss something here?
- I'm not as familiar with "Ethnologue" and since it is not ISO, it may be
harder to sell

My only other comment is to restate that scientific names should be treated
as language-independent.  I see them as codes for taxons (they used to be
Latin but that has changed over the years - see the Chinese dinosaur


-----Original Message-----
From: tdwg-tapir-bounces at lists.tdwg.org
[mailto:tdwg-tapir-bounces at lists.tdwg.org] On Behalf Of Renato De Giovanni
Sent: Wednesday, July 04, 2007 10:16 AM
To: tdwg-tapir at lists.tdwg.org
Subject: Re: [tdwg-tapir] tapir metadata issues

Hi all,

Let me share with you more thoughts about this subject using Wouter's
original message as a reference.

> dc:language is mandatory. What to do with data that is not language 
> specific?
> Example: we are going to use Tapir for sharing lists of scientific 
> names. Should the language be Latin in that case?
> We think about using specifying English (eng) as default in that case. 
> The recommendation is to use IANA Language subtags.
> Probably better to recommend the languages from ethnologue.org 
> (3-letter abbreviations). This because the data can be in much more 
> languages then the IANA Languages, for instance common names in 
> extinct languages. This is different from the xml:lang attribute, 
> which is primarily for application development.

The fact that part of the data being exposed by a service contains
scientific names doesn't mean that a user needs to understand latin to make
use of this content. In my opinion, dc:language should only be used to
indicate that users need to know one or more specific languages if they want
to understand the content being served. The best example in our case would
probably be species description data. In this case dc:language should
definitely be used to indicate the language in which species are described.

If a service exposes only pure taxonomic data or just names, without any
kind of description, I would probably not specify any language as part of
TAPIR metadata. Even if the content includes common names in the most
unusual languages, because names are essentially identifiers used to
designate entities.

However, when exposing common names associated with a taxon, I certainly
agree it's desirable to specify the language, but dc:language would not be
appropriate here since it's just a general reference about the whole content
of the service. It would be necessary to have a specific concept to indicate
the language for each common name, and the content of this concept could be
IANA codes, ethnologue, or any other option.

So now I think I agree with Markus that we could keep the existing approach
and force a specific language standard through the spec. This standard could
certainly be IANA, unless we expect services to provide content (related to
descriptions, explanations, etc.) in really unusual languages.

By the way, even when the service content is not associated with any
particular language, we could keep dc:language as a mandatory element.
I've just discovered that the IANA code "zxx" means "No linguistic content".

Would it be OK for everybody if we keep dc:language a mandatory element, but
now unbounded, and then force through the spec the use of IANA codes?

Best Regards,

> I think it may not be enough.  ISO 639-2 (3 letter codes) lists about 
> 500 languages if I am right. Ethnologue about 7000. The data can be in 
> any language or dialect, especially common names or herbal 
> information. The ethnologue 3-letter code list has the advantage of 
> having a link between languages and countries, although the iso 
> countries list they use is not completely up to date. Usually I prefer 
> ISO standards, but in this case I am not sure.
> Wouter
> ----- Original Message -----
> From: "Döring, Markus" <m.doering at BGBM.org>
> To: "Wouter Addink" <wouter at eti.uva.nl>; <tdwg-tapir at lists.tdwg.org>
> Sent: Wednesday, July 04, 2007 12:08 PM
> Subject: Re: [tdwg-tapir] tapir metadata issues
> Isn't rfc3066 as used by xml schema enough?
> Any arguments against it?
> RFC3066 specifies the primary language to be ISO 639-2.
> The Library of Congress, maintainers of ISO 639-2, has made the list 
> of languages registered available on the Internet. It can be found at
> http://www.loc.gov/standards/iso639-2/langhome.html
> http://www.w3.org/TR/xmlschema-2/#language
> http://www.ietf.org/rfc/rfc3066.txt

tdwg-tapir mailing list
tdwg-tapir at lists.tdwg.org

More information about the tdwg-tag mailing list