[tdwg-tapir] tapir metadata issues
Hi, I am currently implementing Tapir metadata responses and have some issues.
-indexing preferences frequency in the documentation states @Frequency (starting with uppercase), the schema uses @frequency. I assume I have to implement @frequency.
-dc:language is mandatory. What to do with data that is not language specific? Example: we are going to use Tapir for sharing lists of scientific names. Should the language be Latin in that case? We think about using specifying English (eng) as default in that case. The recommendation is to use IANA Language subtags. Probably better to recommend the languages from ethnologue.org (3-letter abbreviations). This because the data can be in much more languages then the IANA Languages, for instance common names in extinct languages. This is different from the xml:lang attribute, which is primarily for application development.
-the example for an accesspoint is http://example.net/tapir.cgi I think this may be misleading, it would perhaps be better to use http://example.net/tapir.cgi/yourdatasource as example.
Wouter
Hi Wouter,
Many thanks for raising these issues. I'll comment each one below...
-indexing preferences frequency in the documentation states @Frequency (starting with uppercase), the schema uses @frequency. I assume I have to implement @frequency.
You're right. I've just fixed the specification and I plan to release a new version during the next weeks.
-dc:language is mandatory. What to do with data that is not language specific? Example: we are going to use Tapir for sharing lists of scientific names. Should the language be Latin in that case? We think about using specifying English (eng) as default in that case. The recommendation is to use IANA Language subtags. Probably better to recommend the languages from ethnologue.org (3-letter abbreviations). This because the data can be in much more languages then the IANA Languages, for instance common names in extinct languages. This is different from the xml:lang attribute, which is primarily for application development.
I agree with you, and I think it should not be a problem for the existing applications if we make these changes:
- Make dc:language an optional element. - Change the cardinality of dc:language to "unbounded". - Change the recommendation about the content of dc:language by including ethnologue codes as another option (probably the main option). Note that it will still be just a recommendation, not a normative statement.
I'll wait to see if there's some feedback about this before making the changes.
-the example for an accesspoint is http://example.net/tapir.cgi I think this may be misleading, it would perhaps be better to use http://example.net/tapir.cgi/yourdatasource as example.
This one I think it's OK to keep it as it is. A TAPIR accesspoint is just an URL, so it's really up to the provider to decide which pattern to use.
Thanks again, -- Renato
Renato, thanks for your comments.
- Make dc:language an optional element.
- Change the cardinality of dc:language to "unbounded".
- Change the recommendation about the content of dc:language by including
ethnologue codes as another option (probably the main option). Note that it will still be just a recommendation, not a normative statement.
Ok. Perhaps we should add an optional attribute also, for specifying the used code standard, if any? That should not affect current implementations I think. Problem is that you cannot do anything with an abbreviation if you do not know what it means. Making assumptions can be dangerous. For instance you could asume that "SW" means Swedish, or that it means Swahili. If you know that it is an IANA subtag, you can use it and you can also raise an error if there is an abbreviation which is not present in the used standard.
Another comment about the Tapir metadata: when giving courses in installing Tapirlink, I noticed that none of the about 10 (Dutch) students could figure out themselves what 'relatedEntity' means. They all needed help on that. Perhaps the documentation of that element should be expanded?
Cheers, Wouter
Greetings,
I understand making the dc:language optional but I'd be really concerned about allowing the language code to be from different standards. The example for "SW" Wouter points out would be a real concern. Can we have mulitple language elements each of which is tied to a specific language code standard? This way we cannot make the type of mistake with "SW" being missinterpreted as Swedish or Swahili.
Thanks, Jim
Jim Graham Natural Resource Ecology Laboratory Colorado State University Fort Collins, CO 80524 jim@nrel.colostate.edu 970-491-0410
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Wouter Addink Sent: Friday, June 29, 2007 2:45 AM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] tapir metadata issues
Renato, thanks for your comments.
- Make dc:language an optional element.
- Change the cardinality of dc:language to "unbounded".
- Change the recommendation about the content of dc:language by
including ethnologue codes as another option (probably the main option). Note that it will still be just a recommendation, not a normative
statement.
Ok. Perhaps we should add an optional attribute also, for specifying the used code standard, if any? That should not affect current implementations I think. Problem is that you cannot do anything with an abbreviation if you do not know what it means. Making assumptions can be dangerous. For instance you could asume that "SW" means Swedish, or that it means Swahili. If you know that it is an IANA subtag, you can use it and you can also raise an error if there is an abbreviation which is not present in the used standard.
Another comment about the Tapir metadata: when giving courses in installing Tapirlink, I noticed that none of the about 10 (Dutch) students could figure out themselves what 'relatedEntity' means. They all needed help on that. Perhaps the documentation of that element should be expanded?
Cheers, Wouter
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi there, This really is weird. I was confident that the xml schema language data type was used. This defines natural language identifiers as defined by RFC 3066. http://www.w3.org/TR/xmlschema-2/#language http://www.ietf.org/rfc/rfc3066.txt
But when I looked up the tapir schema, it makes use of the dublin core schema that then states the following:
<xs:element name="language" substitutionGroup="any"/> <xs:element name="any" type="SimpleLiteral" abstract="true"/> <xs:complexType name="SimpleLiteral"> <xs:complexContent mixed="true"> <xs:restriction base="xs:anyType"> xs:sequence <xs:any processContents="lax" minOccurs="0" maxOccurs="0"/> </xs:sequence> <xs:attribute ref="xml:lang" use="optional"/> </xs:restriction> </xs:complexContent> </xs:complexType>
So you can use anything for the dc:language element AND tag it with an optional xml:lang attribute. Thats weird:
<dc:language xml:lang="en">swuaheli</dc:language>
Markus
Am 29.06.2007 15:52 Uhr schrieb "Jim Graham" unter jim@nrel.colostate.edu:
Greetings,
I understand making the dc:language optional but I'd be really concerned about allowing the language code to be from different standards. The example for "SW" Wouter points out would be a real concern. Can we have mulitple language elements each of which is tied to a specific language code standard? This way we cannot make the type of mistake with "SW" being missinterpreted as Swedish or Swahili.
Thanks, Jim
Jim Graham Natural Resource Ecology Laboratory Colorado State University Fort Collins, CO 80524 jim@nrel.colostate.edu 970-491-0410
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Wouter Addink Sent: Friday, June 29, 2007 2:45 AM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] tapir metadata issues
Renato, thanks for your comments.
- Make dc:language an optional element.
- Change the cardinality of dc:language to "unbounded".
- Change the recommendation about the content of dc:language by
including ethnologue codes as another option (probably the main option). Note that it will still be just a recommendation, not a normative
statement.
Ok. Perhaps we should add an optional attribute also, for specifying the used code standard, if any? That should not affect current implementations I think. Problem is that you cannot do anything with an abbreviation if you do not know what it means. Making assumptions can be dangerous. For instance you could asume that "SW" means Swedish, or that it means Swahili. If you know that it is an IANA subtag, you can use it and you can also raise an error if there is an abbreviation which is not present in the used standard.
Another comment about the Tapir metadata: when giving courses in installing Tapirlink, I noticed that none of the about 10 (Dutch) students could figure out themselves what 'relatedEntity' means. They all needed help on that. Perhaps the documentation of that element should be expanded?
Cheers, Wouter
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi Markus,
I agree it's unusual, but I think in DC they were also considering the possibility to have the content of dc:language in natural language. You can see an example in their own documentation:
http://dublincore.org/2006/12/18/dces.rdf#
Anyway, the documentation also says that "Recommended best practice is to use a controlled vocabulary such as RFC 3066 [RFC3066]", which is what you're suggesting.
I'll try to list some alternatives in another message.
Regards, -- Renato
Hi there, This really is weird. I was confident that the xml schema language data type was used. This defines natural language identifiers as defined by RFC 3066. http://www.w3.org/TR/xmlschema-2/#language http://www.ietf.org/rfc/rfc3066.txt
But when I looked up the tapir schema, it makes use of the dublin core schema that then states the following:
<xs:element name="language" substitutionGroup="any"/> <xs:element name="any" type="SimpleLiteral" abstract="true"/> <xs:complexType name="SimpleLiteral"> <xs:complexContent mixed="true"> <xs:restriction base="xs:anyType"> xs:sequence <xs:any processContents="lax" minOccurs="0" maxOccurs="0"/> </xs:sequence> <xs:attribute ref="xml:lang" use="optional"/> </xs:restriction> </xs:complexContent> </xs:complexType>
So you can use anything for the dc:language element AND tag it with an optional xml:lang attribute. Thats weird:
<dc:language xml:lang="en">swuaheli</dc:language>
Markus
Hi all,
I see the following alternatives to the language issue:
1) Indicate through the specification one particular standard to be used by dc:language.
or
2) Include dc:language elements inside a new element with an attribute indicating the standard being used, such as:
<contentLanguages standard="ethnologue"> dc:languageaaa</dc:language> dc:languageaab</dc:language> </contentLanguages>
Where "standard" could be an extensible controlled vocabulary.
or
3) Extend the dc:language type so that it accepts a similar "standard" attribute.
Are there other alternatives we should consider?
I think the requirements are that:
* Language can be optional. * There can be multiple languages. * We must somehow know what is the standard used for the language.
I don't think it would be necessary to allow multiple language elements where each one could be potentially related to different standards.
I don't have strong feelings about this, although I would be more inclined to choose option 2. Option 1 would bring less impact to existing implementations and installations, but we would need to be sure that the standard we choose would really cover all needs.
What do you think?
Regards, -- Renato
I would prefer 1 but if it will be hard for a single standard to meet everyone's needs then I'd support 2. Any of the options would be preferred to the existing situation.
Jim
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Monday, July 02, 2007 7:55 PM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] tapir metadata issues
Hi all,
I see the following alternatives to the language issue:
1) Indicate through the specification one particular standard to be used by dc:language.
or
2) Include dc:language elements inside a new element with an attribute indicating the standard being used, such as:
<contentLanguages standard="ethnologue"> dc:languageaaa</dc:language> dc:languageaab</dc:language> </contentLanguages>
Where "standard" could be an extensible controlled vocabulary.
or
3) Extend the dc:language type so that it accepts a similar "standard" attribute.
Are there other alternatives we should consider?
I think the requirements are that:
* Language can be optional. * There can be multiple languages. * We must somehow know what is the standard used for the language.
I don't think it would be necessary to allow multiple language elements where each one could be potentially related to different standards.
I don't have strong feelings about this, although I would be more inclined to choose option 2. Option 1 would bring less impact to existing implementations and installations, but we would need to be sure that the standard we choose would really cover all needs.
What do you think?
Regards, -- Renato
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I cant see why we shouldnt mandate one specific standard. One variable less. I would vote for option #1
Markus
Am 03.07.2007 3:55 Uhr schrieb "Renato De Giovanni" unter renato@cria.org.br:
Hi all,
I see the following alternatives to the language issue:
- Indicate through the specification one particular standard to be used
by dc:language.
or
- Include dc:language elements inside a new element with an attribute
indicating the standard being used, such as:
<contentLanguages standard="ethnologue"> <dc:language>aaa</dc:language> <dc:language>aab</dc:language> </contentLanguages>
Where "standard" could be an extensible controlled vocabulary.
or
- Extend the dc:language type so that it accepts a similar "standard"
attribute.
Are there other alternatives we should consider?
I think the requirements are that:
- Language can be optional.
- There can be multiple languages.
- We must somehow know what is the standard used for the language.
I don't think it would be necessary to allow multiple language elements where each one could be potentially related to different standards.
I don't have strong feelings about this, although I would be more inclined to choose option 2. Option 1 would bring less impact to existing implementations and installations, but we would need to be sure that the standard we choose would really cover all needs.
What do you think?
Regards,
Renato
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
If we are confident we have a standard that suits all, I have nothing against it.
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Renato De Giovanni" renato@cria.org.br; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 9:48 AM Subject: Re: [tdwg-tapir] tapir metadata issues
I cant see why we shouldnt mandate one specific standard. One variable less. I would vote for option #1
Markus
Am 03.07.2007 3:55 Uhr schrieb "Renato De Giovanni" unter renato@cria.org.br:
Hi all,
I see the following alternatives to the language issue:
- Indicate through the specification one particular standard to be used
by dc:language.
or
- Include dc:language elements inside a new element with an attribute
indicating the standard being used, such as:
<contentLanguages standard="ethnologue"> <dc:language>aaa</dc:language> <dc:language>aab</dc:language> </contentLanguages>
Where "standard" could be an extensible controlled vocabulary.
or
- Extend the dc:language type so that it accepts a similar "standard"
attribute.
Are there other alternatives we should consider?
I think the requirements are that:
- Language can be optional.
- There can be multiple languages.
- We must somehow know what is the standard used for the language.
I don't think it would be necessary to allow multiple language elements where each one could be potentially related to different standards.
I don't have strong feelings about this, although I would be more inclined to choose option 2. Option 1 would bring less impact to existing implementations and installations, but we would need to be sure that the standard we choose would really cover all needs.
What do you think?
Regards,
Renato
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Isn't rfc3066 as used by xml schema enough? Any arguments against it?
RFC3066 specifies the primary language to be ISO 639-2. The Library of Congress, maintainers of ISO 639-2, has made the list of languages registered available on the Internet. It can be found at
http://www.loc.gov/standards/iso639-2/langhome.html http://www.w3.org/TR/xmlschema-2/#language http://www.ietf.org/rfc/rfc3066.txt
Am 04.07.2007 10:28 Uhr schrieb "Wouter Addink" unter wouter@eti.uva.nl:
If we are confident we have a standard that suits all, I have nothing against it.
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Renato De Giovanni" renato@cria.org.br; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 9:48 AM Subject: Re: [tdwg-tapir] tapir metadata issues
I cant see why we shouldnt mandate one specific standard. One variable less. I would vote for option #1
Markus
Am 03.07.2007 3:55 Uhr schrieb "Renato De Giovanni" unter renato@cria.org.br:
Hi all,
I see the following alternatives to the language issue:
- Indicate through the specification one particular standard to be used
by dc:language.
or
- Include dc:language elements inside a new element with an attribute
indicating the standard being used, such as:
<contentLanguages standard="ethnologue"> <dc:language>aaa</dc:language> <dc:language>aab</dc:language> </contentLanguages>
Where "standard" could be an extensible controlled vocabulary.
or
- Extend the dc:language type so that it accepts a similar "standard"
attribute.
Are there other alternatives we should consider?
I think the requirements are that:
- Language can be optional.
- There can be multiple languages.
- We must somehow know what is the standard used for the language.
I don't think it would be necessary to allow multiple language elements where each one could be potentially related to different standards.
I don't have strong feelings about this, although I would be more inclined to choose option 2. Option 1 would bring less impact to existing implementations and installations, but we would need to be sure that the standard we choose would really cover all needs.
What do you think?
Regards,
Renato
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
I think it may not be enough. ISO 639-2 (3 letter codes) lists about 500 languages if I am right. Ethnologue about 7000. The data can be in any language or dialect, especially common names or herbal information. The ethnologue 3-letter code list has the advantage of having a link between languages and countries, although the iso countries list they use is not completely up to date. Usually I prefer ISO standards, but in this case I am not sure.
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Wouter Addink" wouter@eti.uva.nl; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 12:08 PM Subject: Re: [tdwg-tapir] tapir metadata issues
Isn't rfc3066 as used by xml schema enough? Any arguments against it?
RFC3066 specifies the primary language to be ISO 639-2. The Library of Congress, maintainers of ISO 639-2, has made the list of languages registered available on the Internet. It can be found at
http://www.loc.gov/standards/iso639-2/langhome.html http://www.w3.org/TR/xmlschema-2/#language http://www.ietf.org/rfc/rfc3066.txt
Am 04.07.2007 10:28 Uhr schrieb "Wouter Addink" unter wouter@eti.uva.nl:
If we are confident we have a standard that suits all, I have nothing against it.
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Renato De Giovanni" renato@cria.org.br; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 9:48 AM Subject: Re: [tdwg-tapir] tapir metadata issues
I cant see why we shouldnt mandate one specific standard. One variable less. I would vote for option #1
Markus
Am 03.07.2007 3:55 Uhr schrieb "Renato De Giovanni" unter renato@cria.org.br:
Hi all,
I see the following alternatives to the language issue:
- Indicate through the specification one particular standard to be used
by dc:language.
or
- Include dc:language elements inside a new element with an attribute
indicating the standard being used, such as:
<contentLanguages standard="ethnologue"> <dc:language>aaa</dc:language> <dc:language>aab</dc:language> </contentLanguages>
Where "standard" could be an extensible controlled vocabulary.
or
- Extend the dc:language type so that it accepts a similar "standard"
attribute.
Are there other alternatives we should consider?
I think the requirements are that:
- Language can be optional.
- There can be multiple languages.
- We must somehow know what is the standard used for the language.
I don't think it would be necessary to allow multiple language elements where each one could be potentially related to different standards.
I don't have strong feelings about this, although I would be more inclined to choose option 2. Option 1 would bring less impact to existing implementations and installations, but we would need to be sure that the standard we choose would really cover all needs.
What do you think?
Regards,
Renato
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi all,
Let me share with you more thoughts about this subject using Wouter's original message as a reference.
dc:language is mandatory. What to do with data that is not language specific? Example: we are going to use Tapir for sharing lists of scientific names. Should the language be Latin in that case? We think about using specifying English (eng) as default in that case. The recommendation is to use IANA Language subtags. Probably better to recommend the languages from ethnologue.org (3-letter abbreviations). This because the data can be in much more languages then the IANA Languages, for instance common names in extinct languages. This is different from the xml:lang attribute, which is primarily for application development.
The fact that part of the data being exposed by a service contains scientific names doesn't mean that a user needs to understand latin to make use of this content. In my opinion, dc:language should only be used to indicate that users need to know one or more specific languages if they want to understand the content being served. The best example in our case would probably be species description data. In this case dc:language should definitely be used to indicate the language in which species are described.
If a service exposes only pure taxonomic data or just names, without any kind of description, I would probably not specify any language as part of TAPIR metadata. Even if the content includes common names in the most unusual languages, because names are essentially identifiers used to designate entities.
However, when exposing common names associated with a taxon, I certainly agree it's desirable to specify the language, but dc:language would not be appropriate here since it's just a general reference about the whole content of the service. It would be necessary to have a specific concept to indicate the language for each common name, and the content of this concept could be IANA codes, ethnologue, or any other option.
So now I think I agree with Markus that we could keep the existing approach and force a specific language standard through the spec. This standard could certainly be IANA, unless we expect services to provide content (related to descriptions, explanations, etc.) in really unusual languages.
By the way, even when the service content is not associated with any particular language, we could keep dc:language as a mandatory element. I've just discovered that the IANA code "zxx" means "No linguistic content".
Would it be OK for everybody if we keep dc:language a mandatory element, but now unbounded, and then force through the spec the use of IANA codes?
Best Regards, -- Renato
I think it may not be enough. ISO 639-2 (3 letter codes) lists about 500 languages if I am right. Ethnologue about 7000. The data can be in any language or dialect, especially common names or herbal information. The ethnologue 3-letter code list has the advantage of having a link between languages and countries, although the iso countries list they use is not completely up to date. Usually I prefer ISO standards, but in this case I am not sure.
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Wouter Addink" wouter@eti.uva.nl; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 12:08 PM Subject: Re: [tdwg-tapir] tapir metadata issues
Isn't rfc3066 as used by xml schema enough? Any arguments against it?
RFC3066 specifies the primary language to be ISO 639-2. The Library of Congress, maintainers of ISO 639-2, has made the list of languages registered available on the Internet. It can be found at
http://www.loc.gov/standards/iso639-2/langhome.html http://www.w3.org/TR/xmlschema-2/#language http://www.ietf.org/rfc/rfc3066.txt
Everything Renato is saying sounds right on and I think the invasive community will be able to agree on a language standard as long as it covers enough languages.
- The ISO 639-2 (3 letter codes) should work but the 2-letter codes miss some language distinctions. - The only IANA codes I am aware of are the "Internet Assigned Numbers Authority" and these are country codes rather than language codes - did I miss something here? - I'm not as familiar with "Ethnologue" and since it is not ISO, it may be harder to sell
My only other comment is to restate that scientific names should be treated as language-independent. I see them as codes for taxons (they used to be Latin but that has changed over the years - see the Chinese dinosaur "Tsintaosaurus").
Thanks, Jim
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Wednesday, July 04, 2007 10:16 AM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] tapir metadata issues
Hi all,
Let me share with you more thoughts about this subject using Wouter's original message as a reference.
dc:language is mandatory. What to do with data that is not language specific? Example: we are going to use Tapir for sharing lists of scientific names. Should the language be Latin in that case? We think about using specifying English (eng) as default in that case. The recommendation is to use IANA Language subtags. Probably better to recommend the languages from ethnologue.org (3-letter abbreviations). This because the data can be in much more languages then the IANA Languages, for instance common names in extinct languages. This is different from the xml:lang attribute, which is primarily for application development.
The fact that part of the data being exposed by a service contains scientific names doesn't mean that a user needs to understand latin to make use of this content. In my opinion, dc:language should only be used to indicate that users need to know one or more specific languages if they want to understand the content being served. The best example in our case would probably be species description data. In this case dc:language should definitely be used to indicate the language in which species are described.
If a service exposes only pure taxonomic data or just names, without any kind of description, I would probably not specify any language as part of TAPIR metadata. Even if the content includes common names in the most unusual languages, because names are essentially identifiers used to designate entities.
However, when exposing common names associated with a taxon, I certainly agree it's desirable to specify the language, but dc:language would not be appropriate here since it's just a general reference about the whole content of the service. It would be necessary to have a specific concept to indicate the language for each common name, and the content of this concept could be IANA codes, ethnologue, or any other option.
So now I think I agree with Markus that we could keep the existing approach and force a specific language standard through the spec. This standard could certainly be IANA, unless we expect services to provide content (related to descriptions, explanations, etc.) in really unusual languages.
By the way, even when the service content is not associated with any particular language, we could keep dc:language as a mandatory element. I've just discovered that the IANA code "zxx" means "No linguistic content".
Would it be OK for everybody if we keep dc:language a mandatory element, but now unbounded, and then force through the spec the use of IANA codes?
Best Regards, -- Renato
I think it may not be enough. ISO 639-2 (3 letter codes) lists about 500 languages if I am right. Ethnologue about 7000. The data can be in any language or dialect, especially common names or herbal information. The ethnologue 3-letter code list has the advantage of having a link between languages and countries, although the iso countries list they use is not completely up to date. Usually I prefer ISO standards, but in this case I am not sure.
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Wouter Addink" wouter@eti.uva.nl; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 12:08 PM Subject: Re: [tdwg-tapir] tapir metadata issues
Isn't rfc3066 as used by xml schema enough? Any arguments against it?
RFC3066 specifies the primary language to be ISO 639-2. The Library of Congress, maintainers of ISO 639-2, has made the list of languages registered available on the Internet. It can be found at
http://www.loc.gov/standards/iso639-2/langhome.html http://www.w3.org/TR/xmlschema-2/#language http://www.ietf.org/rfc/rfc3066.txt
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Hi, I agree on all points with Renato. Good to know that there is a IANA code for 'no linguistic content', would be good to add that to the documentation, because it is not obvious that there is such a code. There is even a code for 'undetermined', if I remember well. I also agree with Jim that scientific names should not be treated as linguistic content.
Wouter
----- Original Message ----- From: "Jim Graham" jim@nrel.colostate.edu To: "'Renato De Giovanni'" renato@cria.org.br; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 7:16 PM Subject: RE: [tdwg-tapir] tapir metadata issues
Everything Renato is saying sounds right on and I think the invasive community will be able to agree on a language standard as long as it covers enough languages.
- The ISO 639-2 (3 letter codes) should work but the 2-letter codes miss some language distinctions. - The only IANA codes I am aware of are the "Internet Assigned Numbers Authority" and these are country codes rather than language codes - did I miss something here? - I'm not as familiar with "Ethnologue" and since it is not ISO, it may be harder to sell
My only other comment is to restate that scientific names should be treated as language-independent. I see them as codes for taxons (they used to be Latin but that has changed over the years - see the Chinese dinosaur "Tsintaosaurus").
Thanks, Jim
-----Original Message----- From: tdwg-tapir-bounces@lists.tdwg.org [mailto:tdwg-tapir-bounces@lists.tdwg.org] On Behalf Of Renato De Giovanni Sent: Wednesday, July 04, 2007 10:16 AM To: tdwg-tapir@lists.tdwg.org Subject: Re: [tdwg-tapir] tapir metadata issues
Hi all,
Let me share with you more thoughts about this subject using Wouter's original message as a reference.
dc:language is mandatory. What to do with data that is not language specific? Example: we are going to use Tapir for sharing lists of scientific names. Should the language be Latin in that case? We think about using specifying English (eng) as default in that case. The recommendation is to use IANA Language subtags. Probably better to recommend the languages from ethnologue.org (3-letter abbreviations). This because the data can be in much more languages then the IANA Languages, for instance common names in extinct languages. This is different from the xml:lang attribute, which is primarily for application development.
The fact that part of the data being exposed by a service contains scientific names doesn't mean that a user needs to understand latin to make use of this content. In my opinion, dc:language should only be used to indicate that users need to know one or more specific languages if they want to understand the content being served. The best example in our case would probably be species description data. In this case dc:language should definitely be used to indicate the language in which species are described.
If a service exposes only pure taxonomic data or just names, without any kind of description, I would probably not specify any language as part of TAPIR metadata. Even if the content includes common names in the most unusual languages, because names are essentially identifiers used to designate entities.
However, when exposing common names associated with a taxon, I certainly agree it's desirable to specify the language, but dc:language would not be appropriate here since it's just a general reference about the whole content of the service. It would be necessary to have a specific concept to indicate the language for each common name, and the content of this concept could be IANA codes, ethnologue, or any other option.
So now I think I agree with Markus that we could keep the existing approach and force a specific language standard through the spec. This standard could certainly be IANA, unless we expect services to provide content (related to descriptions, explanations, etc.) in really unusual languages.
By the way, even when the service content is not associated with any particular language, we could keep dc:language as a mandatory element. I've just discovered that the IANA code "zxx" means "No linguistic content".
Would it be OK for everybody if we keep dc:language a mandatory element, but now unbounded, and then force through the spec the use of IANA codes?
Best Regards, -- Renato
I think it may not be enough. ISO 639-2 (3 letter codes) lists about 500 languages if I am right. Ethnologue about 7000. The data can be in any language or dialect, especially common names or herbal information. The ethnologue 3-letter code list has the advantage of having a link between languages and countries, although the iso countries list they use is not completely up to date. Usually I prefer ISO standards, but in this case I am not sure.
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Wouter Addink" wouter@eti.uva.nl; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 12:08 PM Subject: Re: [tdwg-tapir] tapir metadata issues
Isn't rfc3066 as used by xml schema enough? Any arguments against it?
RFC3066 specifies the primary language to be ISO 639-2. The Library of Congress, maintainers of ISO 639-2, has made the list of languages registered available on the Internet. It can be found at
http://www.loc.gov/standards/iso639-2/langhome.html http://www.w3.org/TR/xmlschema-2/#language http://www.ietf.org/rfc/rfc3066.txt
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
_______________________________________________ tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Jim,
Sorry I didn't answer your message before.
What I've just included in the specification is that dc:language in TAPIR metadata responses must follow RFC 4646 (http://www.rfc-editor.org/rfc/rfc4646.txt) and therefore use language codes specified by the IANA Language Subtag Registry (http://www.iana.org/assignments/language-subtag-registry). So yes, it's the same IANA that you were thinking.
I hope these are the correct and most up-to-date references for what we need.
Best Regards, -- Renato
Everything Renato is saying sounds right on and I think the invasive community will be able to agree on a language standard as long as it covers enough languages.
- The ISO 639-2 (3 letter codes) should work but the 2-letter codes miss
some language distinctions.
- The only IANA codes I am aware of are the "Internet Assigned Numbers
Authority" and these are country codes rather than language codes - did I miss something here?
- I'm not as familiar with "Ethnologue" and since it is not ISO, it may be
harder to sell
My only other comment is to restate that scientific names should be treated as language-independent. I see them as codes for taxons (they used to be Latin but that has changed over the years - see the Chinese dinosaur "Tsintaosaurus").
Thanks, Jim
Hi Wouter,
Another comment about the Tapir metadata: when giving courses in installing Tapirlink, I noticed that none of the about 10 (Dutch) students could figure out themselves what 'relatedEntity' means. They all needed help on that. Perhaps the documentation of that element should be expanded?
I agree that the element name is a bit vague, but element and attribute names don't have to be human-readable and changing them at this stage would be complicated.
The documentation about this element says:
"A Related Entity can be for example the organisation or group that is hosting the service, providing the data, sponsoring the network, etc. This allows acknowledgement to any kind of organisation or even person that is somehow related to the service."
Please let me know if you have more ideas about how to improve it. You can contact me directly about this.
Thanks again, -- Renato
Hi, It would be nice to have a custom slot for contacts. For instance to include a personal website url.
Wouter
Wouter,
I think it should be no problem to add this custom slot. I'll take the liberty to do this now, but I'll wait a bit more before publishing the schema until we decide about the dc:language issue.
Regards, -- Renato
Hi, It would be nice to have a custom slot for contacts. For instance to include a personal website url.
Wouter
Wouter, renato, A personal website is part of vcard. If we do not have that already Id suggest to include more optioanl vcard elements. Having a custom slot is good anyways, so no need to change that back.
Markus
Am 03.07.2007 5:14 Uhr schrieb "Renato De Giovanni" unter renato@cria.org.br:
Wouter,
I think it should be no problem to add this custom slot. I'll take the liberty to do this now, but I'll wait a bit more before publishing the schema until we decide about the dc:language issue.
Regards,
Renato
Hi, It would be nice to have a custom slot for contacts. For instance to include a personal website url.
Wouter
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
Good suggestion,
Wouter
----- Original Message ----- From: "Döring, Markus" m.doering@BGBM.org To: "Renato De Giovanni" renato@cria.org.br; tdwg-tapir@lists.tdwg.org Sent: Wednesday, July 04, 2007 9:50 AM Subject: Re: [tdwg-tapir] tapir metadata issues
Wouter, renato, A personal website is part of vcard. If we do not have that already Id suggest to include more optioanl vcard elements. Having a custom slot is good anyways, so no need to change that back.
Markus
Am 03.07.2007 5:14 Uhr schrieb "Renato De Giovanni" unter renato@cria.org.br:
Wouter,
I think it should be no problem to add this custom slot. I'll take the liberty to do this now, but I'll wait a bit more before publishing the schema until we decide about the dc:language issue.
Regards,
Renato
Hi, It would be nice to have a custom slot for contacts. For instance to include a personal website url.
Wouter
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
tdwg-tapir mailing list tdwg-tapir@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tapir
participants (4)
-
Döring, Markus
-
Jim Graham
-
Renato De Giovanni
-
Wouter Addink