[tdwg-tag] character encodings

Mon Jan 11 03:27:39 CET 2010

Hi TAGgers

My recommendation is to always define the character set explicitly in the <head> of the HTML document. This avoids many of the problems. This article is probably the bible on character sets and their correct usage, including in HTML: http://www.joelonsoftware.com/articles/Unicode.html.

Once this is sorted out, either:

a. Use UTF-8 as the character set and supply characters as-is, e.g. using the actual character code point for u-umlaut in the document as required.

b. (If item a above fails.) Use HTML entities whenever an 8-bit character is required, e.g. the text "&uuml;" in place of the actual character code point for u-umlaut.

I have found that, in practice, item a is really only possible if you're able to control the character set of each part of your software stack, e.g. the text editor used by someone sending you an XML file, the parser you use to parse the XML file, the database connection and client you use to place that data into a database, the database connection and client you use to read data from the database, the Content-Type emitted by the server and/or by your web pages when requested by a user's web browser, and finally the user's web browser, which if old enough, might not handle UTF-8 data properly despite all your effort.

My recommendation here is to code for the future, as old browsers get replaced reasonably quickly.

Cheers,
Ben

-----Original Message-----
From: tdwg-tag-bounces at lists.tdwg.org [mailto:tdwg-tag-bounces at lists.tdwg.org] On Behalf Of Bob Morris
Sent: Tuesday, 22 December 2009 21:58
To: Gregor Hagedorn
Cc: tdwg-content at lists.tdwg.org; Hong cui; Terry Catapano; Technical Architecture Group mailing list
Subject: Re: [tdwg-tag] character encodings

Yes, you are quite right that XML should take care of it.  And also
right that using Java Readers and Writers addresses the
well-formedness issue.  Generally, XML management is quite easy in
Java 6.  However, my experience has been that some browsers do not not
render some characters (or encodings?) perfectly and so I wondered
whether there are recommendations.

On Tue, Dec 22, 2009 at 5:08 AM, Gregor Hagedorn <g.m.hagedorn at gmail.com> wrote:
> Hi Bob
>
> is that necessary with xml? XML defines a set of allowable character
> encodings and any xml-processor "should" be able to read them.
>
> Problems we experienced in key to nature really go back to java
> programmers considering writing xml the production of strings in
> handwritten code, rather than using xml readers or writers. The result
> was generally (sooner or later...) non-well formed xml.
>
> Gregor
>
> 2009/12/21 Bob Morris <morris.bob at gmail.com>:
>> Does SDD (resp. tdwg) have any best-practice about character encodings
>> and character sets in XML.  The sad fact of life is that many printed,
>> and word processing, systematics and ecology documents have odd
>> symbols even in characters and states, e.g. the degree sign, +- sign,
>> the circled 'x', em-dashes, ..., which are not always rendered by
>> browsers.
>

-- 
Robert A. Morris
Professor of Computer Science (nominally retired)
UMASS-Boston
100 Morrissey Blvd
Boston, MA 02125-3390
Associate, Harvard University Herbaria
email: ram at cs.umb.edu
web: http://bdei.cs.umb.edu/
web: http://etaxonomy.org/FilteredPush
http://www.cs.umb.edu/~ram
phone (+1)617 287 6466
_______________________________________________
tdwg-tag mailing list
tdwg-tag at lists.tdwg.org
http://lists.tdwg.org/mailman/listinfo/tdwg-tag

This email, together with any attachments, is intended for the
addressee only. It may contain confidential or privileged information.
If you are not the intended recipient of this email, please notify
the sender, delete the email and attachments from your system and
destroy any copies you may have taken of the email and its attachments.
Duplication or further distribution by hardcopy, by electronic means
or verbally is not permitted without permission.