Hi TAGgers My recommendation is to always define the character set explicitly in the <head> of the HTML document. This avoids many of the problems. This article is probably the bible on character sets and their correct usage, including in HTML: http://www.joelonsoftware.com/articles/Unicode.html. Once this is sorted out, either: a. Use UTF-8 as the character set and supply characters as-is, e.g. using the actual character code point for u-umlaut in the document as required. b. (If item a above fails.) Use HTML entities whenever an 8-bit character is required, e.g. the text "ü" in place of the actual character code point for u-umlaut. I have found that, in practice, item a is really only possible if you're able to control the character set of each part of your software stack, e.g. the text editor used by someone sending you an XML file, the parser you use to parse the XML file, the database connection and client you use to place that data into a database, the database connection and client you use to read data from the database, the Content-Type emitted by the server and/or by your web pages when requested by a user's web browser, and finally the user's web browser, which if old enough, might not handle UTF-8 data properly despite all your effort. My recommendation here is to code for the future, as old browsers get replaced reasonably quickly. Cheers, Ben -----Original Message----- From: tdwg-tag-bounces@lists.tdwg.org [mailto:tdwg-tag-bounces@lists.tdwg.org] On Behalf Of Bob Morris Sent: Tuesday, 22 December 2009 21:58 To: Gregor Hagedorn Cc: tdwg-content@lists.tdwg.org; Hong cui; Terry Catapano; Technical Architecture Group mailing list Subject: Re: [tdwg-tag] character encodings Yes, you are quite right that XML should take care of it. And also right that using Java Readers and Writers addresses the well-formedness issue. Generally, XML management is quite easy in Java 6. However, my experience has been that some browsers do not not render some characters (or encodings?) perfectly and so I wondered whether there are recommendations. On Tue, Dec 22, 2009 at 5:08 AM, Gregor Hagedorn <g.m.hagedorn@gmail.com> wrote:
Hi Bob
is that necessary with xml? XML defines a set of allowable character encodings and any xml-processor "should" be able to read them.
Problems we experienced in key to nature really go back to java programmers considering writing xml the production of strings in handwritten code, rather than using xml readers or writers. The result was generally (sooner or later...) non-well formed xml.
Gregor
2009/12/21 Bob Morris <morris.bob@gmail.com>:
Does SDD (resp. tdwg) have any best-practice about character encodings and character sets in XML. The sad fact of life is that many printed, and word processing, systematics and ecology documents have odd symbols even in characters and states, e.g. the degree sign, +- sign, the circled 'x', em-dashes, ..., which are not always rendered by browsers.
-- Robert A. Morris Professor of Computer Science (nominally retired) UMASS-Boston 100 Morrissey Blvd Boston, MA 02125-3390 Associate, Harvard University Herbaria email: ram@cs.umb.edu web: http://bdei.cs.umb.edu/ web: http://etaxonomy.org/FilteredPush http://www.cs.umb.edu/~ram phone (+1)617 287 6466 _______________________________________________ tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag This email, together with any attachments, is intended for the addressee only. It may contain confidential or privileged information. If you are not the intended recipient of this email, please notify the sender, delete the email and attachments from your system and destroy any copies you may have taken of the email and its attachments. Duplication or further distribution by hardcopy, by electronic means or verbally is not permitted without permission.