June 16, 2003 2:13 PM PDT
XML and Unicode: Mix with care
Dealing with XML
Published by the Unicode Consortium, Unicode is a standard character set for computers that aims to assign a number for every character in every written language. XML (Extensible Markup Language), a World Wide Web Consortium (W3C) recommendation for marking up digital documents and creating new markup languages for specific tasks or industries, relies on Unicode and closely tracks its revisions.
But a technical report released by the Unicode Consortium--and simultaneously published as a note by the W3C's internationalization activity--warns document authors that some Unicode features are going to cause XML applications, HTML browsers, and other programs to choke.
Conflict arises between Unicode and Web markup languages from the fundamentally different philosophies that underlie the character set and Web standards. While Unicode produces a one-for-one, linear correspondence for every character on the page, XML and its Web-based relatives are more flexible in that they let authors assign different style and functional attributes to a single character, word or page.
For example, Unicode provides what's called "compatibility characters," separate numbers to designate superscript or subscript numerals or letters. With HTML or XML, by contrast, the author would use the basic character and then style it as superscript or subscript.
All things being equal, the W3C advises authors to use the markup alternatives.
Compatibility characters are "just not the long-term, sound way to do things," said Martin Duerst, the W3C's internationalization activity lead and a visiting scientist at the Massachusetts Institute of Technology's Laboratory for Computer Science. "We're urging authors to use Unicode in a responsible and adequate way when it's used with XML."
Many times, authors know that their Unicode is destined to be read by Web browsers and other XML applications. But some of the conflicts crop up as a surprise when XML applications are fed information from older databases and information repositories.
That's when applications that are designed for markup languages start stuttering on characters that designate things like vertical tabulators, tab feeds and other controls.
"In the report we go through a lot of different kinds of characters that, in one way or another, may make sense in a legacy system or in plain text, but once you have markup at your disposition, you can use structure," Duerst said. "You want to use structure instead of a character, a number. If you're using XML, use what XML makes available. Control character stuff really doesn't work."
The fourth version of Unicode will be out in book form later this year. Prepublication versions of Unicode 4.0 are available online now.