Google: Unicode conquers ASCII on the Web
I picture it happening this way. The Roman alphabet is on the run, pursued by a much larger army of Arabic characters with long scimitar-like ligatures, Chinese characters that look like throwing stars, and European peasant letters bristling with umlauts, cedillas, and tildes.
Unicode now is the most common character encoding method on the Web.
(Credit: Google)Unicode has overtaken ASCII as the most popular character encoding scheme on the World Wide Web, Mark Davis, Google's senior international software architect, said in a blog post. Also vanquished at almost exactly the same time was the Western European encoding.
Unicode is a character encoding standard that gracefully accommodates dozens of languages as well as Roman characters with diacritical marks. ASCII, a tried-and true, decades-old standard, is limited to 128 or 256 characters and has a hard time extending beyond the range of a century-old Remington typewriter.
Unicode vanquished ASCII and Western European within 10 days in December, Davis said.
"What's more impressive than simply overtaking them is the speed with which this happened," he added, pointing to a graph showing the meteoric rise of Unicode.
Google's a fan of Unicode Web sites. When it processes data from Web sites, it converts it into Unicode first if it's not already there. That improves international search abilities.
"The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover," he said.
Google just converted to Unicode 5.1, he added, "so people speaking languages such as Malayalam can now search for words containing the new characters," he said.
One disadvantage Unicode has over ASCII, though, is that it takes at least twice as much memory to store a Roman alphabet character because Unicode uses more bytes to enumerate its vastly larger range of alphabetic symbols.
Stephen Shankland writes about a wide range of technology and products, but has a particular focus on browsers and digital photography. He joined CNET News in 1998 and since then also has covered Google, Yahoo, servers, supercomputing, Linux and open-source software, and science. E-mail Stephen, or follow him on Twitter at http://www.twitter.com/stshank. 



though, is that it takes at least twice as much memory to store a
Roman alphabet character".
That's not really true with UTF-8. For most Western/Roman
characters, UTF-8 takes up exactly one byte per character just
like ASCII. When you get into accent marks and non-Roman
character sets, though, UTF-8 can take up more than two bytes
per character.
See:
http://en.wikipedia.org/wiki/UTF-8
In Windows programs, text is typically represented as UTF-16 internally, which does take up more space, but generally behaves faster, since the Windows APIs are natively UTF-16.
The older single-byte/double-byte API equivalents are quietly converted to Unicode on each call, which can slow programs down a bit if they are particularly text-heavy.
Also, since web pages consist largely of HTML tags and client-side scripts, which are made up of pure ASCII characters, these take up no more space than if it page were ordinary ASCII or some ISO ASCII extension set.
But it's another thing to actually have utf-8 encoded characters in your text -- ones that are not also part of basic ASCII. My guess is that only a small percentage of pages served utf-8 actually "use" it, for all the reasons already expressed by others.
- by krosavcheg May 9, 2008 2:18 AM PDT
- 1) The "meteoric rise" of unicode is indisputable, but the graph is misleading. 75% of the web is still not unicode. Since the family of unicode text encodings aims to replace all other encodings, the graph really should have only 2 lines, "unicode encodings" and "other encodings".
- Reply to this comment
-
(5 Comments)2) As other commenters remarked, the overhead of unicode encodings is minimal. Overhead should never be an argument against using a unicode encoding. Anyone who has to deal with multiple text encodings in organically evolved (i.e. not carefully designed) IT systems will agree.
wcoenen (logged in with bugmenot.com)