The technology known as Extensible Markup Language has become a nearly universal way to share information online. But there's a growing recognition that XML's benefits sometimes come with a price tag: sluggish performance.
That problem is now spawning efforts to speed up XML traffic. Proponents say a skinnier XML will boost the speed of everything from Internet commerce to data exchange between cell phones. But so far, there's no agreement on the technology to make that happen.
Here's the problem: Right now, the XML standard calls for information to be stored as text. That means that an XML document, such as a purchase order or a Web page, can be easily viewed by a person or "read" by a machine, either through widely available text editors or XML parsers.
News.context
What's new:
Concerns over Extensible Markup Language's performance has spawned efforts to speed up traffic with a binary format.
Bottom line: Although there's a growing recognition that XML speed is a potential problem, there's no agreement on how to fix the problem. Critics are wary of making XML proprietary and spoiling its success.
But performance problems result from XML's tendency to create very large files. That's in part because XML formatting calls for each element within a document to be tagged with labels written out as text. What's more, XML-based protocols, called Web services, also generate a great deal of XML traffic.
"Not only is XML verbose, but it's extremely wasteful in how much space it needs to use for the amount of true data that it is sending," said Jeff Lamb, chief technology officer of Leader Technologies, which uses XML extensively in teleconferencing applications and believes that a change is needed.
The leading candidate to help alleviate XML's performance woes is a technology called binary XML, which calls for a new format that compresses XML transmissions.
Sun Microsystems has started an open-source Fast Infoset Project based on binary XML, and the standards body responsible for XML, the World Wide Web Consortium (WC3), has formed the Binary Characterization Working Group to consider putting XML in binary format.
On the face of it, compressing XML documents by using a different file format may seem like a reasonable way to address sluggish performance. But the very idea has many people--including an XML pioneer within Sun--worried that incompatible versions of XML will result.
"If I were world dictator, I'd put a kibosh on binary XML, and I'm quite confident that the people who are pushing for it would find another solution," said Tim Bray, who's both co-inventor of XML and an executive in Sun's software group.
"But as it is, these people think they're right and they're not stupid, so maybe they are right. Thus, let's hope that they play nice with standards bodies and provide that free open-source software--all of which the Sun Fast-Infoset people are doing, to their credit," Bray said.
Putting the squeeze on XML
The Fast Infoset plan, which represents more than a year of work, proposes that XML documents get shrunk down into a binary format in order to speed up transmission of files over the Internet. Sun has chosen a compression method that's already a standard used in the telecommunications industry.
The Sun engineers behind Fast Infoset argue that binary
If the transmission of XML is the concern, why not compress the data during the transmission process? That way it can be stored in it's native format for editing and searching, but compressed to a small state for the transmission.
Some storage systems like Oracle allow for behind the scenes compression so compression could even spill over to storage systems and be transparent to the consumers and producers of XML.
In order to maintain compatibility among various systems, everyone must agree as to how the bits are to me transformed. When we discuss binary formats it implies compression. But, what form of compression? The entire packet, header, body, etc. At what point do you compress? At what level of the stack? Eveyone must agree so everyone can understand the transmission.
Compression will only save transmission times, but it will not speed the parsing of the data, which I imagine is at least 50% of the bottleneck. Binary XML would help here.
...and was the winner. Http protocol also exchanges text information. Tim Bray is right : the benefits of text information are so huge than binary XML is a bad idea. Optimizations can always be done locally if needed. Same story as assembler compared to high level languages.
The company Exos Services Inc., in Denver has already devloped a very efficent XML platform. It prioritizes internet stream packets, which cause no bogg down. Check them out <a class="jive-link-external" href="http://www.exosservices.com/" target="_newWindow">http://www.exosservices.com/</a>.
First, looking at binary format such as Corba, it's much more interoperable than xml these day. Just go to the Axis forum and see how many people having problem with .net and java?
Get real. Corba has both communication interoperabability and source code interoperability. Soap is a true backward standard with these regards.
I work on both and I know the heart of it. I developed enterprise applications using both and I know. Corba is 5 to 10 times faster. It's much more scalable.
So, what's the problem of Corba? These are addressed by webservice/soap, and that's why it shines, not because of text based transportation:
1) It's not embraced by all vendor. 2) It does not use web as the mean of transportation. 3) The addressing scheme is really stupid (not using a simple http url, but a weird and long IOR string 4) Looking using JNDI/nameservice. Most of the time, people know exactly what the address is and not needed to use the repository server for the service. 5) Screw up programming model in Corba. Put it in one short: it's stupid. a) It's like black whole, sucking all your programming into it. There's an article talking about this. b) It forces you to use Corba objects when you don't need to, or want to. c) The architecture is too complex, and vendors hardly get it right. Example would be taken well known, matured libraries such as Ace/Tao and test it in a reall world app, it would not handle Tcp/Ip package corruption. Or Mico for example, it would not have time out option (with the version I last checked a while ago). This shows it's how hard you have to work to get these right.
So, Soap/WS advantage is not text base (text helps a little). Corba's problem has nothing to do with binary. Interoperability is not because of text or not (it's opened standard).
Another misleading point is not only the network that slows down the connection. For example: take out the network, and use the same computer for the client and the server. It's still very slow. Why? CPU cycles. XML consumes so much CPU cycles that 100% is utilize. This is not the case for Corba.
My advice is to look at the problem seriously. Work with it like I do. Find the exact problem and fix it. Do not speculate blindly. Bray is just too blind, and bias, because it's his baby.
Here's some similarity between Soap/ws and Corba to show that some of the myth Soap/ws brings:
1) Both use another declaration language: wsdl for soap and idl for corba. Idl is much easier to master.
2) Programing style most vendors support is to generate the stub, skeleton from the declaration language. So no different here. Both are cumbersome, and taken the same amount of work.
CORBA was the dream, but, Microsoft killed it. Corba was way ahead of its time. But, it offers far more than just data and services interoperability. XML offers a simple data transport whereas CORBA offers full data/object/method distribution. Far more complex and difficult to integrate, and, currently, far too much overhead. What you gain in data transport you lose in processing speed.
I couldn't agree more. When I brief people on Web Services, I tell them it's next generation CORBA.
My company uses the IFX XML standard in financial applications. The documents can be quite large and intricate. They are typically used between dedicated clients and servers, where a substantial investment is made in their development. Performance is ALWAYS an issue. As you pointed out, and as the Sun article showed, the XMLlanguage object binding can be the dominant cost. This is network independent - as you pointed out, for the intraprocess example.
Sun article: <a class="jive-link-external" href="http://java.sun.com/developer/technicalArticles/WebServices/fastWS/" target="_newWindow">http://java.sun.com/developer/technicalArticles/WebServices/fastWS/</a> and <a class="jive-link-external" href="http://java.sun.com/developer/technicalArticles/xml/fastinfoset/" target="_newWindow">http://java.sun.com/developer/technicalArticles/xml/fastinfoset/</a>
There is no reason to create a new standard / fragment XML compatibility. Transparent compression when/if needed is the way to go.
HTML is verbose too, and the proper way to handle it is to setup on the fly compression (such as Apache's mod_deflate). <a class="jive-link-external" href="http://httpd.apache.org/docs-2.0/mod/mod_deflate.html" target="_newWindow">http://httpd.apache.org/docs-2.0/mod/mod_deflate.html</a>
It significantly cuts bandwidth usage, speeds up transmission with very little impact on the processing power. The best part is once setup, you just forget about it, it's completely transparent.
If initial research puts the performance boost at 3x for binary XML, then at the rate which hardware performance grows this will be matched in 2 years. XML's usage will surely grow, but XML itself will not get more complex. In a few years from now, even the most constrained devices (PDA, cameras, personal devices) will likely have ample computing power and storage to handle today's text-based XML. Why bother with what will undoubtedly become a fragmented market of open and proprietary binary-XML specs?
Why not simply add a header tag section to the top of an XML document that defines what 'short tags' (tokens) in the rest of the document should expand to? It's the old data dictionary song, with a faster beat... very easy to automate prior to transmission. <root> <dictionary> <define token="a1" fullname="LastName"/> </dictionary> <a1>Smith</a1> </root>
People are already using binary formats for XML interchange
Whether Tim Bray likes it or not, people are already using binary interchange formats.
The question before the W3C is whether we should try to gain support for a single such format, to promote interoperability, or whether people should go off and use the 80 or so formats in widespread use today.
Many concerns about integration and compatibility are being resolved by companies implementing the XML and service oriented architectures. At the same time due to lack of proper standards and improper implementations, it is still hard to have large amounts of information bundled within the XML documents for transactions. There is a need to create compression standards and a robust XML architectures. However, a certain vendor controlling and pushing a format may ultimately prove to be a downfall for the rapid growth in XML adoption due to competing formats.
In my experience, the vast majority of XML is machine generated, and also the element start and end tags often consume more space than the data itself, so why not allow the closing tag to be </> and infer the element name that it closes? This would be a trivial code change, and based on the XML I see on a daily basis, this is probably a 30% space saving, along with some associated reduction in disk and network times. Parsing time probably doesn't reduce by much, but still worth having IMHO.
Patent-holding company claimed a host of Web giants owed it hundreds of millions in royalties for their use of online video streaming, search suggestions, and other "interactive" elements on pages.
After large numbers of longtime 'Burners' failed to get tickets during the event's recent selection process, many claimed organizers had failed to adopt a sensible system. Now, those organizers are trying to calm community anger.
Creating a tiny version of a coaxial cable, researchers at the University of California at San Diego create smallest laser to date, an advance that could lead to optical computer chips or high-resolution displays.
SolarReserve hits a milestone on a 110-megawatt solar power plant that will have between 10 and 15 hours of energy storage in tanks of molten salt for supplying Nevada.
Some storage systems like Oracle allow for behind the scenes compression so compression could even spill over to storage systems and be transparent to the consumers and producers of XML.
systems, everyone must agree as to how the bits
are to me transformed. When we discuss binary formats
it implies compression. But, what form of
compression? The entire packet, header, body, etc. At
what point do you compress? At what level of the
stack? Eveyone must agree so everyone can understand
the transmission.
<a class="jive-link-external" href="http://www.som.tulane.edu/tccep/documents/CI_Defined.pdf" target="_newWindow">http://www.som.tulane.edu/tccep/documents/CI_Defined.pdf</a>
Its amazing
Get real. Corba has both communication interoperabability and source code interoperability. Soap is a true backward standard with these regards.
I work on both and I know the heart of it. I developed enterprise applications using both and I know. Corba is 5 to 10 times faster. It's much more scalable.
So, what's the problem of Corba? These are addressed by webservice/soap, and that's why it shines, not because of text based transportation:
1) It's not embraced by all vendor.
2) It does not use web as the mean of transportation.
3) The addressing scheme is really stupid (not using a simple http url, but a weird and long IOR string
4) Looking using JNDI/nameservice. Most of the time, people know exactly what the address is and not needed to use the repository server for the service.
5) Screw up programming model in Corba. Put it in one short: it's stupid. a) It's like black whole, sucking all your programming into it. There's an article talking about this. b) It forces you to use Corba objects when you don't need to, or want to. c) The architecture is too complex, and vendors hardly get it right. Example would be taken well known, matured libraries such as Ace/Tao and test it in a reall world app, it would not handle Tcp/Ip package corruption. Or Mico for example, it would not have time out option (with the version I last checked a while ago). This shows it's how hard you have to work to get these right.
So, Soap/WS advantage is not text base (text helps a little). Corba's problem has nothing to do with binary. Interoperability is not because of text or not (it's opened standard).
Another misleading point is not only the network that slows down the connection. For example: take out the network, and use the same computer for the client and the server. It's still very slow. Why? CPU cycles. XML consumes so much CPU cycles that 100% is utilize. This is not the case for Corba.
My advice is to look at the problem seriously. Work with it like I do. Find the exact problem and fix it. Do not speculate blindly. Bray is just too blind, and bias, because it's his baby.
Here's some similarity between Soap/ws and Corba to show that some of the myth Soap/ws brings:
1) Both use another declaration language: wsdl for soap and idl for corba. Idl is much easier to master.
2) Programing style most vendors support is to generate the stub, skeleton from the declaration language. So no different here. Both are cumbersome, and taken the same amount of work.
Corba was way ahead of its time. But, it offers
far more than just data and services
interoperability. XML offers a simple data transport
whereas CORBA offers full data/object/method
distribution. Far more complex and difficult to
integrate, and, currently, far too much overhead.
What you gain in data transport you lose in
processing speed.
My company uses the IFX XML standard in financial applications. The documents can be quite large and intricate. They are typically used between dedicated clients and servers, where a substantial investment is made in their development. Performance is ALWAYS an issue. As you pointed out, and as the Sun article showed, the XMLlanguage object binding can be the dominant cost. This is network independent - as you pointed out, for the intraprocess example.
Sun article: <a class="jive-link-external" href="http://java.sun.com/developer/technicalArticles/WebServices/fastWS/" target="_newWindow">http://java.sun.com/developer/technicalArticles/WebServices/fastWS/</a>
and
<a class="jive-link-external" href="http://java.sun.com/developer/technicalArticles/xml/fastinfoset/" target="_newWindow">http://java.sun.com/developer/technicalArticles/xml/fastinfoset/</a>
HTML is verbose too, and the proper way to handle it is to setup on the fly compression (such as Apache's mod_deflate).
<a class="jive-link-external" href="http://httpd.apache.org/docs-2.0/mod/mod_deflate.html" target="_newWindow">http://httpd.apache.org/docs-2.0/mod/mod_deflate.html</a>
It significantly cuts bandwidth usage, speeds up transmission with very little impact on the processing power. The best part is once setup, you just forget about it, it's completely transparent.
<root>
<dictionary>
<define token="a1" fullname="LastName"/>
</dictionary>
<a1>Smith</a1>
</root>
The question before the W3C is whether we should try to gain support for a single such format, to promote interoperability, or whether people should go off and use the 80 or so formats in widespread use today.
Liam
(W3C XML Activity Lead)
This would be a trivial code change, and based on the XML I see on a daily basis, this is probably a 30% space saving, along with some associated reduction in disk and network times.
Parsing time probably doesn't reduce by much, but still worth having IMHO.