The following article was originally published on XML.com.1
The Apache XML developer lists have seen some turmoil recently over a proposed plan to develop their next generation XML parser. However a phoenix may have been borne out of the ashes: this week we report on "Spinnaker," a.k.a. the Xerces Refactoring Initiative.
The announcement listed goals for the Spinnaker effort itself. Amongst these was the desire to produce a parser suitable for inclusion in the Java Development Kit, a goal that apparently had internal backing from Sun.
Is it possible that, in the future, we hear about submissions to the tree *before* everyone goes home on Friday? I want us all to work together on the future of the Xerces parser instead of being surprised by a new source tree over a weekend.
These opening remarks set the tone for a heated debate, which threatened to turn into an IBM versus Sun squabble over who was playing fair by open source rules. Abrasive remarks were exchanged, with both sides quickly disavowing any corporate-lead motives, whilst questioning those of the other side. James Davidson believed the timing of his announcement was irrelevant:
This is open source, Apache style. People work whenever they work and that's the way this all works. Most Apache developers don't work on the main sources during the typical M-F 8-4 (local time) window. They work when they get time, or the muse is with them, or whatever. There are no limits, it's a 24/7 shop and to be blunt, conformance with a corporate schedule isn't part of the mandate.
The problem is not about Sun vs IBM. It is not about corporate vs individuals. Not about creating a new project. Not about working on week-ends. Not about questioning the current goals or implementation. The problem is about making decision on your own. Not communicating. Not consulting others in the project.
Ironically, it seems that a lack of communication, along with a good deal of miscommunication, was the root cause of the argument. This is symptomatic of an internal divide within the Apache parser development community. Stefano Mazzocchi, in valiant attempts to defuse the situation, highlighted the cause of the divide:
Both Xerces and Crimson are "donations" to the Apache effort. However, these donations haven't been made into the arms of a waiting team of Apache volunteers. Both projects are still largely driven by development teams at IBM and Sun. Mazzocchi's key point is that simply providing a public CVS server does not make for a good open source project. If the development teams continue as they have done prior to opening the source (e.g., by having off-line design meetings), then there's little chance for a developer community to form around the project.
Mazzocchi suggested that the Tomcat developers have managed to make the transition from internally driven development to a full open source community, whereas the Xerces, Xalan, and Crimson teams have yet to achieve it. Rajiv Mordani further illustrated this point:
...since most of you'll work in the same office there is a lot of things that happen in there that should be actually done on the mailing lists. I don't remember seeing a mail going out proposing the implementation of schemas for [example]. The only info that was sent out is that the repository will be a little unstable for the next few days as schema support is being added to Xerces so please use the tagged version of the workspace.
It would be wrong to single out just the IBM developers in this regard. It appears that the Sun team is as much at fault--although, as Brett McLaughin commented, it's the external perspective that matters:
As tempers cooled, it quickly became apparent that the perception that neither team is interested in the other's software was incorrect. Once the rhetoric was laid aside, a lot of common ground was discovered. At that point the real work started.
To help push matters forward, Sam Ruby suggested that the project should be known as the Xerces Refactoring Initiative (XRI)--a name he described as "intentionally boring." Stefano Mazzocchi produced a "virtual press release" to explain the common purpose of the initiative:
The Apache XML community started a "Refactoring Initiative" to create [a] next generation XML parser for Java. This will be known as "codename Spinnaker" (or simply spinnaker) and will be the collective design whiteboard where the worldwide Apache community of individuals will openly develop such [a] new XML parser. The final result of this RI will be determined when "final" status is reached and, as always, decided by the community.
In his opening announcement, James Davidson listed several perceived problems with Xerces. These included performance problems under HotSpot, and code complexity. Both of these appear to be the result of efforts put in by the Xerces team to optimize the parser for the Java 1.1 platform. While the efforts have clearly paid off, the optimizations do make the code harder to understand, and limit the ability of HotSpot to automatically optimize performance on Java 1.2 and 1.3 virtual machines.
The main objection I have to the current Xerces code is not about which VM it runs on, but on the ability for a developer to understand the code and make changes to it. I think one major reason there has not been more developer participation is because the code is difficult to understand ... I looked at two other parsers that I believe are easier to understand: Aelfred2 (not Apache) and Crimson/ProjectX... With Xerces it takes much more effort to understand the code and I prefer not to make changes to code I don't understand.
I have lots of users on the JDOM lists who think Xerces is great, but are totally scared by the code--unfortunately, I can't blame them. They would rather spend their time working on JDOM, or something similar.
The fact is that we're also interested in a new version of Xerces which is more modular. As a matter of fact, we've given it quite some thoughts already, and [have] even written down a first draft of a design document on it that we'll be happy to send out as input.
We don't think the current model is perfect. We refer to it as "the star model." Basically every piece knows about the others and the flow of information is far from being linear. Instead, we're thinking of a pipeline model. Where every piece works as a box with an input and output stream working independently of the others. It will be hard to make this as efficient as the current parser. But this, among other things, would let one to easily plugin the validator or not.
I think the point is to build/refactor a next generation, commercially viable, parser based on the current state of the art (including and especially Schema support), and the collected requirements. It is indeed pointless, in my opinion, to talk about "adapting" other code bases--what is going to occur is a mining operation from Xerces and Crimson. Anything that's open source with the right license is open to be mined for ideas...
There's stuff that microparsers like nanoxml can contribute to the discussion. Python XML parsers (and there is more than one) are quite good. If we are talking James Clark, let's not forget expat; this is a very good parser and represents the core of the Perl XML processing family.
The project has now taken on a life of its own: differences seem to have been set aside (at least temporarily) to focus instead on the technology. Ed Staub has taken on the task of co-ordinating the collection and publication of the XRI requirements. The discussion has already thrown out some interesting ideas such as Grammar caching (avoiding reparsing of the same DTD multiple times), and compiling an XML Schema into a custom parser.
There's obviously a long road ahead for XRI/Spinnaker but it should yield some interesting results. Don't go looking for a Xerces 2 just yet, and don't despair, as Xerces 1 is not about to be abandoned: the IBM team is currently hard at work completing XML Schema support.
Perhaps the key benefit of this project, despite its rocky beginnings, is that the internal fragility of the Apache XML may be removed. With effort from both the Xerces and Crimson teams, as well as the wider developer community, we'll not only benefit from a next-generation parser, but also a more stable and collaborative process underpinning the development of a vital component in the XML infrastructure.
1. http://www.xml.com/pub/a/2000/07/19/deviant/index.html. Copyright ©2000 O'Reilly & Associates, Inc. Reprinted with permission from O'Reilly & Associates, Inc.