Working with XML and Java
3.1 Reading an XML Data Stream
Problem
You want to access an XML document in a fast stream. Your dataset is too large for DOM, and you want a more selective API than SAX offers.
Solution
Use the StAX API in Java SE 6 to “pull” parse your document.
Discussion
Java has given us a number of ways to work with XML documents, including the pop-ular DOM and SAX. The most recent addition is StAX, or Streaming API for XML, which is largely the brainchild of Oracle/BEA. While all three of these methods of parsing XML have advantages, they have shortcomings too.
StAX is currently the most efficient method of dealing with XML, and is therefore particularly well suited to working with complex processes such as data binding and SOAP messages. Oracle/BEA’s WebLogic 9 and 10 use this parser internally within the application server, as does Glassfish v2.
DOM offers an easy-to-use API, and has an advantage over SAX and StAX in that it is XPath-capable. But it also forces you to read the entire document into memory. This is fine for small documents, but can damage performance for sizeable documents, and can be ultimately prohibitive for very large documents. One European bank network regularly transfers multi-gigabyte XML files within their SOA; they’re not using DOM to deal with it.
SAX, on the other hand, handles this problem by working as a “push” parser; that is, events are generated for each structure the parser encounters within the document, and the programmer can choose to deal with those he’s interested in. The disadvantage here is that SAX will typically generate a lot of events that the programmer doesn’t care about. Moreover, the SAX API does not offer iterative processing of your document, and blasts through the whole thing from beginning to end. In this model, the parser controls the processing of the document.
The StAX API gives you control akin to the Java I/O RandomAccessFile—you can skip sections of the document, work with a subsection of the document, pause and resume
processing, or stop processing at any time. Using the “pull” model for processing, the application is in charge of how the document is processed, and exerts this control by indicating which items it’s interested in working with; the parser then pulls them out of the event stream.
Streaming parsers can work with documents whose format is only loosely known, but you do need to know what you want to work with beforehand: you have to tell the parser what you want to pull.
But the infosets that StAX creates are very small, and are immediate candidates for garbage collection. This gives your XML processing work a small footprint, making it ideal for use not just with small heap devices such as mobile phones, but with long-running server-side applications too.
Unlike SAX, StAX is able to write XML documents. This reduces the number of APIs you have to deal with. That having been said, in addition to reading and writing XML data, there are two different models for parsing data with StAX: the cursor model and the iterator model.
Using the StAX cursor model: XMLStreamReader
The XMLStreamReader interface does the heavy lifting here. Using this interface, you can read everything about both the structure of a document and its content in a stream of events. To receive these events, use the hasNext method to determine if there are any more events to read, and the next method to get an integer token for the next event.
Using that token, you switch on the different kinds of parse events, and do some work if the current event represents something in which you’re interested.
You can capture events for the following XMLEvent subinterfaces, each representing a different aspect of a document’s structure:
• CDATA
• CHARACTER
• COMMENT
• DTD
• START_DOCUMENT
• END_DOCUMENT
• START_ELEMENT
• END_ELEMENT
• ENTITY_DECLARATION
• NAMESPACE
• NOTATION_DECLARATION
• PROCESSING_INSTRUCTION
• SPACE
The cursor moves forward through the document from start to finish, pointing to each item along the way. If you’re used to using SAX, this should be familiar.
Let’s start with a very simple and rather poorly defined XML file as an example. Use this XML file as the basis for your parsing:
<catalog>
<book sku="123_xaa">
<title>King Lear</title>
<author>William Shakespeare</author>
<price>6.95</price>
<category>classics</category>
</book>
<book sku="988_yty">
<title>Hamlet</title>
<author>William Shakespeare</author>
<price>5.95</price>
<category>classics</category>
</book>
<book sku="434_asd">
<title>1984</title>
<author>George Orwell</author>
<price>12.95</price>
<category>classics</category>
</book>
<book sku="876_pep">
<title>Java Generics and Collections</title>
<authors>
<author>Maurice Naftalin</author>
<author>Phillip Wadler</author>
</authors>
<price>34.99</price>
<category>programming</category>
</book>
</catalog>
So here you have a catalog that holds a bunch of books, presumably for sale. Each book has an identifier, a title, one or more authors, and so on. The trickiest part of the catalog.xml file is that the authors element is optional, used only if the book has more than one author.
The program in Example 3-1 illustrates how to use the cursor parsing method in StAX.
Here the author objects are stored in a TreeSet as they are discovered; the Tree part provides natural sorting, and the Set part ensures uniqueness.
Example 3-1. Using the cursor parsing method in StAX package com.sc.ch02.stax;
import static java.lang.System.out;
import java.io.InputStream;
import java.util.Set;
import java.util.TreeSet;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
public class StaxCursor {
private static final String db = "/ch02/Catalog.xml";
//we'll hold values here as we find them private Set<String> uniqueAuthors;
public static void main(String... args) { StaxCursor p = new StaxCursor();
p.find();
}
//constructor public StaxCursor() {
uniqueAuthors = new TreeSet<String>();
}
//parse the document and offload work to helpers public void find() {
XMLInputFactory xif = XMLInputFactory.newInstance();
//forward-only, most efficient way to read XMLStreamReader reader = null;
//get ahold of the file final InputStream is =
StaxCursor.class.getResourceAsStream(db);
//whether current event represents elem, attrib, etc int eventType;
String current = "";
try {
//create the reader from the stream reader = xif.createXMLStreamReader(is);
//work with stream and get the type of event //we're inspecting
while (reader.hasNext()) {
//because this is Cursor, we get an integer token to next event eventType = reader.next();
//do different work depending on current event switch (eventType) {
case XMLEvent.START_ELEMENT:
//save element name for later
current = reader.getName().toString();
printSkus(current, reader);
break;
case XMLEvent.CHARACTERS:
findAuthors(current, reader);
break;
} } //end loop
out.println("Unique Authors=" + uniqueAuthors);
} catch (XMLStreamException e) { out.println("Cannot parse: " + e);
} }
//get the name and value of the book's sku attribute private void printSkus(String current, XMLStreamReader r) { current = r.getName().toString();
if ("book".equals(current)) {
String k = r.getAttributeName(0).toString();
String v = r.getAttributeValue(0);
out.println("AttribName " + k + "=" + v);
} }
//inspect author elements and read their values.
private void findAuthors(String current, XMLStreamReader r) throws XMLStreamException {
if ("author".equals(current)) { String v = r.getText().trim();
//can get whitespace value, so ignore if (v.length() > 0) {
uniqueAuthors.add(v);
} } }
}
The reader’s getText method gives the event value, and the getAttributeValue method is used with an integer indicating the index of the attribute you want a value for.
Running the program produces the following results:
AttribName sku=123_xaa AttribName sku=988_yty AttribName sku=434_asd AttribName sku=876_pep
Unique Authors=[George Orwell, Maurice Naftalin, Phillip Wadler, William Shakespeare]
In this example, you are interested in authors (which are their own element) and SKU values (which are attributes of the book element). You save the name of the current node for each iteration of the loop so that you can match it in your two processing methods.
Normally, you’ll just want to use the StAX implementation that comes with Java SE 6. But it’s worth noting that Sun has an implementation of StAX available as a separate download from https://sjsxp.dev.java.net/.
This implementation is built on Xerces 2, and is very lazy (a good thing for parsers!). There are other StAX implementations available as well, such as those from Oracle.
Using the StAX iterator model
The iterator API is the more flexible and easily extensible of the two models.
Let’s parse the same Catalog.xml document just defined with the other StAX model, the iterator. This is shown in Example 3-2.
Example 3-2. Reading XML with StAX iterator
public class StaxIterator { public void find() {
XMLInputFactory xif = XMLInputFactory.newInstance();
//forward-only, most efficient way to read XMLEventReader reader = null;
//get ahold of the file final InputStream is =
StaxIterator.class.getResourceAsStream(db);
try {
//create the reader from the stream reader = xif.createXMLEventReader(is);
//work with stream and get the type of event //we're inspecting
while (reader.hasNext()) {
XMLEvent e = reader.nextEvent();
if (e.isStartElement()){
e = e.asStartElement().getAttributeByName(
new QName("sku"));
if (e != null){
out.println(e);
} } } //end loop
} catch (XMLStreamException e) { out.println("Cannot parse: " + e);
} } }
Executing the program gives the following output:
sku='123_xaa' sku='988_yty' sku='434_asd' sku='876_pep'
As you can see, the two parsing models are very similar, with slightly different ways of handling events.