Wednesday, March 10, 2010

XPath and Java

XPath is a domain specific language which is used to extract data from an XML document. It's supported by many different general purpose languages like C#, PHP, and Java. Take the following XML document for example:

<library>
<book>
<language>en</language>
<title>Solaris</title>
<author>Stanislaw Lem</author>
</book>
<book>
<language>fr</language>
<title>Le Petit Prince</title>
<author>Antoine de Saint-Exupéry</author>
</book>
<book>
<language>en</language>
<title>Dune</title>
<author>Frank Herbert</author>
</book>
</library>

The following XPath query retrieves the titles of all English-language books:

/library/book[language='en']/title

Java makes working with XPath (and XML in general) kind of complicated, since many different classes are involved. First, the XML document must be loaded into a DOM (Document Object Model).

StreamSource source = new StreamSource(new File("books.xml"));
DOMResult result = new DOMResult();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(source, result);
Node documentRoot = result.getNode();

This means that the XML text is read into memory and organized into a tree of nodes where each tag is an element node. The top element node would be the <library> element, which would have three child element nodes (<book>), and so on.



To demonstrate the power of XPath, this is what Java code might look like if XPath did not exist. The programmer would have to manually iterate through the entire DOM to get what she needed.

List<String> englishTitles = new ArrayList<String>();
Node books = documentRoot.getFirstChild();
if (books.getNodeName().equals("library")){
for (int i = 0; i < books.getChildNodes().getLength(); ++i){
Node book = books.getChildNodes().item(i);
if (book.getNodeName().equals("book")){
boolean english = false;
String title = null;
for (int j = 0; j < book.getChildNodes().getLength(); ++j){
Node bookChild = book.getChildNodes().item(j);
if (bookChild.getNodeName().equals("language") &&
bookChild.getTextContent().equals("en")){
english = true;
}
if (bookChild.getNodeName().equals("title")){
title = bookChild.getTextContent();
}
}
if (english && title != null){
englishTitles.add(title);
}
}
}
}
for (String title : englishTitles){
System.out.println(title);
}

As you can see, this is very very tedious and error prone! XPath is the better solution by far:

XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodeList = (NodeList)xpath.evaluate("/library/book[language='en']/title",
documentRoot, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); ++i){
Node node = nodeList.item(i);
System.out.println(node.getTextContent());
}

Namespaces

XML documents often use namespaces. These are sort of like Java packages--they group related elements together and prevent name collisions from occurring. Let's say that each <language> element belonged to a namespace:

<book>
<language xmlns="http://translate.google.com">en</language>
<title>Solaris</title>
<author>Stanislaw Lem</author>
</book>

Note: While namespaces technically can be anything (like "abc123" for example), they should be globally unique. There's no way to enforce this, so the convention is to use a URI belonging to the person or company creating the namespace. For example, if Oracle wants to use a namespace, they can be fairly certain that no one else in the entire world is using one starting with "http://www.oracle.com".


To make Java aware of namespaces,a NamespaceContext object must be created and added to the XPath object.

XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContext() {
public String getNamespaceURI(String prefix) {
if ("tr".equals(prefix)){
return "http://translate.google.com";
}
return null;
}
public Iterator getPrefixes(String uri) {
return null;
}
public String getPrefix(String uri) {
return null;
}
});

This will assign the prefix "tr" to the namespace "http://translate.google.com". The prefix can be anything, but the namespace must match the one in the XML document.

NodeList nodeList = (NodeList)xpath.evaluate("/library/book[tr:language='en']/title",
documentRoot, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); ++i){
Node node = nodeList.item(i);
System.out.println(node.getTextContent());
}

To learn more about XPath, you can visit w3schools.com.

No comments: