This is the third article in a series dealing with using XPath within a Java application. This article focuses attention on tips for improving XPath performance.
The first two articles were:
- A Java XPath Best Practices Tutorial
Which showed how to setup and evaluate XPath expressions using Java.
- Java XPath Examples
Which explored XPath syntax and showed some examples of XPath expressions.
The code used in this article uses Java SE 8u111. There are no other frameworks nor tools are referenced in this article.
2. Parser Choice
Essentially there are two types of parsers used to parse XML data:
- DOM – Document Object Model – This popular class of parsers reads the entire XML file and constructs the DOM in memory. Since the DOM is memory resident, evaluation of the XPath expressions is usually faster than with SAX parsers.
- SAX – Simple API for XML – These parsers are typically event driven, use much less memory, and are better suited for large XML files. Since SAX parsers make a single pass through the XML data file, supported axes include only those that do not require “backing up” the parser to visit nodes previously processed nodes. For example the axes following-sibling is supported, but the axes preceding-sibling is not. SAX parsers are usually slower than their DOM counterparts. SAX parsers are event driven which means if a DOM (or DOM sub-trees) are used it is the responsibility of the user to build, maintain, and manage DOMs. In short: SAX parsers are slower and more complicated to use, but the trade-off is that SAX parsers can handle much larger XML files.
3. Performance Tests
This section evaluates several, but similar XPath expressions for performance in regards to compilation and execution.
3.1 XML Data Used
The XPath expressions used in the performance tests were run against two XML files. Both files conform to the following structure:
Data files: entryLevel_Short.xml and entryLevel_Big.xml
<?xml version="1.0" encoding="UTF-8"?> <TheRoot> <Entry> <Kid1></Kid1> <Kid2></Kid2> <Kid3></Kid3> <Kid4></Kid4> <Kid5></Kid5> <Kid6></Kid6> <Kid7></Kid7> <Kid8></Kid8> <Kid9></Kid9> <Kid10></Kid10> </Entry> <Entry> <Kid1></Kid1> <Kid2></Kid2> <Kid3></Kid3> <Kid4></Kid4> <Kid5></Kid5> <Kid6></Kid6> <Kid7></Kid7> <Kid8></Kid8> <Kid9></Kid9> <Kid10></Kid10> </Entry> <Entry> <Kid1></Kid1> <Kid2></Kid2> <Kid3></Kid3> <Kid4></Kid4> <Kid5></Kid5> <Kid6></Kid6> <Kid7></Kid7> <Kid8></Kid8> <Kid9></Kid9> <Kid10></Kid10> </Entry> . . . A. The file enrtyLevel_Short.xml has 2,000 ‘Entry’ nodes B. The file entryLevel_Big.xml has 20,000 ‘Entry’ nodes C. Each ‘Entry’ node, in both files, has 10 children nodes. Kid1 through Kid10
3.2 Code Used
The code used to compile and evaluate the XPath expressions using XPath objects, previously presented in the Java XPath Examples article (https://examples.javacodegeeks.com/core-java/xml/xpath/java-xpath-examples/). The code was updated with a loop to perform the evaluation, compilation, and execution of the XPath expression a preset number of times.
Code adopted from Java XPath Best Practices Tutorial article (https://examples.javacodegeeks.com/core-java/xpath-best-practices-tutorial/) has been added to evaluate ‘predicate’ based XPath expressions by ‘Walking the DOM’, with the addition of a loop to validate and execute the expression.
The complete code used is contained in the class XPathPerformanceTester and is included, along with the XML data files (entryLevel_Short.xml and entryLevel_Big.xml), as part of the download attached to this article.
3.3 Performance Comparison Results
Below is a table that compares the same XPath expression when evaluated against XML data files with:
- 2,000 ‘Entry’ nodes – namely entryLevel_Short.xml
- 20,000 ‘Entry’ nodes – namely entryLevel_Big.xml
Table showing performance when compiling and evaluating different XPath expressions with smaller and larger XML files
Average time using Average time using Expression entryLevel_Short.xml entryLevel_Big.xml Comment /TheRoot/Entry 82ms 319ms First Entry node //Kid1/.. 88ms 550ms Parent of first Kid1 //Kid1/parent::node() 42ms 416ms Parent of first Kid1 /TheRoot/Entry/Kid1 32ms 322ms Returns first Kid1 node
Rows 5 and 6 both return the parent of the first Kid1 node. Row 5 locates the parent by using the relative location operator ‘..’ to get the parent, where Row 6 uses the ‘parent::’ axes to get the same result. Notice when executed against the smaller XML file, the expression ‘//Kid1/..’ expression takes over twice as long to execute as the ‘//Kid1/parent::node()’ expression. Compare that to the execution times for a file with 10% more nodes, the average execution for the ‘//Kid1/parent:: node()’ expression grows by roughly 10% as expected, while the execution time for the ‘//Kid1/..’ expression extends by approximately 20%.
Table showing performance when using predicates with XPath objects versus ‘Walking the DOM’ as applied to smaller and larger XML files
Average time using Average time using Expression entryLevel_Short.xml entryLevel_Big.xml Comment /TheRoot/Entry[last()] 847ms 88,794ms Last Entry node ~Entry[last()] 15ms 30ms Last Entry node /TheRoot/Entry 34ms 203ms 1,000th Entry Node ~Entry 2ms 15ms 1,000th Entry Node /TheRoot/Entry 39ms 218ms 2,000th Entry Node ~Entry 2ms 14ms 2,000th Entry Node /TheRoot/Entry 97ms 392ms 20,000th Entry Node ~Entry undefined-Not Enough Data 16ms 20,000th Entry Node
Expressions in that begin with ‘~’ are evaluated using ‘Walking the DOM’ code.
The expression ‘//Entry[last()]’ returns the last ‘Entry’ node in the data file. Notice that the timing results for Row 4, takes over an order of magnitude longer to execute over any other expression when used with the smaller XML file, and over 2 orders of magnitude when applied to the larger XML file.
The table shows that the ‘Walk the DOM’ code can determine the same result as the XPath Object code in a fraction of the time.
However, the comparison is not fair since the ‘Walk the DOM’ code is optimized for expressions using a predicate to find either the nth or last occurrence of a single node. The ‘Walk the DOM’ code is very specific and fails if used with more complicated expressions.
Granted this timing data is anecdotal at best. However, it does show some relative performance relationships between XPath expressions and the data.
XPath performance is a matter of tradeoffs of the different components.
DOM parsers are generally faster than SAX parsers. The tradeoff is that SAX parsers are more complicated to use since they are single pass, and event driven, but they can handle much larger XML files than DOM parsers.
Using ‘Walk the DOM’ code can be blazingly fast but is not very robust. Small changes to the XML schema can require massive code changes to accommodate. The tradeoff is that using XPath Objects usually only requires a change in the XPath expression to accommodate variations in the XML schema, but has poorer performance.
5. Alternative Products
Here is a partial list of XPath related products which address different aspects of performance or functionality:
- CachedXPathAPI’s which boasts to be faster by up to 100X over the XPath API. More information can be found here: http://xml.apache.org/xalan-j/apidocs/org/apache/xpath/CachedXPathAPI.html
- SAXPath utilizes a simple callback interface API that abstracts the details of parsing and processing data with a SAX parser. More information can be found here: http://www.saxpath.org/
- VTD-XML Is easier to use than SAX, uses significantly less memory than a DOM parser, and supports XPath. More information can be found here: http://vtd-xml.sourceforge.net/
- XMLDog supports the evaluation of multiple XPath expressions during a single pass of an XML file using a SAX parser. More information can be found here: https://github.com/santhosh-tekuri/jlibs/wiki/XMLDog
6. Download the Source Code
Download the XML data and source code used in his article.
You can download the full source code for this article here: JavaXPathPerformanceTips.zip