XPath

Java XPath Performance Tips

1. Introduction

This is the third article in a series dealing with using XPath within a Java application. This article focuses attention on tips for improving XPath performance.

The first two articles were:

The code used in this article uses Java SE 8u111. There are no other frameworks nor tools are referenced in this article.

2. Parser Choice

Essentially there are two types of parsers used to parse XML data:

  • DOM – Document Object Model – This popular class of parsers reads the entire XML file and constructs the DOM in memory. Since the DOM is memory resident, evaluation of the XPath expressions is usually faster than with SAX parsers.
  • SAX – Simple API for XML – These parsers are typically event driven, use much less memory, and are better suited for large XML files. Since SAX parsers make a single pass through the XML data file, supported axes include only those that do not require “backing up” the parser to visit nodes previously processed nodes. For example the axes following-sibling is supported, but the axes preceding-sibling is not. SAX parsers are usually slower than their DOM counterparts. SAX parsers are event driven which means if a DOM (or DOM sub-trees) are used it is the responsibility of the user to build, maintain, and manage DOMs. In short: SAX parsers are slower and more complicated to use, but the trade-off is that SAX parsers can handle much larger XML files.

3. Performance Tests

This section evaluates several, but similar XPath expressions for performance in regards to compilation and execution.

3.1 XML Data Used

The XPath expressions used in the performance tests were run against two XML files. Both files conform to the following structure:

Data files: entryLevel_Short.xml and entryLevel_Big.xml

<?xml version="1.0" encoding="UTF-8"?>
<TheRoot>
    <Entry>
        <Kid1></Kid1>
        <Kid2></Kid2>
        <Kid3></Kid3>
        <Kid4></Kid4>
        <Kid5></Kid5>
        <Kid6></Kid6>
        <Kid7></Kid7>
        <Kid8></Kid8>
        <Kid9></Kid9>
        <Kid10></Kid10>
    </Entry>
    <Entry>
        <Kid1></Kid1>
        <Kid2></Kid2>
        <Kid3></Kid3>
        <Kid4></Kid4>
        <Kid5></Kid5>
        <Kid6></Kid6>
        <Kid7></Kid7>
        <Kid8></Kid8>
        <Kid9></Kid9>
        <Kid10></Kid10>
    </Entry>
    <Entry>
        <Kid1></Kid1>
        <Kid2></Kid2>
        <Kid3></Kid3>
        <Kid4></Kid4>
        <Kid5></Kid5>
        <Kid6></Kid6>
        <Kid7></Kid7>
        <Kid8></Kid8>
        <Kid9></Kid9>
        <Kid10></Kid10>
    </Entry>
              .
              .
              .

A. The file enrtyLevel_Short.xml has 2,000 ‘Entry’ nodes

B. The file entryLevel_Big.xml has 20,000 ‘Entry’ nodes

C. Each ‘Entry’ node, in both files, has 10 children nodes.
    Kid1 through Kid10

3.2 Code Used

The code used to compile and evaluate the XPath expressions using XPath objects, previously presented in the Java XPath Examples article (https://examples.javacodegeeks.com/core-java/xml/xpath/java-xpath-examples/). The code was updated with a loop to perform the evaluation, compilation, and execution of the XPath expression a preset number of times.

Code adopted from Java XPath Best Practices Tutorial article (https://examples.javacodegeeks.com/core-java/xpath-best-practices-tutorial/) has been added to evaluate ‘predicate’ based XPath expressions by ‘Walking the DOM’, with the addition of a loop to validate and execute the expression.

The complete code used is contained in the class XPathPerformanceTester and is included, along with the XML data files (entryLevel_Short.xml and entryLevel_Big.xml), as part of the download attached to this article.

3.3 Performance Comparison Results

Below is a table that compares the same XPath expression when evaluated against XML data files with:

  • 2,000 ‘Entry’ nodes – namely entryLevel_Short.xml
  • 20,000 ‘Entry’ nodes – namely entryLevel_Big.xml

Table showing performance when compiling and evaluating different XPath expressions with smaller and larger XML files

                       Average time using       Average time using     
Expression             entryLevel_Short.xml     entryLevel_Big.xml     Comment

/TheRoot/Entry          82ms                       319ms                First Entry node
//Kid1/..               88ms                       550ms                Parent of first Kid1
//Kid1/parent::node()   42ms                       416ms                Parent of first Kid1
/TheRoot/Entry/Kid1     32ms                       322ms                Returns first Kid1 node

Rows 5 and 6 both return the parent of the first Kid1 node. Row 5 locates the parent by using the relative location operator ‘..’ to get the parent, where Row 6 uses the ‘parent::’ axes to get the same result. Notice when executed against the smaller XML file, the expression ‘//Kid1/..’ expression takes over twice as long to execute as the ‘//Kid1/parent::node()’ expression. Compare that to the execution times for a file with 10% more nodes, the average execution for the ‘//Kid1/parent:: node()’ expression grows by roughly 10% as expected, while the execution time for the ‘//Kid1/..’ expression extends by approximately 20%.

Table showing performance when using predicates with XPath objects versus ‘Walking the DOM’ as applied to smaller and larger XML files

                          Average time using       Average time using     
Expression                entryLevel_Short.xml     entryLevel_Big.xml     Comment

/TheRoot/Entry[last()]    847ms                    88,794ms               Last Entry node
~Entry[last()]             15ms                        30ms               Last Entry node
/TheRoot/Entry[1000]       34ms                       203ms               1,000th Entry Node
~Entry[1000]                2ms                        15ms               1,000th Entry Node
/TheRoot/Entry[2000]       39ms                       218ms               2,000th Entry Node
~Entry[2000]                2ms                        14ms               2,000th Entry Node
/TheRoot/Entry[20000]      97ms                       392ms               20,000th Entry Node
~Entry[20000]             undefined-Not Enough Data    16ms               20,000th Entry Node

Expressions in that begin with ‘~’ are evaluated using ‘Walking the DOM’ code.

The expression ‘//Entry[last()]’ returns the last ‘Entry’ node in the data file. Notice that the timing results for Row 4, takes over an order of magnitude longer to execute over any other expression when used with the smaller XML file, and over 2 orders of magnitude when applied to the larger XML file.

The table shows that the ‘Walk the DOM’ code can determine the same result as the XPath Object code in a fraction of the time.

However, the comparison is not fair since the ‘Walk the DOM’ code is optimized for expressions using a predicate to find either the nth or last occurrence of a single node. The ‘Walk the DOM’ code is very specific and fails if used with more complicated expressions.

Granted this timing data is anecdotal at best. However, it does show some relative performance relationships between XPath expressions and the data.

4. Conclusion

XPath performance is a matter of tradeoffs of the different components.

DOM parsers are generally faster than SAX parsers. The tradeoff is that SAX parsers are more complicated to use since they are single pass, and event driven, but they can handle much larger XML files than DOM parsers.

Using ‘Walk the DOM’ code can be blazingly fast but is not very robust. Small changes to the XML schema can require massive code changes to accommodate. The tradeoff is that using XPath Objects usually only requires a change in the XPath expression to accommodate variations in the XML schema, but has poorer performance.

5. Alternative Products

Here is a partial list of XPath related products which address different aspects of performance or functionality:

6. Download the Source Code

Download the XML data and source code used in his article.

Download
You can download the full source code for this article here: JavaXPathPerformanceTips.zip

David Guinivere

David graduated from University of California, Santa Cruz with a Bachelor’s degree in Information Science. He has been actively working with Java since the days of J2SE 1.2 in 1998. His work has largely involved integrating Java and SQL. He has worked on a wide range of projects including e-commerce, CRM, Molecular Diagnostic, and Video applications. As a freelance developer he is actively studying Big Data, Cloud and Web Development.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button