XPath

Java XPath Performance DOM vs SAX Example

1. Introduction

This article compares the performance of the Java DOM parser distributed with Java and a SAX parser distributed with the Home Edition of Saxon for evaluating various XPath expressions.

Typically DOM parsers can evaluate XPath expressions faster than SAX parsers since DOM parsers construct and retain the DOM document in memory when parsing XML data. SAX parsers, on the other hand, are event-driven, single pass parsers which uses minimal amounts of memory and can consequently handle much larger XML files.

The Saxon parser combines the best of both worlds in that it enables XPath expressions to be evaluated using TreeInfo objects. TreeInfo objects keep information that is common to all the nodes in a tree. Also, TreeInfo objects are a contextual equivalent to memory resident portions of the DOM document.

1.1. Sample Code Requirements

The example code allows the user to compare the time taken to compile and evaluate XPath expressions using either the Java DOM parser or the Saxon-HE (Home Edition) SAX parser. After evaluating the XPath expression, the user can view the results and optionally get the time to compile and evaluate the XPath expression (Averaged over 5,000 compilation and evaluations).

The example code uses the Java DOM parser supplied with Java SE 8u111, and the Saxon-HE (Home Edition) version 9.7 SAX parser.

NOTE: Due to a documented bug in the saxon9he.jar, the sample code (available in the download for this article) must be built and executed using the IntelliJ IDEA to avoid a runtime error.

The qual2017.xml file is a collection of QualifierRecords and is the default data file which must be downloaded from the U.S. National Library of Medicine using the following URL: https://www.nlm.nih.gov/mesh/download_mesh.html

Download Saxon-HE (Home Edition) for compiling and executing the example code for this article. The Saxon HE package, documentation, additional code samples are available from the Saxon website (http://saxon.sourceforge.net/). It is also strongly recommended you download the resources (which contains sample code, and user documentation).

Tip
You may skip project creation and jump directly to download of the example code below.

2. The Data

The default data used by the example code presented in this article, is a single XML file. The qual2017.xml file is 269Mb in size, and describes a set of Qualifier Records as defined by the U.S. National Library of Medicine. Below a partial listing from the qual2017.xml is shown here to give the user an idea of the structure of the data.

NOTE: “…” indicates more of the same XML tags appear in the downloaded qual2017.xml file.

qual2017.xml

<!DOCTYPE QualifierRecordSet SYSTEM "https://www.nlm.nih.gov/databases/dtd/nlmqualifierrecordset_20170101.dtd">
<QualifierRecordSet LanguageCode = "eng">
    <QualifierRecord>
        <QualifierUI>Q000000981</QualifierUI>
            <QualifierName>
                <String>diagnostic imaging</String>
            </QualifierName>
            <DateCreated>
                <Year>2016</Year>
                <Month>06</Month>
                <Day>29</Day>
            </DateCreated>
            <DateRevised>
                <Year>2016</Year>
                <Month>06</Month>
                <Day>08</Day>
            </DateRevised>
            <DateEstablished>
                <Year>2017</Year>
                <Month>01</Month>
                <Day>01</Day>
            </DateEstablished>
            <Annotation>subheading only; coordinate with specific  imaging technique if pertinent
            </Annotation>
            <HistoryNote>2017(1967)
            </HistoryNote>
            <TreeNumberList>
                <TreeNumber>Y04.010</TreeNumber>
            </TreeNumberList>
            <ConceptList>
                <Concept PreferredConceptYN="Y">
                    <ConceptUI>M000614856</ConceptUI>
                    <ConceptName>
                        <String>diagnostic imaging</String>
                    </ConceptName>
                    <ScopeNote>Used for the visualization of an anatomical structure or for the diagnosis of disease.  Commonly used imaging techniques include radiography, radionuclide imaging, thermography, tomography, and ultrasonography
                    </ScopeNote>
                    <ConceptRelationList>
                        <ConceptRelation RelationName="NRW">
                            <Concept1UI>M000614856</Concept1UI>
                            <Concept2UI>M0030904</Concept2UI>
                       </ConceptRelation>
                           ...
                   </ConceptRelationList>
                       ....
            </ConceptList>
                ...
        </QualifierRecord>
            ...
</QualifierRecordSet>

A list of the XML tags and the number of occurrences is provided here to facilitate verification of XPath expression results.

List of the XML tags along with the number of times that the tag appears in the qual2017.xml file:

Abbreviation         80
Annotation           80
Concept             239
Concept1UI          318
Concept2UI          318
ConceptList          80
ConceptName         239
ConceptRelation     318
ConceptRelationList 210
ConceptUI           239
DateCreated          83
DateEstablished      80
DateRevised          80
Day                 243
EntryVersion         80
HistoryNote          80
Month               243
OnlineNote           78
QualifierName        80
QualifierRecord      80
QualifierRecordSet    1
QualifierUI          80
ScopeNote            83
SortVersion           7
String              624
Term                305
TermList            239
TermUI              305
TreeNumber           95
TreeNumberList       80
Year                243

3. Running the Example Code Application

The example code included with this article is a Java console application which gives the user the ability to check the performance of XPath expressions when compiled and evaluated using either the Java DOM parser or the SAX parser provided with Saxon HE.

Below is the main console prompt for the example code application:

=+=+=+=+=+=+=+=+
XML file:qual2017.xml
Parser Type:None
Last command: 


Enter , !d, !s, !x, !q, !help: 

Explanation of the example code application options:

!d to switch to DOM parser
!s to switch to the SAX parser
!x to change XML file

!q to exit (default)

!help displays this message

Any input that does not begin with "!" is assumed to be an expression.

Below is a sample session with the example code application:

=+=+=+=+=+=+=+=+
XML file:qual2017.xml
Parser Type:None
Last command: 


Enter , !d, !s, !x, !q, !help:   !d
Switching to DOM parser...Done.
=+=+=+=+=+=+=+=+
XML file:qual2017.xml
Parser Type:DOM
Last command: 


Enter , !d, !s, !x, !q, !help:   count(//SortVersion)
count(//SortVersion) counted 7.0 records in qual2017.xml.
Run time test for 5000 executions? (Y or N)  y
Average time to compile and evaluate:  7.4756 (Total: 37378)
=+=+=+=+=+=+=+=+
XML file:qual2017.xml
Parser Type:DOM
Last command: count(//SortVersion)


Enter , !d, !s, !x, !q, !help:   !s
Switching to SAX parser...Done.
=+=+=+=+=+=+=+=+
XML file:qual2017.xml
Parser Type:SAX
Last command: count(//SortVersion)


Enter , !d, !s, !x, !q, !help:   //EntryVersion/parent::node()
There are 80 nodes in the result.
Display results?  (Y or N)  y
1: Term = T000895609diagnostic imaging20160219DGDIAG IMAGE
2: Term = T060543abnormalitiesABABNORM
3: Term = T060555administration & dosageADADMINISTRATION AADMIN

   ... Condensed for brevity
   (NOTE: The Saxon HE parser shows the concatenation
          of the children node values for any node in 
          the node set generated by the expression.)

77: Term = T061387contraindicationsCTCONTRA
78: Term = T061435agonistsAGAGON
79: Term = T061436virologyVIVIROL
80: Term = T490705ethics20020424ESETHICS
Run time test for 5000 executions? (Y or N)  y
Average time to compile and evaluate:  0.1424 (Total: 712)
=+=+=+=+=+=+=+=+
XML file:qual2017.xml
Parser Type:SAX
Last command: //EntryVersion/parent::node()


Enter , !d, !s, !x, !q, !help:   !q
Do you wish to exit?  (Y or N):   y

3.2 Instantiating the Parsers

The following methods show how to instantiate the parsers. When the user uses the !d or !s commands to change parsers, the structures for the current parser are set to null to free up memory when they are garbage collected.

Methods to Instantiate the DOM or SAX Parser

    // Initialize the DOM parser and load the DOM document
    private void initializeDOMParser() {
        System.out.print(PROMPT_DOM_PARSER);
        // Set SAX structures to null to release memory
        saxTreeInfo = null;
        xpathObj = null;

        // Instantiate the DOM parser
        domParser = getDOMParser();
        // Load the DOM Document from the XML data using the parser
        try {
            domDocument = domParser.parse(xml_file);
        } catch (SAXException | IOException e) {
            System.out.println("Exception in loadDOMDocument():  " + e.getMessage());
            e.printStackTrace();
        }
        // Instantiate an XPath object which compiles
        // and evaluates XPath expressions.
        xpathObj = XPathFactory.newInstance().newXPath();
        System.out.println(PROMPT_DONE);
    }

    // Initialize the SAX parser and it's support structures
    private void initializeSAXParser() {
        System.out.print(PROMPT_SAX_PARSER);
        // Set DOM structures to null to release memory
        domDocument = null;
        domParser = null;
        xpathObj = null;
        try {
            // The following initialization code is specific to Saxon
            // Please refer to SaxonHE documentation for details
            XPathFactory xpFactory = new net.sf.saxon.xpath.XPathFactoryImpl();
            xpathObj = xpFactory.newXPath();

            // Build the source document.
            InputSource inputSrc = new InputSource(new File(xml_file).getAbsolutePath());
            SAXSource saxSrc = new SAXSource(inputSrc);
            net.sf.saxon.Configuration config = 
                ((net.sf.saxon.xpath.XPathFactoryImpl) 
                xpFactory).getConfiguration();
            saxTreeInfo = config.buildDocumentTree(saxSrc);
        } catch (XPathException e) {
            System.out.println("Exception in initialize():  " + e.getMessage());
            e.printStackTrace();
        }
        System.out.println(PROMPT_DONE);
    }

3.3. Compiling and Evaluating XPath Expressions

XPath expressions are compiled and evaluated in a single method. A loop provides the ability to measure the time required for compilation and evaluation of an XPath expression. Initially, the loop is executed once, and the user prompted if the results of the evaluation should be displayed. A single execution of the loop allows the user the opportunity to fine tune the XPath expression to ensure the desired output. After giving the user the chance to inspect the results, the the user is prompted if the XPath expression should be timed to measure the compilation and execution time. To get a more accurate time the compilation and evaluation operation is averaged over 5,000 times.

NOTE: XPath expressions return either a Number or a Set (List) of Nodes. Expressions of the form count(<expression>) are the only expressions that the example code application recognizes as returning a Number, all other expressions are assumed to return a Set of Nodes.

Method for Compiling and Evaluating XPath Expressions

    private void displayTimedResults(float testTimeTotal) {
        System.out.printf("%s %.4f (%s %.0f)\n", PROMPT_AVG_TIME,
                (float) (testTimeTotal / (float)TOTAL_TEST_COUNT),
                PROMPT_TOTAL, testTimeTotal);
    }

    // Display standardized numeric result
    private void displayNumericResult(String command,
                                      Number numResult,
                                      String fileName) {
        System.out.println(command + " counted " + numResult +
                " records in " + fileName + ".");
    }
   
    // Display standardized node contents
    private void displayNodeResult(int nodeIdx, String name,
                                   String value) {
        System.out.println((nodeIdx + 1) + ": " + name + " = " + value);    }

    // Execute XPath command based on selected parser
    private void executeXPath(String cmd, ParserType parserInUse,
                              boolean fNumberResult, Mode mode) {
        XPathExpression processCmd = null;
        Number resultNum = null;
        long testTimeStart = 0;
        // Setup the number of times to compile and evaluate
        // the expression based on the mode
        float testTimeTotal = 0;
        int maxTestCount = 1;
        if (mode == Mode.Test) {
            maxTestCount = TOTAL_TEST_COUNT;
        }
        try {
            switch (parserInUse) {
                case DOM:
                    if (fNumberResult) {
                        for (int testCount = 0; 
                             testCount < maxTestCount;
                             testCount++)
                        {
                            testTimeStart = System.currentTimeMillis();
                            processCmd = xpathObj.compile(cmd);
                            resultNum = 
                                (Number) processCmd.evaluate(domDocument, 
                                             XPathConstants.NUMBER);
                            testTimeTotal += System.currentTimeMillis() – 
                                                 testTimeStart;
                        }
                        if (mode == Mode.Eval) {
                            displayNumericResult(cmd, resultNum, xml_file);
                        } else {
                            displayTimedResults(testTimeTotal);
                        }
                    } else {
                        // Result will be a list
                        NodeList resultNodeList = null;
                        for (int testCount = 0; 
                             testCount  0))
                            {
                                int nodeCount = resultNodeList.getLength();
                                System.out.println("There are " + nodeCount +
                                   " nodes in the result:");

                                String displayResults = 
                                    promptUser(PROMPT_DISPLAY_RESULTS, "N");
                                if (displayResults.equalsIgnoreCase("Y")) {
                                    // Go through each node in the list
                                    // and display the node name and value
                                    for (int i = 0; i < nodeCount; i++) {
                                        Node node = resultNodeList.item(i);
                                        displayNodeResult(i, 
                                            node.getNodeName(),
                                            node.getNodeValue());
                                    }
                                }
                            }
                        } else {
                            displayTimedResults(testTimeTotal);
                        }
                    }
                    break;
                case SAX:
                    if (fNumberResult) {
                        for (int testCount = 0; 
                             testCount < maxTestCount;
                             testCount++)
                        {
                            testTimeStart = 
                                System.currentTimeMillis();
                            processCmd = xpathObj.compile(cmd);
                            resultNum = 
                                (Number) processCmd.evaluate(saxTreeInfo, 
                                             XPathConstants.NUMBER);
                            testTimeTotal += System.currentTimeMillis() – 
                                testTimeStart;
                        }
                        if (mode == Mode.Eval) {
                            displayNumericResult(cmd, resultNum, xml_file); 
                        } else {
                            displayTimedResults(testTimeTotal);
                        }
                    } else {
                        // Result will be a list of nodes
                        List resultNodeList = null;
                        for (int testCount = 0;
                             testCount  0))
                            {
                                int nodeCount = resultNodeList.size();
                                System.out.println("There are " + 
                                    nodeCount + " nodes in the result.");

                                String displayResults = 
                                    promptUser(PROMPT_DISPLAY_RESULTS, "N");
                                if (displayResults.equalsIgnoreCase("Y")) {
                                    // Go through each node in the list
                                    // and display the node name and value
                                    for (int i = 0; i < nodeCount; i++) {
                                        NodeInfo cNode = 
                                            (NodeInfo) resultNodeList.get(i);
                                        displayNodeResult(i,
                                             cNode.getDisplayName(),
                                             cNode.getStringValue());
                                    }
                                }
                            }
                        } else {
                            displayTimedResults(testTimeTotal);
                        }
                    }
                    break;
                default:
                    return;
            }
        } catch (XPathExpressionException e) {
            System.out.println("Exception in executeXPath():  " + 
                                e.getMessage());
            e.printStackTrace();
        }
    }

4. DOM and SAX Performance Results

The main point of this article is to show the performance difference when using a DOM parser and a SAX parser.

Table showing performance difference between Java DOM parser and Saxon-HE SAX parser for compiling and evaluating XPath expressions.

Expression                                                               DOM       SAX
count(//ConceptName)                                                     6.666     0.018
				
//Concept[@PreferredConceptYN]/@PreferredConceptYN                       7.1174    0.0228

//EntryVersion/parent::node()                                            6.8054    0.0232
//EntryVersion/..                                                        6.4556    0.0324

/QualifierRecordSet/QualifierRecord                                      6.362     0.0202
/QualifierRecordSet/QualifierRecord/QualifierUI/child::node()            6.477     0.0294
/QualifierRecordSet/QualifierRecord[5]/QualifierUI/child::node()         6.4176    0.025
/QualifierRecordSet/QualifierRecord[80]/QualifierUI/child::node()        6.5328    0.0408
/QualifierRecordSet/QualifierRecord[last()]/QualifierUI/child::node()    6.6468    0.0388

Note: Measurements in the table above are in milliseconds

From the data presented above there are a couple of items worth noting:

  • The performance of the Saxon-HE SAX parser is over 2 orders of magnitude better than the Java DOM parser.
  • XPath expressions using an axes step (i.e.: parent::node()) compile and evaluate faster than those expressions which instead use a node specificaion ( i.e.: .. ) under the Java DOM parser but slower under the Saxon-HE SAX parser. (Refer to rows 6 and 7 in the preceding table.)

4. Conclusion

On the surface SAX parsers are known for being:

  1. Slower than DOM parsers.
  2. Event driven, requiring the DefaultHandler class be extended in order to provide an event handler suited for the application
  3. Not usually supporting XPath expressions
  4. Processing XML files in a single pass. Making it difficult to implement “backward looking” axes such as parent::, ancestor::, preceding:: or preceding-sibling:: provided the parser supports XPath expressions
  5. Being able to process much larger XML files since they do not keep data in memory

The Saxon HE SAX parser goes against most of those traditions since as compared to the Java DOM pasrser since it:

  1. Is faster
  2. Supports XPath expressions, including support for “backward looking” axes
  3. Does suffer similar memory restrictions when using XPath expressions

5. Related Articles

Below are links to the other articles in this series:

6. Download the Source Code

Download the XML data and source code used in his article.

Download
You can download the full source code for this article here: SaxDomPerformance.zip

David Guinivere

David graduated from University of California, Santa Cruz with a Bachelor’s degree in Information Science. He has been actively working with Java since the days of J2SE 1.2 in 1998. His work has largely involved integrating Java and SQL. He has worked on a wide range of projects including e-commerce, CRM, Molecular Diagnostic, and Video applications. As a freelance developer he is actively studying Big Data, Cloud and Web Development.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button