XML parsing in Python
In this article, we will explain XML parsing in Python.
1. Introduction
Python is an easy-to-learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.
The Python interpreter is easily extended with new functions and data types implemented in C or C++ (or other languages callable from C). Python is also suitable as an extension language for customizable applications.
2. XML
XML stands for eXtensible Markup Language. It is a markup language much like HTML and is designed to store and transport data. XML is a W3C Recommendation. XML and HTML were designed with different goals. XML was designed to carry data – with focus on what data is, HTML was designed to display data – with focus on how data looks. XML tags are not predefined like HTML tags. With XML, the author must define both the tags and the document structure. XML stores data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data. It also makes it easier to expand or upgrade to new operating systems, new applications, or new browsers, without losing data. With XML, data can be available to all kinds of “reading machines” like people, computers, voice machines, news feeds, etc.
3. XML parsing in Python Example
In this section, we will see a working example of how to parse an XML in Python. The xml.etree.ElementTree
module implements a simple and efficient API for parsing and creating XML data. The ElementTree
class represents the whole XML document as a tree, and Element
represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree
level. Interactions with a single XML element and its sub-elements are done on the Element
level.
Now, let us first create a simple xml file containing user information like below:
user.xml
<?xml version="1.0"?>
<users>
<user id="001">
<first-name>Mohammad</first-name>
<middle-name>Meraj</middle-name>
<surname>Zia</surname>
<dob>01/02/1900</dob>
</user>
<user id="002">
<first-name>Steve</first-name>
<middle-name></middle-name>
<surname>Gates</surname>
<dob>11/12/1900</dob>
</user>
</users>
The xml file contains information about two users, it contains their id (it’s an attribute), names, and date of birth details. Now let us see how we will use the xml.etree.ElementTree
module to parse this xml:
import xml.etree.ElementTree as ET
tree = ET.parse('user.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
First, we imported the module. Then we called the parse method passing the xml file. This method will parse the XML document into element
tree. Then we will call the getroot()
method to get the root object. After this we will iterate through the child elements to print the tag and attribute. When you will execute the above code you will see a similar output as below:
user {'id': '001'}
user {'id': '002'}
You can also use the fromstring()
method to parse XML from a string directly into
an Element
, which is the root element of the parsed tree. We can also access a specific child element using the correct index. Let’s say we want to access the date of birth of the user with id 002. We know that user 002 is the second element and date of birth is the fourth inside the <user>
. So we can call root[1][3].text
which will return us '11/12/1900'
.
Note that not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input.
4. Non-blocking parsing
Most parsing functions provided by this module require the whole document to be read at once before returning any result. It is possible to use XMLParser
and feed data into it incrementally. Sometimes what the user really wants is to be able to parse XML incrementally, without blocking operations, while enjoying the convenience of fully constructed Element
objects. The most powerful tool for doing this is XMLPullParser
. It does not require a blocking read to obtain the XML data, and is instead fed with data incrementally with XMLPullParser.feed()
calls. To get the parsed XML elements, call XMLPullParser.read_events()
5. Common funtions
In this section we will look at some of the common functions used in XML parsing.
One of such is findall()
. This funtion find all matching sub elements by tag name or path. There is a similar method find()
, this method finds the first matching element by tag name or path. Below is the example showing the usage of these:
import xml.etree.ElementTree as ET
tree = ET.parse('user.xml')
root = tree.getroot()
for user in root.findall('user'):
print(user.find('first-name').text, user.find('surname').text)
Element.text
accesses the element’s text content. Element.get()
accesses the element’s attributes. When you run the above code you will get a similar output as below:
Mohammad Zia
Steve Gates
6. Summary
In this article, we looked what XML is and how to parse an XML document in Python. This is a very simple exmple demonstrating parsing using the ElementTree module. There are various other sophisticated modules to parse XML as well but for this we used the basic ones. We also looked at some of the most commonly used methods when parsing an XML document.
This example only focusses on blocking parsing which means that your code will wait till it can load the whole XML in memory. There are ways to parse XML in a non-blocking fashion but that’s out of scope for this article.
You can find more Python tutorials here.