The basics of XML for web-scraping

If you are interested in web-scraping like I am, it is very useful, if not essential, to know something about XML. XML stands for Extensible Markup Language and it was designed to transport and store data while HTML was designed to display data. XML separates data from HTML and simplifies data sharing and data transport since it stores data in plain text format and provides a software- and hardware-independent way of storing data.

Next I will summarize what I have learned about XML from [1]. From the web-scraping point of view, I think the most relevant section is the one about XML tree structure, while the sections about naming practices and attribute vs. element debate is here only to give a better background on XML. This is naturally not intended to be a comprehensive review of the subject and this post is subject to future changes as I learn more about XML and about which parts of XML knowledge are useful for web-scraping.

XML tree structure

Following is a valid XML structure extracted from [1] that will be used as an example.


<bookstore>
 <book category="COOKING">
 <title lang="en">Everyday Italian</title>
 <author>Giada De Laurentiis</author>
 <year>2005</year>
 <price>30.00</price>
 </book>
 <book category="CHILDREN">
 <title lang="en">Harry Potter</title>
 <author>J K. Rowling</author>
 <year>2005</year>
 <price>29.99</price>
 </book>
 <book category="WEB">
 <title lang="en">Learning XML</title>
 <author>Erik T. Ray</author>
 <year>2003</year>
 <price>39.95</price>
 </book>
</bookstore>

  • First, XML tags are not predefined, you must define your own tags. In our example above, the tags bookstore and book doesn’t have any predefined meaning and were chosen by the developer of the XML example. The name of the tags are usually associated with an intrinsic meaning that are related to the kind of data that the XML structure are supposed to hold. In this example, it is quite clear that inside the bookstore element, there will be different book elements, and that each book element have a title element, an author element, and so on.
  • XML tags are case sensitive.
  • XML elements must have a closing tag, unlike HTML.
  • XML documents must contain a root element. This element is the parent of all other elements. The terms parent, child, and sibling are used to describe the relationships between elements. Parent elements have children and children on the same level are called siblings. All elements can have children. So, in the example above bookstore is the root element, book is the child of bookstore. title, author, year and price are siblings and children of book.
  • All elements can have text content and attributes, just like in HTML. So, the title element of the first book has “Everyday Italian” as text content and attribute lang that assumes the value “en”.
  • XML attribute values must be quoted. So lang=en would be incorrect, the correct form being lang="en".
  • Comments in XML: <!-- This is a comment -->

XML Naming Rules and Best Naming Practices

  1. XML elements must follow these naming rules:
    • Names can contain letters, numbers, and other characters
    • Names cannot start with a number or punctuation character
    • Names cannot start with the letters xml (or XML, or Xml, etc)
    • Names cannot contain spaces
  2. Make names descriptive. Names with an underscore separator are nice: <first_name>, <last_name>
  3. Names should be short and simple, like this: <book_title>; not like this: <the_title_of_the_book>.

XML Elements vs. Attributes

Take a look at the following examples:


<person sex="female">
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>

<person>
<sex>female</sex>
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>

Both examples provide the same information but in the first example sex is an attribute while in the last, sex is an element. Attributes are handy in HTML but in XML the advice is to avoid them and use elements instead.

Some of the problems with using attributes are:

  • attributes cannot contain multiple values (elements can)
  • attributes cannot contain tree structures (elements can)
  • attributes are not easily expandable (for future changes)
  • attributes are difficult to read and maintain.

In general, use elements for data and use attributes for information that is not relevant to the data.

References:

[1] XML tutorial from w3schools.

Advertisements