24 09 | 2013

Working with XML using standard Unix tools

Written by Tanguy

Classified in : Homepage, Debian, Command line, To remember

Like it or not, XML has been used everywhere, even in cases where text-based formats would have been sufficient. Unfortunately, standard tools such as grep, sed or awk are not really adapted to work with XML. Let us take the following example:

<chapter
    xmlns="http://docbook.org/ns/docbook" version="5.0">
    <title>The Debian distribution</title>

    <para>Debian is a free operating system, describing itself as “the
    universal operating system”. It is mostly known as a GNU/Linux
    distribution, but it also exist in other variants such as GNU/Hurd
    and GNU/kFreeBSD…</para>
</chapter>

PYX and xml2

There are at least two line-oriented alternative formats for XML:

  • PYX is an even-oriented format derived from an SGML subset, which can be used with the tool XMLStarlet,
  • xml2 is a tool that can transform XML to a content-oriented format.

This is what our example would like in PYX:

(chapter
Aversion 5.0
-\n
(title
-The Debian distribution
)title
-\n\n
(para
-Debian is a free operating system, describing itself as
-“the\n    universal operating system”. It is mostly known as a GNU/Linux\n    distribution, but it also exist in other variants such as GNU/Hurd\n    and GNU/kFreeBSD…
)para
-\n
)chapter

And in the xml2 format:

/chapter/@xmlns=http://docbook.org/ns/docbook
/chapter/@version=5.0
/chapter/title=The Debian distribution
/chapter/para=Debian is a free operating system, describing itself as “the
/chapter/para=    universal operating system”. It is mostly known as a GNU/Linux
/chapter/para=    distribution, but it also exist in other variants such as GNU/Hurd
/chapter/para=    and GNU/kFreeBSD…

Examples of use

We want to extract the DocBook version number. This is not easy to do in a reliable way using the XML directly, but it appears directly with xml2:

$ xml2 < chapter.xml | grep '^/chapter/@version=' \
    | cut -d= -f2
5.0

We want to move the title into an info tag, using PYX:

$ xmlstarlet pyx chapter.xml | sed -e '/^(title$/i\
(info
/^)title$/a\
)info' | xmlstarlet p2x
<chapter version="5.0">
    <info><title>The Debian distribution</title></info>
    […]

We could go further, adding a keywords entry in that info tag for instance, but you get the idea: when you want to work with XML in a reliable way, try xmlstarlet pyx or xml2.

2 comments

tuesday 24 september 2013 à 20:31 piero said : #1

Amazing.
What I need.
Thank You.

wednesday 25 september 2013 à 11:09 wodny said : #2

Nice one. I didn't know the PYX format. I've always used XPath to extract data.

Write a comment

What is the second letter of the word iummn? : 

Archives