Tuesday, February 5, 2013

How to parse an XML file in Ruby using libxml

I wanted to try out libxml instead of nokogiri to see if its any better to parse XML files. Here's a quick program that parses the ISO codes and converts it to a csv file.

Let's say your file has the following XML structure (taken from debian ISO 3166_2 codes)


<iso_3166_2_entries>
  <iso_3166_country code="AD">
    <iso_3166_subset type="Parish">
       <iso_3166_2_entry code="AD-07"    name="Andorra la Vella" />
       <iso_3166_2_entry code="AD-02"    name="Canillo" />
       .
       .
       .
    </iso_3166_subset>
  </iso_3166_country>
</iso_3166_2_entries>



And you want to convert it to a CSV file of type Country|Type|Name....





require 'rubygems'
require 'open-uri'
require 'xml'

# You can also grab the raw XML from http source (see below)
# raw_xml = open("http://somewebsite.myfile=.xml").read


# Here we are grabbing the xml from file source
source = XML::Parser.file('myfile.xml'
content = source.parse

countries = content.root.find('./iso_3166_country'
countries.each do |country| 
  # Dont process empty or blank countries
  if(country.children.first != nil && country.inner_xml.strip != '')
    subsets = country.find('iso_3166_subset')
    subsets.each do |subset|
      entries =  subset.find('iso_3166_2_entry')
      entries.each do |entry|
        code = entry.attributes['code']
        name = entry.attributes['name']
        # output a tsv
        puts country.attributes['code'].strip + "\t" + subset.attributes['type'].strip + "\t"
              code.slice(code.index('-')+1..code.length).strip + "\t" + name.strip
      end      
    end
  end
end

6 comments: