Ruby/Rails: Parse large XMLs (SAX parsers, Pull parsers) + example of Pull parser

When you have to parse huge XML files (hundreds of MB-like), loading the whole XML in memory is not an option.

XML parsers

The most popular XML parsers can be split in 2 big categories:

  • tree-based (aka DOM parsers) – that parse the whole XML file and transform it into a huge tree of nodes.
  • event-based (SAX and Pull parsers)

Therefore, a DOM parser consumes a lot of memory (since it stores the whole tree in memory) but it is much easier to use. You can do all sorts of  cool stuff on them like xpath selectors, css selectors, converting the xml to a hash, etc. Basically, the vast majority of examples/tutorials of ruby XML are refering to DOM parsers.

However, in the 5% of the situations where the XML file is really big, storing it in memory is not an option.

Event-based XML parsers

There are two types of event-based parsers

  • Pull parsers, where you control a “cursor” in the XML file that you can move with simple primitives like go up/go down etc.

Declarative XML SAX parsing library

There is an interesting XML SAX parsing library that automatically parses the file into an object. However, the programmer must declare the structure of the subtree. Check out SaxMachine.

Example using Nokogiri’s Pull Parser

I wanted to implement a method that would iterate over <product> tags. This method would also transform the subtree <product> … </product> to a hash. I don’t have a standard subtree structure. Below is a snipped from my function.

      def each_offer(input, options={}, &block)
        reader = Nokogiri::XML::Reader input
        tag_name = options[:tag_name] || 'product'

        #search for reader
        until (reader.name == tag_name)
          break unless reader.read
        end

        i            = 1
        level        = 1
        product_node = nil
        elem_name    = nil
        elem_hash = {}

        stack = [[tag_name, reader.attributes]]
        while (reader.read)
          case reader.node_type
            #start element
            when 1
              stack.push([reader.name.to_s, reader.attributes])

            #text element
            when 3, Nokogiri::XML::Node::CDATA_SECTION_NODE
              stack.last[1] = reader.value

            #end element
            when 15
              return if stack.empty?

              elem = stack.pop
              parent = stack.last

              # I finished the node
              if parent.nil?
                yield(elem[1])
                elem = nil
                next
              end

              # else..
              key = elem[0]
              parent_childs = parent[1]
              if parent_childs.has_key?(key)
                unless parent_childs[key].is_a? Array
                  parent_childs[key] = [parent_childs[key]]
                end

                parent_childs[key] << elem[1]
              else
                parent_childs[key] =  elem[1]
              end
          end
        end
      end

References

ReXML

http://ruby-doc.org/core/classes/REXML/Parsers/SAX2Parser.html

http://stdlib.rubyonrails.org/libdoc/rexml/rdoc/classes/REXML/Parsers/PullParser.html

Nokogiri

http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html

http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Reader.html

LibXML

http://libxml.rubyforge.org/rdoc/

Posts on XML event parser

Comparing SAX parsers in Ruby

SO question on large XML

Transforming XML and the ReXML’s Pull Parser

Processing large XML files (SAX example)

Advertisements

2 Responses

  1. I had observed all these issues mentioned above, and have come up with

    http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/

  2. example – https://github.com/amolpujari/reading-huge-xml/blob/master/examples/item.rb

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: