When you have to parse huge XML files (hundreds of MB-like), loading the whole XML in memory is not an option.
XML parsers
The most popular XML parsers can be split in 2 big categories:
- tree-based (aka DOM parsers) – that parse the whole XML file and transform it into a huge tree of nodes.
- event-based (SAX and Pull parsers)
Therefore, a DOM parser consumes a lot of memory (since it stores the whole tree in memory) but it is much easier to use. You can do all sorts of cool stuff on them like xpath selectors, css selectors, converting the xml to a hash, etc. Basically, the vast majority of examples/tutorials of ruby XML are refering to DOM parsers.
However, in the 5% of the situations where the XML file is really big, storing it in memory is not an option.
Event-based XML parsers
There are two types of event-based parsers
- Push parsers like SAX, where you react to encoutered tags as you get them. You can find here a comparison over the principal ruby SAX parsers.
- Pull parsers, where you control a “cursor” in the XML file that you can move with simple primitives like go up/go down etc.
Declarative XML SAX parsing library
There is an interesting XML SAX parsing library that automatically parses the file into an object. However, the programmer must declare the structure of the subtree. Check out SaxMachine.
Example using Nokogiri’s Pull Parser
I wanted to implement a method that would iterate over <product> tags. This method would also transform the subtree <product> … </product> to a hash. I don’t have a standard subtree structure. Below is a snipped from my function.
def each_offer(input, options={}, &block)
reader = Nokogiri::XML::Reader input
tag_name = options[:tag_name] || 'product'
#search for reader
until (reader.name == tag_name)
break unless reader.read
end
i = 1
level = 1
product_node = nil
elem_name = nil
elem_hash = {}
stack = [[tag_name, reader.attributes]]
while (reader.read)
case reader.node_type
#start element
when 1
stack.push([reader.name.to_s, reader.attributes])
#text element
when 3, Nokogiri::XML::Node::CDATA_SECTION_NODE
stack.last[1] = reader.value
#end element
when 15
return if stack.empty?
elem = stack.pop
parent = stack.last
# I finished the node
if parent.nil?
yield(elem[1])
elem = nil
next
end
# else..
key = elem[0]
parent_childs = parent[1]
if parent_childs.has_key?(key)
unless parent_childs[key].is_a? Array
parent_childs[key] = [parent_childs[key]]
end
parent_childs[key] << elem[1]
else
parent_childs[key] = elem[1]
end
end
end
end
References
ReXML
http://ruby-doc.org/core/classes/REXML/Parsers/SAX2Parser.html
http://stdlib.rubyonrails.org/libdoc/rexml/rdoc/classes/REXML/Parsers/PullParser.html
Nokogiri
http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/SAX/Parser.html
http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Reader.html
LibXML
http://libxml.rubyforge.org/rdoc/
Posts on XML event parser
Transforming XML and the ReXML’s Pull Parser
Processing large XML files (SAX example)
Filed under: Data Extraction, ruby-on-rails Tagged: | dom, hapricot, nokogiri, rexml, ruby, ruby-on-rails, sax, xml, xml parser
I had observed all these issues mentioned above, and have come up with
http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/
example – https://github.com/amolpujari/reading-huge-xml/blob/master/examples/item.rb