'How can I use Nokogiri to write a HUGE XML file?
I have a Rails application that uses delayed_job in a reporting feature to run some very large reports. One of these generates a massive XML file and it can take literally days in the bad, old way the code is written. I thought that, having seen impressive benchmarks on the internet, Nokogiri could afford us some nontrivial performance gains.
However, the only examples I can find involve using the Nokogiri Builder to create an xml object, then using .to_xml to write the whole thing. But there isn't enough memory in my zip code to handle that for a file of this size.
So can I use Nokogiri to stream or write this data out to file?
Solution 1:[1]
Nokogiri is designed to build in memory because you build a DOM and it converts it to XML on the fly. It's easy to use, but there are trade-offs, and doing it in memory is one of them.
You might want to look into using Erubis to generate the XML. Rather than gather all the data before processing and keeping the logic in a controller, like we'd do with Rails, to save memory you can put your logic in the template and have it iterate over your data, which should help with the resource demands.
If you need the XML in a file you might need to do that using redirection:
erubis options templatefile.erb > xmlfile
This is a very simple example, but it shows you could easily define a template to generate XML:
<%
asdf = (1..5).to_a
%>
<xml>
<element>
<% asdf.each do |i| %>
<subelement><%= i %></subelement>
<% end %>
</element>
</xml>
which, when I call erubis test.erb outputs:
<xml>
<element>
<subelement>1</subelement>
<subelement>2</subelement>
<subelement>3</subelement>
<subelement>4</subelement>
<subelement>5</subelement>
</element>
</xml>
EDIT:
The string concatenation was taking forever...
Yes, it can simply because of garbage collection. You don't show any code example of how you're building your strings, but Ruby works better when you use << to append one string to another than when using +.
It also might work better to not try to keep everything in a string, but instead to write it immediately to disk, appending to an open file as you go.
Again, without code examples I'm shooting in the dark about what you might be doing or why things run slow.
Solution 2:[2]
You don't need to build the whole XML document in memory with Nokogiri; just use Nokogiri to build whatever subtree of the document makes sense, and use Element#write_to to write one element at a time.
Here's an example that can write as long a document as you're willing to wait for:
#!/usr/bin/env ruby
require 'nokogiri'
if (count = ARGV[0].to_i) < 1
$stderr.puts("Usage: #{File.basename(__FILE__)} <count>")
exit 1
end
def build_child(index)
builder = Nokogiri::XML::Builder.new do |xml|
xml.child_element(index: index) do |child|
child.text("This is child #{index}")
end
end
builder.doc.root
end
nokogiri_options = { encoding: 'UTF-8' }
puts '<?xml version="1.0" encoding="UTF-8"?>'
puts '<root_element>'
(0...count).each do |index|
child_element = build_child(index)
child_element.write_to($stdout, nokogiri_options)
puts
end
puts '</root_element>'
If you want to be extra-fancy (or support more complex Nokogiri options) you can even use Nokogiri to generate the XML declaration and root element by writing an empty root document to a StringIO:
def build_root_doc
builder = Nokogiri::XML::Builder.new do |xml|
xml.root_element do |root|
root.text("\n") # ensure separate opening/closing tags
end
end
builder.doc
end
root_xml = StringIO.open do |tmp|
build_root_doc.write_to(tmp, nokogiri_options)
tmp.string
end
# <?xml version="1.0" encoding="UTF-8"?>
# <root_element>
# </root_element>
# split at start of closing tag
header, footer = %r{([^/]+)(</.*)}.match(root_xml)[1..2]
puts header
(0...count).each do |index|
child_element = build_child(index)
child_element.write_to($stdout, nokogiri_options)
puts
end
puts footer
Output:
$ ./big-nokogiri.rb 2
<?xml version="1.0" encoding="UTF-8"?>
<root_element>
<child_element index="0">This is child 0</child_element>
<child_element index="1">This is child 1</child_element>
</root_element>
$ ./big-nokogiri.rb 1000000 | tail -f
<child_element index="999996">This is child 999996</child_element>
<child_element index="999997">This is child 999997</child_element>
<child_element index="999998">This is child 999998</child_element>
<child_element index="999999">This is child 999999</child_element>
</root_element>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Community |
| Solution 2 | David Moles |
