使用Ruby和Nokogiri解析大型XML文件

我有一个大的XML文件（大约10K行）我需要定期解析这种格式：

 10000    Category Name 1 Val 1  ...... 10,000 more times

我想要做的是使用nokogiri解析每个节点，以计算一个类别中的项目数量。然后，我想从total_count中减去该数字，得到一个读数为“Count of Interest_Category：n，Count of All Else：z”的输出。

这是我现在的代码：

 #!/usr/bin/ruby require 'rubygems' require 'nokogiri' require 'open-uri' icount = 0 xmlfeed = Nokogiri::XML(open("/path/to/file/all.xml")) all_items = xmlfeed.xpath("//items") all_items.each do |adv| if (adv.children.filter("cat").first.child.inner_text.include? "partofcatname") icount = icount + 1 end end othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount puts icount puts othercount

这似乎有效，但速度很慢！对于10,000件物品，我说的时间超过10分钟。有一个更好的方法吗？我是以不太理想的方式做事吗？

您可以通过将代码更改为以下内容来大幅缩短执行时间。只需将“99”更改为您要检查的任何类别：

 require 'rubygems' require 'nokogiri' require 'open-uri' icount = 0 xmlfeed = Nokogiri::XML(open("test.xml")) items = xmlfeed.xpath("//item") items.each do |item| text = item.children.children.first.text if ( text =~ /99/ ) icount += 1 end end othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount puts icount puts othercount

这在我的机器上花了大约三秒钟。我认为你犯的一个关键错误就是你选择“items”迭代而不是创建“item”节点的集合。这使得你的迭代代码变得笨拙和缓慢。

这是一个将SAX解析器计数与基于DOM的计数进行比较的示例，使用七个类别中的一个计算500,000 。一，输出：

创建XML文件：1.7s
通过SAX计算：12.9秒
创建DOM：1.6s
通过DOM计算：2.5s

这两种技术都产生相同的哈希值，计算每个类别的数量：

 {"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}

SAX版本需要12.9秒才能进行计数和分类，而DOM版本只需1.6秒即可创建DOM元素，需要2.5秒才能查找并分类所有值。 DOM版本的速度快3倍！

……但这不是整个故事。我们还要看看RAM的使用情况。

对于500,000件产品，SAX（12.9s）达到238MB RAM; DOM（4.1s）峰值为1.0GB。
对于1,000,000个项目，SAX（25.5s）达到243MB RAM; DOM（8.1s）达到2.0GB。
对于2,000,000件产品，SAX（55.1s）达到250MB RAM; DOM（ ??? ）峰值为3.2GB。

我的机器上有足够的内存来处理1,000,000个项目，但是在2,000,000时我用完了内存，不得不开始使用虚拟内存。即使使用SSD和快速机器，我还是让DOM代码运行了将近十分钟才能最终杀死它。

您报告的时间很长很可能是因为您的RAM耗尽并且作为虚拟内存的一部分连续点击磁盘。如果您可以将DOM放入内存中，请使用它，因为它很快。但是，如果不能，则必须使用SAX版本。

这是测试代码：

 require 'nokogiri' CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ] ITEM_COUNT = 500_000 def test! create_xml sleep 2; GC.start # Time to read memory before cleaning the slate test_sax sleep 2; GC.start # Time to read memory before cleaning the slate test_dom end def time(label) t1 = Time.now yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] } end def test_sax item_counts = time("Count via SAX") do counter = CategoryCounter.new # Use parse_file so we can stream data from disk instead of flooding RAM Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml') counter.category_counts end # p item_counts end def test_dom doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } } counts = time("Count via DOM") do counts = Hash.new(0) doc.xpath('//cat').each do |cat| counts[cat.children[0].content] += 1 end counts end # p counts end class CategoryCounter < Nokogiri::XML::SAX::Document attr_reader :category_counts def initialize @category_counts = Hash.new(0) end def start_element(name,att=nil) @count = name=='cat' end def characters(str) if @count @category_counts[str] += 1 @count = false end end end def create_xml time("Create XML file") do File.open('tmp.xml','w') do |f| f << " 10000  #{ ITEM_COUNT.times.map{ |i| " #{CATEGORIES.sample} Name #{i} Value #{i} " }.join("\n") }  " end end end test! if __FILE__ == $0

DOM计数如何工作？

如果我们剥离一些测试结构，基于DOM的计数器看起来像这样：

 # Open the file on disk and pass it to Nokogiri so that it can stream read; # Better than doc = Nokogiri.XML(IO.read('tmp.xml')) # which requires us to load a huge string into memory just to parse it doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } # Create a hash with default '0' values for any 'missing' keys counts = Hash.new(0) # Find every `` element in the document (assumes one per ) doc.xpath('//cat').each do |cat| # Get the child text node's content and use it as the key to the hash counts[cat.children[0].content] += 1 end

SAX计数如何工作？

首先，让我们关注这段代码：

 class CategoryCounter < Nokogiri::XML::SAX::Document attr_reader :category_counts def initialize @category_counts = Hash.new(0) end def start_element(name,att=nil) @count = name=='cat' end def characters(str) if @count @category_counts[str] += 1 @count = false end end end

当我们创建这个类的一个新实例时，我们得到一个对象，它具有一个Hash，对于所有值默认为0，以及可以在其上调用的几个方法。 SAX Parser将在文档中运行时调用这些方法。

每次SAX解析器看到一个新元素时，它将调用此类的start_element方法。当发生这种情况时，我们根据这个元素是否被命名为“cat”来设置一个标志（以便我们稍后可以找到它的名称）。
每当SAX解析器啜饮一大块文本时，它就会调用对象的characters方法。当发生这种情况时，我们检查我们看到的最后一个元素是否是一个类别（即@count是否设置为true ）; 如果是这样，我们使用此文本节点的值作为类别名称，并将一个值添加到我们的计数器。

要在Nokogiri的SAX解析器中使用我们的自定义对象，我们这样做：

 # Create a new instance, with its empty hash counter = CategoryCounter.new # Create a new parser that will call methods on our object, and then # use `parse_file` so that it streams data from disk instead of flooding RAM Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml') # Once that's done, we can get the hash of category counts back from our object counts = counter.category_counts p counts["Pigs"]

我建议使用SAX解析器而不是DOM解析器来处理这么大的文件。 Nokogiri有一个很好的SAX解析器内置： http ：//nokogiri.org/Nokogiri/XML/SAX.html

对于大型文件，SAX的处理方式很好，因为它不会构建一个巨大的DOM树，在你的情况下是过度的; 您可以在事件触发时构建自己的结构（例如，用于计算节点）。

查看Greg Weber版本的Paul Dix的萨克斯机器gem： http ： //blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

使用SaxMachine解析大文件似乎将整个文件加载到内存中

sax-machine使代码更加简单; Greg的变体让它变得流畅。

你可能想尝试一下 – https://github.com/amolpujari/reading-huge-xml

HugeXML.read xml, elements_lookup do |element| # => element{ :name, :value, :attributes} end

我也试过用牛

使用Ruby和Nokogiri解析大型XML文件

DOM计数如何工作？

SAX计数如何工作？

使用Nokogiri解析大型XML

用Nokogiri获取节点的兄弟姐妹

如何使用Nokogiri获取XML文档的根元素名称？

使用Nokogiri插入和删除XML节点和元素

数据抓取多个页面点击循环

如何使用Nokogiri将两个XML文件合并为一个？

Gem文件不会使用bundler更新或安装

使用Nokogiri在XPath中逃脱单引号？

获取Nokogiri中属性的值以提取链接URL

直接在Nokogiri的标签内获取文本