如何使用Nokogiri在两个HTML注释之间抓取HTML？

我有一些HTML页面，其中要提取的内容用下面的HTML注释标记。

 .....  some text
 Some more elements
  ...

我正在使用Nokogiri并尝试在和 comments之间提取HTML。

我想提取这两个HTML注释之间的完整元素：

 some text
 Some more elements

我可以使用这个字符回调获得纯文本版本：

 class TextExtractor < Nokogiri::XML::SAX::Document def initialize @interesting = false @text = "" @html = "" end def comment(string) case string.strip # strip leading and trailing whitespaces when /^begin content/ # match starting comment @interesting = true when /^end content/ @interesting = false # match closing comment end def characters(string) @text << string if @interesting end end

我使用@text获得纯文本版本，但我需要存储在@html的完整HTML。

在两个节点之间提取内容不是我们要做的正常事情; 通常我们想要特定节点内的内容。注释是节点，它们只是特殊类型的节点。

 require 'nokogiri' doc = Nokogiri::HTML(<  some text
 Some more elements
   EOT

通过查找包含指定文本的注释，可以找到起始节点：

 start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #

一旦发现，那么需要一个存储当前节点的循环，然后查找下一个兄弟，直到找到另一个注释：

 content = Nokogiri::XML::NodeSet.new(doc) contained_node = start_comment.next_sibling loop do break if contained_node.comment? content << contained_node contained_node = contained_node.next_sibling end content.to_html # => "\n some text
\n Some more elements
\n"

如何使用Nokogiri在两个HTML注释之间抓取HTML？

如何使用Nokogiri解析XML并拆分节点值？

在Windows 7上使用Ruby 2.3安装机械化时出错

重构Ruby抓取代码

在Chrome Developer工具中使用“复制Xpath”时如何停止插入隐式标记

是否可以用Nokogiri解析样式表？

Nokogiri可以保留属性引用风格吗？

如何点击Mechanize和Nokogiri中的链接？

如何将子项添加到特定位置的节点？

为什么我会遇到Nokogiri崩溃和MemoryError：负重新分配大小？

Nokogiri可以搜索“？xml-stylesheet”标签吗？