Nokogiri使用格式和链接标签抓取文本，，，等

如何使用Nokogiri以格式标签递归捕获所有文本？

This is text in the TD with strong tags This is a child node. with bold tags "another line of text to a link " This is text inside a div inside another div inside a paragraph tag

例如，我想捕获：

"This is text in the TD with strong tags" "This is a child node. with bold tags" "another line of text to a link " "This is text inside a div inside another div inside a paragraph tag"

我不能只使用.text（），因为它剥离格式化标签，我不知道如何递归。

添加细节：Sanitize看起来像一个有趣的gem，我现在正在读它。但是，有一些额外的信息可能会澄清我需要做什么。

我需要遍历每个节点，获取文本，处理它并将其放回原处。因此，我会抓住文本，“这是带有强标签的TD中的文字”，将其修改为类似“这是TD中带有强标签的修改后的文本。然后转到div 1中的下一个标签获取

文本。 “这是一个子节点。带有粗体标签”修改它“这是一个修改过的子节点。带有粗体标签。” 并把它放回去。转到下一个div＃2并抓取文本，“另一行文本到链接”，修改它，“另一行修改后的文本到链接”，并将其放回并转到下一个节点，Div＃2并抓取段落标签中的文字。 “这是在段落标记内的另一个div内的div内修改的文本”

所以在处理完所有内容之后，新的html应该看起来像这样……

This is modified text in the TD with strong tags This is a modified child node. with bold tags "another line of modified text to a link " This is modified text inside a div inside another div inside a paragraph tag

我的准代码，但我真的坚持这两个部分，只使用格式化文本（清理帮助），但清理抓取所有标签。我需要保留格式化文本的格式，包括空格等。但是，不要抓住不相关的标记子项。两个，遍历所有与全文标签直接相关的孩子。

#Quasi-code doc = Nokogiri.HTML(html) kids=doc.at('div#1') text_kids=kids.descendant_elements text.kids.each do |i| #grab full text(full sentence and paragraphs) with formating tags #currently, I have not way to grab just the text with formatting and not the other tags modified_text=processing_code(i.full_text_w_formating()) i.full_text_w_formating=modified_text end def processing_code(string) #code to process string (not relevant for this example) return modified_string end # Recursive 1 class Nokogiri::XML::Node def descendant_elements #This is flawed because it grabs every child and even #splits it based on any tag. # I need to traverse down only the text related children. element_children.map{ |kid| [kid, kid.descendant_elements] }.flatten end end

我会使用两种策略，Nokogiri来提取你想要的内容，然后是黑名单/白名单程序来剥离你不想要的标签或保留你想要的标签。

require 'nokogiri' require 'sanitize' html = ' This is text in the TD with strong tags This is a child node. with bold tags "another line of text to a link " This is text inside a div inside another div inside a paragraph tag ' doc = Nokogiri.HTML(html) html_fragment = doc.at('div#1').to_html

将捕获
作为HTML字符串：

This is text in the TD with strong tags This is a child node. with bold tags "another line of text to a link " This is text inside a div inside another div inside a paragraph tag

尾随是两个打开标记的结果。这可能是故意的，但没有结束标签，Nokogiri会做一些修正来使HTML正确。

将html_fragment传递给Sanitize gem：

doc = Sanitize.clean( html_fragment, :elements => %w[ ab em strong ], :attributes => { 'a' => %w[ href ], }, )

返回的文本如下所示：

This is text in the TD with strong tags This is a child node. with bold tags "another line of text to a link " This is text inside a div inside another div inside a paragraph tag

同样，由于HTML格式错误且没有结束标记，因此存在两个尾随结束标记。

通过COM从Ruby调用C＃.dll

使用Ruby的Net / HTTP模块，我可以发送原始JSON数据吗？

使用Nokogiri构建空白XML标签？
由于缺少原生扩展，Nokogiri安装错误
是否有可能’卸载’（’un-require’）Ruby库？
如何让Nokogiri了解我的命名空间？
如何使用XPath访问在JavaScript中呈现的HTML元素？
如何使用Nokogiri将两个XML文件合并为一个？
Nokogiri在视图中显示数据
如何将一组放在中
是否有与Nokogiri类似的解析Ruby代码的东西？

Nokogiri使用格式和链接标签抓取文本，，，等

如何使用nokogiri和rubyzip编辑docx

在Chrome Developer工具中使用“复制Xpath”时如何停止插入隐式标记

如何在Nokogiri中使用XPath？

无法安装Nokogiri 1.4.3 gem

使用ruby将HTML转换为纯文本并维护结构/格式

如何使用Nokogiri在某些标签之后或之前获取文本

如何使用Mechanize / Nokogiri获取页面源

在Ruby脚本中使用SLIM / HAML等？

使用Nokogiri获取包含特定属性名称的元素中的所有节点

安装nokogiri的gem devkit（windows）时缺少libxml2