XPath轴，获取所有后续节点，直到

我有以下HTML示例：

 Foo bar
 lorem
 ipsum
 etc
 Bar baz
 dum dum dum
 poopfiddles

我想要提取’Foo bar’标题后面的所有段落，直到我到达’Bar baz’标题（’bar baz’标题的文字未知，所以不幸的是我无法使用bougyman提供的答案）。现在我当然可以使用类似//h2[text()='Foo bar']/following::p但当然会抓住此标题后的所有段落。因此，我可以选择遍历节点集并将段落推送到数组，直到文本与下一个标题的文本匹配为止，但说实话，这绝不像在XPath中那样酷。

有没有办法做到这一点，我错过了？

使用：

 (//h2[. = 'Foo bar'])[1]/following-sibling::p [1 = count(preceding-sibling::h2[1] | (//h2[. = 'Foo bar'])[1])]

如果保证每个h2都有一个不同的值，这可以简化为：

 //h2[. = 'Foo bar']/following-sibling::p [1 = count(preceding-sibling::h2[1] | ../h2[. = 'Foo bar'])]

这意味着 ：选择跟随h2兄弟姐妹（文档中的第一个或仅一个）的所有p元素，其字符串值为'Foo bar'并且所有这些p元素的第一个前一个兄弟h2正好是h2 (first or only one in the document) whose string value is ‘Foo bar’。

这里我们使用一种方法来查找两个节点是否相同 ：

 count($n1 | $n2) = 1

当节点$n1和$n2是同一节点时，它们是true() 。

这个表达式可以推广 ：

 $x/following-sibling::p [1 = count(preceding-sibling::node()[name() = name($x)][1] | $x)]

选择$ x指定的任何节点的所有“紧随其后的兄弟姐妹” 。

在XPath 2.0中（我知道这对你没有帮助……）最简单的解决方案可能就是

H2 [。 =’Foo bar’] / follow-sibling :: *除了h2 [。 =’Bar baz’] /（。| following-sibling :: *）

但是和其他解决方案一样，这可能（在没有识别模式的优化器的情况下）在第二个h2之外的元素数量上是线性的，而你真的想要一个性能仅取决于元素数量的解决方案选择。我一直觉得有一个直到操作员会很好：

 h2[. = 'Foo bar']/(following-sibling::* until . = 'Bar baz')

在缺少的情况下，当要选择的节点数量与后续兄弟节点数量相比较时，使用递归的XSLT或XQuery解决方案可能会表现得更好。

这个XPATH 1.0语句选择所有

，它们是

后面的兄弟姐妹，其字符串值等于“Foo bar”，后面跟着一个

兄弟元素谁是第一个兄弟姐妹

字符串值“Foo bar”。

 //p[preceding-sibling::h2[.='Foo bar']] [following-sibling::h2[ preceding-sibling::h2[1][.='Foo bar']]]

仅仅因为它不在答案之间，经典的XPath 1.0设置排除：

A – B = $A[count(.|$B)!=count($B)]

对于这种情况：

 (//h2[.='Foo bar'] /following-sibling::p) [count(.|../h2[.='Foo bar'] /following-sibling::h2[1] /following-sibling::p) != count(../h2[.='Foo bar'] /following-sibling::h2[1] /following-sibling::p)]

注意：这将是Kaysian方法的否定。

XPath 2.0有运算符<< （如果$node1在$node2之前，则$node1 << $node2为真），这样你就可以使用//h2[. = 'Foo bar']/following-sibling::p[. << //h2[. = 'Bar baz']] //h2[. = 'Foo bar']/following-sibling::p[. << //h2[. = 'Bar baz']] //h2[. = 'Foo bar']/following-sibling::p[. << //h2[. = 'Bar baz']] 。然而，我不知道nokogiri是否支持XPath 2.0。

 require 'nokogiri' doc = Nokogiri::XML < Foo
 lorem
 ipsum
 etc
 Bar
 dum dum dum
 poopfiddles
  ENDXML a = doc.xpath( '//h2[text()="Foo"]/following::p[not(preceding::h2[text()="Bar"])]' ) puts a.map{ |n| n.to_s } #=> lorem
 #=> ipsum
 #=> etc

我怀疑使用next_sibling遍历DOM可能更有效率，直到你达到目的为止：

 node = doc.at_xpath('//h2[text()="Foo bar"]').next_sibling stop = doc.at_xpath('//h2[text()="Bar baz"]') a = [] while node && node!=stop a << node unless node.type == 3 # skip text nodes node = node.next_sibling end puts a.map{ |n| n.to_s } #=> lorem
 #=> ipsum
 #=> etc

但是，这并不快。在一些简单的测试中，我发现xpath-only（第一个解决方案）的速度是这个循环测试的2倍，即使在stop节点之后有很多段落也是如此。当有许多节点要捕获时（停止后很少）它在6x-10x范围内表现更好。

如何匹配第二个？如果您只想要顶部，请匹配第二部分并抓住它上面的所有内容。
doc.xpath("//h2[text()='Bar baz']/preceding-sibling::p").map { |m| m.text } doc.xpath("//h2[text()='Bar baz']/preceding-sibling::p").map { |m| m.text } => [“lorem”，“ipsum”，“etc”]

或者如果您不知道第二个，请转到另一个级别： doc.xpath("//h2[text()='Foo bar']/following-sibling::h2/preceding-sibling::p").map { |it| it.text } doc.xpath("//h2[text()='Foo bar']/following-sibling::h2/preceding-sibling::p").map { |it| it.text } => [“lorem”，“ipsum”，“etc”]

XPath轴，获取所有后续节点，直到

Foo bar

Bar baz

Foo

Bar

可以在没有安装nokogiri-java的情况下在jruby中加载nokogiri？

如何刮取延迟加载的页面

/usr/local/lib/libz.1.dylib，文件是为i386构建的，它不是被链接的体系结构（x86_64）

无法确定元素是否存在

RVM 1.9.1和nokogiri

我无法从Nokogiri解析的字符串中删除空格

如何使用Nokogiri在两个HTML注释之间抓取HTML？

安装nokogiri时出错，权限被拒绝

无法安装Nokogiri

XPath选择前面的元素与可选的插入空白文本节点