如何在最接近的时间段内选择前280个单词?

我需要从较长的单词中提取指定数量单词的较短段文本。 我可以这样做

text = "There was a very big cat that was sitting on the ledge. It was overlooking the garden. The dog next door watched with curiosity." text.split[0..15].join(' ') >>""There was a very big cat that was sitting on the ledge. It was overlooking" 

我想选择下一期的文本,所以我最终不会得到部分句子。

是否有一种方法可能使用正则表达式来完成我正在尝试做的事情,这将能够使文本达到并包括在第15个单词之后最接近的下一个时期?

您可以使用

 (?:\w+[,.?!]?\s+){14}(?:\w+,?\s+)*?\w+[.?!] 

重复一个单词,可选[逗号/句号/问号/感叹号]和空格,共14次。 然后,它懒惰地重复一个单词后跟一个空格,然后是另一个单词和一个句点,确保该模式在从开始的15个单词后的第一个句点结束。

https://regex101.com/r/ardIQ7/4

 r = / (?: # begin a non-capture group \p{Alpha}+ # match one or more letters [.!?]? # optionally ('?' following ']') match one of the 3 punctuation chars [ ]+ # match one or more spaces ) # end non-capture group {14,}? # execute the preceding non-capture group at least 14 times, lazily ('?') \p{Alpha}+ # match one or more letters [.!?] # match one of the three punctuation characters /x # free-spacing regex definition mode text[r] #=> "There was a very big cat that was sitting on the ledge. It was overlooking # the garden. 

自由间隔模式剥离空格,这就是上面的空格字符在字符类( [ ]+ )中的原因。 按惯例,正则表达式如下。

 /(?:\p{Alpha}+[.!?]? +){14,}?\p{Alpha}+[.!?]/ 

你可以沿着这些方向做点什么:

 text = "There was a very big cat that was sitting on the ledge. It was overlooking the garden. The dog next door watched with curiosity." tgt=15 old_text=text.scan(/[^.]+\.\s?/) new_text=[] while (old_text && new_text.join.scan(/\b\p{Alpha}+\b/).length<=tgt) do new_text << old_text.shift end p new_text.join 

打印:

 "There was a very big cat that was sitting on the ledge. It was overlooking the garden. " 

这适用于任何长度的普通句子,并且一旦另外一个句子超过单词目标就会中断。