如何在最接近的时间段内选择前280个单词?
我需要从较长的单词中提取指定数量单词的较短段文本。 我可以这样做
text = "There was a very big cat that was sitting on the ledge. It was overlooking the garden. The dog next door watched with curiosity." text.split[0..15].join(' ') >>""There was a very big cat that was sitting on the ledge. It was overlooking"
我想选择下一期的文本,所以我最终不会得到部分句子。
是否有一种方法可能使用正则表达式来完成我正在尝试做的事情,这将能够使文本达到并包括在第15个单词之后最接近的下一个时期?
您可以使用
(?:\w+[,.?!]?\s+){14}(?:\w+,?\s+)*?\w+[.?!]
重复一个单词,可选[逗号/句号/问号/感叹号]和空格,共14次。 然后,它懒惰地重复一个单词后跟一个空格,然后是另一个单词和一个句点,确保该模式在从开始的15个单词后的第一个句点结束。
r = / (?: # begin a non-capture group \p{Alpha}+ # match one or more letters [.!?]? # optionally ('?' following ']') match one of the 3 punctuation chars [ ]+ # match one or more spaces ) # end non-capture group {14,}? # execute the preceding non-capture group at least 14 times, lazily ('?') \p{Alpha}+ # match one or more letters [.!?] # match one of the three punctuation characters /x # free-spacing regex definition mode text[r] #=> "There was a very big cat that was sitting on the ledge. It was overlooking # the garden.
自由间隔模式剥离空格,这就是上面的空格字符在字符类( [ ]+
)中的原因。 按惯例,正则表达式如下。
/(?:\p{Alpha}+[.!?]? +){14,}?\p{Alpha}+[.!?]/
你可以沿着这些方向做点什么:
text = "There was a very big cat that was sitting on the ledge. It was overlooking the garden. The dog next door watched with curiosity." tgt=15 old_text=text.scan(/[^.]+\.\s?/) new_text=[] while (old_text && new_text.join.scan(/\b\p{Alpha}+\b/).length<=tgt) do new_text << old_text.shift end p new_text.join
打印:
"There was a very big cat that was sitting on the ledge. It was overlooking the garden. "
这适用于任何长度的普通句子,并且一旦另外一个句子超过单词目标就会中断。