如何从文本中删除url？

我想帮助解析Ruby中的文本。

鉴于：

@BreakingNews：台风莫拉克击中台湾，中国疏散数千人http://news.bnonews.com/u4z3

我想删除所有的超链接，返回纯文本。

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

 foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3" r = foo.gsub(/http:\/\/[\w\.:\/]+/, '') puts r # @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

这是一个古老而又好的问题。这是一个依赖于Ruby的内置URI的答案：

 require 'set' require 'uri' text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3' schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i URI.extract(text).each do |url| text.gsub!(url, '') if (url[schemes_regex]) end puts text.squeeze(' ')

通过IRB传递显示正在发生的事情以及由此产生的结果：

我定义了要搜索的文本：

 irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3' => "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"

我定义了一个我们想要响应的URI方案的正则表达式。这是一种防御性移动，因为URI在其搜索步骤中返回误报：

 irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i => /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i

让URI遍历文本查找URL。对于找到的每一个，如果它是我们想要做出反应的方案，则从文本中删除所有出现的内容：

 irb(main):008:0* URI.extract(text).each do |url| irb(main):009:1* text.gsub!(url, '') if (url[schemes_regex]) irb(main):010:1> end

这些是找到的URI.extracturl。它错误地报道了BreakingNews:因为尾随: 。我认为它不太复杂，但对于正常使用它很好：

 => ["BreakingNews:", "http://news.bnonews.com/u4z3"]

显示生成的文本：

 irb(main):012:0* puts text.squeeze(' ') @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

它可以快速，肮脏的方式或以复杂的方式完成。我正在展示复杂的方式：

 require 'rubygems' require 'hpricot' # you may need to install this gem require 'open-uri' ## first getting the embeded/framed html file's url start_url = 'http://news.bnonews.com/u4z3' doc = Hpricot(open(start_url)) news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) ## now getting the news text, its in the 3rd  tag of the framed html file doc2 = Hpricot(open(news_html_url.to_s)) news_text = doc2.at('//p[3]').to_plain_text puts news_text

尝试了解代码在每个步骤中执行的操作。并将这些知识应用到您未来的项目中。从这些页面获取帮助：

http://wiki.github.com/why/hpricot/an-hpricot-showcase

http://code.whytheluckystiff.net/doc/hpricot/

如何从文本中删除url？

在Rails表单中处理MongoMapper EmbeddedDocument

为什么在读取文本文件时出现“UTF-8中的无效字节序列”错误？

Rails在Helper中渲染部分

Ruby：要求’irbtools’引发LoadError

编辑现有的Rails迁移是个好主意？

在表关系问题中对信息进行分组和计数

Rails，PHP和参数

需要使用范围嵌套连接的ActiveRelation

将String转换为字符串数组的最快方法

在Ruby中将哈希数组转换为ONE哈希