如何从文本中提取URL

如何从Ruby中的纯文本文件中提取所有URL？

我试过一些库，但在某些情况下它们会失败。什么是最好的方式？

什么案件失败了？

根据图书馆regexpert ，您可以使用

regexp = /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[az]{2,5}(([0-9]{1,5})?\/.*)?$)/ix

然后对文本执行scan 。

编辑：似乎regexp支持空字符串。只需删除初始(^$)就可以了

如果您喜欢使用Ruby中已经提供的内容：

 require "uri" URI.extract("text here http://foo.example.org/bla and here mailto:test@example.com and here also.") # => ["http://foo.example.org/bla", "mailto:test@example.com"]

阅读更多： http ： //railsapi.com/doc/ruby-v1.8/classes/URI.html#M004495

我用过twitter-text gem

 require "twitter-text" class UrlParser include Twitter::Extractor end urls = UrlParser.new.extract_urls("http://stackoverflow.com") puts urls.inspect

你可以使用正则表达式和.scan()

 string.scan(/(https?:\/\/([-\w\.]+)+(:\d+)?(\/([\w\/_\.]*(\?\S+)?)?)?)/)

您可以开始使用该正则表达式并根据您的需要进行调整。

 require 'uri' foo = # foo.to_s => "http://sofzh.miximages.com/ruby/00u0u_gKHnmtWe0Jk_600x450.jpg"

编辑：解释

对于那些通过JSON响应或使用像Nokogiri或Mechanize这样的抓取工具解析URI的问题，这个解决方案对我有用。

如果您的输入看起来类似于：

 "http://sofzh.miximages.com/ruby/c31IkbM.gifv;http://sofzh.miximages.com/ruby/c31IkbM.gifvhttp://sofzh.miximages.com/ruby/c31IkbM.gifv"

即，URL不一定在它们周围有空格，可以由任何分隔符分隔，或者它们之间没有分隔符，您可以使用以下方法：

 def process_images(raw_input) return [] if raw_input.nil? urls = raw_input.split('http') urls.shift urls.map { |url| "http#{url}".strip.split(/[\s\,\;]/)[0] } end

希望能帮助到你！

如何从文本中提取URL

使用Ruby和net-ssh，如何使用Net :: SSH.start中的key_data参数进行身份validation？

bundle exec是否需要’bundler / setup’等效？

如何更改gem环境设置？

如何将插件“转换”为gem，使其“私密”？

在Ruby中比较包含字符串字符串的两个数组

建立一个网站 – 使用Ruby的最佳实践和架构

在Valuations＃new中的ActionController :: UrlGenerationError

尝试使用YoutubeV3 API创建实时广播时出现“需要登录”错误

将`params`和`session`和`env`添加到Object

Heroku toolbelt更新失败