Tag: 海葵

Ruby刮刀。如何导出到CSV？: 我写了这个ruby脚本来从制造商网站上搜集产品信息。在数组中抓取和存储产品对象有效，但我无法弄清楚如何将数组数据导出到csv文件。抛出此错误：scraper.rb：45：main：Object的未定义方法`send_data’（NoMethodError）我不明白这段代码。这是做什么的，为什么它不正常？ send_data csv_data, :type => ‘text/csv; charset=iso-8859-1; header=present’, :disposition => “attachment; filename=products.csv” 完整代码： #!/usr/bin/ruby require ‘rubygems’ require ‘anemone’ require ‘fastercsv’ productsArray = Array.new class Product attr_accessor :name, :sku, :desc end # Scraper Code Anemone.crawl(“http://retail.pelicanbayltd.com/”) do |anemone| anemone.on_every_page do |page| currentPage = Product.new #Product info parsing currentPage.name = page.doc.css(“.page_headers”).text currentPage.sku = page.doc.css(“tr:nth-child(2) […]

跳过带有扩展名pdf的网页，在Anemone中抓取拉链: 我正在使用海葵gem（Ruby-1.8.7和Rails 3.1.1）开发爬虫。如何从抓取/下载中跳过带有扩展名pdf，doc，zip等的网页。

Ruby + Anemone Web Crawler：正则表达式匹配以一系列数字结尾的URL: 假设我正在尝试抓取一个网站，跳过一个像这样结束的页面： http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117 我目前正在使用Ruby中的Anemone gem来构建爬虫。我使用的是skip_links_like方法，但我的模式似乎永远不匹配。我试图使其尽可能通用，因此它不依赖于子页面而只是=2105925 （数字）。我试过/=\d+$/和/\?.*\d+$/ /=\d+$/但它似乎没有用。这类似于跳过带有扩展名pdf的网页，来自在Anemone中抓取的zip，但我不能用数字而不是扩展来使其值得。此外，在http://regexpal.com/上使用pattern =\d+$将成功匹配http://misc.com/test/index.php?page=news&subpage=20060118 编辑：这是我的全部代码。我想知道是否有人能够确切地看到错误。 require ‘anemone’ … Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true) do |anemone| anemone.skip_links_like /\?.*\d+$/ anemone.on_every_page do |page| pURL = page.url.to_s puts “Now checking: ” + pURL bestGuess[pURL] = match_freq( manList, page.doc.inner_text ) puts “Successfully checked” end end 我的输出是这样的： … Now […]