Ruby on Rails XPath Json刮痧图像

我正试图从网站上抓取图片。 到目前为止,我正在使用Nokogiri和XPath,但收效甚微。 对于HTML有img和src的典型网站,我可以使用:

tmp2 = Nokogiri::HTML(open(site_url)) tmp2.xpath("//img/@src").each do |src| ...do whatever end 

但是,像亚马逊和eBay这样的网站只能用javascript触发某些图像。 如果我查看代码,我可以在数组中看到数据。 例如,来自亚马逊(来源: http : //www.amazon.com/Threads-Thought-Womens-Dreams-X-Small/dp/B00T46V758/ref=sr_1_5? s=apparel&ie=UTF8&qid=1433555447&sr= 1-5 ) :

   P.when('jQuery', 'cf').execute(function($, cf){ P.load.js('http://z-ecx.images-amazon.com/images/G/01/browser-scripts/imageBlock-udp-airy/imageBlock-udp-airy-4060168860._V1_.js'); }); P.when('A', 'jQuery', 'ImageBlockATF', 'cf').register('ImageBlockBTF', function(A, $, imageBlockATF, cf){ var data = {"indexToColor":[],"burjImageBlock":0,"isSwatchHoverConsistent":1,"heroFocalPoint":null,"visualDimensions":["color_name"],"productGroupID":"apparel_display_on_website","newVideoMissing":0,"useIV":0,"useClickZoom":null,"useChildVideos":0,"numColors":7,"logMetrics":0,"defaultColor":"initial","airyConfig":{"enableContinuousPlay":null,"installFlashButtonText":"Install Flash Player","contentTitle":null,"autoplayCutOffTimeSeconds":null,"ageGate":{"monthNames":["January","February","March","April","May","June","July","August","September","October","November","December"],"deniedPrompt":"We're sorry. You are not old enough to watch this video.","submitText":"Submit","prompt":"This video is not intended for all audiences. What date were you born?"},"videoAds":null,"videoUnsupportedPrompt":"Sorry, this video is unsupported on this browser.","desiredMode":null,"swfUrl":"http://g-ecx.images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1102.0/flash/AiryBasicRenderer._V304902271_.swf","isAutoplayEnabled":null,"installFlashPrompt":"Adobe Flash Player is required to watch this video.","isLiveStream":null,"regionCode":"NA","contentId":null,"playbackErrorPrompt":"Sorry, an error has occurred while attempting video playback. Please try again later.","contentMinAge":null,"isForesterTrackingDisabled":null,"streamingUrls":null,"parentId":null,"foresterMetadataParams":{"client":"Dpx","requestId":"1MX7VHFRVAS6TWY64BXC","marketplaceId":"ATVPDKIKX0DER","session":"182-9511970-7757812","method":"Apparel.ImageBlock"},"jsUrl":"http://z-ecx.images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1102.0/js/airy.chromeless._V304902265_.js"},"mainImageMaxSizes":null,"staticStrings":{"playVideo":"Click to play video","rollOverToZoom":"Roll over image to zoom in","images":"Images","video":"video","clickToZoom":"Click on image to zoom in","touchToZoom":"Touch the image to zoom in","videos":"Videos","close":"Close","pleaseSelect":"Please select","clickToExpand":"Click to open expanded view","allMedia":"All Media"},"notThumbnailClickImmersiveView":1,"gIsNewTwister":1,"title":"Threads 4 Thought Women's Tabitha Basic Tank Top","ivRepresentativeAsin":{"6":"B00T46V76W","4":"B00WM3O7ES","1":"B00T46YZES","3":"B00WM3NLPE","2":"B00T46VD16","5":"B00T46VGXQ"},"mainImageSizes":[[342,445],[385,500],[425,550],[466,606],[522,679]],"isQuickview":0,"ipadVideoSizes":[[340,444],[384,500]],"colorToAsin":{"Coral Dreams":{"asin":"B00T46V76W"},"Heather Grey":{"asin":"B00WM3NLPE"},"Black":{"asin":"B00T46YZES"},"White":{"asin":"B00T46VGXQ"},"Deep Blue Sea":{"asin":"B00T46VD16"},"Sea Glass":{"asin":"B00WM3O7ES"}},"thumbExperimentEnabledValue":1,"showLITBOnClick":0,"videoSizes":[[342,445],[384,500]],"stretchyGoodnessWidth":[1280,1440,1640,1800],"autoplayVideo":0,"hoverZoomIndicator":"","sitbReftag":"","useHoverZoom":1,"staticImages":{"zoomOut":"http://sofzh.miximages.com/ruby-on-rails/zoom-out._V184888738_.bmp","hoverZoomIcon":"http://sofzh.miximages.com/ruby-on-rails/icon_zoom._V138923886_.png","zoomIn":"http://sofzh.miximages.com/ruby-on-rails/zoom-in._V184888790_.bmp","zoomLensBackground":"http://sofzh.miximages.com/ruby-on-rails/tile._V211431200_.gif","videoThumbIcon":"http://sofzh.miximages.com/ruby-on-rails/video._V183716339_SX38_SY50_CR,0,0,38,50_.jpg","spinner":"http://sofzh.miximages.com/ruby-on-rails/loading-large_labeled._V192238949_.gif","zoomInCur":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoomIn._V323082799_.cur","videoSWFPath":"http://g-ecx.images-amazon.com/images/G/01/Quarterdeck/en_US/video/20110518115040892/Video._V178668404_.swf","arrow":"http://sofzh.miximages.com/ruby-on-rails/sprite-vertical-popover-arrow._V186877868_.png","zoomOutCur":"http://g-ecx.images-amazon.com/images/G/01/detail-page/cursors/zoomOut._V323082798_.cur"},"videos":[],"gPreferChildVideos":0,"altsOnLeft":1,"ivImageSetKeys":{"Coral Dreams":"6","Heather Grey":"3","Black":"1","initial":0,"White":"5","Deep Blue Sea":"2","Sea Glass":"4"},"useHoverZoomIpad":"","isUDP":1,"alwaysIncludeVideo":0,"widths":[1280,1440,1640,1800],"maxAlts":7,"useChromelessVideoPlayer":1,"mainImageHeightPartitions":null}; data["customerImages"] = eval('[]'); data["colorImages"] = {"Coral Dreams":[{"large":"http://sofzh.miximages.com/ruby-on-rails/41FGlhksmtL.jpg","variant":"MAIN","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81iXQbkcpiL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41FGlhksmtL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81iXQbkcpiL._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/81iXQbkcpiL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81iXQbkcpiL._UY550_.jpg":["423","550"],"http://sofzh.miximages.com/ruby-on-rails/81iXQbkcpiL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81iXQbkcpiL._UY500_.jpg":["385","500"]}},{"large":"http://sofzh.miximages.com/ruby-on-rails/41XR9o0cV-L.jpg","variant":"BACK","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81bVmFiRu0L._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41XR9o0cV-L._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81bVmFiRu0L._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81bVmFiRu0L._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81bVmFiRu0L._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81bVmFiRu0L._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/81bVmFiRu0L._UY550_.jpg":["423","550"]}}],"Heather Grey":[{"large":"http://sofzh.miximages.com/ruby-on-rails/41f-8R8Eu-L.jpg","variant":"MAIN","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81dTYkBL+xL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41f-8R8Eu-L._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81dTYkBL+xL._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/81dTYkBL+xL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81dTYkBL+xL._UY550_.jpg":["423","550"],"http://sofzh.miximages.com/ruby-on-rails/81dTYkBL+xL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81dTYkBL+xL._UX342_.jpg":["342","445"]}},{"large":"http://sofzh.miximages.com/ruby-on-rails/41gLiFBbcdL.jpg","variant":"BACK","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81ua3AXCpJL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41gLiFBbcdL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81ua3AXCpJL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81ua3AXCpJL._UY550_.jpg":["423","550"],"http://sofzh.miximages.com/ruby-on-rails/81ua3AXCpJL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81ua3AXCpJL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81ua3AXCpJL._UX466_.jpg":["466","606"]}}],"Black":[{"large":"http://sofzh.miximages.com/ruby-on-rails/41BxSpfEM7L.jpg","variant":"MAIN","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41BxSpfEM7L._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL._UY550_.jpg":["423","550"],"http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL._UX466_.jpg":["466","606"]}},{"large":"http://sofzh.miximages.com/ruby-on-rails/41Gf+W-cPTL.jpg","variant":"BACK","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81SJwuaCspL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41Gf+W-cPTL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81SJwuaCspL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81SJwuaCspL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81SJwuaCspL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81SJwuaCspL._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/81SJwuaCspL._UY550_.jpg":["423","550"]}}],"White":[{"large":"http://sofzh.miximages.com/ruby-on-rails/41tElK2wPKL.jpg","variant":"MAIN","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81kKgU75rIL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41tElK2wPKL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81kKgU75rIL._UY550_.jpg":["423","550"],"http://sofzh.miximages.com/ruby-on-rails/81kKgU75rIL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81kKgU75rIL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81kKgU75rIL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81kKgU75rIL._UX466_.jpg":["466","606"]}},{"large":"http://sofzh.miximages.com/ruby-on-rails/31lEDIs4cqL.jpg","variant":"BACK","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81OBgvbUR7L._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/31lEDIs4cqL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81OBgvbUR7L._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/81OBgvbUR7L._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81OBgvbUR7L._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81OBgvbUR7L._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81OBgvbUR7L._UY550_.jpg":["423","550"]}}],"Deep Blue Sea":[{"large":"http://sofzh.miximages.com/ruby-on-rails/41oNq3KmSGL.jpg","variant":"MAIN","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81MtZtmxVLL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41oNq3KmSGL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81MtZtmxVLL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81MtZtmxVLL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81MtZtmxVLL._UY550_.jpg":["423","550"],"http://sofzh.miximages.com/ruby-on-rails/81MtZtmxVLL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81MtZtmxVLL._UX466_.jpg":["466","606"]}},{"large":"http://sofzh.miximages.com/ruby-on-rails/41AJgd1OuYL.jpg","variant":"BACK","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81uLEksrYFL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41AJgd1OuYL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81uLEksrYFL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81uLEksrYFL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81uLEksrYFL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81uLEksrYFL._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/81uLEksrYFL._UY550_.jpg":["423","550"]}}],"Sea Glass":[{"large":"http://sofzh.miximages.com/ruby-on-rails/418vg-re8oL.jpg","variant":"MAIN","hiRes":"http://sofzh.miximages.com/ruby-on-rails/81YgtD-bEwL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/418vg-re8oL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/81YgtD-bEwL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/81YgtD-bEwL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/81YgtD-bEwL._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/81YgtD-bEwL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/81YgtD-bEwL._UY550_.jpg":["423","550"]}},{"large":"http://sofzh.miximages.com/ruby-on-rails/41lcpC41VSL.jpg","variant":"BACK","hiRes":"http://sofzh.miximages.com/ruby-on-rails/814+6ZLwIxL._UL1500_.jpg","thumb":"http://sofzh.miximages.com/ruby-on-rails/41lcpC41VSL._SR38,50_.jpg","main":{"http://sofzh.miximages.com/ruby-on-rails/814+6ZLwIxL._UY500_.jpg":["385","500"],"http://sofzh.miximages.com/ruby-on-rails/814+6ZLwIxL._UX342_.jpg":["342","445"],"http://sofzh.miximages.com/ruby-on-rails/814+6ZLwIxL._UX522_.jpg":["522","679"],"http://sofzh.miximages.com/ruby-on-rails/814+6ZLwIxL._UX466_.jpg":["466","606"],"http://sofzh.miximages.com/ruby-on-rails/814+6ZLwIxL._UY550_.jpg":["423","550"]}}]}; data["heroImage"] = {}; data["landingAsinColor"] = 'Coral Dreams'; data["shouldApplyResizeFix"] = false; return data; });  

我想要抓取的文件名没有src(即http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL.UY500 .jpg )在这种情况下,该数组被称为“data [”colorImages “]

但是……我不能在这里硬编码,因为在eBay上也会发生同样的事情……例如: http : //www.ebay.com/itm/Summer-Women-Casual-Chiffon-Loose-Tops-蝙蝠-短袖松-T恤-Blouse- / 351411949784?PT = LH_DefaultDomain_0&VAR =&散列= item51d1c8d0d8

我需要的文件名是“enImgCarousel”

在旁注中,当我使用以下javascript bookmarklet为每个url获取图像时,我能够获得正确的图像:

 a=''; for (b=0;b<document.images.length;b++){ a+='
'}; ifa=''){ document.writea+''); void(document.close()) }else{ alert('No images!') }

回到Nokogiri和XPath,我也尝试过:

 tmp2.xpath("//img").each do |src|... 

  tmp2.xpath("html//img").each do |src| 

任何想法我应该怎么做或进入哪个方向?

这是解决您想要实现的目标的另一种方式; 你可以使用水豚和恶作剧 ,我得到了你想要的那个。 因此,我假设您不必使用此解决方案深入了解javascript。

如果你刮,我建议你考虑使用polygeist的水豚,你可以找到很多来源参考。

以下是我试过的代码。 希望它会有所帮助!

 require 'capybara' require 'capybara/dsl' require 'capybara/poltergeist' Capybara.register_driver :poltergeist_debug do |app| Capybara::Poltergeist::Driver.new(app, inspector: true) end Capybara.javascript_driver = :poltergeist_debug Capybara.current_driver = :poltergeist_debug # Amazon Case visit_site('https://www.amazon.com/dp/B00T46V758/?tag=stackoverfl08-20') doc_amazon = Nokogiri::HTML.parse(page.html) doc_amazon.xpath("//img/@src").each do |src| p src.value end #ebay case visit_site('https://www.ebay.com/itm/Summer-Women-Casual-Chiffon-Loose-Tops-Batwing-Short-Sleeve-Loose-T-Shirt-Blouse-/351411949784?pt=LH_DefaultDomain_0&var=&hash=item51d1c8d0d8') doc_ebay = Nokogiri::HTML.parse(page.html) doc_ebay.xpath("//img/@src").each do |src| p src.value end 

如果你想深入了解它(似乎你不想要)

 doc.xpath("//div[@id='imgTagWrapperId']/img").attribute('src').value # => "http://sofzh.miximages.com/ruby-on-rails/81+TW8762BL._UX453_.jpg" doc.xpath("//div[@id='mainImgHldr']/img[@id='icImg']").attribute('src').value # => "http://sofzh.miximages.com/ruby-on-rails/s-l300.jpg" 

您是否正在尝试使用定价等生成竞争对手的数据库?
您是想抓住整个类别还是个别卖家? 我问的原因是,如果他们已经启用了该function,您可以获得每个卖家列出的项目的RSS源。 这样,当您可以从RSS源获取中央数据时,您不必浪费时间来抓取页面。

解析网页时,根据您在网页中的位置(您提到的轮播),您遇到的索引来自隐藏代表较大图像的缩略图。
我建议先查看eBay API和Amazon API,然后再找到卖家的RSS源。

至于越过任何Javascript问题,网页会动态加载旋转幻灯片和轮播,因此您必须使用Mechanize(如上面建议的RAJ)或Beautiful Soup或Selenium来获取完全呈现的网页,其中所有图像都在可废弃的网页中州。

如果还有其他我可以提供的帮助,请随时发布您的来源。

对不起,当我从手机上发布答案时,我无法立即编写完整的代码,但是,我可以给你一个方法。 你应该使用Mechanize with selenium-webdriver&watir而不仅仅是Nokogiri。

使用Mechanize,您将能够处理来自JavaScript的元素。 您可以模拟浏览器上的实际移动,即您可以编码单击链接/按钮,您可以等待图像加载,然后可以刮掉它。 所有这一切都可以很容易地使用Mechanize完成。