数据抓取多个页面点击循环

试图找出一种方法,使用一个机制来刮取并向arrays添加我们想要的所有UCAS网站数据。 目前,我们正在努力使用机械化链接点击进行编码。 想知道是否有人可以提供帮助,在循环中有三个连续的链接点击进入所有搜索结果页面。 显示所有大学课程的第一个链接是div class morecourseslink

显示课程名称,持续时间和资格的第二个链接是div类coursenamearea

第三个链接在div coursedetailsshowable中,一个id是coursedetailtab_entryreqs

目前我们正在用下面的方式抓取uninames:

class PagesController "]') page = mechanize.get(next_page_link['href']) page.search('li.result h3').each do |h3| name = h3.text @uninames_array.push(name) end end puts @uninames_array.to_s end end 

课程名称的持续时间和资格来自以下内容:

 require 'mechanize' mechanize = Mechanize.new @duration_array = [] @qual_array = [] @courses_array = [] page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41') page.search('div.courseinfoduration').each do |x| puts x.text.strip page.search('div.courseinfooutcome').each do |y| puts y.text.strip end while next_page_link = page.at('.pager a[text()=">"]') page = mechanize.get(next_page_link['href']) page.search('div.courseinfoduration').each do |x| name = x @duration_array.push(name) puts x.text.strip end end while next_page_link = page.at('.pager a[text()=">"]') page = mechanize.get(next_page_link['href']) page.search('div.courseinfooutcome').each do |y| name = y @qual_array.push(name) puts y.text.strip end end page.search('div.coursenamearea h4').each do |h4| puts h4.text.strip end while next_page_link = page.at('.pager a[text()=">"]') page = mechanize.get(next_page_link['href']) page.search('div.coursenamearea h4').each do |h4| name = h4.text @courses_array.push(name) puts h4.text.strip end end end 

如果你想用一个Mechanize实例做这个,为什么不将它们全部串在一起并存储你需要在变量中跳转的页面?

如果所有代码都有效,那么您可以将它们串在一起形成一个方法调用:

 def home require 'mechanize' mechanize = Mechanize.new @uninames_array = [] page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016') page.search('li.result h3').each do |h3| name = h3.text @uninames_array.push(name) end while next_page_link = page.at('.pager a[text()=">"]') page = mechanize.get(next_page_link['href']) page.search('li.result h3').each do |h3| name = h3.text @uninames_array.push(name) end end @duration_array = [] @qual_array = [] @courses_array = [] page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41') page.search('div.courseinfoduration').each do |x| puts x.text.strip page.search('div.courseinfooutcome').each do |y| puts y.text.strip end while next_page_link = page.at('.pager a[text()=">"]') page = mechanize.get(next_page_link['href']) page.search('div.courseinfoduration').each do |x| name = x @duration_array.push(name) puts x.text.strip end end while next_page_link = page.at('.pager a[text()=">"]') page = mechanize.get(next_page_link['href']) page.search('div.courseinfooutcome').each do |y| name = y @qual_array.push(name) puts y.text.strip end end page.search('div.coursenamearea h4').each do |h4| puts h4.text.strip end while next_page_link = page.at('.pager a[text()=">"]') page = mechanize.get(next_page_link['href']) page.search('div.coursenamearea h4').each do |h4| name = h4.text @courses_array.push(name) puts h4.text.strip end end