使用www :: mechanize时的Iconv :: IllegalSequence

我正在尝试做一些webscraping,但WWW:Mechanize gem似乎不喜欢编码和崩溃。
post请求导致302重定向(跟随机械化,到目前为止很好),结果页面似乎崩溃了。 我google了很多,但到目前为止没有任何问题可以解决这个问题。 你们中有人有个主意吗?

码:

require 'rubygems' require 'mechanize' agent = WWW::Mechanize.new agent.user_agent_alias = 'Mac Safari' answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung', {"Country" => "Deutschland", "Abholstation" => "Aalen", "Abgabestation" => "Aalen", "Abholdatum" => "26.02.2009", "Abholzeit_stunde" => "13", "Abholzeit_minute" => "30", "Abgabedatum" => "28.02.2009", "Abgabezeit_stunde" => "13", "Abgabezeit_minute" => "30", "CountryID" => "DE", "AbholstationID"=>"AA1", "AbgabestationID"=>"AA1" } ) puts answer.body 

错误:

 D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence) from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form' from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post' from test.rb:7 

该页面肯定是UTF-8,但是Mechanize使用NKF(核心Ruby库)来猜测编码,并且由于某种原因它出现在Shift JIS中。 解决此问题的最快方法是覆盖Mechanize的编码映射,以便当它尝试使用Iconv将主体转换为UTF-8时,它也将源代码传递为UTF-8。 你可以这样做:

 WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8" 

放在require Mechanize库的行之后。 您可能希望在找到问题的根本原因之后立即设置值,甚至更好,并在必要时提交补丁。

注意:我解决这个问题的方法是使用backtrace调试Mechanize库。 to_native_charset方法调用detect_charset ,这是问题所在。

在我的例子中,get方法返回了一个Mechanize::File ,它根本不使用编码。
我能够通过手动转换Iconv来修复它,但这只有在你已经知道编码的情况下才有效。

 result = @agent.get uri # Mechanize::File instead of Mechanize::Page is returned # so we have to convert manually result = Iconv.conv("utf-8", "iso-8859-1", result.body)