如何用Nokogiri解析HTML表？

我正在尝试解析一个表，但我不知道如何从中保存数据。我想将每行中的数据保存为：

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]

样本表是：

 html = <<EOT  . . .   Table name  Column name 1  Column name 2  Column name 3  Column name 4  Column name 5  
  Raw name 1  2,094  0,017  0,098  0,113  0,452  
 Raw name 5  2,094  0,017  0,098  0,113  0,452  
 
 EOT

Table name	Column name 1	Column name 2	Column name 3	Column name 4	Column name 5
Raw name 1	2,094	0,017	0,098	0,113	0,452
Raw name 5	2,094	0,017	0,098	0,113	0,452

我的刮刀代码是：

  doc = Nokogiri::HTML(open(html), nil, 'UTF-8') tables = doc.css('div.open') @tablesArray = [] tables.each do |table| title = table.css('tr[1] > th').text cell_data = table.css('tr > td').text raw_name = table.css('tr > th').text @tablesArray << Table.new(cell_data, raw_name) end render template: 'scrape_krasecology' end end

当我尝试在HTML页面中显示数据时，看起来所有列名都存储在一个数组的元素中，所有数据都以相同的方式存储。

问题的关键是在多个结果上调用#text将返回每个单独元素的#text的串联。

让我们来看看每个步骤的作用：

 # Finds all s with class open # I'm assuming you have only one 
 so # you don't actually have to loop through # all tables, instead you can just operate # on the first one. If that is not the case, # you can use a loop the way you did tables = doc.css('table.open') # The text of all  one in the table title = table.css('tr[1] > th').text # The text of all s in the table # You obviously wanted just the  cell_data = table.css('tr > td').text # The text of all s in the table # You obviously wanted just the  raw_name = table.css('tr > th').texts in 
s in all 
s in one 
s in all 
s in one

s in
s in all
s in one
s in all
s in one

现在我们知道什么是错的，这是一个可能的解决方案：

 html = <  Table name Column name 1 Column name 2 Column name 3 Column name 4 Column name 5   Raw name 1 1001 1002 1003 1004 1005   Raw name 2 2001 2002 2003 2004 2005   Raw name 3 3001 3002 3003 3004 3005   EOT

 doc = Nokogiri::HTML(html, nil, 'UTF-8') # Fetches only the first . If you have # more than one, you can loop the way you # originally did. table = doc.css('table.open').first # Fetches all rows (s) rows = table.css('tr') # The column names are the first row (shift returns # the first element and removes it from the array). # On that row we get the text of each individual  # This will be Table name, Column name 1, Column name 2... column_names = rows.shift.css('th').map(&:text) # On each of the remaining rows text_all_rows = rows.map do |row| # We get the name ( ) # On the first row this will be Raw name 1 # on the second - Raw name 2, etc. row_name = row.css('th').text # We get the text of each individual value ( ) # On the first row this will be 1001, 1002, 1003... # on the second - 2001, 2002, 2003... etc row_values = row.css('td').map(&:text) # We map the name, followed by all the values [row_name, *row_values] end p column_names # => ["Table name", "Column name 1", "Column name 2", # "Column name 3", "Column name 4", "Column name 5"] p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"], # ["Raw name 2", "2001", "2002", "2003", "2004", "2005"], # ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]] # If you want to combine them text_all_rows.each do |row_as_text| p column_names.zip(row_as_text).to_h end # => # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"} # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"} # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}

# This will be Table name, Column name 1, Column name 2... column_names = rows.shift.css('th').map(&:text) # On each of the remaining rows text_all_rows = rows.map do \|row\| # We get the name (	) # On the first row this will be Raw name 1 # on the second - Raw name 2, etc. row_name = row.css('th').text # We get the text of each individual value (	) # On the first row this will be 1001, 1002, 1003... # on the second - 2001, 2002, 2003... etc row_values = row.css('td').map(&:text) # We map the name, followed by all the values [row_name, *row_values] end p column_names # => ["Table name", "Column name 1", "Column name 2", # "Column name 3", "Column name 4", "Column name 5"] p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"], # ["Raw name 2", "2001", "2002", "2003", "2004", "2005"], # ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]] # If you want to combine them text_all_rows.each do \|row_as_text\| p column_names.zip(row_as_text).to_h end # => # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"} # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"} # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}

您想要的输出是无稽之谈：

 ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] # ~> -:1: Invalid octal digit # ~> ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]

我假设你想要引用的数字。

剥离阻止代码工作的东西，并将HTML缩减为更易于管理的示例，然后运行它：

 require 'nokogiri' html = <  Table name Column name 1 Column name 2   Raw name 1 2,094 0,017   Raw name 5 2,094 0,017   EOT doc = Nokogiri::HTML(html) tables = doc.css('table.open') tables_data = [] tables.each do |table| title = table.css('tr[1] > th').text # !> assigned but unused variable - title cell_data = table.css('tr > td').text raw_name = table.css('tr > th').text tables_data << [cell_data, raw_name] end

结果如下：

 tables_data # => [["2,0940,0172,0940,017", # "Table nameColumn name 1Column name 2Raw name 1Raw name 5"]]

首先要注意的是，虽然你指定了title但你并没有使用title 。可能是在您清理代码时发生的事情。

css ，与search和xpath ，返回一个NodeSet，类似于一个节点数组。在NodeSet上使用text或inner_text ，它会将连接的每个节点的文本返回到单个字符串中：

获取所有包含的Node对象的内部文本。

这是它的行为：

 require 'nokogiri' doc = Nokogiri::HTML('foo
bar
') doc.css('p').text # => "foobar"

相反，您应该迭代找到的每个节点，并单独提取其文本。这里有很多次关于SO的内容：

 doc.css('p').map{ |node| node.text } # => ["foo", "bar"]

这可以简化为：

 doc.css('p').map(&:text) # => ["foo", "bar"]

请参阅“ 如何避免在刮取时加入节点中的所有文本 ”。

当与Node inner_text使用时，文档会说明content ， text和inner_text ：

返回此节点的内容。

相反，您需要追踪单个节点的文本：

 require 'nokogiri' html = <  Table name Column name 1 Column name 2 Column name 3 Column name 4 Column name 5   Raw name 1 2,094 0,017 0,098 0,113 0,452   Raw name 5 2,094 0,017 0,098 0,113 0,452   EOT tables_data = [] doc = Nokogiri::HTML(html) doc.css('table.open').each do |table| # find all rows in the current table, then iterate over the second all the way to the final one... table.css('tr')[1..-1].each do |tr| # collect the cell data and raw names from the remaining rows' cells... raw_name = tr.at('th').text cell_data = tr.css('td').map(&:text) # aggregate it... tables_data += [raw_name, cell_data] end end

现在导致：

 tables_data # => ["Raw name 1", # ["2,094", "0,017", "0,098", "0,113", "0,452"], # "Raw name 5", # ["2,094", "0,017", "0,098", "0,113", "0,452"]]

您可以弄清楚如何将引用的数字强制转换为Ruby可接受的小数，或者根据需要操纵内部数组。

我假设你从这里借用了一些代码或任何其他相关的引用（或者我很抱歉添加错误的引用） – http://quabr.com/34781600/ruby-nokogiri-parse-html-table 。

但是，如果要捕获所有行，可以更改以下代码 –

希望这可以帮助您解决问题。

 doc = Nokogiri::HTML(open(html), nil, 'UTF-8') # We need .open tr, because we want to capture all the columns from a specific table's row @tablesArray = doc.css('table.open tr').reduce([]) do |array, row| # This will allow us to create result as this your illustrated one # ie. ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] array << row.css('th, td').map(&:text) end render template: 'scrape_krasecology'

最好的祝愿

如何用Nokogiri解析HTML表？

Nokogiri vs Hpricot？

Ruby中可用的网页抓取gem/工具

HTML解析为Ruby中的DOM

使用ruby将HTML转换为纯文本并维护结构/格式

如何在HTML文档中安全地使用嵌入JSON？

Rails中的元标记解析