从SQL Server 2008解析CSV的语义正确方法是什么?

我从SQL Server 2008获得了一个CSV转储,其中包含以下行:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00 Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00 Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00 Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00 Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 

parse_dbenhur很漂亮,但可以重写它以支持逗号和引号的存在吗? parse_ugly很丑陋。

 # @dbenhur's excellent answer, which works 100% for what i originally asked for SEP = /(?:,|\Z)/ QUOTED = /"([^"]*)"/ UNQUOTED = /([^,]*)/ FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/ def parse_dbenhur(line) line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] } end def parse_ugly(line) dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') } fields = [] open = false dumb_fields.each_with_index do |v, i| open ? fields.last.concat(v) : fields.push(v) open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"')) end fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v } end lines = [] lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00' lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00' lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00' lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00' lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00' require 'csv' lines.each do |line| puts puts line begin c = CSV.parse_line(line) puts "#{c.to_csv.chomp} (size #{c.length})" rescue puts "FasterCSV says: #{$!}" end a = parse_ugly(line) puts "#{a.to_csv.chomp} (size #{a.length})" b = parse_dbenhur(line) puts "#{b.to_csv.chomp} (size #{b.length})" end 

这是运行时的输出:

 Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00 FasterCSV says: Illegal quoting in line 1. Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4) Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4) Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00 FasterCSV says: Unclosed quoted field on line 1. Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4) Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4) Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00 FasterCSV says: Missing or stray quote in line 1 Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4) Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5) Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00 FasterCSV says: Missing or stray quote in line 1 Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4) Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5) Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4) Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6) Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6) 

UPDATE

请注意,当字段包含逗号时,CSV会使用双引号。

更新2

如果逗号从相关字段中删除,那就没问题了…我的parse_ugly方法不会保留它们。

更新3

我从客户端了解到,SQL Server 2008正在导出这个奇怪的CSV – 这已经在这里和这里向微软报告了

更新4

@ dbenhur的回答非常适合我最初要求的内容,但指出我忽略了用逗号和引号显示的行。 我会接受d @ benhur的答案 – 但我希望它可以改进以适用于上述所有行。

绝对最终更新

这段代码有效(我认为它在语义上是正确的):

 QUOTED = /"((?:[^"]|(?:""(?!")))*)"/ SEPQ = /,(?! )/ UNQUOTED = /([^,]*)/ SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/ FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/ def parse_sql_server_2008_csv_line(line) line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') } end 

改编自@dbenhur和@ ghostdog74的答案如何处理带有“坏逗号”的CSV文件?

以下使用regexp和String#scan 。 我观察到你正在处理的破坏的CSV格式中, "只有在字段的开头结尾处有引用属性。

扫描在连续匹配正则表达式的字符串中移动,因此正则表达式可以假设其起始匹配点是字段的开头。 我们构造了正则表达式,因此它可以匹配平衡的引用字段,没有内部引号( QUOTED一串非逗号( UNQUOTED )。 当匹配任何替代字段表示时,它必须后跟一个分隔符,该分隔符可以是逗号或字符串结尾( SEP

因为UNQUOTED可以在分隔符之前匹配零长度字段,所以扫描始终匹配我们丢弃的末尾的空字段[0...-1] 。 Scan生成一组元组; 每个元组都是一个捕获组的数组,因此我们map每个元素,用matches[0] || matches[1]选择捕获的替代元素 matches[0] || matches[1]

你的示例行都没有显示包含逗号和引号的字段 – 我不知道它将如何合法表示,并且此代码可能无法正确识别这样的字段。

 SEP = /(?:,|\Z)/ QUOTED = /"([^"]*)"/ UNQUOTED = /([^,]*)/ FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/ def ugly_parse line line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] } end lines.each do |l| puts l puts ugly_parse(l).inspect puts end # Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00 # ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"] # # Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00 # ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"] # # Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00 # ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"] 

如果您的CSV不使用双引号作为合法引用字符,请将选项调整为CSV以传递:quote_char => "\0" ,然后您可以执行此操作(为了清晰起见,包裹字符串)

 1.9.3p327 > puts 'Construction,197133031B,"MORGAN SHOES" ALT, 1997-05-13 00:00:00'.parse_csv(:quote_char => "\0") Construction 197133031B "MORGAN SHOES" ALT 1997-05-13 00:00:00 1.9.3p327 > puts 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER, 1996-08-09 00:00:00'.parse_csv(:quote_char => "\0") Plumbing 196222006P REPLACE LEAD WATER SERVICE W/1" COPPER 1996-08-09 00:00:00