RUBY CSV计算回报

我想计算2天的公司名单回报。所有信息都以CSV格式保存。结构如下：第一列是公司名称，第二列是日期，第三列是价格，第四列是return = p（t + 2）/ p（t）。

（1）CSV为1.8G。使用“CSV.each_with_index ..”非常慢。如果我使用“CSV.foreach”，它不会让我在两天内找到价格。

（2）价格缺失值。因此，即使我使用CSV.each_with_index，i + 2也可能无法确定正确的日期。

谢谢你的帮助。

输入：

[ ['a', '2014-6-1', '1'], ['a', '2014-6-2', '2'], ['a', '2014-6-4', '3'], ['a', '2014-6-5', '4'], ['b', '2014-6-1', '1'], ['b', '2014-6-2', '2'], ['b', '2014-6-3', '3'], ['b', '2014-6-4', '4'], ['b', '2014-6-5', '5'] ]

输出：

 [ ['a', '2014-6-1', '1', ''], # Missing because no 2014-6-3 price for a ['a', '2014-6-2', '2', '1.5'], # p(a2014-6-4)/p(a2014-6-2) = 1.5 ['a', '2014-6-4', '3', ''], # Missing because no 2014-6-6 price ['a', '2014-6-5', '4', ''], # Missing because no 2014-6-7 price ['b', '2014-6-1', '1', '3'], ['b', '2014-6-2', '2', '2'], ['b', '2014-6-3', '3', '1.7'], ['b', '2014-6-4', '4', ''], ['b', '2014-6-5', '5', ''] ]

我想到的逻辑如下。它与第一条评论中的逻辑相同。我没有对第二部分进行编码，因为我不确定在ruby中将大CSV与自身合并的好方法。我还考虑在以下观察中搜索第n个工作日。但我想避免使用each_with_index，因为CSV非常大。我不知道如何在ruby中实现这个逻辑。

（1）计算日期后的第n个工作日（2）将数据集与自身合并，以便在第n个工作日获得价格

 require 'csv' require 'business_time' # 30/60/90/365 business days # cdate ncusip prc permno firm csvIn = 'in.csv' csvOut = 'out.csv' csv = CSV.open(csvOut, "w") csv < true) do |row| current_date = Time.parse(row['cdate']) day60 = 42.business_days.after(current_date) csv << [row['cdate'], row['ncusip'], row['prc'], row['permno'], row['firm'], day60] end csv.close

您的代码会导致一些新的要求，例如找到nth business day ，但问题中没有明确定义，也许更正确的方法是打开另一个关于“在ruby中找到第n个工作日的最快方式”的问题。

因此，对于您在结果示例中评论的要求，我们只会提出问题。

要求要点：

读取一个大的csv文件，其中包含日期格式的字符串
对于每一天，在一组中的n天（n = 2）之后找到价格
对于每天的记录附加一个按两天价格计算的比率，如果n天后没有价格，则将其留空

基本基准：

随着样本数据重复45,000次，我得到了一个包含360,000条记录的10MB csv文件。

我的第一个想法是生成一个Buffer类来缓冲尚未满足下n天记录的记录。将新记录推送到缓冲区时，缓冲区将移出新记录前n天的所有记录。

但是我需要知道在这个实现中可能使用的一些基本操作的处理时间，然后我可以通过选择更有效的操作来计算总处理时间的下限：

将日期格式化字符串转换为日期至少360,000次
比较两天360,000次
获取另一个日期后n天的日期为360,000次
计算两个日期之间的天数360,000次
比较存储在数组数组中的两个日期360,000次
将一行推入缓冲区并移出360,000次
将比率或空字符串附加到每个记录360,000次

我听说CSV是一种非常低效的方式，因此我将比较两个文件解析处理时间：

使用CSV.foreach逐行读取csv文件，并将它们解析为一个数组
使用IO.read一次将csv文件读入一个字符串，并将该字符串拆分为一个数组

基本基准脚本：

 require 'csv' require 'benchmark' Benchmark.bm{|x| epoch=Date.new(1970,1,1) date1=Date.today date2=Date.today.next i1=1 i2=200000 date_str='2014-6-1' a = [[1,2,4,date2],2,[1,2,4,date1]] # 1. convert date formatted string to date at least 360,000 times x.report("1.string2date"){ 360000.times{Date.strptime(date_str,"%Y-%m-%d")} } # 2. compare two days for 360,000 times x.report("2.DateCompare"){ 360000.times{date2>=date1} } # 3. get the date that is n days after another date for 360,000 times x.report("3.DateAdd2 "){ 360000.times{date1 + 2} } # 4. calculate the days between two dates for 360,000 times x.report("4.Date Differ"){ 360000.times{date2-date1} } # 5. compare two dates stored in an array of arrays for 360,000 times x.report("5.ArrDateComp"){ 360000.times{ a.last[3] > a.first[3]} } # 6. push a row into buffer and shift out for 360,000 times x.report("6.array shift"){ 360000.times{ a<<[1,2,3]; a.shift} } # 7. append a ratio or empty string to every record for 360,000 times x.report("7.Add Ratio "){ 360000.times{ res << (['1','2014-6-1',"3"]<< (2==2 ? (3.to_f/2.to_f).round(2) : "" ))} } x.report('CSVparse '){ CSV.foreach("data.csv"){|row| } } x.report('IOread '){ data = IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each{|x| } } }

结果：

  user system total real 1.string2date 0.827000 0.000000 0.827000 ( 0.820001) 2.DateCompare 0.078000 0.000000 0.078000 ( 0.070000) 3.DateAdd2 0.109000 0.000000 0.109000 ( 0.110000) 4.Date Differ 0.359000 0.000000 0.359000 ( 0.360000) 5.ArrDateComp 0.109000 0.000000 0.109000 ( 0.110001) 6.array shift 0.094000 0.000000 0.094000 ( 0.090000) 7.Add Ratio 0.530000 0.000000 0.530000 ( 0.530000) CSVparse 2.902000 0.016000 2.918000 ( 2.910005) IOread 0.515000 0.015000 0.530000 ( 0.540000)

分析结果

传输日期格式化字符串到日期是所有这些操作的最慢操作，因此应该在文件解析过程中使用它，以确保每个记录只执行一次到目前为止传输字符串的操作。
比较两个日期比计算两个日期之间的天数快7倍，因此我将在n天之后存储日期，而不是存储自纪元日期以来缓冲区中的日期的整数。
总的处理时间至少包括1,2,3,5,6,7这些部分。因此，估计处理时间的下限应为1.75秒。有一些开销不包括在内。
使用CSV解析时，下限为4.24秒。
使用IO＃读取和拆分时，下限为2.262秒。

Buffer Class和push方法的实现

 class Buff def initialize @buff=[] @epoch = Date.new(1970,1,1) @n=2 end def push_date( row ) # store buff with two date value appended, ["a", "2014-6-1", "1", #,#] # the last element of date is n days after the record's date res = [] @buff << (row << (row[3] + @n) ) while (@buff.last[3] >= @buff.first[4] || row[0] != @buff.first[0]) v = (@buff.last[3] == @buff.first[4] && row[0] == @buff.first[0] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "") res <<(@buff.shift[0..2]<< v) end return res end def tails @buff.inject([]) {|res,x| res << (x[0..2]<< "")} end def clear @buff=[] end end

基准

 buff=Buff.new res=[] Benchmark.bm{|x| buff.clear res = [] x.report("CSVdate"){ CSV.foreach("data.csv"){|row| buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x} } buff.tails.each{|x| res << x} } buff.clear res = [] x.report("IOdate"){ IO.read("data.csv").split.inject([]){|memo,o| memo << o.split(',')}.each {|row| buff.push_date(row << Date.strptime(row[1],"%Y-%m-%d")).each{|x| res << x} } buff.tails.each{|x| res << x} } } puts "output result count:#{res.size}" puts "Here is the fist 12 sample outputs:" res[0..11].each{|x| puts x.to_s}

结果

  user system total real CSVdate 6.411000 0.047000 6.458000 ( 6.500009) IOdate 3.557000 0.109000 3.666000 ( 3.710005) output result count:360000 Here is the fist 12 sample outputs: ["a", "2014-6-1", "1", ""] ["a", "2014-6-2", "2", 1.5] ["a", "2014-6-4", "3", ""] ["a", "2014-6-5", "4", ""] ["b", "2014-6-1", "1", 3.0] ["b", "2014-6-2", "2", 2.0] ["b", "2014-6-3", "3", 1.67] ["b", "2014-6-4", "4", ""] ["b", "2014-6-5", "5", ""] ["a", "2014-6-1", "1", ""] ["a", "2014-6-2", "2", 1.5] ["a", "2014-6-4", "3", ""]

结论

实际的处理时间是3.557秒，比估计的下限慢约57％，但仍有一些开销没有考虑。
CSV版本比IO＃读取版本慢2倍。
我们应该使用IO＃read逐块读取输入文件，以防止内存不足错误。
它必须有一些调整空间。

UPDATE1：

调音

通过更改组比较和日期比较的顺序更快地推送：

 class Buff def push_fast( row ) # store buff with two date value appended, ["a", "2014-6-1", "1", #,#] # the last element of date is n days after the record's date res = [] row << (row[3] + @n) # change the order of the two compares, can reduce the counts of date compares while @buff.first && (row[0] != @buff.first[0] || row[3] >= @buff.first[4] ) v = (row[0] == @buff.first[0] && row[3] == @buff.first[4] ? (row[2].to_f/@buff.first[2].to_f).round(2) : "") res <<(@buff.shift[0..2]<< v) end @buff << row return res end end

基准测试结果

  user system total real IOdate 3.806000 0.031000 3.837000 ( 3.830005) IOfast 3.323000 0.062000 3.385000 ( 3.390005)

可以获得0.480秒的促销。首先通过比较组保存许多日期比较时间，如果组更改，则将所有缓冲区记录移出而不进行日期比较。

这是另一种可能性：

正如Jaugar建议的那样，使用CSV或IO执行简单的数组读取。
创建id和日期的哈希
迭代输入数组并在哈希中找到日期+2。
输出计算

假设您有一个名为input的数组，如上所述，它看起来像这样：

 # convert.rb require 'date' class Convert attr_reader :dates_hash def initialize @input_array = [] @dates_hash = {} @n = 2 @date_converter = {} end def add_to_hash(row) # create a hash of ids, dates and values, like this: # {"a"=>{#Date: 2014-6-1 => 1, #Date: 2014-6-4 => 2} ... etc.} id = row[0] date = to_date(row[1]) value = row[2].to_i # Merge using a block, so that for a given id (like "a"), the inner hashes # append rather than replace each other @dates_hash.merge!( { id => { date => value } } ) do |key, x, y| x.merge(y) end end def input(input_array) output = input_array.map do |row| id = row[0] date = to_date(row[1]) value = row[2] #create a row of output with id, date, original value, and modified price #set variable date2 to the date + @n value. If a exists, do the calculation row[0..2] << ( (date2 = @dates_hash[id][date+@n] ) ? date2/value.to_f : '') end end def to_date(date) # convert to Date and memoize if converted_date = @date_converter[date] converted_date else @date_converter[date] = Date.parse(date) end end end

我按照Jaugar的例子进行了基准测试，结果如下：

 require 'csv' require 'benchmark' require './convert' buff=[] output = [] Benchmark.bm{|x| x.report("convert with csv"){ converter = Convert.new() CSV.foreach("data.csv") do |row| buff << row converter.add_to_hash(row) end output = converter.input(buff) } x.report("convert with IO"){ converter = Convert.new() IO.readlines("data.csv").map{|row| row.split(',')}.each do |row| buff << row converter.add_to_hash(row) end output = converter.input(buff) } } puts "Here is the first 12 sample outputs:" output[0..11].each{|x| puts x.to_s}

Jagaur在我的计算机上的基准是：

  user system total real CSVdate 11.270000 0.020000 11.290000 ( 11.302404) IOdate 8.740000 0.020000 8.760000 ( 8.756997)

我的基准是：

  user system total real convert with csv 10.450000 0.090000 10.540000 ( 10.546727) convert with IO 12.850000 0.120000 12.970000 ( 12.972962) ["a", " 2014-6-1", " 1", ""] ["a", " 2014-6-2", " 2", 1.5] ["a", " 2014-6-4", " 3", ""] ["a", " 2014-6-5", " 4", ""] ["b", " 2014-6-1", " 1", 3.0] ["b", " 2014-6-2", " 2", 2.0] ["b", " 2014-6-3", " 3", 1.6666666666666667] ["b", " 2014-6-4", " 4", ""] ["b", " 2014-6-5", " 5", ""] ["a", " 2014-6-1", " 1", ""] ["a", " 2014-6-2", " 2", 1.5] ["a", " 2014-6-4", " 3", ""]

不知道为什么我的IO速度较慢，但它在相似的范围内。我的哈希只是因为csv中的数据不断重复而非常小。不确定具有更大哈希的更现实的情况下性能如何。不应该因为散列上的查找非常有效而严重降级。

RUBY CSV计算回报

要求要点：

基本基准：

Buffer Class和push方法的实现

基准

调音

Rails 4 CSV导入值并将值设置为键值

Ruby如何合并两个具有略微不同标头的CSV文件

将CSV数据导入ruby数组/变量

将上传的CSV文件中的行与rails中的用户相关联

Ruby无法解析CSV文件：CSV :: MalformedCSVError（第1行中的非法引用）

如何强制Ruby的CSV输出中的一个字段用双引号括起来？

你如何使用Ruby CSV转换器？

导入CSV的Rails由于格式错误而失败

Ruby 1.9.2导出CSV字符串而不生成文件

rails 3.1生成CSV文件