如何从文本块中删除dupes

什么是在文本文件中删除块内的欺骗的智能和简单方法。 每个块由两个换行符分隔。

之前:

apple banana apple cherry cherry delta epsilon delta epsilon apple pie delta delta 

后:

 apple banana cherry delta epsilon apple pie delta 

谢谢。 应该在Mac上工作。 允许unicode。 任何shell方法/语言/命令。 Dupes不一定是连续的。 如果忽略前导/尾随空格,可以使用奖励,或者可以使用逗号作为记录中的分隔符。

 $ awk '!NF{delete seen} !seen[$0]++' file apple banana cherry delta epsilon apple pie delta 

使用GNU awk为gensub() 忽略 (而不是删除 )前导/尾随空格将是:

 $ awk '!NF{delete seen} !seen[gensub(/^\s+|\s+$/,"","g")]++' file 

我不知道你的意思can use a comma as the delimiter within a record此上下文中can use a comma as the delimiter within a record

ruby!

 text =<<_ apple banana apple cherry cherry delta epsilon delta epsilon apple pie delta delta _ r1 = / (?<=\n) # match a newline in a positive lookbehind \n # match a newline /x # extended/free-spacing regex definition mode r2 = / (?<=\n) # match a newline in a positive lookbehind /x puts text.split(r1).map { |s| s.split(r2).uniq.join }.join("\n") # apple # banana # cherry # delta # epsilon # apple pie # delta 

步骤:

 a = text.split(r1) #=> ["apple\nbanana\napple\ncherry\ncherry\n", # "delta\nepsilon\ndelta\nepsilon\n", # "apple pie\ndelta\ndelta\n"] a.map { |s| s.split(r2) } #=> [["apple\n", "banana\n", "apple\n", "cherry\n", "cherry\n"], # ["delta\n", "epsilon\n", "delta\n", "epsilon\n"], # ["apple pie\n", "delta\n", "delta\n"]] a.map { |s| s.split(r2).uniq } #=> [["apple\n", "banana\n", "cherry\n"], # ["delta\n", "epsilon\n"], # ["apple pie\n", "delta\n"]] b = a.map { |s| s.split(r2).uniq.join } #=> ["apple\nbanana\ncherry\n", # "delta\nepsilon\n", # "apple pie\ndelta\n"] b.join("\n") #=> "apple\nbanana\ncherry\n\ndelta\nepsilon\n\napple pie\ndelta\n" 

这可能适合你(GNU sed):

 sed -r ':a;N;s/\b((\S+)\b.*)\n\2$/\1/;/^$/M!ba' file 

将线条存储在图案空间(PS)中,直到空白行或文件末尾。 读取最后一行和前一行的模式匹配,如果匹配,则删除最后一行。 如果最后一行是空行(或文件末尾),则打印PS中保留的所有行。

鉴于:

 $ cat file apple banana apple cherry cherry delta epsilon delta epsilon apple pie delta delta 

您可以使用Ruby的段落模式命令行开关将空行作为每个记录的分隔符,并将字段分隔符设置为每个字段的\n 。 然后统一每个块:

 $ ruby -00 -F'\n' -lane '$><<$F.uniq.join("\n")<<"\n\n"' file apple banana cherry delta epsilon apple pie delta 

解释:

 $ ruby -00 -F'\n' -lane '$><<$F.uniq.join("\n")<<"\n\n"' ^ # ruby 1.9+ only I think ^ # split records by \n\n ^ # split fields by \n ^ # options to: -l loop over input a auto split n don't auto print e compile command line ^ # to STDOUT ^ # append ^ # the split fields ^ # made uniq ^ # join back to a string ^ # add back the record separator 

或者,您可以使用Ruby哈希来计算字段,然后只打印哈希的键:

 $ ruby -00 -F'\n' -lane 'h=Hash.new(0) $F.each {|f| h[f]+=1 } ph puts h.keys.join("\n")<<"\n\n" ' file {"apple"=>2, "banana"=>1, "cherry"=>2} apple banana cherry {"delta"=>2, "epsilon"=>2} delta epsilon {"apple pie"=>1, "delta"=>2} apple pie delta 

(在ruby 1.9+中,哈希值保持插入顺序 - 这将按文件顺序打印单词。)

然后,如果要向潜在字段分隔符添加a ,则可以执行以下操作:

 $ ruby -00 -F'\n|,' -lane '$><<$F.uniq.join("\n")<<"\n\n"' file