如何从许多文本文件中提取特定信息

我有200多个文件。 例如,其中一个就像下面他们是txt文件。 我想逐个阅读它们,然后从中获取特定信息并将其导出到xls文件

例如,如何在xls文件中获取以下信息

TOTAL ENERGY = -444.38126 EV ELECTRONIC ENERGY = -840.31531 EV CORE-CORE REPULSION = 395.93406 EV GRADIENT NORM = 0.91931 = 0.45965 PER ATOM DIPOLE = 2.66600 DEBYE POINT GROUP: C2v NO. OF FILLED LEVELS = 6 IONIZATION POTENTIAL = 10.352991 EV HOMO LUMO ENERGIES (EV) = -10.353 0.402 MOLECULAR WEIGHT = 30.0262 COSMO AREA = 60.70 SQUARE ANGSTROMS COSMO VOLUME = 42.52 CUBIC ANGSTROMS 

我读了几篇文章,他们写道,可以使用

 sed -n ".." file.txt 

问题是即使我打算使用它也会花费我很长时间,因为我应该把当前的一个文件读入bash然后我应该去找每个像

  HEAT OF FORMATION TOTAL ENERGY ELECTRONIC ENERGY CORE-CORE REPULSION GRADIENT NORM DIPOLE NO. OF FILLED LEVELS IONIZATION POTENTIAL HOMO LUMO ENERGIES (EV) MOLECULAR WEIGHT COSMO AREA COSMO VOLUME 

然后我将这一行逐行粘贴到xls文件中,并附上相应的行信息

  SUMMARY OF PM7 CALCULATION, Site No: 29451 MOPAC2016 (Version: 18.063M) Tue Mar 20 15:08:13 2018 No. of days remaining = 349 Empirical Formula: C H2 O = 4 atoms SYMMETRY Formaldehyde GEOMETRY OPTIMISED USING EIGENVECTOR FOLLOWING (EF). SCF FIELD WAS ACHIEVED HEAT OF FORMATION = -25.54241 KCAL/MOL = -106.86944 KJ/MOL TOTAL ENERGY = -444.38126 EV ELECTRONIC ENERGY = -840.31531 EV CORE-CORE REPULSION = 395.93406 EV GRADIENT NORM = 0.91931 = 0.45965 PER ATOM DIPOLE = 2.66600 DEBYE POINT GROUP: C2v NO. OF FILLED LEVELS = 6 IONIZATION POTENTIAL = 10.352991 EV HOMO LUMO ENERGIES (EV) = -10.353 0.402 MOLECULAR WEIGHT = 30.0262 COSMO AREA = 60.70 SQUARE ANGSTROMS COSMO VOLUME = 42.52 CUBIC ANGSTROMS MOLECULAR DIMENSIONS (Angstroms) Atom Atom Distance H 3 O 1 2.00299 H 4 O 1 1.65067 H 4 C 2 0.00000 SCF CALCULATIONS = 4 WALL-CLOCK TIME = 0.309 SECONDS COMPUTATION TIME = 0.033 SECONDS FINAL GEOMETRY OBTAINED SYMMETRY Formaldehyde O 0.00000000 +0 0.0000000 +0 0.0000000 +0 0 0 0 C 1.20614565 +1 0.0000000 +0 0.0000000 +0 1 0 0 H 1.09115836 +1 121.2760970 +1 0.0000000 +0 2 1 0 H 1.09115836 +0 121.2760970 +0 180.0000000 +0 2 1 3 3 1 4 3 2 4 

我想将数据导出到一个csv中,并将每个数据导出到彼此之下,如下所示

 data1 444.38126 EV -840.31531 EV 395.93406 EV 0.91931 = 0.45965 PER ATOM 2.66600 C2v 6 10.352991 -10.353 0.402 30.0262 60.70 42.52 

我知道如何逐行读取每个文件。 让我们假设输出文件是output.txt

 line_num=0 text=File.open('output.txt').read text.gsub!(/\r\n?/, "\n") text.each_line do |line| print "#{line_num += 1} #{line}" end 

因此它可以逐行读取它,现在我尝试提取这些信息

 line_num=0 text=File.open('output.txt').read text.gsub!(/\r\n?/, "\n") text.each_line do |line| if line[/TOTAL ENERGY/] puts line.split("=",2)[-1].strip end if line[/ELECTRONIC ENERGY/] toggle=1 next end if line[/CORE-CORE REPULSION/] toggle=1 next if line[/GRADIENT NORM/] toggle=1 next if line[/DIPOLE/] toggle=1 next if line[/NO. OF FILLED LEVELS/] toggle=1 next if line[/IONIZATION POTENTIAL/] toggle=1 next if line[/HOMO LUMO ENERGIES (EV)/] toggle=1 next if line[/MOLECULAR WEIGHT /] toggle=1 next if line[/COSMO AREA/] toggle=1 next if line[/COSMO VOLUME/] toggle=1 next end 

它必须是ruby? 你怎么用bash读取文件,他们在Excel中格式化结果?

例如:

 for filename in *.txt; do awk '{print FILENAME ":" $0}' $filename | grep '[AZ]\{3,\}.*=' >> r.csv done 

将创建r.csv文件,您可以使用菜单Data – > Text to Columns在Excel中打开并格式化。

例如,你可以使用字符“=”作为列分隔符。