查找一个句子是否包含Ruby中的特定短语

现在我通过将句子分成数组然后执行包含以查看它是否包含单词来查看句子是否包含特定单词。 就像是:

"This is my awesome sentence.".split(" ").include?('awesome') 

但我想知道用短语做这个的最快方法是什么。 就好像我想查看句子“这是我真棒的句子”。 包含短语“我很棒的句子”。 我正在抓句子并比较大量的短语,所以速度有点重要。

以下是一些变化:

 require 'benchmark' lorem = ('Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut' # !> unused literal ignored 'enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in' # !> unused literal ignored 'reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,' # !> unused literal ignored 'sunt in culpa qui officia deserunt mollit anim id est laborum.' * 10) << ' foo' lorem.split.include?('foo') # => true lorem['foo'] # => "foo" lorem.include?('foo') # => true lorem[/foo/] # => "foo" lorem[/fo{2}/] # => "foo" lorem[/foo$/] # => "foo" lorem[/fo{2}$/] # => "foo" lorem[/fo{2}\Z/] # => "foo" /foo/.match(lorem)[-1] # => "foo" /foo$/.match(lorem)[-1] # => "foo" /foo/ =~ lorem # => 621 n = 500_000 puts RUBY_VERSION puts "n=#{ n }" Benchmark.bm(25) do |x| x.report("array search:") { n.times { lorem.split.include?('foo') } } x.report("literal search:") { n.times { lorem['foo'] } } x.report("string include?:") { n.times { lorem.include?('foo') } } x.report("regex:") { n.times { lorem[/foo/] } } x.report("wildcard regex:") { n.times { lorem[/fo{2}/] } } x.report("anchored regex:") { n.times { lorem[/foo$/] } } x.report("anchored wildcard regex:") { n.times { lorem[/fo{2}$/] } } x.report("anchored wildcard regex2:") { n.times { lorem[/fo{2}\Z/] } } x.report("/regex/.match") { n.times { /foo/.match(lorem)[-1] } } x.report("/regex$/.match") { n.times { /foo$/.match(lorem)[-1] } } x.report("/regex/ =~") { n.times { /foo/ =~ lorem } } x.report("/regex$/ =~") { n.times { /foo$/ =~ lorem } } x.report("/regex\Z/ =~") { n.times { /foo\Z/ =~ lorem } } end 

Ruby 1.9.3的结果:

 1.9.3
 N = 500000
                                用户系统总真实
数组搜索:12.960000 0.010000 12.970000(12.978311)
文字搜索:0.800000 0.000000 0.800000(0.807110)
 string include?:0.760000 0.000000 0.760000(0.758918)
正则表达式:0.660000 0.000000 0.660000(0.657608)
通配符正则表达式:0.660000 0.000000 0.660000(0.660296)
锚定的正则表达式:0.660000 0.000000 0.660000(0.664025)
锚定通配符正则表达式:0.660000 0.000000 0.660000(0.664897)
 anchored wildcard regex2:0.320000 0.000000 0.320000(0.328876)
 /regex/.match 1.430000 0.000000 1.430000(1.424602)
 /regex$/.match 1.430000 0.000000 1.430000(1.434538)
 / regex / = ~0.530000 0.000000 0.530000(0.538128)
 / regex $ / = ~0.540000 0.000000 0.540000(0.536318)
 / regexZ / = ~0.210000 0.000000 0.210000(0.214547)

并且1.8.7:

 1.8.7
 N = 500000
                               用户系统总真实
数组搜索:21.250000 0.000000 21.250000(21.296039)
文字搜索:0.660000 0.000000 0.660000(0.660102)
 string include?:0.610000 0.000000 0.610000(0.612433)
正则表达式:0.950000 0.000000 0.950000(0.946308)
通配符正则表达式:2.840000 0.000000 2.840000(2.850198)
锚定正则表达式:0.950000 0.000000 0.950000(0.951270)
锚定通配符正则表达式:2.870000 0.010000 2.880000(2.874209)
 anchored wildcard regex2:2.870000 0.000000 2.870000(2.868291)
 /regex/.match 1.470000 0.000000 1.470000(1.479383)
 /regex$/.match 1.480000 0.000000 1.480000(1.498106)
 / regex / = ~0.6680000 0.000000 0.680000(0.677444)
 / regex $ / = ~0.700000 0.000000 0.700000(0.704486)
 / regexZ / = ~0.700000 0.000000 0.700000(0.701943)

因此,从结果来看,使用像'foobar'['foo']这样的固定字符串搜索比使用正则表达式'foobar'[/foo/]要慢,后者比等效的'foobar' =~ /foo/慢。

OP原始解决方案受到严重影响,因为它遍历字符串两次:一次将其拆分为单个字,第二次迭代数组寻找实际目标字。 随着字符串大小的增加,其性能会降低。


编辑:有一点我觉得Ruby的性能很有趣,就是锚定的正则表达式比未锚定的正则表达式略慢。 在Perl中,几年前我第一次运行这种基准时,情况正好相反。


这是使用Fruity的更新版本。 各种表达式返回不同的结果。 如果要查看目标字符串是否存在,可以使用任何一个。 如果你想看看这个值是否在字符串的末尾,就像这些正在测试一样,或者为了得到目标的位置,那么一些肯定比其他的更快,所以选择相应的。

 require 'fruity' TARGET_STR = (' ' * 100) + ' foo' TARGET_STR['foo'] # => "foo" TARGET_STR[/foo/] # => "foo" TARGET_STR[/fo{2}/] # => "foo" TARGET_STR[/foo$/] # => "foo" TARGET_STR[/fo{2}$/] # => "foo" TARGET_STR[/fo{2}\Z/] # => "foo" TARGET_STR[/fo{2}\z/] # => "foo" TARGET_STR[/foo\Z/] # => "foo" TARGET_STR[/foo\z/] # => "foo" /foo/.match(TARGET_STR)[-1] # => "foo" /foo$/.match(TARGET_STR)[-1] # => "foo" /foo/ =~ TARGET_STR # => 101 /foo$/ =~ TARGET_STR # => 101 /foo\Z/ =~ TARGET_STR # => 101 TARGET_STR.include?('foo') # => true TARGET_STR.index('foo') # => 101 TARGET_STR.rindex('foo') # => 101 puts RUBY_VERSION puts "TARGET_STR.length = #{ TARGET_STR.length }" puts puts 'compare fixed string vs. unanchored regex' compare do fixed_str { TARGET_STR['foo'] } unanchored_regex { TARGET_STR[/foo/] } end puts puts 'compare /foo/ to /fo{2}/' compare do unanchored_regex { TARGET_STR[/foo/] } unanchored_regex2 { TARGET_STR[/fo{2}/] } end puts puts 'compare unanchored vs. anchored regex' # !> assigned but unused variable - delay compare do unanchored_regex { TARGET_STR[/foo/] } anchored_regex_dollar { TARGET_STR[/foo$/] } anchored_regex_Z { TARGET_STR[/foo\Z/] } anchored_regex_z { TARGET_STR[/foo\z/] } end puts puts 'compare /foo/, match and =~' compare do unanchored_regex { TARGET_STR[/foo/] } unanchored_match { /foo/.match(TARGET_STR)[-1] } unanchored_eq_match { /foo/ =~ TARGET_STR } end puts puts 'compare fixed, unanchored, Z, include?, index and rindex' compare do fixed_str { TARGET_STR['foo'] } unanchored_regex { TARGET_STR[/foo/] } anchored_regex_Z { TARGET_STR[/foo\Z/] } include_eh { TARGET_STR.include?('foo') } _index { TARGET_STR.index('foo') } _rindex { TARGET_STR.rindex('foo') } end 

结果如下:

 # >> 2.2.3 # >> TARGET_STR.length = 104 # >> # >> compare fixed string vs. unanchored regex # >> Running each test 8192 times. Test will take about 1 second. # >> fixed_str is faster than unanchored_regex by 2x ± 0.1 # >> # >> compare /foo/ to /fo{2}/ # >> Running each test 8192 times. Test will take about 1 second. # >> unanchored_regex2 is similar to unanchored_regex # >> # >> compare unanchored vs. anchored regex # >> Running each test 8192 times. Test will take about 1 second. # >> anchored_regex_z is similar to anchored_regex_Z # >> anchored_regex_Z is faster than unanchored_regex by 19.999999999999996% ± 10.0% # >> unanchored_regex is similar to anchored_regex_dollar # >> # >> compare /foo/, match and =~ # >> Running each test 8192 times. Test will take about 1 second. # >> unanchored_eq_match is faster than unanchored_regex by 2x ± 0.1 (results differ: 101 vs foo) # >> unanchored_regex is faster than unanchored_match by 3x ± 0.1 # >> # >> compare fixed, unanchored, Z, include?, index and rindex # >> Running each test 32768 times. Test will take about 3 seconds. # >> _rindex is similar to include_eh (results differ: 101 vs true) # >> include_eh is faster than _index by 10.000000000000009% ± 10.0% (results differ: true vs 101) # >> _index is faster than fixed_str by 19.999999999999996% ± 10.0% (results differ: 101 vs foo) # >> fixed_str is faster than anchored_regex_Z by 39.99999999999999% ± 10.0% # >> anchored_regex_Z is similar to unanchored_regex 

修改字符串的大小揭示了很多要知道的东西。

更改为1,000个字符:

 # >> 2.2.3 # >> TARGET_STR.length = 1004 # >> # >> compare fixed string vs. unanchored regex # >> Running each test 4096 times. Test will take about 1 second. # >> fixed_str is faster than unanchored_regex by 50.0% ± 10.0% # >> # >> compare /foo/ to /fo{2}/ # >> Running each test 2048 times. Test will take about 1 second. # >> unanchored_regex2 is similar to unanchored_regex # >> # >> compare unanchored vs. anchored regex # >> Running each test 8192 times. Test will take about 1 second. # >> anchored_regex_z is faster than anchored_regex_Z by 10.000000000000009% ± 10.0% # >> anchored_regex_Z is faster than unanchored_regex by 3x ± 0.1 # >> unanchored_regex is similar to anchored_regex_dollar # >> # >> compare /foo/, match and =~ # >> Running each test 4096 times. Test will take about 1 second. # >> unanchored_eq_match is similar to unanchored_regex (results differ: 1001 vs foo) # >> unanchored_regex is faster than unanchored_match by 2x ± 0.1 # >> # >> compare fixed, unanchored, Z, include?, index and rindex # >> Running each test 32768 times. Test will take about 4 seconds. # >> _rindex is faster than anchored_regex_Z by 2x ± 1.0 (results differ: 1001 vs foo) # >> anchored_regex_Z is faster than include_eh by 2x ± 0.1 (results differ: foo vs true) # >> include_eh is faster than fixed_str by 10.000000000000009% ± 10.0% (results differ: true vs foo) # >> fixed_str is similar to _index (results differ: foo vs 1001) # >> _index is similar to unanchored_regex (results differ: 1001 vs foo) 

把它压到10,000:

 # >> 2.2.3 # >> TARGET_STR.length = 10004 # >> # >> compare fixed string vs. unanchored regex # >> Running each test 512 times. Test will take about 1 second. # >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0% # >> # >> compare /foo/ to /fo{2}/ # >> Running each test 256 times. Test will take about 1 second. # >> unanchored_regex2 is similar to unanchored_regex # >> # >> compare unanchored vs. anchored regex # >> Running each test 8192 times. Test will take about 3 seconds. # >> anchored_regex_z is similar to anchored_regex_Z # >> anchored_regex_Z is faster than unanchored_regex by 21x ± 1.0 # >> unanchored_regex is similar to anchored_regex_dollar # >> # >> compare /foo/, match and =~ # >> Running each test 256 times. Test will take about 1 second. # >> unanchored_eq_match is similar to unanchored_regex (results differ: 10001 vs foo) # >> unanchored_regex is faster than unanchored_match by 10.000000000000009% ± 10.0% # >> # >> compare fixed, unanchored, Z, include?, index and rindex # >> Running each test 32768 times. Test will take about 18 seconds. # >> _rindex is faster than anchored_regex_Z by 2x ± 0.1 (results differ: 10001 vs foo) # >> anchored_regex_Z is faster than include_eh by 15x ± 1.0 (results differ: foo vs true) # >> include_eh is similar to _index (results differ: true vs 10001) # >> _index is similar to fixed_str (results differ: 10001 vs foo) # >> fixed_str is faster than unanchored_regex by 39.99999999999999% ± 10.0% 

您可以轻松检查字符串是否包含带方括号的另一个字符串,如下所示:

 irb(main):084:0> "This is my awesome sentence."["my awesome sentence"] => "my awesome sentence" irb(main):085:0> "This is my awesome sentence."["cookies for breakfast?"] => nil 

如果它找到它将返回子字符串,否则返回nil 。 它应该非常快。

这是一个非答案,显示OS X上@TheTinMan for Ruby 1.9.2的代码基准。注意相对性能的差异,特别是第二和第三次测试的改进。

                                user     system      total        real array search:              7.960000   0.000000   7.960000 (  7.962338) literal search:            0.450000   0.010000   0.460000 (  0.445905) string include?:           0.400000   0.000000   0.400000 (  0.400932) regex:                     0.510000   0.000000   0.510000 (  0.512635) wildcard regex:            0.520000   0.000000   0.520000 (  0.514800) anchored regex:            0.510000   0.000000   0.510000 (  0.513328) anchored wildcard regex:   0.520000   0.000000   0.520000 (  0.517759) /regex/.match              0.940000   0.000000   0.940000 (  0.943471) /regex$/.match             0.940000   0.000000   0.940000 (  0.936782) /regex/ =~                 0.440000   0.000000   0.440000 (  0.446921) /regex$/ =~                0.450000   0.000000   0.450000 (  0.447904) 

我使用Benchmark.bmbm运行这些结果,但结果在排练回合和实际时间之间没有差异,如上所示。

如果您不熟悉正则表达式,我相信他们可以在这里解决您的问题:

http://www.regular-expressions.info/ruby.html

基本上你会创建一个正常的表达式对象,寻找“真棒”(很可能不区分大小写)然后你可以做

 /regex/.match(string) 

返回匹配数据。 如果要返回角色所在的索引,可以执行以下操作:

 match = "This is my awesome sentence." =~ /awesome/ puts match #This will return the index of the first letter, so the first a in awesome 

我已经阅读了这篇文章了解更多细节,因为它解释得比我更好。 如果您不想了解它并且只想跳转使用它,我建议您:

http://www.ruby-doc.org/core/classes/Regexp.html