如何检测字符串内相同的部分?

我试图将解码算法想要的问题分解成更小的问题。 这是第一部分。

题:

  • 两个字符串:s1和s2
  • s1的一部分与s2的一部分相同
  • 空间是分隔符
  • 如何提取相同的部分?

例1:

s1 = "12 November 2010 - 1 visitor" s2 = "6 July 2010 - 100 visitors" the identical parts are "2010", "-", "1" and "visitor" 

例2:

 s1 = "Welcome, John!" s2 = "Welcome, Peter!" the identical parts are "Welcome," and "!" 

例3 :(澄清“!”示例)

 s1 = "Welcome, Sam!" s2 = "Welcome, Tom!" the identical parts are "Welcome," and "m!" 

Python和Ruby首选。 谢谢

编辑:更新了此示例以使用所有示例,包括#1:

 def scan(s1, s2): # Find the longest match where s1 starts with s2 # Returns None if no matches l = len(s1) while 1: if not l: return None elif s1[:l] == s2[:l]: return s1[:l] else: l -= 1 def contains(s1, s2): D = {} # Remove duplicates using a dict L1 = s1.split(' ') L2 = s2.split(' ') # Don't add results which have already # been processed to satisfy example #1! DProcessed = {} for x in L1: yy = 0 for y in L2: if yy in DProcessed: yy += 1 continue # Scan from the start to the end of the words a = scan(x, y) if a: DProcessed[yy] = None D[a] = None break # Scan from the end to the start of the words a = scan(x[::-1], y[::-1]) if a: DProcessed[yy] = None D[a[::-1]] = None break yy += 1 return list(D.keys()) print contains("12 November 2010 - 1 visitor", "6 July 2010 - 100 visitors") print contains("Welcome, John!", "Welcome, Peter!") print contains("Welcome, Sam!", "Welcome, Tom!") 

哪个输出:

 ['1', 'visitor', '-', '2010'] ['Welcome,', '!'] ['Welcome,', 'm!'] 

例如1

 >>> s1 = 'November 2010 - 1 visitor' >>> s2 = '6 July 2010 - 100 visitors' >>> >>> [i for i in s1.split() if any(j for j in s2.split() if i in j)] ['2010', '-', '1', 'visitor'] >>> 

对彼此而言

 >>> s1 = "Welcome, John!" >>> s2 = "Welcome, Peter!" >>> [i for i in s1.replace('!',' !').split() if any(j for j in s2.replace('!',' !').split() if i in j)] ['Welcome,', '!'] >>> 

注意 :以上代码不适用于示例3,即刚刚添加的OP

 s1 = "12 November 2010 - 1 visitor" s2 = "6 July 2010 - 100 visitors" l1 = s1.split() for item in l1: if item in s2: print item 

这在空白上分裂。

同样在单词边界上拆分的解决方案(为了捕获示例2中的! )在Python中不起作用,因为re.split()不会在零长度匹配上拆分。

第三个例子,即使任何字符串的子字符串都是潜在的匹配,也会因为许多可能的变化而使事情变得复杂得多(对于1234 ,我必须检查123434 ,并且对于每个额外数字,排列的数量呈指数增加。

完整的Ruby解决方案:

 def start_similar(i, j) front = '' for ix in (0...([i.size, j.size].min)) if i[ix] == j[ix] then front += i[ix].chr else break end end return front end def back_similar(i, j) back = '' for ix in (0...([i.size, j.size].min)).to_a.reverse dif = i.size < j.size ? j.size - i.size : i.size - j.size ci = i[ i.size < j.size ? ix : ix + dif ] cj = j[ i.size > j.size ? ix : ix + dif ] if ci == cj then back = ci.chr + back else break end end return back end def scan(x, y) a, b = x.split(' '), y.split(' ') result = [] for i in a do for j in b do result << start_similar(i, j) result << back_similar(i, j) end end return result.uniq.select do |r| not r.empty? end end puts scan( "12 November 2010 - 1 visitor", "6 July 2010 - 100 visitors" ).inspect # ["1", "2010", "0", "-", "visitor"] puts scan( "Welcome, John!", "Welcome, Peter!" ).inspect # ["Welcome,", "!"] puts scan( "Welcome, Sam!", "Welcome, Tom!" ).inspect # ["Welcome,", "m!"]