如何检测字符串内相同的部分？

我试图将解码算法想要的问题分解成更小的问题。这是第一部分。

题：

两个字符串：s1和s2
s1的一部分与s2的一部分相同
空间是分隔符
如何提取相同的部分？

例1：

s1 = "12 November 2010 - 1 visitor" s2 = "6 July 2010 - 100 visitors" the identical parts are "2010", "-", "1" and "visitor"

例2：

 s1 = "Welcome, John!" s2 = "Welcome, Peter!" the identical parts are "Welcome," and "!"

例3 :(澄清“！”示例）

 s1 = "Welcome, Sam!" s2 = "Welcome, Tom!" the identical parts are "Welcome," and "m!"

Python和Ruby首选。谢谢

编辑：更新了此示例以使用所有示例，包括＃1：

 def scan(s1, s2): # Find the longest match where s1 starts with s2 # Returns None if no matches l = len(s1) while 1: if not l: return None elif s1[:l] == s2[:l]: return s1[:l] else: l -= 1 def contains(s1, s2): D = {} # Remove duplicates using a dict L1 = s1.split(' ') L2 = s2.split(' ') # Don't add results which have already # been processed to satisfy example #1! DProcessed = {} for x in L1: yy = 0 for y in L2: if yy in DProcessed: yy += 1 continue # Scan from the start to the end of the words a = scan(x, y) if a: DProcessed[yy] = None D[a] = None break # Scan from the end to the start of the words a = scan(x[::-1], y[::-1]) if a: DProcessed[yy] = None D[a[::-1]] = None break yy += 1 return list(D.keys()) print contains("12 November 2010 - 1 visitor", "6 July 2010 - 100 visitors") print contains("Welcome, John!", "Welcome, Peter!") print contains("Welcome, Sam!", "Welcome, Tom!")

哪个输出：

 ['1', 'visitor', '-', '2010'] ['Welcome,', '!'] ['Welcome,', 'm!']

例如1

 >>> s1 = 'November 2010 - 1 visitor' >>> s2 = '6 July 2010 - 100 visitors' >>> >>> [i for i in s1.split() if any(j for j in s2.split() if i in j)] ['2010', '-', '1', 'visitor'] >>>

对彼此而言

 >>> s1 = "Welcome, John!" >>> s2 = "Welcome, Peter!" >>> [i for i in s1.replace('!',' !').split() if any(j for j in s2.replace('!',' !').split() if i in j)] ['Welcome,', '!'] >>>

注意：以上代码不适用于示例3，即刚刚添加的OP

 s1 = "12 November 2010 - 1 visitor" s2 = "6 July 2010 - 100 visitors" l1 = s1.split() for item in l1: if item in s2: print item

这在空白上分裂。

同样在单词边界上拆分的解决方案（为了捕获示例2中的! ）在Python中不起作用，因为re.split()不会在零长度匹配上拆分。

第三个例子，即使任何字符串的子字符串都是潜在的匹配，也会因为许多可能的变化而使事情变得复杂得多（对于1234 ，我必须检查1234 ， 3和4 ，并且对于每个额外数字，排列的数量呈指数增加。

完整的Ruby解决方案：

 def start_similar(i, j) front = '' for ix in (0...([i.size, j.size].min)) if i[ix] == j[ix] then front += i[ix].chr else break end end return front end def back_similar(i, j) back = '' for ix in (0...([i.size, j.size].min)).to_a.reverse dif = i.size < j.size ? j.size - i.size : i.size - j.size ci = i[ i.size < j.size ? ix : ix + dif ] cj = j[ i.size > j.size ? ix : ix + dif ] if ci == cj then back = ci.chr + back else break end end return back end def scan(x, y) a, b = x.split(' '), y.split(' ') result = [] for i in a do for j in b do result << start_similar(i, j) result << back_similar(i, j) end end return result.uniq.select do |r| not r.empty? end end puts scan( "12 November 2010 - 1 visitor", "6 July 2010 - 100 visitors" ).inspect # ["1", "2010", "0", "-", "visitor"] puts scan( "Welcome, John!", "Welcome, Peter!" ).inspect # ["Welcome,", "!"] puts scan( "Welcome, Sam!", "Welcome, Tom!" ).inspect # ["Welcome,", "m!"]

如何检测字符串内相同的部分？

在Web浏览器中实现聊天系统

使用ruby或python在文件中查找

为什么在OS X上的Ruby（而不是Python）中的分叉进程中`Time.utc`会变慢？

为什么一些正则表达式引擎在单个输入字符串中匹配。*两次？

解决Python的共享服务器安全问题

heroku运行rake db：migrate – / bin / sh：1：rake：not found

将URL输出的JSON保存到文件中

语法推理库？

哪种脚本语言更适合笔测试？

将Hex Chars转换为比特串（Python或Ruby）