2013年12月25日 星期三

多重分隔號字串切割之方法暨效能比較 ( Compare performance of string splitting with multiple delimiters )

在Python中,str內建split()函式(method)做字串分割用
strT = "a1,a2,a3"
strT.split(',')

#output => ['a1', 'a2', 'a3']
單個分隔符號可以使用split()
本文探討多個分隔符號(例如:"a1,a2;a3-a4#a5")情況下的字串分割方法並比較其效能:
目標:
input: "a1,a2;a3-a4#a5"
output: ["a1", "a2", "a3", "a4", "a5"]

1. 使用re模組(module):
    import re
    strT = "a1,a2;a3-a4#a5"
    re.split(",|;|-|#", strT)
不過使用re模組有一個地方需注意,若分隔符號裡含有re的特殊字元時使用此法會出錯
例如:
import re
strDot = "a1,a2;a3-a4.a5"
re.split(",|;|-|.", strDot)
'.'是re模組的特殊字元
此時會輸出 ['', '', '', '', '', '', '', '', '', '', '', '', '', '', ''] 的非預期結果。
再看另一個例子,
import re
strQM = "a1,a2;a3?a4#a5"
re.split(",|;|?|#", strQM)
'?'是re模組的特殊字元
這次Python丟出"......sre_constants.error: nothing to repeat" 的例外(exception)訊息
至於會輸出非預期的切割結果或拋出例外訊息可能得研究re模組的原始碼
若欲知哪些是re的特殊字元,請使用Python的help功能查詢
在Python的Shell視窗(執行IDLE)輸入
>>>import re
>>>help(re)
2. 多重迴圈:
此段程式碼參考網友river0830的分享而作
    strT = "a1,a2;a3-a4#a5"
    tupDelimiter = (',', ';', '-', '#')
    stack = [strT,]
    for subDelimiter in tupDelimiter:
        for i, subString in enumerate(stack):
            subStack = subString.split(subDelimiter)
            stack.pop(i)
            for j, _subString in enumerate(subStack):
                stack.insert(i+j, _subString)
3. 遞迴:
借用多重迴圈方法的概念,改寫為遞迴的形式
def strSplit_Recursion(listSource, listDelimiters):
    listRebuild = listSource
    if (len(listDelimiters) == 0):
        print listRebuild
    else:
        for i, strItem in enumerate(listSource):
            listSubStack = strItem.split(listDelimiters[-1])
            listRebuild.pop(i)
            for j, _strItem in enumerate(listSubStack):
                listRebuild.insert(i+j, _strItem)
    
        listDelimiters.pop()
        strSplit_Recursion(listRebuild, listDelimiters)

if __name__ == '__main__':
    strSplit_Recursion([r"a1,a2;a3-a4#a5"], ['-', ',', ';', '#'])
4. 取代:
先將字串內全部的分隔符號都置換成同一個,之後便可使用split()將字串分割
    strT = "a1,a2;a3-a4#a5"
    tupDelimiter = (',', ';', '-', '#')
    for charDelimiter in tupDelimiter:
        strT = strT.replace(charDelimiter, ';')
    strT = strT.split(';')
寫隻小程式比較四者的執行時間:
from __future__ import print_function
import re

def strSplit_re(strSource, delimiters):
    re.split(delimiters, strSource)

def strSplit_Iteration(strSource, tupDelimiter):
    stack = [strSource,]
    for subDelimiter in tupDelimiter:
        for i, subString in enumerate(stack):
            subStack = subString.split(subDelimiter)
            stack.pop(i)
            for j, _subString in enumerate(subStack):
                stack.insert(i+j, _subString)

def strSplit_Recursion(listSource, listDelimiters):
    listRebuild = listSource
    if (len(listDelimiters) == 0):
        pass
    else:
        for i, strItem in enumerate(listSource):
            listSubStack = strItem.split(listDelimiters[-1])
            listRebuild.pop(i)
            for j, _strItem in enumerate(listSubStack):
                listRebuild.insert(i+j, _strItem)
    
        listDelimiters.pop()
        strSplit_Recursion(listRebuild, listDelimiters)
        
def strSplit_Replace(strSource, tupDelimiter):
    for subDelimiter in tupDelimiter:
        strSource = strSource.replace(subDelimiter, ';')
    strSource = strSource.split(";")


def f1():
    strSplit_re(r"a1,a2;a3-a4#a5", ",|;|-|#")
    
def f2():
    strSplit_Iteration(r"a1,a2;a3-a4#a5", (',', ';', '-', '#'))

def f3():
    strSplit_Recursion([r"a1,a2;a3-a4#a5"], ['-', ',', ';', '#'])

def f4():
    strSplit_Replace(r"a1,a2;a3-a4#a5", (',', ';', '-', '#'))

if __name__ == '__main__':
    import timeit
    t1 = timeit.timeit("f1()", setup="from __main__ import f1")
    t2 = timeit.timeit("f2()", setup="from __main__ import f2")
    t3 = timeit.timeit("f3()", setup="from __main__ import f3")
    t4 = timeit.timeit("f4()", setup="from __main__ import f4")
    print ("Time of strSplit_re(): ", t1)
    print ("Time of strSplit_Iteration(): ", t2)
    print ("Time of strSplit_Recursion(): ", t3)
    print ("Time of strSplit_Replace(): ", t4)
註解:
1. 匯入"from __future__ import print_function"令Python2.7得以使用Python3的print method
2. timeit是Python裡用來計算執行時間的模塊(module),可用來計算function的執行時間,也可用來計算一小段代碼的執行時間,預設計算執行1,000,000(1百萬)次的所需時間。有興趣的讀者可自行參考Python官方文件,裡面可以找到timeit.timeit的用法

環境:
作業系統:Windows 7
Python編譯器版本:2.7.5

結果:
作法1. 藉由re模組的方法:2.03140......sec
作法2. 多重迴圈拆解法:15.70410......sec
作法3. 遞迴法:17.69398......sec
作法4. 全置換成單一分隔符號法:1.47379......sec

English Version:
According to Python tutorial, it told us that str.split() method could split string with 1 delimiter. For example,
strT = "a1,a2,a3"
strT.split(',')

#output => ['a1', 'a2', 'a3']
But how about string with multiple delimiters ? How to split string with 2 or more delimiters ?
For example, someone would like to achieve the target as below:
input: "a1,a2;a3-a4#a5"
output: ["a1", "a2", "a3", "a4", "a5"]
We will discuss 4 approaches to split string with multiple delimiters and compare the execution time among them.

1. Using "re" module:
    import re
    strT = "a1,a2;a3-a4#a5"
    re.split(",|;|-|#", strT)
However, you have to avoid the special character of the re module, otherwise it will output unexpected result or exception message. There are 2 examples as below:
import re
strDot = "a1,a2;a3-a4.a5"
re.split(",|;|-|.", strDot)
Dot character('.') is one of the special character of the re module.
This code snippets will output unexpected result ['', '', '', '', '', '', '', '', '', '', '', '', '', '', ''].
Another example,
import re
strQM = "a1,a2;a3?a4#a5"
re.split(",|;|?|#", strQM)
Question mark('?') is also one of the special character of the re module.
It shows "......sre_constants.error: nothing to repeat" exception message when delimiters include question mark.
It may needs to research source code of the re module for the reason.
But you can use "help" function of Python to get special characters of the re module.
If you would like to get the usage of re module, you can execute IDLE to open Python Shell Windows, and key in the following command sequentially:
>>>import re
>>>help(re)
2. Multilayer for-loop:
This code snippets is refer to another related article from netizen river0830.
    strT = "a1,a2;a3-a4#a5"
    tupDelimiter = (',', ';', '-', '#')
    stack = [strT,]
    for subDelimiter in tupDelimiter:
        for i, subString in enumerate(stack):
            subStack = subString.split(subDelimiter)
            stack.pop(i)
            for j, _subString in enumerate(subStack):
                stack.insert(i+j, _subString)
3. Recursion:
Draw on the concept of multilayer for-loop, rewrite to recursion approach.
def strSplit_Recursion(listSource, listDelimiters):
    listRebuild = listSource
    if (len(listDelimiters) == 0):
        print listRebuild
    else:
        for i, strItem in enumerate(listSource):
            listSubStack = strItem.split(listDelimiters[-1])
            listRebuild.pop(i)
            for j, _strItem in enumerate(listSubStack):
                listRebuild.insert(i+j, _strItem)
    
        listDelimiters.pop()
        strSplit_Recursion(listRebuild, listDelimiters)

if __name__ == '__main__':
    strSplit_Recursion([r"a1,a2;a3-a4#a5"], ['-', ',', ';', '#'])
4. Replacement:
Replace all delimiters to the same, and you can use split() to split string.
    strT = "a1,a2;a3-a4#a5"
    tupDelimiter = (',', ';', '-', '#')
    for charDelimiter in tupDelimiter:
        strT = strT.replace(charDelimiter, ';')
    strT = strT.split(';')
OK! Let's compare the execution time among them via the following program.
from __future__ import print_function
import re

def strSplit_re(strSource, delimiters):
    re.split(delimiters, strSource)

def strSplit_Iteration(strSource, tupDelimiter):
    stack = [strSource,]
    for subDelimiter in tupDelimiter:
        for i, subString in enumerate(stack):
            subStack = subString.split(subDelimiter)
            stack.pop(i)
            for j, _subString in enumerate(subStack):
                stack.insert(i+j, _subString)

def strSplit_Recursion(listSource, listDelimiters):
    listRebuild = listSource
    if (len(listDelimiters) == 0):
        pass
    else:
        for i, strItem in enumerate(listSource):
            listSubStack = strItem.split(listDelimiters[-1])
            listRebuild.pop(i)
            for j, _strItem in enumerate(listSubStack):
                listRebuild.insert(i+j, _strItem)
    
        listDelimiters.pop()
        strSplit_Recursion(listRebuild, listDelimiters)
        
def strSplit_Replace(strSource, tupDelimiter):
    for subDelimiter in tupDelimiter:
        strSource = strSource.replace(subDelimiter, ';')
    strSource = strSource.split(";")


def f1():
    strSplit_re(r"a1,a2;a3-a4#a5", ",|;|-|#")
    
def f2():
    strSplit_Iteration(r"a1,a2;a3-a4#a5", (',', ';', '-', '#'))

def f3():
    strSplit_Recursion([r"a1,a2;a3-a4#a5"], ['-', ',', ';', '#'])

def f4():
    strSplit_Replace(r"a1,a2;a3-a4#a5", (',', ';', '-', '#'))

if __name__ == '__main__':
    import timeit
    t1 = timeit.timeit("f1()", setup="from __main__ import f1")
    t2 = timeit.timeit("f2()", setup="from __main__ import f2")
    t3 = timeit.timeit("f3()", setup="from __main__ import f3")
    t4 = timeit.timeit("f4()", setup="from __main__ import f4")
    print ("Time of strSplit_re(): ", t1)
    print ("Time of strSplit_Iteration(): ", t2)
    print ("Time of strSplit_Recursion(): ", t3)
    print ("Time of strSplit_Replace(): ", t4)
Remark:
1. Invoke "from __future__ import print_function" on the Python2.7 if someone would like to use print function just like Python3.
2. "timeit" is a module of Python which can measure the execution time of the code snippets with 1,000,000(1 million) times repetitions as default. Please refer to Python Documentation for timeit usage. There is timeit.timeit on the inside.

Execute environment:
OS: Windows 7
Python version: 2.7.5

List the execution time with 1 million repetitions:
1. Using "re" module: 2.03140......sec
2. Multilayer for-loop: 15.70410......sec
3. Recursion: 17.69398......sec
4. Replacement: 1.47379......sec

沒有留言:

張貼留言