'Is there a way to combine multiple resub operations into one to make it faster in Python?
I have a dataframe column that has an input like below.
Input = '{1:A06YCASDB2LXXXXX000000}{2:A303TYDBTM2AXXD}{3:{108:23158}}{4:\r\n:20:APS0182405\r\n:23B:DRED\r\n:32A:182349USD3280,00\r\n:33B:USD31280,00\r\n:52M:/73240222\r\nRAWR UK Ltd\r\n28 School Road\r\nfast\r\nCo. Angrid\r\n:57A:TETRIS\r\n:59:/BU500023231012000066241\r\nDUMMYNAME DUMMYLASTNAME\r\PLACE/REST\r\n:70:PA74536/39\r\n:71A:OUR\r\n-}
I have developed a chain regex method to apply multiple re.sub operations
def chainRegex(string):
string = re.sub(":\\d{2}[A-Z]?:"," ", string)
string = re.sub("\r\n"," ", string)
string = [re.sub("([^a-zA-Z ]+?)","",i) for i in string.split()]
string = list(filter(None, string))
return string
The expected output is given a list below.
output = ['AYCASDBLXXXXXATYDBTMAXXD', 'APS', 'DRED', 'USD', 'USD', 'RAWR', 'UK', 'Ltd', 'School','Road', 'fast', 'Co', 'Angrid', 'TETRIS', 'BU', 'DUMMYNAME', 'DUMMYLASTNAME', 'PLACEREST', 'PA', 'OUR']
Is there a way to combine these multiple resub operations into one to make it faster or is there an alternative faster operation? Parsing option won't work because the structure of string sometimes corrupted (missing {} or keys).
Solution 1:[1]
You can use
def chainRegex(string):
x = re.sub(r"(?::\d{2}[A-Z]?:|\r\n)+", " ", string).split()
return [w for w in ["".join(c for c in i if c.isalpha()) for i in x] if w != ""]
See the Python demo.
Here,
re.sub(r"(?::\d{2}[A-Z]?:|\r\n)+", " ", string).split()finds all one or more sequences of a colon + two digits, an optional letter and a colon or a CRLF line endings and replaces them with a single space["".join(c for c in i if c.isalpha()) for i in x]- removes all non-letters from each word[w for w in ... if w != ""]omits the empty items.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
