'Pandas: regex search function skipping matches and returning incorrect results
I need to extract alpha numeric code from various fields in data. The code could be in any of three fields. I need the first match -- then extract the code with a substitution/capture group to populate the "code" column.
In the toy example below, the output is what I would expect ( AKA: it "works" ). In my production version, it skips obvious matches ( not all, small number ) and on another small number, the substitution results in smushed results or a full return of the matched field.
Toy data:
df3 = pd.DataFrame([[1000, 'File X234 version 2.pdf', 'My Title 8209','BR1',''],
[1001, 'File_X003.pdf', 'Title X003', 'BR1',''],
[1003, 'File.pdf', 'BR3 8200', 'BR2',''],
[1004, 'BR5_file.doc','BR4 F200','BR1',''],
[1005, 'file.txt', 'Title', 'BR1', ''],
[1006, '8208 doc3.txt', 'doc3', 'BR4', '']],
columns=['ID', 'File Name', 'Title', 'Type','Code'])
REGEX paterns with capture groups:
patternA = re.compile(r'^.*(82[0-9]{2}).*$', re.IGNORECASE)
patternB = re.compile(r'^.*(X[0-9]{3}).*$', re.IGNORECASE)
patternC = re.compile(r'^.*(F[0-9]{3}).*$', re.IGNORECASE)
Function very similar to production version:
def func_bar(x):
text = x['File Name'] + x['Title']
if patternA.search(text):
value = patternA.sub(r"\1", text)
return value
elif patternB.search(text):
value = patternB.sub(r'\1', text)
return value
elif patternC.search(text):
value = patternC.sub(r'\1', text)
return value
else: return "Not Found"
Apply function to column:
df3['Code'] = df3.apply(func_bar, axis=1)
Success(?):
ID File Name Title Type Code
0 1000 File X234 version 2.pdf My Title 8209 BR1 8209
1 1001 File_X003.pdf Title X003 BR1 X003
2 1003 File.pdf BR3 8200 BR2 8200
3 1004 BR5_file.doc BR4 F200 BR1 F200
4 1005 file.txt Title BR1 Not Found
5 1006 8208 doc3.txt doc3 BR4 8208
Any ideas why my production version is:
- skipping just a few obvious matches?
- Smushing output ( only a few )
- Returning the whole matched field?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
