'Pandas: regex search function skipping matches and returning incorrect results

I need to extract alpha numeric code from various fields in data. The code could be in any of three fields. I need the first match -- then extract the code with a substitution/capture group to populate the "code" column.

In the toy example below, the output is what I would expect ( AKA: it "works" ). In my production version, it skips obvious matches ( not all, small number ) and on another small number, the substitution results in smushed results or a full return of the matched field.

Toy data:

df3 = pd.DataFrame([[1000, 'File X234 version 2.pdf', 'My Title 8209','BR1',''], 
                    [1001, 'File_X003.pdf', 'Title X003', 'BR1',''], 
                    [1003, 'File.pdf', 'BR3 8200', 'BR2',''],
                    [1004, 'BR5_file.doc','BR4 F200','BR1',''],
                    [1005, 'file.txt', 'Title', 'BR1', ''],
                    [1006, '8208 doc3.txt', 'doc3', 'BR4', '']],
                   columns=['ID', 'File Name', 'Title', 'Type','Code'])

REGEX paterns with capture groups:

patternA   = re.compile(r'^.*(82[0-9]{2}).*$', re.IGNORECASE)
patternB   = re.compile(r'^.*(X[0-9]{3}).*$', re.IGNORECASE)
patternC   = re.compile(r'^.*(F[0-9]{3}).*$', re.IGNORECASE)

Function very similar to production version:

def func_bar(x):
    text = x['File Name'] + x['Title']

    if patternA.search(text): 
        value = patternA.sub(r"\1", text)
        return value
    elif patternB.search(text): 
        value = patternB.sub(r'\1', text)
        return value
    elif patternC.search(text): 
        value = patternC.sub(r'\1', text)
        return value
    else: return "Not Found"

Apply function to column:

df3['Code'] = df3.apply(func_bar, axis=1)

Success(?):

    ID          File Name                   Title           Type    Code
0   1000        File X234 version 2.pdf     My Title 8209   BR1     8209
1   1001        File_X003.pdf               Title X003      BR1     X003
2   1003        File.pdf                    BR3 8200        BR2     8200
3   1004        BR5_file.doc                BR4 F200        BR1     F200
4   1005        file.txt                    Title           BR1     Not Found
5   1006        8208 doc3.txt               doc3            BR4     8208

Any ideas why my production version is:

  • skipping just a few obvious matches?
  • Smushing output ( only a few )
  • Returning the whole matched field?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source