'matching values in python dataframe using multiple regex patterns
Im trying to create a new column filling with a value ('company') if values in another column match one of the patterns in the regex below:
"INC|INC$|INC$|LTD$|CORP$|CORPORATION$|COMPANY$|LLC$|\*LLC$|\*,INC$|\*,CORP$|\*LTD$|\*CORP$|LEASING|TRANSPORTATION|CONSULTANTS|SERVICES|INCORPORATED"
Here is what i tried:
patterns = [".INC.","INC$", ",INC$","LTD$", "CORP$", "CORPORATION$", "COMPANY$", "LLC$", ".*([a-zA-Z]+)LLC$", ".*([a-zA-Z]+),INC$", ".*([a-zA-Z]+),CORP$", ".*([a-zA-Z]+)LTD$", ".*([a-zA-Z]+)CORP$", "LEASING", "TRANSPORTATION", "CONSULTANTS", "SERVICES", "INCORPORATED"]
patterns = re.compile('|'.join(patterns))
data.loc[data['OwnerName'].str.contains(patterns), 'owner'] = 'company'
It matches and renames some strings but not the others. For instance: xxx,INC is matched but xxx INC is not matched.
Could you please point out what am i doing wrong. Thanks!
The xxx, INC string should turn into company if matched. But it does not.
Solution 1:[1]
To match optional trailing whitespace, you can add \s* before $.
Also, some values in the regex you provided are redundant, you can greatly shorten the pattern if you use
patterns = ["INC",r"LTD\s*$",r"CORP\s*$",r"CORPORATION\s*$",r"COMPANY\s*$",r"LLC\s*$","LEASING","TRANSPORTATION","CONSULTANTS","SERVICES"]
patterns = re.compile('|'.join(patterns))
data.loc[data['OwnerName'].str.contains(patterns), 'owner'] = 'company'
Use raw string literals when defining patterns with literal backslash to avoid warnings.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
