'Refine cleaning address data function
I am working on cleaning user input address data.
data = {'address':['211 S. 10TH AVE APT 4',
'11095 FRAZIER DR',
'1020 BLUEBERRY CT SE ,',
'7614 202 AVE E',
'8013 SO. ALASKA ST.',
'529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}
What I would to do is 1. remove leading and trailing white spaces while keeping one space in between original entry 2. change to proper or title capitalization 3. spell out abbreviated street names
In the initial approach I was successful with the first two goals:
def nospecial(address_text):
import re #use regex
address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
return text
I thought a for loop will work for my third goal, for which I modified the above into:
def st_suffix():
return {'Dr': 'Drive',
'Rd': 'Road', 'Blvd':'Boulevard',
'St':'Street', 'Ste':'Suite',
'Apts': 'Apartments', 'Apt':'Apartment',
'Ct':'Court', 'Cir':'Circle'}
def nospecial(address_text):
import re #use regex
abbv = st_suffix() # get dict
address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
for suffix in address: #go through my address text and search for abbreviated keys above and spell out
rep = abbv[address_text] if address_text in abbv.keys() else address_text[suffix] #check dict
return text
with this last version, I get a TypeError: string indices must be integers. I think my mistake in the for-loop line but I am not sure. Please help. Thank you
Solution 1:[1]
You can use
import pandas as pd
data = {'address':['211 S. 10TH AVE APT 4',
'11095 FRAZIER DR',
'1020 BLUEBERRY CT SE ,',
'7614 202 AVE E',
'8013 SO. ALASKA ST.',
'529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}
df=pd.DataFrame(data)
d = {r'\bDr\b\.?': 'Drive',
r'\bRd\b\.?': 'Road', r'\bBlvd\b\.?':'Boulevard',
r'\bSt\b\.?':'Street', r'\bSte\b\.?':'Suite',
r'\bApts\b\.?': 'Apartments', r'\bApt\b\.?':'Apartment',
r'\bCt\b\.?':'Court', r'\bCir\b\.?':'Circle'}
df['address'] = df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)
Note:
dis a dictionary with regexps used as keys and replacements as values, the\bdenotes word boundaries and\.?matches an optional dot chars.str.split().str.join(' ')- removes leading/trailing whitespaces and only keeps one space between each non-whitespace chunk in the string.str.title()- converts strings to title case.replace(d, regex=True)- replaces withddictionary values.
Output:
>>> df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)
0 211 S. 10Th Ave Apartment 4
1 11095 Frazier Drive
2 1020 Blueberry Court Se ,
3 7614 202 Ave E
4 8013 So. Alaska Street
5 529 Goldentemple Pl
6 123 Love Bird Court
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
