'Refine cleaning address data function

I am working on cleaning user input address data.

data = {'address':['211 S. 10TH AVE APT 4', 
                    '11095 FRAZIER DR', 
                    '1020 BLUEBERRY CT SE ,', 
                    '7614 202 AVE E',
                    '8013 SO. ALASKA ST.',
                   '529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}

What I would to do is 1. remove leading and trailing white spaces while keeping one space in between original entry 2. change to proper or title capitalization 3. spell out abbreviated street names

In the initial approach I was successful with the first two goals:

def nospecial(address_text):
    import re #use regex
    address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
    address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
    return text

I thought a for loop will work for my third goal, for which I modified the above into:

def st_suffix():
    return {'Dr': 'Drive',
            'Rd': 'Road', 'Blvd':'Boulevard',
            'St':'Street', 'Ste':'Suite',
            'Apts': 'Apartments', 'Apt':'Apartment',
            'Ct':'Court', 'Cir':'Circle'}


def nospecial(address_text):
    import re #use regex
    abbv = st_suffix() # get dict
    address_text = re.sub("[^a-zA-Z0-9 ]+", "",text) # remove non-alphanumeric characters but leave one space
    address_text = address_text.strip().title() #strip leading and trailing white spaces and change to proper cases
    for suffix in address: #go through my address text and search for abbreviated keys above and spell out
        rep = abbv[address_text] if address_text in abbv.keys() else address_text[suffix] #check dict
    return text

with this last version, I get a TypeError: string indices must be integers. I think my mistake in the for-loop line but I am not sure. Please help. Thank you



Solution 1:[1]

You can use

import pandas as pd
data = {'address':['211 S. 10TH AVE APT 4', 
    '11095 FRAZIER DR', 
    '1020 BLUEBERRY CT SE ,', 
    '7614 202 AVE E',
    '8013 SO. ALASKA ST.',
    '529 GOLDENTEMPLE PL', '123 LOVE BIRD CT'
]}
df=pd.DataFrame(data)
d = {r'\bDr\b\.?': 'Drive',
    r'\bRd\b\.?': 'Road', r'\bBlvd\b\.?':'Boulevard',
    r'\bSt\b\.?':'Street', r'\bSte\b\.?':'Suite',
    r'\bApts\b\.?': 'Apartments', r'\bApt\b\.?':'Apartment',
    r'\bCt\b\.?':'Court', r'\bCir\b\.?':'Circle'}
df['address'] = df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)

Note:

  • d is a dictionary with regexps used as keys and replacements as values, the \b denotes word boundaries and \.? matches an optional dot chars
  • .str.split().str.join(' ') - removes leading/trailing whitespaces and only keeps one space between each non-whitespace chunk in the string
  • .str.title() - converts strings to title case
  • .replace(d, regex=True) - replaces with d dictionary values.

Output:

>>> df['address'].str.split().str.join(' ').str.title().replace(d, regex=True)
0    211 S. 10Th Ave Apartment 4
1            11095 Frazier Drive
2      1020 Blueberry Court Se ,
3                 7614 202 Ave E
4         8013 So. Alaska Street
5            529 Goldentemple Pl
6            123 Love Bird Court

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew