'Standardized the location information of tweets data

I'm dealing with the user location information from tweets. And I want to get a standardized location tag from these user-input data. If the location is within USA it return the name of state, else it return the country name. Basically something like:

text = ["New York, NY, USA", "Santa Monica, California", "ShanDong, China"]
output = text.standardize()

output
["New York", "California", "China"]

And it should have some tolerance to the typo of users. Is there any library recommended? Any thoughts on this will be really appreciated!



Solution 1:[1]

Here's what I would do, and I actually did recently in a project with tweets: Take a list of the possible states inside the US. Then, create a function to check if certain string contains the words of any state. If so, print the state name. Otherwise, print the last word(s) of the string after a comma.

text = ["New York, NY, USA", "Santa Monica, California", "ShanDong, China"]

 states = ['Alaska', 'Alabama', 'Arkansas', 'American Samoa', 'Arizona', 'California', 'Colorado', 'Connecticut', 'District of Columbia', 'Delaware', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Iowa', 'Idaho', 'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts', 'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri', 'Northern Mariana Islands', 'Mississippi', 'Montana', 'National', 'North Carolina', 'North Dakota', 'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Virgin Islands', 'Vermont', 'Washington', 'Wisconsin', 'West Virginia', 'Wyoming']

def standartize(text):
    for state in states:
        if text.__contains__(state):
            return(state)
    return(text.split(", ")[-1])

text_2 = [standartize(i) for i in text]
# Prints ['New York', 'California', 'China']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 leo_val