'Unable to rename/replace categories in a dataframe after removing unicode u
I am trying to rename the categories in a dataframe after removing the unicode u with a .replace('u','',regex) method due to the method removing the other 'u's in the text as well. I have tried using the replace, and the rename_categories method to change the categories into desired format using a dictionary to map but it remains unchanged after removing the unicode u. Does anyone know a better way I can approach this? I have attached a link to the CSV I am working with.

'''uploaded = files.upload()
yelpdf = pd.read_csv(io.BytesIO(uploaded['yelp_reviews.csv']))
print(yelpdf['NoiseLevel'].value_counts())
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype(str)
update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'}
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace('u','',regex=True)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype('category')
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].cat.rename_categories(update_NoiseLevel)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace(update_NoiseLevel)
print(yelpdf['NoiseLevel'].value_counts())'''
its a CSV file with yelp data and this issue is occurring within the NoiseLevel column
Solution 1:[1]
Try str.extract before create category (if needed)
df = pd.read_excel('yelp_reviews.xlsx')
df['NoiseLevel'] = df['NoiseLevel'].str.extract("(?:u')?([^']*)")
Output:
>>> df['NoiseLevel'].unique()
array(['average', 'quiet', nan, 'loud', 'very_loud'], dtype=object)
>>> df['NoiseLevel'].head(10)
0 average
1 average
2 average
3 average
4 average
5 average
6 average
7 quiet
8 NaN
9 NaN
Name: NoiseLevel, dtype: object
Solution 2:[2]
I did post a solution last night but it was not correct. This is the CORRECTED one:
Basically all the issues arise from the way the day is stored. In my case (in my GooggleColab notebook) , some data was coming in with “u’” and some others with JUST the first quote only.
This inconsistency is the source of all the problems, and in particular this is why "renaming" is not working. Your names are also misspelled… (No?) update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'} it should be: update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud'}
BEFORE you upload the CSV in Colab, you need to make a little adjustment to the “Noise_level”, column so that all have quotes, in both sides… etc I used (IN EXCEL):
FORMULA =IF(H2="","unclasssifed",IF(LEFT(H2,1)<>"u",CONCATENATE("'",H2),IF(LEFT(H2,1)="u",H2)))
H2 = “NoiseLevel” column in excel…, you may do that also in CSV using Python… For the empty ones, that is nan, I used the term “unclassified”…
Steps to resolve this (inconsistency) and make the renaming work:
1. Import file in colab:
import pandas as pd
import io
df = pd.read_csv(io.StringIO(uploaded['stelios_copy_of_yelp_reviews - yelp_reviews (4).csv'].decode('utf-8')))
df.head()
2. Create a function to remove ALL the brackets and "u'" all in your data, in that column ... (!) and apply it!
def remove_another(x):
string = str(x)
aaa = string.replace("u'", "").replace("'", "")
return aaa
df['NoiseLevel_u_removed'] = df['NoiseLevel_modified_excel'].apply(remove_another)
3. Correct this! and RUN remaining commands
update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud', 'unclassified': 'Unclassified'}
print(df['NoiseLevel_u_removed'].value_counts())
df['NoiseLevel_category'] = df['NoiseLevel_u_removed'].astype('category')
df['NoiseLevel_u_removed'] = df['NoiseLevel_category'].cat.rename_categories(update_NoiseLevel)
df['NoiseLevel_u_removed'][0:23]
I hope this helps you !
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
| Solution 2 |

