'How to convert a datatype of a column with both integer and decimal numbers in Python?

I was working on wine data on kaggle. Where there was a column named price has values like $32, $17, $15.99, Nan wine_data.isnull().sum()--After applying this code, there were a lot of missing values so I wrote another code i.e. wine_data['designation'].fillna(wine_data['designation'].mode()[0], inplace = True) wine_data['varietal'].fillna(wine_data['varietal'].mode()[0], inplace = True) wine_data['appellation'].fillna(wine_data['appellation'].mode()[0], inplace = True) wine_data['alcohol'].fillna(wine_data['alcohol'].mode()[0], inplace = True) wine_data['price'].fillna(wine_data['price'].mode()[0], inplace = True) wine_data['reviewer'].fillna(wine_data['reviewer'].mode()[0], inplace = True) wine_data['review'].fillna(wine_data['review'].mode()[0], inplace = True)

Then I wanted to do a correlation of alcohol with rating and price with rating but both alcohol and price column has '%' and '$' these characters.So, I applied this code. wine_data = wine_data.assign(alcohol_num = lambda row: row["alcohol"].replace("%", "", regex=True).astype('float')) wine_data = wine_data.assign(price_numbers= wine_data['price'].str.replace('$','',regex = True)).astype('float')

It's throwing me an error like-- could not convert string to float: 'J. Lohr 2000 Hilltop Vineyard Cabernet Sauvignon (Paso Robles)'

Then I tried this code: wine_data = wine_data.assign(price_numbers= wine_data['price'].str.replace('$','',regex = True)).astype('int') It's throwing me an error like-- invalid literal for int() with base 10: 'J. Lohr 2000 Hilltop Vineyard Cabernet Sauvignon (Paso Robles)'



Solution 1:[1]

Your data is not clean. One of the elements in your price column keeps containing the string 'J. Lohr 2000 Hilltop Vineyard Cabernet Sauvignon (Paso Robles)', which is why the column cannot be converted to float, even though you did some other cleansing steps.

You want be a bit more structured in your data cleansing: Do one step after the other, take a look at the intermediate df, and do not try to do many cleansing steps at once with an apply() function. If you have a messy dataset, maybe 10 steps are required, no way you can do all of that with a single apply() call.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 KingOtto