'Why do we use X_train.select_dtypes(exclude=['object']) to drop categorical variables from dataframe in pandas
Source:https://www.kaggle.com/code/alexisbcook/categorical-variables
In order to drop categorical variables,we use the command
drop_X_train = X_train.select_dtypes(exclude=['object'])
doesnt it make more sense to use
drop_X_train = X_train.select_dtypes(exclude=['string'])
since categorical variables have data type string?
Solution 1:[1]
pandas deliberately uses native python strings, which require an object dtype. See pandas distinction between str and object types
Also see: https://pandas.pydata.org/docs/user_guide/text.html
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = df["A"].astype("category")
df["C"] = df["A"].astype("string")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null object
1 B 4 non-null category
2 C 4 non-null string
dtypes: category(1), object(1), string(1)
memory usage: 328.0+ bytes
print(df)
A B C
0 a a a
1 b b b
2 c c c
3 a a a
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
