'Why do we use X_train.select_dtypes(exclude=['object']) to drop categorical variables from dataframe in pandas

Source:https://www.kaggle.com/code/alexisbcook/categorical-variables

In order to drop categorical variables,we use the command

drop_X_train = X_train.select_dtypes(exclude=['object'])

doesnt it make more sense to use

drop_X_train = X_train.select_dtypes(exclude=['string']) since categorical variables have data type string?



Solution 1:[1]

pandas deliberately uses native python strings, which require an object dtype. See pandas distinction between str and object types

Also see: https://pandas.pydata.org/docs/user_guide/text.html

df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["B"] = df["A"].astype("category") 
df["C"] = df["A"].astype("string")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   A       4 non-null      object  
 1   B       4 non-null      category
 2   C       4 non-null      string  
dtypes: category(1), object(1), string(1)
memory usage: 328.0+ bytes

print(df)

   A  B  C
0  a  a  a
1  b  b  b
2  c  c  c
3  a  a  a

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1