'I got Error ==> ValueError: could not convert string to float: '?'
I'm running the following python script:
print("(Positive Patients ST depression): " + str(pos_data['oldpeak'].mean()))
print("(Negative Patients ST depression): " + str(neg_data['oldpeak'].mean()))
print("(Positive Patients thalach): " + str(pos_data['thalach'].mean()))
print("(Negative Patients thalach): " + str(neg_data['thalach'].mean()))
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
However I got the errors in the second last line like:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-32-436d450d7687> in <module>()
1 from sklearn.preprocessing import StandardScaler
2 sc = StandardScaler()
----> 3 x_train = sc.fit_transform(x_train)
4 x_test = sc.transform(x_test)
4 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
744 array = array.astype(dtype, casting="unsafe", copy=False)
745 else:
--> 746 array = np.asarray(array, order=order, dtype=dtype)
747 except ComplexWarning as complex_warning:
748 raise ValueError(
ValueError: could not convert string to float: '?'
Can anyone explain a little bit about this?
Solution 1:[1]
You are trying to scale string data. Though, scikit-learn's Standard Scaler (like a lot of scikit-learn algorithms and ML algorithms) accepts ONLY numerical data.
So, you need to make numerical data from your text data. You may do this by using one of scikit-learn's vectorizers - CountVectorizer, TfIdfVectorizer (recomended). These vectorizers take all your data, split it to the words and assign a numeric value to each word, so you can use it as you wish.
Of couse, there are some intermediate vectorizing techniques, but start with TfIdf.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | K0mp0t |
