'I got Error ==> ValueError: could not convert string to float: '?'

I'm running the following python script:

print("(Positive Patients ST depression): " + str(pos_data['oldpeak'].mean()))
print("(Negative Patients ST depression): " + str(neg_data['oldpeak'].mean()))

print("(Positive Patients thalach): " + str(pos_data['thalach'].mean()))
print("(Negative Patients thalach): " + str(neg_data['thalach'].mean()))

X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 1)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

However I got the errors in the second last line like:

    ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-436d450d7687> in <module>()
      1 from sklearn.preprocessing import StandardScaler
      2 sc = StandardScaler()
----> 3 x_train = sc.fit_transform(x_train)
      4 x_test = sc.transform(x_test)

4 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744                     array = array.astype(dtype, casting="unsafe", copy=False)
    745                 else:
--> 746                     array = np.asarray(array, order=order, dtype=dtype)
    747             except ComplexWarning as complex_warning:
    748                 raise ValueError(

ValueError: could not convert string to float: '?'

Can anyone explain a little bit about this?



Solution 1:[1]

You are trying to scale string data. Though, scikit-learn's Standard Scaler (like a lot of scikit-learn algorithms and ML algorithms) accepts ONLY numerical data.

So, you need to make numerical data from your text data. You may do this by using one of scikit-learn's vectorizers - CountVectorizer, TfIdfVectorizer (recomended). These vectorizers take all your data, split it to the words and assign a numeric value to each word, so you can use it as you wish.

Of couse, there are some intermediate vectorizing techniques, but start with TfIdf.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 K0mp0t