'RandomForestClassifier in PySpark

I have separate training and test data sets. I have already converted RDD to pyspark Dataframe. The training data set contains 9 columns, out of which the first 8 are features columns and the last one is the label. The last column is just 1's and 0's. Similarly, the test data set contains 8 columns and all are features columns. Below is the code I have written, but I am not getting the predictions dataframe.

x = df_train.columns #getting list of train columns
y = df_test.columns #getting list of test columns

for col in x:
    df_train = df_train.withColumn(col, df_train[col].cast(FloatType())) #converting to floattype
for col in y:
    df_test = df_test.withColumn(col, df_test[col].cast(FloatType())) #converting to floatype

    
trainingData=df_train.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"]) #making features and labels

assemble_f = VectorAssembler(inputCols = y[:],outputCol="features")
output = assemble_f.transform(df_test)
output.show()

rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
rfModel = rf.fit(trainingData)

predictions = rfModel.transform(output)
predictions.show()

When I am trying to get the predictions dataframe I am getting an error saying 'bool' object is not subscriptable. I am not understanding where exactly I have made errors in the code.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source