'TypeError while tokenizing a column in Spark dataframe

I'm trying to tokenize a 'string' column from a spark dataset.

The spark dataframe is as follows:

df: 
index ---> Integer 
question ---> String

This is how I'm using the spark tokenizer:

Quest = df.withColumn("question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol=Quest, outputCol="question_parts")

But I get the following error:

Invalid param value given for param "inputCol". Could not convert <class 'pyspark.sql.dataframe.DataFrame'> to string type

I also substituted the first line of my code with following codes, but they didn't resolve this error either:

Quest = df.select(concat_ws(" ",col("question")))

and

Quest= df.withColumn("question", concat_ws(" ",col("question")))

What's my mistake here?



Solution 1:[1]

The mistake is the second line. df.withColumn() returns a dataframe with the column you just created appended. In the second line, inputCol="question" should give you what you need. You then need to transform your dataframe using the tokenizer.

Try:

df = df.withColumn("Question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol="Question", outputCol="question_parts")
tokenizer.Transform(df)

Edit:
I'm not sure you intended to create a new column in the first line - I've changed the column name in the withColumn method from "question" to "Question" to replace the existing column. It also looks from your data like the column is already in string format - if so this step is not necessary.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1