'Character-level TFIDF in PySpark?

I have a dataset that consists in 10M rows:

>>> df

    name            job                                     company
0   Amanda Arroyo   Herbalist                               Norton-Castillo
1   Victoria Brown  Outdoor activities/education manager    Bowman-Jensen
2   Amy Henry       Chemist, analytical                     Wilkerson, Guerrero and Mason

And I want to calculate the 3-gram character-level tfidf vectors for the column name, like I would easily do with sklearn:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
X = tfidf.fit_transform(df['name'])

The problem is that I can't see any reference to it in the Spark documentation or in the HashingTF API docs.

Is this achievable at all with PySpark?



Solution 1:[1]

The tools are available:

TFIDF Spark vs SKlean and ngrams.

Yes it is achievable.

Example of characters being tokenized.

df = spark.createDataFrame([("a b c",)], ["text"])
tokenizer = Tokenizer(outputCol="words")
tokenizer.setInputCol("text")
Tokenizer...
tokenizer.transform(df).head()
Row(text='a b c', words=['a', 'b', 'c'])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1