'Character-level TFIDF in PySpark?
I have a dataset that consists in 10M rows:
>>> df
name job company
0 Amanda Arroyo Herbalist Norton-Castillo
1 Victoria Brown Outdoor activities/education manager Bowman-Jensen
2 Amy Henry Chemist, analytical Wilkerson, Guerrero and Mason
And I want to calculate the 3-gram character-level tfidf vectors for the column name, like I would easily do with sklearn:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 3))
X = tfidf.fit_transform(df['name'])
The problem is that I can't see any reference to it in the Spark documentation or in the HashingTF API docs.
Is this achievable at all with PySpark?
Solution 1:[1]
The tools are available:
TFIDF Spark vs SKlean and ngrams.
Yes it is achievable.
Example of characters being tokenized.
df = spark.createDataFrame([("a b c",)], ["text"]) tokenizer = Tokenizer(outputCol="words") tokenizer.setInputCol("text") Tokenizer... tokenizer.transform(df).head() Row(text='a b c', words=['a', 'b', 'c'])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
