'How to use pyspark VectorAssembler
I'm trying to use the VectorAssembler function of pyspark but it seems that it is not working properly. I have a dataframe of twitter data with a row for each hashtag and a column for each day of the year with the count of how many times that hashtag has been used in that day. I want to vectorize this and my code is:
%%spark
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols = daily_hashtag_matrix.columns[1:], outputCol = "vector")
output = assembler.transform(daily_hashtag_matrix)
daily_vector = output.select( "vector")
daily_vector.show(n=15)
However the output is not as expected, in the sense that for few rows it is the vector I want, while for the majority is not, see below:
+--------------------+
| vector|
+--------------------+
|(356,[6,62,98,228...|
|(356,[4,10,11,12,...|
|(356,[12,117,209,...|
|(356,[186,187],[1...|
| (356,[79],[1.0])|
|(356,[152,168],[1...|
|(356,[1,15,25,29,...|
|(356,[3,4,5,9,11,...|
|(356,[38,57,184,2...|
|(356,[3,6,9,17,35...|
|(356,[18,31,49,90...|
| (356,[351],[1.0])|
|[3.0,1.0,0.0,0.0,...|
|(356,[102,103],[4...|
|(356,[6,110,206],...|
+--------------------+
I would like to have all rows as the 13th row in the output. What am I doing wrong? Thanks in advance
Solution 1:[1]
What you're seeing on row 13th is called DenseVector, while the rest are called SparseVector, where most of the values of the row are zeros. Take a look at the sample below
from pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([
(1, 0, 3),
(0, 0, 0),
], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features")
vecAssembler.setInputCols(["a", "b", "c"])
vecAssembler.transform(df).collect()
[Row(a=1, b=0, c=3, features=DenseVector([1.0, 0.0, 3.0])),
Row(a=0, b=0, c=0, features=SparseVector(3, {}))]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | pltc |
