'How to use pyspark VectorAssembler

I'm trying to use the VectorAssembler function of pyspark but it seems that it is not working properly. I have a dataframe of twitter data with a row for each hashtag and a column for each day of the year with the count of how many times that hashtag has been used in that day. I want to vectorize this and my code is:

%%spark
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols = daily_hashtag_matrix.columns[1:], outputCol = "vector")
output = assembler.transform(daily_hashtag_matrix)
daily_vector = output.select( "vector")
daily_vector.show(n=15)

However the output is not as expected, in the sense that for few rows it is the vector I want, while for the majority is not, see below:

+--------------------+
|              vector|
+--------------------+
|(356,[6,62,98,228...|
|(356,[4,10,11,12,...|
|(356,[12,117,209,...|
|(356,[186,187],[1...|
|    (356,[79],[1.0])|
|(356,[152,168],[1...|
|(356,[1,15,25,29,...|
|(356,[3,4,5,9,11,...|
|(356,[38,57,184,2...|
|(356,[3,6,9,17,35...|
|(356,[18,31,49,90...|
|   (356,[351],[1.0])|
|[3.0,1.0,0.0,0.0,...|
|(356,[102,103],[4...|
|(356,[6,110,206],...|
+--------------------+

I would like to have all rows as the 13th row in the output. What am I doing wrong? Thanks in advance



Solution 1:[1]

What you're seeing on row 13th is called DenseVector, while the rest are called SparseVector, where most of the values of the row are zeros. Take a look at the sample below

from pyspark.ml.feature import VectorAssembler

df = spark.createDataFrame([
    (1, 0, 3),
    (0, 0, 0),
], ["a", "b", "c"])
vecAssembler = VectorAssembler(outputCol="features")
vecAssembler.setInputCols(["a", "b", "c"])
vecAssembler.transform(df).collect()

[Row(a=1, b=0, c=3, features=DenseVector([1.0, 0.0, 3.0])),
 Row(a=0, b=0, c=0, features=SparseVector(3, {}))]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pltc