'user_id mismatch in Pyspark ALS
I trained the ALS model on a set of 450000 unique user_id's. Following which I extracted the user_matrix from it using model.userMatrix, then I did inner join this dataframe with my train dataframe on user_matrix.id == train.user_id. I expected that the inner join will return a dataframe having same no of unique user_id as train( and same no of id as user_matrix) but to my surprise the resultant dataframe have only about 110000 unique user_id i.e. not all the user_id are present in the user_matrix which should ideally have been the case.
I am unable to understand it. The count of user_id and id in train and user_matrix are almost the same around the same but they are not equal(as shown by inner join results)
Am I something missing here?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
