'user_id mismatch in Pyspark ALS

I trained the ALS model on a set of 450000 unique user_id's. Following which I extracted the user_matrix from it using model.userMatrix, then I did inner join this dataframe with my train dataframe on user_matrix.id == train.user_id. I expected that the inner join will return a dataframe having same no of unique user_id as train( and same no of id as user_matrix) but to my surprise the resultant dataframe have only about 110000 unique user_id i.e. not all the user_id are present in the user_matrix which should ideally have been the case.

I am unable to understand it. The count of user_id and id in train and user_matrix are almost the same around the same but they are not equal(as shown by inner join results)

Am I something missing here?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source