'Duplicate rows appearing after join inner

I am working with two pyspark data-frames including millions of rows and I am trying to join them. Although both data-sets do not include any duplicates, and they only share a unique identifier between them, when joining them with inner or left join I get duplicated rows.

I have:

  • D1 - user data set including the userID and some user-level variables such as 'gender' and 'date of birth'. No duplicates appear for D1.
+-----+-------------+----+--------+
| usID|   date_birth|code|  gender| 
+-----+-------------+----+--------+
|  ID1|   2017-07-10|  L1|      F |
|  ID2|   2000-05-02|  L1|      M |
|  ID3|   1990-08-30|  L2|      F |
|  ID4|   1995-07-04|  L2|      M |
print(df1.groupBy(df1.columns)\
     .count()\
     .where(F.col('count') > 1)\
     .count())  

# OUTPUT: 0 -> No user duplicates
  • D2 - purchases data set including user ID and details of a particular purchase. The same user appears multiple times but the purchases are unique, each user has a unique timestamp in Unix.
+-----+-------------+-------+
| usID|    time_unix|item_id|  
+-----+-------------+-------+
|  ID1|   1653063112|  53063|     
|  ID1|   1653763145|  53064|     
|  ID2|   1653089114|  53064|      
|  ID2|   1653091516|  53062|      
print(df2.groupBy(df2.columns)\
     .count()\
     .where(F.col('count') > 1)\
     .count()) 

# OUTPUT: 0 -> No purchase duplicates

My goal is to include the user information from D1 to all rows with matching the user ID in D2.

# Add info from df1 to df2 with user unique identifier: usID
df = df2.join(df1, on = ['usID'], how = 'left').cache() 

# Testing duplicate rows
print(df.groupBy(df.columns)\
    .count()\
    .where(F.col('count') > 1)\
    .count()) 

# OUTPUT: 1.942.799.248 duplicates (30% of the rows got duplicated)
# df output sample:
+-----+-------------+-------+-------------+----+--------+
| usID|    time_unix|item_id|   date_birth|code|  gender| 
+-----+-------------+-------+-------------+----+--------+
|  ID1|   1653063112|  53063|   2017-07-10|  L1|      F |     
|  ID1|   1653063112|  53063|   2017-07-10|  L1|      F |  
|  ID1|   1653063112|  53063|   2017-07-10|  L1|      F |  

Why are the duplicated rows (with the same content in all columns) appearing? Has it something to do with the amount of data being joined?



Solution 1:[1]

Inner text in template, as @Yuriy suggested, or:

<pre>{{ testText }}</pre>

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Misha Mashina