'Duplicate rows appearing after join inner
I am working with two pyspark data-frames including millions of rows and I am trying to join them. Although both data-sets do not include any duplicates, and they only share a unique identifier between them, when joining them with inner or left join I get duplicated rows.
I have:
- D1 - user data set including the userID and some user-level variables such as 'gender' and 'date of birth'. No duplicates appear for D1.
+-----+-------------+----+--------+
| usID| date_birth|code| gender|
+-----+-------------+----+--------+
| ID1| 2017-07-10| L1| F |
| ID2| 2000-05-02| L1| M |
| ID3| 1990-08-30| L2| F |
| ID4| 1995-07-04| L2| M |
print(df1.groupBy(df1.columns)\
.count()\
.where(F.col('count') > 1)\
.count())
# OUTPUT: 0 -> No user duplicates
- D2 - purchases data set including user ID and details of a particular purchase. The same user appears multiple times but the purchases are unique, each user has a unique timestamp in Unix.
+-----+-------------+-------+
| usID| time_unix|item_id|
+-----+-------------+-------+
| ID1| 1653063112| 53063|
| ID1| 1653763145| 53064|
| ID2| 1653089114| 53064|
| ID2| 1653091516| 53062|
print(df2.groupBy(df2.columns)\
.count()\
.where(F.col('count') > 1)\
.count())
# OUTPUT: 0 -> No purchase duplicates
My goal is to include the user information from D1 to all rows with matching the user ID in D2.
# Add info from df1 to df2 with user unique identifier: usID
df = df2.join(df1, on = ['usID'], how = 'left').cache()
# Testing duplicate rows
print(df.groupBy(df.columns)\
.count()\
.where(F.col('count') > 1)\
.count())
# OUTPUT: 1.942.799.248 duplicates (30% of the rows got duplicated)
# df output sample:
+-----+-------------+-------+-------------+----+--------+
| usID| time_unix|item_id| date_birth|code| gender|
+-----+-------------+-------+-------------+----+--------+
| ID1| 1653063112| 53063| 2017-07-10| L1| F |
| ID1| 1653063112| 53063| 2017-07-10| L1| F |
| ID1| 1653063112| 53063| 2017-07-10| L1| F |
Why are the duplicated rows (with the same content in all columns) appearing? Has it something to do with the amount of data being joined?
Solution 1:[1]
Inner text in template, as @Yuriy suggested, or:
<pre>{{ testText }}</pre>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Misha Mashina |
