'Joining two RDD's in PySpark?
I asked a previous question regarding two RDD's I was trying to join together, here were the two RDD's:
+------+---+
| _1| _2|
+------+---+
|Python| 36|
| C| 6|
| C#| 8|
+------+---+
+------+---+
| _1| _2|
+------+---+
|Python| 10|
| C| 1|
| C#| 1|
+------+---+
After executing the following line of code on both RDD's, this was the result:
joined_rdd = rdd1.join(rdd2).map(lambda x: (x[0], *x[1]))
+------+---+---+
| _1| _2| _3|
+------+---+---+
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
+------+---+---+
This was exactly what I wanted, BUT, if I would like to join another RDD to this 3 column joined_rdd, how might I do that? The code I used originally does not work and I've tried every variation and cannot seem to obtain the result I want, here is what I want it to look like:
rdd3
+------+---+
| _1| _2|
+------+---+
|Python| 8|
| C| 15|
| C#|100|
+------+---+
After joining with joined_rdd:
final_joined_rdd
+------+---+---+---+
| _1| _2| _3| _3|
+------+---+---+---+
|Python| 36| 10| 8|
| C| 6| 1| 15|
| C#| 8| 1|100|
+------+---+---+---+
Any help to achieve this result would be appreciated, thanks!
Note: I cannot convert these RDD's to data frame and then join because the RDD's are really just DStream objects.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
