'Pyspark crossjoin between 2 dataframes with millions of records

I have 2 dataframes A(35 Million records) and B(30000 records)

A

|Text |
-------
| pqr  |
-------
| xyz  |
------- 

B

|Title |
-------
| a  |
-------
| b  |
-------
| c  |
------- 

Below dataframe C is obtained after a crossjoin between A and B.

c = A.crossJoin(B, on = [A.text == B.Title)

C

|text | Title |
---------------
| pqr  | a    |
---------------
| pqr  | b    |
---------------
| pqr  | c    |
---------------
| xyz  | a    |
---------------
| xyz  | b    |
---------------
| xyz  | c    |
---------------

Both the columns above are of type String.

I am performing the below operation and it results in an Spark error(Job aborted due to stage failure)

display(c.withColumn("Contains", when(col('text').contains(col('Title')), 1).otherwise(0)).filter(col('Contains') == 0).distinct())

Any suggestions on how this join needs to be done to avoid the Spark error() on the resulting operations?

Spark error message



Solution 1:[1]

try using broadcast joins

from pyspark.sql.functions import broadcast
c = broadcast(A).crossJoin(B)

If you don't need and extra column "Contains" column thne you can just filter it as

display(c.filter(col("text").contains(col("Title"))).distinct())

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 n1tk