'What is a reason why two PySpark dataframes wouldn't be equal if they have the same schema and data?

Suppose we have two PySpark dataframes df1 and df2. Also suppose that they have the same number of rows (5 rows). If df1.schema = df2.schema and df1.take(5) = df2.take(5), why wouldn't df1 = df2?



Solution 1:[1]

Data handled by Spark are distributed randomly across worker nodes (or executors), they're also unordered and not predictable. Therefore it makes no sense to compare df1 == df2. If you truly want to compare them both, and as long as they have the same schema, you can do df1.subtract(df2).count() == 0 to see if they have exact same data.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 pltc