'why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly

Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be able to read data from redshift. My question is why pyspark need this S3 temporal directory. Other libraries, like Pandas for instance, can read Redshift tables directly without using any temporal directory. Thanks to everyone.

Luis



Solution 1:[1]

The Redshift data source uses Amazon S3 to efficiently transfer data in and out of Redshift and uses JDBC to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.

See https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html

Implementation style, for performance for Redshift. COPY etc from Redshift faster than Spark.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1