'why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly
Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be able to read data from redshift. My question is why pyspark need this S3 temporal directory. Other libraries, like Pandas for instance, can read Redshift tables directly without using any temporal directory. Thanks to everyone.
Luis
Solution 1:[1]
The Redshift data source uses Amazon S3 to efficiently transfer data in and out of Redshift and uses JDBC to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.
See https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html
Implementation style, for performance for Redshift. COPY etc from Redshift faster than Spark.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |