'Connecting spark cluster to S3 bucket
We have configured a spark standalone cluster in our organization. Currently, the cluster has only one worker. The master and the worker both have Apache Spark 3.2.1 (Hadoop 3.2). I have checked connecting from a machine (with installed pyspark) to the cluster above and everything worked well.
However, when I try to connect my spark application to a S3 bucket (hosted by our organization and not AWS), it fails. I have tried many configurations and installing dependencies like
- hadoop-aws
- aws-java-sdk
- hadoop-common
- aws-java-sdk-bundle
But the connection was not successful (every time I receive a specific error based on the dependency I try to attach to the application). These dependencies are the most-common ones when I check how others are doing this, but there is no fixed solution for the issue (the solutions are random).
The latest code I have used:
spark = SparkSession.builder\
.config("spark.master", url)\
.config("spark.app.name", app_name)\
.config("spark.executor.memory", executor_memory)\
.config("spark.driver.memory", driver_memory)\
.config('spark.executor.instances', 2) \
.config('spark.executor.cores', 2) \
.config('spark.driver.extraJavaOptions',
'-Dhttp.proxyHost=proxyhost -Dhttp.proxyPort=proxyport -Dhttps.proxyHost=proxyhost -Dhttps.proxyPort=proxyport -Dcom.amazonaws.services.s3.enableV4=true')\
.config('spark.executor.extraJavaOptions',
'-Dhttp.proxyHost=proxyhost -Dhttp.proxyPort=proxyport -Dhttps.proxyHost=proxyhost -Dhttps.proxyPort=proxyport -Dcom.amazonaws.services.s3.enableV4=true')\
.config('spark.jars.packages','com.amazonaws:aws-java-sdk-bundle:1.12.175,org.apache.hadoop:hadoop-aws:3.3.1,org.apache.hadoop:hadoop-client:3.3.1')\
.config("spark.hadoop.fs.s3a.access.key", access_key)\
.config("spark.hadoop.fs.s3a.secret.key", secret_key)\
.config("spark.hadoop.fs.s3a.endpoint", endpoint_url)\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")\
.config("spark.hadoop.fs.s3a.path.style.access", "true")\
.getOrCreate()
With the above code, the cell (in the jupyter notebook) in which I ask for a specific csv file in the s3 bucket runs but it does not stop executing. The code is
df = spark.read.csv('s3a://working-data-fraud/file.csv', header = True, sep = ',')
Solution 1:[1]
have a look at the apache hadoop aws docs
only ever use the same aws-sdk-bundle jar the hadoop release was built with; they are fussy releases. it is not a 1.12.x release, whatever it was.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | stevel |
