'Connecting spark cluster to S3 bucket

We have configured a spark standalone cluster in our organization. Currently, the cluster has only one worker. The master and the worker both have Apache Spark 3.2.1 (Hadoop 3.2). I have checked connecting from a machine (with installed pyspark) to the cluster above and everything worked well.

However, when I try to connect my spark application to a S3 bucket (hosted by our organization and not AWS), it fails. I have tried many configurations and installing dependencies like

  • hadoop-aws
  • aws-java-sdk
  • hadoop-common
  • aws-java-sdk-bundle

But the connection was not successful (every time I receive a specific error based on the dependency I try to attach to the application). These dependencies are the most-common ones when I check how others are doing this, but there is no fixed solution for the issue (the solutions are random).

The latest code I have used:

spark = SparkSession.builder\
.config("spark.master", url)\
.config("spark.app.name", app_name)\
.config("spark.executor.memory", executor_memory)\
.config("spark.driver.memory", driver_memory)\
.config('spark.executor.instances', 2) \
.config('spark.executor.cores', 2) \
.config('spark.driver.extraJavaOptions',
                    '-Dhttp.proxyHost=proxyhost -Dhttp.proxyPort=proxyport -Dhttps.proxyHost=proxyhost  -Dhttps.proxyPort=proxyport -Dcom.amazonaws.services.s3.enableV4=true')\
.config('spark.executor.extraJavaOptions',
                    '-Dhttp.proxyHost=proxyhost -Dhttp.proxyPort=proxyport -Dhttps.proxyHost=proxyhost -Dhttps.proxyPort=proxyport -Dcom.amazonaws.services.s3.enableV4=true')\
.config('spark.jars.packages','com.amazonaws:aws-java-sdk-bundle:1.12.175,org.apache.hadoop:hadoop-aws:3.3.1,org.apache.hadoop:hadoop-client:3.3.1')\
.config("spark.hadoop.fs.s3a.access.key", access_key)\
.config("spark.hadoop.fs.s3a.secret.key", secret_key)\
.config("spark.hadoop.fs.s3a.endpoint", endpoint_url)\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")\
.config("spark.hadoop.fs.s3a.path.style.access", "true")\
.getOrCreate()

With the above code, the cell (in the jupyter notebook) in which I ask for a specific csv file in the s3 bucket runs but it does not stop executing. The code is

df = spark.read.csv('s3a://working-data-fraud/file.csv', header = True, sep = ',')


Solution 1:[1]

have a look at the apache hadoop aws docs

only ever use the same aws-sdk-bundle jar the hadoop release was built with; they are fussy releases. it is not a 1.12.x release, whatever it was.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 stevel