'How to read a csv file from s3 bucket using pyspark

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:

spark = SparkSession.builder.getOrCreate()
file = "s3://bucket/file.csv"

c = spark.read\
    .csv(file)\
    .count()

print(c)

But I'm getting the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

How can I fix this?



Solution 1:[1]

You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.

--packages org.apache.hadoop:hadoop-aws:3.2.0

You need to set below configurations.

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")

After that you can read CSV file.

spark.read.csv("s3a://bucket/file.csv")

Solution 2:[2]

Thanks Mohana for the pointer! After breaking my head for more than a day, I was able to finally figure out. Summarizing my learnings:

Make sure what version of Hadoop your spark comes with:

print(f'pyspark hadoop version:  

{spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}')

or look for

ls jars/hadoop*.jar

The issue I was having was I had older version of Spark that I had installed a while back that Hadoop 2.7 and was messing up everything.

This should give a brief idea of what binaries you need to download.

For me it was Spark 3.2.1 and Hadoop 3.3.1.

Hence I downloaded : https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.1 https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.901 # added this just in case;

Placed these jar files in the spark installation dir: spark/jars/

spark-submit runner.py --packages org.apache.hadoop:hadoop-aws:3.3.1

You have your code snippet that reads from AWS S3

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mohana B C
Solution 2 Achilleus