'Pyspark read csv file from S3 bucket : AnalysisException: Path does not exist

In Google Colab I'm trying to get PySpark to read in csv from S3 bucket.

This is my code:

# Read in data from S3 Buckets
from pyspark import SparkFiles
url = "https://bucket-name.s3.amazonaws.com/filename.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("filename.csv"), sep=",", header=True)

# Show DataFrame
df.show()

And this is my return:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-14-5d0cdc44d2c4> in <module>()
      4 url = "https://bucket-name.s3.amazonaws.com/filename.csv"
      5 spark.sparkContext.addFile(url)
----> 6 df = spark.read.csv(SparkFiles.get("filename.csv"), sep=",", header=True)
      7 
      8 # Show DataFrame

2 frames
/content/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Path does not exist: file:/tmp/spark-d308539f-6371-4081-b6f4-e5f13ca7ed5b/userFiles-05f00260-eb10-4e31-8a5f-3abc12a17149/filename.csv

I'm trying to have it read the file from the S3 bucket. I've enabled public access permissions to the bucket as well as the file.



Solution 1:[1]

I have done this slightly different:

import boto3
import json
import io


def get_bucket(bucket_name: str):
  """
  Returns the specified bucket 
  :param: bucket_name str the bucket name to return
  :return: The bucket
  """
  s3 = boto3.resource("s3")
  bucket = s3.Bucket(bucket_name)
  return bucket


def read_file(bucket, key, encoding="utf-8") -> str:
  file_obj = io.BytesIO()
  bucket.download_fileobj(key, file_obj)
  wrapper = io.TextIOWrapper(file_obj, encoding=encoding)
  file_obj.seek(0)
  return wrapper.read() 

bucket = get_bucket("myBucket")
file_as_str = read_file(bucket, <KEY>)
csvData = spark.sparkContext.parallelize(io.StringIO(file_as_str))
df = spark.read.option("header", True).option("inferSchema", True).option("sep", ",").csv(csvData)


Note that <KEY> is the S3 key of your file.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dharman