'Pyspark read csv file from S3 bucket : AnalysisException: Path does not exist
In Google Colab I'm trying to get PySpark to read in csv from S3 bucket.
This is my code:
# Read in data from S3 Buckets
from pyspark import SparkFiles
url = "https://bucket-name.s3.amazonaws.com/filename.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("filename.csv"), sep=",", header=True)
# Show DataFrame
df.show()
And this is my return:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-14-5d0cdc44d2c4> in <module>()
4 url = "https://bucket-name.s3.amazonaws.com/filename.csv"
5 spark.sparkContext.addFile(url)
----> 6 df = spark.read.csv(SparkFiles.get("filename.csv"), sep=",", header=True)
7
8 # Show DataFrame
2 frames
/content/spark-3.1.2-bin-hadoop3.2/python/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
AnalysisException: Path does not exist: file:/tmp/spark-d308539f-6371-4081-b6f4-e5f13ca7ed5b/userFiles-05f00260-eb10-4e31-8a5f-3abc12a17149/filename.csv
I'm trying to have it read the file from the S3 bucket. I've enabled public access permissions to the bucket as well as the file.
Solution 1:[1]
I have done this slightly different:
import boto3
import json
import io
def get_bucket(bucket_name: str):
"""
Returns the specified bucket
:param: bucket_name str the bucket name to return
:return: The bucket
"""
s3 = boto3.resource("s3")
bucket = s3.Bucket(bucket_name)
return bucket
def read_file(bucket, key, encoding="utf-8") -> str:
file_obj = io.BytesIO()
bucket.download_fileobj(key, file_obj)
wrapper = io.TextIOWrapper(file_obj, encoding=encoding)
file_obj.seek(0)
return wrapper.read()
bucket = get_bucket("myBucket")
file_as_str = read_file(bucket, <KEY>)
csvData = spark.sparkContext.parallelize(io.StringIO(file_as_str))
df = spark.read.option("header", True).option("inferSchema", True).option("sep", ",").csv(csvData)
Note that <KEY> is the S3 key of your file.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dharman |
