'unable to read parquet files from directory with pyarrow

I'm using pyarrow(0.12.1) to read in parquet objects from s3

Here is the code I'm using:

s3 = s3fs.S3FileSystem()
base_pya_dataset = ParquetDataset('s3://bucket1/source/schema/table_name/2019_10_31_19_59_16', filesystem=s3)

I get the follow error when trying to create the ParquetDataset:

"errorMessage": "Corrupted file, smaller than file footer",
  "errorType": "ArrowIOError",

What am I doing wrong? The thing that is very confusing to me is that I had this working before (yesterday). Nothing that I can spot has changed beyond the parquet files I'm using. Do the parquet files have to be of a certain type?

I have already tried: - adding a trailing forward slash

When I feed it a path to one file, it works. Clearly it is something wrong with the way it is trying to get the file from the directory I feed it.



Solution 1:[1]

That error may mean it's trying to read a file that isn't Parquet.

I'd recommend upgrading to the latest version of pyarrow (0.15.1) and trying again. There has been lots of development since 0.12.1 and it's possible that whatever corner you've run into has been addressed.

Solution 2:[2]

After more testing, it seems as though the ParquetDataset feature does not work simply on parquet files (or even a single one) in a directory (even if they all have the same schema). When I put partitioned parquet files in a directory (outputted from spark), I receive no errors.

Even when I download those partitioned parquet files from spark, then re-uplaod them to s3, the read fails. I can't imagine how downloading the parquet files and then re-uploading them to s3 would mess up the schema to the point where it raises an exception (especially since I do not open or modify the actual parquet files).

It is very strange because and make it a bit uneasy about relying on this.

Solution 3:[3]

I know this is an old question, but this error came to me recently. I believe you should not have the "s3://" prefix on your file path if you include a filesystem parameter.

Check this out as reference: https://issues.apache.org/jira/browse/ARROW-10937

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Neal Richardson
Solution 2 nojohnny101
Solution 3 KhareS