'Read Parquet file form S3 in EMR cluster taling a long time

I am trying to read a parquet file (not compressed) into a pandas dataframe on a EMR cluster. I am using EMR 6.4 and parquet version 1.1.5. We are in the process of upgrading to the latest version. The parquet file is about 350 MB but the read operation seem to be taking around 3 hours.

df=pd.read_parquet(<s3 link>)

Any ideas on what could be causing this delay? The same code has been working well earlier and the step completes usually in 4 mins. This file seem to be a bit larger than earlier processing. Earlier the file size used to be around 250 MB. Appreciate any suggestions on how to troubleshoot this further?

pandas dataframe amazon-emr

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Read Parquet file form S3 in EMR cluster taling a long time

Sources

Related Questions