'Unable to infer schema when loading Parquet file

response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

But then:

outcome2 = sqlc.read.parquet(response)  # fail

fails with:

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?

Using Spark 2.1.1. Also fails in 2.2.0.

Found this bug report, but was fixed in 2.0.1, 2.1.0.

UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".

Solution 1:^[1]

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.

Solution 2:^[2]

In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.

See also:

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

Solution 3:^[3]

I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket). After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).

Glue was trying to apply data catalog table schema on a file which doesn't exist.

After copying file into s3 bucket file location, issue got resolved.

Hope this helps someone who encounters/encountered an error in AWS Glue.

Solution 4:^[4]

This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.

Besides that with parquet, the same thing happens with ORC.

Solution 5:^[5]

Just to emphasize @Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename

val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
         .load("/Users/myuser/_HEADER_0")

org.apache.spark.sql.AnalysisException: 
Unable to infer schema for CSV. It must be specified manually.;

Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)

Solution 6:^[6]

In my case, the error occurred because the filename contained underscores. Rewriting / reading the file without underscores (hyphens were OK) solved the problem...

Solution 7:^[7]

I see there are already so many Answers. But the issue I faced was my Spark job was trying to read a file which is being overwritten by another Spark job that was previously started. It sounds bad, but I did that mistake.

Solution 8:^[8]

For me this happened when I thought loading the correct file path but instead pointed a incorrect folder

Solution 9:^[9]

Happened to me for a parquet file that was in the process of being written to. Just need to wait for it to be completely written.

Solution 10:^[10]

I ran into a similar problem with reading a csv

spark.read.csv("s3a://bucket/spark/csv_dir/.")

gave an error of:

org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;

I found if I removed the trailing . and then it works. ie:

spark.read.csv("s3a://bucket/spark/csv_dir/")

I tested this for parquet adding a trailing . and you get an error of:

org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

Solution 11:^[11]

I just encountered the same problem but none of the solutions here work for me. I try to merge the row groups of my parquet files on hdfs by first reading them and write it to another place using:

df = spark.read.parquet('somewhere')
df.write.parquet('somewhere else')

But later when I query it with

spark.sql('SELECT sth FROM parquet.`hdfs://host:port/parquetfolder/` WHERE .. ')

It shows the same problem. I finally solve this by using pyarrow:

df = spark.read.parquet('somewhere')
pdf = df.toPandas()
adf = pa.Table.from_pandas(pdf)   # import pyarrow as pa
fs = pa.hdfs.connect()
fw = fs.open(path, 'wb')
pq.write_table(adf, fw)           # import pyarrow.parquet as pq
fw.close()

Solution 12:^[12]

Check if .parquet files available at the response path. I am assuming, either files don't exist or it may be exist in some internal(partitioned) folders. If files are available under multiple hierarchy folders then append /* for each folder.

As in my case .parquet files were under 3 folders from base_path, so I have given path as base_path/*/*/*

Solution 13:^[13]

As others mentioned, in my case this error appeared when I was reading S3 keys that did not exist. A solution is filter-in keys that do exist:

import com.amazonaws.services.s3.AmazonS3URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
import java.net.URI

def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String = {
  val uri = new URI(url)
  val hostWithEndpoint = uri.getHost + "." + domain
  new URI(uri.getScheme, uri.getUserInfo, hostWithEndpoint, uri.getPort, uri.getPath, uri.getQuery, uri.getFragment).toString
}

def createS3URI(url: String): AmazonS3URI = {
  try {
    // try to instantiate AmazonS3URI with url
    new AmazonS3URI(url)
  } catch {
    case e: IllegalArgumentException if e.getMessage.
      startsWith("Invalid S3 URI: hostname does not appear to be a valid S3 endpoint") => {
      new AmazonS3URI(addEndpointToUrl(url))
    }
  }
}

def s3FileExists(spark: SparkSession, url: String): Boolean = {
  val amazonS3Uri: AmazonS3URI = createS3URI(url)
  val s3BucketUri = new URI(s"${amazonS3Uri.getURI().getScheme}://${amazonS3Uri.getBucket}")

  FileSystem
    .get(s3BucketUri, spark.sparkContext.hadoopConfiguration)
    .exists(new Path(url))
}

and you can use it as:

val partitions = List(yesterday, today, tomorrow)
  .map(f => somepath + "/date=" + f)
  .filter(f => s3FileExists(spark, f))

val df = spark.read.parquet(partitions: _*)

For that solution I took some code out of spark-redshift project, here.

Solution 14:^[14]

You are just loading a parquet file , Of course parquet had valid schema. Otherwise it would not been saved as parquet. This error means -

Either parquet file does not exist . (99.99% cases this is the issue. Spark error messages are often less obvious)
Somehow the parquet file got corrupted or Or It's not a parquet file at all

Solution 15:^[15]

I ran into this issue because of folder in folder issue.

for example folderA.parquet was supposed to have partion.... but instead it have folderB.parquet which inside have partition.

Resolution, transfer the file to parent folder and delete the subfolder.

Solution 16:^[16]

you can read with /*

outcome2 = sqlc.read.parquet(f"{response}/*")  # work for me

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	ostrokach
Solution 3	Ash
Solution 4	Anxo P
Solution 5	ibaralf
Solution 6	meeh
Solution 7
Solution 8	mani_nz
Solution 9
Solution 10	lockwobr
Solution 11	cloudray
Solution 12	Shams
Solution 13	Kyr
Solution 14
Solution 15	Rishabh Agarwal
Solution 16	Evandro Mendes

'Unable to infer schema when loading Parquet file

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Solution 5:[5]

Solution 6:[6]

Solution 7:[7]

Solution 8:[8]

Solution 9:[9]

Solution 10:[10]

Solution 11:[11]

Solution 12:[12]

Solution 13:[13]

Solution 14:[14]

Solution 15:[15]

Solution 16:[16]