'Unable to infer schema when loading Parquet file
response = "mi_or_chd_5"
outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
But then:
outcome2 = sqlc.read.parquet(response) # fail
fails with:
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
in
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Using Spark 2.1.1. Also fails in 2.2.0.
Found this bug report, but was fixed in 2.0.1, 2.1.0.
UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".
Solution 1:[1]
This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.
You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.
Solution 2:[2]
In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.
See also:
Solution 3:[3]
I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket). After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).
Glue was trying to apply data catalog table schema on a file which doesn't exist.
After copying file into s3 bucket file location, issue got resolved.
Hope this helps someone who encounters/encountered an error in AWS Glue.
Solution 4:[4]
This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.
Besides that with parquet, the same thing happens with ORC.
Solution 5:[5]
Just to emphasize @Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename
val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
.load("/Users/myuser/_HEADER_0")
org.apache.spark.sql.AnalysisException:
Unable to infer schema for CSV. It must be specified manually.;
Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)
Solution 6:[6]
In my case, the error occurred because the filename contained underscores. Rewriting / reading the file without underscores (hyphens were OK) solved the problem...
Solution 7:[7]
I see there are already so many Answers. But the issue I faced was my Spark job was trying to read a file which is being overwritten by another Spark job that was previously started. It sounds bad, but I did that mistake.
Solution 8:[8]
For me this happened when I thought loading the correct file path but instead pointed a incorrect folder
Solution 9:[9]
Happened to me for a parquet file that was in the process of being written to. Just need to wait for it to be completely written.
Solution 10:[10]
I ran into a similar problem with reading a csv
spark.read.csv("s3a://bucket/spark/csv_dir/.")
gave an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
I found if I removed the trailing . and then it works. ie:
spark.read.csv("s3a://bucket/spark/csv_dir/")
I tested this for parquet adding a trailing . and you get an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
Solution 11:[11]
I just encountered the same problem but none of the solutions here work for me. I try to merge the row groups of my parquet files on hdfs by first reading them and write it to another place using:
df = spark.read.parquet('somewhere')
df.write.parquet('somewhere else')
But later when I query it with
spark.sql('SELECT sth FROM parquet.`hdfs://host:port/parquetfolder/` WHERE .. ')
It shows the same problem. I finally solve this by using pyarrow:
df = spark.read.parquet('somewhere')
pdf = df.toPandas()
adf = pa.Table.from_pandas(pdf) # import pyarrow as pa
fs = pa.hdfs.connect()
fw = fs.open(path, 'wb')
pq.write_table(adf, fw) # import pyarrow.parquet as pq
fw.close()
Solution 12:[12]
Check if .parquet files available at the response path. I am assuming, either files don't exist or it may be exist in some internal(partitioned) folders. If files are available under multiple hierarchy folders then append /* for each folder.
As in my case .parquet files were under 3 folders from base_path, so I have given path as base_path/*/*/*
Solution 13:[13]
As others mentioned, in my case this error appeared when I was reading S3 keys that did not exist. A solution is filter-in keys that do exist:
import com.amazonaws.services.s3.AmazonS3URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
import java.net.URI
def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String = {
val uri = new URI(url)
val hostWithEndpoint = uri.getHost + "." + domain
new URI(uri.getScheme, uri.getUserInfo, hostWithEndpoint, uri.getPort, uri.getPath, uri.getQuery, uri.getFragment).toString
}
def createS3URI(url: String): AmazonS3URI = {
try {
// try to instantiate AmazonS3URI with url
new AmazonS3URI(url)
} catch {
case e: IllegalArgumentException if e.getMessage.
startsWith("Invalid S3 URI: hostname does not appear to be a valid S3 endpoint") => {
new AmazonS3URI(addEndpointToUrl(url))
}
}
}
def s3FileExists(spark: SparkSession, url: String): Boolean = {
val amazonS3Uri: AmazonS3URI = createS3URI(url)
val s3BucketUri = new URI(s"${amazonS3Uri.getURI().getScheme}://${amazonS3Uri.getBucket}")
FileSystem
.get(s3BucketUri, spark.sparkContext.hadoopConfiguration)
.exists(new Path(url))
}
and you can use it as:
val partitions = List(yesterday, today, tomorrow)
.map(f => somepath + "/date=" + f)
.filter(f => s3FileExists(spark, f))
val df = spark.read.parquet(partitions: _*)
For that solution I took some code out of spark-redshift project, here.
Solution 14:[14]
You are just loading a parquet file , Of course parquet had valid schema. Otherwise it would not been saved as parquet. This error means -
- Either parquet file does not exist . (99.99% cases this is the issue. Spark error messages are often less obvious)
- Somehow the parquet file got corrupted or Or It's not a parquet file at all
Solution 15:[15]
I ran into this issue because of folder in folder issue.
for example folderA.parquet was supposed to have partion.... but instead it have folderB.parquet which inside have partition.
Resolution, transfer the file to parent folder and delete the subfolder.
Solution 16:[16]
you can read with /*
outcome2 = sqlc.read.parquet(f"{response}/*") # work for me
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
