'spark 2.4 Parquet column cannot be converted in file, Column: [Impressions], Expected: bigint, Found: BINARY
I'm facing a weird issue that I cannot understand.
I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data).
The HIVE schema registered for this column is of Long.
Thus, when loading the data:
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW adwords_ads_agg_Yxz AS
SELECT
a.customer_id
, a.Campaign
, ...
, SUM(BIGINT(a.Impressions)) as Impressions
, SUM(BIGINT(a.Cost))/1000000 as Cost
FROM adwords_ad a
LEFT JOIN ds_ad_mapping m ON BIGINT(a.Ad_ID) = BIGINT(m.adEngineId) AND a.customer_id = m.reportAccountId
WHERE a.customer_id in (...)
AND a.day >= DATE('2019-02-01')
GROUP BY
a.customer_id
, ...
""")
I'm making sure that everything get converted to BIGINT. The error happens later on, on the step:
spark.sql("CACHE TABLE adwords_ads_agg_Yxz")
After seeing this error, I ran the same code in a notebook and tried to have more debug, first of all by making sure that the conversion happens to the column for BIGINT / long:
from pyspark.sql.types import LongType
df = df.withColumn("Impressions", f.col("Impressions").cast(LongType()))
df.createOrReplaceTempView('adwords_ads_agg_Yxz')
and then printing the schema from this freshly converted df:
root
|-- customer_id: long (nullable = true)
|-- Campaign: string (nullable = true)
|-- MatchType: string (nullable = true)
|-- League: string (nullable = false)
|-- Ad_Group: string (nullable = true)
|-- Impressions: long (nullable = true) <- Here!
|-- Cost: double (nullable = true)
and then doing the caching, but the error remains:
Spark Job Progress An error occurred while calling o84.sql. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 47.0 failed 4 times, most recent failure: Lost task 9.3 in stage 47.0 (TID 2256, ip-172-31-00-00.eu-west-1.compute.internal, executor 10): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3a://bucket/prod/reports/adwords_ad/customer_id=1111111/date=2019-11-21/theparquetfile.snappy.parquet. Column: [Impressions], Expected: bigint, Found: BINARY
Has anyone encounter this problem and / or would have an idea what is causing this?
If I remove the caching, the error will happen when trying to write the data to parquet. I don't know either why it is mentioning adwords_ad table at this point when I'm trying to refresh / write a temporary table
Solution 1:[1]
When using hive table over parquet, and then read it using SPARK, SPARK takes the schema of the parquet and not of the hive table defenition.
it make sense that into ur parquet files schema Impressions is a BINARY, and it doesnt matter that in the hive table its Long, because spark take the schema from the parquet file.
Solution 2:[2]
Try disabling the vectorized parquet reader as mentioned in below troubleshoot guide:
https://docs.microsoft.com/en-us/azure/databricks/kb/scala/spark-job-fail-parquet-column-convert
Solution 3:[3]
You need to update the document and pull the element from the array since you do not want to delete whole document:
db.collectionName.update({ screen:"screen-1" },{ $pull: { 'commercials.id': 1 }} );
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ShemTov |
| Solution 2 | Vibha |
| Solution 3 | R2D2 |
