'Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ...

Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable)

if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 - everything works fine.

Issue is, if I am attempting to load some existing tables (older tables created spark 2.2 or before) - facing issues with COUNT of records. Diff count when count of target hive table is done thru beeline vs spark sql.

Please assist.



Solution 1:[1]

There seems to be an issue with sync of hive-Metastore and spark-catalog for hive tables (with parquet file format) created o2.n spark 2 (or before - with comple /nested data tydata type) and loaded using spark v2.4.

Usual case, spark.catalog.refresh(<hive-table-name>) will refresh the stats from hiveMetastore to spark.catalog.

In this case, an explicit spark.catalog.resfreshByPath(<location-maprfs-path>) need to bed executed to refresh the stats.pet*

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 VimalK