'On HDinsight 4.0, how can i save/update table on hive and this table be readable for oozie/spark/jupyter

Today we have this scenario:

  1. Cluster Azure HDInsight 4.0
  2. Running Workflow on Oozie
  3. On this version, spark and hive do not share metadata anymore
  4. We came from HDInsight 3.6, to work with this change, now we use Hive Warehouse Connector
    • Before: spark.write.saveAsTable("tableName", mode="overwrite")
    • Now: df.write.mode("overwrite").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option('table', "tableName").save()

The problem is on this point, using HWC make possible to save tables on hive. But, hive databases/tables are not visible for Spark, Oozie and Jupyter, they see only tables on spark scope.

So, this is a major problem for us, because is not possible get data on managed tables from hive, and use them on oozie workflow.

To be possible save table on hive, and be visible on all cluster i made this configurations on Ambari:

  1. hive > hive.strict.managed.tables = false
  2. spark2 > metastore.catalog.default = hive

And now is possible to save table on hive, on the "old" way spark.write.saveAsTable.

But there is a problem when table is update/overwrite:

pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: user=hive, 
path="wasbs://[email protected]/hive/warehouse/managed/table"
:user:supergroup:drwxr-xr-x);'

So, i have two questions:

  1. Is this the correct way to save table on hive, to be visible on all cluster?
  2. How can i avoid this permission error on table overwrite? Keep in mind, this error occours when we execute Oozie Workflow

Thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source