'How to skip a serial column in greenplum table while inserting from spark dataframe to greenplum

Here is all the required information and the code :

val gscReadOptionMap = Map(
      "url" -> s"jdbc:postgresql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}",
      "user" -> jdbcUsername,
      "password" -> jdbcPassword,
      "dbschema" -> "public",
      "dbtable" -> dbtable,
)

final_df.write
      .format("greenplum")      
      .options(gscReadOptionMap)
      .mode(SaveMode.Append)
      .save()

GreenPlum table schema :

    Column    |            Type             |                                   Modifiers
--------------+-----------------------------+-------------------------------------------------------------------------------
 auto_id      | bigint                      | not null default nextval('tmp_test_tpledger_timeuser2_auto_id_seq'::regclass)
 userid       | character varying(128)      | not null
 eventtime    | timestamp without time zone | not null
 time_spent   | bigint                      |

Spark DataFrame Schema :

root
 |-- userid: string (nullable = true)
 |-- eventtime: timestamp (nullable = true)
 |-- time_spent: long (nullable = true)

When trying to write data from spark to greenplum getting the following error.

22/05/02 13:30:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/05/02 13:30:37 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 22/05/02 13:30:38 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. java.lang.RuntimeException: Spark DataFrame must include column[s] "auto_id" when writing to Greenplum Database table. at io.pivotal.greenplum.spark.externaltable.RowTransformer$.getFunction(RowTransformer.scala:47) at io.pivotal.greenplum.spark.GreenplumRelationProvider.saveDataFrame(GreenplumRelationProvider.scala:153) at io.pivotal.greenplum.spark.GreenplumRelationProvider.createRelation(GreenplumRelationProvider.scala:115) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)



Solution 1:[1]

Right now the Greenplum Connector for Apache Spark requires that every column in the Greenplum Database table is also present in the Spark DataFrame. If the Spark DataFrame has more columns than the Greenplum Table, those columns will be ignored, but if a column is missing, the request will error out. We can consider your feature request for future releases of the Connector.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 denalex