'Databricks Delta table column with double data type to store long value

  1. I created a json schema with one of the field having data type as double initially and this schema was used to create the Databricks delta table.

  2. Intially the orc file had double for that particular column in data type , but now i had to change it is changed to long in orc file.

  3. The orc file is read using readstream

readstream

mydata_readstream = ( 
    spark.readStream
    .format("orc")
    .schema("my-schema")  # schema defined the field as double
    .load('/mnt/myapp/data/content.orc')
  )

Writestream

ldc_fault_writestream = ( 
  my_readstream
  .writeStream
  .trigger(once=True)
  .option("checkpointLocation", "/mnt/mypath/checkpoints/myhistory")
  .foreachBatch(loadMyDataInWriteStream)
  .start()
)
def loadmydataInWriteStrea(df, batch_id):
  my_column = 'updateddate'  
  (
    df
    .drop(my_column )
    .dropDuplicates(['employeeid'])
    .withColumn(my_column, F.date_format(F.col('updateddate'), 'yyyy').cast('int'))
    .repartition(1)
    .write
    .partitionBy('updateddate')
    .format("orc")
    .mode("append")
    .save('/mnt/myapp/data/output/')  
  )

Below is the exception i see when, the orc file field is of long datatype.

File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/databricks/spark/python/pyspark/sql/utils.py", line 202, in call
    raise e
  File "/databricks/spark/python/pyspark/sql/utils.py", line 199, in call
    self.func(DataFrame(jdf, self.sql_ctx), batch_id)
  File "<command-813392196322563>", line 5, in loadMyDataInWriteStream
    df
  File "/databricks/spark/python/pyspark/sql/readwriter.py", line 1136, in save
    self._jwrite.save(path)
  File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/databricks/spark/python/pyspark/sql/utils.py", line 117, in deco
    return f(*a, **kw)
  File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o565.save.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:307)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:194)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:121)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:119)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:144)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:213)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:257)

Schema of ORC file

From the orc file

Schema from the Delta table

from the detatable

Question:

  • If i update the data type of field to long in the JSON schema, i noticed there is no exception thrown and the data got updated to the Delta table. But the table was created with the double datatype.

Is this correct behavior for Deta table, where the Table has defined with Double but it can store the long data as well?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source