'Databricks Delta table column with double data type to store long value
I created a json schema with one of the field having data type as
doubleinitially and this schema was used to create the Databricks delta table.Intially the orc file had double for that particular column in data type , but now i had to change it is changed to long in orc file.
The orc file is read using
readstream
readstream
mydata_readstream = (
spark.readStream
.format("orc")
.schema("my-schema") # schema defined the field as double
.load('/mnt/myapp/data/content.orc')
)
Writestream
ldc_fault_writestream = (
my_readstream
.writeStream
.trigger(once=True)
.option("checkpointLocation", "/mnt/mypath/checkpoints/myhistory")
.foreachBatch(loadMyDataInWriteStream)
.start()
)
def loadmydataInWriteStrea(df, batch_id):
my_column = 'updateddate'
(
df
.drop(my_column )
.dropDuplicates(['employeeid'])
.withColumn(my_column, F.date_format(F.col('updateddate'), 'yyyy').cast('int'))
.repartition(1)
.write
.partitionBy('updateddate')
.format("orc")
.mode("append")
.save('/mnt/myapp/data/output/')
)
Below is the exception i see when, the orc file field is of long datatype.
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 2442, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
File "/databricks/spark/python/pyspark/sql/utils.py", line 202, in call
raise e
File "/databricks/spark/python/pyspark/sql/utils.py", line 199, in call
self.func(DataFrame(jdf, self.sql_ctx), batch_id)
File "<command-813392196322563>", line 5, in loadMyDataInWriteStream
df
File "/databricks/spark/python/pyspark/sql/readwriter.py", line 1136, in save
self._jwrite.save(path)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/databricks/spark/python/pyspark/sql/utils.py", line 117, in deco
return f(*a, **kw)
File "/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o565.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:307)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:194)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:121)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:119)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:144)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:213)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:257)
Schema of ORC file
Schema from the Delta table
Question:
- If i update the data type of field to
longin the JSON schema, i noticed there is no exception thrown and the data got updated to the Delta table. But the table was created with the double datatype.
Is this correct behavior for Deta table, where the Table has defined with Double but it can store the long data as well?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|


