'Error writing a partitioned Delta Table from a multitasking job in azure databricks

I have a notebook that writes a delta table with a statement similar to the following:

match = "current.country = updates.country and current.process_date = updates.process_date"
deltaTable = DeltaTable.forPath(spark, silver_path)
deltaTable.alias("current")\
.merge(
    data.alias("updates"),
    match) \
  .whenMatchedUpdate(
      set = update_set,
      condition = condition) \
  .whenNotMatchedInsert(values = values_set)\
  .execute()

The multitask job has two tasks that are executed in parallel.

When executing the job the following error is displayed:

ConcurrentAppendException: Files were added to partition [country=Panamá, process_date=2022-01-01 00:00:00] by a concurrent update. Please try the operation again.

In each task I send different countries (Panama, Ecuador) and the same date as a parameter, so when executing only the information corresponding to the country sent should be written. This delta table is partitioned by the country and process_date fields. Any ideas what I'm doing wrong? How should I specify the partition to be affected when using the "merge" statement?

I appreciate if you can clarify how I should work with the partitions in these cases, since this is new to me.

Update: I made an adjustment in the condition to specify the country and process date according to what is indicated here (ConcurrentAppendException). Now I get the following error message:

ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.

I can't think what could cause the error. Keep investigating.

Solution 1:^[1]

Error – ConcurrentAppendException: Files were added to the root of the table by a concurrent update. Please try the operation again.

This exception is often thrown during concurrent DELETE, UPDATE, or MERGE operations. While the concurrent operations may be physically updating different partition directories, one of them may read the same partition that the other one concurrently updates, thus causing a conflict. You can avoid this by making the separation explicit in the operation condition.

Update query would be executed for the Delta Lake target table when 'Update Strategy' transformation is used in the mapping. When multiple Update Strategy transformations are used for the same target table, multiple UPDATE queries would be executed in parallel and hence, target data would be unpredictable. Due to the unpredictable data scenario in Delta Lake target for concurrent UPDATE queries, it is not supported to use more than one 'Update Strategy' transformation per 'Databricks Delta Lake Table' in a mapping. Redesign the mapping such that there is one 'Update Strategy' transformation per Delta Lake table.

Solution -

While running a mapping with one 'Update Strategy' transformation per Databricks Delta Lake table, execution would complete successfully.

Refer - https://docs.delta.io/latest/concurrency-control.html#avoid-conflicts-using-partitioning-and-disjoint-command-conditions

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	AbhishekKhandave-MT

'Error writing a partitioned Delta Table from a multitasking job in azure databricks

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]