'Inserting Records To Delta Table Through Databricks

I wanted to insert 100,000 records into a delta table using databricks. I am trying to insert data by using a simple for loop , something like -

revision_date = '01/04/2022'
for i in range( 0 , 100,000):
    spark.sql(""" insert into db.delta_table_name values ( 'Class1' , '{revision_date}' + i """)

The problem is , it takes awfully long to insert data using insert statement in databricks. It took almost 5+ hours to complete this. Can anyone suggest an alternative or a solution for this problem in databricks.

My Cluster configuration is - 168 GB, 24 core, DBR 9.1 LTS,Spark 3.1.2



Solution 1:[1]

The loop through enormous INSERT operations on Delta Table costs a lot because it involves a new Transaction Logging for every single INSERT command. May read more on the doc.

Instead, it would be better to create a whole Spark dataframe first and then execute just one WRITE operation to insert data into Delta Table. The example code below will do in less than a minute.

from pyspark.sql.functions import expr, row_number, lit, to_date, date_add
from pyspark.sql.window import Window
columns = ['col1']
rows = [['Class1']]
revision_date = '01/04/2022'

# just create a one record dataframe
df = spark.createDataFrame(rows, columns)

# duplicate to 100,000 records
df = df.withColumn('col1', expr('explode(array_repeat(col1,100000))'))

# create date column
df = df.withColumn('revision_date', lit(revision_date))
df = df.withColumn('revision_date', to_date('revision_date', 'dd/MM/yyyy'))

# create sequence column 
w = Window().orderBy(lit('X'))
df = df.withColumn("col2", row_number().over(w))

# use + operation to add date
df = df.withColumn("revision_date", df.revision_date + df.col2)

# drop unused column
df = df.drop("col2")

# write to the delta table location
df.write.format('delta').mode('overwrite').save('/location/of/your/delta/table')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 PhuriChal