'Databricks python/pyspark code to find the age of the blob in azure container

Looking for databricks python/pyspark code to copy azure blob from one container to another container older than 30 days



Solution 1:[1]

  • The copy code is simple as follows.

    dbutils.fs.cp("/mnt/xxx/file_A", "/mnt/yyy/file_A", True)
    
  • The difficult part is checking blob modification time. According to the doc, the modification time will only get returned by using dbutils.fs.ls command on Databricks Runtime 10.2 or above. You may check the Runtime version using the command below.

    spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
    

    The returned value will be Databricks Runtime followed by Scala versions.
    If you get lucky with the version, you can can do something like:

    import time
      ts_now = time.time()
    
    for file in dbutils.fs.ls('/mnt/xxx'):
      if ts_now - file.modificationTime > 30 * 86400:
        dbutils.fs.cp(f'/mnt/xxx/{file.name}', f'/mnt/yyy/{file.name}', True)
    

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 PhuriChal