'How to implement Multiprocessing in Azure Databricks - Python

I need to get details of each file from a directory. It is taking longer time. I need to implement Multiprocessing so that it's execution can be completed early.

My code is like this:

from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process

def iterate_directories(root_dir):
  
  for child in Path(root_dir).iterdir():
    
    if child.is_file():
        modified_time = datetime.fromtimestamp(getmtime(file)).date()
        file_size = getsize(file)
         # further steps...
      
    else:
      iterate_directories(child) ## I need this to run on separate Process (in Parallel)
    

I tried to do recursive call using below, but it is not working. It comes out of loop immediately.

else:
    p = Process(target=iterate_directories, args=(child))
    Pros.append(p) # declared Pros as empty list.
    p.start()

for p in Pros:
  if not p.is_alive():
     p.join()

What am I missing here? How can I run for sub-directories in parallel.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source