'How to make code that runs a process once per file parallel?

import cv2
import time
import glob

img_array = []
start_time = time.time()
for files in glob.glob("C:/pictures/*.jpg"):
    a = cv2.imread(files)
    img_array.append(a)
    height, width,layer = a.shape
    size = (width, height)
video = cv2.VideoWriter('C:/pictures/project.avi', cv2.VideoWriter_fourcc(*'mp4v'), 15, size)

for value in range(len(img_array)):
    video.write(img_array[value])
video.release()
print("--- %s seconds ---" % (time.time() - start_time))

This is the serial code I made first. The program is merging several images and making video using cv2 library.

import cv2         
import time
import glob
from multiprocessing import Process
img_array = []
def convert():
    start_time = time.time()
    for files in glob.glob("C:/pictures/*.jpg"):
        a = cv2.imread(files)
        img_array.append(a)
        height, width,layer = a.shape
        size = (width, height)
        video = cv2.VideoWriter('C:/pictures/project.avi', cv2.VideoWriter_fourcc(*'mp4v'), 15, size)
    for value in range(len(img_array)):
        video.write(img_array[value])
    video.release()
    print("--- %s seconds ---" % (time.time() - start_time))
if __name__ == '__main__':
    procs =[]
    proc = Process(target = convert, args='')   
    procs.append(proc)
    proc.start()
    for proc in procs:
        proc.join()

And this is the parallel version of the upper serial code. I used multiprocessing-Process. But parallel should be faster than serial but my code, parallel version is usually slower or similar to serial. I am even not sure my parallel code is the correct way to convert my serial code to parallel. If anyone could help me, I would really appreciate it.

import cv2
import time    
import glob
import multiprocessing

shape = 1000,1000
img_array = []

def convert(files):
    a = cv2.imread(files)
    resized = cv2.resize(a,shape)
    img_array.append(resized)
    video = cv2.VideoWriter('C:/pictures/project.avi', cv2.VideoWriter_fourcc(*'mp4v'), 15, shape)

    for i in range(len(img_array)):
        video.write(img_array[i]) 
    video.release()

if __name__ == '__main__':
    start_time = time.time()
    p = multiprocessing.Pool()
    for files in glob.glob("C:/pictures/*.jpg"):
        p.apply_async(convert, [files])
    p.close()
    p.join()
    print("--- %s seconds ---" % (time.time() - start_time))

This is code for the comment below.



Solution 1:[1]

Below is an example of my multiprocessing code. The following code processes multiple files at once within a specified folder. Simply, create a folder where you'd like to dump all your files for parallel processing. I named my folder, "Folder." It is also important to create another folder for where your test file will live for initializing the function. This file can be an excel test dummy. For this example, I created a folder titled, "mastertemp" and added a test file titled, "mastertemp.xlsx"

Also, with the code I provided, the multiprocessing speed is dependent on the number of CPUs available. I generally use Google Cloud - AI Notebooks. I'm able to set a specified number of CPUs. For instance, if I have 8 CPUs, the code will process 8 files simultaneously until there are no more files to process.

I hope this helps out!

import multiprocessing
import glob
from pathlib import Path
import pandas as pd
import os
import time

# MULTI-PROCESS

# created a 'mastertemp' folder and inserted a test excel file to launch the parallel process

file = 'mastertemp/mastertemp.xlsx'   # add a test file and set file equal to "mastertemp.xlsx"; initializing
def process(file):
    '''
    write your code here; what do you want to do with the files? Below, I simply am reading to df

    '''
    df = pd.read_excel(file)
    print(file)

p = multiprocessing.Pool()

# created a 'Folder' titled folder where all excel files will live
# f = excel files
for f in glob.glob("Folder/*.xlsx"):
    # launch a process for each file (ish).
    # The result will be approximately one process per CPU core available.
    p.apply_async(process, [f])
p.close()
p.join() # Wait for all child processes to close.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1