'hold multidimensional numpy arrays of images in a pandas dataframe and write to disk

I am trying to read about 300K pairs of images from different URLs in a pandas dataframe by converting the images to numpy arrays and then calculate the MSE and SSIM values for each of the 300k pairs.

Basically what I want is something like this: output dataframe

I preprocess the image before calculating ssim and mse by doing this

def url_to_image(url, readFlag=cv2.IMREAD_COLOR):
    resp = urlopen(url)
    image = np.asarray(bytearray(resp.read()), dtype="uint8")
    image = cv2.imdecode(image, readFlag)
    return image

What I want to do is try to iterate row by row and call the above function to get the numpy array for the pair of images to be compared and then calculate MSE and SSIM and add these two values as columns to the dataframe as below

for index, row in df.iterrows():
    df.at[index,'mse']=mse(url_to_image(row["AIRBNB_IMAGE"]), url_to_image(row["HOMEAWAY_IMAGE"]))
    df.at[index,'ssim']=ssim(url_to_image(row["AIRBNB_IMAGE"]), url_to_image(row["HOMEAWAY_IMAGE"]),multichannel=True)

I tried this code and it takes forever and looks like there is a memory leak. My question is there a way to avoid pandas holding the 300K pairs of 400X400 matrices in memory before I can write to file?

Can I instead write to disk line by line to a csv file and flush the memory before reading the next pair of images?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source