'Google Drive and Colaboratory Virtual Machine are not syncing properly

I'm trying to download and extract the Google Speech Commands Dataset using a Google Colab notebook. The Tar archive is fairly large, but from an ML dataset POV it's pretty small. After executing the code snippet attached below, I can see that the archive has been extracted correctly and I can see all the files in the Virtual Machine disk (needless to say, there are 100K+ files as expected).

I understand that syncing the Colab's Virtual Machine memory with the Google Drive needs some time. But even after waiting for quite a while (almost a day) the Google Drive is not getting updated properly. Running a couple of lines of code reveal that only 10-12 directories have been updated correctly (out of 36) and the rest are empty. Is this some kind of bug in the Google Drive's sync process that is causing this? Or am I doing something in an incorrect way? Any help or advice would be appreciated. Thanks in advance!

I'm using this fairly simple piece of code.

import os
import tarfile
import requests
from tqdm import tqdm

from google.colab import drive
drive.mount('/gdrive', force_remount = True)


# url = "http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz"
url = "http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz"
file_name = "speech.tar.gz"
dir_path = "/gdrive/My Drive/Temp/ML/Final/dataset/"
download_path = dir_path + file_name

r = requests.get(url, stream = True)
with open(download_path, 'wb') as file:
    for block in r.iter_content(chunk_size = 4096):
        if(block):
            file.write(block)
print("[INFO] Download completed.")

def extract_tar(path, dest_dir):
    os.chdir(dest_dir)
    with tarfile.open(name = path) as tar:
        for member in tqdm(iterable = tar.getmembers(), total = len(tar.getmembers())):
            tar.extract(member = member)

extract_tar(download_path, dir_path)
print("[INFO] Extraction completed.")

!rm /gdrive/My\ Drive/Temp/ML/Final/dataset/speech.tar.gz
print("[CAUTION] File deleted!")


Solution 1:[1]

I have also faced this issue. And finally I am able to resolve this (with just one line of code).

  1. I have a very big file in google drive around (6 GB) (upload file to google drive)

  2. I tried to extract using cloud converter but it is paid, if your file size is bigger than 1GB

  3. Then I tried to extract this in collab using:

    // mount google drive
    ---------------------------------
    
    from google.colab import drive
    drive.mount('gdrive', force_remount=True)
    root_dir = '/content/gdrive/'
    
    
    
    // Extract file
    -----------------------
    
    !unrar x <rar_file>
    

    I could see that file is getting extracted but I couldn't find extracted files in google drive.

So as per original question here, I also thought may be, google collab and google drive are not getting synched.

But when I looked at it closely, I found that google-collab is extracting files at /content folder (no matter from whichever directory you fire unrar command)

So I simply used this command and it solved my problem:

!unrar x <rar_file> <dest_folder_where_we_want_to_extract>
  1. At this point, all the files extracted might be in the collaboratory vm and not fully synched to drive.

So we have to explicitly synch the collab and drive:

    drive.flush_and_unmount()   

This might take couple of hours depending if data is in GBs otherwise it's very fast.

(please run this command in separate cell. Not in same cell as where we untar)

  1. As above command unmounts the drive. Please remount your drive.

    from google.colab import drive
    
    drive.mount('gdrive', force_remount=True)
    
    root_dir = '/content/gdrive/'
    

Solution 2:[2]

Do matters improve if you call drive.flush_and_unmount()?

Details: https://colab.research.google.com/notebooks/io.ipynb#scrollTo=D78AM1fFt2ty

Solution 3:[3]

Try setting Google Drive to the English language if you use a different language. This did the trick for me.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Bob Smith
Solution 3 ouflak