'attempting to manually download MNIST pytorch dataset in databricks

I've attempted a couple different iterations now to get the dataset manually loaded into databricks's DBFS.. so that PyTorch can load it.. however the MNIST dataset seems to just be some binary file.. is it expected I unzip it first or just.. point to the GZipped tarball? So far all my trials have gotten this error

 train_dataset = datasets.MNIST(
     13     'dbfs:/FileStore/tarballs/train_images_idx3_ubyte.gz',
     14     train=True,
RuntimeError: Dataset not found. You can use download=True to download it

I am aware I can turn Download=True , however due to the firewalls this is not an option and I want to just upload the files and wire them in myself via DBFS... anyone done this as well?

EDIT: @alexey suggested I need to add the extra paths MNIST/raw

enter image description here

And then change the input to

  train_dataset = datasets.MNIST(
    '/dbfs/FileStore/tarballs',
    train=True,
    download=False,
    transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]))
  data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

But same error

enter image description here



Solution 1:[1]

My code and dir:

train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('../colabx/data', train=True, download=False,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))

....\colabx\data\MNIST\raw>ls
t10k-images-idx3-ubyte     train-images-idx3-ubyte
t10k-images-idx3-ubyte.gz  train-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte     train-labels-idx1-ubyte
t10k-labels-idx1-ubyte.gz  train-labels-idx1-ubyte.gz

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Alexey Birukov