'How to shrink a git repository by removing all outputs from all Jupyter notebooks in the history
I'd like to shrink a git repository by removing all outputs from every Jupyter notebook in the repository's history. Is this even possible?
Solution 1:[1]
Yes, it's possible using git filter-repo.
- Install git filter-repo
- Get a clean clone of your repository, and make sure you keep a backup in case something goes wrong.
- Go to the repository's root directory, and run the following command:
git filter-repo --blob-callback '
import json
try:
notebook = json.loads(blob.data)
if "cells" in notebook:
for cell in notebook["cells"]:
if "outputs" in cell:
cell["outputs"] = []
blob.data = (json.dumps(notebook, ensure_ascii=False, indent=1,
sort_keys=True) + "\n").encode("utf-8")
except json.JSONDecodeError as ex:
pass
except UnicodeDecodeError as ex:
pass
'
This command will call the Python code for every blob in the history, and it will try to parse it as JSON. If that works and returns a dict with a "cells" key, then we're (almost certainly) dealing with a Jupyter notebook, and we can go through the cells and replace the outputs with an empty array. The code then uses json.dumps() to dump back the notebook to file, replacing the blob's previous data.
I tried this on a repository with many notebooks, about 200MB large, and it shrunk down to 20MB. I found this quite nice so I thought I'd share.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | MiniQuark |
