'pytorch.to(device="cuda") leads to killed process in Docker Container
This is Dockerfile:
FROM registry.access.redhat.com/ubi8/python-39:1-48
USER root
....
COPY pyproject.toml poetry.lock start-singleuser.sh setup-volume.sh jupyter_notebook_config.py ${PY_PKG_DIR}/
COPY config/krb5.conf /etc/krb5.conf
COPY config/core-site.xml /etc/hadoop/hadoop-3.3.1/etc/hadoop/core-site.xml
COPY config/hdfs-site.xml /etc/hadoop/hadoop-3.3.1/etc/hadoop/hdfs-site.xml
COPY content/root/
# System Setup
RUN DIST=$(. /etc/os-release; echo $ID$VERSION_ID)\
&& dnf -y --enablerepo "ubi-8*" update --nobest\
&& curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.repo |tee /etc/yum.repos.d/libnvidia-container.repo\
&& dnf -y --enablerepo "ubi-8*" install java-11-openjdk nvidia-container-toolkit nodejs nss_wrapper git\
&& dnf -y --enablerepo "ubi-8*" clean all\
&& dnf list installed
RUN ln -sf /usr/share/zoneinfo/Europe/Berlin /etc/localtime\
&& echo "Europe/Berlin" > /etc/timezone\
&& python3 -m ensurepip --upgrade\
&& python3 -m pip --no-cache-dir install poetry\
&& poetry config virtualenvs.create false\
&& cd ${PY_PKG_DIR}\
&& poetry install --no-dev -v
RUN keytool -import -alias ad_r8737 -keystore ${JAVA_HOME}/lib/security/cacerts -file /usr/share/pki/ca-trust-source/anchors/R8737.pem -noprompt -storepass changeit
RUN chmod g+w /etc/krb5.conf\
&& mkdir -p /opt/app-root/src/.local\
&& chmod g+w /opt/app-root/src/.local
ENV PIP_CONFIG_FILE=/opt/pip/pip.conf\
# PIP_TARGET=${PACKAGE_DIR}\
PYTHONPATH="${PYTHONPATH}:/jup/packages"\
HOME=${WORKING_DIR}
RUN python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())"
RUN python3 -m pip install --upgrade pip --target="/opt/app-root/lib/python3.9/site-packages"
RUN python3 -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113\
&& pip install jupyter --target="/opt/app-root/lib/python3.9/site-packages"
RUN python3 -m IPython kernel install --name "torch_cuda" --display-name "Kernel for development with GPUs"
# Gibt es einen Grund das anders zu machen?
RUN chmod -R 775 /opt/pip
RUN chmod -R 775 /proc/driver
EXPOSE 8888
ENTRYPOINT ["/bin/bash", "/opt/app-root/src/app-pkg/start-singleuser.sh"]
The Container runs in a Kubernetes cluster and as soon as I try to use my GPU, the Python process instantly crashes:
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.rand(4).to(device="cuda")
Killed
(app-root)
I am not able to debug it, since there is no stacktrace or something, just a killed process with "killed". Does anyone know what might cause the issue and how to fix it?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
