'Occasional "could not translate host error" in Postgres, Django and Docker Swarm set up
I have a stack with two nodes, only one manager in Docker Swarm, one replica of db on the manager and 3 replicas of the web (Django backend). Occasionally I get this error in the logs of my web container
psycopg2.OperationalError: could not translate host name "db" to address: Name or service not known
/usr/local/lib/python3.8/site-packages/django/core/management/commands/makemigrations.py:105: RuntimeWarning: Got an error checking a consistent migration history performed for database connection 'default': could not translate host name "db" to address: Name or service not known
When I was building this locally, i got this error for example after rebooting my machine, but then i just docker-compose down and up again and it disappeared. (I never got another solution for this).
However now in my swarm stack I do not have a workaround.
I don't know what exactly is causing this, I've tried everything I could find, changing the SQL_HOST to localhost, putting the stack name in front of the service: stack_db, adding POSTGRES_HOST_AUTH_METHOD=trust to the db environment, adding the web and the db in the same network, changing the postgres image to postgres:13.4-alpine, adding a depends_on rule for which I use a script with my deploy command (I also parse it with docker-compose for the env files to be taken into consideration):
docker stack deploy -c <(docker-compose -f my-compose-stack.yml config | yq e '(.services[] | select(.depends_on | tag == "!!map")).depends_on |= (. | keys)' -) stack
Nothing seems to work. I even tried to docker compose up and down on my stack file, and then deploy it. The weird thing is, sometimes, all of a sudden it works. I don't know what's breaking it, nor what's fixing it. Please, help me figure this out.
This is my docker-stack:
version: "3.3"
services:
db:
image: postgres:13.4-alpine
ports:
- "5432:5432"
command: "-c logging_collector=on"
volumes:
- ./database/postgres_data:/var/lib/postgresql/data/
networks:
- data_network
environment:
- POSTGRES_USER=student
- POSTGRES_PASSWORD=x
- POSTGRES_DB=x
- POSTGRES_HOST_AUTH_METHOD=trust
deploy:
placement:
constraints:
- node.role==manager
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
web:
image: xxx
depends_on:
db:
condition: service_started
command: bash -c "python manage.py makemigrations && python manage.py migrate && python manage.py runserver 0.0.0.0:8000"
ports:
- 8000:8000
env_file:
- .env.dev
volumes:
- migrations-volume:/elpaso/api/migrations/
deploy:
replicas: 3
restart_policy:
condition: on-failure
networks:
- web_network
- data_network
networks:
web_network:
driver: overlay
data_network:
driver: overlay
volumes:
migrations-volume:
In my .env I have
SQL_HOST=db
SQL_PORT=5432
SQL_USER=student
SQL_PASSWORD=x
SQL_DATABASE=x
SQL_ENGINE=django.db.backends.postgresql
There are no logs in my database service and everything else seems to be working. An hour ago the web service was up and running, and after removing the stack and deploying again this happened. I mention that I also have an nginx container on my manager and 3 replicas of React, but I excluded them since I don't believe they are related. Please do let me know if there is any more information I can provide. Thank you!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
