'Occasional "could not translate host error" in Postgres, Django and Docker Swarm set up

I have a stack with two nodes, only one manager in Docker Swarm, one replica of db on the manager and 3 replicas of the web (Django backend). Occasionally I get this error in the logs of my web container

psycopg2.OperationalError: could not translate host name "db" to address: Name or service not known
/usr/local/lib/python3.8/site-packages/django/core/management/commands/makemigrations.py:105: RuntimeWarning: Got an error checking a consistent migration history performed for database connection 'default': could not translate host name "db" to address: Name or service not known

When I was building this locally, i got this error for example after rebooting my machine, but then i just docker-compose down and up again and it disappeared. (I never got another solution for this). However now in my swarm stack I do not have a workaround.

I don't know what exactly is causing this, I've tried everything I could find, changing the SQL_HOST to localhost, putting the stack name in front of the service: stack_db, adding POSTGRES_HOST_AUTH_METHOD=trust to the db environment, adding the web and the db in the same network, changing the postgres image to postgres:13.4-alpine, adding a depends_on rule for which I use a script with my deploy command (I also parse it with docker-compose for the env files to be taken into consideration):

docker stack deploy -c <(docker-compose -f my-compose-stack.yml config | yq e '(.services[] | select(.depends_on | tag == "!!map")).depends_on |= (. | keys)' -) stack

Nothing seems to work. I even tried to docker compose up and down on my stack file, and then deploy it. The weird thing is, sometimes, all of a sudden it works. I don't know what's breaking it, nor what's fixing it. Please, help me figure this out.

This is my docker-stack:

version: "3.3"

services:
  db:
    image: postgres:13.4-alpine
    ports:
      - "5432:5432"
    command: "-c logging_collector=on"
    volumes:
      - ./database/postgres_data:/var/lib/postgresql/data/
    networks:
      - data_network
    environment:
      - POSTGRES_USER=student
      - POSTGRES_PASSWORD=x
      - POSTGRES_DB=x
      - POSTGRES_HOST_AUTH_METHOD=trust
    deploy:
      placement:
        constraints:
          - node.role==manager
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
  web:
    image: xxx
    depends_on:
      db:
        condition: service_started
    command: bash -c "python manage.py makemigrations && python manage.py migrate && python manage.py runserver 0.0.0.0:8000"
    ports:
      - 8000:8000
    env_file:
      - .env.dev
    volumes:
      - migrations-volume:/elpaso/api/migrations/
    deploy:
      replicas: 3
      restart_policy:
        condition: on-failure
    networks:
      - web_network
      - data_network
networks:
  web_network:
    driver: overlay
  data_network:
    driver: overlay
volumes:
  migrations-volume:

In my .env I have

SQL_HOST=db 
SQL_PORT=5432 
SQL_USER=student
SQL_PASSWORD=x 
SQL_DATABASE=x 
SQL_ENGINE=django.db.backends.postgresql

There are no logs in my database service and everything else seems to be working. An hour ago the web service was up and running, and after removing the stack and deploying again this happened. I mention that I also have an nginx container on my manager and 3 replicas of React, but I excluded them since I don't believe they are related. Please do let me know if there is any more information I can provide. Thank you!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source