'Airflow scheduler does not start after Google Composer upgrade

Good morning,

After upgrade the Google Composer to the version 1.18 and Apache Airflow to the version 1.10.15 (using the auto upgrade from the composer) the scheduler does not seem to be able to start.

Airflow message: "The scheduler does not appear to be running. Last heartbeat was received 1 day ago.The DAGs list may not update, and new tasks will not be scheduled."

After get this I tried:

  • Restart web server gcloud beta composer environments restart-web-server

  • Try to restart Airflow-Scheduler: kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -

  • I looked the info of the pod: kubectl describe pod airflow-scheduler

Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 23 Feb 2022 15:59:13 +0000 Finished: Wed, 23 Feb 2022 16:04:09 +0000

  • So I deleted the pod and wait until it run by itself: kubectl delete pod airflow-scheduler-...

  • EDIT 1: The logs from the pod:

Dags and plugins are not synced yet

  • EDIT 2: Additional logs:

Building synchronization state... Starting synchronization... Copying gs://europe-west1-********-bucket/dags/sql/... Skipping attempt to download to filename ending with slash (/home/airflow/gcs/dags/sql/). This typically happens when using gsutil to download from a subdirectory created by the Cloud Console (https://cloud.google.com/console) / [0/1 files][ 0.0 B/ 11.0 B] 0% Done InvalidUrl Error: Invalid destination path: /home/airflow/gcs/dags/sql/

But it continues restarting alone and sometimes appears the CrashLoopBackOff so indicates that a container is repeatedly crashing after restarting

Not sure what could I do more :/.

Thanks for the help :)



Solution 1:[1]

The problem that you are facing has to do with a problem where the resources are getting on the limits and this is not letting you start the Scheduler.

My assumptions are that this could be happening:

  1. The limits set on the scheduler are causing the gcsfuse process to get killed, can you remove them to check if that stops the crashloop?
  2. K8s cluster does not have enough resources for the Composer Agent to start the scheduler job, you can add resources to this.
  3. You are getting a corrupted entry when it is starting for this. The thing that you could do with this is to restart the scheduler on your own, by using ssh to connect into the instance.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jose Gutierrez Paliza