'AWS MWAA: Glue Crawler issue

I have manually provisioned a Glue Crawler and now am attempting to run it via Airflow (in AWS).

Based on the docs from here, there seems to be plenty of ways to handle this objective compared to other tasks within the Glue environment. However, I'm having issues handling this seemingly simple scenario.

The following code defines the basic setup for Glue[Crawler]+Airflow. Assume there are some other working tasks that are defined before and after it, which are not included here.

run_crawler = AwsGlueCrawlerHook()
run_crawler.start_crawler(crawler_name="foo-crawler")

Now, here is an example flow:

json2parquet >> run_crawler >> parquet2redshift

Given all this, the following error manifests on the Airflow Webserver UI:

Broken DAG: An error occurred (CrawlerRunningException) when calling the StartCrawler operation: Crawler with name housing-raw-crawler-crawler-b3be889 has already started

I get it: why don't you use something other than the start_crawler method...? Fair point, but I don't know what else to employ. I just want to start the crawler after some upstream tasks have successfully completed but am unable to.

How should I resolve this problem?



Solution 1:[1]

json2parquet >> run_crawler >> parquet2redshift

In Airflow, the bitwise right shift Python operator (>>) is used to define a downstream relationship between 2 operators (e.g. BaseOperator).

Declaring a DAG > Task Dependencies (Airflow)

run_crawler = AwsGlueCrawlerHook()
run_crawler.start_crawler(crawler_name="foo-crawler")

run_crawler (AwsGlueCrawlerHook) is not an operator. It is a subclass of BaseHook. The >> (and <<) Python operator can be used with objects that are a subclass of BaseOperator.

airflow.hooks.base
airflow.models.baseoperator

How should I resolve this problem?

run_crawler needs to be implemented as an operator (e.g. BaseOperator).

PythonOperator is a type of operator. The GlueCrawlerOperator is more feature-rich with respect to creating, updating, and running a Glue crawler. The operator executes idempotently. For example, if a crawler with the same name already exists, the operator will run it. Otherwise, it will create it.

GlueCrawlerOperator (Airflow)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrew Nguonly