'Orchestration of an NLP model via airflow and kubernetes

This is more of an architecture question. I have a data engineering background and have been using airflow to orchestrate ETL tasks using airflow for a while. I have limited knowledge of containerization and kuberentes. I have a task to come up with a good practice framework for productionalizting our Data science models using an orchestration engine namely airflow.

Our Data science team creates many NLP models to process different text documents from various resources. Previously the model was created by an external team which requires us to create an anacoda environment install libraries on it and run the model. The running of model was very manual where a data engineer would spin us a EC2 instance, and setup the model download the files to the ec2 instance and process the files using the model and take the output for further processing.

We are trying to move away from this to an automated pipeline where we have an airflow dag that basically orchestrates this all. The point where I am struggling is the running the model part.

This is the logical step I am thinking of doing. Please let me know if you think this would be feasible. All of these will be down in airflow. Step 2,3,4 are the ones I am totally unsure how to achieve.

  1. Download files from ftp to s3
  2. **Dynamically spin up a kubernetes cluster and create parallel pod based on number of files to be process.
  3. Split files between those pods so each pod can only process its subset of files
  4. Collate output of model from each pod into s3 location**
  5. Do post processing on them

I am unsure how I can spin up a kuberentes cluster in airflow on runtime and especially how I split files between pods so each pod only processes on its own chunk of files and pushes output to shared location.

The running of the model has two methods. Daily and Complete. Daily would be a delta of files that have been added since last run whereas complete is a historical reprocessing of the whole document catalogue that we run every 6 months. As you can imagine the back catalogue would require alot of parallel processing and pods in parallel to process the number of documents.

I know this is a very generic post but my lack of kuberentes is the issue and any help would be appreciated in pointing me in the right direction.



Solution 1:[1]

Normally people schedule the container or PODs as per need on top of k8s cluster, however, I am not sure how frequent you need to crate the k8s cluster.

K8s cluster setup :

You can create the K8s cluster in different ways that are more dependent on the cloud provider and options they provide like SDK, CLI, etc.

Here is one example you can use this option with airflow to create the AWS EKS clusters : https://leftasexercise.com/2019/04/01/python-up-an-eks-cluster-part-i/

Most cloud providers support the CLI option so maybe using just CLI also you can create the K8s cluster.

If you want to use GCP GKE you can also check for the operators to create cluster : https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html

Split files between those pods so each pod can only process its subset of files

This is more depends on the file structure, you can mount the S3 direct to all pods, or you can keep the file into NFS and mount it to POD but in all cases you have to manage the directory structure accordingly, you can mount it to POD.

Collate output of model from each pod into s3 location**

You can use boto3 to upload files to S3, Can also mount S3 bucket direct to POD.

it's more now on your structure how big files are generated, and stored.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Harsh Manvar