'AWS BATCH - how to run more concurrent jobs

I have just started working with AWS BATCH for my deep learning workload. I have created a compute environment with the following config:

  • min vCPUs: 0
  • max vCPUs: 16
  • Instance type: g4dn family, g3s family, g3 family, p3 family
  • allocation strategy: BEST_FIT_PROGRESSIVE

The maximum number of vCPU limits for my account is 16 and each of my jobs requires 16GB of memory. I observe that a maximum of 2 jobs can run concurrently at any point in time. I was using allocation strategy: BEST_FIT before and changed it to allocation strategy: BEST_FIT_PROGRESSIVE but I still see that only 2 jobs can run concurrently. This limits the amount of experimentation I can do in a given time. What can I do to increase number of jobs that can run concurrently?



Solution 1:[1]

I figured it out myself just now. I'm posting an answer here just in case anyone finds it helpful in the future. It turns out that the instances that were assigned to each of my jobs are g4dn2xlarge. Each of these instances takes up 8 vCPUs. And as my vCPU limit is 16 only 2 jobs can run concurrently. One of the solutions to this is to ask AWS to increase the limit on vCPU by creating a new support case. Another solution could be to modify the compute environment to use GPU instances that consume 4 vCPUs (lowest possible on AWS) and in this case maximum of 4 jobs can run concurrently.

Solution 2:[2]

There are 2 kind of solutions:

  1. Configure your compute environment with ec2 instances with vCPUs tha be multiple of your jobs definitions. For example: Compute env. with ec2 instance type 8 vCPU and limit up 128 vCPUs of you have a job definition with 8 vCPU it will let you to execute up to 16 concurrent jobs.Because 16 jobs concurrents X 8 vCPU = 128 vCPUs (take in count the allocation strategy and memory of your instance which is important in your job consume memory resources too)
  2. Multi-node parallel jobs, this a very interesting soution because in this kind of scenario you don't need ec2 instances vCPU that at lest be multiple of you vCPU used in your Job definition and jobs can be spaned accross multiple Amazon EC2 instances.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vinay Joshi
Solution 2 ldipotet