'GPU nodegroup in EKS
I am not able to create a nodegroup with GPU type using EKS, getting this error from cloud formation: [!] retryable error (Throttling: Rate exceeded status code: 400, request id: 1e091568-812c-45a5-860b-d0d028513d28) from cloudformation/DescribeStacks - will retry after delay of 988.442104ms
This is my clusterconfig.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: CLUSTER_NAME
region: AWS_REGION
nodeGroups:
- name: NODE_GROUP_NAME_GPU
ami: auto
minSize: MIN_SIZE
maxSize: MAX_SIZE
instancesDistribution:
instanceTypes: ["g4dn.xlarge", "g4dn.2xlarge"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotInstancePools: 1
privateNetworking: true
securityGroups:
withShared: true
withLocal: true
attachIDs: [SECURITY_GROUPS]
iam:
instanceProfileARN: IAM_PROFILE_ARN
instanceRoleARN: IAM_ROLE_ARN
ssh:
allow: true
publicKeyPath: '----'
tags:
k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
k8s.io/cluster-autoscaler/enabled: 'true'
labels:
lifecycle: Ec2Spot
nvidia.com/gpu: 'true'
k8s.amazonaws.com/accelerator: nvidia-tesla
taints:
nvidia.com/gpu: "true:NoSchedule"
Solution 1:[1]
the resolution was to install nividia plugins on the cluster so that the cluster will identify the gpu nodes
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jumana Kass |