'azurerm_kubernetes_cluster - AKS encountered an internal error while attempting the requested

Im trying to create an AKS cluster with a default nodepool.

Terraform version.: 1.1.7

AzureRM version: 2.99 ( After an update the behavior is the same on AzureRM 3.5)

First error message

Failure sending request: StatusCode=400 -- Original Error: Code="SubnetsAssociatedWithNATgatewayWhenOutboundTypeIsStandardLoadBalancer"

After that error I introduced the outbound_type = "userAssignedNATGateway" Unfortunately it doesnt help. I get the follwoing error message

waiting for creation of Cluster: (Managed Cluster Name "managed-cluster" / Resource Group "managed-cluster-rg"): Code="CreateVMSSAgentPoolFailed" Message="AKS encountered an internal error while attempting the requested Creating operation. AKS will continuously retry the requested operation until successful or a retry timeout is hit. Check back to see if the operation requires resubmission.

I tried it many times. I Checked the resource group activity log also but it looks everything is fine.

The code without outbound type is worked around mid of April. Is something changed on the azure side?

  resource "azurerm_kubernetes_cluster" "aks" {
  depends_on              = [azurerm_subnet_nat_gateway_association.aks_nat_gw]
  name                    = var.prefix
  location                = azurerm_resource_group.rg.location
  resource_group_name     = azurerm_resource_group.rg.name
  node_resource_group     = "${var.prefix}-nodes-rg"
  dns_prefix              = var.prefix
  sku_tier                = "Paid"
  private_cluster_enabled = false

  kubernetes_version = data.azurerm_kubernetes_service_versions.current.latest_version

  automatic_channel_upgrade = "patch" 
  auto_scaler_profile {
    balance_similar_node_groups      = true
    max_graceful_termination_sec     = "600"
    scale_down_utilization_threshold = "0.7"
    skip_nodes_with_local_storage    = false # Nodes should use local storage only as cache or temporary not as persistent storage
    skip_nodes_with_system_pods      = false 
  }

  addon_profile {
    http_application_routing {
      enabled = false
    }
    kube_dashboard {
      enabled = false
    }
    azure_policy {
      enabled = true
    }
  }

  default_node_pool {
    name                         = "system"
    os_disk_size_gb              = var.system_pool_config.disk_size
    os_disk_type                 = var.system_pool_config.os_disk_type
    vm_size                      = var.system_pool_config.vm_size
    min_count                    = var.system_pool_config.min_count
    max_count                    = var.system_pool_config.max_count
    vnet_subnet_id               = azurerm_subnet.subnet.id
    tags                         = var.tags
    only_critical_addons_enabled = true
    enable_auto_scaling          = true
    enable_host_encryption       = true
    max_pods                     = 30 # Default | Changing this will require more IPS check the subnet and change the max node count accordingly
    availability_zones           = [1, 2, 3]
    orchestrator_version         = var.system_pool_config.orchestrator_version

    upgrade_settings {
      max_surge = var.system_pool_config.max_surge
    }

  }
  identity {
    type = "SystemAssigned"
  }

  linux_profile {
    admin_username = "kubernetes"
    ssh_key {
      key_data = tls_private_key.node-ssh-key.public_key_openssh
    }
  }

  role_based_access_control {
    enabled = true

    azure_active_directory {
      managed                = "true"
      admin_group_object_ids = var.admin_group_object_ids
    }
  }


  network_profile {
    network_plugin     = "azure"
    network_mode       = "transparent"
    network_policy     = "calico"
    service_cidr       = "10.0.0.0/16"
    dns_service_ip     = "10.0.0.10"
    docker_bridge_cidr = "172.17.0.1/16"
    load_balancer_sku  = "standard"
    load_balancer_profile {
      outbound_ports_allocated  = 0
      idle_timeout_in_minutes   = 4
      managed_outbound_ip_count = 1
    }
    outbound_type = "userAssignedNATGateway"
  }

  tags = var.tags
}

There is a Gateway already attached to the subnet

I opened a github issue also but no response: https://github.com/hashicorp/terraform-provider-azurerm/issues/16712



Solution 1:[1]

pls follow this workflow here that describes how-to create create an AKS cluster with a user-assigned NAT Gateway. You just need to move this into Terraform.

Basically the userAssignedNATGateway needs a managed identity (see step 2 in the workflow) instead using a SystemAssigned identity. You additionally need to assign the new managed identity the Network Contributor Role & Monitoring Metrics Publisher:

# Create Managed identity
resource "azurerm_user_assigned_identity" "example" {
  resource_group_name = azurerm_resource_group.example.name
  location            = azurerm_resource_group.example.location
  name                = "natclusterid"
}

# Role assignment to be able to manage the virtual network
resource "azurerm_role_assignment" "aks_vnet_contributor" {
  scope                            = azurerm_resource_group.example.name
  role_definition_name             = "Network Contributor"
  principal_id                     = azurerm_user_assigned_identity.example.principal_id
  skip_service_principal_aad_check = true
}

# Role assignment to publish metrics
resource "azurerm_role_assignment" "aks_metrics_publisher" {
  scope                            = azurerm_kubernetes_cluster.aks.id
  role_definition_name             = "Monitoring Metrics Publisher"
  principal_id                     = azurerm_user_assigned_identity.example.principal_id
  skip_service_principal_aad_check = true
}

# Create AKS with managed identity
resource "azurerm_kubernetes_cluster" "aks" {
  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.example.id]
  }
} 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Philip Welz