'(AWS Step Functionos) send_task_success raises "Task Timed Out" but the task is still running

Goal

I wanted to make a proof of concept of the callback pattern. This is where you have a step function that puts a message and token in an sqs queue, the queue is wired up to some arbitrary work, and when that work is done you give the step function back the token so it knows to continue.

Problem

I started testing all this stuff by starting an execution in the step function manually and after a few failures I hit on what should have worked. The send_task_success was called but all I ever got back was this An error occurred (TaskTimedOut) when calling the SendTaskSuccess operation: Task Timed Out: 'Provided task does not exist anymore'.

My architecture (you can skip this part)

I did this all in terraform.

Permissions

I'm going to skip all the IAM permission details for brevity but the idea is:

The queue the following with resource of my lambda
- lambda:CreateEventSourceMapping
- lambda:ListEventSourceMappings
- lambda:ListFunctions
The step function has the following with the resource of my queue
- sqs:SendMessage
The lambda has
- AWSLambdaBasicExecutionRole
- AWSLambdaSQSQueueExecutionRole
- states:SendTaskSuccess with step function resource

Terraform

resource "aws_sqs_queue" "queue" {
  name_prefix = "${local.project_name}-"
  fifo_queue = true
  # This one is required for fifo queues for some reason
  content_based_deduplication = true
  policy = templatefile(
    "policy/queue.json",
    {lambda_arn = aws_lambda_function.run_job.arn}
  )
}

resource "aws_sfn_state_machine" "step" {
  name = local.project_name
  role_arn = aws_iam_role.step.arn
  type = "STANDARD"
  definition = templatefile(
    "states.json", {
      sqs_url = aws_sqs_queue.queue.url
    }
  )
}

resource "aws_lambda_function" "run_job" {
  function_name = local.project_name
  description = "Runs a job"
  role = aws_iam_role.lambda.arn

  architectures = ["arm64"]
  runtime = "python3.9"
  filename = var.zip_path
  handler = "main.main"
}

resource "aws_lambda_event_source_mapping" "trigger_lambda" {
  event_source_arn = aws_sqs_queue.queue.arn
  enabled = true
  function_name = aws_lambda_function.run_job.arn
  batch_size = 1
}

Notes:

For my use case I definitely want a FIFO queue. However, there are two funny things you have to do to make a FIFO work (that also make me question what the heck the implementation is doing).

Deduplication. This can either be content based deduplication for the whole queue or you can use the dedplication id thing on a per message basis.
MessageGroupId. This is on a per message basis.

I don't have to worry about the deduplication because every item I put in this queue comes with a unique guid.

State Function

I expect this to be executed with a json that includes "job": "some job guid" at the top level.

{
    "Comment": "This is a thing.",
    "StartAt": "RunJob",
    "States": {
        "RunJob": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
            "Parameters": {
                "QueueUrl": "${sqs_url}",
                "MessageBody": {
                    "Message": {
                        "job_guid.$": "$.job",
                        "TaskToken.$": "$$.Task.Token"
                    }
                },
                "MessageGroupId": "me_group"
            },
            "Next": "Finish"
        },
        "Finish": {
            "Type": "Succeed"
        }
    }
}

Notes:

"RunJob"s resource is not the arn of the queue followed by .waitForTaskToken. Seems obvious since it starts with arn:aws:states but it threw me for a bit.
Inside "MessageBody" I'm pretty sure you can just put whatever you want. For sure I know you can rename "TaskToken" to whatever you want.
You need "MessageGroupId" because it's required when you are using a FIFO queue (for some reason).

Python

import boto3
from json import loads

def main(event, context):
    message = loads(event["Records"][0]["body"])["Message"]
    task_token = message["TaskToken"]
    job_guid = message["job_guid"]
    print(f"{task_token=}")
    print(f"{job_guid=}")
    client = boto3.client('stepfunctions')
    client.send_task_success(taskToken=task_token, output=event["Records"][0]["body"])
    return {"statusCode": 200, "body": "All good"}

Notes:

event["Records"][0]["body"] is a string of a json.
In send_task_success, output expects a string that is json. Basically this means the output of dumps. It just so happens that event["Records"][0]["body"] is a stringified json so that's why I'm returning it.

aws-lambda amazon-sqs aws-step-functions

Solution 1:^[1]

This is the way lambda + sqs works:

A message comes into SQS
SQS passes that off to a lambda. At the same time it makes the item in the queue invisible. It doesn't delete the item at this stage.
If the lambda returns, SQS deletes the item. If not it makes the item visible again (as long as it hasnt been longer than the default visibility timeout since the item was initially added to the queue).
Since a queue has to deal with each item in turn, this means that, if the lambda never succeeds, SQS will just keep retrying it for default visibility timeout and never process anything else.

Note a failure is an exception, timeout, permissions error, etc. If it returns normally, regardless of whats returned, that's counted as a success.

What happened to me is as follow:

First step function execution: There's some sort of configuration error in my lambda or something. I fix it and re-deploy the lambda. I abort this invocation and delete the lambda logs.

Second step function execution: Everything is properly configured this time but my lambda doesn't receive the new function invocation. Since the lambda failed the item wasn't removed from SQS. SQS will just keep retrying the same item until is successful. However, the function execution was aborted so it will never be successful. Nothing else on the queue will ever see the light of day. However, I don't know this. I just see a failed attempt in the logs. So I delete the logs and abort the execution.

Subsequent executions: Finally, the default visibility timeout is hit for the first item in the queue. So SQS tries to execute the second item in the queue. I already aborted it. Etc.

Here are a few approaches to fixing this:

For my particular use-case, it probably doesn't make sense to retry a lambda. So I could set up a dead-letter queue. This is a queue that takes all the failed jobs from the main queue. SQS can be configured to only send it to the dead letter queue after n retries but I would just send it there immediately. The dead letter queue would then be attached to a lambda that deals with cleaning up any resources that need cleaning.
For development, I should wrap everything in a big try except block. If there's an exception, print it to the logs, but clear out the queue so I don't have a build up.
For development, I should use a really short default visibility timeout. Like 500ms if possible. This ensures that my lambda is going to be executed once or maybe twice but that's it. This should be used in addition to the previous suggestion to catch things like permissions errors.
I found this stackoverflow post about SQS retry logic that I thought was helpful too.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	FailureGod

'(AWS Step Functionos) send_task_success raises "Task Timed Out" but the task is still running

Goal

Problem

My architecture (you can skip this part)

Permissions

Terraform

Notes:

State Function

Notes:

Python

Notes:

Solution 1:[1]

This is the way lambda + sqs works:

What happened to me is as follow:

Here are a few approaches to fixing this:

Sources

Related Questions

Solution 1:^[1]