'SQS Lambda Trigger with Visibility Timeout extension

I'm working on a solution where I have a SQS queue with Lambda trigger. My understanding is Lambda will receive messages in batches to be processed, and once Lambda function is successful, the messages in the SQS queue is automatically deleted. However, how do I only allow some of those messages to be deleted?

Let's assume this use case: Lambda function receives a batch with 10 messages, and only 7 messages are valid and can be processed, and the other 3 messages needs to be reprocessed at later point.

My initial thought was I could update the visibility timeout via boto3.sqs.change_visibility_timeout for each of the 3 messages to have it reprocessed after the timeout, however, since overall lambda function execution is successful, all 10 messages are deleted from SQS queue.

Any suggestions?



Solution 1:[1]

Yes, by default, the Lambda function deletes all the messages upon success. You would need to handle this in your code, but not by changing the visibility timeout of the messages.

Add DLQ (dead-letter queue) that will actually handle the failed messages (messages go to DLQ after a certain number of failed attempts to be processed, depending on how you set it up)

You have few options here:

  1. You can handle each item yourself, and delete messages that are processed successfully. In case of a message that's not successful, you can throw an error and it won't be deleted automatically by the lambda function
  2. If you use JavaScript you can try with Middy
  3. If you use Python, you can use Lambda Powertools Python

Solution 2:[2]

For AWS Lambdas with an SQS trigger, by default, when your function encounters an error processing one or more messages in a given batch, the entire batch is marked as a failure. All of the messages in the batch are made visible again in the queue. Depending on your redrive policy, you can end up repeatedly processing successful messages along with the failures.

Rather than change the visibility timeout, the simplest way to specify which messages should be retried later and which can safely be deleted from the queue is to change the function response type to ReportBatchItemFailures. This allows you to return a list of failed message ids, indicating that only those messages in the batch should be made visible again in the queue.

Here's what the reporting syntax looks like for a handler function in Node.js:

exports.handler = async (event) => {
    // Process the event
    const batchItemFailureResponse = {
        batchItemFailures: [
            {
                itemIdentifier: "idFailedMessage1"
            },
            {
                itemIdentifier: "idFailedMessage2"
            },
            {
                itemIdentifier: "idFailedMessage3"
            }
        ]
    };
    return batchItemFailureResponse;
};

There is more information to be found in the official documentation.

This response type is configured when setting a queue as an event source for the Lambda. If you're configuring from the console, navigate to the Lambda function page, select the Configuration tab, and then choose Triggers. Then choose Add Trigger and choose the SQS trigger type. In addition to providing the standard parameters, be sure to check the box under Report batch item failures after expanding Additional Settings. It should look something like this:

Add trigger with batch failure reporting

This parameter must be set when first creating the trigger.

This response type can also be defined if you use CloudFormation templates to provision your resources. See the AWS documentation for more information. Note that if you use AWS SAM event source mappings, the documentation suggests that adding FunctionResponseTypes to the YAML with ReportBatchItemFailures in the type list isn't supported. That is incorrect, the documentation is simply outdated. There is an open issue around addressing this oversight.

Finally, in addition to reporting batch item failures, you should provision a target DLQ (dead-letter queue) and determine a reasonable maximum receive count so that action can be taken on messages that fail repeatedly.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Caldazar
Solution 2 EricMortberg