'How do I fail a specific SQS message in a batch from a Lambda?

I have a Lambda with an SQS trigger. When it gets hit, a batch of records from SQS comes in (usually about 10 at a time, I think). If I return a failed status code from the handler, all 10 messages will be retried. If I return a success code, they'll all be removed from the queue. What if 1 out of those 10 messages failed and I want to retry just that one?

exports.handler = async (event) => {

    for(const e of event.Records){
        try {
            let body = JSON.parse(e.body);
            // do things
        }
        catch(e){
            // one message failed, i want it to be retried
        }        
    }

    // returning this causes ALL messages in 
    // this batch to be removed from the queue
    return {
        statusCode: 200,
        body: 'Finished.'
    };
};

Do I have to manually re-add that ones message back to the queue? Or can I return a status from my handler that indicates that one message failed and should be retried?



Solution 1:[1]

Yes you have to manually re-add the failed messages back to the queue.

What I suggest doing is setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 then you can individually send back the failed messages to the queue.

Solution 2:[2]

You've to programmatically delete each message from after processing it successfully.

So you can have a flag set to true if anyone of the messages failed and depending upon it you can raise error after processing all the messages in a batch so successful messages will be deleted and other messages will be reprocessed based on retry policies.

So as per the below logic only failed and unprocessed messages will get retried.

import boto3

sqs = boto3.client("sqs")

def handler(event, context):
    for message in event['records']:
        queue_url = "form queue url recommended to set it as env variable"
        message_body = message["body"]
        print("do some processing :)")
        message_receipt_handle = message["receiptHandle"]
        sqs.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message_receipt_handle
        )

there is also another way to save successfully processed message id into a variable and perform batch delete operation based on message id

response = client.delete_message_batch(
    QueueUrl='string',
    Entries=[
        {
            'Id': 'string',
            'ReceiptHandle': 'string'
        },
    ]
)

Solution 3:[3]

As per AWS documentation, SQS event source mapping now supports handling of partial failures out of the box. Gist of the linked article is as follows:

  1. Include ReportBatchItemFailures in your event source mapping configuration
  2. The response syntax in case of failures has to be modified to have {"batchItemFailures": [{"itemIdentifier": "id2"},{"itemIdentifier": "id4"}]}, where id2 and id4 where the failed meesageIds in a batch
  3. Quoting the documentation as is:

Lambda treats a batch as a complete success if your function returns any of the following:

An empty batchItemFailures list

A null batchItemFailures list

An empty EventResponse

A null EventResponse

Lambda treats a batch as a complete failure if your function returns any of the following:

An invalid JSON response

An empty string itemIdentifier

A null itemIdentifier

An itemIdentifier with a bad key name

An itemIdentifier value with a message ID that doesn't exist

SAM support is not yet available for the feature as per the documentation. But one of the AWS labs example points to its usage in SAM and it worked for me when tested

Solution 4:[4]

You need to design your app iin diffrent way here is few ideas not best but will solve your problem.

Solution 1:

Note :When an SQS event source mapping is initially created and enabled, or first appear after a period with no traffic, then the Lambda service will begin polling the SQS queue using five parallel long-polling connections, as per AWS documentation, the default duration for a long poll from AWS Lambda to SQS is 20 seconds.

Solution 2:

Use AWS StepFunction

StepFunction will call lambda and handle the retry logic on failure with configurable exponential back-off if needed.

**Solution 3: **

CloudWatch scheduled event to trigger a Lambda function that polls for FAILED.

Error handling for a given event source depends on how Lambda is invoked. Amazon CloudWatch Events invokes your Lambda function asynchronously.

Solution 5:[5]

AWS supports partial batch response. Here is example for Typescript code

type Result = {
  itemIdentifier: string
  status: 'failed' | 'success'
}

const isFulfilled = <T>(
  result: PromiseFulfilledResult<T> | PromiseRejectedResult
): result is PromiseFulfilledResult<T> => result.status === 'fulfilled'

const isFailed = (
  result: PromiseFulfilledResult<Result>
): result is PromiseFulfilledResult<
  Omit<Result, 'status'> & { status: 'failed' }
> => result.value.status === 'failed'

const results = await Promise.allSettled(
 sqsEvent.Records.map(async (record) => {
   try {
     return { status: 'success', itemIdentifier: record.messageId }
   } catch(e) {
     console.error(e);
     return { status: 'failed', itemIdentifier: record.messageId }
   }
  })
)

return results
    .filter(isFulfilled)
    .filter(isFailed)
    .map((result) => ({
      itemIdentifier: result.value.itemIdentifier,
    }))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Deiv
Solution 2
Solution 3
Solution 4
Solution 5 Krystian Mateusiak