'Kinesis Firehose putting JSON objects in S3 without seperator comma

Before sending the data I am using JSON.stringify to the data and it looks like this

{"data": [{"key1": value1, "key2": value2}, {"key1": value1, "key2": value2}]}

But once it passes through AWS API Gateway and Kinesis Firehose puts it to S3 it looks like this

    {
     "key1": value1, 
     "key2": value2
    }{
     "key1": value1, 
     "key2": value2
    }

The seperator comma between the JSON objects are gone but I need it to process data properly.

Template in the API Gateway:

#set($root = $input.path('$'))
{
    "DeliveryStreamName": "some-delivery-stream",
    "Records": [
#foreach($r in $root.data)
#set($data = "{
    ""key1"": ""$r.value1"",
    ""key2"": ""$r.value2""
}")
    {
        "Data": "$util.base64Encode($data)"
    }#if($foreach.hasNext),#end
#end
    ]
}

Solution 1:^[1]

I had this same problem recently, and the only answers I was able to find were basically just to add line breaks ("\n") to the end of every JSON message whenever you posted them to the Kinesis stream, or to use a raw JSON decoder method of some sort that can process concatenated JSON objects without delimiters.

I posted a python code solution which can be found over here on a related Stack Overflow post: https://stackoverflow.com/a/49417680/1546785

Solution 2:^[2]

Once AWS Firehose dumps the JSON objects to s3, it's perfectly possible to read the individual JSON objects from the files.

Using Python you can use the raw_decode function from the json package

from json import JSONDecoder, JSONDecodeError
import re
import json
import boto3

NOT_WHITESPACE = re.compile(r'[^\s]')

def decode_stacked(document, pos=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, pos)
        if not match:
            return
        pos = match.start()

        try:
            obj, pos = decoder.raw_decode(document, pos)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

s3 = boto3.resource('s3')

obj = s3.Object("my-bukcet", "my-firehose-json-key.json")
file_content = obj.get()['Body'].read()
for obj in decode_stacked(file_content):
    print(json.dumps(obj))
    #  { "key1":value1,"key2":value2}
    #  { "key1":value1,"key2":value2}

source: https://stackoverflow.com/a/50384432/1771155

Using Glue / Pyspark you can use

import json

rdd = sc.textFile("s3a://my-bucket/my-firehose-file-containing-json-objects")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()

source: https://stackoverflow.com/a/62984450/1771155

Solution 3:^[3]

One approach you could consider is to configure data processing for your Kinesis Firehose delivery stream by adding a Lambda function as its data processor, which would be executed before finally delivering the data to the S3 bucket.

DeliveryStream:
  ...
  Type: AWS::KinesisFirehose::DeliveryStream
  Properties:
    DeliveryStreamType: DirectPut
    ExtendedS3DestinationConfiguration:
      ...
      BucketARN: !GetAtt MyDeliveryBucket.Arn
      ProcessingConfiguration:
        Enabled: true
        Processors:
          - Parameters:
              - ParameterName: LambdaArn
                ParameterValue: !GetAtt MyTransformDataLambdaFunction.Arn
            Type: Lambda
    ...

And in the Lambda function, ensure that '\n' is appended to the record's JSON string, see below the Lambda function myTransformData.ts in Node.js:

import {
  FirehoseTransformationEvent,
  FirehoseTransformationEventRecord,
  FirehoseTransformationHandler,
  FirehoseTransformationResult,
  FirehoseTransformationResultRecord,
} from 'aws-lambda';

const createDroppedRecord = (
  recordId: string
): FirehoseTransformationResultRecord => {
  return {
    recordId,
    result: 'Dropped',
    data: Buffer.from('').toString('base64'),
  };
};

const processData = (
  payloadStr: string,
  record: FirehoseTransformationEventRecord
) => {
  let jsonRecord;
  // ...
  // Process the orginal payload,
  // And create the record in JSON
  return jsonRecord;
};

const transformRecord = (
  record: FirehoseTransformationEventRecord
): FirehoseTransformationResultRecord => {
  try {
    const payloadStr = Buffer.from(record.data, 'base64').toString();
    const jsonRecord = processData(payloadStr, record);
    if (!jsonRecord) {
      console.error('Error creating json record');
      return createDroppedRecord(record.recordId);
    }
    return {
      recordId: record.recordId,
      result: 'Ok',
      // Ensure that '\n' is appended to the record's JSON string.
      data: Buffer.from(JSON.stringify(jsonRecord) + '\n').toString('base64'),
    };
  } catch (error) {
    console.error('Error processing record ${record.recordId}: ', error);
    return createDroppedRecord(record.recordId);
  }
};

const transformRecords = (
  event: FirehoseTransformationEvent
): FirehoseTransformationResult => {
  let records: FirehoseTransformationResultRecord[] = [];
  for (const record of event.records) {
    const transformed = transformRecord(record);
    records.push(transformed);
  }
  return { records };
};

export const handler: FirehoseTransformationHandler = async (
  event,
  _context
) => {
  const transformed = transformRecords(event);
  return transformed;
};

Once the newline delimiter is in place, AWS services such as Athena will be able to work properly with the JSON record data in the S3 bucket, not just seeing the first JSON record only.

Solution 4:^[4]

please use this code to solve your issue


__Author__ = "Soumil Nitin Shah"
import json
import boto3
import base64


class MyHasher(object):
    def __init__(self, key):
        self.key = key

    def get(self):
        keys = str(self.key).encode("UTF-8")
        keys = base64.b64encode(keys)
        keys = keys.decode("UTF-8")
        return keys

def lambda_handler(event, context):

    output = []
    for record in event['records']:

        payload = base64.b64decode(record['data'])

        """Get the payload from event bridge and just get data attr """""
        serialize_payload = str(json.loads(payload)) + "\n"
        hasherHelper = MyHasher(key=serialize_payload)
        hash = hasherHelper.get()

        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': hash
        }
        print("output_record", output_record)

        output.append(output_record)

    return {'records': output}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Tom Chapin
Solution 2	Vincent Claes
Solution 3	Yuci
Solution 4	Soumil Nitin Shah

'Kinesis Firehose putting JSON objects in S3 without seperator comma

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]