'Kinesis Firehose putting JSON objects in S3 without seperator comma
Before sending the data I am using JSON.stringify to the data and it looks like this
{"data": [{"key1": value1, "key2": value2}, {"key1": value1, "key2": value2}]}
But once it passes through AWS API Gateway and Kinesis Firehose puts it to S3 it looks like this
{
"key1": value1,
"key2": value2
}{
"key1": value1,
"key2": value2
}
The seperator comma between the JSON objects are gone but I need it to process data properly.
Template in the API Gateway:
#set($root = $input.path('$'))
{
"DeliveryStreamName": "some-delivery-stream",
"Records": [
#foreach($r in $root.data)
#set($data = "{
""key1"": ""$r.value1"",
""key2"": ""$r.value2""
}")
{
"Data": "$util.base64Encode($data)"
}#if($foreach.hasNext),#end
#end
]
}
Solution 1:[1]
I had this same problem recently, and the only answers I was able to find were basically just to add line breaks ("\n") to the end of every JSON message whenever you posted them to the Kinesis stream, or to use a raw JSON decoder method of some sort that can process concatenated JSON objects without delimiters.
I posted a python code solution which can be found over here on a related Stack Overflow post: https://stackoverflow.com/a/49417680/1546785
Solution 2:[2]
Once AWS Firehose dumps the JSON objects to s3, it's perfectly possible to read the individual JSON objects from the files.
Using Python you can use the raw_decode function from the json package
from json import JSONDecoder, JSONDecodeError
import re
import json
import boto3
NOT_WHITESPACE = re.compile(r'[^\s]')
def decode_stacked(document, pos=0, decoder=JSONDecoder()):
while True:
match = NOT_WHITESPACE.search(document, pos)
if not match:
return
pos = match.start()
try:
obj, pos = decoder.raw_decode(document, pos)
except JSONDecodeError:
# do something sensible if there's some error
raise
yield obj
s3 = boto3.resource('s3')
obj = s3.Object("my-bukcet", "my-firehose-json-key.json")
file_content = obj.get()['Body'].read()
for obj in decode_stacked(file_content):
print(json.dumps(obj))
# { "key1":value1,"key2":value2}
# { "key1":value1,"key2":value2}
source: https://stackoverflow.com/a/50384432/1771155
Using Glue / Pyspark you can use
import json
rdd = sc.textFile("s3a://my-bucket/my-firehose-file-containing-json-objects")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
Solution 3:[3]
One approach you could consider is to configure data processing for your Kinesis Firehose delivery stream by adding a Lambda function as its data processor, which would be executed before finally delivering the data to the S3 bucket.
DeliveryStream:
...
Type: AWS::KinesisFirehose::DeliveryStream
Properties:
DeliveryStreamType: DirectPut
ExtendedS3DestinationConfiguration:
...
BucketARN: !GetAtt MyDeliveryBucket.Arn
ProcessingConfiguration:
Enabled: true
Processors:
- Parameters:
- ParameterName: LambdaArn
ParameterValue: !GetAtt MyTransformDataLambdaFunction.Arn
Type: Lambda
...
And in the Lambda function, ensure that '\n' is appended to the record's JSON string, see below the Lambda function myTransformData.ts in Node.js:
import {
FirehoseTransformationEvent,
FirehoseTransformationEventRecord,
FirehoseTransformationHandler,
FirehoseTransformationResult,
FirehoseTransformationResultRecord,
} from 'aws-lambda';
const createDroppedRecord = (
recordId: string
): FirehoseTransformationResultRecord => {
return {
recordId,
result: 'Dropped',
data: Buffer.from('').toString('base64'),
};
};
const processData = (
payloadStr: string,
record: FirehoseTransformationEventRecord
) => {
let jsonRecord;
// ...
// Process the orginal payload,
// And create the record in JSON
return jsonRecord;
};
const transformRecord = (
record: FirehoseTransformationEventRecord
): FirehoseTransformationResultRecord => {
try {
const payloadStr = Buffer.from(record.data, 'base64').toString();
const jsonRecord = processData(payloadStr, record);
if (!jsonRecord) {
console.error('Error creating json record');
return createDroppedRecord(record.recordId);
}
return {
recordId: record.recordId,
result: 'Ok',
// Ensure that '\n' is appended to the record's JSON string.
data: Buffer.from(JSON.stringify(jsonRecord) + '\n').toString('base64'),
};
} catch (error) {
console.error('Error processing record ${record.recordId}: ', error);
return createDroppedRecord(record.recordId);
}
};
const transformRecords = (
event: FirehoseTransformationEvent
): FirehoseTransformationResult => {
let records: FirehoseTransformationResultRecord[] = [];
for (const record of event.records) {
const transformed = transformRecord(record);
records.push(transformed);
}
return { records };
};
export const handler: FirehoseTransformationHandler = async (
event,
_context
) => {
const transformed = transformRecords(event);
return transformed;
};
Once the newline delimiter is in place, AWS services such as Athena will be able to work properly with the JSON record data in the S3 bucket, not just seeing the first JSON record only.
Solution 4:[4]
please use this code to solve your issue
__Author__ = "Soumil Nitin Shah"
import json
import boto3
import base64
class MyHasher(object):
def __init__(self, key):
self.key = key
def get(self):
keys = str(self.key).encode("UTF-8")
keys = base64.b64encode(keys)
keys = keys.decode("UTF-8")
return keys
def lambda_handler(event, context):
output = []
for record in event['records']:
payload = base64.b64decode(record['data'])
"""Get the payload from event bridge and just get data attr """""
serialize_payload = str(json.loads(payload)) + "\n"
hasherHelper = MyHasher(key=serialize_payload)
hash = hasherHelper.get()
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': hash
}
print("output_record", output_record)
output.append(output_record)
return {'records': output}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tom Chapin |
| Solution 2 | Vincent Claes |
| Solution 3 | Yuci |
| Solution 4 | Soumil Nitin Shah |
