'CSV parsing errors from AWS S3 read stream
I am trying to read a CSV file from S3, perform a transformation on each row, and then finally write the file back to S3 (in a different location) using streams. The code looks something like this:
const csv = require('csv');
const stream = require('stream');
const readFromS3 = (bucket, key) => {
return s3.getObject({
Bucket: bucket,
Key: key
}).createReadStream();
};
const writeToS3 = (bucket, destinationFile) => {
const writeStream = new stream.PassThrough();
return {
writeStream,
uploadToS3: s3.upload({
Bucket: bucket,
Key: destinationFile,
Body: writeStream,
ContentType: 'text/csv'
}).promise()
};
};
const readStream = readFromS3(bucket, sourceFile);
const parse = csv.parse({ skip_lines_with_error: true });
const transform = csv.transform(async (row, next) => {...});
const stringify = csv.stringify();
const { writeStream, uploadToS3 } = writeToS3(bucket, destinationFile);
writeStream.on('end', async () => {
const uploadResponse = await uploadToS3;
callback(null, uploadResponse);
});
readStream.pipe(parse).pipe(transform).pipe(stringify).pipe(writeStream);
For most files this solution is working as expected. For some files I am getting one of the following errors:
CSV_QUOTE_NOT_CLOSED Quote Not Closed: the parsing is finished with an opening quote at line 570651 undefined
CSV_INCONSISTENT_RECORD_LENGTH Invalid Record Length: expect 7, got 5 on line 599619
The obvious solution would be that the files have formatting issues, but I have manually verified that they are formatted correctly, Additionally, running the code on problem files sometimes produces the desired output without changing anything to the file. The problem seems to be that the S3 read stream doesn't read the file line by line and has arbitrary stream chunks resulting in some lines being broken at random places.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
