'How to write Parquet Files to AWS S3 using Apache Beam Java?
I am trying to convert Json -> Generic Record -> Parquet --to--> S3. I am able to convert it to Parquet but I don't know how to directly put Parquet file to S3 without storing it into filesystem.
Code I wrote for this:
public static void main(String[] args) throws IOException {
PipelineOptionsFactory.register(MainConfig.class);
MainConfig options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MainConfig.class);
Pipeline pipeLine = Pipeline.create(options);
BasicAWSCredentials awsCredentials = new BasicAWSCredentials(options.getAWSAccessKey(), options.getAWSSecretKey());
options.setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCredentials));
Schema jsonSchema = new Schema.Parser().parse(schemaString);
logger.info(jsonSchema.getFields());
pipeLine.apply("ReadMyFile", TextIO.read().from(options.getInput()))
.apply("Convert Json To General Record", ParDo.of(new JsonToGeneralRecord(jsonSchema)))
.setCoder(AvroCoder.of(GenericRecord.class, jsonSchema))
.apply("Generation the parquet files", FileIO.<GenericRecord>write().via(ParquetIO.sink(jsonSchema)).to(options.getOutput()).withNumShards(1).withSuffix(".parquet"));
pipeLine.run();
}
At the end I just want to add that parquet to S3
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
