'Search & Delete data from Parquet objects in S3 Bucket using AWS

I am having an S3 bucket my-s3-bucket which contains multiple Parquet files (with similar schema) under multiple directories similar to as follows:

  1. s3://my-s3-bucket/path1/path2/path3/path4/file1.snappy.parquet
  2. s3://my-s3-bucket/path1/path2/path3/path4/file2.snappy.parquet
  3. s3://my-s3-bucket/path1/path2/path3/path5/file3.snappy.parquet
  4. s3://my-s3-bucket/path1/path2/path3/path5/file4.snappy.parquet
  5. s3://my-s3-bucket/path1/path2/path6/path7/file5.snappy.parquet
  6. s3://my-s3-bucket/path1/path2/path6/path7/file6.snappy.parquet

.....

Each parquet file contains two columns: user-id and favorite-number, with several rows. My use-case involves receiving notifications throughout the day, each containing two attributes: notification-id and user-id. For each notification, I need to check all the Parquet files in my-s3-bucket to know whether the user-id contained in the notification is present in any of them or not. Wherever present, I need to delete that user-id entry/row from that parquet file. Also, for each notification, I need to send a response to an API or SNS topic indicating whether user-id was FoundAndDeleted or NotFound. Apart from deleting the data for given user-id, I want my Parquet files to remain intact as I don't want to lose any other data.

Given the above use-case, I want to build a simple end-to-end workflow of AWS Resources which can help me achieve both objectives:

  1. Searching & Deleting data from multiple Parquet files in S3
  2. Tracking whether the Data was FoundAndDeleted or NotFound to be able to respond to the incoming notification accordingly.

I thought of using an SQS Queue which will receive all the incoming notifications and an AWS Lambda function which will pick & process all these notifications from the queue and send the response. For Searching in S3, I thought of using AWS Athena. For deleting data rows from Parquet files, I thought of using AWS Glue (Spark) or AWS Fargate. I also thought of Batch Processing the notifications for any optimisations. However, I am still confused on two aspects: 1) Deciding the right choice of AWS Resources that should be used and how to break down and assign my tasks to those AWS Resources 2) What the precise workflow should look like and how to connect various components together to achieve the complete functionality

Please let me know your thoughts how I can achieve this end-to-end functionality (workflow) using AWS.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source