'Terabytes of data on AWS S3, want to do data processing followed by modeling on AWS SageMaker

I have terabytes of data stored in S3 bucket in parquet format. I want to develop prototype followed by scaling. The data is really huge. Everyday, it is 50gb incremental. It is structured conventional data (is not image, video or audio). The history for prototype isn't decided yet but if it is 3 months, then it will be 90days x 50gb = 4500gb for prototype or (9000 gb for 6 months).

I want to do data processing, derive some new variables, EDA followed by modelling (feature engineering & implement unsupervised deep learning algorithms). Can anyone suggest me the best way here? For e.g. Use sagemaker notebook, write data processing python scripts there, save processed data to S3 folder, and then apply algorithms? or use EMR for data processing followed by SageMaker for EDA+Modeling. Or any other best way?



Solution 1:[1]

There are couple of ways depending on the finer details. You could use Data Wrangler for EDA or SageMaker Processing with ShardedByS3Key for scaling out. Post EDA you can use SageMaker built-in algorithms or develop your own and SageMaker Pipelines for CI/CD

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Anoop