'Data Warehousing for Defined schema and its invoices
As part of use case, We generate some invoice documents that are transported in ships and cargo. Each Document contains details of the container and their contents. As part of this Ships and Cargo, We need to store these invoice documents for 15 years and retrieve them back.
Here are the details -
Container Name | Origin Port | Destination Port -> Invoice Name
We need to able to retrieve the Invoice name using container name, origin port or destination port or combination of columns (Similar to SQL).
Each invoice will be at-least 40 to 70 MB.
Any suggestions on building this. We use AWS as cloud. I just need some pointers which can help me get Started.
One approach is to use RedShift + Athena backed by Spark Jobs.
Solution 1:[1]
Invoices of that size likely include images and don't fit well in database but your retrieval needs sound very much like a database. I've don't work with clients in the past with similar needs and utilized Redshift for the analytics on the relational data and S3 for storage of large non-relational data (images). The data table in Redshift can have a json (super type) column that contains pointers to any S3 objects and descriptors for what the S3 object is. Any number (up to the max size of the json which is large) of S3 objects can be referenced by a single row in the data table.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Bill Weiner |
