'Ideal data type / structure / model for storing device data with different parameters / attributes in snowflake
We are in the process of designing a dimensional data model in snow flake to store data from different devices (from solar and wind plants) for reporting / analytical purposes. The data is currently residing in influxDB as time series data, one of the challenges in designing the target DB model is - different devices emit data for different parameters (even though the devices have a super set of parameters, it can vary and chances are there that new parameters can be added to the super set).
One of the key ask is to not have any development efforts (coding) when new parameters / devices are added, hence the model and design needs to have the flexibility to store the data accordingly, with certain configurations, Following are the options,
- Create wide fact tables with all the superset parameters and store nulls for devices which do not send the data.
Pros: Lesser data volume compared to #2.
Cons: a) Will have some effort when new parameters are added. b) Depending on the reporting tool (which will be mostly custom built and not a BI tool) the selection of data from different parameters might not be straight forward like using a where clause based on the needed parameters.
- Create narrow fact tables, the parameter will become a dimension table along with other dimensions and will have a reference to it by ID column, and the value will be present in one column.
Pros: a) No efforts / schema changes when new parameters are added. b) Ease of selecting and filtering data based on the selected parameters.
Cons: a) Data Volume - There are 1000's of devices and multiple parameters under them, so approximately per day it will go to 90M records (~1GB - the base data itself is huge and the unpivot would increase the record count dramatically). b) Performance considerations due to increased data volume especially while querying data.
- Use the support provided by snowflake for semi structured data. OBJECT datatype seems to be a good fit, the parameter name and value can be stored as a key value pair.
Pros: a) No efforts / schema changes when new parameters are added. b) Data volume not increased. c) Ease of selecting and filtering data based on functions provided by SQL - Is this true, based on the documentation, the querying looks straight forward especially for OBJECT datatype. However need confirmation.
Cons: a) Performance considerations due to the usage of semi structured data types - From the documentation , it mentions that the VARIANT data type stores the data in columnar format wherever possible (data remain in json where it is not able to convert) , but there is no mention about the OBJECT data type and how it is handled with this data type. So want to understand whether this will have a considerable performance impact or not.
So, considering the above, what would be the ideal way to store this kind of data where the structure changes dynamically based on different devices.
Solution 1:[1]
Option 3 is my favorite for laziness, cost, and performance reasons:
Snowflake uses the same storage ideas for OBJECTs and VARIANTs, it will be optimized for columnar - as long as your underlying object/variant is well suited for it. This means good performance and compression.
Object/variant will need the less maintenance when adding new parameters.
But option 1 has some advantages too:
It's a good idea for governability to know all your columns and their purposes.
3rd party tools understand navigating columns much better than figuring out objects.
Then you could have a great mix of 3+1:
- Store everything as object/variant.
- Create a view that parses the object and names columns for 3rd party tools.
- When new fields are added, you will just need to update the definition of the view.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Felipe Hoffa |
