'Validating maximal values of fields per condition (DF name) using wide-checks of pandera class-based API (SchemaModel)
I am currently creating my first validation schema using the class-based API of pandera for a module of a large python project that has a simple pandera DataFrame of data for signals' values of 4 machines as output of the following format:
| A* | timestamp | signal1 | signal2 | [More Signals] |
|---:|:--------------------|-------------:|--------------:|-----------------:|
| 0 | 2021-10-01 02:04:00 | 10 6023 | 540 186 | ... |
| 1 | 2021-10-01 02:04:01 | 10 7023 | 541 200 | ... |
| 2 | 2021-10-01 02:04:02 | 10 6025 | 542 385 | ... |
| 3 | 2021-10-01 02:04:03 | 10 6029 | 543 123 | ... |
| 4 | 2021-10-01 02:04:04 | 11 0021 | 544 535 | ... |
*"A" is a machine name. It is always different for each dataset, but only takes 1 out of 4 values. DFs get it as a name of a horizontal axis (df.columns.name)
Since the logic of the module is still in progress, the main task for now is to perform a very basic validation of minimal and maximal values for each signal. For now it's been decided that a minimal valid value for each signal is always 0. The harder part is that a maximal valid value for each signal is different for each of the 4 machines.
All information on the machines and their signals' maximal values are stored in a JSON-file. This one gets instanced for the python module to be able to reach each of the dictionaries:
{
"machine_ids": [
"1",
"A",
"B",
"C"
],
"signal_max_bounds": {
"1": {
"signal1": 11000,
"signal2": 550,
"signal3": 17,
"signal4": 3000
},
"A": {
"signal1": 15000,
"signal2": 700,
"signal3": 20,
"signal4": 6000
},
...
},
}
I am trying to make the vaidation process as simple as possible, so that the validation is always applicable using only one SchemaModel Class. I have decided to not create 4 separate tidy-data checks for each signal's field (@pa.check), since static fields of a SchemaModel class cannot adjust themselves to the DF's name to pick a correct maximal value on the run. I took a path of a custom wide-check of the whole DF with @pa.dataframe_check instead.
The resulting validation schema is working and the validation output actually looks just as excpected. I'm linking the code of the Schema at the end of my question.
Since I am new to the pandera module and validation in general I would like to know if there is any better solution for this task, using pandera functionality, any other validation modules or rather including such basic algorithm directly into the logic of the module.
My concerns are:
- The amounts of signals as well their order for each machine might change -> we always have to manually update the fields of the schema and control the logic of the custom wide-check.
- The order of the signals in DFs is not critical for validation, however addressing all of them at once using the single wide-validation requires either a strict match of columns' indexes (.iloc requests) or of columns' names (.loc request) to the order of the signals names in the JSON-file in the resulting assertion of the wide-check -> I cannot really imagine a better workaround rather than splitting the wide-check into several ones with each having 1 Series for one signal only as output. That would be anything but correct / efficient, since wide checks validate the whole dataset.
- The first column of the DataFrame has DateTime values that are only validated to their uniqueness, for now their values are taken as correct thanks to the logic of the module. However the single wide-check of the whole DF requires that an assertion DF of the check's results has the same shape as the validated DF (amounts of rows and cols match) -> I've had to include an excessive line in the wide-check to just insert a column of Datetimes converted to True in order to match the shape of the validated DF. This also looks anything but optimal.
- In case the validation schema would need to be extended with more complicated validation rules really requiring wide-checks of datasets, f.e. adressing mean values of each column, would it be right to keep the wide-check for checking the maximal values? -> We would once again come to having several wide-checks in one Schema, which is logically incorrect. Making one wide-check contain so many assertions is working against a clean code.
Of course there might be many other issues with the current validation. I would appreciate hearing opinions of experienced developers to this matter. I'm also posting a link to this question in the discussions section of pandera's Git-repo.
Adding a column of the machine_id is actually not desired, since we store any production-related metadata in a separate part of the module to make the resulting dataset with 17-20k lines somewhat lighter. Overheads are of a big importance in the project.
And finally here is the code for the validation class with the wide-check of DataFrames:
class ValidationSchema(pa.SchemaModel):
timestamp: Series[DateTime] = pa.Field(unique=True, nullable=False)
signal1: Series[float] = pa.Field(ge=0, nullable=False, coerce=True)
signal2: Series[float] = pa.Field(ge=0, nullable=False, coerce=True)
signal3: Series[float] = pa.Field(ge=0, nullable=False, coerce=True)
signal4: Series[int] = pa.Field(ge=0, nullable=True, coerce=True)
@pa.dataframe_check
def validate_max_values(cls, dataset: pd.DataFrame) -> DataFrame[bool]:
if dataset.columns.name in json_instance.machine_ids:
machine_id = dataset.columns.name
max_bounds_dict = json_instance.signal_max_bounds[machine_id]
result_df = dataset.iloc[:, 1:] <= list(max_bounds_dict.values())
result_df.insert(loc=0, column="timestamp", value=dataset.iloc[:, 0].astype(bool))
return result_df
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
