'Python3.8+ Test if one yaml is a subset of another
In short I am using a Yaml file as a configuration for parameters of some pipelines / functions I am using. In python, this is a nested dictionary, and parameters can themselves be an array / dictionaries. It would be helpful to iterate through all of the configuration files and search for those a subset of values are specified e.g.
# toy example of all parameters a config file might have
- param_a: 1
- param_b:
- b1: 'a'
- b2: [1,2,3]
# want all configs with these values
- param_a: 1
- param_b:
- b2: [1,2,3]
Of course one could do recursion on each nested dictionary, but rather than reinvent the wheel, I was wondering if there is a tried and true solution.
I have seen some related questions (looking to confirm identical dictionaries) and Deep Diff pops up. However, it is unclear if when testing a subset, DeepDiff will return all of the missing keys. Thoughts?
For now I am using this and assuming yaml has been loaded properly as a nested dictionary
def is_config_subset(truth, params):
'''
Arguments:
----------
truth (dict): dictionary of parameters to compare to
params (dict): dictionary of parameters to test
Returns:
----------
result (bool) whether or not `params` is a subset of `truth`
'''
if not type(truth) == type(params): return False
for key, val in params.items():
if key not in truth: return False
if type(val) is dict:
if not is_config_subset(truth[key], val):
return False
else:
if not truth[key] == val: return False
return True
print(is_config_subset({'a':1, 'b':2}, {'b':2}))
print(is_config_subset({'a':1, 'b':2}, {'b':2, 'c':3}))
print(is_config_subset({'a':1, 'b':2, 'c':[1,2,3]}, {'b':2, 'c':[1,2,3]}))
print(is_config_subset({'a':1, 'b':2, 'c':[1,2,3]}, {'b':2, 'c':[1,2]}))
print(is_config_subset({'a':1, 'b':2, 'c':[1,2,3]}, {'a':2, 'b':2}))
True
False
True
False
False
This is probably a simplistic example and will not work in all cases.
Solution 1:[1]
I think simplest - and perhaps most efficient - solution would be to define your own recursive helper function, something like is_subset. Since YAML and JSON are very similar in format, you can first load your YAML data to a Python type - either a list or dict - and pass it in to a recursive function, which performs the subset check as per a pre-defined criteria.
For instance, here's a simple (mostly complete) example to get you started.
def is_subset(superset, o):
"""Check if `o` is subset of `superset`"""
ss_type, o_type = type(superset), type(o)
if ss_type != o_type:
return False
# now we know that both `superset` and `o` are the same type
if o_type is dict:
for o_key, o_value in o.items():
ss_value = superset.get(o_key)
if not is_subset(ss_value, o_value):
return False
else:
return True
if o_type is list:
# an empty list can be considered a subset
if not o_type:
return True
# on other hand, if superset list is empty, we return false
if not superset:
return False
first_o_type = type(o[0])
first_ss_type = type(superset[0])
if first_o_type != first_ss_type:
return False
if first_o_type is dict:
merged_ss, merged_o = {}, {}
for v in o:
if type(v) is dict: # type check to be safe
merged_o.update(v)
# On Python 3.9+, the syntax would be:
# merged_o |= v
for v in superset:
if type(v) is dict: # type check to be safe
merged_ss.update(v)
return is_subset(merged_ss, merged_o)
elif first_o_type is list:
for idx, o_value in enumerate(o):
try:
ss_value = superset[idx]
except IndexError:
return False
if not is_subset(ss_value, o_value):
return False
else:
return True
else: # a list of simple types, like [1, 2, 3]
for o_value in o:
if o_value not in superset:
return False
else:
return True
# it's a simple type - not list or dict
return superset == o
Note there are some edge cases it doesn't cover - for example, lists in the YAML config with mixed data types, such as a list of dict and str values. I'll leave it up to you to decide how to handle those edge cases, in case it's worth covering them as well.
In any case, here's how you'd use the helper function that we declared above:
import yaml
superset_config = yaml.safe_load("""
# toy example of all parameters a config file might have
- param_a: 1
- param_b:
- b1: 'a'
- b2: [1,2,3]
""")
subset_config = yaml.safe_load("""
# want all configs with these values
- param_a: 1
- param_b:
- b2: [1,2,3]
""")
assert is_subset(superset_config, subset_config)
subset_config[0]['param_c'] = 'test'
assert not is_subset(superset_config, subset_config)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
