'Python3.8+ Test if one yaml is a subset of another

In short I am using a Yaml file as a configuration for parameters of some pipelines / functions I am using. In python, this is a nested dictionary, and parameters can themselves be an array / dictionaries. It would be helpful to iterate through all of the configuration files and search for those a subset of values are specified e.g.

# toy example of all parameters a config file might have
- param_a: 1
- param_b: 
  - b1: 'a'
  - b2: [1,2,3]
# want all configs with these values
- param_a: 1
- param_b: 
  - b2: [1,2,3]

Of course one could do recursion on each nested dictionary, but rather than reinvent the wheel, I was wondering if there is a tried and true solution.

I have seen some related questions (looking to confirm identical dictionaries) and Deep Diff pops up. However, it is unclear if when testing a subset, DeepDiff will return all of the missing keys. Thoughts?

For now I am using this and assuming yaml has been loaded properly as a nested dictionary

def is_config_subset(truth, params):
    '''
    Arguments:
    ----------
        truth (dict): dictionary of parameters to compare to
        params (dict): dictionary of parameters to test
    Returns:
    ----------
        result (bool) whether or not `params` is a subset of `truth`
    '''
    if not type(truth) == type(params): return False
    for key, val in params.items():
        if key not in truth: return False
        if type(val) is dict:
            if not is_config_subset(truth[key], val): 
                return False
        else:            
            if not truth[key] == val: return False
    return True

print(is_config_subset({'a':1, 'b':2}, {'b':2}))
print(is_config_subset({'a':1, 'b':2}, {'b':2, 'c':3}))
print(is_config_subset({'a':1, 'b':2, 'c':[1,2,3]}, {'b':2, 'c':[1,2,3]}))
print(is_config_subset({'a':1, 'b':2, 'c':[1,2,3]}, {'b':2, 'c':[1,2]}))
print(is_config_subset({'a':1, 'b':2, 'c':[1,2,3]}, {'a':2, 'b':2}))

True
False
True
False
False

This is probably a simplistic example and will not work in all cases.



Solution 1:[1]

I think simplest - and perhaps most efficient - solution would be to define your own recursive helper function, something like is_subset. Since YAML and JSON are very similar in format, you can first load your YAML data to a Python type - either a list or dict - and pass it in to a recursive function, which performs the subset check as per a pre-defined criteria.

For instance, here's a simple (mostly complete) example to get you started.

def is_subset(superset, o):
    """Check if `o` is subset of `superset`"""
    ss_type, o_type = type(superset), type(o)

    if ss_type != o_type:
        return False

    # now we know that both `superset` and `o` are the same type

    if o_type is dict:

        for o_key, o_value in o.items():
            ss_value = superset.get(o_key)

            if not is_subset(ss_value, o_value):
                return False
        else:
            return True

    if o_type is list:

        # an empty list can be considered a subset
        if not o_type:
            return True

        # on other hand, if superset list is empty, we return false
        if not superset:
            return False

        first_o_type = type(o[0])
        first_ss_type = type(superset[0])

        if first_o_type != first_ss_type:
            return False

        if first_o_type is dict:
            merged_ss, merged_o = {}, {}
            for v in o:
                if type(v) is dict:  # type check to be safe
                    merged_o.update(v)
                    # On Python 3.9+, the syntax would be:
                    #   merged_o |= v

            for v in superset:
                if type(v) is dict:  # type check to be safe
                    merged_ss.update(v)

            return is_subset(merged_ss, merged_o)

        elif first_o_type is list:
            for idx, o_value in enumerate(o):
                try:
                    ss_value = superset[idx]
                except IndexError:
                    return False

                if not is_subset(ss_value, o_value):
                    return False
            else:
                return True

        else:  # a list of simple types, like [1, 2, 3]
            for o_value in o:
                if o_value not in superset:
                    return False
            else:
                return True

    # it's a simple type - not list or dict
    return superset == o

Note there are some edge cases it doesn't cover - for example, lists in the YAML config with mixed data types, such as a list of dict and str values. I'll leave it up to you to decide how to handle those edge cases, in case it's worth covering them as well.

In any case, here's how you'd use the helper function that we declared above:

import yaml


superset_config = yaml.safe_load("""
# toy example of all parameters a config file might have
- param_a: 1
- param_b:
  - b1: 'a'
  - b2: [1,2,3]
""")

subset_config = yaml.safe_load("""
# want all configs with these values
- param_a: 1
- param_b:
  - b2: [1,2,3]
""")


assert is_subset(superset_config, subset_config)

subset_config[0]['param_c'] = 'test'
assert not is_subset(superset_config, subset_config)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1