'How to tag nodes implicitly in yaml (PyYAML)

Consider this yaml file:

!my-type
name: My type
items:
  - name: First item
    number: 42
  - name: Second item
    number: 43

There is one top level object that contains a collection of dictionaries, and I can load it fine with PyYAML. Now, I want to use a proper class instead of these item dictionaries:

!my-type
name: My type
items:
  - !my-type-item
    name: First item
    number: 42
  - !my-type-item
    name: Second item
    number: 43

But this syntax is cumbersome and redundant, since all items in this collection are of the same type. And it gets very ugly when there are hundreds of these items. Is it possible to tag these items implicitly?

I considered using yaml.add_path_resolver but this API does not seem to be public or stable.



Solution 1:[1]

The YAML spec says

Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.

which means you are in accordance to the spec when you do this. I guess this is what add_path_resolver tries to implement.

The problem here is that Python does not have classes with declared, typed fields. Languages that have those can inspect them and load data with the proper type implicitly (done by SnakeYAML, go-yaml et al.). With PyYAML, to do this you'll need to implement a custom constructor, e.g.:

import yaml

def get_value(node, name):
    assert isinstance(node, yaml.MappingNode)
    for key, value in node.value:
        assert isinstance(key, yaml.ScalarNode)
        if key.value == name:
            return value

class MyTypeItem:
    def __init__(self, name, number):
        self.name, self.number = name, number

    @classmethod
    def from_yaml(cls, loader, node):
        name = get_value(node, "name")
        assert isinstance(name, yaml.ScalarNode)

        number = get_value(node, "number")
        assert isinstance(number, yaml.ScalarNode)

        return MyTypeItem(name.value, int(number.value))

    def __repr__(self):
        return f"MyTypeItem(name={self.name}, number={self.number})"

class MyType(yaml.YAMLObject):
    yaml_tag = "!my-type"

    def __init__(self, name, items):
        self.name, self.items = name, items

    @classmethod
    def from_yaml(cls, loader, node):
        name = get_value(node, "name")
        assert isinstance(name, yaml.ScalarNode)

        items = get_value(node, "items")
        assert isinstance(items, yaml.SequenceNode)

        return MyType(name.value,
                [MyTypeItem.from_yaml(loader, n) for n in items.value])

    def __repr__(self):
        return f"MyType(name={self.name}, items={self.items})"

input = """
!my-type
name: My type
items:
  - name: First item
    number: 42
  - name: Second item
    number: 43
"""

print(yaml.load(input, yaml.FullLoader))

This gives you:

MyType(name=My type, items=[MyTypeItem(name=First item, number=42), MyTypeItem(name=Second item, number=43)])

Only the uppermost class derives from yaml.YAMLObject and has a yaml_tag, so that PyYAML can implicitly use it for the root item. MyTypeItem.from_yaml is called explictly from MyType and thus doesn't need to register with PyYAML (you can do that to also be able to load files that contain such an item directly).

You need to do conversions to non-string values manually (as shown with int(number.value)) since .value of any scalar node is always a string.

Solution 2:[2]

To make it easier on yourself, I would use suggest using dataclasses along with the dataclass-wizard for a high level approach.

Here's an approach using YAMLWizard and the PyYAML library for parsing YAML to a nested dataclass structure:

from __future__ import annotations

from dataclasses import dataclass
from dataclass_wizard import YAMLWizard


@dataclass
class MyContainer(YAMLWizard):
    name: str
    items: list[MyItem]


@dataclass
class MyItem:
    name: str
    number: int


if __name__ == '__main__':
    yaml = """
    name: My type
    items:
      - name: First item
        number: 42
      - name: Second item
        number: 43
    """

    c = MyContainer.from_yaml(yaml)
    print(c)

Output:

MyContainer(name='My type', items=[MyItem(name='First item', number=42), MyItem(name='Second item', number=43)])

Note: This requires the yaml extra, which then brings in the PyYAML dependency:

$ pip install dataclass-wizard[yaml]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 flyx
Solution 2 rv.kvetch