'Hydra - Override parameters when instantiating function

I am working on a data pipeline that follows a structure like this:

-- src/
---- etl.py
---- scripts/
------ moduleA.py
------ moduleB.py

I want to parametrise the scripts with Hydra. I have already done it for moduleA, which can be run independently:

import os

import hydra
from omegaconf import DictConfig

@hydra.main(config_path=os.getcwd(), config_name="config")
def main(cfg: DictConfig) -> None:

    # Parse args
    input_file: str = cfg.params.input_file

    do_stuff(input_file)

I would like to have the same approach for moduleB et al and being able to to instantiate these main() from etl.py, which will act as the orchestrator.

TLDR: Is it possible to parametrise a function that reads from a config file without having to renounce to use Hydra? I would like etl.py to be something like this:

import os

import hydra
from omegaconf import DictConfig

from scripts.moduleA import main as process_moduleA

@hydra.main(config_path=os.getcwd(), config_name="config")
def main(cfg: DictConfig) -> None:

    # Parse args
    input_file: str = cfg.params.input_file

    process_moduleA(input_file)

Many thanks in advance!!



Solution 1:[1]

The typical pattern is to use an if __name__ == "__main__" guard in your moduleA.py and moduleB.py files:

# moduleA.py
import hydra
from omegaconf import DictConfig

@hydra.main(config_path=os.getcwd(), config_name="config")
def main(cfg: DictConfig) -> None:
   ...

if __name__ == "__main__":
    main()

This way, when you call moduleA from your etl.py script, the Hydra machinery in moduleA.py will not be triggered.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jasha