'Making all functions available at `package/__init__.py`, how can I prevent module internal code from importing from the top level `__init__.py`?
I'm writing a data manipulation package based on python pandas. For the part which has a functional style, I would like to make my package hierarchy flatter. Currently functions need to be imported using calls such as:
from package.module.submodule import my_function
The proposed change would make it possible to import
from package import my_function
To achieve this, functions and other objects would be imported into package/__init__.py so that they are available in the top level name space. This is how pandas does it, for example pandas/__init__.py makes it possible to import
from pandas import DataFrame
When in fact, the DataFrame class is defined inside pandas.core.frame. You would normally have to import it like this: from pandas.core.frame import DataFrame, but since it's imported in the top level __init__.py it's made available at the top level.
Making functions available as top level imports:
would expose a flat hierarchy for users and would make it easier to use the package
but internally (in the package code) we should not import from
package/__init__.pydirectly to avoid creating circular references.- Searching for from+pandas+import It seems that pandas always avoids importing from the top level (except test scripts which do use from pandas import DataFrame). I don't know how to enforce this.
- Maybe this tool can be helpful: pylint-forbidden-imports,
- or rather flake8-tidy-imports since we are using black and flake8 as a pre commit hook. flake8-tidy-imports makes it possible to define which imports are forbidden. It seems it applies to the whole package though, and not to a specific location in the package.
Related questions
- Best practices for top level __init__.py imports
- Can someone explain __all__ in Python?
- the accepted answer mentions "I personally write an
__all__early in my development lifecycle for modules so that others who might use my code know what they should use and not use."
- the accepted answer mentions "I personally write an
Solution 1:[1]
I think the concern you are expressing is the fact that "importing a sub-module" and "importing a sub-module during the import of a module" are not the same thing. For example, writing this in ipython:
from module.sub.file import func
and writing this from within the module package
from module.sub.file import func
do not do the same thing (even though they look the same). This is because if module has already started its' initialization; then subsequent calls to a sub-module will not re-initialize module nor does module need to have finished initializing before calling its' submodules. This is very similar to how class inheritance works too.
This means that it is perfectly valid for a package to pull various functions from all of its' sub-modules, while each of its' sub-modules can explicitly import from each other through the package itself without causing an infinite loop. This is by design. Example,
module
__init__.py
from .sub1.file1 import func1
from .sub2.file2 import func2
sub1
__init__.py
file1.py
from module.sub2.file2 import func2
def func1(x):
return func2(x)+x
sub2
__init__.py
file2.py
def func2(x):
return x+1
Here the sub-module sub1 is dependent on sub2. The line from module.sub2.file2 import func2 normally means
- execute
module/__init__.pyand load from namespacesub2 - execute
module/sub2/__init__.pyand load from namespacefile2 - execute
module/sub2/file2.pyand load from namespacefunc2
but during a call of from module import func1, when we reach the line in file1.py of from module.sub2.file2 import func2 we have either already ran or in are in the middle of running module/__init__.py and module/sub/__init__.py. This means that line more like does:
module/__init__.pyis currently being executed...skipmodule/sub2/__init__.pywas already loaded...skip- execute
module/sub2/file2.pyand load from namespacefunc2
In general, if module/__init__.py is currently being executed, then additional calls to that statement will simply be skipped. You can quite literally import module itself and it will be outright skipped even though it hasn't finished loading itself. Add some print statements
module
__init__.py
print('start init module')
from .sub1.file1 import func1
from .sub2.file2 import func2
print('end init module')
sub1
__init__.py
file1.py
print('loading module from file1.py')
import module
print('done loading module from file1.py')
from module.sub2.file2 import func2
def func1(x):
return func2(x)+x
sub2
__init__.py
file2.py
def func2(x):
return x+1
Now run from module import func1:
start init module
loading module from file1.py
# Notice that nothing is printed here, meaning module/__init__.py was not run again,
# even though we explicitly wrote "import module", additionally "module" wasn't
# even finished executing it's own __init__.py file.
done loading module from file1.py
end init module
This is awesome from a design perspective. It means that sub2 could have very well been a package completely separate from module but dependent on module. Then at some point, someone was like, "lets drop that indpendent package as sub-module to our module package". The entire folder is just dropped in (without changing any code) and then module can import from it like it is a local sub-package without any care of creating an import loop by accident, even though the sub-package depends on other parts of module itself.
Solution 2:[2]
Your problem is exactly stated in the documentation and solved by using intra-package referencing. You refer to the sub-modules using
from ..frame import DataFrame
instead of using
from pandas.core.frame import DataFrame
I can see it also worked for people here.
This type of referencing is used normally and this is an example by Baidu team in their OCR engine to import all modules.
Stick to your idea because referring to other paths is harsh if your users are beginners.
Solution 3:[3]
I've searched through the pandas GitHub repository and was unable to find a pre-commit hook that addresses your specific problem, so I adapted the use-pd_array-in-core hook from the pandas repository.
My test setup has the following folder structure (excluding the .git folder):
??? package
? ??? __init__.py
? ??? module
? ? ??? submodule.py
? ??? second_module
? ??? submodule.py
??? .pre-commit-config.yaml
??? scripts
??? import_from_submodules.py
The .pre-commit-config.yaml contains
repos:
- repo: local
hooks:
- id: import-from-submodules
name: Import from appropriate submodules
language: python
entry: python scripts/import_from_submodules.py package
files: ^package/
types: [python]
and the the import_from_submodules.py file contains
"""
Check that all imports reference the correct submodule and not import directly
from __init__.py, even though that is technically possible.
This is meant to be run as a pre-commit hook - to run it manually, you can do:
pre-commit run import-from-submodules --all-files
"""
from __future__ import annotations
import argparse
import ast
import sys
from typing import Sequence
class Visitor(ast.NodeVisitor):
def __init__(self, package_name: str, path: str) -> None:
self.package_name = package_name
self.path = path
self.error_message = (
"{path}:{lineno}:{col_offset}: "
f"Don't import from {self.package_name}, "
f"import from {self.package_name}.submodule instead\n"
)
def visit_Import(self, node: ast.Import) -> None:
if any(module.name == self.package_name for module in node.names):
msg = self.error_message.format(
path=self.path, lineno=node.lineno, col_offset=node.col_offset
)
sys.stdout.write(msg)
sys.exit(1)
super().generic_visit(node)
def visit_ImportFrom(self, node: ast.ImportFrom) -> None:
if node.module == self.package_name:
msg = self.error_message.format(
path=self.path, lineno=node.lineno, col_offset=node.col_offset
)
sys.stdout.write(msg)
sys.exit(1)
super().generic_visit(node)
def import_from_submodules(package_name: str, content: str, path: str) -> None:
tree = ast.parse(content)
visitor = Visitor(package_name, path)
visitor.visit(tree)
def main(argv: Sequence[str] | None = None) -> None:
parser = argparse.ArgumentParser()
parser.add_argument("package_name")
parser.add_argument("paths", nargs="*")
args = parser.parse_args(argv)
for path in args.paths:
with open(path, encoding="utf-8") as fd:
content = fd.read()
import_from_submodules(args.package_name, content, path)
if __name__ == "__main__":
main()
This uses the ast module to parse the Python source code of every python file in the package directory and visits each import <module> and from <module> import <function> statement.
If the <module> part equals the package name (which is a command-line parameter of the script you can set in the pre-commit config), the position of the offending line is printed and the script exits with a nonzero exit code to indicate that there are errors.
Let's say there is a function fun inside package/module/submodule.py, which is also imported in __init__.py and included in __all__.
Inside package/second_module/submodule.py the following lines would raise an error if you run pre-commit run import-from-submodules --all-files:
import package
from package import fun
from ..package import fun
whereas
from package.module.submodule import fun
from ..module.submodule import fun
do not. Note that the relative import examples illustrates that all leading dots of relative imports are ignored when comparing the module name to the given package name.
I hope this covers your use case. You are of course welcome to change the error message to something more helpful/clear. The ast module is extremely powerful, if you want to extend this code. For example the use-pd_array-in-core pre-commit hook mentioned earlier also flags all expressions pd.array by checking every attribute access.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Bobby Ocean |
| Solution 2 | Esraa Abdelmaksoud |
| Solution 3 | BurningKarl |
