'Parsing class and function dependencies from a project

I'm trying to run some analysis of class and function dependencies in a Python code base. My first step was to create a .csv file for import into Excel using Python's csv module and regular expressions.

The current version of what I have looks like this:

import re
import os
import csv 
from os.path import join


class ClassParser(object):
   class_expr = re.compile(r'class (.+?)(?:\((.+?)\))?:')                                                                                                                                                                                    
   python_file_expr = re.compile(r'^\w+[.]py$')

   def findAllClasses(self, python_file):
      """ Read in a python file and return all the class names
      """
      with open(python_file) as infile:
         everything = infile.read()
         class_names = ClassParser.class_expr.findall(everything)
         return class_names

   def findAllPythonFiles(self, directory):
      """ Find all the python files starting from a top level directory
      """
      python_files = []
      for root, dirs, files in os.walk(directory):
         for file in files:
            if ClassParser.python_file_expr.match(file):
               python_files.append(join(root,file))
      return python_files

   def parse(self, directory, output_directory="classes.csv"):
      """ Parse the directory and spit out a csv file
      """
      with open(output_directory,'w') as csv_file:
         writer = csv.writer(csv_file)
         python_files = self.findAllPythonFiles(directory)
         for file in python_files:
            classes = self.findAllClasses(file)
            for classname in classes:
               writer.writerow([classname[0], classname[1], file])

if __name__=="__main__":
   parser = ClassParser()
   parser.parse("/path/to/my/project/main/directory")

This generates a .csv output in format:

class name, inherited classes (also comma separated), file
class name, inherited classes (also comma separated), file
... etc. ...

I'm at the point where I'd like to start parsing function declaration and definitions in addition to the class names. My question: Is there a better way to get the class names, inherited class names, function names, parameter names, etc.?

NOTE: I've considered using the Python ast module, but I don't have experience with it and don't know how to use it to get the desired information or if it can even do that.

EDIT: In response to Martin Thurau's request for more information - The reason I'm trying to solve this issue is because I've inherited a rather lengthy (100k+ lines) project that has no discernible structure to its modules, classes and functions; they all exist as a collection of files in a single source directory.

Some of the source files contain dozens of tangentially related classes and are 10k+ lines long which makes them difficult to maintain. I'm starting to perform analysis for the relative difficulty of taking every class and packaging it into a more cohesive structure using The Hitchhiker's Guide to Packaging as a base. Part of what I care about for that analysis is how intertwined a class is with other classes in its file and what imported or inherited classes a particular class relies on.



Solution 1:[1]

I've made a start on implementing this. Put the following code in a file, and run it, passing the name of a file or directory to analyse. It will print out all the classes it finds, the file it was found in, and the bases of the class. It is not intelligent, so if you have two Foo classes defined in your code base it will not tell you which one is being used, but it is a start.

This code uses the python ast module to examine .py files, and finds all the ClassDef nodes. It then uses this meta package to print bits of them out - you will need to install this package.

$ pip install -e git+https://github.com/srossross/Meta.git#egg=meta

Example output, run against django-featured-item:

$ python class-finder.py /path/to/django-featured-item/featureditem/
FeaturedField,../django-featured-item/featureditem/fields.py,models.BooleanField
SingleFeature,../django-featured-item/featureditem/tests.py,models.Model
MultipleFeature,../django-featured-item/featureditem/tests.py,models.Model
Author,../django-featured-item/featureditem/tests.py,models.Model
Book,../django-featured-item/featureditem/tests.py,models.Model
FeaturedField,../django-featured-item/featureditem/tests.py,TestCase

The code:

# class-finder.py
import ast
import csv
import meta
import os
import sys

def find_classes(node, in_file):
    if isinstance(node, ast.ClassDef):
        yield (node, in_file)

    if hasattr(node, 'body'):
        for child in node.body:
            # `yield from find_classes(child)` in Python 3.x
            for x in find_classes(child, in_file): yield x


def print_classes(classes, out):
    writer = csv.writer(out)
    for cls, in_file in classes:
        writer.writerow([cls.name, in_file] +
            [meta.asttools.dump_python_source(base).strip()
                for base in cls.bases])


def process_file(file_path):
    root = ast.parse(open(file_path, 'r').read(), file_path)
    for cls in find_classes(root, file_path):
        yield cls


def process_directory(dir_path):
    for entry in os.listdir(dir_path):
        for cls in process_file_or_directory(os.path.join(dir_path, entry)):
            yield cls


def process_file_or_directory(file_or_directory):
    if os.path.isdir(file_or_directory):
        return process_directory(file_or_directory)
    elif file_or_directory.endswith('.py'):
        return process_file(file_or_directory)
    else:
        return []

if __name__ == '__main__':
    classes = process_file_or_directory(sys.argv[1])
    print_classes(classes, sys.stdout)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Laurel