databuild/plans/17-python-dsl-generator-fix.md
2025-08-06 17:49:15 -07:00

6.5 KiB

Plan 17: Python DSL Generator Fix

Problem Statement

The databuild_dsl_generator rule currently fails when trying to generate DSL code because it cannot properly provide dependencies to the underlying generator script. The issue manifests as import errors like "No module named 'databuild.test'" or "No module named 'betterproto2'".

Root Cause Analysis

The current implementation has two separate approaches that don't work well together:

  1. Direct py_binary approach (//databuild/dsl/python:generator) - Works but requires hard-coding dependencies
  2. DSL generator rule approach (databuild_dsl_generator) - Should allow dynamic deps but has runfiles propagation issues

The core issue is that when databuild_dsl_generator creates a wrapper script that calls the py_binary generator via subprocess, the dependencies specified in the rule's deps attribute aren't available in the Python environment of the subprocess.

Current State

  • databuild_dsl_generator rule exists in databuild/rules.bzl (lines 944-1125)
  • Generic generator.py exists in databuild/dsl/python/generator.py
  • py_binary generator exists in databuild/dsl/python/BUILD.bazel (lines 10-22)
  • Test case: //databuild/test/app/dsl:graph.generate

Solution Design: Custom py_binary per DSL

Instead of using a generic generator + wrapper script approach, create a custom py_binary for each databuild_dsl_generator target that includes the specific dependencies.

Architecture Changes

  1. Remove subprocess approach: Stop calling the generic py_binary via subprocess
  2. Create custom py_binary: Generate a py_binary rule dynamically with the specific deps
  3. Embed generator logic: Include the generator logic directly in the custom binary
  4. Proper runfiles: Ensure all dependencies flow through Bazel's runfiles mechanism

Implementation Plan

Phase 1: Restructure Generator Logic

  1. Extract core generator logic from databuild/dsl/python/generator.py into a library:

    # databuild/dsl/python/generator_lib.py
    def generate_dsl_package(module_path: str, graph_attr: str, output_dir: str):
        # Move the core logic here
    
  2. Create generator library target in databuild/dsl/python/BUILD.bazel:

    py_library(
        name = "generator_lib",
        srcs = ["generator_lib.py"],
        deps = [":dsl", "//databuild:py_proto"],
    )
    
  3. Update standalone generator to use the library:

    # databuild/dsl/python/generator.py
    from databuild.dsl.python.generator_lib import generate_dsl_package
    
    def main():
        generate_dsl_package(sys.argv[1], sys.argv[2], sys.argv[3])
    

Phase 2: Modify DSL Generator Rule

  1. Update _databuild_dsl_generator_impl in databuild/rules.bzl:

    • Remove the subprocess-based wrapper script approach
    • Create a custom py_binary that includes the generator library + user deps
    • Generate the custom binary's source code with the specific module/attr parameters
  2. New implementation structure:

    def _databuild_dsl_generator_impl(ctx):
        # Create custom generator script
        custom_generator = ctx.actions.declare_file(ctx.label.name + "_custom_generator.py")
    
        # Generate script content with embedded parameters
        script_content = generate_custom_generator_script(
            module_path=module_path,
            graph_attr=ctx.attr.graph_attr,
            package_path=package_path
        )
    
        ctx.actions.write(output=custom_generator, content=script_content)
    
        # Create runfiles with all dependencies
        runfiles = ctx.runfiles(files=[custom_generator, ctx.file.graph_file])
        for dep in ctx.attr.deps:
            runfiles = runfiles.merge(dep.default_runfiles)
        # Include generator_lib and py_proto dependencies
        runfiles = runfiles.merge(ctx.attr._generator_lib.default_runfiles)
        runfiles = runfiles.merge(ctx.attr._py_proto.default_runfiles)
    
        return [DefaultInfo(executable=custom_generator, runfiles=runfiles)]
    

Phase 3: Update Rule Attributes

  1. Add generator_lib dependency to rule attributes:

    "_generator_lib": attr.label(
        default = "@databuild//databuild/dsl/python:generator_lib",
    ),
    
  2. Remove unnecessary attributes like _generator, _betterproto2 that are no longer needed

Phase 4: Custom Generator Script Template

  1. Create template for custom generator scripts:
    def generate_custom_generator_script(module_path, graph_attr, package_path):
        return f"""#!/usr/bin/env python3
    import os
    import sys
    from databuild.dsl.python.generator_lib import generate_dsl_package
    
    def main():
        workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY')
        if workspace_root:
            output_dir = os.path.join(workspace_root, '{package_path}')
        else:
            output_dir = '.'
    
        try:
            generate_dsl_package('{module_path}', '{graph_attr}', output_dir)
        except Exception as e:
            print(f"ERROR: Generation failed: {{e}}", file=sys.stderr)
            sys.exit(1)
    
    if __name__ == "__main__":
        main()
    """
    

Testing Strategy

  1. Unit test the generator library separately from the rule
  2. Test with existing DSL (//databuild/test/app/dsl:graph.generate)
  3. Verify dependency isolation - ensure different DSL generators don't interfere
  4. Test pip dependency propagation - ensure betterproto2 and other external deps work

Success Criteria

  • bazel run //databuild/test/app/dsl:graph.generate works without import errors
  • Users can specify deps = [":their_dsl_src"] and have those deps available to generator
  • No hard-coded dependencies in the core generator components
  • Proper error messages when imports fail
  • Generated code works with existing DataBuild infrastructure

Risks and Mitigations

Risk: Bazel rule complexity increases Mitigation: Extract reusable functions, add comprehensive comments

Risk: Runfiles might still not propagate correctly Mitigation: Test with various dependency types early in implementation

Files to Modify

  • databuild/dsl/python/BUILD.bazel - Add generator_lib target
  • databuild/dsl/python/generator_lib.py - New file with extracted logic
  • databuild/dsl/python/generator.py - Simplify to use library
  • databuild/rules.bzl - Rewrite _databuild_dsl_generator_impl
  • databuild/BUILD.bazel - Ensure betterproto2 is properly included in py_proto