6.5 KiB
Plan 17: Python DSL Generator Fix
Problem Statement
The databuild_dsl_generator rule currently fails when trying to generate DSL code because it cannot properly provide dependencies to the underlying generator script. The issue manifests as import errors like "No module named 'databuild.test'" or "No module named 'betterproto2'".
Root Cause Analysis
The current implementation has two separate approaches that don't work well together:
- Direct py_binary approach (
//databuild/dsl/python:generator) - Works but requires hard-coding dependencies - DSL generator rule approach (
databuild_dsl_generator) - Should allow dynamic deps but has runfiles propagation issues
The core issue is that when databuild_dsl_generator creates a wrapper script that calls the py_binary generator via subprocess, the dependencies specified in the rule's deps attribute aren't available in the Python environment of the subprocess.
Current State
databuild_dsl_generatorrule exists indatabuild/rules.bzl(lines 944-1125)- Generic
generator.pyexists indatabuild/dsl/python/generator.py - py_binary generator exists in
databuild/dsl/python/BUILD.bazel(lines 10-22) - Test case:
//databuild/test/app/dsl:graph.generate
Solution Design: Custom py_binary per DSL
Instead of using a generic generator + wrapper script approach, create a custom py_binary for each databuild_dsl_generator target that includes the specific dependencies.
Architecture Changes
- Remove subprocess approach: Stop calling the generic py_binary via subprocess
- Create custom py_binary: Generate a py_binary rule dynamically with the specific deps
- Embed generator logic: Include the generator logic directly in the custom binary
- Proper runfiles: Ensure all dependencies flow through Bazel's runfiles mechanism
Implementation Plan
Phase 1: Restructure Generator Logic
-
Extract core generator logic from
databuild/dsl/python/generator.pyinto a library:# databuild/dsl/python/generator_lib.py def generate_dsl_package(module_path: str, graph_attr: str, output_dir: str): # Move the core logic here -
Create generator library target in
databuild/dsl/python/BUILD.bazel:py_library( name = "generator_lib", srcs = ["generator_lib.py"], deps = [":dsl", "//databuild:py_proto"], ) -
Update standalone generator to use the library:
# databuild/dsl/python/generator.py from databuild.dsl.python.generator_lib import generate_dsl_package def main(): generate_dsl_package(sys.argv[1], sys.argv[2], sys.argv[3])
Phase 2: Modify DSL Generator Rule
-
Update
_databuild_dsl_generator_implindatabuild/rules.bzl:- Remove the subprocess-based wrapper script approach
- Create a custom py_binary that includes the generator library + user deps
- Generate the custom binary's source code with the specific module/attr parameters
-
New implementation structure:
def _databuild_dsl_generator_impl(ctx): # Create custom generator script custom_generator = ctx.actions.declare_file(ctx.label.name + "_custom_generator.py") # Generate script content with embedded parameters script_content = generate_custom_generator_script( module_path=module_path, graph_attr=ctx.attr.graph_attr, package_path=package_path ) ctx.actions.write(output=custom_generator, content=script_content) # Create runfiles with all dependencies runfiles = ctx.runfiles(files=[custom_generator, ctx.file.graph_file]) for dep in ctx.attr.deps: runfiles = runfiles.merge(dep.default_runfiles) # Include generator_lib and py_proto dependencies runfiles = runfiles.merge(ctx.attr._generator_lib.default_runfiles) runfiles = runfiles.merge(ctx.attr._py_proto.default_runfiles) return [DefaultInfo(executable=custom_generator, runfiles=runfiles)]
Phase 3: Update Rule Attributes
-
Add generator_lib dependency to rule attributes:
"_generator_lib": attr.label( default = "@databuild//databuild/dsl/python:generator_lib", ), -
Remove unnecessary attributes like
_generator,_betterproto2that are no longer needed
Phase 4: Custom Generator Script Template
- Create template for custom generator scripts:
def generate_custom_generator_script(module_path, graph_attr, package_path): return f"""#!/usr/bin/env python3 import os import sys from databuild.dsl.python.generator_lib import generate_dsl_package def main(): workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY') if workspace_root: output_dir = os.path.join(workspace_root, '{package_path}') else: output_dir = '.' try: generate_dsl_package('{module_path}', '{graph_attr}', output_dir) except Exception as e: print(f"ERROR: Generation failed: {{e}}", file=sys.stderr) sys.exit(1) if __name__ == "__main__": main() """
Testing Strategy
- Unit test the generator library separately from the rule
- Test with existing DSL (
//databuild/test/app/dsl:graph.generate) - Verify dependency isolation - ensure different DSL generators don't interfere
- Test pip dependency propagation - ensure betterproto2 and other external deps work
Success Criteria
bazel run //databuild/test/app/dsl:graph.generateworks without import errors- Users can specify
deps = [":their_dsl_src"]and have those deps available to generator - No hard-coded dependencies in the core generator components
- Proper error messages when imports fail
- Generated code works with existing DataBuild infrastructure
Risks and Mitigations
Risk: Bazel rule complexity increases Mitigation: Extract reusable functions, add comprehensive comments
Risk: Runfiles might still not propagate correctly Mitigation: Test with various dependency types early in implementation
Files to Modify
databuild/dsl/python/BUILD.bazel- Add generator_lib targetdatabuild/dsl/python/generator_lib.py- New file with extracted logicdatabuild/dsl/python/generator.py- Simplify to use librarydatabuild/rules.bzl- Rewrite_databuild_dsl_generator_impldatabuild/BUILD.bazel- Ensure betterproto2 is properly included in py_proto