databuild/plans/17-python-dsl-generator-fix.md
2025-08-06 17:49:15 -07:00

164 lines
6.5 KiB
Markdown

# Plan 17: Python DSL Generator Fix
## Problem Statement
The `databuild_dsl_generator` rule currently fails when trying to generate DSL code because it cannot properly provide dependencies to the underlying generator script. The issue manifests as import errors like "No module named 'databuild.test'" or "No module named 'betterproto2'".
## Root Cause Analysis
The current implementation has two separate approaches that don't work well together:
1. **Direct py_binary approach** (`//databuild/dsl/python:generator`) - Works but requires hard-coding dependencies
2. **DSL generator rule approach** (`databuild_dsl_generator`) - Should allow dynamic deps but has runfiles propagation issues
The core issue is that when `databuild_dsl_generator` creates a wrapper script that calls the py_binary generator via subprocess, the dependencies specified in the rule's `deps` attribute aren't available in the Python environment of the subprocess.
## Current State
- `databuild_dsl_generator` rule exists in `databuild/rules.bzl` (lines 944-1125)
- Generic `generator.py` exists in `databuild/dsl/python/generator.py`
- py_binary generator exists in `databuild/dsl/python/BUILD.bazel` (lines 10-22)
- Test case: `//databuild/test/app/dsl:graph.generate`
## Solution Design: Custom py_binary per DSL
Instead of using a generic generator + wrapper script approach, create a custom py_binary for each `databuild_dsl_generator` target that includes the specific dependencies.
### Architecture Changes
1. **Remove subprocess approach**: Stop calling the generic py_binary via subprocess
2. **Create custom py_binary**: Generate a py_binary rule dynamically with the specific deps
3. **Embed generator logic**: Include the generator logic directly in the custom binary
4. **Proper runfiles**: Ensure all dependencies flow through Bazel's runfiles mechanism
## Implementation Plan
### Phase 1: Restructure Generator Logic
1. **Extract core generator logic** from `databuild/dsl/python/generator.py` into a library:
```python
# databuild/dsl/python/generator_lib.py
def generate_dsl_package(module_path: str, graph_attr: str, output_dir: str):
# Move the core logic here
```
2. **Create generator library target** in `databuild/dsl/python/BUILD.bazel`:
```python
py_library(
name = "generator_lib",
srcs = ["generator_lib.py"],
deps = [":dsl", "//databuild:py_proto"],
)
```
3. **Update standalone generator** to use the library:
```python
# databuild/dsl/python/generator.py
from databuild.dsl.python.generator_lib import generate_dsl_package
def main():
generate_dsl_package(sys.argv[1], sys.argv[2], sys.argv[3])
```
### Phase 2: Modify DSL Generator Rule
4. **Update `_databuild_dsl_generator_impl`** in `databuild/rules.bzl`:
- Remove the subprocess-based wrapper script approach
- Create a custom py_binary that includes the generator library + user deps
- Generate the custom binary's source code with the specific module/attr parameters
5. **New implementation structure**:
```python
def _databuild_dsl_generator_impl(ctx):
# Create custom generator script
custom_generator = ctx.actions.declare_file(ctx.label.name + "_custom_generator.py")
# Generate script content with embedded parameters
script_content = generate_custom_generator_script(
module_path=module_path,
graph_attr=ctx.attr.graph_attr,
package_path=package_path
)
ctx.actions.write(output=custom_generator, content=script_content)
# Create runfiles with all dependencies
runfiles = ctx.runfiles(files=[custom_generator, ctx.file.graph_file])
for dep in ctx.attr.deps:
runfiles = runfiles.merge(dep.default_runfiles)
# Include generator_lib and py_proto dependencies
runfiles = runfiles.merge(ctx.attr._generator_lib.default_runfiles)
runfiles = runfiles.merge(ctx.attr._py_proto.default_runfiles)
return [DefaultInfo(executable=custom_generator, runfiles=runfiles)]
```
### Phase 3: Update Rule Attributes
6. **Add generator_lib dependency** to rule attributes:
```python
"_generator_lib": attr.label(
default = "@databuild//databuild/dsl/python:generator_lib",
),
```
7. **Remove unnecessary attributes** like `_generator`, `_betterproto2` that are no longer needed
### Phase 4: Custom Generator Script Template
8. **Create template for custom generator scripts**:
```python
def generate_custom_generator_script(module_path, graph_attr, package_path):
return f"""#!/usr/bin/env python3
import os
import sys
from databuild.dsl.python.generator_lib import generate_dsl_package
def main():
workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY')
if workspace_root:
output_dir = os.path.join(workspace_root, '{package_path}')
else:
output_dir = '.'
try:
generate_dsl_package('{module_path}', '{graph_attr}', output_dir)
except Exception as e:
print(f"ERROR: Generation failed: {{e}}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
"""
```
## Testing Strategy
1. **Unit test the generator library** separately from the rule
2. **Test with existing DSL** (`//databuild/test/app/dsl:graph.generate`)
3. **Verify dependency isolation** - ensure different DSL generators don't interfere
4. **Test pip dependency propagation** - ensure betterproto2 and other external deps work
## Success Criteria
- [ ] `bazel run //databuild/test/app/dsl:graph.generate` works without import errors
- [ ] Users can specify `deps = [":their_dsl_src"]` and have those deps available to generator
- [ ] No hard-coded dependencies in the core generator components
- [ ] Proper error messages when imports fail
- [ ] Generated code works with existing DataBuild infrastructure
## Risks and Mitigations
**Risk**: Bazel rule complexity increases
**Mitigation**: Extract reusable functions, add comprehensive comments
**Risk**: Runfiles might still not propagate correctly
**Mitigation**: Test with various dependency types early in implementation
## Files to Modify
- `databuild/dsl/python/BUILD.bazel` - Add generator_lib target
- `databuild/dsl/python/generator_lib.py` - New file with extracted logic
- `databuild/dsl/python/generator.py` - Simplify to use library
- `databuild/rules.bzl` - Rewrite `_databuild_dsl_generator_impl`
- `databuild/BUILD.bazel` - Ensure betterproto2 is properly included in py_proto