diff --git a/plans/15-dsl-graph-generation.md b/plans/15-dsl-graph-generation.md index f563856..4dd5a69 100644 --- a/plans/15-dsl-graph-generation.md +++ b/plans/15-dsl-graph-generation.md @@ -141,7 +141,181 @@ bazel run @test_graph//:job_lookup -- \ --- -### Phase 3: Graph Integration +### Phase 3: Two-Phase Code Generation +**Goal**: Implement proper two-phase code generation that works within Bazel's constraints + +#### Key Learning +Previous attempts failed due to fundamental Bazel constraints: +- **Loading vs Execution phases**: `load()` statements run before genrules execute +- **Dynamic target generation**: Bazel requires the complete build graph before execution begins +- **Hermeticity**: Generated BUILD files must be in source tree, not bazel-bin + +The solution: **Two-phase generation** following established patterns from protobuf, thrift, and other code generators. + +#### Two-Phase Workflow + +**Phase 1: Code Generation** (run by developer) +```bash +bazel run //databuild/test/app/dsl:graph.generate +# Generates BUILD.bazel and Python binaries into source tree +``` + +**Phase 2: Building** (normal Bazel workflow) +```bash +bazel build //databuild/test/app/dsl:graph.analyze +bazel run //databuild/test/app/dsl:graph.service -- --port 8080 +``` + +#### Implementation Tasks + +1. **Create `databuild_dsl_generator` rule**: + ```python + databuild_dsl_generator( + name = "graph.generate", + graph_file = "graph.py", + output_package = "//databuild/test/app/dsl", + deps = [":dsl_src"], + ) + ``` + +2. **Implement generator that writes to source tree**: + ```python + def _databuild_dsl_generator_impl(ctx): + script = ctx.actions.declare_file(ctx.label.name + "_generator.py") + + # Create a script that: + # 1. Loads the DSL graph + # 2. Generates BUILD.bazel and binaries + # 3. Writes them to the source tree + script_content = """ +import os +import sys +# Add workspace root to path +workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY') +output_dir = os.path.join(workspace_root, '{package_path}') + +# Load and generate +from {module_path} import {graph_attr} +{graph_attr}.generate_bazel_package('{name}', output_dir) +print(f'Generated BUILD.bazel and binaries in {{output_dir}}') + """.format( + package_path = ctx.attr.output_package.strip("//").replace(":", "/"), + module_path = ctx.file.graph_file.path.replace("/", ".").replace(".py", ""), + graph_attr = ctx.attr.graph_attr, + name = ctx.attr.name.replace(".generate", ""), + ) + + ctx.actions.write( + output = script, + content = script_content, + is_executable = True, + ) + + return [DefaultInfo(executable = script)] + ``` + +3. **Update `DataBuildGraph.generate_bazel_package()` to target source tree**: + ```python + def generate_bazel_package(self, name: str, output_dir: str) -> None: + """Generate BUILD.bazel and binaries into source directory""" + # Generate BUILD.bazel with real databuild targets + self._generate_build_bazel(output_dir, name) + + # Generate job binaries + self._generate_job_binaries(output_dir) + + # Generate job lookup + self._generate_job_lookup(output_dir) + + print(f"Generated package in {output_dir}") + print("Run 'bazel build :{name}.analyze' to use") + ``` + +4. **Create standard BUILD.bazel template**: + ```python + def _generate_build_bazel(self, output_dir: str, name: str): + # Generate proper databuild_job and databuild_graph targets + # that will work exactly like hand-written ones + build_content = self._build_template.format( + jobs = self._format_jobs(), + graph_name = f"{name}_graph", + job_targets = self._format_job_targets(), + ) + + with open(os.path.join(output_dir, "BUILD.bazel"), "w") as f: + f.write(build_content) + ``` + +#### Interface Design + +**For DSL Authors**: +```python +# In graph.py +graph = DataBuildGraph("my_graph") + +@graph.job +class MyJob(DataBuildJob): + # ... job definition +``` + +**For Users**: +```bash +# Generate code (phase 1) +bazel run //my/app:graph.generate + +# Use generated code (phase 2) +bazel build //my/app:graph.analyze +bazel run //my/app:graph.service +``` + +**In BUILD.bazel**: +```python +databuild_dsl_generator( + name = "graph.generate", + graph_file = "graph.py", + output_package = "//my/app", + deps = [":my_deps"], +) + +# After generation, this file will contain: +# databuild_graph(name = "graph_graph", ...) +# databuild_job(name = "my_job", ...) +# py_binary(name = "my_job_binary", ...) +``` + +#### Benefits of This Approach + +✅ **Works within Bazel constraints** - No dynamic target generation +✅ **Follows established patterns** - Same as protobuf, thrift, OpenAPI generators +✅ **Inspectable output** - Users can see generated BUILD.bazel +✅ **Version controllable** - Generated files can be checked in if desired +✅ **Incremental builds** - Standard Bazel caching works perfectly +✅ **Clean separation** - Generation vs building are separate phases + +#### Tests & Verification +```bash +# Test: Code generation +bazel run //databuild/test/app/dsl:graph.generate +# Should create BUILD.bazel and Python files in source tree + +# Test: Generated targets work +bazel build //databuild/test/app/dsl:graph_graph.analyze +# Should build successfully using generated BUILD.bazel + +# Test: End-to-end functionality +bazel run //databuild/test/app/dsl:graph_graph.analyze -- "color_vote_report/2024-01-01/red" +# Should work exactly like hand-written graph +``` + +#### Success Criteria +- Generator creates valid BUILD.bazel in source tree +- Generated targets are indistinguishable from hand-written ones +- Full DataBuild functionality works through generated code +- Clean developer workflow with clear phase separation + +--- + +### Phase 4: Graph Integration **Goal**: Generate complete databuild graph targets with all operational variants #### Deliverables