Update plan

This commit is contained in:
Stuart Axelbrooke 2025-08-03 02:04:38 -07:00
parent cdc47bddfe
commit 492c30c0bc

View file

@ -141,7 +141,181 @@ bazel run @test_graph//:job_lookup -- \
---
### Phase 3: Graph Integration
### Phase 3: Two-Phase Code Generation
**Goal**: Implement proper two-phase code generation that works within Bazel's constraints
#### Key Learning
Previous attempts failed due to fundamental Bazel constraints:
- **Loading vs Execution phases**: `load()` statements run before genrules execute
- **Dynamic target generation**: Bazel requires the complete build graph before execution begins
- **Hermeticity**: Generated BUILD files must be in source tree, not bazel-bin
The solution: **Two-phase generation** following established patterns from protobuf, thrift, and other code generators.
#### Two-Phase Workflow
**Phase 1: Code Generation** (run by developer)
```bash
bazel run //databuild/test/app/dsl:graph.generate
# Generates BUILD.bazel and Python binaries into source tree
```
**Phase 2: Building** (normal Bazel workflow)
```bash
bazel build //databuild/test/app/dsl:graph.analyze
bazel run //databuild/test/app/dsl:graph.service -- --port 8080
```
#### Implementation Tasks
1. **Create `databuild_dsl_generator` rule**:
```python
databuild_dsl_generator(
name = "graph.generate",
graph_file = "graph.py",
output_package = "//databuild/test/app/dsl",
deps = [":dsl_src"],
)
```
2. **Implement generator that writes to source tree**:
```python
def _databuild_dsl_generator_impl(ctx):
script = ctx.actions.declare_file(ctx.label.name + "_generator.py")
# Create a script that:
# 1. Loads the DSL graph
# 2. Generates BUILD.bazel and binaries
# 3. Writes them to the source tree
script_content = """
import os
import sys
# Add workspace root to path
workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY')
output_dir = os.path.join(workspace_root, '{package_path}')
# Load and generate
from {module_path} import {graph_attr}
{graph_attr}.generate_bazel_package('{name}', output_dir)
print(f'Generated BUILD.bazel and binaries in {{output_dir}}')
""".format(
package_path = ctx.attr.output_package.strip("//").replace(":", "/"),
module_path = ctx.file.graph_file.path.replace("/", ".").replace(".py", ""),
graph_attr = ctx.attr.graph_attr,
name = ctx.attr.name.replace(".generate", ""),
)
ctx.actions.write(
output = script,
content = script_content,
is_executable = True,
)
return [DefaultInfo(executable = script)]
```
3. **Update `DataBuildGraph.generate_bazel_package()` to target source tree**:
```python
def generate_bazel_package(self, name: str, output_dir: str) -> None:
"""Generate BUILD.bazel and binaries into source directory"""
# Generate BUILD.bazel with real databuild targets
self._generate_build_bazel(output_dir, name)
# Generate job binaries
self._generate_job_binaries(output_dir)
# Generate job lookup
self._generate_job_lookup(output_dir)
print(f"Generated package in {output_dir}")
print("Run 'bazel build :{name}.analyze' to use")
```
4. **Create standard BUILD.bazel template**:
```python
def _generate_build_bazel(self, output_dir: str, name: str):
# Generate proper databuild_job and databuild_graph targets
# that will work exactly like hand-written ones
build_content = self._build_template.format(
jobs = self._format_jobs(),
graph_name = f"{name}_graph",
job_targets = self._format_job_targets(),
)
with open(os.path.join(output_dir, "BUILD.bazel"), "w") as f:
f.write(build_content)
```
#### Interface Design
**For DSL Authors**:
```python
# In graph.py
graph = DataBuildGraph("my_graph")
@graph.job
class MyJob(DataBuildJob):
# ... job definition
```
**For Users**:
```bash
# Generate code (phase 1)
bazel run //my/app:graph.generate
# Use generated code (phase 2)
bazel build //my/app:graph.analyze
bazel run //my/app:graph.service
```
**In BUILD.bazel**:
```python
databuild_dsl_generator(
name = "graph.generate",
graph_file = "graph.py",
output_package = "//my/app",
deps = [":my_deps"],
)
# After generation, this file will contain:
# databuild_graph(name = "graph_graph", ...)
# databuild_job(name = "my_job", ...)
# py_binary(name = "my_job_binary", ...)
```
#### Benefits of This Approach
**Works within Bazel constraints** - No dynamic target generation
**Follows established patterns** - Same as protobuf, thrift, OpenAPI generators
**Inspectable output** - Users can see generated BUILD.bazel
**Version controllable** - Generated files can be checked in if desired
**Incremental builds** - Standard Bazel caching works perfectly
**Clean separation** - Generation vs building are separate phases
#### Tests & Verification
```bash
# Test: Code generation
bazel run //databuild/test/app/dsl:graph.generate
# Should create BUILD.bazel and Python files in source tree
# Test: Generated targets work
bazel build //databuild/test/app/dsl:graph_graph.analyze
# Should build successfully using generated BUILD.bazel
# Test: End-to-end functionality
bazel run //databuild/test/app/dsl:graph_graph.analyze -- "color_vote_report/2024-01-01/red"
# Should work exactly like hand-written graph
```
#### Success Criteria
- Generator creates valid BUILD.bazel in source tree
- Generated targets are indistinguishable from hand-written ones
- Full DataBuild functionality works through generated code
- Clean developer workflow with clear phase separation
---
### Phase 4: Graph Integration
**Goal**: Generate complete databuild graph targets with all operational variants
#### Deliverables