databuild/plans/15-dsl-graph-generation.md
2025-08-03 02:04:38 -07:00

466 lines
16 KiB
Markdown

# DSL Graph Generation: Bazel Module Generation from Python DSL
## Motivation & High-Level Goals
### Problem Statement
DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.
### Strategic Goals
1. **Seamless Deployment**: Enable DSL-defined graphs to be built and deployed as complete bazel modules
2. **Hermetic Packaging**: Generate self-contained modules with all dependencies resolved
3. **Interface Consistency**: Maintain CLI/Service interchangeability principle across generated modules
4. **Production Readiness**: Support container deployment and external dependency management
### Success Criteria
- DSL graphs can be compiled to standalone bazel modules (`@my_generated_graph//...`)
- Generated modules support the full databuild interface (analyze, build, service, container images)
- External repositories can depend on databuild core and generate working applications
- End-to-end deployment pipeline from DSL definition to running containers
## Required Reading
### Core Design Documents
- [`DESIGN.md`](../DESIGN.md) - Overall databuild architecture and principles
- [`design/core-build.md`](../design/core-build.md) - Job and graph execution semantics
- [`design/graph-specification.md`](../design/graph-specification.md) - DSL interfaces and patterns
- [`design/service.md`](../design/service.md) - Service interface requirements
- [`design/deploy-strategies.md`](../design/deploy-strategies.md) - Deployment patterns
### Key Source Files
- [`databuild/dsl/python/dsl.py`](../databuild/dsl/python/dsl.py) - Current DSL implementation
- [`databuild/test/app/dsl/graph.py`](../databuild/test/app/dsl/graph.py) - Reference DSL usage
- [`databuild/rules.bzl`](../databuild/rules.bzl) - Bazel rules for jobs and graphs
- [`databuild/databuild.proto`](../databuild/databuild.proto) - Core interfaces
### Understanding Prerequisites
1. **Job Architecture**: Jobs have `.cfg`, `.exec`, and main targets with subcommand pattern
2. **Graph Structure**: Graphs require job lookup, analyze, build, and service variants
3. **Bazel Modules**: External repos use `@workspace//...` references for generated content
4. **CLI/Service Consistency**: Both interfaces must produce identical artifacts and behaviors
## Implementation Plan
### Phase 1: Basic Generation Infrastructure
**Goal**: Establish foundation for generating bazel modules from DSL definitions
#### Deliverables
- Extend `DataBuildGraph.generate_bazel_module()` method
- Generate minimal `MODULE.bazel` with databuild core dependency
- Generate `BUILD.bazel` with job and graph target stubs
- Basic workspace creation and file writing utilities
#### Implementation Tasks
1. Add `generate_bazel_module(workspace_name: str, output_dir: str)` to `DataBuildGraph`
2. Create template system for `MODULE.bazel` and `BUILD.bazel` generation
3. Implement file system utilities for creating workspace structure
4. Add basic validation for DSL graph completeness
#### Tests & Verification
```bash
# Test: Basic generation succeeds
python -c "
from databuild.test.app.dsl.graph import graph
graph.generate_bazel_module('test_graph', '/tmp/generated')
"
# Test: Generated files are valid
cd /tmp/generated
bazel build //... # Should succeed without errors
# Test: Module can be referenced externally
# In separate workspace:
# bazel build @test_graph//...
```
#### Success Criteria
- Generated `MODULE.bazel` has correct databuild dependency
- Generated `BUILD.bazel` is syntactically valid
- External workspace can reference `@generated_graph//...` targets
- No compilation errors in generated bazel files
---
### Phase 2: Job Binary Generation
**Goal**: Convert DSL job classes into executable databuild job targets
#### Deliverables
- Auto-generate job binary Python files with config/exec subcommand handling
- Create `databuild_job` targets for each DSL job class
- Implement job lookup binary generation
- Wire partition pattern matching to job target resolution
#### Implementation Tasks
1. Create job binary template with subcommand dispatching:
```python
# Generated job_binary.py template
if sys.argv[1] == "config":
job_instance = MyDSLJob()
config = job_instance.config(parse_outputs(sys.argv[2:]))
print(json.dumps(config))
elif sys.argv[1] == "exec":
config = json.loads(sys.stdin.read())
job_instance.exec(config)
```
2. Generate job lookup binary from DSL job registrations:
```python
# Generated lookup.py
def lookup_job_for_partition(partition_ref: str) -> str:
for pattern, job_target in JOB_MAPPINGS.items():
if pattern.match(partition_ref):
return job_target
raise ValueError(f"No job found for: {partition_ref}")
```
3. Create `databuild_job` targets in generated `BUILD.bazel`
4. Handle DSL job dependencies and imports in generated files
#### Tests & Verification
```bash
# Test: Job config execution
bazel run @test_graph//:ingest_color_votes.cfg -- \
"daily_color_votes/2024-01-01/red"
# Should output valid JobConfig JSON
# Test: Job exec execution
echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
bazel run @test_graph//:ingest_color_votes.exec
# Should execute successfully
# Test: Job lookup
bazel run @test_graph//:job_lookup -- \
"daily_color_votes/2024-01-01/red"
# Should output: //:ingest_color_votes
```
#### Success Criteria
- All DSL jobs become executable `databuild_job` targets
- Job binaries correctly handle config/exec subcommands
- Job lookup correctly maps partition patterns to job targets
- Generated jobs maintain DSL semantic behavior
---
### Phase 3: Two-Phase Code Generation
**Goal**: Implement proper two-phase code generation that works within Bazel's constraints
#### Key Learning
Previous attempts failed due to fundamental Bazel constraints:
- **Loading vs Execution phases**: `load()` statements run before genrules execute
- **Dynamic target generation**: Bazel requires the complete build graph before execution begins
- **Hermeticity**: Generated BUILD files must be in source tree, not bazel-bin
The solution: **Two-phase generation** following established patterns from protobuf, thrift, and other code generators.
#### Two-Phase Workflow
**Phase 1: Code Generation** (run by developer)
```bash
bazel run //databuild/test/app/dsl:graph.generate
# Generates BUILD.bazel and Python binaries into source tree
```
**Phase 2: Building** (normal Bazel workflow)
```bash
bazel build //databuild/test/app/dsl:graph.analyze
bazel run //databuild/test/app/dsl:graph.service -- --port 8080
```
#### Implementation Tasks
1. **Create `databuild_dsl_generator` rule**:
```python
databuild_dsl_generator(
name = "graph.generate",
graph_file = "graph.py",
output_package = "//databuild/test/app/dsl",
deps = [":dsl_src"],
)
```
2. **Implement generator that writes to source tree**:
```python
def _databuild_dsl_generator_impl(ctx):
script = ctx.actions.declare_file(ctx.label.name + "_generator.py")
# Create a script that:
# 1. Loads the DSL graph
# 2. Generates BUILD.bazel and binaries
# 3. Writes them to the source tree
script_content = """
import os
import sys
# Add workspace root to path
workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY')
output_dir = os.path.join(workspace_root, '{package_path}')
# Load and generate
from {module_path} import {graph_attr}
{graph_attr}.generate_bazel_package('{name}', output_dir)
print(f'Generated BUILD.bazel and binaries in {{output_dir}}')
""".format(
package_path = ctx.attr.output_package.strip("//").replace(":", "/"),
module_path = ctx.file.graph_file.path.replace("/", ".").replace(".py", ""),
graph_attr = ctx.attr.graph_attr,
name = ctx.attr.name.replace(".generate", ""),
)
ctx.actions.write(
output = script,
content = script_content,
is_executable = True,
)
return [DefaultInfo(executable = script)]
```
3. **Update `DataBuildGraph.generate_bazel_package()` to target source tree**:
```python
def generate_bazel_package(self, name: str, output_dir: str) -> None:
"""Generate BUILD.bazel and binaries into source directory"""
# Generate BUILD.bazel with real databuild targets
self._generate_build_bazel(output_dir, name)
# Generate job binaries
self._generate_job_binaries(output_dir)
# Generate job lookup
self._generate_job_lookup(output_dir)
print(f"Generated package in {output_dir}")
print("Run 'bazel build :{name}.analyze' to use")
```
4. **Create standard BUILD.bazel template**:
```python
def _generate_build_bazel(self, output_dir: str, name: str):
# Generate proper databuild_job and databuild_graph targets
# that will work exactly like hand-written ones
build_content = self._build_template.format(
jobs = self._format_jobs(),
graph_name = f"{name}_graph",
job_targets = self._format_job_targets(),
)
with open(os.path.join(output_dir, "BUILD.bazel"), "w") as f:
f.write(build_content)
```
#### Interface Design
**For DSL Authors**:
```python
# In graph.py
graph = DataBuildGraph("my_graph")
@graph.job
class MyJob(DataBuildJob):
# ... job definition
```
**For Users**:
```bash
# Generate code (phase 1)
bazel run //my/app:graph.generate
# Use generated code (phase 2)
bazel build //my/app:graph.analyze
bazel run //my/app:graph.service
```
**In BUILD.bazel**:
```python
databuild_dsl_generator(
name = "graph.generate",
graph_file = "graph.py",
output_package = "//my/app",
deps = [":my_deps"],
)
# After generation, this file will contain:
# databuild_graph(name = "graph_graph", ...)
# databuild_job(name = "my_job", ...)
# py_binary(name = "my_job_binary", ...)
```
#### Benefits of This Approach
**Works within Bazel constraints** - No dynamic target generation
**Follows established patterns** - Same as protobuf, thrift, OpenAPI generators
**Inspectable output** - Users can see generated BUILD.bazel
**Version controllable** - Generated files can be checked in if desired
**Incremental builds** - Standard Bazel caching works perfectly
**Clean separation** - Generation vs building are separate phases
#### Tests & Verification
```bash
# Test: Code generation
bazel run //databuild/test/app/dsl:graph.generate
# Should create BUILD.bazel and Python files in source tree
# Test: Generated targets work
bazel build //databuild/test/app/dsl:graph_graph.analyze
# Should build successfully using generated BUILD.bazel
# Test: End-to-end functionality
bazel run //databuild/test/app/dsl:graph_graph.analyze -- "color_vote_report/2024-01-01/red"
# Should work exactly like hand-written graph
```
#### Success Criteria
- Generator creates valid BUILD.bazel in source tree
- Generated targets are indistinguishable from hand-written ones
- Full DataBuild functionality works through generated code
- Clean developer workflow with clear phase separation
---
### Phase 4: Graph Integration
**Goal**: Generate complete databuild graph targets with all operational variants
#### Deliverables
- Generate `databuild_graph` target with analyze/build/service capabilities
- Create all graph variant targets (`.analyze`, `.build`, `.service`, etc.)
- Wire job dependencies into graph configuration
- Generate container deployment targets
#### Implementation Tasks
1. Generate `databuild_graph` target with complete job list
2. Create all required graph variants:
- `my_graph.analyze` - Planning capability
- `my_graph.build` - CLI execution
- `my_graph.service` - HTTP service
- `my_graph.service.image` - Container image
3. Configure job lookup and dependency wiring
4. Add graph label and identification metadata
#### Tests & Verification
```bash
# Test: Graph analysis
bazel run @test_graph//:my_graph.analyze -- \
"color_vote_report/2024-01-01/red"
# Should output complete job execution plan
# Test: Graph building
bazel run @test_graph//:my_graph.build -- \
"daily_color_votes/2024-01-01/red"
# Should execute end-to-end build
# Test: Service deployment
bazel run @test_graph//:my_graph.service -- --port 8081
# Should start HTTP service on port 8081
# Test: Container generation
bazel build @test_graph//:my_graph.service.image
# Should create deployable container image
```
#### Success Criteria
- Graph targets provide full databuild functionality
- CLI and service interfaces produce identical results
- All graph operations work with generated job targets
- Container images are deployable and functional
---
### Phase 4: Dependency Resolution
**Goal**: Handle external pip packages and bazel dependencies in generated modules
#### Deliverables
- User-declared dependency system in DSL
- Generated `MODULE.bazel` with proper pip and bazel dependencies
- Dependency validation and conflict resolution
- Support for requirements files and version pinning
#### Implementation Tasks
1. Extend `DataBuildGraph` constructor to accept dependencies:
```python
graph = DataBuildGraph(
"//my_graph",
pip_deps=["pandas>=2.0.0", "numpy"],
bazel_deps=["@my_repo//internal:lib"]
)
```
2. Generate `MODULE.bazel` with pip extension configuration:
```python
pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
pip.parse(
hub_name = "pip_deps",
python_version = "3.11",
requirements_lock = "//:requirements_lock.txt"
)
```
3. Create requirements file generation from declared dependencies
4. Add dependency validation during generation
#### Tests & Verification
```bash
# Test: Pip dependencies resolved
bazel build @test_graph//:my_job
# Should succeed with pandas/numpy available
# Test: Cross-module references work
# Generate graph that depends on @other_repo//lib
bazel build @test_graph//:dependent_job
# Should resolve external bazel dependencies
# Test: Container includes all deps
bazel run @test_graph//:my_graph.service.image_load
docker run databuild_test_graph_service:latest python -c "import pandas"
# Should succeed - pandas available in container
```
#### Success Criteria
- Generated modules resolve all external dependencies
- Pip packages are available to job execution
- Cross-repository bazel dependencies work correctly
- Container images include complete dependency closure
---
### Phase 5: End-to-End Deployment
**Goal**: Complete production deployment pipeline with observability
#### Deliverables
- Production-ready container images with proper configuration
- Integration with existing databuild observability systems
- Build event log compatibility
- Performance optimization and resource management
#### Implementation Tasks
1. Optimize generated container images for production use
2. Ensure build event logging works correctly in generated modules
3. Add resource configuration and limits to generated targets
4. Create deployment documentation and examples
5. Performance testing and optimization
#### Tests & Verification
```bash
./run_e2e_tests.sh
```
#### Success Criteria
- Generated modules are production-ready
- Full observability and logging integration
- Performance meets production requirements
- CLI/Service consistency maintained
- Complete deployment documentation
## Validation Strategy
### Integration with Existing Tests
- Extend `run_e2e_tests.sh` to test generated modules
- Add generated module tests to CI/CD pipeline
- Use existing test app DSL as primary test case
### Performance Benchmarks
- Graph analysis speed comparison (DSL vs hand-written bazel)
- Container image size optimization
- Job execution overhead measurement
### Correctness Verification
- Build event log structure validation
- Partition resolution accuracy testing
- Dependency resolution completeness checks