466 lines
16 KiB
Markdown
466 lines
16 KiB
Markdown
# DSL Graph Generation: Bazel Module Generation from Python DSL
|
|
|
|
## Motivation & High-Level Goals
|
|
|
|
### Problem Statement
|
|
DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.
|
|
|
|
### Strategic Goals
|
|
1. **Seamless Deployment**: Enable DSL-defined graphs to be built and deployed as complete bazel modules
|
|
2. **Hermetic Packaging**: Generate self-contained modules with all dependencies resolved
|
|
3. **Interface Consistency**: Maintain CLI/Service interchangeability principle across generated modules
|
|
4. **Production Readiness**: Support container deployment and external dependency management
|
|
|
|
### Success Criteria
|
|
- DSL graphs can be compiled to standalone bazel modules (`@my_generated_graph//...`)
|
|
- Generated modules support the full databuild interface (analyze, build, service, container images)
|
|
- External repositories can depend on databuild core and generate working applications
|
|
- End-to-end deployment pipeline from DSL definition to running containers
|
|
|
|
## Required Reading
|
|
|
|
### Core Design Documents
|
|
- [`DESIGN.md`](../DESIGN.md) - Overall databuild architecture and principles
|
|
- [`design/core-build.md`](../design/core-build.md) - Job and graph execution semantics
|
|
- [`design/graph-specification.md`](../design/graph-specification.md) - DSL interfaces and patterns
|
|
- [`design/service.md`](../design/service.md) - Service interface requirements
|
|
- [`design/deploy-strategies.md`](../design/deploy-strategies.md) - Deployment patterns
|
|
|
|
### Key Source Files
|
|
- [`databuild/dsl/python/dsl.py`](../databuild/dsl/python/dsl.py) - Current DSL implementation
|
|
- [`databuild/test/app/dsl/graph.py`](../databuild/test/app/dsl/graph.py) - Reference DSL usage
|
|
- [`databuild/rules.bzl`](../databuild/rules.bzl) - Bazel rules for jobs and graphs
|
|
- [`databuild/databuild.proto`](../databuild/databuild.proto) - Core interfaces
|
|
|
|
### Understanding Prerequisites
|
|
1. **Job Architecture**: Jobs have `.cfg`, `.exec`, and main targets with subcommand pattern
|
|
2. **Graph Structure**: Graphs require job lookup, analyze, build, and service variants
|
|
3. **Bazel Modules**: External repos use `@workspace//...` references for generated content
|
|
4. **CLI/Service Consistency**: Both interfaces must produce identical artifacts and behaviors
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Basic Generation Infrastructure
|
|
**Goal**: Establish foundation for generating bazel modules from DSL definitions
|
|
|
|
#### Deliverables
|
|
- Extend `DataBuildGraph.generate_bazel_module()` method
|
|
- Generate minimal `MODULE.bazel` with databuild core dependency
|
|
- Generate `BUILD.bazel` with job and graph target stubs
|
|
- Basic workspace creation and file writing utilities
|
|
|
|
#### Implementation Tasks
|
|
1. Add `generate_bazel_module(workspace_name: str, output_dir: str)` to `DataBuildGraph`
|
|
2. Create template system for `MODULE.bazel` and `BUILD.bazel` generation
|
|
3. Implement file system utilities for creating workspace structure
|
|
4. Add basic validation for DSL graph completeness
|
|
|
|
#### Tests & Verification
|
|
```bash
|
|
# Test: Basic generation succeeds
|
|
python -c "
|
|
from databuild.test.app.dsl.graph import graph
|
|
graph.generate_bazel_module('test_graph', '/tmp/generated')
|
|
"
|
|
|
|
# Test: Generated files are valid
|
|
cd /tmp/generated
|
|
bazel build //... # Should succeed without errors
|
|
|
|
# Test: Module can be referenced externally
|
|
# In separate workspace:
|
|
# bazel build @test_graph//...
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Generated `MODULE.bazel` has correct databuild dependency
|
|
- Generated `BUILD.bazel` is syntactically valid
|
|
- External workspace can reference `@generated_graph//...` targets
|
|
- No compilation errors in generated bazel files
|
|
|
|
---
|
|
|
|
### Phase 2: Job Binary Generation
|
|
**Goal**: Convert DSL job classes into executable databuild job targets
|
|
|
|
#### Deliverables
|
|
- Auto-generate job binary Python files with config/exec subcommand handling
|
|
- Create `databuild_job` targets for each DSL job class
|
|
- Implement job lookup binary generation
|
|
- Wire partition pattern matching to job target resolution
|
|
|
|
#### Implementation Tasks
|
|
1. Create job binary template with subcommand dispatching:
|
|
```python
|
|
# Generated job_binary.py template
|
|
if sys.argv[1] == "config":
|
|
job_instance = MyDSLJob()
|
|
config = job_instance.config(parse_outputs(sys.argv[2:]))
|
|
print(json.dumps(config))
|
|
elif sys.argv[1] == "exec":
|
|
config = json.loads(sys.stdin.read())
|
|
job_instance.exec(config)
|
|
```
|
|
|
|
2. Generate job lookup binary from DSL job registrations:
|
|
```python
|
|
# Generated lookup.py
|
|
def lookup_job_for_partition(partition_ref: str) -> str:
|
|
for pattern, job_target in JOB_MAPPINGS.items():
|
|
if pattern.match(partition_ref):
|
|
return job_target
|
|
raise ValueError(f"No job found for: {partition_ref}")
|
|
```
|
|
|
|
3. Create `databuild_job` targets in generated `BUILD.bazel`
|
|
4. Handle DSL job dependencies and imports in generated files
|
|
|
|
#### Tests & Verification
|
|
```bash
|
|
# Test: Job config execution
|
|
bazel run @test_graph//:ingest_color_votes.cfg -- \
|
|
"daily_color_votes/2024-01-01/red"
|
|
# Should output valid JobConfig JSON
|
|
|
|
# Test: Job exec execution
|
|
echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
|
|
bazel run @test_graph//:ingest_color_votes.exec
|
|
# Should execute successfully
|
|
|
|
# Test: Job lookup
|
|
bazel run @test_graph//:job_lookup -- \
|
|
"daily_color_votes/2024-01-01/red"
|
|
# Should output: //:ingest_color_votes
|
|
```
|
|
|
|
#### Success Criteria
|
|
- All DSL jobs become executable `databuild_job` targets
|
|
- Job binaries correctly handle config/exec subcommands
|
|
- Job lookup correctly maps partition patterns to job targets
|
|
- Generated jobs maintain DSL semantic behavior
|
|
|
|
---
|
|
|
|
### Phase 3: Two-Phase Code Generation
|
|
**Goal**: Implement proper two-phase code generation that works within Bazel's constraints
|
|
|
|
#### Key Learning
|
|
Previous attempts failed due to fundamental Bazel constraints:
|
|
- **Loading vs Execution phases**: `load()` statements run before genrules execute
|
|
- **Dynamic target generation**: Bazel requires the complete build graph before execution begins
|
|
- **Hermeticity**: Generated BUILD files must be in source tree, not bazel-bin
|
|
|
|
The solution: **Two-phase generation** following established patterns from protobuf, thrift, and other code generators.
|
|
|
|
#### Two-Phase Workflow
|
|
|
|
**Phase 1: Code Generation** (run by developer)
|
|
```bash
|
|
bazel run //databuild/test/app/dsl:graph.generate
|
|
# Generates BUILD.bazel and Python binaries into source tree
|
|
```
|
|
|
|
**Phase 2: Building** (normal Bazel workflow)
|
|
```bash
|
|
bazel build //databuild/test/app/dsl:graph.analyze
|
|
bazel run //databuild/test/app/dsl:graph.service -- --port 8080
|
|
```
|
|
|
|
#### Implementation Tasks
|
|
|
|
1. **Create `databuild_dsl_generator` rule**:
|
|
```python
|
|
databuild_dsl_generator(
|
|
name = "graph.generate",
|
|
graph_file = "graph.py",
|
|
output_package = "//databuild/test/app/dsl",
|
|
deps = [":dsl_src"],
|
|
)
|
|
```
|
|
|
|
2. **Implement generator that writes to source tree**:
|
|
```python
|
|
def _databuild_dsl_generator_impl(ctx):
|
|
script = ctx.actions.declare_file(ctx.label.name + "_generator.py")
|
|
|
|
# Create a script that:
|
|
# 1. Loads the DSL graph
|
|
# 2. Generates BUILD.bazel and binaries
|
|
# 3. Writes them to the source tree
|
|
script_content = """
|
|
import os
|
|
import sys
|
|
# Add workspace root to path
|
|
workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY')
|
|
output_dir = os.path.join(workspace_root, '{package_path}')
|
|
|
|
# Load and generate
|
|
from {module_path} import {graph_attr}
|
|
{graph_attr}.generate_bazel_package('{name}', output_dir)
|
|
print(f'Generated BUILD.bazel and binaries in {{output_dir}}')
|
|
""".format(
|
|
package_path = ctx.attr.output_package.strip("//").replace(":", "/"),
|
|
module_path = ctx.file.graph_file.path.replace("/", ".").replace(".py", ""),
|
|
graph_attr = ctx.attr.graph_attr,
|
|
name = ctx.attr.name.replace(".generate", ""),
|
|
)
|
|
|
|
ctx.actions.write(
|
|
output = script,
|
|
content = script_content,
|
|
is_executable = True,
|
|
)
|
|
|
|
return [DefaultInfo(executable = script)]
|
|
```
|
|
|
|
3. **Update `DataBuildGraph.generate_bazel_package()` to target source tree**:
|
|
```python
|
|
def generate_bazel_package(self, name: str, output_dir: str) -> None:
|
|
"""Generate BUILD.bazel and binaries into source directory"""
|
|
# Generate BUILD.bazel with real databuild targets
|
|
self._generate_build_bazel(output_dir, name)
|
|
|
|
# Generate job binaries
|
|
self._generate_job_binaries(output_dir)
|
|
|
|
# Generate job lookup
|
|
self._generate_job_lookup(output_dir)
|
|
|
|
print(f"Generated package in {output_dir}")
|
|
print("Run 'bazel build :{name}.analyze' to use")
|
|
```
|
|
|
|
4. **Create standard BUILD.bazel template**:
|
|
```python
|
|
def _generate_build_bazel(self, output_dir: str, name: str):
|
|
# Generate proper databuild_job and databuild_graph targets
|
|
# that will work exactly like hand-written ones
|
|
build_content = self._build_template.format(
|
|
jobs = self._format_jobs(),
|
|
graph_name = f"{name}_graph",
|
|
job_targets = self._format_job_targets(),
|
|
)
|
|
|
|
with open(os.path.join(output_dir, "BUILD.bazel"), "w") as f:
|
|
f.write(build_content)
|
|
```
|
|
|
|
#### Interface Design
|
|
|
|
**For DSL Authors**:
|
|
```python
|
|
# In graph.py
|
|
graph = DataBuildGraph("my_graph")
|
|
|
|
@graph.job
|
|
class MyJob(DataBuildJob):
|
|
# ... job definition
|
|
```
|
|
|
|
**For Users**:
|
|
```bash
|
|
# Generate code (phase 1)
|
|
bazel run //my/app:graph.generate
|
|
|
|
# Use generated code (phase 2)
|
|
bazel build //my/app:graph.analyze
|
|
bazel run //my/app:graph.service
|
|
```
|
|
|
|
**In BUILD.bazel**:
|
|
```python
|
|
databuild_dsl_generator(
|
|
name = "graph.generate",
|
|
graph_file = "graph.py",
|
|
output_package = "//my/app",
|
|
deps = [":my_deps"],
|
|
)
|
|
|
|
# After generation, this file will contain:
|
|
# databuild_graph(name = "graph_graph", ...)
|
|
# databuild_job(name = "my_job", ...)
|
|
# py_binary(name = "my_job_binary", ...)
|
|
```
|
|
|
|
#### Benefits of This Approach
|
|
|
|
✅ **Works within Bazel constraints** - No dynamic target generation
|
|
✅ **Follows established patterns** - Same as protobuf, thrift, OpenAPI generators
|
|
✅ **Inspectable output** - Users can see generated BUILD.bazel
|
|
✅ **Version controllable** - Generated files can be checked in if desired
|
|
✅ **Incremental builds** - Standard Bazel caching works perfectly
|
|
✅ **Clean separation** - Generation vs building are separate phases
|
|
|
|
#### Tests & Verification
|
|
```bash
|
|
# Test: Code generation
|
|
bazel run //databuild/test/app/dsl:graph.generate
|
|
# Should create BUILD.bazel and Python files in source tree
|
|
|
|
# Test: Generated targets work
|
|
bazel build //databuild/test/app/dsl:graph_graph.analyze
|
|
# Should build successfully using generated BUILD.bazel
|
|
|
|
# Test: End-to-end functionality
|
|
bazel run //databuild/test/app/dsl:graph_graph.analyze -- "color_vote_report/2024-01-01/red"
|
|
# Should work exactly like hand-written graph
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Generator creates valid BUILD.bazel in source tree
|
|
- Generated targets are indistinguishable from hand-written ones
|
|
- Full DataBuild functionality works through generated code
|
|
- Clean developer workflow with clear phase separation
|
|
|
|
---
|
|
|
|
### Phase 4: Graph Integration
|
|
**Goal**: Generate complete databuild graph targets with all operational variants
|
|
|
|
#### Deliverables
|
|
- Generate `databuild_graph` target with analyze/build/service capabilities
|
|
- Create all graph variant targets (`.analyze`, `.build`, `.service`, etc.)
|
|
- Wire job dependencies into graph configuration
|
|
- Generate container deployment targets
|
|
|
|
#### Implementation Tasks
|
|
1. Generate `databuild_graph` target with complete job list
|
|
2. Create all required graph variants:
|
|
- `my_graph.analyze` - Planning capability
|
|
- `my_graph.build` - CLI execution
|
|
- `my_graph.service` - HTTP service
|
|
- `my_graph.service.image` - Container image
|
|
3. Configure job lookup and dependency wiring
|
|
4. Add graph label and identification metadata
|
|
|
|
#### Tests & Verification
|
|
```bash
|
|
# Test: Graph analysis
|
|
bazel run @test_graph//:my_graph.analyze -- \
|
|
"color_vote_report/2024-01-01/red"
|
|
# Should output complete job execution plan
|
|
|
|
# Test: Graph building
|
|
bazel run @test_graph//:my_graph.build -- \
|
|
"daily_color_votes/2024-01-01/red"
|
|
# Should execute end-to-end build
|
|
|
|
# Test: Service deployment
|
|
bazel run @test_graph//:my_graph.service -- --port 8081
|
|
# Should start HTTP service on port 8081
|
|
|
|
# Test: Container generation
|
|
bazel build @test_graph//:my_graph.service.image
|
|
# Should create deployable container image
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Graph targets provide full databuild functionality
|
|
- CLI and service interfaces produce identical results
|
|
- All graph operations work with generated job targets
|
|
- Container images are deployable and functional
|
|
|
|
---
|
|
|
|
### Phase 4: Dependency Resolution
|
|
**Goal**: Handle external pip packages and bazel dependencies in generated modules
|
|
|
|
#### Deliverables
|
|
- User-declared dependency system in DSL
|
|
- Generated `MODULE.bazel` with proper pip and bazel dependencies
|
|
- Dependency validation and conflict resolution
|
|
- Support for requirements files and version pinning
|
|
|
|
#### Implementation Tasks
|
|
1. Extend `DataBuildGraph` constructor to accept dependencies:
|
|
```python
|
|
graph = DataBuildGraph(
|
|
"//my_graph",
|
|
pip_deps=["pandas>=2.0.0", "numpy"],
|
|
bazel_deps=["@my_repo//internal:lib"]
|
|
)
|
|
```
|
|
|
|
2. Generate `MODULE.bazel` with pip extension configuration:
|
|
```python
|
|
pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
|
|
pip.parse(
|
|
hub_name = "pip_deps",
|
|
python_version = "3.11",
|
|
requirements_lock = "//:requirements_lock.txt"
|
|
)
|
|
```
|
|
|
|
3. Create requirements file generation from declared dependencies
|
|
4. Add dependency validation during generation
|
|
|
|
#### Tests & Verification
|
|
```bash
|
|
# Test: Pip dependencies resolved
|
|
bazel build @test_graph//:my_job
|
|
# Should succeed with pandas/numpy available
|
|
|
|
# Test: Cross-module references work
|
|
# Generate graph that depends on @other_repo//lib
|
|
bazel build @test_graph//:dependent_job
|
|
# Should resolve external bazel dependencies
|
|
|
|
# Test: Container includes all deps
|
|
bazel run @test_graph//:my_graph.service.image_load
|
|
docker run databuild_test_graph_service:latest python -c "import pandas"
|
|
# Should succeed - pandas available in container
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Generated modules resolve all external dependencies
|
|
- Pip packages are available to job execution
|
|
- Cross-repository bazel dependencies work correctly
|
|
- Container images include complete dependency closure
|
|
|
|
---
|
|
|
|
### Phase 5: End-to-End Deployment
|
|
**Goal**: Complete production deployment pipeline with observability
|
|
|
|
#### Deliverables
|
|
- Production-ready container images with proper configuration
|
|
- Integration with existing databuild observability systems
|
|
- Build event log compatibility
|
|
- Performance optimization and resource management
|
|
|
|
#### Implementation Tasks
|
|
1. Optimize generated container images for production use
|
|
2. Ensure build event logging works correctly in generated modules
|
|
3. Add resource configuration and limits to generated targets
|
|
4. Create deployment documentation and examples
|
|
5. Performance testing and optimization
|
|
|
|
#### Tests & Verification
|
|
```bash
|
|
./run_e2e_tests.sh
|
|
```
|
|
|
|
#### Success Criteria
|
|
- Generated modules are production-ready
|
|
- Full observability and logging integration
|
|
- Performance meets production requirements
|
|
- CLI/Service consistency maintained
|
|
- Complete deployment documentation
|
|
|
|
## Validation Strategy
|
|
|
|
### Integration with Existing Tests
|
|
- Extend `run_e2e_tests.sh` to test generated modules
|
|
- Add generated module tests to CI/CD pipeline
|
|
- Use existing test app DSL as primary test case
|
|
|
|
### Performance Benchmarks
|
|
- Graph analysis speed comparison (DSL vs hand-written bazel)
|
|
- Container image size optimization
|
|
- Job execution overhead measurement
|
|
|
|
### Correctness Verification
|
|
- Build event log structure validation
|
|
- Partition resolution accuracy testing
|
|
- Dependency resolution completeness checks
|