16 KiB
DSL Graph Generation: Bazel Module Generation from Python DSL
Motivation & High-Level Goals
Problem Statement
DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.
Strategic Goals
- Seamless Deployment: Enable DSL-defined graphs to be built and deployed as complete bazel modules
- Hermetic Packaging: Generate self-contained modules with all dependencies resolved
- Interface Consistency: Maintain CLI/Service interchangeability principle across generated modules
- Production Readiness: Support container deployment and external dependency management
Success Criteria
- DSL graphs can be compiled to standalone bazel modules (
@my_generated_graph//...) - Generated modules support the full databuild interface (analyze, build, service, container images)
- External repositories can depend on databuild core and generate working applications
- End-to-end deployment pipeline from DSL definition to running containers
Required Reading
Core Design Documents
DESIGN.md- Overall databuild architecture and principlesdesign/core-build.md- Job and graph execution semanticsdesign/graph-specification.md- DSL interfaces and patternsdesign/service.md- Service interface requirementsdesign/deploy-strategies.md- Deployment patterns
Key Source Files
databuild/dsl/python/dsl.py- Current DSL implementationdatabuild/test/app/dsl/graph.py- Reference DSL usagedatabuild/rules.bzl- Bazel rules for jobs and graphsdatabuild/databuild.proto- Core interfaces
Understanding Prerequisites
- Job Architecture: Jobs have
.cfg,.exec, and main targets with subcommand pattern - Graph Structure: Graphs require job lookup, analyze, build, and service variants
- Bazel Modules: External repos use
@workspace//...references for generated content - CLI/Service Consistency: Both interfaces must produce identical artifacts and behaviors
Implementation Plan
Phase 1: Basic Generation Infrastructure
Goal: Establish foundation for generating bazel modules from DSL definitions
Deliverables
- Extend
DataBuildGraph.generate_bazel_module()method - Generate minimal
MODULE.bazelwith databuild core dependency - Generate
BUILD.bazelwith job and graph target stubs - Basic workspace creation and file writing utilities
Implementation Tasks
- Add
generate_bazel_module(workspace_name: str, output_dir: str)toDataBuildGraph - Create template system for
MODULE.bazelandBUILD.bazelgeneration - Implement file system utilities for creating workspace structure
- Add basic validation for DSL graph completeness
Tests & Verification
# Test: Basic generation succeeds
python -c "
from databuild.test.app.dsl.graph import graph
graph.generate_bazel_module('test_graph', '/tmp/generated')
"
# Test: Generated files are valid
cd /tmp/generated
bazel build //... # Should succeed without errors
# Test: Module can be referenced externally
# In separate workspace:
# bazel build @test_graph//...
Success Criteria
- Generated
MODULE.bazelhas correct databuild dependency - Generated
BUILD.bazelis syntactically valid - External workspace can reference
@generated_graph//...targets - No compilation errors in generated bazel files
Phase 2: Job Binary Generation
Goal: Convert DSL job classes into executable databuild job targets
Deliverables
- Auto-generate job binary Python files with config/exec subcommand handling
- Create
databuild_jobtargets for each DSL job class - Implement job lookup binary generation
- Wire partition pattern matching to job target resolution
Implementation Tasks
-
Create job binary template with subcommand dispatching:
# Generated job_binary.py template if sys.argv[1] == "config": job_instance = MyDSLJob() config = job_instance.config(parse_outputs(sys.argv[2:])) print(json.dumps(config)) elif sys.argv[1] == "exec": config = json.loads(sys.stdin.read()) job_instance.exec(config) -
Generate job lookup binary from DSL job registrations:
# Generated lookup.py def lookup_job_for_partition(partition_ref: str) -> str: for pattern, job_target in JOB_MAPPINGS.items(): if pattern.match(partition_ref): return job_target raise ValueError(f"No job found for: {partition_ref}") -
Create
databuild_jobtargets in generatedBUILD.bazel -
Handle DSL job dependencies and imports in generated files
Tests & Verification
# Test: Job config execution
bazel run @test_graph//:ingest_color_votes.cfg -- \
"daily_color_votes/2024-01-01/red"
# Should output valid JobConfig JSON
# Test: Job exec execution
echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
bazel run @test_graph//:ingest_color_votes.exec
# Should execute successfully
# Test: Job lookup
bazel run @test_graph//:job_lookup -- \
"daily_color_votes/2024-01-01/red"
# Should output: //:ingest_color_votes
Success Criteria
- All DSL jobs become executable
databuild_jobtargets - Job binaries correctly handle config/exec subcommands
- Job lookup correctly maps partition patterns to job targets
- Generated jobs maintain DSL semantic behavior
Phase 3: Two-Phase Code Generation
Goal: Implement proper two-phase code generation that works within Bazel's constraints
Key Learning
Previous attempts failed due to fundamental Bazel constraints:
- Loading vs Execution phases:
load()statements run before genrules execute - Dynamic target generation: Bazel requires the complete build graph before execution begins
- Hermeticity: Generated BUILD files must be in source tree, not bazel-bin
The solution: Two-phase generation following established patterns from protobuf, thrift, and other code generators.
Two-Phase Workflow
Phase 1: Code Generation (run by developer)
bazel run //databuild/test/app/dsl:graph.generate
# Generates BUILD.bazel and Python binaries into source tree
Phase 2: Building (normal Bazel workflow)
bazel build //databuild/test/app/dsl:graph.analyze
bazel run //databuild/test/app/dsl:graph.service -- --port 8080
Implementation Tasks
-
Create
databuild_dsl_generatorrule:databuild_dsl_generator( name = "graph.generate", graph_file = "graph.py", output_package = "//databuild/test/app/dsl", deps = [":dsl_src"], ) -
Implement generator that writes to source tree:
def _databuild_dsl_generator_impl(ctx): script = ctx.actions.declare_file(ctx.label.name + "_generator.py") # Create a script that: # 1. Loads the DSL graph # 2. Generates BUILD.bazel and binaries # 3. Writes them to the source tree script_content = """
import os import sys
Add workspace root to path
workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY') output_dir = os.path.join(workspace_root, '{package_path}')
Load and generate
from {module_path} import {graph_attr} {graph_attr}.generate_bazel_package('{name}', output_dir) print(f'Generated BUILD.bazel and binaries in {{output_dir}}') """.format( package_path = ctx.attr.output_package.strip("//").replace(":", "/"), module_path = ctx.file.graph_file.path.replace("/", ".").replace(".py", ""), graph_attr = ctx.attr.graph_attr, name = ctx.attr.name.replace(".generate", ""), )
ctx.actions.write(
output = script,
content = script_content,
is_executable = True,
)
return [DefaultInfo(executable = script)]
3. **Update `DataBuildGraph.generate_bazel_package()` to target source tree**:
```python
def generate_bazel_package(self, name: str, output_dir: str) -> None:
"""Generate BUILD.bazel and binaries into source directory"""
# Generate BUILD.bazel with real databuild targets
self._generate_build_bazel(output_dir, name)
# Generate job binaries
self._generate_job_binaries(output_dir)
# Generate job lookup
self._generate_job_lookup(output_dir)
print(f"Generated package in {output_dir}")
print("Run 'bazel build :{name}.analyze' to use")
- Create standard BUILD.bazel template:
def _generate_build_bazel(self, output_dir: str, name: str): # Generate proper databuild_job and databuild_graph targets # that will work exactly like hand-written ones build_content = self._build_template.format( jobs = self._format_jobs(), graph_name = f"{name}_graph", job_targets = self._format_job_targets(), ) with open(os.path.join(output_dir, "BUILD.bazel"), "w") as f: f.write(build_content)
Interface Design
For DSL Authors:
# In graph.py
graph = DataBuildGraph("my_graph")
@graph.job
class MyJob(DataBuildJob):
# ... job definition
For Users:
# Generate code (phase 1)
bazel run //my/app:graph.generate
# Use generated code (phase 2)
bazel build //my/app:graph.analyze
bazel run //my/app:graph.service
In BUILD.bazel:
databuild_dsl_generator(
name = "graph.generate",
graph_file = "graph.py",
output_package = "//my/app",
deps = [":my_deps"],
)
# After generation, this file will contain:
# databuild_graph(name = "graph_graph", ...)
# databuild_job(name = "my_job", ...)
# py_binary(name = "my_job_binary", ...)
Benefits of This Approach
✅ Works within Bazel constraints - No dynamic target generation
✅ Follows established patterns - Same as protobuf, thrift, OpenAPI generators
✅ Inspectable output - Users can see generated BUILD.bazel
✅ Version controllable - Generated files can be checked in if desired
✅ Incremental builds - Standard Bazel caching works perfectly
✅ Clean separation - Generation vs building are separate phases
Tests & Verification
# Test: Code generation
bazel run //databuild/test/app/dsl:graph.generate
# Should create BUILD.bazel and Python files in source tree
# Test: Generated targets work
bazel build //databuild/test/app/dsl:graph_graph.analyze
# Should build successfully using generated BUILD.bazel
# Test: End-to-end functionality
bazel run //databuild/test/app/dsl:graph_graph.analyze -- "color_vote_report/2024-01-01/red"
# Should work exactly like hand-written graph
Success Criteria
- Generator creates valid BUILD.bazel in source tree
- Generated targets are indistinguishable from hand-written ones
- Full DataBuild functionality works through generated code
- Clean developer workflow with clear phase separation
Phase 4: Graph Integration
Goal: Generate complete databuild graph targets with all operational variants
Deliverables
- Generate
databuild_graphtarget with analyze/build/service capabilities - Create all graph variant targets (
.analyze,.build,.service, etc.) - Wire job dependencies into graph configuration
- Generate container deployment targets
Implementation Tasks
- Generate
databuild_graphtarget with complete job list - Create all required graph variants:
my_graph.analyze- Planning capabilitymy_graph.build- CLI executionmy_graph.service- HTTP servicemy_graph.service.image- Container image
- Configure job lookup and dependency wiring
- Add graph label and identification metadata
Tests & Verification
# Test: Graph analysis
bazel run @test_graph//:my_graph.analyze -- \
"color_vote_report/2024-01-01/red"
# Should output complete job execution plan
# Test: Graph building
bazel run @test_graph//:my_graph.build -- \
"daily_color_votes/2024-01-01/red"
# Should execute end-to-end build
# Test: Service deployment
bazel run @test_graph//:my_graph.service -- --port 8081
# Should start HTTP service on port 8081
# Test: Container generation
bazel build @test_graph//:my_graph.service.image
# Should create deployable container image
Success Criteria
- Graph targets provide full databuild functionality
- CLI and service interfaces produce identical results
- All graph operations work with generated job targets
- Container images are deployable and functional
Phase 4: Dependency Resolution
Goal: Handle external pip packages and bazel dependencies in generated modules
Deliverables
- User-declared dependency system in DSL
- Generated
MODULE.bazelwith proper pip and bazel dependencies - Dependency validation and conflict resolution
- Support for requirements files and version pinning
Implementation Tasks
-
Extend
DataBuildGraphconstructor to accept dependencies:graph = DataBuildGraph( "//my_graph", pip_deps=["pandas>=2.0.0", "numpy"], bazel_deps=["@my_repo//internal:lib"] ) -
Generate
MODULE.bazelwith pip extension configuration:pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip") pip.parse( hub_name = "pip_deps", python_version = "3.11", requirements_lock = "//:requirements_lock.txt" ) -
Create requirements file generation from declared dependencies
-
Add dependency validation during generation
Tests & Verification
# Test: Pip dependencies resolved
bazel build @test_graph//:my_job
# Should succeed with pandas/numpy available
# Test: Cross-module references work
# Generate graph that depends on @other_repo//lib
bazel build @test_graph//:dependent_job
# Should resolve external bazel dependencies
# Test: Container includes all deps
bazel run @test_graph//:my_graph.service.image_load
docker run databuild_test_graph_service:latest python -c "import pandas"
# Should succeed - pandas available in container
Success Criteria
- Generated modules resolve all external dependencies
- Pip packages are available to job execution
- Cross-repository bazel dependencies work correctly
- Container images include complete dependency closure
Phase 5: End-to-End Deployment
Goal: Complete production deployment pipeline with observability
Deliverables
- Production-ready container images with proper configuration
- Integration with existing databuild observability systems
- Build event log compatibility
- Performance optimization and resource management
Implementation Tasks
- Optimize generated container images for production use
- Ensure build event logging works correctly in generated modules
- Add resource configuration and limits to generated targets
- Create deployment documentation and examples
- Performance testing and optimization
Tests & Verification
./run_e2e_tests.sh
Success Criteria
- Generated modules are production-ready
- Full observability and logging integration
- Performance meets production requirements
- CLI/Service consistency maintained
- Complete deployment documentation
Validation Strategy
Integration with Existing Tests
- Extend
run_e2e_tests.shto test generated modules - Add generated module tests to CI/CD pipeline
- Use existing test app DSL as primary test case
Performance Benchmarks
- Graph analysis speed comparison (DSL vs hand-written bazel)
- Container image size optimization
- Job execution overhead measurement
Correctness Verification
- Build event log structure validation
- Partition resolution accuracy testing
- Dependency resolution completeness checks