10 KiB
DSL Graph Generation: Bazel Module Generation from Python DSL
Motivation & High-Level Goals
Problem Statement
DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.
Strategic Goals
- Seamless Deployment: Enable DSL-defined graphs to be built and deployed as complete bazel modules
- Hermetic Packaging: Generate self-contained modules with all dependencies resolved
- Interface Consistency: Maintain CLI/Service interchangeability principle across generated modules
- Production Readiness: Support container deployment and external dependency management
Success Criteria
- DSL graphs can be compiled to standalone bazel modules (
@my_generated_graph//...) - Generated modules support the full databuild interface (analyze, build, service, container images)
- External repositories can depend on databuild core and generate working applications
- End-to-end deployment pipeline from DSL definition to running containers
Required Reading
Core Design Documents
DESIGN.md- Overall databuild architecture and principlesdesign/core-build.md- Job and graph execution semanticsdesign/graph-specification.md- DSL interfaces and patternsdesign/service.md- Service interface requirementsdesign/deploy-strategies.md- Deployment patterns
Key Source Files
databuild/dsl/python/dsl.py- Current DSL implementationdatabuild/test/app/dsl/graph.py- Reference DSL usagedatabuild/rules.bzl- Bazel rules for jobs and graphsdatabuild/databuild.proto- Core interfaces
Understanding Prerequisites
- Job Architecture: Jobs have
.cfg,.exec, and main targets with subcommand pattern - Graph Structure: Graphs require job lookup, analyze, build, and service variants
- Bazel Modules: External repos use
@workspace//...references for generated content - CLI/Service Consistency: Both interfaces must produce identical artifacts and behaviors
Implementation Plan
Phase 1: Basic Generation Infrastructure
Goal: Establish foundation for generating bazel modules from DSL definitions
Deliverables
- Extend
DataBuildGraph.generate_bazel_module()method - Generate minimal
MODULE.bazelwith databuild core dependency - Generate
BUILD.bazelwith job and graph target stubs - Basic workspace creation and file writing utilities
Implementation Tasks
- Add
generate_bazel_module(workspace_name: str, output_dir: str)toDataBuildGraph - Create template system for
MODULE.bazelandBUILD.bazelgeneration - Implement file system utilities for creating workspace structure
- Add basic validation for DSL graph completeness
Tests & Verification
# Test: Basic generation succeeds
python -c "
from databuild.test.app.dsl.graph import graph
graph.generate_bazel_module('test_graph', '/tmp/generated')
"
# Test: Generated files are valid
cd /tmp/generated
bazel build //... # Should succeed without errors
# Test: Module can be referenced externally
# In separate workspace:
# bazel build @test_graph//...
Success Criteria
- Generated
MODULE.bazelhas correct databuild dependency - Generated
BUILD.bazelis syntactically valid - External workspace can reference
@generated_graph//...targets - No compilation errors in generated bazel files
Phase 2: Job Binary Generation
Goal: Convert DSL job classes into executable databuild job targets
Deliverables
- Auto-generate job binary Python files with config/exec subcommand handling
- Create
databuild_jobtargets for each DSL job class - Implement job lookup binary generation
- Wire partition pattern matching to job target resolution
Implementation Tasks
-
Create job binary template with subcommand dispatching:
# Generated job_binary.py template if sys.argv[1] == "config": job_instance = MyDSLJob() config = job_instance.config(parse_outputs(sys.argv[2:])) print(json.dumps(config)) elif sys.argv[1] == "exec": config = json.loads(sys.stdin.read()) job_instance.exec(config) -
Generate job lookup binary from DSL job registrations:
# Generated lookup.py def lookup_job_for_partition(partition_ref: str) -> str: for pattern, job_target in JOB_MAPPINGS.items(): if pattern.match(partition_ref): return job_target raise ValueError(f"No job found for: {partition_ref}") -
Create
databuild_jobtargets in generatedBUILD.bazel -
Handle DSL job dependencies and imports in generated files
Tests & Verification
# Test: Job config execution
bazel run @test_graph//:ingest_color_votes.cfg -- \
"daily_color_votes/2024-01-01/red"
# Should output valid JobConfig JSON
# Test: Job exec execution
echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
bazel run @test_graph//:ingest_color_votes.exec
# Should execute successfully
# Test: Job lookup
bazel run @test_graph//:job_lookup -- \
"daily_color_votes/2024-01-01/red"
# Should output: //:ingest_color_votes
Success Criteria
- All DSL jobs become executable
databuild_jobtargets - Job binaries correctly handle config/exec subcommands
- Job lookup correctly maps partition patterns to job targets
- Generated jobs maintain DSL semantic behavior
Phase 3: Graph Integration
Goal: Generate complete databuild graph targets with all operational variants
Deliverables
- Generate
databuild_graphtarget with analyze/build/service capabilities - Create all graph variant targets (
.analyze,.build,.service, etc.) - Wire job dependencies into graph configuration
- Generate container deployment targets
Implementation Tasks
- Generate
databuild_graphtarget with complete job list - Create all required graph variants:
my_graph.analyze- Planning capabilitymy_graph.build- CLI executionmy_graph.service- HTTP servicemy_graph.service.image- Container image
- Configure job lookup and dependency wiring
- Add graph label and identification metadata
Tests & Verification
# Test: Graph analysis
bazel run @test_graph//:my_graph.analyze -- \
"color_vote_report/2024-01-01/red"
# Should output complete job execution plan
# Test: Graph building
bazel run @test_graph//:my_graph.build -- \
"daily_color_votes/2024-01-01/red"
# Should execute end-to-end build
# Test: Service deployment
bazel run @test_graph//:my_graph.service -- --port 8081
# Should start HTTP service on port 8081
# Test: Container generation
bazel build @test_graph//:my_graph.service.image
# Should create deployable container image
Success Criteria
- Graph targets provide full databuild functionality
- CLI and service interfaces produce identical results
- All graph operations work with generated job targets
- Container images are deployable and functional
Phase 4: Dependency Resolution
Goal: Handle external pip packages and bazel dependencies in generated modules
Deliverables
- User-declared dependency system in DSL
- Generated
MODULE.bazelwith proper pip and bazel dependencies - Dependency validation and conflict resolution
- Support for requirements files and version pinning
Implementation Tasks
-
Extend
DataBuildGraphconstructor to accept dependencies:graph = DataBuildGraph( "//my_graph", pip_deps=["pandas>=2.0.0", "numpy"], bazel_deps=["@my_repo//internal:lib"] ) -
Generate
MODULE.bazelwith pip extension configuration:pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip") pip.parse( hub_name = "pip_deps", python_version = "3.11", requirements_lock = "//:requirements_lock.txt" ) -
Create requirements file generation from declared dependencies
-
Add dependency validation during generation
Tests & Verification
# Test: Pip dependencies resolved
bazel build @test_graph//:my_job
# Should succeed with pandas/numpy available
# Test: Cross-module references work
# Generate graph that depends on @other_repo//lib
bazel build @test_graph//:dependent_job
# Should resolve external bazel dependencies
# Test: Container includes all deps
bazel run @test_graph//:my_graph.service.image_load
docker run databuild_test_graph_service:latest python -c "import pandas"
# Should succeed - pandas available in container
Success Criteria
- Generated modules resolve all external dependencies
- Pip packages are available to job execution
- Cross-repository bazel dependencies work correctly
- Container images include complete dependency closure
Phase 5: End-to-End Deployment
Goal: Complete production deployment pipeline with observability
Deliverables
- Production-ready container images with proper configuration
- Integration with existing databuild observability systems
- Build event log compatibility
- Performance optimization and resource management
Implementation Tasks
- Optimize generated container images for production use
- Ensure build event logging works correctly in generated modules
- Add resource configuration and limits to generated targets
- Create deployment documentation and examples
- Performance testing and optimization
Tests & Verification
./run_e2e_tests.sh
Success Criteria
- Generated modules are production-ready
- Full observability and logging integration
- Performance meets production requirements
- CLI/Service consistency maintained
- Complete deployment documentation
Validation Strategy
Integration with Existing Tests
- Extend
run_e2e_tests.shto test generated modules - Add generated module tests to CI/CD pipeline
- Use existing test app DSL as primary test case
Performance Benchmarks
- Graph analysis speed comparison (DSL vs hand-written bazel)
- Container image size optimization
- Job execution overhead measurement
Correctness Verification
- Build event log structure validation
- Partition resolution accuracy testing
- Dependency resolution completeness checks