databuild/plans/dsl-graph-generation.md
Stuart Axelbrooke 40d42e03dd
Some checks are pending
/ setup (push) Waiting to run
Add plan for dsl graph generation
2025-08-01 20:17:56 -07:00

10 KiB

DSL Graph Generation: Bazel Module Generation from Python DSL

Motivation & High-Level Goals

Problem Statement

DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.

Strategic Goals

  1. Seamless Deployment: Enable DSL-defined graphs to be built and deployed as complete bazel modules
  2. Hermetic Packaging: Generate self-contained modules with all dependencies resolved
  3. Interface Consistency: Maintain CLI/Service interchangeability principle across generated modules
  4. Production Readiness: Support container deployment and external dependency management

Success Criteria

  • DSL graphs can be compiled to standalone bazel modules (@my_generated_graph//...)
  • Generated modules support the full databuild interface (analyze, build, service, container images)
  • External repositories can depend on databuild core and generate working applications
  • End-to-end deployment pipeline from DSL definition to running containers

Required Reading

Core Design Documents

Key Source Files

Understanding Prerequisites

  1. Job Architecture: Jobs have .cfg, .exec, and main targets with subcommand pattern
  2. Graph Structure: Graphs require job lookup, analyze, build, and service variants
  3. Bazel Modules: External repos use @workspace//... references for generated content
  4. CLI/Service Consistency: Both interfaces must produce identical artifacts and behaviors

Implementation Plan

Phase 1: Basic Generation Infrastructure

Goal: Establish foundation for generating bazel modules from DSL definitions

Deliverables

  • Extend DataBuildGraph.generate_bazel_module() method
  • Generate minimal MODULE.bazel with databuild core dependency
  • Generate BUILD.bazel with job and graph target stubs
  • Basic workspace creation and file writing utilities

Implementation Tasks

  1. Add generate_bazel_module(workspace_name: str, output_dir: str) to DataBuildGraph
  2. Create template system for MODULE.bazel and BUILD.bazel generation
  3. Implement file system utilities for creating workspace structure
  4. Add basic validation for DSL graph completeness

Tests & Verification

# Test: Basic generation succeeds
python -c "
from databuild.test.app.dsl.graph import graph
graph.generate_bazel_module('test_graph', '/tmp/generated')
"

# Test: Generated files are valid
cd /tmp/generated
bazel build //...  # Should succeed without errors

# Test: Module can be referenced externally
# In separate workspace:
# bazel build @test_graph//...

Success Criteria

  • Generated MODULE.bazel has correct databuild dependency
  • Generated BUILD.bazel is syntactically valid
  • External workspace can reference @generated_graph//... targets
  • No compilation errors in generated bazel files

Phase 2: Job Binary Generation

Goal: Convert DSL job classes into executable databuild job targets

Deliverables

  • Auto-generate job binary Python files with config/exec subcommand handling
  • Create databuild_job targets for each DSL job class
  • Implement job lookup binary generation
  • Wire partition pattern matching to job target resolution

Implementation Tasks

  1. Create job binary template with subcommand dispatching:

    # Generated job_binary.py template
    if sys.argv[1] == "config":
        job_instance = MyDSLJob()
        config = job_instance.config(parse_outputs(sys.argv[2:]))
        print(json.dumps(config))
    elif sys.argv[1] == "exec":
        config = json.loads(sys.stdin.read())
        job_instance.exec(config)
    
  2. Generate job lookup binary from DSL job registrations:

    # Generated lookup.py
    def lookup_job_for_partition(partition_ref: str) -> str:
        for pattern, job_target in JOB_MAPPINGS.items():
            if pattern.match(partition_ref):
                return job_target
        raise ValueError(f"No job found for: {partition_ref}")
    
  3. Create databuild_job targets in generated BUILD.bazel

  4. Handle DSL job dependencies and imports in generated files

Tests & Verification

# Test: Job config execution
bazel run @test_graph//:ingest_color_votes.cfg -- \
  "daily_color_votes/2024-01-01/red"
# Should output valid JobConfig JSON

# Test: Job exec execution  
echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
  bazel run @test_graph//:ingest_color_votes.exec
# Should execute successfully

# Test: Job lookup
bazel run @test_graph//:job_lookup -- \
  "daily_color_votes/2024-01-01/red"
# Should output: //:ingest_color_votes

Success Criteria

  • All DSL jobs become executable databuild_job targets
  • Job binaries correctly handle config/exec subcommands
  • Job lookup correctly maps partition patterns to job targets
  • Generated jobs maintain DSL semantic behavior

Phase 3: Graph Integration

Goal: Generate complete databuild graph targets with all operational variants

Deliverables

  • Generate databuild_graph target with analyze/build/service capabilities
  • Create all graph variant targets (.analyze, .build, .service, etc.)
  • Wire job dependencies into graph configuration
  • Generate container deployment targets

Implementation Tasks

  1. Generate databuild_graph target with complete job list
  2. Create all required graph variants:
    • my_graph.analyze - Planning capability
    • my_graph.build - CLI execution
    • my_graph.service - HTTP service
    • my_graph.service.image - Container image
  3. Configure job lookup and dependency wiring
  4. Add graph label and identification metadata

Tests & Verification

# Test: Graph analysis
bazel run @test_graph//:my_graph.analyze -- \
  "color_vote_report/2024-01-01/red"
# Should output complete job execution plan

# Test: Graph building
bazel run @test_graph//:my_graph.build -- \
  "daily_color_votes/2024-01-01/red"
# Should execute end-to-end build

# Test: Service deployment
bazel run @test_graph//:my_graph.service -- --port 8081
# Should start HTTP service on port 8081

# Test: Container generation
bazel build @test_graph//:my_graph.service.image
# Should create deployable container image

Success Criteria

  • Graph targets provide full databuild functionality
  • CLI and service interfaces produce identical results
  • All graph operations work with generated job targets
  • Container images are deployable and functional

Phase 4: Dependency Resolution

Goal: Handle external pip packages and bazel dependencies in generated modules

Deliverables

  • User-declared dependency system in DSL
  • Generated MODULE.bazel with proper pip and bazel dependencies
  • Dependency validation and conflict resolution
  • Support for requirements files and version pinning

Implementation Tasks

  1. Extend DataBuildGraph constructor to accept dependencies:

    graph = DataBuildGraph(
        "//my_graph",
        pip_deps=["pandas>=2.0.0", "numpy"],
        bazel_deps=["@my_repo//internal:lib"]
    )
    
  2. Generate MODULE.bazel with pip extension configuration:

    pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
    pip.parse(
        hub_name = "pip_deps",
        python_version = "3.11", 
        requirements_lock = "//:requirements_lock.txt"
    )
    
  3. Create requirements file generation from declared dependencies

  4. Add dependency validation during generation

Tests & Verification

# Test: Pip dependencies resolved
bazel build @test_graph//:my_job
# Should succeed with pandas/numpy available

# Test: Cross-module references work
# Generate graph that depends on @other_repo//lib
bazel build @test_graph//:dependent_job
# Should resolve external bazel dependencies

# Test: Container includes all deps
bazel run @test_graph//:my_graph.service.image_load
docker run databuild_test_graph_service:latest python -c "import pandas"
# Should succeed - pandas available in container

Success Criteria

  • Generated modules resolve all external dependencies
  • Pip packages are available to job execution
  • Cross-repository bazel dependencies work correctly
  • Container images include complete dependency closure

Phase 5: End-to-End Deployment

Goal: Complete production deployment pipeline with observability

Deliverables

  • Production-ready container images with proper configuration
  • Integration with existing databuild observability systems
  • Build event log compatibility
  • Performance optimization and resource management

Implementation Tasks

  1. Optimize generated container images for production use
  2. Ensure build event logging works correctly in generated modules
  3. Add resource configuration and limits to generated targets
  4. Create deployment documentation and examples
  5. Performance testing and optimization

Tests & Verification

./run_e2e_tests.sh

Success Criteria

  • Generated modules are production-ready
  • Full observability and logging integration
  • Performance meets production requirements
  • CLI/Service consistency maintained
  • Complete deployment documentation

Validation Strategy

Integration with Existing Tests

  • Extend run_e2e_tests.sh to test generated modules
  • Add generated module tests to CI/CD pipeline
  • Use existing test app DSL as primary test case

Performance Benchmarks

  • Graph analysis speed comparison (DSL vs hand-written bazel)
  • Container image size optimization
  • Job execution overhead measurement

Correctness Verification

  • Build event log structure validation
  • Partition resolution accuracy testing
  • Dependency resolution completeness checks