Stuart Axelbrooke 492c30c0bc Update plan

2025-08-03 02:04:38 -07:00

16 KiB

Raw Blame History

DSL Graph Generation: Bazel Module Generation from Python DSL

Motivation & High-Level Goals

Problem Statement

DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.

Strategic Goals

Seamless Deployment: Enable DSL-defined graphs to be built and deployed as complete bazel modules
Hermetic Packaging: Generate self-contained modules with all dependencies resolved
Interface Consistency: Maintain CLI/Service interchangeability principle across generated modules
Production Readiness: Support container deployment and external dependency management

Success Criteria

DSL graphs can be compiled to standalone bazel modules (@my_generated_graph//...)
Generated modules support the full databuild interface (analyze, build, service, container images)
External repositories can depend on databuild core and generate working applications
End-to-end deployment pipeline from DSL definition to running containers

Required Reading

Core Design Documents

DESIGN.md - Overall databuild architecture and principles
design/core-build.md - Job and graph execution semantics
design/graph-specification.md - DSL interfaces and patterns
design/service.md - Service interface requirements
design/deploy-strategies.md - Deployment patterns

Key Source Files

databuild/dsl/python/dsl.py - Current DSL implementation
databuild/test/app/dsl/graph.py - Reference DSL usage
databuild/rules.bzl - Bazel rules for jobs and graphs
databuild/databuild.proto - Core interfaces

Understanding Prerequisites

Job Architecture: Jobs have .cfg, .exec, and main targets with subcommand pattern
Graph Structure: Graphs require job lookup, analyze, build, and service variants
Bazel Modules: External repos use @workspace//... references for generated content
CLI/Service Consistency: Both interfaces must produce identical artifacts and behaviors

Implementation Plan

Phase 1: Basic Generation Infrastructure

Goal: Establish foundation for generating bazel modules from DSL definitions

Deliverables

Extend DataBuildGraph.generate_bazel_module() method
Generate minimal MODULE.bazel with databuild core dependency
Generate BUILD.bazel with job and graph target stubs
Basic workspace creation and file writing utilities

Implementation Tasks

Add generate_bazel_module(workspace_name: str, output_dir: str) to DataBuildGraph
Create template system for MODULE.bazel and BUILD.bazel generation
Implement file system utilities for creating workspace structure
Add basic validation for DSL graph completeness

Tests & Verification

# Test: Basic generation succeeds
python -c "
from databuild.test.app.dsl.graph import graph
graph.generate_bazel_module('test_graph', '/tmp/generated')
"

# Test: Generated files are valid
cd /tmp/generated
bazel build //...  # Should succeed without errors

# Test: Module can be referenced externally
# In separate workspace:
# bazel build @test_graph//...

Success Criteria

Generated MODULE.bazel has correct databuild dependency
Generated BUILD.bazel is syntactically valid
External workspace can reference @generated_graph//... targets
No compilation errors in generated bazel files

Phase 2: Job Binary Generation

Goal: Convert DSL job classes into executable databuild job targets

Deliverables

Auto-generate job binary Python files with config/exec subcommand handling
Create databuild_job targets for each DSL job class
Implement job lookup binary generation
Wire partition pattern matching to job target resolution

Implementation Tasks

Create job binary template with subcommand dispatching:

# Generated job_binary.py template
if sys.argv[1] == "config":
    job_instance = MyDSLJob()
    config = job_instance.config(parse_outputs(sys.argv[2:]))
    print(json.dumps(config))
elif sys.argv[1] == "exec":
    config = json.loads(sys.stdin.read())
    job_instance.exec(config)

Generate job lookup binary from DSL job registrations:

# Generated lookup.py
def lookup_job_for_partition(partition_ref: str) -> str:
    for pattern, job_target in JOB_MAPPINGS.items():
        if pattern.match(partition_ref):
            return job_target
    raise ValueError(f"No job found for: {partition_ref}")

Create databuild_job targets in generated BUILD.bazel
Handle DSL job dependencies and imports in generated files

Tests & Verification

# Test: Job config execution
bazel run @test_graph//:ingest_color_votes.cfg -- \
  "daily_color_votes/2024-01-01/red"
# Should output valid JobConfig JSON

# Test: Job exec execution  
echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
  bazel run @test_graph//:ingest_color_votes.exec
# Should execute successfully

# Test: Job lookup
bazel run @test_graph//:job_lookup -- \
  "daily_color_votes/2024-01-01/red"
# Should output: //:ingest_color_votes

Success Criteria

All DSL jobs become executable databuild_job targets
Job binaries correctly handle config/exec subcommands
Job lookup correctly maps partition patterns to job targets
Generated jobs maintain DSL semantic behavior

Phase 3: Two-Phase Code Generation

Goal: Implement proper two-phase code generation that works within Bazel's constraints

Key Learning

Previous attempts failed due to fundamental Bazel constraints:

Loading vs Execution phases: load() statements run before genrules execute
Dynamic target generation: Bazel requires the complete build graph before execution begins
Hermeticity: Generated BUILD files must be in source tree, not bazel-bin

The solution: Two-phase generation following established patterns from protobuf, thrift, and other code generators.

Two-Phase Workflow

Phase 1: Code Generation (run by developer)

bazel run //databuild/test/app/dsl:graph.generate
# Generates BUILD.bazel and Python binaries into source tree

Phase 2: Building (normal Bazel workflow)

bazel build //databuild/test/app/dsl:graph.analyze
bazel run //databuild/test/app/dsl:graph.service -- --port 8080

Implementation Tasks

Create databuild_dsl_generator rule:

databuild_dsl_generator(
    name = "graph.generate",
    graph_file = "graph.py", 
    output_package = "//databuild/test/app/dsl",
    deps = [":dsl_src"],
)

Implement generator that writes to source tree:

def _databuild_dsl_generator_impl(ctx):
    script = ctx.actions.declare_file(ctx.label.name + "_generator.py")

    # Create a script that:
    # 1. Loads the DSL graph
    # 2. Generates BUILD.bazel and binaries 
    # 3. Writes them to the source tree
    script_content = """

import os import sys

Add workspace root to path

workspace_root = os.environ.get('BUILD_WORKSPACE_DIRECTORY') output_dir = os.path.join(workspace_root, '{package_path}')

Load and generate

from {module_path} import {graph_attr} {graph_attr}.generate_bazel_package('{name}', output_dir) print(f'Generated BUILD.bazel and binaries in {{output_dir}}') """.format( package_path = ctx.attr.output_package.strip("//").replace(":", "/"), module_path = ctx.file.graph_file.path.replace("/", ".").replace(".py", ""), graph_attr = ctx.attr.graph_attr, name = ctx.attr.name.replace(".generate", ""), )

   ctx.actions.write(
       output = script,
       content = script_content,
       is_executable = True,
   )
   
   return [DefaultInfo(executable = script)]


3. **Update `DataBuildGraph.generate_bazel_package()` to target source tree**:
```python
def generate_bazel_package(self, name: str, output_dir: str) -> None:
    """Generate BUILD.bazel and binaries into source directory"""
    # Generate BUILD.bazel with real databuild targets
    self._generate_build_bazel(output_dir, name)
    
    # Generate job binaries
    self._generate_job_binaries(output_dir)
    
    # Generate job lookup
    self._generate_job_lookup(output_dir)
    
    print(f"Generated package in {output_dir}")
    print("Run 'bazel build :{name}.analyze' to use")

Create standard BUILD.bazel template:

def _generate_build_bazel(self, output_dir: str, name: str):
    # Generate proper databuild_job and databuild_graph targets
    # that will work exactly like hand-written ones
    build_content = self._build_template.format(
        jobs = self._format_jobs(),
        graph_name = f"{name}_graph",
        job_targets = self._format_job_targets(),
    )

    with open(os.path.join(output_dir, "BUILD.bazel"), "w") as f:
        f.write(build_content)

Interface Design

For DSL Authors:

# In graph.py
graph = DataBuildGraph("my_graph")

@graph.job
class MyJob(DataBuildJob):
    # ... job definition

For Users:

# Generate code (phase 1)
bazel run //my/app:graph.generate

# Use generated code (phase 2) 
bazel build //my/app:graph.analyze
bazel run //my/app:graph.service

In BUILD.bazel:

databuild_dsl_generator(
    name = "graph.generate",
    graph_file = "graph.py",
    output_package = "//my/app", 
    deps = [":my_deps"],
)

# After generation, this file will contain:
# databuild_graph(name = "graph_graph", ...)
# databuild_job(name = "my_job", ...)
# py_binary(name = "my_job_binary", ...)

Benefits of This Approach

✅ Works within Bazel constraints - No dynamic target generation
✅ Follows established patterns - Same as protobuf, thrift, OpenAPI generators
✅ Inspectable output - Users can see generated BUILD.bazel
✅ Version controllable - Generated files can be checked in if desired
✅ Incremental builds - Standard Bazel caching works perfectly
✅ Clean separation - Generation vs building are separate phases

Tests & Verification

# Test: Code generation
bazel run //databuild/test/app/dsl:graph.generate
# Should create BUILD.bazel and Python files in source tree

# Test: Generated targets work
bazel build //databuild/test/app/dsl:graph_graph.analyze  
# Should build successfully using generated BUILD.bazel

# Test: End-to-end functionality
bazel run //databuild/test/app/dsl:graph_graph.analyze -- "color_vote_report/2024-01-01/red"
# Should work exactly like hand-written graph

Success Criteria

Generator creates valid BUILD.bazel in source tree
Generated targets are indistinguishable from hand-written ones
Full DataBuild functionality works through generated code
Clean developer workflow with clear phase separation

Phase 4: Graph Integration

Goal: Generate complete databuild graph targets with all operational variants

Deliverables

Generate databuild_graph target with analyze/build/service capabilities
Create all graph variant targets (.analyze, .build, .service, etc.)
Wire job dependencies into graph configuration
Generate container deployment targets

Implementation Tasks

Generate databuild_graph target with complete job list
Create all required graph variants:
- my_graph.analyze - Planning capability
- my_graph.build - CLI execution
- my_graph.service - HTTP service
- my_graph.service.image - Container image
Configure job lookup and dependency wiring
Add graph label and identification metadata

Tests & Verification

# Test: Graph analysis
bazel run @test_graph//:my_graph.analyze -- \
  "color_vote_report/2024-01-01/red"
# Should output complete job execution plan

# Test: Graph building
bazel run @test_graph//:my_graph.build -- \
  "daily_color_votes/2024-01-01/red"
# Should execute end-to-end build

# Test: Service deployment
bazel run @test_graph//:my_graph.service -- --port 8081
# Should start HTTP service on port 8081

# Test: Container generation
bazel build @test_graph//:my_graph.service.image
# Should create deployable container image

Success Criteria

Graph targets provide full databuild functionality
CLI and service interfaces produce identical results
All graph operations work with generated job targets
Container images are deployable and functional

Phase 4: Dependency Resolution

Goal: Handle external pip packages and bazel dependencies in generated modules

Deliverables

User-declared dependency system in DSL
Generated MODULE.bazel with proper pip and bazel dependencies
Dependency validation and conflict resolution
Support for requirements files and version pinning

Implementation Tasks

Extend DataBuildGraph constructor to accept dependencies:

graph = DataBuildGraph(
    "//my_graph",
    pip_deps=["pandas>=2.0.0", "numpy"],
    bazel_deps=["@my_repo//internal:lib"]
)

Generate MODULE.bazel with pip extension configuration:

pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
pip.parse(
    hub_name = "pip_deps",
    python_version = "3.11", 
    requirements_lock = "//:requirements_lock.txt"
)

Create requirements file generation from declared dependencies
Add dependency validation during generation

Tests & Verification

# Test: Pip dependencies resolved
bazel build @test_graph//:my_job
# Should succeed with pandas/numpy available

# Test: Cross-module references work
# Generate graph that depends on @other_repo//lib
bazel build @test_graph//:dependent_job
# Should resolve external bazel dependencies

# Test: Container includes all deps
bazel run @test_graph//:my_graph.service.image_load
docker run databuild_test_graph_service:latest python -c "import pandas"
# Should succeed - pandas available in container

Success Criteria

Generated modules resolve all external dependencies
Pip packages are available to job execution
Cross-repository bazel dependencies work correctly
Container images include complete dependency closure

Phase 5: End-to-End Deployment

Goal: Complete production deployment pipeline with observability

Deliverables

Production-ready container images with proper configuration
Integration with existing databuild observability systems
Build event log compatibility
Performance optimization and resource management

Implementation Tasks

Optimize generated container images for production use
Ensure build event logging works correctly in generated modules
Add resource configuration and limits to generated targets
Create deployment documentation and examples
Performance testing and optimization

Tests & Verification

./run_e2e_tests.sh

Success Criteria

Generated modules are production-ready
Full observability and logging integration
Performance meets production requirements
CLI/Service consistency maintained
Complete deployment documentation

Validation Strategy

Integration with Existing Tests

Extend run_e2e_tests.sh to test generated modules
Add generated module tests to CI/CD pipeline
Use existing test app DSL as primary test case

Performance Benchmarks

Graph analysis speed comparison (DSL vs hand-written bazel)
Container image size optimization
Job execution overhead measurement

Correctness Verification

Build event log structure validation
Partition resolution accuracy testing
Dependency resolution completeness checks

16 KiB Raw Blame History

DSL Graph Generation: Bazel Module Generation from Python DSL

Motivation & High-Level Goals

Problem Statement

Strategic Goals

Success Criteria

Required Reading

Core Design Documents

Key Source Files

Understanding Prerequisites

Implementation Plan

Phase 1: Basic Generation Infrastructure

Deliverables

Implementation Tasks

Tests & Verification

Success Criteria

Phase 2: Job Binary Generation

Deliverables

Implementation Tasks

Tests & Verification

Success Criteria

Phase 3: Two-Phase Code Generation

Key Learning

Two-Phase Workflow

Implementation Tasks

Add workspace root to path

Load and generate

Interface Design

Benefits of This Approach

Tests & Verification

Success Criteria

Phase 4: Graph Integration

Deliverables

Implementation Tasks

Tests & Verification

Success Criteria

Phase 4: Dependency Resolution

Deliverables

Implementation Tasks

Tests & Verification

Success Criteria

Phase 5: End-to-End Deployment

Deliverables

Implementation Tasks

Tests & Verification

Success Criteria

Validation Strategy

Integration with Existing Tests

Performance Benchmarks

Correctness Verification

16 KiB

Raw Blame History