From 40d42e03dd0a0b476e1ab0048241b57376ad903d Mon Sep 17 00:00:00 2001 From: Stuart Axelbrooke Date: Fri, 1 Aug 2025 20:17:56 -0700 Subject: [PATCH] Add plan for dsl graph generation --- plans/dsl-graph-generation.md | 292 ++++++++++++++++++++++++++++++++++ 1 file changed, 292 insertions(+) create mode 100644 plans/dsl-graph-generation.md diff --git a/plans/dsl-graph-generation.md b/plans/dsl-graph-generation.md new file mode 100644 index 0000000..f563856 --- /dev/null +++ b/plans/dsl-graph-generation.md @@ -0,0 +1,292 @@ +# DSL Graph Generation: Bazel Module Generation from Python DSL + +## Motivation & High-Level Goals + +### Problem Statement +DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface. + +### Strategic Goals +1. **Seamless Deployment**: Enable DSL-defined graphs to be built and deployed as complete bazel modules +2. **Hermetic Packaging**: Generate self-contained modules with all dependencies resolved +3. **Interface Consistency**: Maintain CLI/Service interchangeability principle across generated modules +4. **Production Readiness**: Support container deployment and external dependency management + +### Success Criteria +- DSL graphs can be compiled to standalone bazel modules (`@my_generated_graph//...`) +- Generated modules support the full databuild interface (analyze, build, service, container images) +- External repositories can depend on databuild core and generate working applications +- End-to-end deployment pipeline from DSL definition to running containers + +## Required Reading + +### Core Design Documents +- [`DESIGN.md`](../DESIGN.md) - Overall databuild architecture and principles +- [`design/core-build.md`](../design/core-build.md) - Job and graph execution semantics +- [`design/graph-specification.md`](../design/graph-specification.md) - DSL interfaces and patterns +- [`design/service.md`](../design/service.md) - Service interface requirements +- [`design/deploy-strategies.md`](../design/deploy-strategies.md) - Deployment patterns + +### Key Source Files +- [`databuild/dsl/python/dsl.py`](../databuild/dsl/python/dsl.py) - Current DSL implementation +- [`databuild/test/app/dsl/graph.py`](../databuild/test/app/dsl/graph.py) - Reference DSL usage +- [`databuild/rules.bzl`](../databuild/rules.bzl) - Bazel rules for jobs and graphs +- [`databuild/databuild.proto`](../databuild/databuild.proto) - Core interfaces + +### Understanding Prerequisites +1. **Job Architecture**: Jobs have `.cfg`, `.exec`, and main targets with subcommand pattern +2. **Graph Structure**: Graphs require job lookup, analyze, build, and service variants +3. **Bazel Modules**: External repos use `@workspace//...` references for generated content +4. **CLI/Service Consistency**: Both interfaces must produce identical artifacts and behaviors + +## Implementation Plan + +### Phase 1: Basic Generation Infrastructure +**Goal**: Establish foundation for generating bazel modules from DSL definitions + +#### Deliverables +- Extend `DataBuildGraph.generate_bazel_module()` method +- Generate minimal `MODULE.bazel` with databuild core dependency +- Generate `BUILD.bazel` with job and graph target stubs +- Basic workspace creation and file writing utilities + +#### Implementation Tasks +1. Add `generate_bazel_module(workspace_name: str, output_dir: str)` to `DataBuildGraph` +2. Create template system for `MODULE.bazel` and `BUILD.bazel` generation +3. Implement file system utilities for creating workspace structure +4. Add basic validation for DSL graph completeness + +#### Tests & Verification +```bash +# Test: Basic generation succeeds +python -c " +from databuild.test.app.dsl.graph import graph +graph.generate_bazel_module('test_graph', '/tmp/generated') +" + +# Test: Generated files are valid +cd /tmp/generated +bazel build //... # Should succeed without errors + +# Test: Module can be referenced externally +# In separate workspace: +# bazel build @test_graph//... +``` + +#### Success Criteria +- Generated `MODULE.bazel` has correct databuild dependency +- Generated `BUILD.bazel` is syntactically valid +- External workspace can reference `@generated_graph//...` targets +- No compilation errors in generated bazel files + +--- + +### Phase 2: Job Binary Generation +**Goal**: Convert DSL job classes into executable databuild job targets + +#### Deliverables +- Auto-generate job binary Python files with config/exec subcommand handling +- Create `databuild_job` targets for each DSL job class +- Implement job lookup binary generation +- Wire partition pattern matching to job target resolution + +#### Implementation Tasks +1. Create job binary template with subcommand dispatching: + ```python + # Generated job_binary.py template + if sys.argv[1] == "config": + job_instance = MyDSLJob() + config = job_instance.config(parse_outputs(sys.argv[2:])) + print(json.dumps(config)) + elif sys.argv[1] == "exec": + config = json.loads(sys.stdin.read()) + job_instance.exec(config) + ``` + +2. Generate job lookup binary from DSL job registrations: + ```python + # Generated lookup.py + def lookup_job_for_partition(partition_ref: str) -> str: + for pattern, job_target in JOB_MAPPINGS.items(): + if pattern.match(partition_ref): + return job_target + raise ValueError(f"No job found for: {partition_ref}") + ``` + +3. Create `databuild_job` targets in generated `BUILD.bazel` +4. Handle DSL job dependencies and imports in generated files + +#### Tests & Verification +```bash +# Test: Job config execution +bazel run @test_graph//:ingest_color_votes.cfg -- \ + "daily_color_votes/2024-01-01/red" +# Should output valid JobConfig JSON + +# Test: Job exec execution +echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \ + bazel run @test_graph//:ingest_color_votes.exec +# Should execute successfully + +# Test: Job lookup +bazel run @test_graph//:job_lookup -- \ + "daily_color_votes/2024-01-01/red" +# Should output: //:ingest_color_votes +``` + +#### Success Criteria +- All DSL jobs become executable `databuild_job` targets +- Job binaries correctly handle config/exec subcommands +- Job lookup correctly maps partition patterns to job targets +- Generated jobs maintain DSL semantic behavior + +--- + +### Phase 3: Graph Integration +**Goal**: Generate complete databuild graph targets with all operational variants + +#### Deliverables +- Generate `databuild_graph` target with analyze/build/service capabilities +- Create all graph variant targets (`.analyze`, `.build`, `.service`, etc.) +- Wire job dependencies into graph configuration +- Generate container deployment targets + +#### Implementation Tasks +1. Generate `databuild_graph` target with complete job list +2. Create all required graph variants: + - `my_graph.analyze` - Planning capability + - `my_graph.build` - CLI execution + - `my_graph.service` - HTTP service + - `my_graph.service.image` - Container image +3. Configure job lookup and dependency wiring +4. Add graph label and identification metadata + +#### Tests & Verification +```bash +# Test: Graph analysis +bazel run @test_graph//:my_graph.analyze -- \ + "color_vote_report/2024-01-01/red" +# Should output complete job execution plan + +# Test: Graph building +bazel run @test_graph//:my_graph.build -- \ + "daily_color_votes/2024-01-01/red" +# Should execute end-to-end build + +# Test: Service deployment +bazel run @test_graph//:my_graph.service -- --port 8081 +# Should start HTTP service on port 8081 + +# Test: Container generation +bazel build @test_graph//:my_graph.service.image +# Should create deployable container image +``` + +#### Success Criteria +- Graph targets provide full databuild functionality +- CLI and service interfaces produce identical results +- All graph operations work with generated job targets +- Container images are deployable and functional + +--- + +### Phase 4: Dependency Resolution +**Goal**: Handle external pip packages and bazel dependencies in generated modules + +#### Deliverables +- User-declared dependency system in DSL +- Generated `MODULE.bazel` with proper pip and bazel dependencies +- Dependency validation and conflict resolution +- Support for requirements files and version pinning + +#### Implementation Tasks +1. Extend `DataBuildGraph` constructor to accept dependencies: + ```python + graph = DataBuildGraph( + "//my_graph", + pip_deps=["pandas>=2.0.0", "numpy"], + bazel_deps=["@my_repo//internal:lib"] + ) + ``` + +2. Generate `MODULE.bazel` with pip extension configuration: + ```python + pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip") + pip.parse( + hub_name = "pip_deps", + python_version = "3.11", + requirements_lock = "//:requirements_lock.txt" + ) + ``` + +3. Create requirements file generation from declared dependencies +4. Add dependency validation during generation + +#### Tests & Verification +```bash +# Test: Pip dependencies resolved +bazel build @test_graph//:my_job +# Should succeed with pandas/numpy available + +# Test: Cross-module references work +# Generate graph that depends on @other_repo//lib +bazel build @test_graph//:dependent_job +# Should resolve external bazel dependencies + +# Test: Container includes all deps +bazel run @test_graph//:my_graph.service.image_load +docker run databuild_test_graph_service:latest python -c "import pandas" +# Should succeed - pandas available in container +``` + +#### Success Criteria +- Generated modules resolve all external dependencies +- Pip packages are available to job execution +- Cross-repository bazel dependencies work correctly +- Container images include complete dependency closure + +--- + +### Phase 5: End-to-End Deployment +**Goal**: Complete production deployment pipeline with observability + +#### Deliverables +- Production-ready container images with proper configuration +- Integration with existing databuild observability systems +- Build event log compatibility +- Performance optimization and resource management + +#### Implementation Tasks +1. Optimize generated container images for production use +2. Ensure build event logging works correctly in generated modules +3. Add resource configuration and limits to generated targets +4. Create deployment documentation and examples +5. Performance testing and optimization + +#### Tests & Verification +```bash +./run_e2e_tests.sh +``` + +#### Success Criteria +- Generated modules are production-ready +- Full observability and logging integration +- Performance meets production requirements +- CLI/Service consistency maintained +- Complete deployment documentation + +## Validation Strategy + +### Integration with Existing Tests +- Extend `run_e2e_tests.sh` to test generated modules +- Add generated module tests to CI/CD pipeline +- Use existing test app DSL as primary test case + +### Performance Benchmarks +- Graph analysis speed comparison (DSL vs hand-written bazel) +- Container image size optimization +- Job execution overhead measurement + +### Correctness Verification +- Build event log structure validation +- Partition resolution accuracy testing +- Dependency resolution completeness checks