Add plan for dsl graph generation

2025-08-01 20:17:56 -07:00 · 2025-08-01 20:17:56 -07:00 · 40d42e03dd
commit 40d42e03dd
parent 2ad4ae6d3c
1 changed files with 292 additions and 0 deletions
--- a/plans/dsl-graph-generation.md
+++ b/plans/dsl-graph-generation.md
@ -0,0 +1,292 @@
+# DSL Graph Generation: Bazel Module Generation from Python DSL
+
+## Motivation & High-Level Goals
+
+### Problem Statement
+DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.
+
+### Strategic Goals
+1. **Seamless Deployment**: Enable DSL-defined graphs to be built and deployed as complete bazel modules
+2. **Hermetic Packaging**: Generate self-contained modules with all dependencies resolved
+3. **Interface Consistency**: Maintain CLI/Service interchangeability principle across generated modules
+4. **Production Readiness**: Support container deployment and external dependency management
+
+### Success Criteria
+- DSL graphs can be compiled to standalone bazel modules (`@my_generated_graph//...`)
+- Generated modules support the full databuild interface (analyze, build, service, container images)
+- External repositories can depend on databuild core and generate working applications
+- End-to-end deployment pipeline from DSL definition to running containers
+
+## Required Reading
+
+### Core Design Documents
+- [`DESIGN.md`](../DESIGN.md) - Overall databuild architecture and principles
+- [`design/core-build.md`](../design/core-build.md) - Job and graph execution semantics
+- [`design/graph-specification.md`](../design/graph-specification.md) - DSL interfaces and patterns
+- [`design/service.md`](../design/service.md) - Service interface requirements
+- [`design/deploy-strategies.md`](../design/deploy-strategies.md) - Deployment patterns
+
+### Key Source Files
+- [`databuild/dsl/python/dsl.py`](../databuild/dsl/python/dsl.py) - Current DSL implementation
+- [`databuild/test/app/dsl/graph.py`](../databuild/test/app/dsl/graph.py) - Reference DSL usage
+- [`databuild/rules.bzl`](../databuild/rules.bzl) - Bazel rules for jobs and graphs
+- [`databuild/databuild.proto`](../databuild/databuild.proto) - Core interfaces
+
+### Understanding Prerequisites
+1. **Job Architecture**: Jobs have `.cfg`, `.exec`, and main targets with subcommand pattern
+2. **Graph Structure**: Graphs require job lookup, analyze, build, and service variants
+3. **Bazel Modules**: External repos use `@workspace//...` references for generated content
+4. **CLI/Service Consistency**: Both interfaces must produce identical artifacts and behaviors
+
+## Implementation Plan
+
+### Phase 1: Basic Generation Infrastructure
+**Goal**: Establish foundation for generating bazel modules from DSL definitions
+
+#### Deliverables
+- Extend `DataBuildGraph.generate_bazel_module()` method
+- Generate minimal `MODULE.bazel` with databuild core dependency
+- Generate `BUILD.bazel` with job and graph target stubs
+- Basic workspace creation and file writing utilities
+
+#### Implementation Tasks
+1. Add `generate_bazel_module(workspace_name: str, output_dir: str)` to `DataBuildGraph`
+2. Create template system for `MODULE.bazel` and `BUILD.bazel` generation
+3. Implement file system utilities for creating workspace structure
+4. Add basic validation for DSL graph completeness
+
+#### Tests & Verification
+```bash
+# Test: Basic generation succeeds
+python -c "
+from databuild.test.app.dsl.graph import graph
+graph.generate_bazel_module('test_graph', '/tmp/generated')
+"
+
+# Test: Generated files are valid
+cd /tmp/generated
+bazel build //...  # Should succeed without errors
+
+# Test: Module can be referenced externally
+# In separate workspace:
+# bazel build @test_graph//...
+```
+
+#### Success Criteria
+- Generated `MODULE.bazel` has correct databuild dependency
+- Generated `BUILD.bazel` is syntactically valid
+- External workspace can reference `@generated_graph//...` targets
+- No compilation errors in generated bazel files
+
+---
+
+### Phase 2: Job Binary Generation
+**Goal**: Convert DSL job classes into executable databuild job targets
+
+#### Deliverables
+- Auto-generate job binary Python files with config/exec subcommand handling
+- Create `databuild_job` targets for each DSL job class
+- Implement job lookup binary generation
+- Wire partition pattern matching to job target resolution
+
+#### Implementation Tasks
+1. Create job binary template with subcommand dispatching:
+   ```python
+   # Generated job_binary.py template
+   if sys.argv[1] == "config":
+       job_instance = MyDSLJob()
+       config = job_instance.config(parse_outputs(sys.argv[2:]))
+       print(json.dumps(config))
+   elif sys.argv[1] == "exec":
+       config = json.loads(sys.stdin.read())
+       job_instance.exec(config)
+   ```
+
+2. Generate job lookup binary from DSL job registrations:
+   ```python
+   # Generated lookup.py
+   def lookup_job_for_partition(partition_ref: str) -> str:
+       for pattern, job_target in JOB_MAPPINGS.items():
+           if pattern.match(partition_ref):
+               return job_target
+       raise ValueError(f"No job found for: {partition_ref}")
+   ```
+
+3. Create `databuild_job` targets in generated `BUILD.bazel`
+4. Handle DSL job dependencies and imports in generated files
+
+#### Tests & Verification
+```bash
+# Test: Job config execution
+bazel run @test_graph//:ingest_color_votes.cfg -- \
+  "daily_color_votes/2024-01-01/red"
+# Should output valid JobConfig JSON
+
+# Test: Job exec execution  
+echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
+  bazel run @test_graph//:ingest_color_votes.exec
+# Should execute successfully
+
+# Test: Job lookup
+bazel run @test_graph//:job_lookup -- \
+  "daily_color_votes/2024-01-01/red"
+# Should output: //:ingest_color_votes
+```
+
+#### Success Criteria
+- All DSL jobs become executable `databuild_job` targets
+- Job binaries correctly handle config/exec subcommands
+- Job lookup correctly maps partition patterns to job targets
+- Generated jobs maintain DSL semantic behavior
+
+---
+
+### Phase 3: Graph Integration
+**Goal**: Generate complete databuild graph targets with all operational variants
+
+#### Deliverables
+- Generate `databuild_graph` target with analyze/build/service capabilities
+- Create all graph variant targets (`.analyze`, `.build`, `.service`, etc.)
+- Wire job dependencies into graph configuration
+- Generate container deployment targets
+
+#### Implementation Tasks
+1. Generate `databuild_graph` target with complete job list
+2. Create all required graph variants:
+   - `my_graph.analyze` - Planning capability
+   - `my_graph.build` - CLI execution
+   - `my_graph.service` - HTTP service
+   - `my_graph.service.image` - Container image
+3. Configure job lookup and dependency wiring
+4. Add graph label and identification metadata
+
+#### Tests & Verification
+```bash
+# Test: Graph analysis
+bazel run @test_graph//:my_graph.analyze -- \
+  "color_vote_report/2024-01-01/red"
+# Should output complete job execution plan
+
+# Test: Graph building
+bazel run @test_graph//:my_graph.build -- \
+  "daily_color_votes/2024-01-01/red"
+# Should execute end-to-end build
+
+# Test: Service deployment
+bazel run @test_graph//:my_graph.service -- --port 8081
+# Should start HTTP service on port 8081
+
+# Test: Container generation
+bazel build @test_graph//:my_graph.service.image
+# Should create deployable container image
+```
+
+#### Success Criteria
+- Graph targets provide full databuild functionality
+- CLI and service interfaces produce identical results
+- All graph operations work with generated job targets
+- Container images are deployable and functional
+
+---
+
+### Phase 4: Dependency Resolution
+**Goal**: Handle external pip packages and bazel dependencies in generated modules
+
+#### Deliverables
+- User-declared dependency system in DSL
+- Generated `MODULE.bazel` with proper pip and bazel dependencies
+- Dependency validation and conflict resolution
+- Support for requirements files and version pinning
+
+#### Implementation Tasks
+1. Extend `DataBuildGraph` constructor to accept dependencies:
+   ```python
+   graph = DataBuildGraph(
+       "//my_graph",
+       pip_deps=["pandas>=2.0.0", "numpy"],
+       bazel_deps=["@my_repo//internal:lib"]
+   )
+   ```
+
+2. Generate `MODULE.bazel` with pip extension configuration:
+   ```python
+   pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
+   pip.parse(
+       hub_name = "pip_deps",
+       python_version = "3.11", 
+       requirements_lock = "//:requirements_lock.txt"
+   )
+   ```
+
+3. Create requirements file generation from declared dependencies
+4. Add dependency validation during generation
+
+#### Tests & Verification
+```bash
+# Test: Pip dependencies resolved
+bazel build @test_graph//:my_job
+# Should succeed with pandas/numpy available
+
+# Test: Cross-module references work
+# Generate graph that depends on @other_repo//lib
+bazel build @test_graph//:dependent_job
+# Should resolve external bazel dependencies
+
+# Test: Container includes all deps
+bazel run @test_graph//:my_graph.service.image_load
+docker run databuild_test_graph_service:latest python -c "import pandas"
+# Should succeed - pandas available in container
+```
+
+#### Success Criteria
+- Generated modules resolve all external dependencies
+- Pip packages are available to job execution
+- Cross-repository bazel dependencies work correctly
+- Container images include complete dependency closure
+
+---
+
+### Phase 5: End-to-End Deployment
+**Goal**: Complete production deployment pipeline with observability
+
+#### Deliverables
+- Production-ready container images with proper configuration
+- Integration with existing databuild observability systems
+- Build event log compatibility
+- Performance optimization and resource management
+
+#### Implementation Tasks
+1. Optimize generated container images for production use
+2. Ensure build event logging works correctly in generated modules
+3. Add resource configuration and limits to generated targets
+4. Create deployment documentation and examples
+5. Performance testing and optimization
+
+#### Tests & Verification
+```bash
+./run_e2e_tests.sh
+```
+
+#### Success Criteria
+- Generated modules are production-ready
+- Full observability and logging integration
+- Performance meets production requirements  
+- CLI/Service consistency maintained
+- Complete deployment documentation
+
+## Validation Strategy
+
+### Integration with Existing Tests
+- Extend `run_e2e_tests.sh` to test generated modules
+- Add generated module tests to CI/CD pipeline
+- Use existing test app DSL as primary test case
+
+### Performance Benchmarks
+- Graph analysis speed comparison (DSL vs hand-written bazel)
+- Container image size optimization
+- Job execution overhead measurement
+
+### Correctness Verification
+- Build event log structure validation
+- Partition resolution accuracy testing
+- Dependency resolution completeness checks