This commit is contained in:
parent
2ad4ae6d3c
commit
40d42e03dd
1 changed files with 292 additions and 0 deletions
292
plans/dsl-graph-generation.md
Normal file
292
plans/dsl-graph-generation.md
Normal file
|
|
@ -0,0 +1,292 @@
|
|||
# DSL Graph Generation: Bazel Module Generation from Python DSL
|
||||
|
||||
## Motivation & High-Level Goals
|
||||
|
||||
### Problem Statement
|
||||
DataBuild's Python DSL provides an ergonomic interface for defining data processing graphs, but currently lacks a deployment path. Users can define jobs and graphs using the DSL, but cannot easily package and deploy them as complete, hermetic applications. This limits the DSL's utility as a production-ready interface.
|
||||
|
||||
### Strategic Goals
|
||||
1. **Seamless Deployment**: Enable DSL-defined graphs to be built and deployed as complete bazel modules
|
||||
2. **Hermetic Packaging**: Generate self-contained modules with all dependencies resolved
|
||||
3. **Interface Consistency**: Maintain CLI/Service interchangeability principle across generated modules
|
||||
4. **Production Readiness**: Support container deployment and external dependency management
|
||||
|
||||
### Success Criteria
|
||||
- DSL graphs can be compiled to standalone bazel modules (`@my_generated_graph//...`)
|
||||
- Generated modules support the full databuild interface (analyze, build, service, container images)
|
||||
- External repositories can depend on databuild core and generate working applications
|
||||
- End-to-end deployment pipeline from DSL definition to running containers
|
||||
|
||||
## Required Reading
|
||||
|
||||
### Core Design Documents
|
||||
- [`DESIGN.md`](../DESIGN.md) - Overall databuild architecture and principles
|
||||
- [`design/core-build.md`](../design/core-build.md) - Job and graph execution semantics
|
||||
- [`design/graph-specification.md`](../design/graph-specification.md) - DSL interfaces and patterns
|
||||
- [`design/service.md`](../design/service.md) - Service interface requirements
|
||||
- [`design/deploy-strategies.md`](../design/deploy-strategies.md) - Deployment patterns
|
||||
|
||||
### Key Source Files
|
||||
- [`databuild/dsl/python/dsl.py`](../databuild/dsl/python/dsl.py) - Current DSL implementation
|
||||
- [`databuild/test/app/dsl/graph.py`](../databuild/test/app/dsl/graph.py) - Reference DSL usage
|
||||
- [`databuild/rules.bzl`](../databuild/rules.bzl) - Bazel rules for jobs and graphs
|
||||
- [`databuild/databuild.proto`](../databuild/databuild.proto) - Core interfaces
|
||||
|
||||
### Understanding Prerequisites
|
||||
1. **Job Architecture**: Jobs have `.cfg`, `.exec`, and main targets with subcommand pattern
|
||||
2. **Graph Structure**: Graphs require job lookup, analyze, build, and service variants
|
||||
3. **Bazel Modules**: External repos use `@workspace//...` references for generated content
|
||||
4. **CLI/Service Consistency**: Both interfaces must produce identical artifacts and behaviors
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Basic Generation Infrastructure
|
||||
**Goal**: Establish foundation for generating bazel modules from DSL definitions
|
||||
|
||||
#### Deliverables
|
||||
- Extend `DataBuildGraph.generate_bazel_module()` method
|
||||
- Generate minimal `MODULE.bazel` with databuild core dependency
|
||||
- Generate `BUILD.bazel` with job and graph target stubs
|
||||
- Basic workspace creation and file writing utilities
|
||||
|
||||
#### Implementation Tasks
|
||||
1. Add `generate_bazel_module(workspace_name: str, output_dir: str)` to `DataBuildGraph`
|
||||
2. Create template system for `MODULE.bazel` and `BUILD.bazel` generation
|
||||
3. Implement file system utilities for creating workspace structure
|
||||
4. Add basic validation for DSL graph completeness
|
||||
|
||||
#### Tests & Verification
|
||||
```bash
|
||||
# Test: Basic generation succeeds
|
||||
python -c "
|
||||
from databuild.test.app.dsl.graph import graph
|
||||
graph.generate_bazel_module('test_graph', '/tmp/generated')
|
||||
"
|
||||
|
||||
# Test: Generated files are valid
|
||||
cd /tmp/generated
|
||||
bazel build //... # Should succeed without errors
|
||||
|
||||
# Test: Module can be referenced externally
|
||||
# In separate workspace:
|
||||
# bazel build @test_graph//...
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- Generated `MODULE.bazel` has correct databuild dependency
|
||||
- Generated `BUILD.bazel` is syntactically valid
|
||||
- External workspace can reference `@generated_graph//...` targets
|
||||
- No compilation errors in generated bazel files
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: Job Binary Generation
|
||||
**Goal**: Convert DSL job classes into executable databuild job targets
|
||||
|
||||
#### Deliverables
|
||||
- Auto-generate job binary Python files with config/exec subcommand handling
|
||||
- Create `databuild_job` targets for each DSL job class
|
||||
- Implement job lookup binary generation
|
||||
- Wire partition pattern matching to job target resolution
|
||||
|
||||
#### Implementation Tasks
|
||||
1. Create job binary template with subcommand dispatching:
|
||||
```python
|
||||
# Generated job_binary.py template
|
||||
if sys.argv[1] == "config":
|
||||
job_instance = MyDSLJob()
|
||||
config = job_instance.config(parse_outputs(sys.argv[2:]))
|
||||
print(json.dumps(config))
|
||||
elif sys.argv[1] == "exec":
|
||||
config = json.loads(sys.stdin.read())
|
||||
job_instance.exec(config)
|
||||
```
|
||||
|
||||
2. Generate job lookup binary from DSL job registrations:
|
||||
```python
|
||||
# Generated lookup.py
|
||||
def lookup_job_for_partition(partition_ref: str) -> str:
|
||||
for pattern, job_target in JOB_MAPPINGS.items():
|
||||
if pattern.match(partition_ref):
|
||||
return job_target
|
||||
raise ValueError(f"No job found for: {partition_ref}")
|
||||
```
|
||||
|
||||
3. Create `databuild_job` targets in generated `BUILD.bazel`
|
||||
4. Handle DSL job dependencies and imports in generated files
|
||||
|
||||
#### Tests & Verification
|
||||
```bash
|
||||
# Test: Job config execution
|
||||
bazel run @test_graph//:ingest_color_votes.cfg -- \
|
||||
"daily_color_votes/2024-01-01/red"
|
||||
# Should output valid JobConfig JSON
|
||||
|
||||
# Test: Job exec execution
|
||||
echo '{"outputs":[...], "env":{"DATA_DATE":"2024-01-01"}}' | \
|
||||
bazel run @test_graph//:ingest_color_votes.exec
|
||||
# Should execute successfully
|
||||
|
||||
# Test: Job lookup
|
||||
bazel run @test_graph//:job_lookup -- \
|
||||
"daily_color_votes/2024-01-01/red"
|
||||
# Should output: //:ingest_color_votes
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- All DSL jobs become executable `databuild_job` targets
|
||||
- Job binaries correctly handle config/exec subcommands
|
||||
- Job lookup correctly maps partition patterns to job targets
|
||||
- Generated jobs maintain DSL semantic behavior
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Graph Integration
|
||||
**Goal**: Generate complete databuild graph targets with all operational variants
|
||||
|
||||
#### Deliverables
|
||||
- Generate `databuild_graph` target with analyze/build/service capabilities
|
||||
- Create all graph variant targets (`.analyze`, `.build`, `.service`, etc.)
|
||||
- Wire job dependencies into graph configuration
|
||||
- Generate container deployment targets
|
||||
|
||||
#### Implementation Tasks
|
||||
1. Generate `databuild_graph` target with complete job list
|
||||
2. Create all required graph variants:
|
||||
- `my_graph.analyze` - Planning capability
|
||||
- `my_graph.build` - CLI execution
|
||||
- `my_graph.service` - HTTP service
|
||||
- `my_graph.service.image` - Container image
|
||||
3. Configure job lookup and dependency wiring
|
||||
4. Add graph label and identification metadata
|
||||
|
||||
#### Tests & Verification
|
||||
```bash
|
||||
# Test: Graph analysis
|
||||
bazel run @test_graph//:my_graph.analyze -- \
|
||||
"color_vote_report/2024-01-01/red"
|
||||
# Should output complete job execution plan
|
||||
|
||||
# Test: Graph building
|
||||
bazel run @test_graph//:my_graph.build -- \
|
||||
"daily_color_votes/2024-01-01/red"
|
||||
# Should execute end-to-end build
|
||||
|
||||
# Test: Service deployment
|
||||
bazel run @test_graph//:my_graph.service -- --port 8081
|
||||
# Should start HTTP service on port 8081
|
||||
|
||||
# Test: Container generation
|
||||
bazel build @test_graph//:my_graph.service.image
|
||||
# Should create deployable container image
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- Graph targets provide full databuild functionality
|
||||
- CLI and service interfaces produce identical results
|
||||
- All graph operations work with generated job targets
|
||||
- Container images are deployable and functional
|
||||
|
||||
---
|
||||
|
||||
### Phase 4: Dependency Resolution
|
||||
**Goal**: Handle external pip packages and bazel dependencies in generated modules
|
||||
|
||||
#### Deliverables
|
||||
- User-declared dependency system in DSL
|
||||
- Generated `MODULE.bazel` with proper pip and bazel dependencies
|
||||
- Dependency validation and conflict resolution
|
||||
- Support for requirements files and version pinning
|
||||
|
||||
#### Implementation Tasks
|
||||
1. Extend `DataBuildGraph` constructor to accept dependencies:
|
||||
```python
|
||||
graph = DataBuildGraph(
|
||||
"//my_graph",
|
||||
pip_deps=["pandas>=2.0.0", "numpy"],
|
||||
bazel_deps=["@my_repo//internal:lib"]
|
||||
)
|
||||
```
|
||||
|
||||
2. Generate `MODULE.bazel` with pip extension configuration:
|
||||
```python
|
||||
pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
|
||||
pip.parse(
|
||||
hub_name = "pip_deps",
|
||||
python_version = "3.11",
|
||||
requirements_lock = "//:requirements_lock.txt"
|
||||
)
|
||||
```
|
||||
|
||||
3. Create requirements file generation from declared dependencies
|
||||
4. Add dependency validation during generation
|
||||
|
||||
#### Tests & Verification
|
||||
```bash
|
||||
# Test: Pip dependencies resolved
|
||||
bazel build @test_graph//:my_job
|
||||
# Should succeed with pandas/numpy available
|
||||
|
||||
# Test: Cross-module references work
|
||||
# Generate graph that depends on @other_repo//lib
|
||||
bazel build @test_graph//:dependent_job
|
||||
# Should resolve external bazel dependencies
|
||||
|
||||
# Test: Container includes all deps
|
||||
bazel run @test_graph//:my_graph.service.image_load
|
||||
docker run databuild_test_graph_service:latest python -c "import pandas"
|
||||
# Should succeed - pandas available in container
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- Generated modules resolve all external dependencies
|
||||
- Pip packages are available to job execution
|
||||
- Cross-repository bazel dependencies work correctly
|
||||
- Container images include complete dependency closure
|
||||
|
||||
---
|
||||
|
||||
### Phase 5: End-to-End Deployment
|
||||
**Goal**: Complete production deployment pipeline with observability
|
||||
|
||||
#### Deliverables
|
||||
- Production-ready container images with proper configuration
|
||||
- Integration with existing databuild observability systems
|
||||
- Build event log compatibility
|
||||
- Performance optimization and resource management
|
||||
|
||||
#### Implementation Tasks
|
||||
1. Optimize generated container images for production use
|
||||
2. Ensure build event logging works correctly in generated modules
|
||||
3. Add resource configuration and limits to generated targets
|
||||
4. Create deployment documentation and examples
|
||||
5. Performance testing and optimization
|
||||
|
||||
#### Tests & Verification
|
||||
```bash
|
||||
./run_e2e_tests.sh
|
||||
```
|
||||
|
||||
#### Success Criteria
|
||||
- Generated modules are production-ready
|
||||
- Full observability and logging integration
|
||||
- Performance meets production requirements
|
||||
- CLI/Service consistency maintained
|
||||
- Complete deployment documentation
|
||||
|
||||
## Validation Strategy
|
||||
|
||||
### Integration with Existing Tests
|
||||
- Extend `run_e2e_tests.sh` to test generated modules
|
||||
- Add generated module tests to CI/CD pipeline
|
||||
- Use existing test app DSL as primary test case
|
||||
|
||||
### Performance Benchmarks
|
||||
- Graph analysis speed comparison (DSL vs hand-written bazel)
|
||||
- Container image size optimization
|
||||
- Job execution overhead measurement
|
||||
|
||||
### Correctness Verification
|
||||
- Build event log structure validation
|
||||
- Partition resolution accuracy testing
|
||||
- Dependency resolution completeness checks
|
||||
Loading…
Reference in a new issue