Add python dsl options
Some checks failed
/ setup (push) Has been cancelled

This commit is contained in:
Stuart Axelbrooke 2025-07-21 22:52:20 -07:00
parent 24482e2cc4
commit 4bb8af2c74
2 changed files with 107 additions and 1 deletions

View file

@ -1,7 +1,9 @@
# DataBuild
DataBuild is a trivially-deployable, partition-oriented, declarative build system.
Orchestration logic changes frequently, so just don't write it.
DataBuild is a trivially-deployable, partition-oriented, declarative data build system.
For important context, check out [DESIGN.md](./DESIGN.md). Also, check out [`databuild.proto`](./databuild/databuild.proto) for key system interfaces.

104
plans/python-dsl.md Normal file
View file

@ -0,0 +1,104 @@
# Python DSL Exploration: Foundational Ideas for DataBuild's Evolution
This document explores how Python's expressiveness could reshape DataBuild's interface, not as an implementation plan, but as a collection of themes and insights that could inspire future evolution.
## Core Narratives
### 1. The Prefect Inspiration: Achieving 10x Conciseness
DataBuild currently requires ~1000 lines of Bazel + Rust to express what could potentially be ~100 lines of Python. This order-of-magnitude difference suggests fundamental opportunities for abstraction.
### 2. From Orchestration to Relations: The SQL Insight
The key realization: if orchestration logic changes frequently, we shouldn't make it easier to write - we should eliminate writing it entirely. Like SQL focuses on relational algebra rather than execution plans, DataBuild could focus on data relationships rather than orchestration steps.
### 3. The Spectrum of Approaches
#### Pure Python (Maximum Dynamism)
```python
@db.job(outputs=lambda date: [f"processed/{date}"])
def process(date: str, raw: Partition) -> Partition:
return transform(raw)
```
- Runtime introspection discovers dependencies
- Decorators provide the interface
- Trade-off: Sacrifices compile-time guarantees for expressiveness
#### Hybrid Approaches (Best of Both Worlds)
Multiple strategies explored:
- **Python DSL → Bazel Generation**: Python defines, Bazel executes
- **Python Orchestrator + Bazel Workers**: Python handles coordination, Bazel handles computation
- **Dual-Mode System**: Development in Python, production in Bazel
- **Gradual Migration**: Start pure Python, migrate heavy jobs to Bazel over time
#### Pure Declarative (The Ultimate Vision)
```python
@rdb.partition("clean/{date}")
class CleanData:
@rdb.derives_from("raw/*/{{date}}")
def transform(self, raw_partitions: List[Partition]) -> Partition:
# Pure functional relationship, no orchestration
pass
```
## Foundational Themes
### 1. Declarative Over Imperative
The evolution from "do this, then that" to "this depends on that" represents a fundamental shift in how we think about data pipelines. The interface should express relationships, not recipes.
### 2. Pattern-Based Dependencies
Instead of explicitly listing dependencies, patterns like `raw/*/{{date}}` or `features/[date-30:date]` can express complex relationships concisely. This mirrors SQL's ability to express joins and windows declaratively.
### 3. Interface/Implementation Separation
The most promising approaches separate:
- **Interface**: How users express data relationships (Python's domain)
- **Implementation**: How computations execute (Bazel/Rust's domain)
### 4. Correctness Through Constraints
Rather than compile-time checking of imperative code, correctness could come from:
- Functional transformations (no side effects)
- Pattern-based completeness (all dependencies captured)
- Relational integrity (cycles impossible by construction)
### 5. Runtime Intelligence
With declarative relationships, the system can:
- Build optimal execution plans at runtime
- Adapt to resource availability
- Skip unnecessary recomputation
- Parallelize automatically
## Key Insights
### The Orchestration Paradox
"Orchestration logic changes frequently, so we shouldn't implement it directly at all." This paradox suggests that the solution to complex orchestration isn't better orchestration tools, but eliminating orchestration entirely through declarative relationships.
### The SQL Analogy
SQL's success comes from focusing on relational algebra rather than execution. DataBuild could similarly focus on data relationships rather than build steps. Users declare "what depends on what," not "how to build things."
### The Gradient of Guarantees
Different parts of the system need different guarantees:
- **Relationship declarations**: Need flexibility, benefit from Python
- **Computational execution**: Need hermeticity, benefit from Bazel
- **Runtime planning**: Need intelligence, benefit from Rust
## Future Explorations
### Interface Evolution Paths
1. **Gradual Enhancement**: Keep current Bazel interface, add Python layer on top
2. **Parallel Tracks**: Maintain both Bazel-first and Python-first interfaces
3. **Fundamental Reimagining**: Redesign around pure declarative relationships
### Technical Investigations
- How to preserve Bazel's hermeticity with Python's dynamism
- Pattern matching languages for partition dependencies
- Query planning algorithms for data pipelines
- Time-travel and what-if analysis capabilities
### Philosophical Questions
- Is orchestration a fundamental need or an implementation detail?
- Can we achieve both expressiveness and correctness?
- What would "SQL for data pipelines" actually look like?
## Conclusion
These explorations suggest that DataBuild's future might not be in making orchestration easier, but in making it unnecessary. By focusing on declarative data relationships rather than imperative build steps, we could achieve both the expressiveness of Python and the guarantees of Bazel, while eliminating entire categories of complexity.
The ultimate vision: users declare what data depends on what other data, and the system figures out everything else - just like SQL.