9.9 KiB
Python DSL Exploration: Foundational Ideas for DataBuild's Evolution
This document explores how Python's expressiveness could reshape DataBuild's interface, not as an implementation plan, but as a collection of themes and insights that could inspire future evolution.
Core Narratives
1. The Prefect Inspiration: Achieving 10x Conciseness
DataBuild currently requires ~1000 lines of Bazel + Rust to express what could potentially be ~100 lines of Python. This order-of-magnitude difference suggests fundamental opportunities for abstraction.
2. From Orchestration to Relations: The SQL Insight
The key realization: if orchestration logic changes frequently, we shouldn't make it easier to write - we should eliminate writing it entirely. Like SQL focuses on relational algebra rather than execution plans, DataBuild could focus on data relationships rather than orchestration steps.
3. The Spectrum of Approaches
Pure Python (Maximum Dynamism)
@db.job(outputs=lambda date: [f"processed/{date}"])
def process(date: str, raw: Partition) -> Partition:
return transform(raw)
- Runtime introspection discovers dependencies
- Decorators provide the interface
- Trade-off: Sacrifices compile-time guarantees for expressiveness
Hybrid Approaches (Best of Both Worlds)
Multiple strategies explored:
- Python DSL → Bazel Generation: Python defines, Bazel executes
- Python Orchestrator + Bazel Workers: Python handles coordination, Bazel handles computation
- Dual-Mode System: Development in Python, production in Bazel
- Gradual Migration: Start pure Python, migrate heavy jobs to Bazel over time
Pure Declarative (The Ultimate Vision)
@rdb.partition("clean/{date}")
class CleanData:
@rdb.derives_from("raw/*/{{date}}")
def transform(self, raw_partitions: List[Partition]) -> Partition:
# Pure functional relationship, no orchestration
pass
Foundational Themes
1. Declarative Over Imperative
The evolution from "do this, then that" to "this depends on that" represents a fundamental shift in how we think about data pipelines. The interface should express relationships, not recipes.
2. Pattern-Based Dependencies
Instead of explicitly listing dependencies, patterns like raw/*/{{date}} or features/[date-30:date] can express complex relationships concisely. This mirrors SQL's ability to express joins and windows declaratively.
3. Interface/Implementation Separation
The most promising approaches separate:
- Interface: How users express data relationships (Python's domain)
- Implementation: How computations execute (Bazel/Rust's domain)
4. Correctness Through Constraints
Rather than compile-time checking of imperative code, correctness could come from:
- Functional transformations (no side effects)
- Pattern-based completeness (all dependencies captured)
- Relational integrity (cycles impossible by construction)
5. Runtime Intelligence
With declarative relationships, the system can:
- Build optimal execution plans at runtime
- Adapt to resource availability
- Skip unnecessary recomputation
- Parallelize automatically
Key Insights
The Orchestration Paradox
"Orchestration logic changes frequently, so we shouldn't implement it directly at all." This paradox suggests that the solution to complex orchestration isn't better orchestration tools, but eliminating orchestration entirely through declarative relationships.
The SQL Analogy
SQL's success comes from focusing on relational algebra rather than execution. DataBuild could similarly focus on data relationships rather than build steps. Users declare "what depends on what," not "how to build things."
The Gradient of Guarantees
Different parts of the system need different guarantees:
- Relationship declarations: Need flexibility, benefit from Python
- Computational execution: Need hermeticity, benefit from Bazel
- Runtime planning: Need intelligence, benefit from Rust
Future Explorations
Interface Evolution Paths
- Gradual Enhancement: Keep current Bazel interface, add Python layer on top
- Parallel Tracks: Maintain both Bazel-first and Python-first interfaces
- Fundamental Reimagining: Redesign around pure declarative relationships
Technical Investigations
- How to preserve Bazel's hermeticity with Python's dynamism
- Pattern matching languages for partition dependencies
- Query planning algorithms for data pipelines
- Time-travel and what-if analysis capabilities
Python as the Pattern Language
A crucial realization: we don't need a new DSL for partition patterns - Python itself is the DSL! Python's native string handling capabilities provide everything needed:
Built-in Pattern Matching (Python 3.10+)
match partition_ref:
case f"raw/{source}/{date}" if source in ["api", "ftp"]:
return {"source": source, "date": date}
case f"model/{name}/v{version}" if version.isdigit():
return {"name": name, "version": int(version)}
F-String Templates as Bidirectional Patterns
from parse import parse # Inverse of format()
# Define pattern
pattern = "s3://bucket/{table}/date={date}/hour={hour:02d}"
# Extract (pattern → values)
result = parse(pattern, "s3://bucket/events/date=2024-01-01/hour=13")
# {'table': 'events', 'date': '2024-01-01', 'hour': 13}
# Generate (values → pattern)
pattern.format(table='events', date='2024-01-01', hour=13)
Native Python Pattern Types
- Glob patterns:
raw/*/2024-01-*usingfnmatchorpathlib - Regular expressions: Named groups for complex patterns
- Type annotations:
Annotated[RawPartition, "all sources for date"] - String templates: Standard library
string.Template
Why This Is Beautiful
- No new syntax - Developers already know these patterns
- Full IDE support - Autocomplete, refactoring, type checking
- Composable - Mix glob, regex, f-strings as needed
- Testable - Just Python strings and functions
- Bidirectional - Same pattern for matching and generating
The key insight: Python's string formatting is already a pattern language. We don't need to build a DSL - we need to embrace Python as the DSL.
Rust as an Alternative DSL
While Python offers maximum expressiveness, Rust presents an intriguing alternative that could provide compile-time guarantees without sacrificing elegance:
Procedural Macros for Job Definition
#[job]
fn process_daily(
date: Date,
#[partition("raw/{source}/{date}")] source: Partition<RawData>,
) -> Partition<ProcessedData, "processed/{date}"> {
let raw_data = source.load()?;
transform(raw_data, date).save()
}
#[job(resources = "memory=4G,cpu=2")]
fn aggregate_weekly(
week: Week,
#[partition("processed/[date:date+7]")] daily: PartitionSet<ProcessedData>,
) -> Partition<WeeklyAggregate, "aggregated/week={week}"> {
daily.load_all()?.aggregate().save()
}
Type-Safe Pattern Matching
// Compile-time pattern validation using const generics
pub struct Partition<T, const PATTERN: &'static str> {
_phantom: PhantomData<T>,
}
// Pattern validated at compile time!
impl<T, const PATTERN: &'static str> Partition<T, PATTERN> {
const VALIDATE: () = {
assert!(is_valid_pattern(PATTERN));
};
}
Runtime Dependency Resolution with Compile-Time Structure
#[job(
// Pattern template at compile time
inputs = "raw/{source}/{date}",
// But resolution happens at runtime
)]
impl ProcessDaily {
fn resolve_dependencies(&self, args: &JobArgs) -> Vec<String> {
let date = &args.date;
let mut deps = vec![];
// Dynamic resolution based on runtime state
for source in self.catalog.available_sources(date)? {
deps.push(format!("raw/{}/{}", source, date));
}
// Conditional dependencies
if date.weekday() == Weekday::Monday {
deps.push(format!("raw/{}/{}", "archive", date.previous_friday()));
}
deps
}
}
Why Rust DSL is Compelling
- Zero-cost abstractions - Macros expand to optimal code with no runtime overhead
- Compile-time validation - Pattern syntax checked at compile time
- Type safety - Full type checking across job boundaries
- Single language - If engine is Rust, no polyglot complexity
- IDE support - Full IntelliSense, refactoring, and go-to-definition
The Hybrid Best: Compile-Time Structure, Runtime Flexibility
// Macro provides syntax sugar and compile-time checks
#[derive(DataBuildJob)]
struct ComplexJob {
const INPUT_TEMPLATE: &'static str = "raw/{source}/{date}";
// But actual resolution is dynamic
fn resolve_config(&self, args: &JobArgs, ctx: &Context) -> Result<JobConfig> {
// Full runtime flexibility for dependency calculation
let sources = ctx.catalog.list_sources(&args.date)?;
let inputs = sources.iter()
.map(|s| format!("raw/{}/{}", s, args.date))
.collect();
Ok(JobConfig { inputs, outputs: vec![...] })
}
}
This Rust approach offers an interesting middle ground: stronger guarantees than Python while maintaining expressiveness, and seamless integration with a Rust-based execution engine.
Philosophical Questions
- Is orchestration a fundamental need or an implementation detail?
- Can we achieve both expressiveness and correctness?
- What would "SQL for data pipelines" actually look like?
Conclusion
These explorations suggest that DataBuild's future might not be in making orchestration easier, but in making it unnecessary. By focusing on declarative data relationships rather than imperative build steps, we could achieve both the expressiveness of Python and the guarantees of Bazel, while eliminating entire categories of complexity.
The ultimate vision: users declare what data depends on what other data, and the system figures out everything else - just like SQL.