Stuart Axelbrooke eeef8b6444 Add more DSL thoughts

2025-07-23 17:57:03 -07:00

9.9 KiB

Raw Blame History

Python DSL Exploration: Foundational Ideas for DataBuild's Evolution

This document explores how Python's expressiveness could reshape DataBuild's interface, not as an implementation plan, but as a collection of themes and insights that could inspire future evolution.

Core Narratives

1. The Prefect Inspiration: Achieving 10x Conciseness

DataBuild currently requires ~1000 lines of Bazel + Rust to express what could potentially be ~100 lines of Python. This order-of-magnitude difference suggests fundamental opportunities for abstraction.

2. From Orchestration to Relations: The SQL Insight

The key realization: if orchestration logic changes frequently, we shouldn't make it easier to write - we should eliminate writing it entirely. Like SQL focuses on relational algebra rather than execution plans, DataBuild could focus on data relationships rather than orchestration steps.

3. The Spectrum of Approaches

Pure Python (Maximum Dynamism)

@db.job(outputs=lambda date: [f"processed/{date}"])
def process(date: str, raw: Partition) -> Partition:
    return transform(raw)

Runtime introspection discovers dependencies
Decorators provide the interface
Trade-off: Sacrifices compile-time guarantees for expressiveness

Hybrid Approaches (Best of Both Worlds)

Multiple strategies explored:

Python DSL → Bazel Generation: Python defines, Bazel executes
Python Orchestrator + Bazel Workers: Python handles coordination, Bazel handles computation
Dual-Mode System: Development in Python, production in Bazel
Gradual Migration: Start pure Python, migrate heavy jobs to Bazel over time

Pure Declarative (The Ultimate Vision)

@rdb.partition("clean/{date}")
class CleanData:
    @rdb.derives_from("raw/*/{{date}}")
    def transform(self, raw_partitions: List[Partition]) -> Partition:
        # Pure functional relationship, no orchestration
        pass

Foundational Themes

1. Declarative Over Imperative

The evolution from "do this, then that" to "this depends on that" represents a fundamental shift in how we think about data pipelines. The interface should express relationships, not recipes.

2. Pattern-Based Dependencies

Instead of explicitly listing dependencies, patterns like raw/*/{{date}} or features/[date-30:date] can express complex relationships concisely. This mirrors SQL's ability to express joins and windows declaratively.

3. Interface/Implementation Separation

The most promising approaches separate:

Interface: How users express data relationships (Python's domain)
Implementation: How computations execute (Bazel/Rust's domain)

4. Correctness Through Constraints

Rather than compile-time checking of imperative code, correctness could come from:

Functional transformations (no side effects)
Pattern-based completeness (all dependencies captured)
Relational integrity (cycles impossible by construction)

5. Runtime Intelligence

With declarative relationships, the system can:

Build optimal execution plans at runtime
Adapt to resource availability
Skip unnecessary recomputation
Parallelize automatically

Key Insights

The Orchestration Paradox

"Orchestration logic changes frequently, so we shouldn't implement it directly at all." This paradox suggests that the solution to complex orchestration isn't better orchestration tools, but eliminating orchestration entirely through declarative relationships.

The SQL Analogy

SQL's success comes from focusing on relational algebra rather than execution. DataBuild could similarly focus on data relationships rather than build steps. Users declare "what depends on what," not "how to build things."

The Gradient of Guarantees

Different parts of the system need different guarantees:

Relationship declarations: Need flexibility, benefit from Python
Computational execution: Need hermeticity, benefit from Bazel
Runtime planning: Need intelligence, benefit from Rust

Future Explorations

Interface Evolution Paths

Gradual Enhancement: Keep current Bazel interface, add Python layer on top
Parallel Tracks: Maintain both Bazel-first and Python-first interfaces
Fundamental Reimagining: Redesign around pure declarative relationships

Technical Investigations

How to preserve Bazel's hermeticity with Python's dynamism
Pattern matching languages for partition dependencies
Query planning algorithms for data pipelines
Time-travel and what-if analysis capabilities

Python as the Pattern Language

A crucial realization: we don't need a new DSL for partition patterns - Python itself is the DSL! Python's native string handling capabilities provide everything needed:

Built-in Pattern Matching (Python 3.10+)

match partition_ref:
    case f"raw/{source}/{date}" if source in ["api", "ftp"]:
        return {"source": source, "date": date}
    case f"model/{name}/v{version}" if version.isdigit():
        return {"name": name, "version": int(version)}

F-String Templates as Bidirectional Patterns

from parse import parse  # Inverse of format()

# Define pattern
pattern = "s3://bucket/{table}/date={date}/hour={hour:02d}"

# Extract (pattern → values)
result = parse(pattern, "s3://bucket/events/date=2024-01-01/hour=13")
# {'table': 'events', 'date': '2024-01-01', 'hour': 13}

# Generate (values → pattern)
pattern.format(table='events', date='2024-01-01', hour=13)

Native Python Pattern Types

Glob patterns: raw/*/2024-01-* using fnmatch or pathlib
Regular expressions: Named groups for complex patterns
Type annotations: Annotated[RawPartition, "all sources for date"]
String templates: Standard library string.Template

Why This Is Beautiful

No new syntax - Developers already know these patterns
Full IDE support - Autocomplete, refactoring, type checking
Composable - Mix glob, regex, f-strings as needed
Testable - Just Python strings and functions
Bidirectional - Same pattern for matching and generating

The key insight: Python's string formatting is already a pattern language. We don't need to build a DSL - we need to embrace Python as the DSL.

Rust as an Alternative DSL

While Python offers maximum expressiveness, Rust presents an intriguing alternative that could provide compile-time guarantees without sacrificing elegance:

Procedural Macros for Job Definition

#[job]
fn process_daily(
    date: Date,
    #[partition("raw/{source}/{date}")] source: Partition<RawData>,
) -> Partition<ProcessedData, "processed/{date}"> {
    let raw_data = source.load()?;
    transform(raw_data, date).save()
}

#[job(resources = "memory=4G,cpu=2")]
fn aggregate_weekly(
    week: Week,
    #[partition("processed/[date:date+7]")] daily: PartitionSet<ProcessedData>,
) -> Partition<WeeklyAggregate, "aggregated/week={week}"> {
    daily.load_all()?.aggregate().save()
}

Type-Safe Pattern Matching

// Compile-time pattern validation using const generics
pub struct Partition<T, const PATTERN: &'static str> {
    _phantom: PhantomData<T>,
}

// Pattern validated at compile time!
impl<T, const PATTERN: &'static str> Partition<T, PATTERN> {
    const VALIDATE: () = {
        assert!(is_valid_pattern(PATTERN));
    };
}

Runtime Dependency Resolution with Compile-Time Structure

#[job(
    // Pattern template at compile time
    inputs = "raw/{source}/{date}",
    // But resolution happens at runtime
)]
impl ProcessDaily {
    fn resolve_dependencies(&self, args: &JobArgs) -> Vec<String> {
        let date = &args.date;
        let mut deps = vec![];
        
        // Dynamic resolution based on runtime state
        for source in self.catalog.available_sources(date)? {
            deps.push(format!("raw/{}/{}", source, date));
        }
        
        // Conditional dependencies
        if date.weekday() == Weekday::Monday {
            deps.push(format!("raw/{}/{}", "archive", date.previous_friday()));
        }
        
        deps
    }
}

Why Rust DSL is Compelling

Zero-cost abstractions - Macros expand to optimal code with no runtime overhead
Compile-time validation - Pattern syntax checked at compile time
Type safety - Full type checking across job boundaries
Single language - If engine is Rust, no polyglot complexity
IDE support - Full IntelliSense, refactoring, and go-to-definition

The Hybrid Best: Compile-Time Structure, Runtime Flexibility

// Macro provides syntax sugar and compile-time checks
#[derive(DataBuildJob)]
struct ComplexJob {
    const INPUT_TEMPLATE: &'static str = "raw/{source}/{date}";
    
    // But actual resolution is dynamic
    fn resolve_config(&self, args: &JobArgs, ctx: &Context) -> Result<JobConfig> {
        // Full runtime flexibility for dependency calculation
        let sources = ctx.catalog.list_sources(&args.date)?;
        let inputs = sources.iter()
            .map(|s| format!("raw/{}/{}", s, args.date))
            .collect();
        
        Ok(JobConfig { inputs, outputs: vec![...] })
    }
}

This Rust approach offers an interesting middle ground: stronger guarantees than Python while maintaining expressiveness, and seamless integration with a Rust-based execution engine.

Philosophical Questions

Is orchestration a fundamental need or an implementation detail?
Can we achieve both expressiveness and correctness?
What would "SQL for data pipelines" actually look like?

Conclusion

These explorations suggest that DataBuild's future might not be in making orchestration easier, but in making it unnecessary. By focusing on declarative data relationships rather than imperative build steps, we could achieve both the expressiveness of Python and the guarantees of Bazel, while eliminating entire categories of complexity.

The ultimate vision: users declare what data depends on what other data, and the system figures out everything else - just like SQL.

9.9 KiB Raw Blame History