stuart/databuild

Fork 0

Stuart Axelbrooke 4bb8af2c74

/ setup (push) Has been cancelled

Details

Add python dsl options

2025-07-21 22:52:20 -07:00

5.1 KiB

Raw Blame History

Python DSL Exploration: Foundational Ideas for DataBuild's Evolution

This document explores how Python's expressiveness could reshape DataBuild's interface, not as an implementation plan, but as a collection of themes and insights that could inspire future evolution.

Core Narratives

1. The Prefect Inspiration: Achieving 10x Conciseness

DataBuild currently requires ~1000 lines of Bazel + Rust to express what could potentially be ~100 lines of Python. This order-of-magnitude difference suggests fundamental opportunities for abstraction.

2. From Orchestration to Relations: The SQL Insight

The key realization: if orchestration logic changes frequently, we shouldn't make it easier to write - we should eliminate writing it entirely. Like SQL focuses on relational algebra rather than execution plans, DataBuild could focus on data relationships rather than orchestration steps.

3. The Spectrum of Approaches

Pure Python (Maximum Dynamism)

@db.job(outputs=lambda date: [f"processed/{date}"])
def process(date: str, raw: Partition) -> Partition:
    return transform(raw)

Runtime introspection discovers dependencies
Decorators provide the interface
Trade-off: Sacrifices compile-time guarantees for expressiveness

Hybrid Approaches (Best of Both Worlds)

Multiple strategies explored:

Python DSL → Bazel Generation: Python defines, Bazel executes
Python Orchestrator + Bazel Workers: Python handles coordination, Bazel handles computation
Dual-Mode System: Development in Python, production in Bazel
Gradual Migration: Start pure Python, migrate heavy jobs to Bazel over time

Pure Declarative (The Ultimate Vision)

@rdb.partition("clean/{date}")
class CleanData:
    @rdb.derives_from("raw/*/{{date}}")
    def transform(self, raw_partitions: List[Partition]) -> Partition:
        # Pure functional relationship, no orchestration
        pass

Foundational Themes

1. Declarative Over Imperative

The evolution from "do this, then that" to "this depends on that" represents a fundamental shift in how we think about data pipelines. The interface should express relationships, not recipes.

2. Pattern-Based Dependencies

Instead of explicitly listing dependencies, patterns like raw/*/{{date}} or features/[date-30:date] can express complex relationships concisely. This mirrors SQL's ability to express joins and windows declaratively.

3. Interface/Implementation Separation

The most promising approaches separate:

Interface: How users express data relationships (Python's domain)
Implementation: How computations execute (Bazel/Rust's domain)

4. Correctness Through Constraints

Rather than compile-time checking of imperative code, correctness could come from:

Functional transformations (no side effects)
Pattern-based completeness (all dependencies captured)
Relational integrity (cycles impossible by construction)

5. Runtime Intelligence

With declarative relationships, the system can:

Build optimal execution plans at runtime
Adapt to resource availability
Skip unnecessary recomputation
Parallelize automatically

Key Insights

The Orchestration Paradox

"Orchestration logic changes frequently, so we shouldn't implement it directly at all." This paradox suggests that the solution to complex orchestration isn't better orchestration tools, but eliminating orchestration entirely through declarative relationships.

The SQL Analogy

SQL's success comes from focusing on relational algebra rather than execution. DataBuild could similarly focus on data relationships rather than build steps. Users declare "what depends on what," not "how to build things."

The Gradient of Guarantees

Different parts of the system need different guarantees:

Relationship declarations: Need flexibility, benefit from Python
Computational execution: Need hermeticity, benefit from Bazel
Runtime planning: Need intelligence, benefit from Rust

Future Explorations

Interface Evolution Paths

Gradual Enhancement: Keep current Bazel interface, add Python layer on top
Parallel Tracks: Maintain both Bazel-first and Python-first interfaces
Fundamental Reimagining: Redesign around pure declarative relationships

Technical Investigations

How to preserve Bazel's hermeticity with Python's dynamism
Pattern matching languages for partition dependencies
Query planning algorithms for data pipelines
Time-travel and what-if analysis capabilities

Philosophical Questions

Is orchestration a fundamental need or an implementation detail?
Can we achieve both expressiveness and correctness?
What would "SQL for data pipelines" actually look like?

Conclusion

These explorations suggest that DataBuild's future might not be in making orchestration easier, but in making it unnecessary. By focusing on declarative data relationships rather than imperative build steps, we could achieve both the expressiveness of Python and the guarantees of Bazel, while eliminating entire categories of complexity.

The ultimate vision: users declare what data depends on what other data, and the system figures out everything else - just like SQL.

5.1 KiB Raw Blame History