databuild/plans/python-dsl.md
Stuart Axelbrooke 4bb8af2c74
Some checks failed
/ setup (push) Has been cancelled
Add python dsl options
2025-07-21 22:52:20 -07:00

5.1 KiB

Python DSL Exploration: Foundational Ideas for DataBuild's Evolution

This document explores how Python's expressiveness could reshape DataBuild's interface, not as an implementation plan, but as a collection of themes and insights that could inspire future evolution.

Core Narratives

1. The Prefect Inspiration: Achieving 10x Conciseness

DataBuild currently requires ~1000 lines of Bazel + Rust to express what could potentially be ~100 lines of Python. This order-of-magnitude difference suggests fundamental opportunities for abstraction.

2. From Orchestration to Relations: The SQL Insight

The key realization: if orchestration logic changes frequently, we shouldn't make it easier to write - we should eliminate writing it entirely. Like SQL focuses on relational algebra rather than execution plans, DataBuild could focus on data relationships rather than orchestration steps.

3. The Spectrum of Approaches

Pure Python (Maximum Dynamism)

@db.job(outputs=lambda date: [f"processed/{date}"])
def process(date: str, raw: Partition) -> Partition:
    return transform(raw)
  • Runtime introspection discovers dependencies
  • Decorators provide the interface
  • Trade-off: Sacrifices compile-time guarantees for expressiveness

Hybrid Approaches (Best of Both Worlds)

Multiple strategies explored:

  • Python DSL → Bazel Generation: Python defines, Bazel executes
  • Python Orchestrator + Bazel Workers: Python handles coordination, Bazel handles computation
  • Dual-Mode System: Development in Python, production in Bazel
  • Gradual Migration: Start pure Python, migrate heavy jobs to Bazel over time

Pure Declarative (The Ultimate Vision)

@rdb.partition("clean/{date}")
class CleanData:
    @rdb.derives_from("raw/*/{{date}}")
    def transform(self, raw_partitions: List[Partition]) -> Partition:
        # Pure functional relationship, no orchestration
        pass

Foundational Themes

1. Declarative Over Imperative

The evolution from "do this, then that" to "this depends on that" represents a fundamental shift in how we think about data pipelines. The interface should express relationships, not recipes.

2. Pattern-Based Dependencies

Instead of explicitly listing dependencies, patterns like raw/*/{{date}} or features/[date-30:date] can express complex relationships concisely. This mirrors SQL's ability to express joins and windows declaratively.

3. Interface/Implementation Separation

The most promising approaches separate:

  • Interface: How users express data relationships (Python's domain)
  • Implementation: How computations execute (Bazel/Rust's domain)

4. Correctness Through Constraints

Rather than compile-time checking of imperative code, correctness could come from:

  • Functional transformations (no side effects)
  • Pattern-based completeness (all dependencies captured)
  • Relational integrity (cycles impossible by construction)

5. Runtime Intelligence

With declarative relationships, the system can:

  • Build optimal execution plans at runtime
  • Adapt to resource availability
  • Skip unnecessary recomputation
  • Parallelize automatically

Key Insights

The Orchestration Paradox

"Orchestration logic changes frequently, so we shouldn't implement it directly at all." This paradox suggests that the solution to complex orchestration isn't better orchestration tools, but eliminating orchestration entirely through declarative relationships.

The SQL Analogy

SQL's success comes from focusing on relational algebra rather than execution. DataBuild could similarly focus on data relationships rather than build steps. Users declare "what depends on what," not "how to build things."

The Gradient of Guarantees

Different parts of the system need different guarantees:

  • Relationship declarations: Need flexibility, benefit from Python
  • Computational execution: Need hermeticity, benefit from Bazel
  • Runtime planning: Need intelligence, benefit from Rust

Future Explorations

Interface Evolution Paths

  1. Gradual Enhancement: Keep current Bazel interface, add Python layer on top
  2. Parallel Tracks: Maintain both Bazel-first and Python-first interfaces
  3. Fundamental Reimagining: Redesign around pure declarative relationships

Technical Investigations

  • How to preserve Bazel's hermeticity with Python's dynamism
  • Pattern matching languages for partition dependencies
  • Query planning algorithms for data pipelines
  • Time-travel and what-if analysis capabilities

Philosophical Questions

  • Is orchestration a fundamental need or an implementation detail?
  • Can we achieve both expressiveness and correctness?
  • What would "SQL for data pipelines" actually look like?

Conclusion

These explorations suggest that DataBuild's future might not be in making orchestration easier, but in making it unnecessary. By focusing on declarative data relationships rather than imperative build steps, we could achieve both the expressiveness of Python and the guarantees of Bazel, while eliminating entire categories of complexity.

The ultimate vision: users declare what data depends on what other data, and the system figures out everything else - just like SQL.