databuild/plans/12-dsl.md

7.6 KiB

DataBuild Interface Evolution: Strategic Options and Technical Decisions

This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs.

Executive Summary

DataBuild must choose between three fundamental interface strategies:

  1. Pure Bazel (current): Maximum guarantees, maximum verbosity
  2. High-Level DSL: Expressive interfaces that compile to Bazel
  3. Pure Declarative: Eliminate orchestration entirely through relational modeling

The Core Technical Decisions

Decision 1: Where Should Dependency Logic Live?

Option A: In-Job Config (Current Design)

# Job knows its own dependencies
def config(self, date):
    return {"inputs": [f"raw/{date}", f"raw/{date-1}"]}
  • Locality of knowledge - dependency logic next to usage
  • Natural evolution - changes happen in one place
  • Performance overhead - subprocess per config call
  • Thrives in: Complex enterprise environments where jobs have intricate, evolving dependencies

Option B: Graph-Level Declaration

databuild_job(
    name = "process_daily",
    depends_on = ["raw/{date}"],
    produces = ["processed/{date}"]
)
  • Static analysis - entire graph visible without execution
  • Performance - microseconds vs seconds for planning
  • Flexibility - harder to express dynamic dependencies
  • Implicit coupling - jobs have to duplicate data dependency resolution
  • Thrives in: High-frequency trading systems, real-time analytics where planning speed matters

Option C: Hybrid Pattern-Based

# Patterns at graph level, resolution at runtime
@job(dependency_pattern="raw/{source}/[date-window:date]")
def aggregate(date, window=7):
    # Runtime resolves exact partitions
  • Best of both - fast planning with flexibility
  • Progressive disclosure - simple cases simple
  • Complexity - two places to look
  • Thrives in: Modern data platforms serving diverse teams with varying sophistication

Decision 2: Interface Language Choice

Option A: Pure Bazel (Status Quo)

databuild_job(
    name = "etl",
    binary = ":etl_binary",
)

Narrative: "The Infrastructure-as-Code Platform"

  • For organizations that value reproducibility above all else
  • Where data pipelines are mission-critical infrastructure
  • Teams that already use Bazel for other systems

Strengths:

  • Hermetic builds guarantee reproducibility
  • Multi-language support out of the box
  • Battle-tested deployment story

Weaknesses:

  • High barrier to entry
  • Verbose for simple cases
  • Limited expressiveness

Option B: Python DSL → Bazel Compilation

@db.job
def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"):
    return raw.load().transform().save()

Narrative: "The Developer-First Data Platform"

  • For data teams that move fast and iterate quickly
  • Where Python is already the lingua franca
  • Organizations prioritizing developer productivity

Strengths:

  • 10x more concise than Bazel
  • Natural for data scientists/engineers
  • Rich ecosystem integration

Weaknesses:

  • Additional compilation step
  • Python-centric (less multi-language)
  • Debugging across abstraction layers

Option C: Rust DSL with Procedural Macros

#[job]
fn process(
    #[partition("raw/{date}")] input: Partition<Data>
) -> Partition<Output, "output/{date}"> {
    input.load()?.transform().save()
}

Narrative: "The High-Performance Data Platform"

  • For organizations processing massive scale
  • Where performance and correctness are equally critical
  • Teams willing to invest in Rust expertise

Strengths:

  • Compile-time guarantees with elegance
  • Zero-cost abstractions
  • Single language with execution engine

Weaknesses:

  • Steep learning curve
  • Smaller talent pool
  • Less flexible than Python

Decision 3: Orchestration Philosophy

Option A: Explicit Orchestration (Traditional)

  • Users define execution order and dependencies
  • Similar to Airflow, Prefect, Dagster
  • Thrives in: Organizations with complex business logic requiring explicit control

Option B: Implicit Orchestration (Current DataBuild)

  • Users define jobs and dependencies
  • System figures out execution order
  • Thrives in: Data engineering teams wanting to focus on transformations, not plumbing

Option C: No Orchestration (Pure Declarative)

@partition("clean/{date}")
class CleanData:
    source = "raw/*/{date}"
    
    def transform(self, raw):
        # Pure function, no orchestration
        return clean(merge(raw))

Narrative: "The SQL-for-Data-Pipelines Platform"

  • Orchestration is an implementation detail
  • Users declare relationships, system handles everything
  • Thrives in: Next-generation data platforms, organizations ready to rethink data processing

Strengths:

  • Eliminates entire categories of bugs
  • Enables powerful optimizations

Weaknesses:

  • Paradigm shift for users
  • Less control over execution
  • Harder to debug when things go wrong

Strategic Recommendations by Use Case

For Startups/Fast-Moving Teams

Recommendation: Python DSL → Bazel

  • Start with Python for rapid development
  • Compile to Bazel for production
  • Migrate critical jobs to native Bazel/Rust over time

For Enterprise/Regulated Industries

Recommendation: Pure Bazel with Graph-Level Dependencies

  • Maintain full auditability and reproducibility
  • Use graph-level deps for performance
  • Consider Rust DSL for new greenfield projects

For Next-Gen Data Platforms

Recommendation: Pure Declarative with Rust Implementation

  • Leap directly to declarative model
  • Build on Rust for performance and correctness
  • Pioneer the "SQL for pipelines" approach

Implementation Patterns

Pattern 1: Gradual Migration

Current Bazel → Python DSL (compile to Bazel) → Pure Declarative
  • Low risk, high compatibility
  • Teams can adopt at their own pace
  • Preserves existing investments

Pattern 2: Parallel Tracks

Bazel Interface (production)
     ↕️
Python Interface (development)
  • Different interfaces for different use cases
  • Development velocity without sacrificing production guarantees
  • Higher maintenance burden

Pattern 3: Clean Break

New declarative system alongside legacy
  • Fastest path to innovation
  • No legacy constraints
  • Requires significant investment

Key Technical Insights

Single Source of Truth Principle

Whatever path chosen, dependency declaration and resolution must be co-located:

# Good: Single source
def process(input: partition("raw/{date}")):
    return input.load().transform()

# Bad: Split sources
# In config: depends = ["raw/{date}"]
# In code: data = load("raw/{date}")  # Duplication!

The Pattern Language Insight

No new DSL needed for patterns - leverage existing language features:

  • Python: f-strings, glob, regex
  • Rust: const generics, pattern matching
  • Both: bidirectional pattern template libraries

The Orchestration Elimination Insight

The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning.

Conclusion

The optimal path depends on organizational maturity and ambition:

  1. Conservative Evolution: Enhance Bazel with better patterns and graph-level deps
  2. Developer-Focused: Python DSL compiling to Bazel, maintaining guarantees
  3. Revolutionary Leap: Pure declarative relationships with Rust implementation

Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision.