Stuart Axelbrooke cdc47bddfe Add index numbers to plans

2025-08-01 23:03:19 -07:00

7.6 KiB

Raw Blame History

DataBuild Interface Evolution: Strategic Options and Technical Decisions

This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs.

Executive Summary

DataBuild must choose between three fundamental interface strategies:

Pure Bazel (current): Maximum guarantees, maximum verbosity
High-Level DSL: Expressive interfaces that compile to Bazel
Pure Declarative: Eliminate orchestration entirely through relational modeling

The Core Technical Decisions

Decision 1: Where Should Dependency Logic Live?

Option A: In-Job Config (Current Design)

# Job knows its own dependencies
def config(self, date):
    return {"inputs": [f"raw/{date}", f"raw/{date-1}"]}

✅ Locality of knowledge - dependency logic next to usage
✅ Natural evolution - changes happen in one place
❌ Performance overhead - subprocess per config call
Thrives in: Complex enterprise environments where jobs have intricate, evolving dependencies

Option B: Graph-Level Declaration

databuild_job(
    name = "process_daily",
    depends_on = ["raw/{date}"],
    produces = ["processed/{date}"]
)

✅ Static analysis - entire graph visible without execution
✅ Performance - microseconds vs seconds for planning
❌ Flexibility - harder to express dynamic dependencies
❌ Implicit coupling - jobs have to duplicate data dependency resolution
Thrives in: High-frequency trading systems, real-time analytics where planning speed matters

Option C: Hybrid Pattern-Based

# Patterns at graph level, resolution at runtime
@job(dependency_pattern="raw/{source}/[date-window:date]")
def aggregate(date, window=7):
    # Runtime resolves exact partitions

✅ Best of both - fast planning with flexibility
✅ Progressive disclosure - simple cases simple
❌ Complexity - two places to look
Thrives in: Modern data platforms serving diverse teams with varying sophistication

Decision 2: Interface Language Choice

Option A: Pure Bazel (Status Quo)

databuild_job(
    name = "etl",
    binary = ":etl_binary",
)

Narrative: "The Infrastructure-as-Code Platform"

For organizations that value reproducibility above all else
Where data pipelines are mission-critical infrastructure
Teams that already use Bazel for other systems

Strengths:

Hermetic builds guarantee reproducibility
Multi-language support out of the box
Battle-tested deployment story

Weaknesses:

High barrier to entry
Verbose for simple cases
Limited expressiveness

Option B: Python DSL → Bazel Compilation

@db.job
def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"):
    return raw.load().transform().save()

Narrative: "The Developer-First Data Platform"

For data teams that move fast and iterate quickly
Where Python is already the lingua franca
Organizations prioritizing developer productivity

Strengths:

10x more concise than Bazel
Natural for data scientists/engineers
Rich ecosystem integration

Weaknesses:

Additional compilation step
Python-centric (less multi-language)
Debugging across abstraction layers

Option C: Rust DSL with Procedural Macros

#[job]
fn process(
    #[partition("raw/{date}")] input: Partition<Data>
) -> Partition<Output, "output/{date}"> {
    input.load()?.transform().save()
}

Narrative: "The High-Performance Data Platform"

For organizations processing massive scale
Where performance and correctness are equally critical
Teams willing to invest in Rust expertise

Strengths:

Compile-time guarantees with elegance
Zero-cost abstractions
Single language with execution engine

Weaknesses:

Steep learning curve
Smaller talent pool
Less flexible than Python

Decision 3: Orchestration Philosophy

Option A: Explicit Orchestration (Traditional)

Users define execution order and dependencies
Similar to Airflow, Prefect, Dagster
Thrives in: Organizations with complex business logic requiring explicit control

Option B: Implicit Orchestration (Current DataBuild)

Users define jobs and dependencies
System figures out execution order
Thrives in: Data engineering teams wanting to focus on transformations, not plumbing

Option C: No Orchestration (Pure Declarative)

@partition("clean/{date}")
class CleanData:
    source = "raw/*/{date}"
    
    def transform(self, raw):
        # Pure function, no orchestration
        return clean(merge(raw))

Narrative: "The SQL-for-Data-Pipelines Platform"

Orchestration is an implementation detail
Users declare relationships, system handles everything
Thrives in: Next-generation data platforms, organizations ready to rethink data processing

Strengths:

Eliminates entire categories of bugs
Enables powerful optimizations

Weaknesses:

Paradigm shift for users
Less control over execution
Harder to debug when things go wrong

Strategic Recommendations by Use Case

For Startups/Fast-Moving Teams

Recommendation: Python DSL → Bazel

Start with Python for rapid development
Compile to Bazel for production
Migrate critical jobs to native Bazel/Rust over time

For Enterprise/Regulated Industries

Recommendation: Pure Bazel with Graph-Level Dependencies

Maintain full auditability and reproducibility
Use graph-level deps for performance
Consider Rust DSL for new greenfield projects

For Next-Gen Data Platforms

Recommendation: Pure Declarative with Rust Implementation

Leap directly to declarative model
Build on Rust for performance and correctness
Pioneer the "SQL for pipelines" approach

Implementation Patterns

Pattern 1: Gradual Migration

Current Bazel → Python DSL (compile to Bazel) → Pure Declarative

Low risk, high compatibility
Teams can adopt at their own pace
Preserves existing investments

Pattern 2: Parallel Tracks

Bazel Interface (production)
     ↕️
Python Interface (development)

Different interfaces for different use cases
Development velocity without sacrificing production guarantees
Higher maintenance burden

Pattern 3: Clean Break

New declarative system alongside legacy

Fastest path to innovation
No legacy constraints
Requires significant investment

Key Technical Insights

Single Source of Truth Principle

Whatever path chosen, dependency declaration and resolution must be co-located:

# Good: Single source
def process(input: partition("raw/{date}")):
    return input.load().transform()

# Bad: Split sources
# In config: depends = ["raw/{date}"]
# In code: data = load("raw/{date}")  # Duplication!

The Pattern Language Insight

No new DSL needed for patterns - leverage existing language features:

Python: f-strings, glob, regex
Rust: const generics, pattern matching
Both: bidirectional pattern template libraries

The Orchestration Elimination Insight

The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning.

Conclusion

The optimal path depends on organizational maturity and ambition:

Conservative Evolution: Enhance Bazel with better patterns and graph-level deps
Developer-Focused: Python DSL compiling to Bazel, maintaining guarantees
Revolutionary Leap: Pure declarative relationships with Rust implementation

Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision.

7.6 KiB Raw Blame History

DataBuild Interface Evolution: Strategic Options and Technical Decisions

Executive Summary

The Core Technical Decisions

Decision 1: Where Should Dependency Logic Live?

Decision 2: Interface Language Choice

Decision 3: Orchestration Philosophy

Strategic Recommendations by Use Case

For Startups/Fast-Moving Teams

For Enterprise/Regulated Industries

For Next-Gen Data Platforms

Implementation Patterns

Pattern 1: Gradual Migration

Pattern 2: Parallel Tracks

Pattern 3: Clean Break

Key Technical Insights

Single Source of Truth Principle

The Pattern Language Insight

The Orchestration Elimination Insight

Conclusion

7.6 KiB

Raw Blame History