7.6 KiB
DataBuild Interface Evolution: Strategic Options and Technical Decisions
This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs.
Executive Summary
DataBuild must choose between three fundamental interface strategies:
- Pure Bazel (current): Maximum guarantees, maximum verbosity
- High-Level DSL: Expressive interfaces that compile to Bazel
- Pure Declarative: Eliminate orchestration entirely through relational modeling
The Core Technical Decisions
Decision 1: Where Should Dependency Logic Live?
Option A: In-Job Config (Current Design)
# Job knows its own dependencies
def config(self, date):
return {"inputs": [f"raw/{date}", f"raw/{date-1}"]}
- ✅ Locality of knowledge - dependency logic next to usage
- ✅ Natural evolution - changes happen in one place
- ❌ Performance overhead - subprocess per config call
- Thrives in: Complex enterprise environments where jobs have intricate, evolving dependencies
Option B: Graph-Level Declaration
databuild_job(
name = "process_daily",
depends_on = ["raw/{date}"],
produces = ["processed/{date}"]
)
- ✅ Static analysis - entire graph visible without execution
- ✅ Performance - microseconds vs seconds for planning
- ❌ Flexibility - harder to express dynamic dependencies
- ❌ Implicit coupling - jobs have to duplicate data dependency resolution
- Thrives in: High-frequency trading systems, real-time analytics where planning speed matters
Option C: Hybrid Pattern-Based
# Patterns at graph level, resolution at runtime
@job(dependency_pattern="raw/{source}/[date-window:date]")
def aggregate(date, window=7):
# Runtime resolves exact partitions
- ✅ Best of both - fast planning with flexibility
- ✅ Progressive disclosure - simple cases simple
- ❌ Complexity - two places to look
- Thrives in: Modern data platforms serving diverse teams with varying sophistication
Decision 2: Interface Language Choice
Option A: Pure Bazel (Status Quo)
databuild_job(
name = "etl",
binary = ":etl_binary",
)
Narrative: "The Infrastructure-as-Code Platform"
- For organizations that value reproducibility above all else
- Where data pipelines are mission-critical infrastructure
- Teams that already use Bazel for other systems
Strengths:
- Hermetic builds guarantee reproducibility
- Multi-language support out of the box
- Battle-tested deployment story
Weaknesses:
- High barrier to entry
- Verbose for simple cases
- Limited expressiveness
Option B: Python DSL → Bazel Compilation
@db.job
def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"):
return raw.load().transform().save()
Narrative: "The Developer-First Data Platform"
- For data teams that move fast and iterate quickly
- Where Python is already the lingua franca
- Organizations prioritizing developer productivity
Strengths:
- 10x more concise than Bazel
- Natural for data scientists/engineers
- Rich ecosystem integration
Weaknesses:
- Additional compilation step
- Python-centric (less multi-language)
- Debugging across abstraction layers
Option C: Rust DSL with Procedural Macros
#[job]
fn process(
#[partition("raw/{date}")] input: Partition<Data>
) -> Partition<Output, "output/{date}"> {
input.load()?.transform().save()
}
Narrative: "The High-Performance Data Platform"
- For organizations processing massive scale
- Where performance and correctness are equally critical
- Teams willing to invest in Rust expertise
Strengths:
- Compile-time guarantees with elegance
- Zero-cost abstractions
- Single language with execution engine
Weaknesses:
- Steep learning curve
- Smaller talent pool
- Less flexible than Python
Decision 3: Orchestration Philosophy
Option A: Explicit Orchestration (Traditional)
- Users define execution order and dependencies
- Similar to Airflow, Prefect, Dagster
- Thrives in: Organizations with complex business logic requiring explicit control
Option B: Implicit Orchestration (Current DataBuild)
- Users define jobs and dependencies
- System figures out execution order
- Thrives in: Data engineering teams wanting to focus on transformations, not plumbing
Option C: No Orchestration (Pure Declarative)
@partition("clean/{date}")
class CleanData:
source = "raw/*/{date}"
def transform(self, raw):
# Pure function, no orchestration
return clean(merge(raw))
Narrative: "The SQL-for-Data-Pipelines Platform"
- Orchestration is an implementation detail
- Users declare relationships, system handles everything
- Thrives in: Next-generation data platforms, organizations ready to rethink data processing
Strengths:
- Eliminates entire categories of bugs
- Enables powerful optimizations
Weaknesses:
- Paradigm shift for users
- Less control over execution
- Harder to debug when things go wrong
Strategic Recommendations by Use Case
For Startups/Fast-Moving Teams
Recommendation: Python DSL → Bazel
- Start with Python for rapid development
- Compile to Bazel for production
- Migrate critical jobs to native Bazel/Rust over time
For Enterprise/Regulated Industries
Recommendation: Pure Bazel with Graph-Level Dependencies
- Maintain full auditability and reproducibility
- Use graph-level deps for performance
- Consider Rust DSL for new greenfield projects
For Next-Gen Data Platforms
Recommendation: Pure Declarative with Rust Implementation
- Leap directly to declarative model
- Build on Rust for performance and correctness
- Pioneer the "SQL for pipelines" approach
Implementation Patterns
Pattern 1: Gradual Migration
Current Bazel → Python DSL (compile to Bazel) → Pure Declarative
- Low risk, high compatibility
- Teams can adopt at their own pace
- Preserves existing investments
Pattern 2: Parallel Tracks
Bazel Interface (production)
↕️
Python Interface (development)
- Different interfaces for different use cases
- Development velocity without sacrificing production guarantees
- Higher maintenance burden
Pattern 3: Clean Break
New declarative system alongside legacy
- Fastest path to innovation
- No legacy constraints
- Requires significant investment
Key Technical Insights
Single Source of Truth Principle
Whatever path chosen, dependency declaration and resolution must be co-located:
# Good: Single source
def process(input: partition("raw/{date}")):
return input.load().transform()
# Bad: Split sources
# In config: depends = ["raw/{date}"]
# In code: data = load("raw/{date}") # Duplication!
The Pattern Language Insight
No new DSL needed for patterns - leverage existing language features:
- Python: f-strings, glob, regex
- Rust: const generics, pattern matching
- Both: bidirectional pattern template libraries
The Orchestration Elimination Insight
The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning.
Conclusion
The optimal path depends on organizational maturity and ambition:
- Conservative Evolution: Enhance Bazel with better patterns and graph-level deps
- Developer-Focused: Python DSL compiling to Bazel, maintaining guarantees
- Revolutionary Leap: Pure declarative relationships with Rust implementation
Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision.