# DataBuild Interface Evolution: Strategic Options and Technical Decisions This document outlines the key technical decisions for evolving DataBuild's interface, examining each option through the lens of modern data infrastructure needs. ## Executive Summary DataBuild must choose between three fundamental interface strategies: 1. **Pure Bazel** (current): Maximum guarantees, maximum verbosity 2. **High-Level DSL**: Expressive interfaces that compile to Bazel 3. **Pure Declarative**: Eliminate orchestration entirely through relational modeling ## The Core Technical Decisions ### Decision 1: Where Should Dependency Logic Live? **Option A: In-Job Config (Current Design)** ```python # Job knows its own dependencies def config(self, date): return {"inputs": [f"raw/{date}", f"raw/{date-1}"]} ``` - ✅ **Locality of knowledge** - dependency logic next to usage - ✅ **Natural evolution** - changes happen in one place - ❌ **Performance overhead** - subprocess per config call - **Thrives in**: Complex enterprise environments where jobs have intricate, evolving dependencies **Option B: Graph-Level Declaration** ```python databuild_job( name = "process_daily", depends_on = ["raw/{date}"], produces = ["processed/{date}"] ) ``` - ✅ **Static analysis** - entire graph visible without execution - ✅ **Performance** - microseconds vs seconds for planning - ❌ **Flexibility** - harder to express dynamic dependencies - ❌ **Implicit coupling** - jobs have to duplicate data dependency resolution - **Thrives in**: High-frequency trading systems, real-time analytics where planning speed matters **Option C: Hybrid Pattern-Based** ```python # Patterns at graph level, resolution at runtime @job(dependency_pattern="raw/{source}/[date-window:date]") def aggregate(date, window=7): # Runtime resolves exact partitions ``` - ✅ **Best of both** - fast planning with flexibility - ✅ **Progressive disclosure** - simple cases simple - ❌ **Complexity** - two places to look - **Thrives in**: Modern data platforms serving diverse teams with varying sophistication ### Decision 2: Interface Language Choice **Option A: Pure Bazel (Status Quo)** ```starlark databuild_job( name = "etl", binary = ":etl_binary", ) ``` **Narrative**: "The Infrastructure-as-Code Platform" - For organizations that value reproducibility above all else - Where data pipelines are mission-critical infrastructure - Teams that already use Bazel for other systems **Strengths**: - Hermetic builds guarantee reproducibility - Multi-language support out of the box - Battle-tested deployment story **Weaknesses**: - High barrier to entry - Verbose for simple cases - Limited expressiveness **Option B: Python DSL → Bazel Compilation** ```python @db.job def process(date: str, raw: partition("raw/{date}")) -> partition("clean/{date}"): return raw.load().transform().save() ``` **Narrative**: "The Developer-First Data Platform" - For data teams that move fast and iterate quickly - Where Python is already the lingua franca - Organizations prioritizing developer productivity **Strengths**: - 10x more concise than Bazel - Natural for data scientists/engineers - Rich ecosystem integration **Weaknesses**: - Additional compilation step - Python-centric (less multi-language) - Debugging across abstraction layers **Option C: Rust DSL with Procedural Macros** ```rust #[job] fn process( #[partition("raw/{date}")] input: Partition ) -> Partition { input.load()?.transform().save() } ``` **Narrative**: "The High-Performance Data Platform" - For organizations processing massive scale - Where performance and correctness are equally critical - Teams willing to invest in Rust expertise **Strengths**: - Compile-time guarantees with elegance - Zero-cost abstractions - Single language with execution engine **Weaknesses**: - Steep learning curve - Smaller talent pool - Less flexible than Python ### Decision 3: Orchestration Philosophy **Option A: Explicit Orchestration (Traditional)** - Users define execution order and dependencies - Similar to Airflow, Prefect, Dagster - **Thrives in**: Organizations with complex business logic requiring explicit control **Option B: Implicit Orchestration (Current DataBuild)** - Users define jobs and dependencies - System figures out execution order - **Thrives in**: Data engineering teams wanting to focus on transformations, not plumbing **Option C: No Orchestration (Pure Declarative)** ```python @partition("clean/{date}") class CleanData: source = "raw/*/{date}" def transform(self, raw): # Pure function, no orchestration return clean(merge(raw)) ``` **Narrative**: "The SQL-for-Data-Pipelines Platform" - Orchestration is an implementation detail - Users declare relationships, system handles everything - **Thrives in**: Next-generation data platforms, organizations ready to rethink data processing **Strengths**: - Eliminates entire categories of bugs - Enables powerful optimizations **Weaknesses**: - Paradigm shift for users - Less control over execution - Harder to debug when things go wrong ## Strategic Recommendations by Use Case ### For Startups/Fast-Moving Teams **Recommendation**: Python DSL → Bazel - Start with Python for rapid development - Compile to Bazel for production - Migrate critical jobs to native Bazel/Rust over time ### For Enterprise/Regulated Industries **Recommendation**: Pure Bazel with Graph-Level Dependencies - Maintain full auditability and reproducibility - Use graph-level deps for performance - Consider Rust DSL for new greenfield projects ### For Next-Gen Data Platforms **Recommendation**: Pure Declarative with Rust Implementation - Leap directly to declarative model - Build on Rust for performance and correctness - Pioneer the "SQL for pipelines" approach ## Implementation Patterns ### Pattern 1: Gradual Migration ``` Current Bazel → Python DSL (compile to Bazel) → Pure Declarative ``` - Low risk, high compatibility - Teams can adopt at their own pace - Preserves existing investments ### Pattern 2: Parallel Tracks ``` Bazel Interface (production) ↕️ Python Interface (development) ``` - Different interfaces for different use cases - Development velocity without sacrificing production guarantees - Higher maintenance burden ### Pattern 3: Clean Break ``` New declarative system alongside legacy ``` - Fastest path to innovation - No legacy constraints - Requires significant investment ## Key Technical Insights ### Single Source of Truth Principle Whatever path chosen, dependency declaration and resolution must be co-located: ```python # Good: Single source def process(input: partition("raw/{date}")): return input.load().transform() # Bad: Split sources # In config: depends = ["raw/{date}"] # In code: data = load("raw/{date}") # Duplication! ``` ### The Pattern Language Insight No new DSL needed for patterns - leverage existing language features: - Python: f-strings, glob, regex - Rust: const generics, pattern matching - Both: bidirectional pattern template libraries ### The Orchestration Elimination Insight The highest abstraction isn't better orchestration - it's no orchestration. Like SQL eliminated query planning from user concern, DataBuild could eliminate execution planning. ## Conclusion The optimal path depends on organizational maturity and ambition: 1. **Conservative Evolution**: Enhance Bazel with better patterns and graph-level deps 2. **Developer-Focused**: Python DSL compiling to Bazel, maintaining guarantees 3. **Revolutionary Leap**: Pure declarative relationships with Rust implementation Each path has merit. The key is choosing one that aligns with your organization's data infrastructure philosophy and long-term vision.