# Why DataBuild? A Case for Declarative Data Orchestration Bullet points that should eventually become a blog post. ## Introduction - **The Vision**: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets? - **The Problem**: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult - **The Reality**: Teams either move fast and break things, or move slowly to avoid breaking things - **The Promise**: Declarative data orchestration enables both speed and correctness - **The Inspiration**: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes ## The Hidden Costs of Data Interfaces ### Coupling by Obscurity - Data interfaces hide critical dependencies between jobs - Violates fundamental software design principles: - **Dependencies**: Can't understand impact in isolation - **Obscurity**: Critical coupling information isn't visible in code - **Change amplification**: Simple changes require modifications everywhere - Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime ### The Orchestration Trap - Engineers spend too much time writing, updating, and manually testing orchestration - Orchestration code is: - Constantly changing as requirements evolve - Nearly impossible to test meaningfully - Brittle and breaks when the dependency graph changes ## The Declarative Alternative ### Learning from Proven Systems **Bazel**: Declare targets and dependencies → system handles build orchestration **SQL**: Declare what data you want → query planner handles execution strategy **Kubernetes**: Declare desired state → controllers handle deployment orchestration ### Inversion of Control for Data - Engineers declare **what**: jobs and data dependencies - System handles **how**: execution order, parallelization, failure recovery - Enables **local reasoning**: understand jobs in isolation - Supports **fearless iteration**: changes automatically propagate correctly ### Continuous Reconciliation - Triggers periodically ask: "ensure all expected partitions exist" - DataBuild determines what's missing, stale, or needs rebuilding - System maintains desired state without manual intervention - Self-healing pipelines that adapt to changing upstream data ## Operational Simplicity ### Stateless by Design - Each build request is independent and ephemeral - Append-only event log vs. complex mutable orchestrator state - No database migrations or careful state preservation across versions ### Deployment as Code Compilation - Following Bazel's model: build binary, ship it - Auto-generated deployment configs (Helm charts, etc.) - Version updates are the norm, not the exception ### Separation of Concerns - **DataBuild**: Dependency resolution + execution planning - **External systems**: Scheduling (cron/triggers), infrastructure (Kubernetes) - **Result**: Operational complexity focused where it belongs ## The DataBuild Vision ### Core Tenets - No dependency knowledge necessary to materialize data - Only local dependency knowledge needed to develop - Explicit coupling via declared data dependencies - Automatic orchestration delegation ### What This Enables - **Fearless iteration**: Change any part of a large graph confidently - **Trivial deployment**: Single binary updates, no complex state management - **Automatic correctness**: System prevents composition bugs at "compile time" - **Scalable development**: Near-zero marginal effort for new datasets ## Conclusion - Data engineering doesn't have to be this hard - Declarative approaches have transformed other domains - DataBuild brings these proven patterns to data orchestration - The future: engineers focus on business logic, systems handle the complexity