databuild/design/why-databuild.md


# Why DataBuild? A Case for Declarative Data Orchestration

Bullet points that should eventually become a blog post.

## Introduction
- **The Vision**: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets?
- **The Problem**: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult
- **The Reality**: Teams either move fast and break things, or move slowly to avoid breaking things
- **The Promise**: Declarative data orchestration enables both speed and correctness
- **The Inspiration**: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes

## The Hidden Costs of Data Interfaces

### Coupling by Obscurity
- Data interfaces hide critical dependencies between jobs
- Violates fundamental software design principles:
    - **Dependencies**: Can't understand impact in isolation
    - **Obscurity**: Critical coupling information isn't visible in code
    - **Change amplification**: Simple changes require modifications everywhere
- Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime

### The Orchestration Trap
- Engineers spend too much time writing, updating, and manually testing orchestration
- Orchestration code is:
    - Constantly changing as requirements evolve
    - Nearly impossible to test meaningfully
    - Brittle and breaks when the dependency graph changes

## The Declarative Alternative

### Learning from Proven Systems
**Bazel**: Declare targets and dependencies → system handles build orchestration
**SQL**: Declare what data you want → query planner handles execution strategy
**Kubernetes**: Declare desired state → controllers handle deployment orchestration

### Inversion of Control for Data
- Engineers declare **what**: jobs and data dependencies
- System handles **how**: execution order, parallelization, failure recovery
- Enables **local reasoning**: understand jobs in isolation
- Supports **fearless iteration**: changes automatically propagate correctly

### Continuous Reconciliation
- Triggers periodically ask: "ensure all expected partitions exist"
- DataBuild determines what's missing, stale, or needs rebuilding
- System maintains desired state without manual intervention
- Self-healing pipelines that adapt to changing upstream data

## Operational Simplicity

### Stateless by Design
- Each build request is independent and ephemeral
- Append-only event log vs. complex mutable orchestrator state
- No database migrations or careful state preservation across versions

### Deployment as Code Compilation
- Following Bazel's model: build binary, ship it
- Auto-generated deployment configs (Helm charts, etc.)
- Version updates are the norm, not the exception

### Separation of Concerns
- **DataBuild**: Dependency resolution + execution planning
- **External systems**: Scheduling (cron/triggers), infrastructure (Kubernetes)
- **Result**: Operational complexity focused where it belongs

## The DataBuild Vision

### Core Tenets
- No dependency knowledge necessary to materialize data
- Only local dependency knowledge needed to develop
- Explicit coupling via declared data dependencies
- Automatic orchestration delegation

### What This Enables
- **Fearless iteration**: Change any part of a large graph confidently
- **Trivial deployment**: Single binary updates, no complex state management
- **Automatic correctness**: System prevents composition bugs at "compile time"
- **Scalable development**: Near-zero marginal effort for new datasets

## Conclusion
- Data engineering doesn't have to be this hard
- Declarative approaches have transformed other domains
- DataBuild brings these proven patterns to data orchestration
- The future: engineers focus on business logic, systems handle the complexity