Why DataBuild? A Case for Declarative Data Orchestration

Bullet points that should eventually become a blog post.

Introduction

The Vision: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets?
The Problem: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult
The Reality: Teams either move fast and break things, or move slowly to avoid breaking things
The Promise: Declarative data orchestration enables both speed and correctness
The Inspiration: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes

The Hidden Costs of Data Interfaces

Coupling by Obscurity

Data interfaces hide critical dependencies between jobs
Violates fundamental software design principles:
- Dependencies: Can't understand impact in isolation
- Obscurity: Critical coupling information isn't visible in code
- Change amplification: Simple changes require modifications everywhere
Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime

The Orchestration Trap

Engineers spend too much time writing, updating, and manually testing orchestration
Orchestration code is:
- Constantly changing as requirements evolve
- Nearly impossible to test meaningfully
- Brittle and breaks when the dependency graph changes

The Declarative Alternative

Learning from Proven Systems

Bazel: Declare targets and dependencies → system handles build orchestration SQL: Declare what data you want → query planner handles execution strategy
Kubernetes: Declare desired state → controllers handle deployment orchestration

Inversion of Control for Data

Engineers declare what: jobs and data dependencies
System handles how: execution order, parallelization, failure recovery
Enables local reasoning: understand jobs in isolation
Supports fearless iteration: changes automatically propagate correctly

Continuous Reconciliation

Triggers periodically ask: "ensure all expected partitions exist"
DataBuild determines what's missing, stale, or needs rebuilding
System maintains desired state without manual intervention
Self-healing pipelines that adapt to changing upstream data

Operational Simplicity

Stateless by Design

Each build request is independent and ephemeral
Append-only event log vs. complex mutable orchestrator state
No database migrations or careful state preservation across versions

Deployment as Code Compilation

Following Bazel's model: build binary, ship it
Auto-generated deployment configs (Helm charts, etc.)
Version updates are the norm, not the exception

Separation of Concerns

DataBuild: Dependency resolution + want propagation
External systems: Scheduling (cron/triggers), infrastructure (Kubernetes)
Result: Operational complexity focused where it belongs

The DataBuild Vision

Core Tenets

No dependency knowledge necessary to materialize data
Only local dependency knowledge needed to develop
Explicit coupling via declared data dependencies
Automatic orchestration delegation

What This Enables

Fearless iteration: Change any part of a large graph confidently
Trivial deployment: Single binary updates, no complex state management
Automatic correctness: System prevents composition bugs at "compile time"
Scalable development: Near-zero marginal effort for new datasets

Conclusion

Data engineering doesn't have to be this hard
Declarative approaches have transformed other domains
DataBuild brings these proven patterns to data orchestration
The future: engineers focus on business logic, systems handle the complexity

3.8 KiB Raw Blame History