3.8 KiB
3.8 KiB
Why DataBuild? A Case for Declarative Data Orchestration
Bullet points that should eventually become a blog post.
Introduction
- The Vision: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets?
- The Problem: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult
- The Reality: Teams either move fast and break things, or move slowly to avoid breaking things
- The Promise: Declarative data orchestration enables both speed and correctness
- The Inspiration: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes
The Hidden Costs of Data Interfaces
Coupling by Obscurity
- Data interfaces hide critical dependencies between jobs
- Violates fundamental software design principles:
- Dependencies: Can't understand impact in isolation
- Obscurity: Critical coupling information isn't visible in code
- Change amplification: Simple changes require modifications everywhere
- Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime
The Orchestration Trap
- Engineers spend too much time writing, updating, and manually testing orchestration
- Orchestration code is:
- Constantly changing as requirements evolve
- Nearly impossible to test meaningfully
- Brittle and breaks when the dependency graph changes
The Declarative Alternative
Learning from Proven Systems
Bazel: Declare targets and dependencies → system handles build orchestration
SQL: Declare what data you want → query planner handles execution strategy
Kubernetes: Declare desired state → controllers handle deployment orchestration
Inversion of Control for Data
- Engineers declare what: jobs and data dependencies
- System handles how: execution order, parallelization, failure recovery
- Enables local reasoning: understand jobs in isolation
- Supports fearless iteration: changes automatically propagate correctly
Continuous Reconciliation
- Triggers periodically ask: "ensure all expected partitions exist"
- DataBuild determines what's missing, stale, or needs rebuilding
- System maintains desired state without manual intervention
- Self-healing pipelines that adapt to changing upstream data
Operational Simplicity
Stateless by Design
- Each build request is independent and ephemeral
- Append-only event log vs. complex mutable orchestrator state
- No database migrations or careful state preservation across versions
Deployment as Code Compilation
- Following Bazel's model: build binary, ship it
- Auto-generated deployment configs (Helm charts, etc.)
- Version updates are the norm, not the exception
Separation of Concerns
- DataBuild: Dependency resolution + want propagation
- External systems: Scheduling (cron/triggers), infrastructure (Kubernetes)
- Result: Operational complexity focused where it belongs
The DataBuild Vision
Core Tenets
- No dependency knowledge necessary to materialize data
- Only local dependency knowledge needed to develop
- Explicit coupling via declared data dependencies
- Automatic orchestration delegation
What This Enables
- Fearless iteration: Change any part of a large graph confidently
- Trivial deployment: Single binary updates, no complex state management
- Automatic correctness: System prevents composition bugs at "compile time"
- Scalable development: Near-zero marginal effort for new datasets
Conclusion
- Data engineering doesn't have to be this hard
- Declarative approaches have transformed other domains
- DataBuild brings these proven patterns to data orchestration
- The future: engineers focus on business logic, systems handle the complexity