84 lines
3.8 KiB
Markdown
84 lines
3.8 KiB
Markdown
|
|
# Why DataBuild? A Case for Declarative Data Orchestration
|
|
|
|
Bullet points that should eventually become a blog post.
|
|
|
|
## Introduction
|
|
- **The Vision**: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets?
|
|
- **The Problem**: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult
|
|
- **The Reality**: Teams either move fast and break things, or move slowly to avoid breaking things
|
|
- **The Promise**: Declarative data orchestration enables both speed and correctness
|
|
- **The Inspiration**: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes
|
|
|
|
## The Hidden Costs of Data Interfaces
|
|
|
|
### Coupling by Obscurity
|
|
- Data interfaces hide critical dependencies between jobs
|
|
- Violates fundamental software design principles:
|
|
- **Dependencies**: Can't understand impact in isolation
|
|
- **Obscurity**: Critical coupling information isn't visible in code
|
|
- **Change amplification**: Simple changes require modifications everywhere
|
|
- Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime
|
|
|
|
### The Orchestration Trap
|
|
- Engineers spend too much time writing, updating, and manually testing orchestration
|
|
- Orchestration code is:
|
|
- Constantly changing as requirements evolve
|
|
- Nearly impossible to test meaningfully
|
|
- Brittle and breaks when the dependency graph changes
|
|
|
|
## The Declarative Alternative
|
|
|
|
### Learning from Proven Systems
|
|
**Bazel**: Declare targets and dependencies → system handles build orchestration
|
|
**SQL**: Declare what data you want → query planner handles execution strategy
|
|
**Kubernetes**: Declare desired state → controllers handle deployment orchestration
|
|
|
|
### Inversion of Control for Data
|
|
- Engineers declare **what**: jobs and data dependencies
|
|
- System handles **how**: execution order, parallelization, failure recovery
|
|
- Enables **local reasoning**: understand jobs in isolation
|
|
- Supports **fearless iteration**: changes automatically propagate correctly
|
|
|
|
### Continuous Reconciliation
|
|
- Triggers periodically ask: "ensure all expected partitions exist"
|
|
- DataBuild determines what's missing, stale, or needs rebuilding
|
|
- System maintains desired state without manual intervention
|
|
- Self-healing pipelines that adapt to changing upstream data
|
|
|
|
## Operational Simplicity
|
|
|
|
### Stateless by Design
|
|
- Each build request is independent and ephemeral
|
|
- Append-only event log vs. complex mutable orchestrator state
|
|
- No database migrations or careful state preservation across versions
|
|
|
|
### Deployment as Code Compilation
|
|
- Following Bazel's model: build binary, ship it
|
|
- Auto-generated deployment configs (Helm charts, etc.)
|
|
- Version updates are the norm, not the exception
|
|
|
|
### Separation of Concerns
|
|
- **DataBuild**: Dependency resolution + execution planning
|
|
- **External systems**: Scheduling (cron/triggers), infrastructure (Kubernetes)
|
|
- **Result**: Operational complexity focused where it belongs
|
|
|
|
## The DataBuild Vision
|
|
|
|
### Core Tenets
|
|
- No dependency knowledge necessary to materialize data
|
|
- Only local dependency knowledge needed to develop
|
|
- Explicit coupling via declared data dependencies
|
|
- Automatic orchestration delegation
|
|
|
|
### What This Enables
|
|
- **Fearless iteration**: Change any part of a large graph confidently
|
|
- **Trivial deployment**: Single binary updates, no complex state management
|
|
- **Automatic correctness**: System prevents composition bugs at "compile time"
|
|
- **Scalable development**: Near-zero marginal effort for new datasets
|
|
|
|
## Conclusion
|
|
- Data engineering doesn't have to be this hard
|
|
- Declarative approaches have transformed other domains
|
|
- DataBuild brings these proven patterns to data orchestration
|
|
- The future: engineers focus on business logic, systems handle the complexity
|