parent
52869abc07
commit
38956ac7d4
2 changed files with 81 additions and 8 deletions
|
|
@ -5,7 +5,7 @@ DataBuild is a trivially-deployable, partition-oriented, declarative data build
|
||||||
|
|
||||||
DataBuild is for teams at data-driven orgs who need reliable, flexible, and correct data pipelines and are tired of manually orchestrating complex dependency graphs. You define Jobs (that take input data partitions and produce output partitions), compose them into Graphs (partition dependency networks), and DataBuild handles the rest. Just ask it to build a partition, and databuild handles resolving the jobs that need to run, planning execution order, running builds concurrently, and tracking and exposing build progress. Instead of writing orchestration code that breaks when dependencies change, you focus on the data transformations while DataBuild ensures your pipelines are correct, observable, and reliable.
|
DataBuild is for teams at data-driven orgs who need reliable, flexible, and correct data pipelines and are tired of manually orchestrating complex dependency graphs. You define Jobs (that take input data partitions and produce output partitions), compose them into Graphs (partition dependency networks), and DataBuild handles the rest. Just ask it to build a partition, and databuild handles resolving the jobs that need to run, planning execution order, running builds concurrently, and tracking and exposing build progress. Instead of writing orchestration code that breaks when dependencies change, you focus on the data transformations while DataBuild ensures your pipelines are correct, observable, and reliable.
|
||||||
|
|
||||||
For important context, check out [DESIGN.md](./DESIGN.md). Also, check out [`databuild.proto`](./databuild/databuild.proto) for key system interfaces. Key features:
|
For important context, check out [DESIGN.md](./DESIGN.md), along with designs in [design/](./design/). Also, check out [`databuild.proto`](./databuild/databuild.proto) for key system interfaces. Key features:
|
||||||
|
|
||||||
- **Declarative dependencies** - Ask for data, get data. Define partition dependencies and DataBuild automatically plans what jobs to run and when.
|
- **Declarative dependencies** - Ask for data, get data. Define partition dependencies and DataBuild automatically plans what jobs to run and when.
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,11 +1,84 @@
|
||||||
|
|
||||||
# Why DataBuild?
|
# Why DataBuild? A Case for Declarative Data Orchestration
|
||||||
|
|
||||||
(work in progress)
|
Bullet points that should eventually become a blog post.
|
||||||
|
|
||||||
Why?
|
## Introduction
|
||||||
- Orchestration logic changes all the time, better to not write it directly
|
- **The Vision**: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets?
|
||||||
- Declarative -> Compile time correctness (e.g. can detect when no job produces a partition pattern)
|
- **The Problem**: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult
|
||||||
- Compartmentalized jobs + data deps -> Simplicity and compartmentalization of complexity
|
- **The Reality**: Teams either move fast and break things, or move slowly to avoid breaking things
|
||||||
- Bazel based -> Easy to deploy, maintain, and update
|
- **The Promise**: Declarative data orchestration enables both speed and correctness
|
||||||
|
- **The Inspiration**: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes
|
||||||
|
|
||||||
|
## The Hidden Costs of Data Interfaces
|
||||||
|
|
||||||
|
### Coupling by Obscurity
|
||||||
|
- Data interfaces hide critical dependencies between jobs
|
||||||
|
- Violates fundamental software design principles:
|
||||||
|
- **Dependencies**: Can't understand impact in isolation
|
||||||
|
- **Obscurity**: Critical coupling information isn't visible in code
|
||||||
|
- **Change amplification**: Simple changes require modifications everywhere
|
||||||
|
- Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime
|
||||||
|
|
||||||
|
### The Orchestration Trap
|
||||||
|
- Engineers spend too much time writing, updating, and manually testing orchestration
|
||||||
|
- Orchestration code is:
|
||||||
|
- Constantly changing as requirements evolve
|
||||||
|
- Nearly impossible to test meaningfully
|
||||||
|
- Brittle and breaks when the dependency graph changes
|
||||||
|
|
||||||
|
## The Declarative Alternative
|
||||||
|
|
||||||
|
### Learning from Proven Systems
|
||||||
|
**Bazel**: Declare targets and dependencies → system handles build orchestration
|
||||||
|
**SQL**: Declare what data you want → query planner handles execution strategy
|
||||||
|
**Kubernetes**: Declare desired state → controllers handle deployment orchestration
|
||||||
|
|
||||||
|
### Inversion of Control for Data
|
||||||
|
- Engineers declare **what**: jobs and data dependencies
|
||||||
|
- System handles **how**: execution order, parallelization, failure recovery
|
||||||
|
- Enables **local reasoning**: understand jobs in isolation
|
||||||
|
- Supports **fearless iteration**: changes automatically propagate correctly
|
||||||
|
|
||||||
|
### Continuous Reconciliation
|
||||||
|
- Triggers periodically ask: "ensure all expected partitions exist"
|
||||||
|
- DataBuild determines what's missing, stale, or needs rebuilding
|
||||||
|
- System maintains desired state without manual intervention
|
||||||
|
- Self-healing pipelines that adapt to changing upstream data
|
||||||
|
|
||||||
|
## Operational Simplicity
|
||||||
|
|
||||||
|
### Stateless by Design
|
||||||
|
- Each build request is independent and ephemeral
|
||||||
|
- Append-only event log vs. complex mutable orchestrator state
|
||||||
|
- No database migrations or careful state preservation across versions
|
||||||
|
|
||||||
|
### Deployment as Code Compilation
|
||||||
|
- Following Bazel's model: build binary, ship it
|
||||||
|
- Auto-generated deployment configs (Helm charts, etc.)
|
||||||
|
- Version updates are the norm, not the exception
|
||||||
|
|
||||||
|
### Separation of Concerns
|
||||||
|
- **DataBuild**: Dependency resolution + execution planning
|
||||||
|
- **External systems**: Scheduling (cron/triggers), infrastructure (Kubernetes)
|
||||||
|
- **Result**: Operational complexity focused where it belongs
|
||||||
|
|
||||||
|
## The DataBuild Vision
|
||||||
|
|
||||||
|
### Core Tenets
|
||||||
|
- No dependency knowledge necessary to materialize data
|
||||||
|
- Only local dependency knowledge needed to develop
|
||||||
|
- Explicit coupling via declared data dependencies
|
||||||
|
- Automatic orchestration delegation
|
||||||
|
|
||||||
|
### What This Enables
|
||||||
|
- **Fearless iteration**: Change any part of a large graph confidently
|
||||||
|
- **Trivial deployment**: Single binary updates, no complex state management
|
||||||
|
- **Automatic correctness**: System prevents composition bugs at "compile time"
|
||||||
|
- **Scalable development**: Near-zero marginal effort for new datasets
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
- Data engineering doesn't have to be this hard
|
||||||
|
- Declarative approaches have transformed other domains
|
||||||
|
- DataBuild brings these proven patterns to data orchestration
|
||||||
|
- The future: engineers focus on business logic, systems handle the complexity
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue