databuild/design/why-databuild.md
2025-09-03 21:32:17 -07:00

3.8 KiB

Why DataBuild? A Case for Declarative Data Orchestration

Bullet points that should eventually become a blog post.

Introduction

  • The Vision: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets?
  • The Problem: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult
  • The Reality: Teams either move fast and break things, or move slowly to avoid breaking things
  • The Promise: Declarative data orchestration enables both speed and correctness
  • The Inspiration: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes

The Hidden Costs of Data Interfaces

Coupling by Obscurity

  • Data interfaces hide critical dependencies between jobs
  • Violates fundamental software design principles:
    • Dependencies: Can't understand impact in isolation
    • Obscurity: Critical coupling information isn't visible in code
    • Change amplification: Simple changes require modifications everywhere
  • Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime

The Orchestration Trap

  • Engineers spend too much time writing, updating, and manually testing orchestration
  • Orchestration code is:
    • Constantly changing as requirements evolve
    • Nearly impossible to test meaningfully
    • Brittle and breaks when the dependency graph changes

The Declarative Alternative

Learning from Proven Systems

Bazel: Declare targets and dependencies → system handles build orchestration SQL: Declare what data you want → query planner handles execution strategy
Kubernetes: Declare desired state → controllers handle deployment orchestration

Inversion of Control for Data

  • Engineers declare what: jobs and data dependencies
  • System handles how: execution order, parallelization, failure recovery
  • Enables local reasoning: understand jobs in isolation
  • Supports fearless iteration: changes automatically propagate correctly

Continuous Reconciliation

  • Triggers periodically ask: "ensure all expected partitions exist"
  • DataBuild determines what's missing, stale, or needs rebuilding
  • System maintains desired state without manual intervention
  • Self-healing pipelines that adapt to changing upstream data

Operational Simplicity

Stateless by Design

  • Each build request is independent and ephemeral
  • Append-only event log vs. complex mutable orchestrator state
  • No database migrations or careful state preservation across versions

Deployment as Code Compilation

  • Following Bazel's model: build binary, ship it
  • Auto-generated deployment configs (Helm charts, etc.)
  • Version updates are the norm, not the exception

Separation of Concerns

  • DataBuild: Dependency resolution + want propagation
  • External systems: Scheduling (cron/triggers), infrastructure (Kubernetes)
  • Result: Operational complexity focused where it belongs

The DataBuild Vision

Core Tenets

  • No dependency knowledge necessary to materialize data
  • Only local dependency knowledge needed to develop
  • Explicit coupling via declared data dependencies
  • Automatic orchestration delegation

What This Enables

  • Fearless iteration: Change any part of a large graph confidently
  • Trivial deployment: Single binary updates, no complex state management
  • Automatic correctness: System prevents composition bugs at "compile time"
  • Scalable development: Near-zero marginal effort for new datasets

Conclusion

  • Data engineering doesn't have to be this hard
  • Declarative approaches have transformed other domains
  • DataBuild brings these proven patterns to data orchestration
  • The future: engineers focus on business logic, systems handle the complexity