databuild/design/requirements.md
Stuart Axelbrooke c2bd4f230c
Some checks failed
/ setup (push) Has been cancelled
Add explicit requirements.md
2025-09-07 18:03:55 -07:00

3.1 KiB

Requirements

Data Model

  • Partition oriented: All data is made up of partitions. Partitions are atomic, mutually exclusive, collectively exhaustive, and final.
  • All partitions can be produced by jobs contained in the graph.
  • The graph is able to unambiguously determine which job produces a given partition (if any) - the same partition cannot be produced by multiple jobs.

Execution Model

  • The system must rely on a reliable control system for reconciling the desired data state with the current. (a'la k8s)
  • Job composition is automated based on explicit partition-based data-deps.
  • Partitions are the dependency signals: Jobs explicitly signal when specific upstream data deps are missing (so that they can be built).
  • Jobs must be idempotent, stateless functions (conditioned on their runtime-resolved config).
  • Jobs must be safely runnable concurrently.
  • Reasoning about jobs and produced partitions should be completely local (when no interrogation of upstreams is necessary).
  • All build state is internal to the graph service.
  • Should gracefully handle multiple sources requesting overlapping but different sets of partitions.
  • Must support both batched and single partition jobs - Some jobs efficiently process many partitions together, others process one at a time.
  • System must support heterogeneous compute platforms - Jobs can run locally, in containers, or on external systems (EMR, Databricks, BigQuery).
  • Users interact with the system primarily through declarative statements: wanting partitions to exist, marking partitions as invalid (taints), and defining jobs that transform partitions. The system handles all imperative orchestration internally.

Deployment Model

  • Trivially deployable: Jobs are described by bazel targets that allow them to be trivially executed and packaged (literally, via code gen, or some other way).
  • Graphs must be composable: one graph must have an efficient way to explicitly depend on data produced by another graph.
  • Deployment updates must not break in-flight work - Continuous deployment is the norm; system must handle version transitions gracefully.

Observability Model

  • Observing build system state, decisions, and rationale should be easy (via CLI, web app, or API).
  • "Why doesn't this partition exist yet" must be answerable for any requested partition, terminating in either "these jobs are still in progress" or "these jobs failed".
  • The structure and interfaces of the CLI, API, and web app should be fundamentally the same.

Correctness Model

  • Compile time correctness is the engine of long-term productivity and maintainability. When ever it is reasonably possible, we chose compile-time correctness assertion mechanisms.

Tenets

  • Correctness over speed.
  • Explicit over implicit - Make all dependencies and decisions visible.
  • Simple over clever - Reconciliation loops over complex state machines.
  • No dependency knowledge necessary to materialize data: no dependency knowledge should be necessary to materialize data, and only local dep knowledge should be necessary to implement new data building jobs. Global knowledge should never be necessary where existing jobs are already sufficient.