Add explicit requirements.md

2025-09-07 17:15:14 -07:00 · 2025-09-07 17:15:14 -07:00 · c2bd4f230c
commit c2bd4f230c
parent cf449529a3
5 changed files with 54 additions and 90 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -5,6 +5,8 @@ DataBuild is a trivially-deployable, partition-oriented, declarative build syste
 ## Philosophy
 Inspired by [these requirements.](./design/requirements.md)
 Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"
 DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, enabling continuous data reconciliation for data platforms of all sizes. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.
@ -22,12 +24,6 @@ Graphs and jobs are defined in [bazel](https://bazel.build), allowing graphs (an
 - **Bazel Targets** - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in `databuild_job` bazel targets, and connecting them with a `databuild_graph` target.
 - [**Graph Definition Languages**](design/graph-specification.md) Application libraries in Python/Rust/Scala that use language features to enable ergonomic and succinct specification of jobs and graphs.
 ### Partition / Job Assumptions and Best Practices
 - **Partitions are atomic and final** - Either the data is complete or its "not there".
 - **Partitions are mutually exclusive and collectively exhaustive** - Row membership to a partition should be unambiguous and consistent.
 - **Jobs are idempotent** - For the same input data and parameters, the same partition is produced (functionally).
 ## Bazel Components
 ### Job
@ -60,6 +56,15 @@ Taints allow for manual/programmatic invalidation of built partitions. Partition
 - Orchestration decisions and application logic is innately coupled.
 - "systemd for data platforms"
 ## What About Configuration?
 Configuration is all the information that is provided to a job that isn't a) the data the job reads or b) the partitions the job is being asked to produce. This could be info like "what modeling strategy do we use for this customer" or "when did was this feed configured", etc. It has the inconvenient features of being critical for practical business value and is also difficult to fit in as data (since you often want to change and "tweak" it).
 DataBuild explicitly and intentionally treats configuration as a job-internal concept: jobs are not pure functions, but it is a good idea for almost all of the implementation to be purely functional: it's recommended to calculate structured job configuration up front (along with trying to resolve the required input data), then invoking the rest of your job as a pure function over the config and data. 
 What about situations where data is configured by a web app, etc? Taints are a great way to invalidate partitions that are impacted by config changes, and you can create callbacks in your application to taint impacted partitions. 
 ## Assumptions
 - Job -> partition relationships are canonical, job runs are idempotent
--- a/README.md
+++ b/README.md
@ -103,3 +103,6 @@ End to end testing:
 ```bash
 ./run_e2e_tests.sh
 ```
 #### Test Strategy
 Where possible, we make invalid state unrepresentable via rust's type system. Where that is not possible, we prefer [property-testing](https://en.wikipedia.org/wiki/Software_testing#Property_testing), with a handful of bespoke tests to capture critical edge cases or important behaviors. 
--- a/design/requirements.md
+++ b/design/requirements.md
@ -0,0 +1,38 @@
 # Requirements
 ## Data Model
 - Partition oriented: All data is made up of partitions. Partitions are atomic, mutually exclusive, collectively exhaustive, and final.
 - All partitions can be produced by jobs contained in the graph.
 - The graph is able to unambiguously determine which job produces a given partition (if any) - the same partition cannot be produced by multiple jobs.
 ## Execution Model
 - The system must rely on a reliable control system for reconciling the desired data state with the current. (a'la k8s)
 - Job composition is automated based on explicit partition-based data-deps.
 - Partitions are the dependency signals: Jobs explicitly signal when specific upstream data deps are missing (so that they can be built).
 - Jobs must be idempotent, stateless functions (conditioned on their runtime-resolved config).
 - Jobs must be safely runnable concurrently.
 - Reasoning about jobs and produced partitions should be completely local (when no interrogation of upstreams is necessary).
 - All build state is internal to the graph service.
 - Should gracefully handle multiple sources requesting overlapping but different sets of partitions.
 - Must support both batched and single partition jobs - Some jobs efficiently process many partitions together, others process one at a time.
 - System must support heterogeneous compute platforms - Jobs can run locally, in containers, or on external systems (EMR, Databricks, BigQuery).
 - Users interact with the system primarily through declarative statements: wanting partitions to exist, marking partitions as invalid (taints), and defining jobs that transform partitions. The system handles all imperative orchestration internally.
 ## Deployment Model
 - Trivially deployable: Jobs are described by bazel targets that allow them to be trivially executed and packaged (literally, via code gen, or some other way).
 - Graphs must be composable: one graph must have an efficient way to explicitly depend on data produced by another graph.
 - Deployment updates must not break in-flight work - Continuous deployment is the norm; system must handle version transitions gracefully.
 ## Observability Model
 - Observing build system state, decisions, and rationale should be easy (via CLI, web app, or API).
 - "Why doesn't this partition exist yet" must be answerable for any requested partition, terminating in either "these jobs are still in progress" or "these jobs failed".
 - The structure and interfaces of the CLI, API, and web app should be fundamentally the same.
 ## Correctness Model
 - Compile time correctness is the engine of long-term productivity and maintainability. When ever it is reasonably possible, we chose compile-time correctness assertion mechanisms.
 ## Tenets
 - Correctness over speed.
 - Explicit over implicit - Make all dependencies and decisions visible.
 - Simple over clever - Reconciliation loops over complex state machines.
 - No dependency knowledge necessary to materialize data: no dependency knowledge should be necessary to materialize data, and only local dep knowledge should be necessary to implement new data building jobs. Global knowledge should never be necessary where existing jobs are already sufficient.
--- a/design/service.md
+++ b/design/service.md
@ -26,6 +26,8 @@ trait GraphService {
 The purpose of the API is to enable remote, programmatic interaction with databuild applications, and to host endpoints 
 needed by the [web app](#web-app).
 [Notes about details, context, and askama views.](https://claude.ai/share/76622c1c-7489-496e-be81-a64fef24e636)
 ## Web App
 The web app visualizes databuild application state via features like listing past builds, job statistics, 
 partition liveness, build request status, etc. This section specifies the hierarchy of functions of the web app. Pages 
--- a/design/why-databuild.md
+++ b/design/why-databuild.md
@ -1,84 +0,0 @@
 # Why DataBuild? A Case for Declarative Data Orchestration
 Bullet points that should eventually become a blog post.
 ## Introduction
 - **The Vision**: What if data engineers could iterate fearlessly on massive pipelines? If individual engineers could onboard and support 10x as many datasets?
 - **The Problem**: Most orchestration abstractions are brittle - moving quickly and correctly without complete product certainty is difficult
 - **The Reality**: Teams either move fast and break things, or move slowly to avoid breaking things
 - **The Promise**: Declarative data orchestration enables both speed and correctness
 - **The Inspiration**: Learning from the declarative evolutions in Bazel, SQL, and Kubernetes
 ## The Hidden Costs of Data Interfaces
 ### Coupling by Obscurity
 - Data interfaces hide critical dependencies between jobs
 - Violates fundamental software design principles:
    - **Dependencies**: Can't understand impact in isolation
    - **Obscurity**: Critical coupling information isn't visible in code
    - **Change amplification**: Simple changes require modifications everywhere
 - Example: Bug fix in dataset A breaks jobs B, C, D... but you don't know until runtime
 ### The Orchestration Trap
 - Engineers spend too much time writing, updating, and manually testing orchestration
 - Orchestration code is:
    - Constantly changing as requirements evolve
    - Nearly impossible to test meaningfully
    - Brittle and breaks when the dependency graph changes
 ## The Declarative Alternative
 ### Learning from Proven Systems
 **Bazel**: Declare targets and dependencies → system handles build orchestration
 **SQL**: Declare what data you want → query planner handles execution strategy  
 **Kubernetes**: Declare desired state → controllers handle deployment orchestration
 ### Inversion of Control for Data
 - Engineers declare **what**: jobs and data dependencies
 - System handles **how**: execution order, parallelization, failure recovery
 - Enables **local reasoning**: understand jobs in isolation
 - Supports **fearless iteration**: changes automatically propagate correctly
 ### Continuous Reconciliation
 - Triggers periodically ask: "ensure all expected partitions exist"
 - DataBuild determines what's missing, stale, or needs rebuilding
 - System maintains desired state without manual intervention
 - Self-healing pipelines that adapt to changing upstream data
 ## Operational Simplicity
 ### Stateless by Design
 - Each build request is independent and ephemeral
 - Append-only event log vs. complex mutable orchestrator state
 - No database migrations or careful state preservation across versions
 ### Deployment as Code Compilation
 - Following Bazel's model: build binary, ship it
 - Auto-generated deployment configs (Helm charts, etc.)
 - Version updates are the norm, not the exception
 ### Separation of Concerns
 - **DataBuild**: Dependency resolution + want propagation
 - **External systems**: Scheduling (cron/triggers), infrastructure (Kubernetes)
 - **Result**: Operational complexity focused where it belongs
 ## The DataBuild Vision
 ### Core Tenets
 - No dependency knowledge necessary to materialize data
 - Only local dependency knowledge needed to develop
 - Explicit coupling via declared data dependencies
 - Automatic orchestration delegation
 ### What This Enables
 - **Fearless iteration**: Change any part of a large graph confidently
 - **Trivial deployment**: Single binary updates, no complex state management
 - **Automatic correctness**: System prevents composition bugs at "compile time"
 - **Scalable development**: Near-zero marginal effort for new datasets
 ## Conclusion
 - Data engineering doesn't have to be this hard
 - Declarative approaches have transformed other domains
 - DataBuild brings these proven patterns to data orchestration
 - The future: engineers focus on business logic, systems handle the complexity