databuild/DESIGN.md
Stuart Axelbrooke 7e889856e9
Some checks are pending
/ setup (push) Waiting to run
Refactor docs
2025-07-19 15:38:30 -07:00

5.5 KiB

DataBuild Design

DataBuild is a trivially-deployable, partition-oriented, declarative build system. Where data orchestration flows are normally imperative and implicit (do this, then do that, etc), DataBuild uses stated data dependencies to make this process declarative and explicit. DataBuild scales the declarative nature of tools like DBT to meet the needs of modern, broadly integrated data and ML organizations, that consume data from many sources and which arrive on a highly varying basis. DataBuild enables confident, bounded completeness in a world where input data is effectively never complete at any given time.

Philosophy

Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"

DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, answering what sequence of jobs must be run to build a specific partition of data. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.

Graphs and jobs are defined in bazel, allowing graphs (and their constituent jobs) to be built and deployed trivially.

Concepts

  • Partitions - A partition is an atomic unit of data. DataBuild's data dependencies work by using partition references (e.g. s3://some/dataset/date=2025-06-01) as dependency signals between jobs, allowing the construction of build graphs to produce arbitrary partitions.
  • Jobs - Their exec entrypoint builds partitions from partitions, and their config entrypoint specifies what partitions are required to produce the requested partition(s), along with the specific config to run exec with to build said partitions.
  • Graphs - Composes jobs together to achieve multi-job orchestration, using a lookup mechanism to resolve a requested partition to the job that can build it. Together with its constituent jobs, Graphs can fully plan the build of any set of partitions. Most interactions with a DataBuild app happen with a graph.
  • Build Event Log - Encodes the state of the system, recording build requests, job activity, partition production, etc to enable running databuild as a deployed application.
  • Bazel targets - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in databuild_job bazel targets, and connecting them with a databuild_graph target.

Partition / Job Assumptions and Best Practices

  • Partitions are atomic and final - Either the data is complete or its "not there".
  • Partitions are mutually exclusive and collectively exhaustive - Row membership to a partition should be unambiguous and consistent.
  • Jobs are idempotent - For the same input data and parameters, the same partition is produced (functionally).

Partition Delegation

If a partition is already up to date, or is already being built by a previous build request, a new build request will "delegate" to that build request. Instead of running the job to build said partition again, it will emit a delegation event in the build event log, explicitly pointing to the build action it is delegating to.

Components

Job

The databuild_job rule expects to reference a binary that adheres to the following expectations:

  • For the config subcommand, it prints the JSON job config to stdout based on the requested partitions, e.g. for a binary bazel-bin/my_binary, it prints a valid job config when called like bazel-bin/my_binary config my_dataset/color=red my_dataset/color=blue.
  • For the exec subcommand, it produces the partitions requested to the config subcommand when configured by the job config it produced. E.g., if config had produced {..., "args": ["red", "blue"], "env": {"MY_ENV": "foo"}, then calling MY_ENV=foo bazel-bin/my_binary exec red blue should produce partitions my_dataset/color=red and my_dataset/color=blue.

Graph

The databuild_graph rule expects two fields, jobs, and lookup:

  • The lookup binary target should return a JSON object with keys as job labels and values as the list of partitions that each job is responsible for producing. This enables graph planning by walking backwards in the data dependency graph.
  • The jobs list should just be a list of all jobs involved in the graph. The graph will recursively call config to resolve the full set of jobs to run.

Build Event Log (BEL)

The BEL encodes all relevant build actions that occur, enabling concurrent builds. This includes:

  • Graph events, including "build requested", "build started", "analysis started", "build failed", "build completed", etc.
  • Job events, including "..."

The BEL is similar to event-sourced systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service.