databuild/DESIGN.md

7.4 KiB

DataBuild Design

DataBuild is a trivially-deployable, partition-oriented, declarative build system. Where data orchestration flows are normally imperative and implicitly coupled (do this, then do that, etc), DataBuild uses stated data dependencies to make this process declarative and explicit. DataBuild scales the declarative nature of tools like DBT to meet the needs of modern, broadly integrated data and ML organizations, who consume data from many sources and which arrive on a highly varying basis. DataBuild enables confident, bounded completeness in a world where input data is effectively never complete at any given time.

Philosophy

Inspired by these requirements.

Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"

DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, enabling continuous data reconciliation for data platforms of all sizes. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.

Graphs and jobs are defined in bazel, allowing graphs (and their constituent jobs) to be built and deployed trivially.

Concepts

  • Partitions - A partition is an atomic unit of data. DataBuild's data dependencies work by using partition references (e.g. s3://some/dataset/date=2025-06-01) as dependency signals between jobs, allowing the construction of build graphs to produce arbitrary partitions.
  • Jobs - Builds requested partitions from specific input partitions, or raising when input partitions are missing (specifying which partitions can't be built because of specific missing partitions)
  • Graphs - Composes jobs together to achieve multi-job orchestration, using a lookup mechanism to resolve a requested partition to the job that can build it. Together with its constituent jobs, Graphs can fully build any set of partitions. Most interactions with a DataBuild app happen with a graph.
  • Build Event Log - Encodes the state of the system, recording partition wants, job activity, partition production, etc to enable running databuild as a deployed application.
  • Wants - Partition wants can be registered with DataBuild, enabling continuous data reconciliation and build of wanted partitions as soon as their graph-external dependencies are met.
  • Taints - Taints mark a partition as invalid, indicating that readers should not use it, and that it should be rebuilt when requested or depended upon. If there is a still-active want for the tainted partition, it will be rebuilt immediately.
  • Bazel Targets - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in databuild_job bazel targets, and connecting them with a databuild_graph target.
  • Graph Definition Languages Application libraries in Python/Rust/Scala that use language features to enable ergonomic and succinct specification of jobs and graphs.

Bazel Components

Job

The databuild_job rule requires just a binary target that it can execute, and any relevant metadata that helps the graph call it properly. The referenced binary should accept a list of partitions that it needs to produce, and if any required partitions are missing, report which are missing and which requested partitions they prevent from being built.

Jobs are executed via a wrapper component that provides observability, error handling, and standardized communication with the graph. The wrapper captures all job output as structured logs, enabling comprehensive monitoring without requiring jobs to have network connectivity.

Graph

The databuild_graph rule expects two fields, jobs, and lookup:

  • The lookup binary target should return a JSON object with keys as job labels and values as the list of partitions that each job is responsible for producing. This enables graph planning by walking backwards in the data dependency graph.
  • The jobs list should just be a list of all jobs involved in the graph. The graph will recursively call config to resolve the full set of jobs to run.

Build Event Log (BEL)

The BEL encodes all relevant build actions that occur, enabling distributed/concurrent builds. This includes submitted wants, job events (started, succeeded, partitions missing, etc)

The BEL is similar to event-sourced systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service.

Wants and Taints

"Wants" are the main mechanism for eventually built partitions. In real world scenarios, it is standard for data to arrive late, or not at all. Wants cause the databuild graph to continually attempt to build the wanted partitions while they aren't live, and enabling it to list wants who are past SLA.

Taints allow for manual/programmatic invalidation of built partitions. Partitions tainted since their last build are considered as non-existent, and will be rebuilt if any other wanted partition depends on them. This also opens the door for invalidating downstream partitions as well.

Key Insights

  • Orchestration logic changes all the time - better to not write it at all.
  • Orchestration decisions and application logic is innately coupled.
  • "systemd for data platforms"

What About Configuration?

Configuration is all the information that is provided to a job that isn't a) the data the job reads or b) the partitions the job is being asked to produce. This could be info like "what modeling strategy do we use for this customer" or "when did was this feed configured", etc. It has the inconvenient features of being critical for practical business value and is also difficult to fit in as data (since you often want to change and "tweak" it).

DataBuild explicitly and intentionally treats configuration as a job-internal concept: jobs are not pure functions, but it is a good idea for almost all of the implementation to be purely functional: it's recommended to calculate structured job configuration up front (along with trying to resolve the required input data), then invoking the rest of your job as a pure function over the config and data.

What about situations where data is configured by a web app, etc? Taints are a great way to invalidate partitions that are impacted by config changes, and you can create callbacks in your application to taint impacted partitions.

Assumptions

  • Job -> partition relationships are canonical, job runs are idempotent
  • The data deps / partitions read by a fully parameterized job run is consistent over time: same parameters = same read partitions