7.1 KiB
DataBuild Design
DataBuild is a trivially-deployable, partition-oriented, declarative build system. Where data orchestration flows are normally imperative and implicit (do this, then do that, etc), DataBuild uses stated data dependencies to make this process declarative and explicit. DataBuild scales the declarative nature of tools like DBT to meet the needs of modern, broadly integrated data and ML organizations, who consume data from many sources and which arrive on a highly varying basis. DataBuild enables confident, bounded completeness in a world where input data is effectively never complete at any given time.
Philosophy
Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"
DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, answering what sequence of jobs must be run to build a specific partition of data. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.
Graphs and jobs are defined in bazel, allowing graphs (and their constituent jobs) to be built and deployed trivially.
Concepts
- Partitions - A partition is an atomic unit of data. DataBuild's data dependencies work by using partition references (e.g.
s3://some/dataset/date=2025-06-01) as dependency signals between jobs, allowing the construction of build graphs to produce arbitrary partitions. - Jobs - Their
execentrypoint builds partitions from partitions, and theirconfigentrypoint specifies what partitions are required to produce the requested partition(s), along with the specific config to runexecwith to build said partitions. - Graphs - Composes jobs together to achieve multi-job orchestration, using a
lookupmechanism to resolve a requested partition to the job that can build it. Together with its constituent jobs, Graphs can fully plan the build of any set of partitions. Most interactions with a DataBuild app happen with a graph. - Build Event Log - Encodes the state of the system, recording build requests, job activity, partition production, etc to enable running databuild as a deployed application.
- Wants - Partition wants can be registered with DataBuild, causing it to build the wanted partitions as soon as its graph-external dependencies are met.
- Taints - Taints mark a partition as invalid, indicating that readers should not use it, and that it should be rebuilt when requested or depended upon. If there is a still-active want for the tainted partition, it will be rebuilt immediately.
- Bazel Targets - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in
databuild_jobbazel targets, and connecting them with adatabuild_graphtarget. - Graph Specification Strategies (coming soon) Application libraries in Python/Rust/Scala that use language features to enable ergonomic and succinct specification of jobs and graphs.
Partition / Job Assumptions and Best Practices
- Partitions are atomic and final - Either the data is complete or its "not there".
- Partitions are mutually exclusive and collectively exhaustive - Row membership to a partition should be unambiguous and consistent.
- Jobs are idempotent - For the same input data and parameters, the same partition is produced (functionally).
Partition Delegation
If a partition is already up to date, or is already being built by a previous build request, a new build request will "delegate" to that build request. Instead of running the job to build said partition again, it will emit a delegation event in the build event log, explicitly pointing to the build action it is delegating to.
Components
Job
The databuild_job rule expects to reference a binary that adheres to the following expectations:
- For the
configsubcommand, it prints the JSON job config to stdout based on the requested partitions, e.g. for a binarybazel-bin/my_binary, it prints a valid job config when called likebazel-bin/my_binary config my_dataset/color=red my_dataset/color=blue. - For the
execsubcommand, it produces the partitions requested to theconfigsubcommand when configured by the job config it produced. E.g., ifconfighad produced{..., "args": ["red", "blue"], "env": {"MY_ENV": "foo"}, then callingMY_ENV=foo bazel-bin/my_binary exec red blueshould produce partitionsmy_dataset/color=redandmy_dataset/color=blue.
Jobs are executed via a wrapper component that provides observability, error handling, and standardized communication with the graph. The wrapper captures all job output as structured logs, enabling comprehensive monitoring without requiring jobs to have network connectivity.
Graph
The databuild_graph rule expects two fields, jobs, and lookup:
- The
lookupbinary target should return a JSON object with keys as job labels and values as the list of partitions that each job is responsible for producing. This enables graph planning by walking backwards in the data dependency graph. - The
jobslist should just be a list of all jobs involved in the graph. The graph will recursively call config to resolve the full set of jobs to run.
Build Event Log (BEL)
The BEL encodes all relevant build actions that occur, enabling concurrent builds. This includes:
- Graph events, including "build requested", "build started", "analysis started", "build failed", "build completed", etc.
- Job events, including "..."
The BEL is similar to event-sourced systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service.
Triggers and Wants (Coming Soon)
"Wants" are the main mechanism for continually building partitions over time. In real world scenarios, it is standard for data to arrive late, or not at all. Wants cause the databuild graph to continually attempt to build the wanted partitions until a) the partitions are live or b) the want expires, at which another script can be run. Wants are the mechanism that implements SLA checking.
You can also use cron-based triggers, which return partition refs that they want built.
Key Insights
- Orchestration logic changes all the time - better to not write it at all.
- Orchestration decisions and application logic is innately coupled
Assumptions
- Job -> partition relationships are canonical, job runs are idempotent