76 lines
7.2 KiB
Markdown
76 lines
7.2 KiB
Markdown
|
|
# DataBuild Design
|
|
|
|
DataBuild is a trivially-deployable, partition-oriented, declarative build system. Where data orchestration flows are normally imperative and implicit (do this, then do that, etc), DataBuild uses stated data dependencies to make this process declarative and explicit. DataBuild scales the declarative nature of tools like DBT to meet the needs of modern, broadly integrated data and ML organizations, who consume data from many sources and which arrive on a highly varying basis. DataBuild enables confident, bounded completeness in a world where input data is effectively never complete at any given time.
|
|
|
|
## Philosophy
|
|
|
|
Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"
|
|
|
|
DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, answering what sequence of jobs must be run to build a specific partition of data. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.
|
|
|
|
Graphs and jobs are defined in [bazel](https://bazel.build), allowing graphs (and their constituent jobs) to be built and deployed trivially.
|
|
|
|
## Concepts
|
|
|
|
- **Partitions** - A partition is an atomic unit of data. DataBuild's data dependencies work by using partition references (e.g. `s3://some/dataset/date=2025-06-01`) as dependency signals between jobs, allowing the construction of build graphs to produce arbitrary partitions.
|
|
- **Jobs** - Their `exec` entrypoint builds partitions from partitions, and their `config` entrypoint specifies what partitions are required to produce the requested partition(s), along with the specific config to run `exec` with to build said partitions.
|
|
- **Graphs** - Composes jobs together to achieve multi-job orchestration, using a `lookup` mechanism to resolve a requested partition to the job that can build it. Together with its constituent jobs, Graphs can fully plan the build of any set of partitions. Most interactions with a DataBuild app happen with a graph.
|
|
- **Build Event Log** - Encodes the state of the system, recording build requests, job activity, partition production, etc to enable running databuild as a deployed application.
|
|
- **Wants** - Partition wants can be registered with DataBuild, causing it to build the wanted partitions as soon as its graph-external dependencies are met.
|
|
- **Taints** - Taints mark a partition as invalid, indicating that readers should not use it, and that it should be rebuilt when requested or depended upon. If there is a still-active want for the tainted partition, it will be rebuilt immediately.
|
|
- **Bazel Targets** - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in `databuild_job` bazel targets, and connecting them with a `databuild_graph` target.
|
|
- [**Graph Specification Strategies**](design/graph-specification.md) (coming soon) Application libraries in Python/Rust/Scala that use language features to enable ergonomic and succinct specification of jobs and graphs.
|
|
|
|
### Partition / Job Assumptions and Best Practices
|
|
|
|
- **Partitions are atomic and final** - Either the data is complete or its "not there".
|
|
- **Partitions are mutually exclusive and collectively exhaustive** - Row membership to a partition should be unambiguous and consistent.
|
|
- **Jobs are idempotent** - For the same input data and parameters, the same partition is produced (functionally).
|
|
|
|
### Partition Delegation
|
|
|
|
If a partition is already up to date, or is already being built by a previous build request, a new build request will "delegate" to that build request. Instead of running the job to build said partition again, it will emit a delegation event in the build event log, explicitly pointing to the build action it is delegating to.
|
|
|
|
## Components
|
|
|
|
### Job
|
|
|
|
The `databuild_job` rule expects to reference a binary that adheres to the following expectations:
|
|
|
|
- For the `config` subcommand, it prints the JSON job config to stdout based on the requested partitions, e.g. for a binary `bazel-bin/my_binary`, it prints a valid job config when called like `bazel-bin/my_binary config my_dataset/color=red my_dataset/color=blue`.
|
|
- For the `exec` subcommand, it produces the partitions requested to the `config` subcommand when configured by the job config it produced. E.g., if `config` had produced `{..., "args": ["red", "blue"], "env": {"MY_ENV": "foo"}`, then calling `MY_ENV=foo bazel-bin/my_binary exec red blue` should produce partitions `my_dataset/color=red` and `my_dataset/color=blue`.
|
|
|
|
Jobs are executed via a wrapper component that provides observability, error handling, and standardized communication with the graph. The wrapper captures all job output as structured logs, enabling comprehensive monitoring without requiring jobs to have network connectivity.
|
|
|
|
### Graph
|
|
|
|
The `databuild_graph` rule expects two fields, `jobs`, and `lookup`:
|
|
|
|
- The `lookup` binary target should return a JSON object with keys as job labels and values as the list of partitions that each job is responsible for producing. This enables graph planning by walking backwards in the data dependency graph.
|
|
- The `jobs` list should just be a list of all jobs involved in the graph. The graph will recursively call config to resolve the full set of jobs to run.
|
|
|
|
### Build Event Log (BEL)
|
|
|
|
The BEL encodes all relevant build actions that occur, enabling concurrent builds. This includes:
|
|
|
|
- Graph events, including "build requested", "build started", "analysis started", "build failed", "build completed", etc.
|
|
- Job events, including "..."
|
|
|
|
The BEL is similar to [event-sourced](https://martinfowler.com/eaaDev/EventSourcing.html) systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service.
|
|
|
|
### Triggers and Wants (Coming Soon)
|
|
["Wants"](./design/wants.md) are the main mechanism for continually building partitions over time. In real world scenarios, it is standard for data to arrive late, or not at all. Wants cause the databuild graph to continually attempt to build the wanted partitions until a) the partitions are live or b) the want expires, at which another script can be run. Wants are the mechanism that implements SLA checking.
|
|
|
|
You can also use cron-based triggers, which return partition refs that they want built.
|
|
|
|
# Key Insights
|
|
|
|
- Orchestration logic changes all the time - better to not write it at all.
|
|
- Orchestration decisions and application logic is innately coupled.
|
|
- "systemd for data platforms"
|
|
|
|
## Assumptions
|
|
|
|
- Job -> partition relationships are canonical, job runs are idempotent
|
|
|