parent
ec6494ee59
commit
7e889856e9
4 changed files with 74 additions and 386 deletions
57
DESIGN.md
Normal file
57
DESIGN.md
Normal file
|
|
@ -0,0 +1,57 @@
|
|||
|
||||
# DataBuild Design
|
||||
|
||||
DataBuild is a trivially-deployable, partition-oriented, declarative build system. Where data orchestration flows are normally imperative and implicit (do this, then do that, etc), DataBuild uses stated data dependencies to make this process declarative and explicit. DataBuild scales the declarative nature of tools like DBT to meet the needs of modern, broadly integrated data and ML organizations, that consume data from many sources and which arrive on a highly varying basis. DataBuild enables confident, bounded completeness in a world where input data is effectively never complete at any given time.
|
||||
|
||||
## Philosophy
|
||||
|
||||
Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"
|
||||
|
||||
DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, answering what sequence of jobs must be run to build a specific partition of data. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.
|
||||
|
||||
Graphs and jobs are defined in [bazel](https://bazel.build), allowing graphs (and their constituent jobs) to be built and deployed trivially.
|
||||
|
||||
## Concepts
|
||||
|
||||
- **Partitions** - A partition is an atomic unit of data. DataBuild's data dependencies work by using partition references (e.g. `s3://some/dataset/date=2025-06-01`) as dependency signals between jobs, allowing the construction of build graphs to produce arbitrary partitions.
|
||||
- **Jobs** - Their `exec` entrypoint builds partitions from partitions, and their `config` entrypoint specifies what partitions are required to produce the requested partition(s), along with the specific config to run `exec` with to build said partitions.
|
||||
- **Graphs** - Composes jobs together to achieve multi-job orchestration, using a `lookup` mechanism to resolve a requested partition to the job that can build it. Together with its constituent jobs, Graphs can fully plan the build of any set of partitions. Most interactions with a DataBuild app happen with a graph.
|
||||
- **Build Event Log** - Encodes the state of the system, recording build requests, job activity, partition production, etc to enable running databuild as a deployed application.
|
||||
- **Bazel targets** - Bazel is a fast, extensible, and hermetic build system. DataBuild uses bazel targets to describe graphs and jobs, making graphs themselves deployable application. Implementing a DataBuild app is the process of integrating your data build jobs in `databuild_job` bazel targets, and connecting them with a `databuild_graph` target.
|
||||
|
||||
### Partition / Job Assumptions and Best Practices
|
||||
|
||||
- **Partitions are atomic and final** - Either the data is complete or its "not there".
|
||||
- **Partitions are mutually exclusive and collectively exhaustive** - Row membership to a partition should be unambiguous and consistent.
|
||||
- **Jobs are idempotent** - For the same input data and parameters, the same partition is produced (functionally).
|
||||
|
||||
### Partition Delegation
|
||||
|
||||
If a partition is already up to date, or is already being built by a previous build request, a new build request will "delegate" to that build request. Instead of running the job to build said partition again, it will emit a delegation event in the build event log, explicitly pointing to the build action it is delegating to.
|
||||
|
||||
## Components
|
||||
|
||||
### Job
|
||||
|
||||
The `databuild_job` rule expects to reference a binary that adheres to the following expectations:
|
||||
|
||||
- For the `config` subcommand, it prints the JSON job config to stdout based on the requested partitions, e.g. for a binary `bazel-bin/my_binary`, it prints a valid job config when called like `bazel-bin/my_binary config my_dataset/color=red my_dataset/color=blue`.
|
||||
- For the `exec` subcommand, it produces the partitions requested to the `config` subcommand when configured by the job config it produced. E.g., if `config` had produced `{..., "args": ["red", "blue"], "env": {"MY_ENV": "foo"}`, then calling `MY_ENV=foo bazel-bin/my_binary exec red blue` should produce partitions `my_dataset/color=red` and `my_dataset/color=blue`.
|
||||
|
||||
### Graph
|
||||
|
||||
The `databuild_graph` rule expects two fields, `jobs`, and `lookup`:
|
||||
|
||||
- The `lookup` binary target should return a JSON object with keys as job labels and values as the list of partitions that each job is responsible for producing. This enables graph planning by walking backwards in the data dependency graph.
|
||||
- The `jobs` list should just be a list of all jobs involved in the graph. The graph will recursively call config to resolve the full set of jobs to run.
|
||||
|
||||
### Build Event Log (BEL)
|
||||
|
||||
The BEL encodes all relevant build actions that occur, enabling concurrent builds. This includes:
|
||||
|
||||
- Graph events, including "build requested", "build started", "analysis started", "build failed", "build completed", etc.
|
||||
- Job events, including "..."
|
||||
|
||||
The BEL is similar to [event-sourced](https://martinfowler.com/eaaDev/EventSourcing.html) systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service.
|
||||
|
||||
|
||||
65
README.md
65
README.md
|
|
@ -1,57 +1,26 @@
|
|||
|
||||
# DataBuild
|
||||
|
||||
A bazel-based data build system.
|
||||
DataBuild is a trivially-deployable, partition-oriented, declarative build system.
|
||||
|
||||
For important context, check out [the manifesto](./manifesto.md), and [core concepts](./core-concepts.md). Also, check out [`databuild.proto`](./databuild/databuild.proto) for key system interfaces.
|
||||
For important context, check out [DESIGN.md](./DESIGN.md). Also, check out [`databuild.proto`](./databuild/databuild.proto) for key system interfaces.
|
||||
|
||||
## Testing
|
||||
## Usage
|
||||
|
||||
See the [podcast example BUILD file](examples/podcast_reviews/BUILD.bazel).
|
||||
|
||||
## Development
|
||||
|
||||
### Testing
|
||||
|
||||
DataBuild core testing:
|
||||
|
||||
````bash
|
||||
bazel test //...
|
||||
````
|
||||
|
||||
End to end testing:
|
||||
|
||||
### Quick Test
|
||||
Run the comprehensive end-to-end test suite:
|
||||
```bash
|
||||
./run_e2e_tests.sh
|
||||
```
|
||||
|
||||
### Core Unit Tests
|
||||
```bash
|
||||
# Run all core DataBuild tests
|
||||
./scripts/bb_test_all
|
||||
|
||||
# Remote testing
|
||||
./scripts/bb_remote_test_all
|
||||
```
|
||||
|
||||
### Manual Testing
|
||||
```bash
|
||||
# Test basic graph CLI build
|
||||
cd examples/basic_graph
|
||||
bazel run //:basic_graph.build -- "generated_number/pippin"
|
||||
|
||||
# Test podcast reviews CLI build
|
||||
cd examples/podcast_reviews
|
||||
bazel run //:podcast_reviews_graph.build -- "reviews/date=2020-01-01"
|
||||
|
||||
# Test service builds
|
||||
bazel run //:basic_graph.service -- --port=8080
|
||||
# Then in another terminal:
|
||||
curl -X POST -H "Content-Type: application/json" \
|
||||
-d '{"partitions": ["generated_number/pippin"]}' \
|
||||
http://localhost:8080/api/v1/builds
|
||||
```
|
||||
|
||||
### Event Validation Tests
|
||||
The end-to-end tests validate that CLI and Service builds emit identical events:
|
||||
- **Event count alignment**: CLI and Service must generate the same total event count
|
||||
- **Event type breakdown**: Job, partition, and build_request events must match exactly
|
||||
- **Event consistency**: Both interfaces represent the same logical build process
|
||||
|
||||
Example test output:
|
||||
```
|
||||
Event breakdown:
|
||||
Job events: CLI=2, Service=2
|
||||
Partition events: CLI=3, Service=3
|
||||
Request events: CLI=9, Service=9
|
||||
✅ All build events (job, partition, and request) are identical
|
||||
✅ Total event counts are identical: 14 events each
|
||||
```
|
||||
|
|
|
|||
187
core-concepts.md
187
core-concepts.md
|
|
@ -1,187 +0,0 @@
|
|||
# Tenets
|
||||
- No dependency knowledge necessary to materialize data
|
||||
- Only local dependency knowledge to develop
|
||||
- Not a framework (what does this mean?)
|
||||
|
||||
# Organizing Philosophy
|
||||
|
||||
Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"
|
||||
|
||||
DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, answering what sequence of jobs must be run to build a specific partition of data. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.
|
||||
|
||||
Graphs and jobs are defined in [bazel](https://bazel.build), allowing graphs (and their constituent jobs) to be built and deployed trivially.
|
||||
|
||||
# Nouns / Verbs / Phases
|
||||
|
||||
## Partitions
|
||||
DataBuild is fundamentally about composing graphs of jobs and partitions of data, where partitions are the things we want to produce, or are the nodes between jobs. E.g., in a machine learning pipeline, a partition would be the specific training dataset produced for a given date, model version, etc, that would in turn be read by the model training job, which would itself produce a partition representing the trained model itself.
|
||||
|
||||
Partitions are assumed to be atomic and final (for final input partitions), such that it is unambiguous in what cases a partition must be (re)calculated.
|
||||
|
||||
## Partition References
|
||||
|
||||
A partition reference (or partition ref) is a serialized reference to a literal partition of data. This can be anything, so long as it uniquely identifies its partition, but something path-like or URI-like is generally advisable for ergonomics purposes; e.g. `/datasets/reviews/v1/date=2025-05-04/country=usa` or `dal://ranker/features/return_stats/2025/05/04/`.
|
||||
|
||||
## Jobs
|
||||
```mermaid
|
||||
flowchart LR
|
||||
upstream_a[(Upstream Partition A)]
|
||||
upstream_b[(Upstream Partition B)]
|
||||
job[Job]
|
||||
output_c[(Output Partition C)]
|
||||
output_d[(Output Partition D)]
|
||||
upstream_a & upstream_b --> job --> output_c & output_d
|
||||
```
|
||||
|
||||
In DataBuild, `Job`s are the atomic unit of data processing, representing the mapping of upstream partitions into output partitions. A job is defined by two capabilities: 1) expose an executable to run the job and produce the desired partitions of data (configured via env vars and args), retuning manifests that describe produced partitions; and 2) exposes a configuration executable that turns references to desired partitions into a job config that fully configures said job executable to produce the desired partitions.
|
||||
|
||||
Jobs are assumed to be idempotent and independent, such that two jobs configured to produce separate partitions can run without interaction. These assumptions allow jobs to state only their immediate upstream and output data dependencies (the partitions they consume and produce), and in a graph leave no ambiguity about what must be done to produce a desired partition.
|
||||
|
||||
Jobs are implemented via the [`databuild_job`](databuild/rules.bzl) bazel rule. An extremely basic job definition can be found in the [basic_job example](./examples/basic_job/).
|
||||
|
||||
## Graphs
|
||||
A `Graph` is the composition of jobs and partitions via their data dependencies. Graphs answer "what partitions does a job require to produce its outputs?", and "what job must be run to produce a given partition?" Defining a graph relies on only the list of involved jobs, and a lookup executable that transforms desired partitions into the job(s) that produce.
|
||||
|
||||
Graphs expose two entrypoints: `graph.analyze`, which produces the literal `JobGraph` specifying the structure of the build graph to be execute to build a specific set of partitions (enabling visualization, planning, precondition checking, etc); and `graph.build`, which runs the build process for a set of requested partitions (relying on `graph.analyze` to plan). Other entrypoints are described in the [graph README](databuild/graph/README.md).
|
||||
|
||||
Graphs are implemented via the [`databuild_graph`](databuild/rules.bzl) bazel rule. A basic graph definition can be found in the [basic_graph example](./examples/basic_graph/).
|
||||
|
||||
### Implementing a Graph
|
||||
To make a fully described graph, engineers must define:
|
||||
|
||||
- `databuild_job`s
|
||||
- Implementing the exec and config targets for each
|
||||
- A `databuild_graph` (referencing a `lookup` binary to resolve jobs)
|
||||
|
||||
And that's it!
|
||||
|
||||
## Catalog
|
||||
A catalog is a database of partition manifests and past/in-progress graph builds and job runs. When run with a catalog, graphs can:
|
||||
|
||||
- Skip jobs whose outputs are already present and up to date.
|
||||
- Safely run data builds in parallel, delegating overlapping partition requests to already scheduled/running jobs.
|
||||
|
||||
TODO - plan and implement this functionality.
|
||||
|
||||
---
|
||||
# Appendix
|
||||
|
||||
## Future
|
||||
|
||||
- Partition versions - e.g. how to not invalidate prior produced data with every code change?
|
||||
- merkle tree + semver as implementation?
|
||||
- mask upstream changes that aren't major
|
||||
- content addressable storage based on action keys that point to merkle tree
|
||||
- compile to set of build files? (thrash with action graph?)
|
||||
- catalog of partition manifests + code artifacts enables this
|
||||
- start with basic presence check?
|
||||
|
||||
## Questions
|
||||
- How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
|
||||
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
|
||||
- How do we implement job lookup for graphs? Is this a job catalog thing?
|
||||
- Answer: Yes, job graphs have a `lookup` attr
|
||||
- How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
|
||||
- How do we handle non-materialize relationships outside the graph?
|
||||
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem
|
||||
|
||||
## Ideas
|
||||
- Should we have an "optimistic" mode that builds all partitions that can be built?
|
||||
- Emit an event stream for observability purposes?
|
||||
|
||||
## Partition Overlap
|
||||
For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.
|
||||
|
||||
- Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
|
||||
- Leave a door open, but don't get nerd sniped
|
||||
- Make sure the `JobGraph` is merge-able
|
||||
- How do we merge data deps? (timeout is time based) - Do we need to?
|
||||
|
||||
## Data Ver & Invalidation
|
||||
Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:
|
||||
|
||||
- No invalidate: add optional field for new feature not relevant for past data
|
||||
- Invalidate: whoops, we were calculating the score wrong
|
||||
|
||||
This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is `v0.0.0`).
|
||||
|
||||
## Why Deployability Matters
|
||||
This needs to be deployable trivially from day one because:
|
||||
- We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
|
||||
- Being able to vend executables makes building weakly coupled apps easy (not a framework)
|
||||
|
||||
# Demo Development
|
||||
1. `databuild_job` ✅
|
||||
1. `databuild_job.cfg` ✅
|
||||
2. `databuild_job.exec` ✅
|
||||
3. Tests ✅
|
||||
4. `databuild_job` (to `cfg` and `exec`) ✅
|
||||
5. Deployable `databuild_job` ✅
|
||||
2. `databuild_graph` ✅
|
||||
1. `databuild_graph.analyze` ✅
|
||||
2. `databuild_graph` provider ✅
|
||||
3. `databuild_graph.exec` ✅
|
||||
4. `databuild_graph.build` ✅
|
||||
5. `databuild_graph.mermaid` ✅
|
||||
5. podcast reviews example
|
||||
6. Reflect (data versioning/caching/partition manifests, partition overlap, ...?)
|
||||
|
||||
# Factoring
|
||||
- Core - graph description, build, analysis, and execution
|
||||
- Service - job/partition catalog, parallel execution, triggers, exposed service
|
||||
- Product - Accounts/RBAC, auth, delegates for exec/storage
|
||||
|
||||
|
||||
# Service Sketch
|
||||
|
||||
```mermaid
|
||||
flowchart
|
||||
codebase
|
||||
subgraph service
|
||||
data_service
|
||||
end
|
||||
subgraph database
|
||||
job_catalog
|
||||
partition_catalog
|
||||
end
|
||||
codebase -- deployed_to --> data_service
|
||||
data_service -- logs build events --> job_catalog
|
||||
data_service -- queries/records partition manifest --> partition_catalog
|
||||
|
||||
```
|
||||
|
||||
# Scratch
|
||||
Implementation:
|
||||
- Bazel to describe jobs/graphs
|
||||
- Whatever you want to implement jobs and graphs (need solid interfaces)
|
||||
|
||||
```starlark
|
||||
databuild_graph(
|
||||
name = "my_graph",
|
||||
jobs = [":my_job", ...],
|
||||
plan = ":my_graph_plan",
|
||||
)
|
||||
|
||||
py_binary(
|
||||
name = "my_graph_plan",
|
||||
...
|
||||
)
|
||||
|
||||
databuild_job(
|
||||
name = "my_job",
|
||||
configure = ":my_job_configure",
|
||||
run = ":my_job_binary",
|
||||
)
|
||||
|
||||
scala_binary(
|
||||
name = "my_job_configure",
|
||||
...
|
||||
)
|
||||
|
||||
scala_binary(
|
||||
name - ":my_job_binary",
|
||||
...
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
151
manifesto.md
151
manifesto.md
|
|
@ -1,151 +0,0 @@
|
|||
# DataBuild Manifesto
|
||||
|
||||
## Why
|
||||
|
||||
### Motivation
|
||||
|
||||
The modern ML company today is a data company, whose value is derived from ingesting, processing, and refining data. The vast majority of our data providers provide it in discrete batches, most interfaces where data is vended internally or to customers is via batches, and the flow/process this data goes through from consumption to vending is often long and complex. We have the opportunity to define a simple and declarative fabric that connects our data inputs to our data outputs in a principled manner, separating concerns by scale, and minimizing our operational overhead. This fabric would remove human judgment required to produce any data asset, and take responsibility for achieving the end-to-end data production process.
|
||||
|
||||
This fabric also allows us to completely define the scope of involved code and data for a given concern, increasing engineering velocity and quality. It also allows separating concerns by scale: keeping job internal logic separate from job/dataset composition logic, minimizing the number of things that must be considered when changing or authoring new code and data. It is also important to be practical, not dogmatic, and this system should thrive in an environment with experimentation and other orchestration strategies, so long as the assumptions below hold true.
|
||||
|
||||
These capabilities also help us achieve important eng and business goals:
|
||||
|
||||
- Automatically handling updated data
|
||||
- Correctness checking across data dependencies
|
||||
- Automatically enforcing data usage policies
|
||||
- Job reuse & using differently configured but same jobs in parallel
|
||||
- Lineage tracking
|
||||
|
||||
### Assumptions
|
||||
|
||||
First, let’s state a few key assumptions from current best practices and standards:
|
||||
|
||||
- Batches (partitions) are mutually exclusive and collectively exhaustive by domain (dataset).
|
||||
- A step in the data production process (job) can completely define what partitions it needs and produces.
|
||||
- Produced partitions are final, conditional on their inputs being final.
|
||||
- Jobs are idempotent and (practically) produce no side effects aside from output partitions.
|
||||
|
||||
### Analogy: Build Systems
|
||||
|
||||
One immediate analogy is software [build systems](https://en.wikipedia.org/wiki/Build_automation), like Bazel, which use build targets as nodes in the graph to be queried to produce desired artifacts correctly. Build systems rely on declared edges between targets to resolve what work needs to be done and in what order, also allowing for sensible caching. These base assumptions allow the build system to automatically handle orchestration for any build request, meaning incredibly complex build processes (like building whole OS releases), that are otherwise too complex for humans to orchestrate themselves, are solvable trivially by computers. This also allows rich interaction with the build process, like querying dependencies and reasoning about the build graph. Related: see [dazzle demo](https://docs.google.com/presentation/d/18tL4f_fXCkoaQ7zeSs0AaciryRCR4EmX7hd1wL3Zy8c/edit#slide=id.g2ee07a77300_0_8) discussing the similarities.
|
||||
|
||||
The complicating factor for us is that we rely on data for extensive caching. We could, in principle, reprocess all received data (e.g. in an ingest bucket) when we want to produce any output partition for customers, treating the produced data itself as a build target, ephemeral and valid only for the requested partition. However, this is incredibly wasteful, especially when the same partition of intermediate data is read potentially hundreds of times.
|
||||
|
||||
This is our motivation for a different kind of node in our build graph: dataset partitions. It’s not clear exactly how we should handle these nodes differently, but it’s obvious that rebuilding a partition is orders of magnitude more expensive than rebuilding a code build target, so we want to cache these as long as possible. That said, there is obvious opportunity to rebuild partitions when their inputs change, either based on new/updated data arriving, or by definition change.
|
||||
|
||||
### Analogy: Dataflow Programming
|
||||
|
||||
[Dataflow programming](https://en.wikipedia.org/wiki/Dataflow_programming) is another immediate analogy, where programs are described as a DAG of operations rather than a sequence of operations (a’la imperative paradigms). This implicitly allows parallel execution, as operations run when their inputs are available, not when explicitly requested (a’la imperative programs).
|
||||
|
||||
![[Pasted image 20250323203936.png]]
|
||||
|
||||
Zooming out, we can take the build process described above and model it instead as a dataflow program: jobs are operations, and partitions are values, and the topo-sorted build action graph is the compiled program. One valuable insight from dataflow programming is the concept of “ports”, like a function parameter or return value. Bazel has a similar concept, where build rules expose parameters they expect targets to set (e.g. `srcs`, `deps`, `data`, etc) that enables sandboxing hermeticity. Here, we can use the port concept to extend to data as well, allowing jobs to specify what partitions they need before they can run, and having that explicitly control job internal data resolution for a single source of truth.
|
||||
|
||||
### Analogy: Workflow Orchestration Systems
|
||||
|
||||
|
||||
|
||||
### Practicality and Experimentation
|
||||
|
||||
What separates this from build systems is the practical consideration that sometimes we just need to notebook up some data to get results to a customer faster, or run an experiment without productionizing the code. This is a deviation from the build system analogy, where you would never dream of reaching in and modifying a `.o` file as part of the build process, but we regularly intentionally produce data that is not yet encoded with a job yet. This is a super power of data interfaces, and an ability we very much want to maintain.
|
||||
|
||||
### Stateless, Declarative Data Builds
|
||||
|
||||
In essence, what we want to achieve is declarative, stateless, repeatable data builds, in a system that is easy to modify and experiment with, and which is easy to use, improve, and verify the correctness of. We want a system that takes responsibility for achieving the end to end journey in producing data outputs.
|
||||
|
||||
## How
|
||||
|
||||
Here’s what we need:
|
||||
|
||||
- A set of nouns and organizing semantics that completely describe the data build process
|
||||
- A strategy for composing jobs & data
|
||||
|
||||
### DataBuild Nouns
|
||||
|
||||
- **Dataset** \- Data meta-containers / Bookshelves of partitions
|
||||
- **Partition** \- Atomic units of data
|
||||
- **Job** \- Transforms read partitions into written partitions
|
||||
- **Job Target** \- Transforms reference(s) to desired partition(s) into job config and data deps that produces them
|
||||
- **Data deps** \- The set of references to partitions required to run a job and produce the desired output partitions
|
||||
- **Code build time / data build time / job run time** \- Separate build phases / scopes of responsibility. Code build time is when code artifacts are compiled, data build time is when jobs are configured and data dependencies are resolved/materialized, and job run time is when the job is actually run (generally at the end of data build time).
|
||||
|
||||
### Composition
|
||||
|
||||
An important question here is how we describe composition between jobs and data, and how we practically achieve partitions that require multiple jobs to build.
|
||||
|
||||
Imperative composition (e.g. “do A, then do B, then…”) is difficult to maintain and easy to mess up without granular data dep checking. Instead, because job targets define their data deps, we can rely on “pull semantics” to define the composition of jobs and data, a’la [Luigi](https://luigi.readthedocs.io/en/stable/). This would be achieved with a “materialize” type of data dependency, where a `materialize(A)` dep on B would mean that we would ensure A existed before building B, building A if necessary.
|
||||
|
||||
This means we can achieve workloads of multiple jobs by invoking materialize data deps at build time. In practice, this means a DAG generated for a given job target would invoke other DAGs before it ran its own job to ensure its upstream data deps were present. To ensure we have observability of the whole flow for a given requested build, we can log a build request ID alongside other build metadata, or rely on orchestration systems that support this innately (like Prefect).
|
||||
|
||||
This creates a convenient interface between our upcoming PerfOpt web app (and similar applications), requiring them only to ask for a partition to exist to fulfill a segment refresh or reconfiguration.
|
||||
|
||||
### Composition With Non-DataBuild Deps
|
||||
|
||||
A key requirement for any successful organizing system is the flexibility to integrate with other systems. This is achieved out of the box thanks to the organizing assumptions above, meaning that DataBuild operations can wait patiently for partitions that are otherwise built by other systems, and expose the ability to “pull” via events/requests/etc. Success in this space is enabling each group of engineers to solve problems as they see fit, with a happy path that enables most work for the efficiencies and quality that stems from shared implementation.
|
||||
|
||||
---
|
||||
|
||||
# Appendix
|
||||
|
||||
## Questions
|
||||
|
||||
### Are they data build programs?
|
||||
|
||||
A tempting analogy is to say that we are compiling a program that builds the data, e.g. compiling the set and order of jobs to produce a desired output. This seems similar, yet innately different from the “it's a build system” analogy. This seems related to expectations about isolation, as build systems generally allow and assume a significant amount of isolation, where more general programs have a weaker assumption. This is a key distinction, as we intentionally provide ad-hoc data for experimentation or quick customer turn around quite commonly, and are likely to continue in the future.
|
||||
|
||||
### Explicit coupling across data?
|
||||
|
||||
Naive data interfaces allow laundering of data, e.g. we may fix a bug in job A and not realize that we need to rerun jobs B and C because they consume job A’s produced data. DataBuild as a concept brings focus to the data dependency relationship, making it explicit what jobs could be rerun after the bug fix. This creates a new question, “how much should we optimize data reconciliation?” We could introduce concepts like minor versioning of datasets or explicitly consumed columns that would allow us to detect more accurately which jobs need to be rerun, but this depends on excessively rerunning jobs being a regrettable outcome. However, if job runs are cheap, the added complexity from these concepts may be more expensive than the wasted compute to the business. Through this lens, we should lean on the simpler, more aggressive recalculation strategy until it’s obvious that we need to increase efficiency.
|
||||
|
||||
### Is this just `make` for data?
|
||||
|
||||
The core problem that makes DataBuild necessary beyond what Make offers is the unique nature of data processing at scale:
|
||||
|
||||
1. Data dependencies are more complex than code dependencies
|
||||
2. Data processing is significantly more expensive than code compilation
|
||||
3. Data often requires temporal awareness (partitioning by time)
|
||||
4. Data work involves a mix of production systems and experimentation
|
||||
5. Data projects require both rigorous pipelines and flexibility for exploration
|
||||
|
||||
In essence, DataBuild isn't just Make with partition columns added - it's a reconceptualization of build systems specifically for data processing flows, recognizing the unique properties of datasets versus software artifacts, while still leveraging the power of declarative dependency management.
|
||||
|
||||
### What are the key JTBDs?
|
||||
|
||||
- Configure and run jobs
|
||||
- Declare job and dataset targets, and their relationships
|
||||
- With what? Bazel-like language? Dagster-like?
|
||||
- Assert correctness of the data build graph
|
||||
- Including data interfaces?
|
||||
- Catalog partition liveness and related metadata
|
||||
- All in postgres?
|
||||
- Managing data reconciliation and recalculation
|
||||
- Built off of single event stream?
|
||||
- Enforcing data usage policies (implementing taints)
|
||||
- Tracking data lineage, job history
|
||||
- Data cache management
|
||||
|
||||
- Data access? Implement accessors in execution frameworks, e.g. spark/duckdb/standard python/scala/etc?
|
||||
- Automated QA, alerting, and notifications?? Or establishing ownership of datasets? (or just ship metrics and let users handle elsewhere)
|
||||
|
||||
## Components
|
||||
- Data catalog / partition event log
|
||||
- Orchestrator / scheduler
|
||||
- Compiler (description --> build graph)
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
compiler([Compiler])
|
||||
orchestrator([Orchestrator])
|
||||
data_catalog[(Data Catalog)]
|
||||
job_log[(Job Log)]
|
||||
databuild_code --> compiler --> databuild_graph
|
||||
databuild_graph & data_catalog --> orchestrator --> job_runs --> job_log & data_catalog
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Data access details & lineage tracking may need to happen via the same component, but is considered an "internal to job runs" consideration currently.
|
||||
|
||||
### Data Catalog
|
||||
The data catalog is an essential component that enables cache-aware planning of data builds. It is a mapping of `(partition_ref, mtime) -> partition_manifest`. Access to the described data graph and the data catalog is all that is needed to plan out the net work needed to materialize a new partition. Jobs themselves are responsible for any short-circuiting for work that happens out of band.
|
||||
|
||||
##
|
||||
Loading…
Reference in a new issue