151 lines
13 KiB
Markdown
151 lines
13 KiB
Markdown
# DataBuild Manifesto
|
||
|
||
## Why
|
||
|
||
### Motivation
|
||
|
||
The modern ML company today is a data company, whose value is derived from ingesting, processing, and refining data. The vast majority of our data providers provide it in discrete batches, most interfaces where data is vended internally or to customers is via batches, and the flow/process this data goes through from consumption to vending is often long and complex. We have the opportunity to define a simple and declarative fabric that connects our data inputs to our data outputs in a principled manner, separating concerns by scale, and minimizing our operational overhead. This fabric would remove human judgment required to produce any data asset, and take responsibility for achieving the end-to-end data production process.
|
||
|
||
This fabric also allows us to completely define the scope of involved code and data for a given concern, increasing engineering velocity and quality. It also allows separating concerns by scale: keeping job internal logic separate from job/dataset composition logic, minimizing the number of things that must be considered when changing or authoring new code and data. It is also important to be practical, not dogmatic, and this system should thrive in an environment with experimentation and other orchestration strategies, so long as the assumptions below hold true.
|
||
|
||
These capabilities also help us achieve important eng and business goals:
|
||
|
||
- Automatically handling updated data
|
||
- Correctness checking across data dependencies
|
||
- Automatically enforcing data usage policies
|
||
- Job reuse & using differently configured but same jobs in parallel
|
||
- Lineage tracking
|
||
|
||
### Assumptions
|
||
|
||
First, let’s state a few key assumptions from current best practices and standards:
|
||
|
||
- Batches (partitions) are mutually exclusive and collectively exhaustive by domain (dataset).
|
||
- A step in the data production process (job) can completely define what partitions it needs and produces.
|
||
- Produced partitions are final, conditional on their inputs being final.
|
||
- Jobs are idempotent and (practically) produce no side effects aside from output partitions.
|
||
|
||
### Analogy: Build Systems
|
||
|
||
One immediate analogy is software [build systems](https://en.wikipedia.org/wiki/Build_automation), like Bazel, which use build targets as nodes in the graph to be queried to produce desired artifacts correctly. Build systems rely on declared edges between targets to resolve what work needs to be done and in what order, also allowing for sensible caching. These base assumptions allow the build system to automatically handle orchestration for any build request, meaning incredibly complex build processes (like building whole OS releases), that are otherwise too complex for humans to orchestrate themselves, are solvable trivially by computers. This also allows rich interaction with the build process, like querying dependencies and reasoning about the build graph. Related: see [dazzle demo](https://docs.google.com/presentation/d/18tL4f_fXCkoaQ7zeSs0AaciryRCR4EmX7hd1wL3Zy8c/edit#slide=id.g2ee07a77300_0_8) discussing the similarities.
|
||
|
||
The complicating factor for us is that we rely on data for extensive caching. We could, in principle, reprocess all received data (e.g. in an ingest bucket) when we want to produce any output partition for customers, treating the produced data itself as a build target, ephemeral and valid only for the requested partition. However, this is incredibly wasteful, especially when the same partition of intermediate data is read potentially hundreds of times.
|
||
|
||
This is our motivation for a different kind of node in our build graph: dataset partitions. It’s not clear exactly how we should handle these nodes differently, but it’s obvious that rebuilding a partition is orders of magnitude more expensive than rebuilding a code build target, so we want to cache these as long as possible. That said, there is obvious opportunity to rebuild partitions when their inputs change, either based on new/updated data arriving, or by definition change.
|
||
|
||
### Analogy: Dataflow Programming
|
||
|
||
[Dataflow programming](https://en.wikipedia.org/wiki/Dataflow_programming) is another immediate analogy, where programs are described as a DAG of operations rather than a sequence of operations (a’la imperative paradigms). This implicitly allows parallel execution, as operations run when their inputs are available, not when explicitly requested (a’la imperative programs).
|
||
|
||
![[Pasted image 20250323203936.png]]
|
||
|
||
Zooming out, we can take the build process described above and model it instead as a dataflow program: jobs are operations, and partitions are values, and the topo-sorted build action graph is the compiled program. One valuable insight from dataflow programming is the concept of “ports”, like a function parameter or return value. Bazel has a similar concept, where build rules expose parameters they expect targets to set (e.g. `srcs`, `deps`, `data`, etc) that enables sandboxing hermeticity. Here, we can use the port concept to extend to data as well, allowing jobs to specify what partitions they need before they can run, and having that explicitly control job internal data resolution for a single source of truth.
|
||
|
||
### Analogy: Workflow Orchestration Systems
|
||
|
||
|
||
|
||
### Practicality and Experimentation
|
||
|
||
What separates this from build systems is the practical consideration that sometimes we just need to notebook up some data to get results to a customer faster, or run an experiment without productionizing the code. This is a deviation from the build system analogy, where you would never dream of reaching in and modifying a `.o` file as part of the build process, but we regularly intentionally produce data that is not yet encoded with a job yet. This is a super power of data interfaces, and an ability we very much want to maintain.
|
||
|
||
### Stateless, Declarative Data Builds
|
||
|
||
In essence, what we want to achieve is declarative, stateless, repeatable data builds, in a system that is easy to modify and experiment with, and which is easy to use, improve, and verify the correctness of. We want a system that takes responsibility for achieving the end to end journey in producing data outputs.
|
||
|
||
## How
|
||
|
||
Here’s what we need:
|
||
|
||
- A set of nouns and organizing semantics that completely describe the data build process
|
||
- A strategy for composing jobs & data
|
||
|
||
### DataBuild Nouns
|
||
|
||
- **Dataset** \- Data meta-containers / Bookshelves of partitions
|
||
- **Partition** \- Atomic units of data
|
||
- **Job** \- Transforms read partitions into written partitions
|
||
- **Job Target** \- Transforms reference(s) to desired partition(s) into job config and data deps that produces them
|
||
- **Data deps** \- The set of references to partitions required to run a job and produce the desired output partitions
|
||
- **Code build time / data build time / job run time** \- Separate build phases / scopes of responsibility. Code build time is when code artifacts are compiled, data build time is when jobs are configured and data dependencies are resolved/materialized, and job run time is when the job is actually run (generally at the end of data build time).
|
||
|
||
### Composition
|
||
|
||
An important question here is how we describe composition between jobs and data, and how we practically achieve partitions that require multiple jobs to build.
|
||
|
||
Imperative composition (e.g. “do A, then do B, then…”) is difficult to maintain and easy to mess up without granular data dep checking. Instead, because job targets define their data deps, we can rely on “pull semantics” to define the composition of jobs and data, a’la [Luigi](https://luigi.readthedocs.io/en/stable/). This would be achieved with a “materialize” type of data dependency, where a `materialize(A)` dep on B would mean that we would ensure A existed before building B, building A if necessary.
|
||
|
||
This means we can achieve workloads of multiple jobs by invoking materialize data deps at build time. In practice, this means a DAG generated for a given job target would invoke other DAGs before it ran its own job to ensure its upstream data deps were present. To ensure we have observability of the whole flow for a given requested build, we can log a build request ID alongside other build metadata, or rely on orchestration systems that support this innately (like Prefect).
|
||
|
||
This creates a convenient interface between our upcoming PerfOpt web app (and similar applications), requiring them only to ask for a partition to exist to fulfill a segment refresh or reconfiguration.
|
||
|
||
### Composition With Non-DataBuild Deps
|
||
|
||
A key requirement for any successful organizing system is the flexibility to integrate with other systems. This is achieved out of the box thanks to the organizing assumptions above, meaning that DataBuild operations can wait patiently for partitions that are otherwise built by other systems, and expose the ability to “pull” via events/requests/etc. Success in this space is enabling each group of engineers to solve problems as they see fit, with a happy path that enables most work for the efficiencies and quality that stems from shared implementation.
|
||
|
||
---
|
||
|
||
# Appendix
|
||
|
||
## Questions
|
||
|
||
### Are they data build programs?
|
||
|
||
A tempting analogy is to say that we are compiling a program that builds the data, e.g. compiling the set and order of jobs to produce a desired output. This seems similar, yet innately different from the “it's a build system” analogy. This seems related to expectations about isolation, as build systems generally allow and assume a significant amount of isolation, where more general programs have a weaker assumption. This is a key distinction, as we intentionally provide ad-hoc data for experimentation or quick customer turn around quite commonly, and are likely to continue in the future.
|
||
|
||
### Explicit coupling across data?
|
||
|
||
Naive data interfaces allow laundering of data, e.g. we may fix a bug in job A and not realize that we need to rerun jobs B and C because they consume job A’s produced data. DataBuild as a concept brings focus to the data dependency relationship, making it explicit what jobs could be rerun after the bug fix. This creates a new question, “how much should we optimize data reconciliation?” We could introduce concepts like minor versioning of datasets or explicitly consumed columns that would allow us to detect more accurately which jobs need to be rerun, but this depends on excessively rerunning jobs being a regrettable outcome. However, if job runs are cheap, the added complexity from these concepts may be more expensive than the wasted compute to the business. Through this lens, we should lean on the simpler, more aggressive recalculation strategy until it’s obvious that we need to increase efficiency.
|
||
|
||
### Is this just `make` for data?
|
||
|
||
The core problem that makes DataBuild necessary beyond what Make offers is the unique nature of data processing at scale:
|
||
|
||
1. Data dependencies are more complex than code dependencies
|
||
2. Data processing is significantly more expensive than code compilation
|
||
3. Data often requires temporal awareness (partitioning by time)
|
||
4. Data work involves a mix of production systems and experimentation
|
||
5. Data projects require both rigorous pipelines and flexibility for exploration
|
||
|
||
In essence, DataBuild isn't just Make with partition columns added - it's a reconceptualization of build systems specifically for data processing flows, recognizing the unique properties of datasets versus software artifacts, while still leveraging the power of declarative dependency management.
|
||
|
||
### What are the key JTBDs?
|
||
|
||
- Configure and run jobs
|
||
- Declare job and dataset targets, and their relationships
|
||
- With what? Bazel-like language? Dagster-like?
|
||
- Assert correctness of the data build graph
|
||
- Including data interfaces?
|
||
- Catalog partition liveness and related metadata
|
||
- All in postgres?
|
||
- Managing data reconciliation and recalculation
|
||
- Built off of single event stream?
|
||
- Enforcing data usage policies (implementing taints)
|
||
- Tracking data lineage, job history
|
||
- Data cache management
|
||
|
||
- Data access? Implement accessors in execution frameworks, e.g. spark/duckdb/standard python/scala/etc?
|
||
- Automated QA, alerting, and notifications?? Or establishing ownership of datasets? (or just ship metrics and let users handle elsewhere)
|
||
|
||
## Components
|
||
- Data catalog / partition event log
|
||
- Orchestrator / scheduler
|
||
- Compiler (description --> build graph)
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
compiler([Compiler])
|
||
orchestrator([Orchestrator])
|
||
data_catalog[(Data Catalog)]
|
||
job_log[(Job Log)]
|
||
databuild_code --> compiler --> databuild_graph
|
||
databuild_graph & data_catalog --> orchestrator --> job_runs --> job_log & data_catalog
|
||
```
|
||
|
||
Notes:
|
||
- Data access details & lineage tracking may need to happen via the same component, but is considered an "internal to job runs" consideration currently.
|
||
|
||
### Data Catalog
|
||
The data catalog is an essential component that enables cache-aware planning of data builds. It is a mapping of `(partition_ref, mtime) -> partition_manifest`. Access to the described data graph and the data catalog is all that is needed to plan out the net work needed to materialize a new partition. Jobs themselves are responsible for any short-circuiting for work that happens out of band.
|
||
|
||
##
|