add some docs

This commit is contained in:
Stuart Axelbrooke 2025-04-25 12:46:29 -07:00
parent 8eba58f328
commit 25480a72e5
No known key found for this signature in database
GPG key ID: 1B0A848C29D46A35
3 changed files with 361 additions and 0 deletions

6
README.md Normal file
View file

@ -0,0 +1,6 @@
# DataBuild
A bazel-based data build system.
For important context, check out [the manifesto](./manifesto.md), and [core concepts](./core-concepts.md).

202
core-concepts.md Normal file
View file

@ -0,0 +1,202 @@
# Tenets
- No dependency knowledge necessary to materialize data
- Only local dependency knowledge to develop
- Not a framework (what does this mean?)
# Verbs / Phases
- What do engineers describe?
- Jobs - unit of data processing
- Configure executable (rule impl)
- Run executable (generated file contents)
- Graphs - data service
- Plan (resolving partition refs to jobs that build them)
Jobs
1. `job.configure(refs) -> Seq[JobConfig]` - Produces a `JobConfig` for producing a set of partitions
2. `job.execute(refs) -> Seq[PartitionManifest]` - Produces a job config and immediately runs job run executable with it
Graphs
1. `graph.plan(refs) -> JobGraph` - Produces a `JobGraph` that fully describes the building of requested partition refs (with preconditions)
2. `graph.execute(refs) -> Seq[PartitionManifest]` - Executes the `JobGraph` eagerly, produces the partition manifests of the underlying finished jobs (emits build events?)
3. `graph.lookup(refs) -> Map[JobLabel, ref]`
## Job Configuration
The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable.
```
case class JobConfig(
// The partitions that this parameterization produces
outputs: Set[PartitionRef],
inputs: Set[DataDep],
args: Seq[String],
env: Map[String, String],
// Path to executable that will be run
executable: String,
)
case class DataDep (
// E.g. query, materialize
depType: DepType,
ref: PartitionRef,
// Excluded for now
// timeoutSeconds: int,
)
```
## Job Execution
Jobs produce partition manifests:
```
case class PartitionManifest(
outputs: Set[PartitionRef],
inputs: Set[PartitionManifest],
startTime: long,
endTime: long,
config: JobConfig,
)
```
## Graph Planning
The set of job configs that, if run in topo-sorted order, will produce these `outputs`.
```
case class JobGraph(
outputs: Set[PartitionRef],
// Needs the executable too? How do we reference here?
nodes: Set[JobConfig],
)
```
The databuild graph needs to:
- Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph
- Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions)
- Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration?
Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the **partition catalog pluggable**!
```mermaid
flowchart
jobs & outputs --> plan --> job_graph
job_graph --> compile.bash --> build_script
job_graph & partition_catalog --> partition_build_service
```
# Build Graph / Action Graph
- merkle tree + dataset versions (semver?)
- mask upstream changes that aren't major
- content addressable storage based on action keys that point to merkle tree
- compile to set of build files? (thrash with action graph?)
- catalog of partition manifests + code artifacts enables this
- start with basic presence check
- side effects expected
- partition manifests as output artifact?
- this is orchestration layer concern because `configure` needs to be able to invalidate cache
# Assumptions
- Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure?
- Job runs are idempotent (e.g. overwrite)
- A `databuild_graph` can be deployed "unambiguously" (lol)
# Questions
- How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
- How do we implement job lookup for graphs? Is this a job catalog thing?
- Answer: Yes, job graphs have a `lookup` attr
- How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
- How do we handle non-materialize relationships outside the graph?
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem
## Ideas
- Should we have an "optimistic" mode that builds all partitions that can be built?
- Emit an event stream for observability purposes?
## Partition Overlap
For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.
- Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
- Leave a door open, but don't get nerd sniped
- Make sure the `JobGraph` is merge-able
- How do we merge data deps? (timeout is time based) - Do we need to?
## Data Ver & Invalidation
Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:
- No invalidate: add optional field for new feature not relevant for past data
- Invalidate: whoops, we were calculating the score wrong
This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is `v0.0.0`).
## Why Deployability Matters
This needs to be deployable trivially from day one because:
- We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
- Being able to vend executables makes building weakly coupled apps easy (not a framework)
# Demo Development
1. `databuild_job`
1. `databuild_job.cfg`
2. `databuild_job.exec`
3. Tests ✅
4. `databuild_job` (to `cfg` and `exec`) ✅
5. Deployable `databuild_job`
2. `databuild_graph`
1. `databuild_graph.analyze`
2. `databuild_graph` provider ✅
3. `databuild_graph.exec`
4. `databuild_graph.build`
5. `databuild_graph.mermaid`
5. podcast reviews example
6. Reflect (data versioning/caching/partition manifests, partition overlap, ...?)
# Factoring
- Core - graph description, build, analysis, and execution
- Service - job/partition catalog, parallel execution, triggers, exposed service
- Product - Accounts/RBAC, auth, delegates for exec/storage
# Service Sketch
```mermaid
flowchart
codebase
subgraph service
data_service
end
subgraph database
job_catalog
partition_catalog
end
codebase -- deployed_to --> data_service
data_service -- logs build events --> job_catalog
data_service -- queries/records partition manifest --> partition_catalog
```
# Scratch
Implementation:
- Bazel to describe jobs/graphs
- Whatever you want to implement jobs and graphs (need solid interfaces)
```python
databuild_graph(
name = "my_graph",
jobs = [":my_job", ...],
plan = ":my_graph_plan",
)
py_binary(
name = "my_graph_plan",
...
)
databuild_job(
name = "my_job",
configure = ":my_job_configure",
run = ":my_job_binary",
)
scala_binary(
name = "my_job_configure",
...
)
scala_binary(
name - ":my_job_binary",
...
)
```

153
manifesto.md Normal file
View file

@ -0,0 +1,153 @@
# DataBuild Manifesto
## Why
### Motivation
The modern ML company today is a data company, whose value is derived from ingesting, processing, and refining data. The vast majority of our data providers provide it in discrete batches, and we deliver discrete batches to our customers, and the flow/process this data goes through from consumption to vending is often long and complex. We have the opportunity to define a simple but expressive fabric that simply connects our data inputs to our data outputs in a principled manner, separating concerns by scale, and minimizing our operational overhead. This fabric would remove human judgment required to produce any data asset, and take responsibility for achieving the end to end data production process.
This fabric also allows us to completely define the scope of involved code and data for a given concern, increasing engineering velocity and quality. It also allows separating concerns by scale: keeping job internal logic separate from job/dataset composition logic, minimizing the number of things that must be considered when changing or authoring new code and data. It is also important to be practical, not dogmatic, and this system should thrive in an environment with experimentation and other orchestration strategies, so long as the assumptions below hold true.
These capabilities also help us achieve important eng and business goals:
- Automatically handling updated data
- Correctness checking across data dependencies
- Automatically enforcing data usage policies
- Job reuse & using differently configured but same jobs in parallel
- Lineage tracking
### Assumptions
First, lets state a few key assumptions from current best practices and standards:
- Batches (partitions) are mutually exclusive and collectively exhaustive by domain (dataset).
- A step in the data production process (job) can completely define what partitions it needs and produces.
- Produced partitions are final, conditional on their inputs being final.
- Jobs are idempotent and (practically) produce no side effects aside from output partitions.
### Analogy: Build Systems
One immediate analogy is software [build systems](https://en.wikipedia.org/wiki/Build_automation), like Bazel, which use build targets as nodes in the graph to be queried to produce desired artifacts correctly. Build systems rely on declared edges between targets to resolve what work needs to be done and in what order, also allowing for sensible caching. These base assumptions allow the build system to automatically handle orchestration for any build request, meaning incredibly complex build processes (like building whole OS releases), that are otherwise too complex for humans to orchestrate themselves, are solvable trivially by computers. This also allows rich interaction with the build process, like querying dependencies and reasoning about the build graph. Related: see [dazzle demo](https://docs.google.com/presentation/d/18tL4f_fXCkoaQ7zeSs0AaciryRCR4EmX7hd1wL3Zy8c/edit#slide=id.g2ee07a77300_0_8) discussing the similarities.
The complicating factor for us is that we rely on data for extensive caching. We could, in principle, reprocess all received data (e.g. in an ingest bucket) when we want to produce any output partition for customers, treating the produced data itself as a build target, ephemeral and valid only for the requested partition. However, this is incredibly wasteful, especially when the same partition of intermediate data is read potentially hundreds of times.
This is our motivation for a different kind of node in our build graph: dataset partitions. Its not clear exactly how we should handle these nodes differently, but its obvious that rebuilding a partition is orders of magnitude more expensive than rebuilding a code build target, so we want to cache these as long as possible. That said, there is obvious opportunity to rebuild partitions when their inputs change, either based on new/updated data arriving, or by definition change.
### Analogy: Dataflow Programming
[Dataflow programming](https://en.wikipedia.org/wiki/Dataflow_programming) is another immediate analogy, where programs are described as a DAG of operations rather than a sequence of operations (ala imperative paradigms). This implicitly allows parallel execution, as operations run when their inputs are available, not when explicitly requested (ala imperative programs).
![[Pasted image 20250323203936.png]]
Zooming out, we can take the build process described above and model it instead as a dataflow program: jobs are operations, and partitions are values, and the topo-sorted build action graph is the compiled program. One valuable insight from dataflow programming is the concept of “ports”, like a function parameter or return value. Bazel has a similar concept, where build rules expose parameters they expect targets to set (e.g. `srcs`, `deps`, `data`, etc) that enables sandboxing hermeticity. Here, we can use the port concept to extend to data as well, allowing jobs to specify what partitions they need before they can run, and having that explicitly control job internal data resolution for a single source of truth.
### Analogy: Workflow Orchestration Systems
### Practicality and Experimentation
What separates this from build systems is the practical consideration that sometimes we just need to notebook up some data to get results to a customer faster, or run an experiment without productionizing the code. This is a deviation from the build system analogy, where you would never dream of reaching in and modifying a `.o` file as part of the build process, but we regularly intentionally produce data that is not yet encoded with a job yet. This is a super power of data interfaces, and an ability we very much want to maintain.
### Stateless, Declarative Data Builds
In essence, what we want to achieve is declarative, stateless, repeatable data builds, in a system that is easy to modify and experiment with, and which is easy to use, improve, and verify the correctness of. We want a system that takes responsibility for achieving the end to end journey in producing data outputs.
## How
Heres what we need:
- A set of nouns and organizing semantics that completely describe the data build process
- A strategy for composing jobs & data
### DataBuild Nouns
- **Dataset** \- Data meta-containers / Bookshelves of partitions
- **Partition** \- Atomic units of data
- **Job** \- Transforms read partitions into written partitions
- **Job Target** \- Transforms reference(s) to desired partition(s) into job config and data deps that produces them
- **Data deps** \- The set of references to partitions required to run a job and produce the desired output partitions
- **Code build time / data build time / job run time** \- Separate build phases / scopes of responsibility. Code build time is when code artifacts are compiled, data build time is when jobs are configured and data dependencies are resolved/materialized, and job run time is when the job is actually run (generally at the end of data build time).
### Composition
An important question here is how we describe composition between jobs and data, and how we practically achieve partitions that require multiple jobs to build.
Imperative composition (e.g. “do A, then do B, then…”) is difficult to maintain and easy to mess up without granular data dep checking. Instead, because job targets define their data deps, we can rely on “pull semantics” to define the composition of jobs and data, ala [Luigi](https://luigi.readthedocs.io/en/stable/). This would be achieved with a “materialize” type of data dependency, where a `materialize(A)` dep on B would mean that we would ensure A existed before building B, building A if necessary.
This means we can achieve workloads of multiple jobs by invoking materialize data deps at build time. In practice, this means a DAG generated for a given job target would invoke other DAGs before it ran its own job to ensure its upstream data deps were present. To ensure we have observability of the whole flow for a given requested build, we can log a build request ID alongside other build metadata, or rely on orchestration systems that support this innately (like Prefect).
This creates a convenient interface between our upcoming PerfOpt web app (and similar applications), requiring them only to ask for a partition to exist to fulfill a segment refresh or reconfiguration.
### Composition With Non-DataBuild Deps
A key requirement for any successful organizing system is the flexibility to integrate with other systems. This is achieved out of the box thanks to the organizing assumptions above, meaning that DataBuild operations can wait patiently for partitions that are otherwise built by other systems, and expose the ability to “pull” via events/requests/etc. Success in this space is enabling each group of engineers to solve problems as they see fit, with a happy path that enables most work for the efficiencies and quality that stems from shared implementation.
---
# Appendix
## Questions
### Are they data build programs?
A tempting analogy is to say that we are compiling a program that builds the data, e.g. compiling the set and order of jobs to produce a desired output. This seems similar, yet innately different from the “it's a build system” analogy. This seems related to expectations about isolation, as build systems generally allow and assume a significant amount of isolation, where more general programs have a weaker assumption. This is a key distinction, as we intentionally provide ad-hoc data for experimentation or quick customer turn around quite commonly, and are likely to continue in the future.
### Explicit coupling across data?
Naive data interfaces allow laundering of data, e.g. we may fix a bug in job A and not realize that we need to rerun jobs B and C because they consume job As produced data. DataBuild as a concept brings focus to the data dependency relationship, making it explicit what jobs could be rerun after the bug fix. This creates a new question, “how much should we optimize data reconciliation?” We could introduce concepts like minor versioning of datasets or explicitly consumed columns that would allow us to detect more accurately which jobs need to be rerun, but this depends on excessively rerunning jobs being a regrettable outcome. However, if job runs are cheap, the added complexity from these concepts may be more expensive than the wasted compute to the business. Through this lens, we should lean on the simpler, more aggressive recalculation strategy until its obvious that we need to increase efficiency.
### Is this just `make` for data?
The core problem that makes DataBuild necessary beyond what Make offers is the unique nature of data processing at scale:
1. Data dependencies are more complex than code dependencies
2. Data processing is significantly more expensive than code compilation
3. Data often requires temporal awareness (partitioning by time)
4. Data work involves a mix of production systems and experimentation
5. Data projects require both rigorous pipelines and flexibility for exploration
In essence, DataBuild isn't just Make with partition columns added - it's a reconceptualization of build systems specifically for data processing flows, recognizing the unique properties of datasets versus software artifacts, while still leveraging the power of declarative dependency management.
### What are the key JTBDs?
- Configure and run jobs
- Declare job and dataset targets, and their relationships
- With what? Bazel-like language? Dagster-like?
- Assert correctness of the data build graph
- Including data interfaces?
- Catalog partition liveness and related metadata
- All in postgres?
- Managing data reconciliation and recalculation
- Built off of single event stream?
- Enforcing data usage policies (implementing taints)
- Tracking data lineage, job history
- Data cache management
- Data access? Implement accessors in execution frameworks, e.g. spark/duckdb/standard python/scala/etc?
- Automated QA, alerting, and notifications?? Or establishing ownership of datasets? (or just ship metrics and let users handle elsewhere)
## Components
- Data catalog / partition event log
- Orchestrator / scheduler
- Compiler (description --> build graph)
```mermaid
flowchart LR
compiler([Compiler])
orchestrator([Orchestrator])
data_catalog[(Data Catalog)]
job_log[(Job Log)]
databuild_code --> compiler --> databuild_graph
databuild_graph & data_catalog --> orchestrator --> job_runs --> job_log & data_catalog
```
Notes:
- Data access details & lineage tracking may need to happen via the same component, but is considered an "internal to job runs" consideration currently.
### Data Catalog
The data catalog is an essential component that enables cache-aware planning of data builds. It is a mapping of `(partition_ref, mtime) -> partition_manifest`. Access to the described data graph and the data catalog is all that is needed to plan out the net work needed to materialize a new partition. Jobs themselves are responsible for any short-circuiting for work that happens out of band.
## Entity Relationships
Note: Data services are missing below. They could be "libraries" for data, e.g. filling an organizational role for data dependencies at a larger scale.