9.4 KiB
Tenets
- No dependency knowledge necessary to materialize data
- Only local dependency knowledge to develop
- Not a framework (what does this mean?)
Organizing Philosophy
Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"
DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, answering what sequence of jobs must be run to build a specific partition of data. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.
Graphs and jobs are defined in bazel, allowing graphs (and their constituent jobs) to be built and deployed trivially.
Nouns / Verbs / Phases
Partitions
DataBuild is fundamentally about composing graphs of jobs and partitions of data, where partitions are the things we want to produce, or are the nodes between jobs. E.g., in a machine learning pipeline, a partition would be the specific training dataset produced for a given date, model version, etc, that would in turn be read by the model training job, which would itself produce a partition representing the trained model itself.
Partitions are assumed to be atomic and final (for final input partitions), such that it is unambiguous in what cases a partition must be (re)calculated.
Partition References
A partition reference (or partition ref) is a serialized reference to a literal partition of data. This can be anything, so long as it uniquely identifies its partition, but something path-like or URI-like is generally advisable for ergonomics purposes; e.g. /datasets/reviews/v1/date=2025-05-04/country=usa or dal://ranker/features/return_stats/2025/05/04/.
Jobs
flowchart LR
upstream_a[(Upstream Partition A)]
upstream_b[(Upstream Partition B)]
job[Job]
output_c[(Output Partition C)]
output_d[(Output Partition D)]
upstream_a & upstream_b --> job --> output_c & output_d
In DataBuild, Jobs are the atomic unit of data processing, representing the mapping of upstream partitions into output partitions. A job is defined by two capabilities: 1) expose an executable to run the job and produce the desired partitions of data (configured via env vars and args), retuning manifests that describe produced partitions; and 2) exposes a configuration executable that turns references to desired partitions into a job config that fully configures said job executable to produce the desired partitions.
Jobs are assumed to be idempotent and independent, such that two jobs configured to produce separate partitions can run without interaction. These assumptions allow jobs to state only their immediate upstream and output data dependencies (the partitions they consume and produce), and in a graph leave no ambiguity about what must be done to produce a desired partition.
Jobs are implemented via the databuild_job bazel rule. An extremely basic job definition can be found in the basic_job example.
Graphs
A Graph is the composition of jobs and partitions via their data dependencies. Graphs answer "what partitions does a job require to produce its outputs?", and "what job must be run to produce a given partition?" Defining a graph relies on only the list of involved jobs, and a lookup executable that transforms desired partitions into the job(s) that produce.
Graphs expose two entrypoints: graph.analyze, which produces the literal JobGraph specifying the structure of the build graph to be execute to build a specific set of partitions (enabling visualization, planning, precondition checking, etc); and graph.build, which runs the build process for a set of requested partitions (relying on graph.analyze to plan). Other entrypoints are described in the graph README.
Graphs are implemented via the databuild_graph bazel rule. A basic graph definition can be found in the basic_graph example.
Implementing a Graph
To make a fully described graph, engineers must define:
databuild_jobs- Implementing the exec and config targets for each
- A
databuild_graph(referencing alookupbinary to resolve jobs)
And that's it!
Catalog
A catalog is a database of partition manifests and past/in-progress graph builds and job runs. When run with a catalog, graphs can:
- Skip jobs whose outputs are already present and up to date.
- Safely run data builds in parallel, delegating overlapping partition requests to already scheduled/running jobs.
TODO - plan and implement this functionality.
Appendix
Future
- Partition versions - e.g. how to not invalidate prior produced data with every code change?
- merkle tree + semver as implementation?
- mask upstream changes that aren't major
- content addressable storage based on action keys that point to merkle tree
- compile to set of build files? (thrash with action graph?)
- catalog of partition manifests + code artifacts enables this
- start with basic presence check?
Questions
- How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
- How do we implement job lookup for graphs? Is this a job catalog thing?
- Answer: Yes, job graphs have a
lookupattr
- Answer: Yes, job graphs have a
- How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
- How do we handle non-materialize relationships outside the graph?
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem
Ideas
- Should we have an "optimistic" mode that builds all partitions that can be built?
- Emit an event stream for observability purposes?
Partition Overlap
For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.
- Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
- Leave a door open, but don't get nerd sniped
- Make sure the
JobGraphis merge-able - How do we merge data deps? (timeout is time based) - Do we need to?
Data Ver & Invalidation
Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:
- No invalidate: add optional field for new feature not relevant for past data
- Invalidate: whoops, we were calculating the score wrong
This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is v0.0.0).
Why Deployability Matters
This needs to be deployable trivially from day one because:
- We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
- Being able to vend executables makes building weakly coupled apps easy (not a framework)
Demo Development
databuild_job✅databuild_job.cfg✅databuild_job.exec✅- Tests ✅
databuild_job(tocfgandexec) ✅- Deployable
databuild_job✅
databuild_graph✅databuild_graph.analyze✅databuild_graphprovider ✅
databuild_graph.exec✅databuild_graph.build✅databuild_graph.mermaid✅- podcast reviews example
- Reflect (data versioning/caching/partition manifests, partition overlap, ...?)
Factoring
- Core - graph description, build, analysis, and execution
- Service - job/partition catalog, parallel execution, triggers, exposed service
- Product - Accounts/RBAC, auth, delegates for exec/storage
Service Sketch
flowchart
codebase
subgraph service
data_service
end
subgraph database
job_catalog
partition_catalog
end
codebase -- deployed_to --> data_service
data_service -- logs build events --> job_catalog
data_service -- queries/records partition manifest --> partition_catalog
Scratch
Implementation:
- Bazel to describe jobs/graphs
- Whatever you want to implement jobs and graphs (need solid interfaces)
databuild_graph(
name = "my_graph",
jobs = [":my_job", ...],
plan = ":my_graph_plan",
)
py_binary(
name = "my_graph_plan",
...
)
databuild_job(
name = "my_job",
configure = ":my_job_configure",
run = ":my_job_binary",
)
scala_binary(
name = "my_job_configure",
...
)
scala_binary(
name - ":my_job_binary",
...
)