stuart/databuild

Fork 0

Stuart Axelbrooke f2567f7567

Reorganize repo

2025-05-07 17:39:25 -07:00

9.4 KiB

Raw Blame History

Tenets

No dependency knowledge necessary to materialize data
Only local dependency knowledge to develop
Not a framework (what does this mean?)

Organizing Philosophy

Many large-scale systems for producing data leave the complexity of true orchestration to the user - even DAG-based systems for implementing dependencies leave the system as a collection of DAGs, requiring engineers to solve the same "why doesn't this data exist?" and "how do I build this data?"

DataBuild takes inspiration from modern data orchestration and build systems to fully internalize this complexity, using the Job concept to localize all decisions of turning upstream data into output data (and making all dependencies explicit); and the Graph concept to handle composition of jobs, answering what sequence of jobs must be run to build a specific partition of data. With Jobs and Graphs, DataBuild takes complete responsibility for the data build process, allowing engineers to consider concerns only local to the jobs relevant to their feature.

Graphs and jobs are defined in bazel, allowing graphs (and their constituent jobs) to be built and deployed trivially.

Nouns / Verbs / Phases

Partitions

DataBuild is fundamentally about composing graphs of jobs and partitions of data, where partitions are the things we want to produce, or are the nodes between jobs. E.g., in a machine learning pipeline, a partition would be the specific training dataset produced for a given date, model version, etc, that would in turn be read by the model training job, which would itself produce a partition representing the trained model itself.

Partitions are assumed to be atomic and final (for final input partitions), such that it is unambiguous in what cases a partition must be (re)calculated.

Partition References

A partition reference (or partition ref) is a serialized reference to a literal partition of data. This can be anything, so long as it uniquely identifies its partition, but something path-like or URI-like is generally advisable for ergonomics purposes; e.g. /datasets/reviews/v1/date=2025-05-04/country=usa or dal://ranker/features/return_stats/2025/05/04/.

Jobs

flowchart LR
    upstream_a[(Upstream Partition A)]
    upstream_b[(Upstream Partition B)]
    job[Job]
    output_c[(Output Partition C)]
    output_d[(Output Partition D)]
    upstream_a & upstream_b --> job --> output_c & output_d

In DataBuild, Jobs are the atomic unit of data processing, representing the mapping of upstream partitions into output partitions. A job is defined by two capabilities: 1) expose an executable to run the job and produce the desired partitions of data (configured via env vars and args), retuning manifests that describe produced partitions; and 2) exposes a configuration executable that turns references to desired partitions into a job config that fully configures said job executable to produce the desired partitions.

Jobs are assumed to be idempotent and independent, such that two jobs configured to produce separate partitions can run without interaction. These assumptions allow jobs to state only their immediate upstream and output data dependencies (the partitions they consume and produce), and in a graph leave no ambiguity about what must be done to produce a desired partition.

Jobs are implemented via the databuild_job bazel rule. An extremely basic job definition can be found in the basic_job example.

Graphs

A Graph is the composition of jobs and partitions via their data dependencies. Graphs answer "what partitions does a job require to produce its outputs?", and "what job must be run to produce a given partition?" Defining a graph relies on only the list of involved jobs, and a lookup executable that transforms desired partitions into the job(s) that produce.

Graphs expose two entrypoints: graph.analyze, which produces the literal JobGraph specifying the structure of the build graph to be execute to build a specific set of partitions (enabling visualization, planning, precondition checking, etc); and graph.build, which runs the build process for a set of requested partitions (relying on graph.analyze to plan). Other entrypoints are described in the graph README.

Graphs are implemented via the databuild_graph bazel rule. A basic graph definition can be found in the basic_graph example.

Implementing a Graph

To make a fully described graph, engineers must define:

databuild_jobs
- Implementing the exec and config targets for each
A databuild_graph (referencing a lookup binary to resolve jobs)

And that's it!

Catalog

A catalog is a database of partition manifests and past/in-progress graph builds and job runs. When run with a catalog, graphs can:

Skip jobs whose outputs are already present and up to date.
Safely run data builds in parallel, delegating overlapping partition requests to already scheduled/running jobs.

TODO - plan and implement this functionality.

Appendix

Future

Partition versions - e.g. how to not invalidate prior produced data with every code change?
- merkle tree + semver as implementation?
- mask upstream changes that aren't major
- content addressable storage based on action keys that point to merkle tree
- compile to set of build files? (thrash with action graph?)
- catalog of partition manifests + code artifacts enables this
- start with basic presence check?

Questions

How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
How do we implement job lookup for graphs? Is this a job catalog thing?
- Answer: Yes, job graphs have a lookup attr
How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
How do we handle non-materialize relationships outside the graph?
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem

Ideas

Should we have an "optimistic" mode that builds all partitions that can be built?
Emit an event stream for observability purposes?

Partition Overlap

For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.

Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
Leave a door open, but don't get nerd sniped
Make sure the JobGraph is merge-able
How do we merge data deps? (timeout is time based) - Do we need to?

Data Ver & Invalidation

Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:

No invalidate: add optional field for new feature not relevant for past data
Invalidate: whoops, we were calculating the score wrong

This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is v0.0.0).

Why Deployability Matters

This needs to be deployable trivially from day one because:

We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
Being able to vend executables makes building weakly coupled apps easy (not a framework)

Demo Development

databuild_job ✅
1. databuild_job.cfg ✅
2. databuild_job.exec ✅
3. Tests ✅
4. databuild_job (to cfg and exec) ✅
5. Deployable databuild_job ✅
databuild_graph ✅
1. databuild_graph.analyze ✅
2. databuild_graph provider ✅
databuild_graph.exec ✅
databuild_graph.build ✅
databuild_graph.mermaid ✅
podcast reviews example
Reflect (data versioning/caching/partition manifests, partition overlap, ...?)

Factoring

Core - graph description, build, analysis, and execution
Service - job/partition catalog, parallel execution, triggers, exposed service
Product - Accounts/RBAC, auth, delegates for exec/storage

Service Sketch

flowchart
	codebase
	subgraph service
		data_service
	end
	subgraph database
		job_catalog
		partition_catalog
	end
	codebase -- deployed_to --> data_service
	data_service -- logs build events --> job_catalog
	data_service -- queries/records partition manifest --> partition_catalog

Scratch

Implementation:

Bazel to describe jobs/graphs
Whatever you want to implement jobs and graphs (need solid interfaces)

databuild_graph(
	name = "my_graph",
	jobs = [":my_job", ...],
	plan = ":my_graph_plan",
)

py_binary(
	name = "my_graph_plan",
	...
)

databuild_job(
	name = "my_job",
	configure = ":my_job_configure",
	run = ":my_job_binary",
)

scala_binary(
	name = "my_job_configure",
	...
)

scala_binary(
	name - ":my_job_binary",
	...
)

9.4 KiB Raw Blame History