stuart/databuild

Fork 0

Stuart Axelbrooke 25480a72e5

add some docs

2025-04-25 12:46:29 -07:00

7.4 KiB

Raw Blame History

Tenets

No dependency knowledge necessary to materialize data
Only local dependency knowledge to develop
Not a framework (what does this mean?)

Verbs / Phases

What do engineers describe?
- Jobs - unit of data processing
  - Configure executable (rule impl)
  - Run executable (generated file contents)
- Graphs - data service
  - Plan (resolving partition refs to jobs that build them)

Jobs

job.configure(refs) -> Seq[JobConfig] - Produces a JobConfig for producing a set of partitions
job.execute(refs) -> Seq[PartitionManifest] - Produces a job config and immediately runs job run executable with it

Graphs

graph.plan(refs) -> JobGraph - Produces a JobGraph that fully describes the building of requested partition refs (with preconditions)
graph.execute(refs) -> Seq[PartitionManifest] - Executes the JobGraph eagerly, produces the partition manifests of the underlying finished jobs (emits build events?)
graph.lookup(refs) -> Map[JobLabel, ref]

Job Configuration

The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable.

case class JobConfig(
	// The partitions that this parameterization produces
	outputs: Set[PartitionRef],
	inputs: Set[DataDep],
	args: Seq[String],
	env: Map[String, String],
	// Path to executable that will be run
	executable: String,
)

case class DataDep (
	// E.g. query, materialize
	depType: DepType,
	ref: PartitionRef,
	// Excluded for now
	// timeoutSeconds: int,
)

Job Execution

Jobs produce partition manifests:

case class PartitionManifest(
	outputs: Set[PartitionRef],
	inputs: Set[PartitionManifest],
	startTime: long,
	endTime: long,
	config: JobConfig,
)

Graph Planning

The set of job configs that, if run in topo-sorted order, will produce these outputs.

case class JobGraph(
	outputs: Set[PartitionRef],
	// Needs the executable too? How do we reference here?
	nodes: Set[JobConfig],
)

The databuild graph needs to:

Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph
Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions)
Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration?

Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the partition catalog pluggable!

flowchart
	jobs & outputs --> plan --> job_graph
	job_graph --> compile.bash --> build_script
	job_graph & partition_catalog --> partition_build_service

Build Graph / Action Graph

merkle tree + dataset versions (semver?)
- mask upstream changes that aren't major
- content addressable storage based on action keys that point to merkle tree
- compile to set of build files? (thrash with action graph?)
- catalog of partition manifests + code artifacts enables this
- start with basic presence check
side effects expected
partition manifests as output artifact?
- this is orchestration layer concern because configure needs to be able to invalidate cache

Assumptions

Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure?
Job runs are idempotent (e.g. overwrite)
A databuild_graph can be deployed "unambiguously" (lol)

Questions

How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
How do we implement job lookup for graphs? Is this a job catalog thing?
- Answer: Yes, job graphs have a lookup attr
How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
How do we handle non-materialize relationships outside the graph?
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem

Ideas

Should we have an "optimistic" mode that builds all partitions that can be built?
Emit an event stream for observability purposes?

Partition Overlap

For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.

Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
Leave a door open, but don't get nerd sniped
Make sure the JobGraph is merge-able
How do we merge data deps? (timeout is time based) - Do we need to?

Data Ver & Invalidation

Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:

No invalidate: add optional field for new feature not relevant for past data
Invalidate: whoops, we were calculating the score wrong

This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is v0.0.0).

Why Deployability Matters

This needs to be deployable trivially from day one because:

We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
Being able to vend executables makes building weakly coupled apps easy (not a framework)

Demo Development

databuild_job ✅
1. databuild_job.cfg ✅
2. databuild_job.exec ✅
3. Tests ✅
4. databuild_job (to cfg and exec) ✅
5. Deployable databuild_job ✅
databuild_graph ✅
1. databuild_graph.analyze ✅
2. databuild_graph provider ✅
databuild_graph.exec ✅
databuild_graph.build ✅
databuild_graph.mermaid
podcast reviews example
Reflect (data versioning/caching/partition manifests, partition overlap, ...?)

Factoring

Core - graph description, build, analysis, and execution
Service - job/partition catalog, parallel execution, triggers, exposed service
Product - Accounts/RBAC, auth, delegates for exec/storage

Service Sketch

flowchart
	codebase
	subgraph service
		data_service
	end
	subgraph database
		job_catalog
		partition_catalog
	end
	codebase -- deployed_to --> data_service
	data_service -- logs build events --> job_catalog
	data_service -- queries/records partition manifest --> partition_catalog

Scratch

Implementation:

Bazel to describe jobs/graphs
Whatever you want to implement jobs and graphs (need solid interfaces)

databuild_graph(
	name = "my_graph",
	jobs = [":my_job", ...],
	plan = ":my_graph_plan",
)

py_binary(
	name = "my_graph_plan",
	...
)

databuild_job(
	name = "my_job",
	configure = ":my_job_configure",
	run = ":my_job_binary",
)

scala_binary(
	name = "my_job_configure",
	...
)

scala_binary(
	name - ":my_job_binary",
	...
)

7.4 KiB Raw Blame History