databuild/core-concepts.md
2025-04-25 12:46:29 -07:00

7.4 KiB

Tenets

  • No dependency knowledge necessary to materialize data
  • Only local dependency knowledge to develop
  • Not a framework (what does this mean?)

Verbs / Phases

  • What do engineers describe?
    • Jobs - unit of data processing
      • Configure executable (rule impl)
      • Run executable (generated file contents)
    • Graphs - data service
      • Plan (resolving partition refs to jobs that build them)

Jobs

  1. job.configure(refs) -> Seq[JobConfig] - Produces a JobConfig for producing a set of partitions
  2. job.execute(refs) -> Seq[PartitionManifest] - Produces a job config and immediately runs job run executable with it

Graphs

  1. graph.plan(refs) -> JobGraph - Produces a JobGraph that fully describes the building of requested partition refs (with preconditions)
  2. graph.execute(refs) -> Seq[PartitionManifest] - Executes the JobGraph eagerly, produces the partition manifests of the underlying finished jobs (emits build events?)
  3. graph.lookup(refs) -> Map[JobLabel, ref]

Job Configuration

The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable.

case class JobConfig(
	// The partitions that this parameterization produces
	outputs: Set[PartitionRef],
	inputs: Set[DataDep],
	args: Seq[String],
	env: Map[String, String],
	// Path to executable that will be run
	executable: String,
)

case class DataDep (
	// E.g. query, materialize
	depType: DepType,
	ref: PartitionRef,
	// Excluded for now
	// timeoutSeconds: int,
)

Job Execution

Jobs produce partition manifests:

case class PartitionManifest(
	outputs: Set[PartitionRef],
	inputs: Set[PartitionManifest],
	startTime: long,
	endTime: long,
	config: JobConfig,
)

Graph Planning

The set of job configs that, if run in topo-sorted order, will produce these outputs.

case class JobGraph(
	outputs: Set[PartitionRef],
	// Needs the executable too? How do we reference here?
	nodes: Set[JobConfig],
)

The databuild graph needs to:

  • Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph
  • Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions)
  • Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration?

Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the partition catalog pluggable!

flowchart
	jobs & outputs --> plan --> job_graph
	job_graph --> compile.bash --> build_script
	job_graph & partition_catalog --> partition_build_service

Build Graph / Action Graph

  • merkle tree + dataset versions (semver?)
    • mask upstream changes that aren't major
    • content addressable storage based on action keys that point to merkle tree
    • compile to set of build files? (thrash with action graph?)
    • catalog of partition manifests + code artifacts enables this
    • start with basic presence check
  • side effects expected
  • partition manifests as output artifact?
    • this is orchestration layer concern because configure needs to be able to invalidate cache

Assumptions

  • Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure?
  • Job runs are idempotent (e.g. overwrite)
  • A databuild_graph can be deployed "unambiguously" (lol)

Questions

  • How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
    • Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
  • How do we implement job lookup for graphs? Is this a job catalog thing?
    • Answer: Yes, job graphs have a lookup attr
  • How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
  • How do we handle non-materialize relationships outside the graph?
    • Answer: Provide build modes, but otherwise awaiting external data is a non-core problem

Ideas

  • Should we have an "optimistic" mode that builds all partitions that can be built?
  • Emit an event stream for observability purposes?

Partition Overlap

For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.

  • Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
  • Leave a door open, but don't get nerd sniped
  • Make sure the JobGraph is merge-able
  • How do we merge data deps? (timeout is time based) - Do we need to?

Data Ver & Invalidation

Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:

  • No invalidate: add optional field for new feature not relevant for past data
  • Invalidate: whoops, we were calculating the score wrong

This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is v0.0.0).

Why Deployability Matters

This needs to be deployable trivially from day one because:

  • We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
  • Being able to vend executables makes building weakly coupled apps easy (not a framework)

Demo Development

  1. databuild_job
    1. databuild_job.cfg
    2. databuild_job.exec
    3. Tests
    4. databuild_job (to cfg and exec)
    5. Deployable databuild_job
  2. databuild_graph
    1. databuild_graph.analyze
    2. databuild_graph provider
  3. databuild_graph.exec
  4. databuild_graph.build
  5. databuild_graph.mermaid
  6. podcast reviews example
  7. Reflect (data versioning/caching/partition manifests, partition overlap, ...?)

Factoring

  • Core - graph description, build, analysis, and execution
  • Service - job/partition catalog, parallel execution, triggers, exposed service
  • Product - Accounts/RBAC, auth, delegates for exec/storage

Service Sketch

flowchart
	codebase
	subgraph service
		data_service
	end
	subgraph database
		job_catalog
		partition_catalog
	end
	codebase -- deployed_to --> data_service
	data_service -- logs build events --> job_catalog
	data_service -- queries/records partition manifest --> partition_catalog
	

Scratch

Implementation:

  • Bazel to describe jobs/graphs
  • Whatever you want to implement jobs and graphs (need solid interfaces)
databuild_graph(
	name = "my_graph",
	jobs = [":my_job", ...],
	plan = ":my_graph_plan",
)

py_binary(
	name = "my_graph_plan",
	...
)

databuild_job(
	name = "my_job",
	configure = ":my_job_configure",
	run = ":my_job_binary",
)

scala_binary(
	name = "my_job_configure",
	...
)

scala_binary(
	name - ":my_job_binary",
	...
)