databuild/core-concepts.md
2025-04-25 12:46:29 -07:00

202 lines
7.4 KiB
Markdown

# Tenets
- No dependency knowledge necessary to materialize data
- Only local dependency knowledge to develop
- Not a framework (what does this mean?)
# Verbs / Phases
- What do engineers describe?
- Jobs - unit of data processing
- Configure executable (rule impl)
- Run executable (generated file contents)
- Graphs - data service
- Plan (resolving partition refs to jobs that build them)
Jobs
1. `job.configure(refs) -> Seq[JobConfig]` - Produces a `JobConfig` for producing a set of partitions
2. `job.execute(refs) -> Seq[PartitionManifest]` - Produces a job config and immediately runs job run executable with it
Graphs
1. `graph.plan(refs) -> JobGraph` - Produces a `JobGraph` that fully describes the building of requested partition refs (with preconditions)
2. `graph.execute(refs) -> Seq[PartitionManifest]` - Executes the `JobGraph` eagerly, produces the partition manifests of the underlying finished jobs (emits build events?)
3. `graph.lookup(refs) -> Map[JobLabel, ref]`
## Job Configuration
The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable.
```
case class JobConfig(
// The partitions that this parameterization produces
outputs: Set[PartitionRef],
inputs: Set[DataDep],
args: Seq[String],
env: Map[String, String],
// Path to executable that will be run
executable: String,
)
case class DataDep (
// E.g. query, materialize
depType: DepType,
ref: PartitionRef,
// Excluded for now
// timeoutSeconds: int,
)
```
## Job Execution
Jobs produce partition manifests:
```
case class PartitionManifest(
outputs: Set[PartitionRef],
inputs: Set[PartitionManifest],
startTime: long,
endTime: long,
config: JobConfig,
)
```
## Graph Planning
The set of job configs that, if run in topo-sorted order, will produce these `outputs`.
```
case class JobGraph(
outputs: Set[PartitionRef],
// Needs the executable too? How do we reference here?
nodes: Set[JobConfig],
)
```
The databuild graph needs to:
- Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph
- Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions)
- Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration?
Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the **partition catalog pluggable**!
```mermaid
flowchart
jobs & outputs --> plan --> job_graph
job_graph --> compile.bash --> build_script
job_graph & partition_catalog --> partition_build_service
```
# Build Graph / Action Graph
- merkle tree + dataset versions (semver?)
- mask upstream changes that aren't major
- content addressable storage based on action keys that point to merkle tree
- compile to set of build files? (thrash with action graph?)
- catalog of partition manifests + code artifacts enables this
- start with basic presence check
- side effects expected
- partition manifests as output artifact?
- this is orchestration layer concern because `configure` needs to be able to invalidate cache
# Assumptions
- Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure?
- Job runs are idempotent (e.g. overwrite)
- A `databuild_graph` can be deployed "unambiguously" (lol)
# Questions
- How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
- How do we implement job lookup for graphs? Is this a job catalog thing?
- Answer: Yes, job graphs have a `lookup` attr
- How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
- How do we handle non-materialize relationships outside the graph?
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem
## Ideas
- Should we have an "optimistic" mode that builds all partitions that can be built?
- Emit an event stream for observability purposes?
## Partition Overlap
For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.
- Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
- Leave a door open, but don't get nerd sniped
- Make sure the `JobGraph` is merge-able
- How do we merge data deps? (timeout is time based) - Do we need to?
## Data Ver & Invalidation
Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:
- No invalidate: add optional field for new feature not relevant for past data
- Invalidate: whoops, we were calculating the score wrong
This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is `v0.0.0`).
## Why Deployability Matters
This needs to be deployable trivially from day one because:
- We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
- Being able to vend executables makes building weakly coupled apps easy (not a framework)
# Demo Development
1. `databuild_job`
1. `databuild_job.cfg`
2. `databuild_job.exec`
3. Tests
4. `databuild_job` (to `cfg` and `exec`)
5. Deployable `databuild_job`
2. `databuild_graph`
1. `databuild_graph.analyze`
2. `databuild_graph` provider
3. `databuild_graph.exec`
4. `databuild_graph.build`
5. `databuild_graph.mermaid`
5. podcast reviews example
6. Reflect (data versioning/caching/partition manifests, partition overlap, ...?)
# Factoring
- Core - graph description, build, analysis, and execution
- Service - job/partition catalog, parallel execution, triggers, exposed service
- Product - Accounts/RBAC, auth, delegates for exec/storage
# Service Sketch
```mermaid
flowchart
codebase
subgraph service
data_service
end
subgraph database
job_catalog
partition_catalog
end
codebase -- deployed_to --> data_service
data_service -- logs build events --> job_catalog
data_service -- queries/records partition manifest --> partition_catalog
```
# Scratch
Implementation:
- Bazel to describe jobs/graphs
- Whatever you want to implement jobs and graphs (need solid interfaces)
```python
databuild_graph(
name = "my_graph",
jobs = [":my_job", ...],
plan = ":my_graph_plan",
)
py_binary(
name = "my_graph_plan",
...
)
databuild_job(
name = "my_job",
configure = ":my_job_configure",
run = ":my_job_binary",
)
scala_binary(
name = "my_job_configure",
...
)
scala_binary(
name - ":my_job_binary",
...
)
```