7.4 KiB
Tenets
- No dependency knowledge necessary to materialize data
- Only local dependency knowledge to develop
- Not a framework (what does this mean?)
Verbs / Phases
- What do engineers describe?
- Jobs - unit of data processing
- Configure executable (rule impl)
- Run executable (generated file contents)
- Graphs - data service
- Plan (resolving partition refs to jobs that build them)
- Jobs - unit of data processing
Jobs
job.configure(refs) -> Seq[JobConfig]- Produces aJobConfigfor producing a set of partitionsjob.execute(refs) -> Seq[PartitionManifest]- Produces a job config and immediately runs job run executable with it
Graphs
graph.plan(refs) -> JobGraph- Produces aJobGraphthat fully describes the building of requested partition refs (with preconditions)graph.execute(refs) -> Seq[PartitionManifest]- Executes theJobGrapheagerly, produces the partition manifests of the underlying finished jobs (emits build events?)graph.lookup(refs) -> Map[JobLabel, ref]
Job Configuration
The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable.
case class JobConfig(
// The partitions that this parameterization produces
outputs: Set[PartitionRef],
inputs: Set[DataDep],
args: Seq[String],
env: Map[String, String],
// Path to executable that will be run
executable: String,
)
case class DataDep (
// E.g. query, materialize
depType: DepType,
ref: PartitionRef,
// Excluded for now
// timeoutSeconds: int,
)
Job Execution
Jobs produce partition manifests:
case class PartitionManifest(
outputs: Set[PartitionRef],
inputs: Set[PartitionManifest],
startTime: long,
endTime: long,
config: JobConfig,
)
Graph Planning
The set of job configs that, if run in topo-sorted order, will produce these outputs.
case class JobGraph(
outputs: Set[PartitionRef],
// Needs the executable too? How do we reference here?
nodes: Set[JobConfig],
)
The databuild graph needs to:
- Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph
- Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions)
- Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration?
Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the partition catalog pluggable!
flowchart
jobs & outputs --> plan --> job_graph
job_graph --> compile.bash --> build_script
job_graph & partition_catalog --> partition_build_service
Build Graph / Action Graph
- merkle tree + dataset versions (semver?)
- mask upstream changes that aren't major
- content addressable storage based on action keys that point to merkle tree
- compile to set of build files? (thrash with action graph?)
- catalog of partition manifests + code artifacts enables this
- start with basic presence check
- side effects expected
- partition manifests as output artifact?
- this is orchestration layer concern because
configureneeds to be able to invalidate cache
- this is orchestration layer concern because
Assumptions
- Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure?
- Job runs are idempotent (e.g. overwrite)
- A
databuild_graphcan be deployed "unambiguously" (lol)
Questions
- How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
- How do we implement job lookup for graphs? Is this a job catalog thing?
- Answer: Yes, job graphs have a
lookupattr
- Answer: Yes, job graphs have a
- How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
- How do we handle non-materialize relationships outside the graph?
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem
Ideas
- Should we have an "optimistic" mode that builds all partitions that can be built?
- Emit an event stream for observability purposes?
Partition Overlap
For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.
- Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
- Leave a door open, but don't get nerd sniped
- Make sure the
JobGraphis merge-able - How do we merge data deps? (timeout is time based) - Do we need to?
Data Ver & Invalidation
Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:
- No invalidate: add optional field for new feature not relevant for past data
- Invalidate: whoops, we were calculating the score wrong
This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is v0.0.0).
Why Deployability Matters
This needs to be deployable trivially from day one because:
- We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
- Being able to vend executables makes building weakly coupled apps easy (not a framework)
Demo Development
databuild_job✅databuild_job.cfg✅databuild_job.exec✅- Tests ✅
databuild_job(tocfgandexec) ✅- Deployable
databuild_job✅
databuild_graph✅databuild_graph.analyze✅databuild_graphprovider ✅
databuild_graph.exec✅databuild_graph.build✅databuild_graph.mermaid- podcast reviews example
- Reflect (data versioning/caching/partition manifests, partition overlap, ...?)
Factoring
- Core - graph description, build, analysis, and execution
- Service - job/partition catalog, parallel execution, triggers, exposed service
- Product - Accounts/RBAC, auth, delegates for exec/storage
Service Sketch
flowchart
codebase
subgraph service
data_service
end
subgraph database
job_catalog
partition_catalog
end
codebase -- deployed_to --> data_service
data_service -- logs build events --> job_catalog
data_service -- queries/records partition manifest --> partition_catalog
Scratch
Implementation:
- Bazel to describe jobs/graphs
- Whatever you want to implement jobs and graphs (need solid interfaces)
databuild_graph(
name = "my_graph",
jobs = [":my_job", ...],
plan = ":my_graph_plan",
)
py_binary(
name = "my_graph_plan",
...
)
databuild_job(
name = "my_job",
configure = ":my_job_configure",
run = ":my_job_binary",
)
scala_binary(
name = "my_job_configure",
...
)
scala_binary(
name - ":my_job_binary",
...
)