202 lines
7.4 KiB
Markdown
202 lines
7.4 KiB
Markdown
# Tenets
|
|
- No dependency knowledge necessary to materialize data
|
|
- Only local dependency knowledge to develop
|
|
- Not a framework (what does this mean?)
|
|
|
|
# Verbs / Phases
|
|
- What do engineers describe?
|
|
- Jobs - unit of data processing
|
|
- Configure executable (rule impl)
|
|
- Run executable (generated file contents)
|
|
- Graphs - data service
|
|
- Plan (resolving partition refs to jobs that build them)
|
|
|
|
Jobs
|
|
1. `job.configure(refs) -> Seq[JobConfig]` - Produces a `JobConfig` for producing a set of partitions
|
|
2. `job.execute(refs) -> Seq[PartitionManifest]` - Produces a job config and immediately runs job run executable with it
|
|
|
|
Graphs
|
|
1. `graph.plan(refs) -> JobGraph` - Produces a `JobGraph` that fully describes the building of requested partition refs (with preconditions)
|
|
2. `graph.execute(refs) -> Seq[PartitionManifest]` - Executes the `JobGraph` eagerly, produces the partition manifests of the underlying finished jobs (emits build events?)
|
|
3. `graph.lookup(refs) -> Map[JobLabel, ref]`
|
|
|
|
## Job Configuration
|
|
The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable.
|
|
```
|
|
case class JobConfig(
|
|
// The partitions that this parameterization produces
|
|
outputs: Set[PartitionRef],
|
|
inputs: Set[DataDep],
|
|
args: Seq[String],
|
|
env: Map[String, String],
|
|
// Path to executable that will be run
|
|
executable: String,
|
|
)
|
|
|
|
case class DataDep (
|
|
// E.g. query, materialize
|
|
depType: DepType,
|
|
ref: PartitionRef,
|
|
// Excluded for now
|
|
// timeoutSeconds: int,
|
|
)
|
|
```
|
|
## Job Execution
|
|
Jobs produce partition manifests:
|
|
```
|
|
case class PartitionManifest(
|
|
outputs: Set[PartitionRef],
|
|
inputs: Set[PartitionManifest],
|
|
startTime: long,
|
|
endTime: long,
|
|
config: JobConfig,
|
|
)
|
|
```
|
|
## Graph Planning
|
|
The set of job configs that, if run in topo-sorted order, will produce these `outputs`.
|
|
```
|
|
case class JobGraph(
|
|
outputs: Set[PartitionRef],
|
|
// Needs the executable too? How do we reference here?
|
|
nodes: Set[JobConfig],
|
|
)
|
|
```
|
|
|
|
The databuild graph needs to:
|
|
- Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph
|
|
- Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions)
|
|
- Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration?
|
|
|
|
Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the **partition catalog pluggable**!
|
|
|
|
```mermaid
|
|
flowchart
|
|
jobs & outputs --> plan --> job_graph
|
|
job_graph --> compile.bash --> build_script
|
|
job_graph & partition_catalog --> partition_build_service
|
|
```
|
|
|
|
# Build Graph / Action Graph
|
|
- merkle tree + dataset versions (semver?)
|
|
- mask upstream changes that aren't major
|
|
- content addressable storage based on action keys that point to merkle tree
|
|
- compile to set of build files? (thrash with action graph?)
|
|
- catalog of partition manifests + code artifacts enables this
|
|
- start with basic presence check
|
|
- side effects expected
|
|
- partition manifests as output artifact?
|
|
- this is orchestration layer concern because `configure` needs to be able to invalidate cache
|
|
# Assumptions
|
|
- Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure?
|
|
- Job runs are idempotent (e.g. overwrite)
|
|
- A `databuild_graph` can be deployed "unambiguously" (lol)
|
|
|
|
# Questions
|
|
- How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
|
|
- Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
|
|
- How do we implement job lookup for graphs? Is this a job catalog thing?
|
|
- Answer: Yes, job graphs have a `lookup` attr
|
|
- How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
|
|
- How do we handle non-materialize relationships outside the graph?
|
|
- Answer: Provide build modes, but otherwise awaiting external data is a non-core problem
|
|
|
|
## Ideas
|
|
- Should we have an "optimistic" mode that builds all partitions that can be built?
|
|
- Emit an event stream for observability purposes?
|
|
|
|
## Partition Overlap
|
|
For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.
|
|
|
|
- Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
|
|
- Leave a door open, but don't get nerd sniped
|
|
- Make sure the `JobGraph` is merge-able
|
|
- How do we merge data deps? (timeout is time based) - Do we need to?
|
|
|
|
## Data Ver & Invalidation
|
|
Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:
|
|
|
|
- No invalidate: add optional field for new feature not relevant for past data
|
|
- Invalidate: whoops, we were calculating the score wrong
|
|
|
|
This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is `v0.0.0`).
|
|
|
|
## Why Deployability Matters
|
|
This needs to be deployable trivially from day one because:
|
|
- We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
|
|
- Being able to vend executables makes building weakly coupled apps easy (not a framework)
|
|
|
|
# Demo Development
|
|
1. `databuild_job` ✅
|
|
1. `databuild_job.cfg` ✅
|
|
2. `databuild_job.exec` ✅
|
|
3. Tests ✅
|
|
4. `databuild_job` (to `cfg` and `exec`) ✅
|
|
5. Deployable `databuild_job` ✅
|
|
2. `databuild_graph` ✅
|
|
1. `databuild_graph.analyze` ✅
|
|
2. `databuild_graph` provider ✅
|
|
3. `databuild_graph.exec` ✅
|
|
4. `databuild_graph.build` ✅
|
|
5. `databuild_graph.mermaid`
|
|
5. podcast reviews example
|
|
6. Reflect (data versioning/caching/partition manifests, partition overlap, ...?)
|
|
|
|
# Factoring
|
|
- Core - graph description, build, analysis, and execution
|
|
- Service - job/partition catalog, parallel execution, triggers, exposed service
|
|
- Product - Accounts/RBAC, auth, delegates for exec/storage
|
|
|
|
|
|
# Service Sketch
|
|
|
|
```mermaid
|
|
flowchart
|
|
codebase
|
|
subgraph service
|
|
data_service
|
|
end
|
|
subgraph database
|
|
job_catalog
|
|
partition_catalog
|
|
end
|
|
codebase -- deployed_to --> data_service
|
|
data_service -- logs build events --> job_catalog
|
|
data_service -- queries/records partition manifest --> partition_catalog
|
|
|
|
```
|
|
|
|
# Scratch
|
|
Implementation:
|
|
- Bazel to describe jobs/graphs
|
|
- Whatever you want to implement jobs and graphs (need solid interfaces)
|
|
|
|
```python
|
|
databuild_graph(
|
|
name = "my_graph",
|
|
jobs = [":my_job", ...],
|
|
plan = ":my_graph_plan",
|
|
)
|
|
|
|
py_binary(
|
|
name = "my_graph_plan",
|
|
...
|
|
)
|
|
|
|
databuild_job(
|
|
name = "my_job",
|
|
configure = ":my_job_configure",
|
|
run = ":my_job_binary",
|
|
)
|
|
|
|
scala_binary(
|
|
name = "my_job_configure",
|
|
...
|
|
)
|
|
|
|
scala_binary(
|
|
name - ":my_job_binary",
|
|
...
|
|
)
|
|
|
|
```
|
|
|