# Tenets - No dependency knowledge necessary to materialize data - Only local dependency knowledge to develop - Not a framework (what does this mean?) # Verbs / Phases - What do engineers describe? - Jobs - unit of data processing - Configure executable (rule impl) - Run executable (generated file contents) - Graphs - data service - Plan (resolving partition refs to jobs that build them) Jobs 1. `job.configure(refs) -> Seq[JobConfig]` - Produces a `JobConfig` for producing a set of partitions 2. `job.execute(refs) -> Seq[PartitionManifest]` - Produces a job config and immediately runs job run executable with it Graphs 1. `graph.plan(refs) -> JobGraph` - Produces a `JobGraph` that fully describes the building of requested partition refs (with preconditions) 2. `graph.execute(refs) -> Seq[PartitionManifest]` - Executes the `JobGraph` eagerly, produces the partition manifests of the underlying finished jobs (emits build events?) 3. `graph.lookup(refs) -> Map[JobLabel, ref]` ## Job Configuration The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable. ``` case class JobConfig( // The partitions that this parameterization produces outputs: Set[PartitionRef], inputs: Set[DataDep], args: Seq[String], env: Map[String, String], // Path to executable that will be run executable: String, ) case class DataDep ( // E.g. query, materialize depType: DepType, ref: PartitionRef, // Excluded for now // timeoutSeconds: int, ) ``` ## Job Execution Jobs produce partition manifests: ``` case class PartitionManifest( outputs: Set[PartitionRef], inputs: Set[PartitionManifest], startTime: long, endTime: long, config: JobConfig, ) ``` ## Graph Planning The set of job configs that, if run in topo-sorted order, will produce these `outputs`. ``` case class JobGraph( outputs: Set[PartitionRef], // Needs the executable too? How do we reference here? nodes: Set[JobConfig], ) ``` The databuild graph needs to: - Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph - Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions) - Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration? Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the **partition catalog pluggable**! ```mermaid flowchart jobs & outputs --> plan --> job_graph job_graph --> compile.bash --> build_script job_graph & partition_catalog --> partition_build_service ``` # Build Graph / Action Graph - merkle tree + dataset versions (semver?) - mask upstream changes that aren't major - content addressable storage based on action keys that point to merkle tree - compile to set of build files? (thrash with action graph?) - catalog of partition manifests + code artifacts enables this - start with basic presence check - side effects expected - partition manifests as output artifact? - this is orchestration layer concern because `configure` needs to be able to invalidate cache # Assumptions - Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure? - Job runs are idempotent (e.g. overwrite) - A `databuild_graph` can be deployed "unambiguously" (lol) # Questions - How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case - Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs. - How do we implement job lookup for graphs? Is this a job catalog thing? - Answer: Yes, job graphs have a `lookup` attr - How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt? - How do we handle non-materialize relationships outside the graph? - Answer: Provide build modes, but otherwise awaiting external data is a non-core problem ## Ideas - Should we have an "optimistic" mode that builds all partitions that can be built? - Emit an event stream for observability purposes? ## Partition Overlap For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams. - Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog? - Leave a door open, but don't get nerd sniped - Make sure the `JobGraph` is merge-able - How do we merge data deps? (timeout is time based) - Do we need to? ## Data Ver & Invalidation Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples: - No invalidate: add optional field for new feature not relevant for past data - Invalidate: whoops, we were calculating the score wrong This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is `v0.0.0`). ## Why Deployability Matters This needs to be deployable trivially from day one because: - We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it - Being able to vend executables makes building weakly coupled apps easy (not a framework) # Demo Development 1. `databuild_job` ✅ 1. `databuild_job.cfg` ✅ 2. `databuild_job.exec` ✅ 3. Tests ✅ 4. `databuild_job` (to `cfg` and `exec`) ✅ 5. Deployable `databuild_job` ✅ 2. `databuild_graph` ✅ 1. `databuild_graph.analyze` ✅ 2. `databuild_graph` provider ✅ 3. `databuild_graph.exec` ✅ 4. `databuild_graph.build` ✅ 5. `databuild_graph.mermaid` 5. podcast reviews example 6. Reflect (data versioning/caching/partition manifests, partition overlap, ...?) # Factoring - Core - graph description, build, analysis, and execution - Service - job/partition catalog, parallel execution, triggers, exposed service - Product - Accounts/RBAC, auth, delegates for exec/storage # Service Sketch ```mermaid flowchart codebase subgraph service data_service end subgraph database job_catalog partition_catalog end codebase -- deployed_to --> data_service data_service -- logs build events --> job_catalog data_service -- queries/records partition manifest --> partition_catalog ``` # Scratch Implementation: - Bazel to describe jobs/graphs - Whatever you want to implement jobs and graphs (need solid interfaces) ```python databuild_graph( name = "my_graph", jobs = [":my_job", ...], plan = ":my_graph_plan", ) py_binary( name = "my_graph_plan", ... ) databuild_job( name = "my_job", configure = ":my_job_configure", run = ":my_job_binary", ) scala_binary( name = "my_job_configure", ... ) scala_binary( name - ":my_job_binary", ... ) ```