databuild/core-concepts.md

# Tenets
- No dependency knowledge necessary to materialize data
- Only local dependency knowledge to develop
- Not a framework (what does this mean?)

# Verbs / Phases
- What do engineers describe?
    - Jobs - unit of data processing
        - Configure executable (rule impl)
        - Run executable (generated file contents)
    - Graphs - data service
        - Plan (resolving partition refs to jobs that build them)

Jobs
1. `job.configure(refs) -> Seq[JobConfig]` - Produces a `JobConfig` for producing a set of partitions
2. `job.execute(refs) -> Seq[PartitionManifest]` - Produces a job config and immediately runs job run executable with it

Graphs
1. `graph.plan(refs) -> JobGraph` - Produces a `JobGraph` that fully describes the building of requested partition refs (with preconditions)
2. `graph.execute(refs) -> Seq[PartitionManifest]` - Executes the `JobGraph` eagerly, produces the partition manifests of the underlying finished jobs (emits build events?)
3. `graph.lookup(refs) -> Map[JobLabel, ref]`

## Job Configuration
The process of fully parameterizing a job run based on the desired partitions. This parameterizes the job run executable.
```
case class JobConfig(
	// The partitions that this parameterization produces
	outputs: Set[PartitionRef],
	inputs: Set[DataDep],
	args: Seq[String],
	env: Map[String, String],
	// Path to executable that will be run
	executable: String,
)

case class DataDep (
	// E.g. query, materialize
	depType: DepType,
	ref: PartitionRef,
	// Excluded for now
	// timeoutSeconds: int,
)
```
## Job Execution
Jobs produce partition manifests:
```
case class PartitionManifest(
	outputs: Set[PartitionRef],
	inputs: Set[PartitionManifest],
	startTime: long,
	endTime: long,
	config: JobConfig,
)
```
## Graph Planning
The set of job configs that, if run in topo-sorted order, will produce these `outputs`.
```
case class JobGraph(
	outputs: Set[PartitionRef],
	// Needs the executable too? How do we reference here?
	nodes: Set[JobConfig],
)
```

The databuild graph needs to:
- Analyze: use the provided partition refs to determine all involved jobs and their configs in a job graph
- Plan: Determine the literal jobs to execute (skipping/pruning valid cached partitions)
- Compile: compile the graph into an artifact that runs the materialize process <-- sounds like an application consideration?

Perhaps these are different capabilities - e.g. producing a bash script that runs the build process is a fundamentally separate thing than the smarter stateful thing that manages these builds over time, pruning cached builds, etc. And we could make the **partition catalog pluggable**!

```mermaid
flowchart
	jobs & outputs --> plan --> job_graph
	job_graph --> compile.bash --> build_script
	job_graph & partition_catalog --> partition_build_service
```

# Build Graph / Action Graph
- merkle tree + dataset versions (semver?)
    - mask upstream changes that aren't major
    - content addressable storage based on action keys that point to merkle tree
    - compile to set of build files? (thrash with action graph?)
    - catalog of partition manifests + code artifacts enables this
    - start with basic presence check
- side effects expected
- partition manifests as output artifact?
    - this is orchestration layer concern because `configure` needs to be able to invalidate cache
# Assumptions
- Job runs are independent, e.g. if run X is already producing partition A, run Y can safely prune A... during configure?
- Job runs are idempotent (e.g. overwrite)
- A `databuild_graph` can be deployed "unambiguously" (lol)

# Questions
- How does partition overlap work? Can it be pruned? Or throw during configure? This sounds like a very common case
  - Answer: this is a responsibility of a live service backed by a datastore. If jobs are in-fact independent, then refs requested by another build can be "delegated" to the already jobs for those refs.
- How do we implement job lookup for graphs? Is this a job catalog thing?
  - Answer: Yes, job graphs have a `lookup` attr
- How do graphs handle caching? We can't plan a whole graph if job configs contain mtimes, etc (we don't know when the job will finish). So it must detect stale partitions (and downstreams) that need to be rebuilt?
- How do we handle non-materialize relationships outside the graph?
  - Answer: Provide build modes, but otherwise awaiting external data is a non-core problem

## Ideas
- Should we have an "optimistic" mode that builds all partitions that can be built?
- Emit an event stream for observability purposes?

## Partition Overlap
For example, we have two partitions we want to build for 2 different concerns, e.g. pulled by two separate triggers, and both of these partitions depend on some of the same upstreams.

- Do we need managed state, which is the "pending build graph"? Do we need an (internal, at least) data catalog?
- Leave a door open, but don't get nerd sniped
- Make sure the `JobGraph` is merge-able
- How do we merge data deps? (timeout is time based) - Do we need to?

## Data Ver & Invalidation
Sometimes there are minor changes that don't invalidate past produced data, and sometimes there are major changes that do invalidate past partitions. Examples:

- No invalidate: add optional field for new feature not relevant for past data
- Invalidate: whoops, we were calculating the score wrong

This is separate from "version the dataset", since a dataset version represents a structure/meaning, and partitions produced in the past can be incorrect for the intended structure/meaning, and legitimately need to be overwritten. In contrast, new dataset versions allow new intended structure/meaning. This should be an optional concept (e.g. default version is `v0.0.0`).

## Why Deployability Matters
This needs to be deployable trivially from day one because:
- We want to "launch jobs" in an un-opinionated way - tell bazel what platform you're building for, then boop the results off to that system, and run it
- Being able to vend executables makes building weakly coupled apps easy (not a framework)

# Demo Development
1. `databuild_job` ✅
    1. `databuild_job.cfg` ✅
    2. `databuild_job.exec` ✅
    3. Tests ✅
    4. `databuild_job` (to `cfg` and `exec`) ✅
    5. Deployable `databuild_job` ✅
2. `databuild_graph` ✅
    1. `databuild_graph.analyze` ✅
    2. `databuild_graph` provider ✅
3. `databuild_graph.exec`  ✅
4. `databuild_graph.build`  ✅
5. `databuild_graph.mermaid`
5. podcast reviews example
6. Reflect (data versioning/caching/partition manifests, partition overlap, ...?)

# Factoring
- Core - graph description, build, analysis, and execution
- Service - job/partition catalog, parallel execution, triggers, exposed service
- Product - Accounts/RBAC, auth, delegates for exec/storage


# Service Sketch

```mermaid
flowchart
	codebase
	subgraph service
		data_service
	end
	subgraph database
		job_catalog
		partition_catalog
	end
	codebase -- deployed_to --> data_service
	data_service -- logs build events --> job_catalog
	data_service -- queries/records partition manifest --> partition_catalog

```

# Scratch
Implementation:
- Bazel to describe jobs/graphs
- Whatever you want to implement jobs and graphs (need solid interfaces)

```python
databuild_graph(
	name = "my_graph",
	jobs = [":my_job", ...],
	plan = ":my_graph_plan",
)

py_binary(
	name = "my_graph_plan",
	...
)

databuild_job(
	name = "my_job",
	configure = ":my_job_configure",
	run = ":my_job_binary",
)

scala_binary(
	name = "my_job_configure",
	...
)

scala_binary(
	name - ":my_job_binary",
	...
)

```