databuild/design/core-build.md

141 lines
6.2 KiB
Markdown

# Core Build
Purpose: Centralize the build logic and semantics in a performant, correct core.
## Architecture
- Jobs depend on input partitions and produce output partitions.
- Graphs compose jobs to fully plan and execute builds of requested partitions.
- Both jobs and graphs emit events via the [build event log](./build-event-log.md) to update build state.
- A common interface is implemented to execute job and graph build actions, which different clients rely on (e.g. CLI,
service, etc)
- Jobs and graphs use wrappers to implement configuration and [observability](./observability.md)
- Graph-based composition is the basis for databuild application [deployment](./deploy-strategies.md)
## Jobs
Jobs are the atomic unit of work in databuild, executed via a Rust-based wrapper that provides:
- Structured logging and telemetry collection
- Platform-agnostic execution across local, container, and cloud environments
- Zero-network-dependency operation via log-based communication
- Standardized error handling and exit code categorization
### `job.config`
Purpose: Enable planning of execution graph. Executed in-process when possible for speed. For interface details, see
[`PartitionRef`](./glossary.md#partitionref) and [`JobConfig`](./glossary.md#jobconfig) in
[`databuild.proto`](../databuild/databuild.proto).
```rust
trait DataBuildJob {
fn config(outputs: Vec<PartitionRef>) -> JobConfig;
}
```
#### `job.config` State Diagrams
```mermaid
flowchart TD
begin((begin)) --> validate_args
emit_job_config_fail --> fail((fail))
validate_args -- fail --> emit_arg_validate_fail --> emit_job_config_fail
validate_args -- success --> emit_arg_validate_success --> run_config
run_config -- fail --> emit_config_fail --> emit_job_config_fail
run_config -- success --> emit_config_success ---> success((success))
```
### `job.exec`
Purpose: Execute job in exec wrapper.
```rust
trait DataBuildJob {
fn exec(config: JobConfig) -> PartitionManifest;
}
```
#### `job.exec` State Diagram
```mermaid
flowchart TD
begin((begin)) --> wrapper_validate_config
emit_job_exec_fail --> fail((fail))
wrapper_validate_config -- fail --> emit_config_validate_fail --> emit_job_exec_fail
wrapper_validate_config -- success --> emit_config_validate_success --> wrapper_launch_task
wrapper_launch_task -- fail --> emit_task_launch_fail --> emit_job_exec_fail
wrapper_launch_task -- success --> emit_task_launch_success --> wrapper_monitor_task
wrapper_monitor_task -- heartbeat timer --> emit_heartbeat --> wrapper_monitor_task
wrapper_monitor_task -- job stderr --> emit_log_entry --> wrapper_monitor_task
wrapper_monitor_task -- job stdout --> emit_log_entry --> wrapper_monitor_task
wrapper_monitor_task -- non-zero exit --> emit_task_failed --> emit_job_exec_fail
wrapper_monitor_task -- zero exit --> emit_task_success --> emit_partition_manifest
emit_partition_manifest --> success((success))
```
## Graphs
Graphs are the unit of composition. To `analyze` (plan) task graphs (see [`JobGraph`](./glossary.md#jobgraph)), they
iteratively walk back from the requested output partitions, invoking `job.config` until no unresolved partitions
remain. To `build` partitions, the graph runs `analyze` then iteratively executes the resulting task graph.
### `graph.analyze`
Purpose: produce a complete task graph to materialize a requested set of partitions.
```rust
trait DataBuildGraph {
fn analyze(outputs: Vec<PartitionRef>) -> JobGraph;
}
```
#### `graph.analyze` State Diagram
```mermaid
flowchart TD
begin((begin)) --> initialize_missing_partitions --> dispatch_missing_partitions
emit_graph_analyze_fail --> fail((fail))
dispatch_missing_partitions -- fail --> emit_partition_dispatch_fail --> emit_graph_analyze_fail
dispatch_missing_partitions -- success --> cycle_detected?
cycle_detected? -- yes --> emit_cycle_detected --> emit_graph_analyze_fail
cycle_detected? -- no --> remaining_missing_partitions?
remaining_missing_partitions? -- yes --> dispatch_missing_partitions
remaining_missing_partitions? -- no --> emit_job_graph --> success((success))
```
### `graph.build`
Purpose: analyze, then execute the resulting task graph.
```rust
trait DataBuildGraph {
fn build(outputs: Vec<PartitionRef>);
}
```
#### `graph.build` State Diagram
```mermaid
flowchart TD
begin((begin)) --> graph_analyze
emit_graph_build_fail --> fail((fail))
graph_analyze -- fail --> emit_graph_build_fail
graph_analyze -- success --> initialize_ready_jobs --> remaining_ready_jobs?
remaining_ready_jobs? -- yes --> emit_remaining_jobs --> schedule_jobs
remaining_ready_jobs? -- none schedulable --> emit_jobs_unschedulable --> emit_graph_build_fail
schedule_jobs -- fail --> emit_job_schedule_fail --> emit_graph_build_fail
schedule_jobs -- success --> emit_job_schedule_success --> await_jobs
await_jobs -- job_failure --> emit_job_failure --> emit_job_cancels --> cancel_running_jobs
cancel_running_jobs --> emit_graph_build_fail
await_jobs -- N seconds since heartbeat --> emit_heartbeat --> await_jobs
await_jobs -- job_success --> remaining_ready_jobs?
remaining_ready_jobs? -- no ---------> emit_graph_build_success --> success((success))
```
## Correctness Strategy
- Core component interfaces are described in [`databuild.proto`](../databuild/databuild.proto), a protobuf interface
shared by all core components and all [GSLs](./graph-specification.md).
- [GSLs](./graph-specification.md) implement ergonomic graph, job, and partition helpers that make coupling explicit
- Graphs automatically detect and raise on non-unique job -> partition mappings
- Graph and job processes are fully described by state diagrams, whose state transitions are logged to the
[build event log](./build-event-log.md).
## Partition Delegation
- Sometimes a partition already exists, or another build request is already planning on producing a partition
- A later build request with delegate to an already existing build request for said partition
- The later build request will write an event to the [build event log](./build-event-log.md) referencing the ID
of the delegate, allowing traceability of visualization
## Heartbeats / Health Checks
- Which strategy do we use?
- If we are launching tasks to a place we can't health check, how could they heartbeat?