databuild/design/core-build.md
2025-07-26 00:48:18 -07:00

5.9 KiB

Core Build

Purpose: Centralize the build logic and semantics in a performant, correct core.

Architecture

  • Jobs depend on input partitions and produce output partitions.
  • Graphs compose jobs to fully plan and execute builds of requested partitions.
  • Both jobs and graphs emit events via the build event log to update build state.
  • A common interface is implemented to execute job and graph build actions, which different clients rely on (e.g. CLI, service, etc)
  • Jobs and graphs use wrappers to implement configuration and observability
  • Graph-based composition is the basis for databuild application deployment

Jobs

Jobs are the atomic unit of work in databuild.

  • Job wrapper fulfills configuration, observability, and record keeping

job.config

Purpose: Enable planning of execution graph. Executed in-process when possible for speed. For interface details, see PartitionRef and JobConfig in databuild.proto.

trait DataBuildJob {
  fn config(outputs: Vec<PartitionRef>) -> JobConfig;
}

job.config State Diagrams

flowchart TD
    begin((begin)) --> validate_args
    emit_job_config_fail --> fail((fail))
    validate_args -- fail --> emit_arg_validate_fail --> emit_job_config_fail
    validate_args -- success --> emit_arg_validate_success --> run_config
    run_config -- fail --> emit_config_fail --> emit_job_config_fail
    run_config -- success --> emit_config_success ---> success((success))

job.exec

Purpose: Execute job in exec wrapper.

trait DataBuildJob {
  fn exec(config: JobConfig) -> PartitionManifest;
}

job.exec State Diagram

flowchart TD
    begin((begin)) --> validate_config
    emit_job_exec_fail --> fail((fail))
    validate_config -- fail --> emit_config_validate_fail --> emit_job_exec_fail
    validate_config -- success --> emit_config_validate_success --> launch_task
    launch_task -- fail --> emit_task_launch_fail --> emit_job_exec_fail
    launch_task -- success --> emit_task_launch_success --> await_task
    await_task -- waited N seconds --> emit_heartbeat --> await_task
    await_task -- non-zero exit code --> emit_task_failed --> emit_job_exec_fail
    await_task -- zero exit code --> emit_task_success --> calculate_metadata
    calculate_metadata -- fail --> emit_metadata_calculation_fail --> emit_job_exec_fail
    calculate_metadata -- success --> emit_metadata ---> success((success))

Graphs

Graphs are the unit of composition. To analyze (plan) task graphs (see JobGraph), they iteratively walk back from the requested output partitions, invoking job.config until no unresolved partitions remain. To build partitions, the graph runs analyze then iteratively executes the resulting task graph.

graph.analyze

Purpose: produce a complete task graph to materialize a requested set of partitions.

trait DataBuildGraph {
  fn analyze(outputs: Vec<PartitionRef>) -> JobGraph;
}

graph.analyze State Diagram

flowchart TD
    begin((begin)) --> initialize_missing_partitions --> dispatch_missing_partitions
    emit_graph_analyze_fail --> fail((fail))
    dispatch_missing_partitions -- fail --> emit_partition_dispatch_fail --> emit_graph_analyze_fail
    dispatch_missing_partitions -- success --> cycle_detected?
    cycle_detected? -- yes --> emit_cycle_detected --> emit_graph_analyze_fail
    cycle_detected? -- no --> remaining_missing_partitions?
    remaining_missing_partitions? -- yes --> dispatch_missing_partitions
    remaining_missing_partitions? -- no --> emit_job_graph --> success((success))

graph.build

Purpose: analyze, then execute the resulting task graph.

trait DataBuildGraph {
  fn build(outputs: Vec<PartitionRef>);
}

graph.build State Diagram

flowchart TD
    begin((begin)) --> graph_analyze
    emit_graph_build_fail --> fail((fail))
    graph_analyze -- fail --> emit_graph_build_fail
    graph_analyze -- success --> initialize_ready_jobs --> remaining_ready_jobs?
    remaining_ready_jobs? -- yes --> emit_remaining_jobs --> schedule_jobs
    remaining_ready_jobs? -- none schedulable --> emit_jobs_unschedulable --> emit_graph_build_fail
    schedule_jobs -- fail --> emit_job_schedule_fail --> emit_graph_build_fail
    schedule_jobs -- success --> emit_job_schedule_success --> await_jobs
    await_jobs -- job_failure --> emit_job_failure --> emit_job_cancels --> cancel_running_jobs
    cancel_running_jobs --> emit_graph_build_fail
    await_jobs -- N seconds since heartbeat --> emit_heartbeat --> await_jobs
    await_jobs -- job_success --> remaining_ready_jobs?
    remaining_ready_jobs? -- no ---------> emit_graph_build_success --> success((success))

Correctness Strategy

  • Core component interfaces are described in databuild.proto, a protobuf interface shared by all core components and all GSLs.
  • GSLs implement ergonomic graph, job, and partition helpers that make coupling explicit
  • Graphs automatically detect and raise on non-unique job -> partition mappings
  • Graph and job processes are fully described by state diagrams, whose state transitions are logged to the build event log.

Partition Delegation

  • Sometimes a partition already exists, or another build request is already planning on producing a partition
  • A later build request with delegate to an already existing build request for said partition
  • The later build request will write an event to the build event log referencing the ID of the delegate, allowing traceability of visualization

Heartbeats / Health Checks

  • Which strategy do we use?
  • If we are launching tasks to a place we can't health check, how could they heartbeat?