databuild/docs/design/orchestrated-state-machines.md

5.5 KiB

Orchestrated State Machines: A Theory of Application Architecture

Overview

DataBuild's core architecture exemplifies a pattern we call Dependency-Aware State Machine Orchestration or Stateful Dataflow Architecture. This document crystallizes the theory behind this approach and its applications.

The Pattern

At its essence, the pattern is:

Application = State Machines + Dependency Graph + Orchestration Logic

Where:

  • State Machines: Individual entities (Want, JobRun, Partition) with well-defined, type-safe states
  • Dependency Graph: Relationships between entities (wants depend on partitions, partitions depend on job runs)
  • Orchestration Logic: Coordination rules that trigger state transitions when dependencies are satisfied

Key Components

1. State Machines

Each domain entity is modeled as an explicit state machine with:

  • Well-defined states (NotStarted, Running, Completed, Failed, etc.)
  • Type-safe transitions enforced at compile time
  • Immutable state progression via consuming methods

Example from DataBuild:

pub enum JobRun<B: JobRunBackend> {
    NotStarted(JobRunWithState<B, B::NotStartedState>),
    Running(JobRunWithState<B, B::RunningState>),
    Completed(JobRunWithState<B, B::CompletedState>),
    Failed(JobRunWithState<B, B::FailedState>),
    // ...
}

// Can ONLY call run() on NotStarted jobs - compiler enforces this!
impl<B: JobRunBackend> JobRunWithState<B, B::NotStartedState> {
    pub fn run(self, env) -> Result<JobRunWithState<B, B::RunningState>, Error>
}

2. Dependency Graph

Entities are connected through explicit dependencies:

  • Wants → Partitions (wants request specific partitions)
  • Partitions → JobRuns (jobs build partitions)
  • JobRuns → Partitions (jobs declare what they built)

3. Orchestration Logic

A central orchestrator:

  • Observes all entity states
  • Evaluates dependency conditions
  • Triggers state transitions when conditions are met
  • Maintains global consistency invariants

Core Principles

  1. Model domain entities as explicit state machines - Don't hide state in boolean flags
  2. Express dependencies as a graph - Make relationships first-class
  3. Centralize coordination logic - Separate entity behavior from system coordination
  4. Make state transitions event-sourced - Append-only log enables time-travel and auditability
  5. Use types to enforce valid transitions - Catch errors at compile time, not runtime

Advantages

Type Safety

Compile-time guarantees prevent invalid state transitions:

// This will not compile:
let job = JobRun::Running(running_job);
job.run(); // ERROR: no method `run` found for `JobRun<Running>`

Observability

Event-sourced state transitions provide complete audit trail:

  • What's running? Query running jobs
  • What failed? Filter by failed state
  • When did it transition? Check BEL timestamps

Testability

  • State machines can be tested in isolation
  • Orchestration logic can be tested with mock state machines
  • Dependency resolution can be tested independently

Incremental Progress

System can be stopped and restarted:

  • State is persisted in BEL
  • Resume from last known state
  • No need to restart from beginning

Correctness

  • Type system prevents impossible states
  • Event log provides ground truth
  • Dependency graph ensures proper ordering

Real-World Applications

This pattern is the fundamental architecture of:

Build Systems

  • Bazel, Buck, Pants - artifacts depend on other artifacts
  • Your "builds" are literally builds

Workflow Engines

  • Temporal, Prefect, Airflow - DAG of tasks with state
  • Each task is a state machine, orchestrator schedules based on dependencies

Data Orchestration

  • Dagster, Kedro - data assets with lineage
  • Partitions are data assets, jobs are transformations

Game Engines

  • Entity Component Systems - entities have state
  • Game loop orchestrates entity state transitions

Business Process Management

  • BPMN engines - business processes as state machines
  • Workflow engine coordinates process instances

When to Use This Pattern

This architecture is particularly powerful for systems where:

  • Eventual consistency is acceptable (not strict ACID transactions)
  • Incremental progress is important (can checkpoint and resume)
  • Observability is critical (need to know what's happening)
  • Correctness matters (type-safe transitions prevent bugs)
  • Concurrency is inherent (multiple things happening simultaneously)
  • Dependencies are complex (can't just process sequentially)

Implementation Lessons

Use Drain for State Transitions

Clean pattern for moving entities through states:

fn schedule_queued_jobs(&mut self) -> Result<()> {
    let mut new_jobs = Vec::new();
    for job in self.job_runs.drain(..) {
        let transitioned = match job {
            JobRun::NotStarted(ns) => JobRun::Running(ns.run(None)?),
            other => other, // Pass through unchanged
        };
        new_jobs.push(transitioned);
    }
    self.job_runs = new_jobs;
    Ok(())
}

Parameterize State for Type Safety

pub struct JobRunWithState<Backend, State> {
    job_run_id: Uuid,
    state: State,  // Type parameter enforces valid operations
}

Event Sourcing for Auditability

All state changes emit events to append-only log:

self.bel.append_event(&JobRunSuccessEvent {
    job_run_id,
    timestamp
})?;

Separate Entity Logic from Coordination

  • Entity state machines: "What transitions are valid for me?"
  • Orchestrator: "Given all entity states, what should happen next?"