resilience design

2025-08-21 22:09:29 -07:00 · 2025-08-21 22:09:29 -07:00 · a78c6fc5fb
commit a78c6fc5fb
parent 8ba4820654
1 changed files with 86 additions and 0 deletions
--- a/design/resilience.md
+++ b/design/resilience.md
@ -0,0 +1,86 @@
+# Resilience Design
+
+Purpose: Enable DataBuild to maintain correctness during version rollouts, unexpected restarts, and continuous deployment.
+
+## Core Tenets
+- **Simplicity over complexity** - Avoid distributed state management
+- **Correctness over speed** - Accept retries rather than risk corruption
+- **Enable good design** - Constraints that guide users toward robust patterns
+
+## Key Mechanisms
+
+### Process Generation Tracking
+
+Each DataBuild service instance receives a monotonically increasing generation identifier (timestamp-based) at startup. The BEL rejects events from older generations, preventing split-brain scenarios where multiple versions compete. This provides natural version fencing without coordination - the newest service always wins, and older instances' events are ignored.
+
+### Builds as Expiring Leases
+
+Build requests function as time-bound leases on partition production. Active builds must periodically heartbeat to maintain their lease. When a service restarts, builds without recent heartbeats are considered orphaned and marked failed. This prevents zombie builds from interfering with new attempts while avoiding the complexity of explicit lease management.
+
+### Want-Level Retry Logic
+
+Wants represent the desired state (partitions that should exist), while builds are ephemeral attempts to achieve that state. When a service restarts, the recovery strategy is deliberately simple: cancel all in-flight builds and let active wants retrigger new attempts. This separation means restarts don't require complex state recovery - the system naturally converges to satisfy outstanding wants.
+
+### Job Re-entrance for Long-Running Tasks
+
+Long-running jobs on external systems (EMR, BigQuery, Databricks) must survive frequent restarts caused by continuous deployment. DataBuild enables re-entrance through a simple stdout-based state mechanism:
+
+- Jobs emit state by printing specially-marked JSON lines to stdout
+- The wrapper intercepts these lines and stores them in the BEL (max 1KB per event)
+- On restart, jobs receive their previous state via environment variable
+- Jobs can then re-attach to external resources or resume from checkpoints
+
+This approach is language-agnostic, requires no libraries, and works across all deployment platforms. The 1KB limit forces jobs to store pointers (job IDs, URLs) rather than data, maintaining proper separation of concerns.
+
+### Write Collision Detection
+
+Rather than preventing duplicate execution through locking, DataBuild detects and handles collisions after the fact. Jobs announce their intent to build a partition via BEL events. If another job is already building or has recently built the partition, the new job delegates to the existing effort. This approach embraces eventual consistency while ensuring correctness through idempotency.
+
+## Operational Guidelines
+
+### Deployment Strategy
+
+New deployments start with a fresh generation, causing the BEL to reject events from the previous version. Old builds timeout and fail naturally, while active wants trigger fresh build attempts under the new generation. This approach requires no coordination between versions and handles both planned rollouts and unexpected restarts.
+
+### Job Design Requirements
+
+Jobs must be idempotent - producing the same output given the same inputs, regardless of how many times they run. This is achieved through:
+- Checking if outputs already exist before starting work
+- Using atomic writes (staging location then rename)
+- Storing progress markers for resumption
+- Re-attaching to external jobs when possible
+
+## Risk Classes and Mitigations
+
+### State Consistency Risks
+- **Orphaned jobs**: Handled via heartbeat timeouts and generation tracking
+- **Double execution**: Mitigated through idempotency and collision detection
+- **Lost progress**: Addressed by want-level retries and job re-entrance
+
+### Coordination Risks
+- **Cross-graph desync**: GraphService API provides reliable event streaming with retry
+- **Event ordering conflicts**: Generation-based fencing ensures single writer
+- **Want satisfaction gaps**: Continuous evaluation ensures wants are eventually satisfied
+
+### Resource Management Risks
+- **Storage conflicts**: Atomic writes and existence checks prevent corruption
+- **Compute waste**: Accepted trade-off for simplicity and correctness
+- **Memory state loss**: System designed to be stateless with full recovery from BEL
+
+## Key Insights
+
+1. **Restarts are routine**: Continuous deployment means services restart frequently, often mid-build. The system must handle this gracefully without operator intervention.
+
+2. **State through stdout**: Job state management via stdout/env vars provides a universal mechanism that works across languages and platforms without libraries.
+
+3. **BEL as source of truth**: All coordination happens through the append-only Build Event Log, eliminating distributed state management complexity.
+
+4. **Wants ensure completion**: By separating intent (wants) from execution (builds), the system can fail builds freely while guaranteeing eventual completion.
+
+5. **Small state constraint**: The 1KB limit on job state forces proper design - jobs store references to external resources, not the resources themselves.
+
+6. **Generation fencing**: Simple timestamp-based generations provide total ordering without coordination, preventing cross-version interference.
+
+## Benefits
+
+This design achieves resilience through simplicity. Rather than building complex resumption logic or distributed coordination, DataBuild embraces restarts as routine events. The system may run slower during transitions but maintains correctness without operator intervention. The same patterns work across local development, containers, and cloud platforms, providing a consistent mental model for users.