From 7ccec59364c130cdb02027e02c0dfa73d44d43c2 Mon Sep 17 00:00:00 2001 From: Stuart Axelbrooke Date: Tue, 25 Nov 2025 10:28:29 +0800 Subject: [PATCH] update partitions refactor plan --- docs/plans/partitions-refactor.md | 149 ++++++++++++++++++++++++++++-- 1 file changed, 139 insertions(+), 10 deletions(-) diff --git a/docs/plans/partitions-refactor.md b/docs/plans/partitions-refactor.md index ac513e1..7f0cfbf 100644 --- a/docs/plans/partitions-refactor.md +++ b/docs/plans/partitions-refactor.md @@ -208,6 +208,7 @@ Can answer: - **UpForRetry**: Upstream dependencies satisfied, partition ready to retry building - **Live**: Successfully built - **Failed**: Hard failure (shouldn't retry) + - **UpstreamFailed**: Partition failed because upstream dependencies failed (terminal state) - **Tainted**: Marked invalid by taint event **Removed:** Missing state - partitions only exist when jobs start building them or are completed. @@ -215,6 +216,7 @@ Can answer: Key transitions: - Building → UpstreamBuilding (job reports dep miss) - UpstreamBuilding → UpForRetry (all upstream deps satisfied) + - UpstreamBuilding → UpstreamFailed (upstream dependency hard failure) - Building → Live (job succeeds) - Building → Failed (job hard failure) - UpForRetry → Building (new job queued for retry, creates fresh UUID) @@ -309,6 +311,7 @@ Can answer: - Check canonical partition states for all partition refs - Transition based on observation (in priority order): - If ANY canonical partition is Failed → New → Failed (job can't be safely retried) + - If ANY canonical partition is UpstreamFailed → New → UpstreamFailed (upstream deps failed) - If ALL canonical partitions exist AND are Live → New → Successful (already built!) - If ANY canonical partition is Building → New → Building (being built now) - If ANY canonical partition is UpstreamBuilding → New → UpstreamBuilding (waiting for deps) @@ -316,7 +319,7 @@ Can answer: - Otherwise (partitions don't exist or other states) → New → Idle (need to schedule) - For derivative wants, additional logic may transition to UpstreamBuilding - Key insight: Most wants will go New → Idle because partitions won't exist yet (only created when jobs start). Subsequent wants for already-building partitions go New → Building. Wants arriving during dep miss go New → UpstreamBuilding. Wants for partitions ready to retry go New → Idle. Wants for already-Live partitions go New → Successful. Wants for Failed partitions go New → Failed. + Key insight: Most wants will go New → Idle because partitions won't exist yet (only created when jobs start). Subsequent wants for already-building partitions go New → Building. Wants arriving during dep miss go New → UpstreamBuilding. Wants for partitions ready to retry go New → Idle. Wants for already-Live partitions go New → Successful. Wants for Failed or UpstreamFailed partitions go New → Failed/UpstreamFailed. 3. **Keep WantSchedulability building check** @@ -350,6 +353,7 @@ Can answer: New transitions needed: - **New → Failed:** Any partition failed + - **New → UpstreamFailed:** Any partition upstream failed - **New → Successful:** All partitions live - **New → Idle:** Normal case, partitions don't exist - **New → Building:** Partitions already building when want created @@ -468,14 +472,22 @@ Complete flow when a job has dependency miss: - Wants for partition: Building → UpstreamBuilding - Wants track the derivative want IDs in their UpstreamBuildingState -4. **Upstream builds complete:** - - Derivative wants build upstream partitions → upstream partition becomes Live - - **Lookup downstream_waiting:** Get `downstream_waiting[upstream_partition_ref]` → list of UUIDs waiting for this upstream - - For each waiting partition UUID: - - Get partition from `partitions_by_uuid[uuid]` - - Check if ALL its MissingDeps are now satisfied (canonical partitions for all refs are Live) - - If satisfied: transition partition UpstreamBuilding → UpForRetry - - Remove uuid from `downstream_waiting` entries (cleanup) +4. **Upstream builds complete or fail:** + - **Success case:** Derivative wants build upstream partitions → upstream partition becomes Live + - **Lookup downstream_waiting:** Get `downstream_waiting[upstream_partition_ref]` → list of UUIDs waiting for this upstream + - For each waiting partition UUID: + - Get partition from `partitions_by_uuid[uuid]` + - Check if ALL its MissingDeps are now satisfied (canonical partitions for all refs are Live) + - If satisfied: transition partition UpstreamBuilding → UpForRetry + - Remove uuid from `downstream_waiting` entries (cleanup) + + - **Failure case:** Upstream partition transitions to Failed (hard failure) + - **Lookup downstream_waiting:** Get `downstream_waiting[failed_partition_ref]` → list of UUIDs waiting for this upstream + - For each waiting partition UUID in UpstreamBuilding state: + - Transition partition: UpstreamBuilding → UpstreamFailed + - Transition associated wants: UpstreamBuilding → UpstreamFailed + - Remove uuid from `downstream_waiting` entries (cleanup) + - This propagates failure information down the dependency chain 5. **Want becomes schedulable:** - When partition transitions to UpForRetry, wants transition: UpstreamBuilding → Idle @@ -493,7 +505,8 @@ Complete flow when a job has dependency miss: - UpstreamBuilding also acts as lease (upstreams not ready, can't retry yet) - UpForRetry releases lease (upstreams ready, safe to schedule) - Failed releases lease but blocks new wants (hard failure, shouldn't retry) -- `downstream_waiting` index enables O(1) lookup of affected partitions when upstreams complete +- UpstreamFailed releases lease and blocks new wants (upstream deps failed, can't succeed) +- `downstream_waiting` index enables O(1) lookup of affected partitions when upstreams complete or fail **Taint Handling:** @@ -657,6 +670,122 @@ JobRunBufferEventV1 { - Want completes → partition remains in history - Partition expires → new UUID for rebuild, canonical updated +## Implementation FAQs + +### Q: Do we need to maintain backwards compatibility with existing events? + +**A:** No. We can assume no need to maintain backwards compatibility or retain data produced before this change. This simplifies the implementation significantly - no need to handle old event formats or generate UUIDs for replayed pre-UUID events. + +### Q: How should we handle reference errors and index inconsistencies? + +**A:** Panic on any reference issues with contextual information. This includes: +- Missing partition UUIDs in `partitions_by_uuid` +- Missing canonical pointers in `canonical_partitions` +- Inverted index inconsistencies (wants_for_partition, downstream_waiting) +- Invalid state transitions + +Add assertions and validation throughout to catch these issues immediately rather than failing silently. + +### Q: What about cleanup of the `wants_for_partition` inverted index? + +**A:** Don't remove wants from the index when they complete. This is acceptable for the initial implementation. Building of years of partitions for a mature data platform would still represent less than a million entries, which is manageable. We can add cleanup later if needed. + +### Q: What happens when an upstream partition is Tainted instead of becoming Live? + +**A:** Tainting of an upstream means it is no longer live, and the downstream job should dep miss. The system will operate correctly: +1. Downstream job discovers upstream is Tainted (not Live) → dep miss +2. Derivative want created for tainted upstream +3. Tainted upstream triggers rebuild (new UUID, replaces canonical) +4. Derivative want succeeds → downstream can resume + +### Q: How should UUIDs be generated? Should the Orchestrator calculate them? + +**A:** Use deterministic derivation instead of orchestrator generation: + +```rust +fn derive_partition_uuid(job_run_id: &str, partition_ref: &str) -> Uuid { + // Hash job_run_id + partition_ref bytes + let mut hasher = Sha256::new(); + hasher.update(job_run_id.as_bytes()); + hasher.update(partition_ref.as_bytes()); + let hash = hasher.finalize(); + // Convert first 16 bytes to UUID + Uuid::from_slice(&hash[0..16]).unwrap() +} +``` + +**Benefits:** +- No orchestrator UUID state/generation needed +- Deterministic replay (same job + ref = same UUID) +- Event schema stays simple (job_run_id + partition refs) +- Build state derives UUIDs in `handle_job_run_buffer()` +- No need for `PartitionInstanceRef` message in protobuf + +### Q: How do we enforce safe canonical partition access? + +**A:** Add and use helper methods in BuildState to enforce correct access patterns: +- `get_canonical_partition(ref)` - lookup canonical partition for a ref +- `get_canonical_partition_uuid(ref)` - get UUID of canonical partition +- `get_partition_by_uuid(uuid)` - direct UUID lookup +- `get_wants_for_partition(ref)` - query inverted index + +Existing `get_partition()` function should be updated to use canonical lookup. Code should always access "current state" via canonical_partitions, not by ref lookup in the deprecated partitions map. + +### Q: What is the want schedulability check logic? + +**A:** A want is schedulable if: +- The canonical partition doesn't exist for any of its partition refs, OR +- The canonical partition exists and is in Tainted or UpForRetry state + +In other words: `!exists || Tainted || UpForRetry` + +Building and UpstreamBuilding partitions act as leases (not schedulable). + +### Q: Should we implement phases strictly sequentially? + +**A:** No. Proceed in the most efficient and productive manner possible. Phases can be combined or reordered as makes sense. For example, Phase 1 + Phase 2 can be done together since want state sensing depends on the new partition states. + +### Q: Should we write tests incrementally or implement everything first? + +**A:** Implement tests as we go. Write unit tests for each component as it's implemented, then integration tests for full scenarios. + +### Q: Should wants reference partition UUIDs or partition refs? + +**A:** Wants should NEVER reference partition instances (via UUID). Wants should ONLY reference canonical partitions via partition ref strings. This is already the case - wants include partition refs, which allows the orchestrator to resolve partition info for want state updates. The separation is: +- Wants → Partition Refs (canonical, user-facing) +- Jobs → Partition UUIDs (specific instances, historical) + +### Q: Should we add UpstreamFailed state for partitions? + +**A:** Yes. This provides symmetry with want semantics and clear terminal state propagation: + +**Scenario:** +1. Partition A: Building → Failed (hard failure) +2. Partition B needs A, dep misses → UpstreamBuilding +3. Derivative want created for A, immediately fails (A is Failed) +4. Partition B: UpstreamBuilding → UpstreamFailed + +**Benefits:** +- Clear signal that partition can never succeed (upstreams failed) +- Mirrors Want UpstreamFailed semantics (consistency) +- Useful for UIs and debugging +- Prevents indefinite waiting in UpstreamBuilding state + +**Transition logic:** +- When partition transitions to Failed, lookup `downstream_waiting[failed_partition_ref]` +- For each downstream partition UUID in UpstreamBuilding state, transition to UpstreamFailed +- This propagates failure information down the dependency chain + +**Add to Phase 1 partition states:** +- **UpstreamFailed**: Partition failed because upstream dependencies failed (terminal state) + +**Add transition:** +- UpstreamBuilding → UpstreamFailed (upstream dependency hard failure) + +### Q: Can a job build the same partition ref multiple times? + +**A:** No, this is invalid. A job run cannot build the same partition multiple times. Each partition ref should appear at most once in a job's building_partitions list. + ## Summary Adding partition UUIDs solves fundamental architectural problems: