update partitions refactor plan

2025-11-25 10:28:29 +08:00 · 2025-11-25 10:28:29 +08:00 · 7ccec59364
commit 7ccec59364
parent dfc1d19237
1 changed files with 139 additions and 10 deletions
--- a/docs/plans/partitions-refactor.md
+++ b/docs/plans/partitions-refactor.md
@ -208,6 +208,7 @@ Can answer:
   - **UpForRetry**: Upstream dependencies satisfied, partition ready to retry building
   - **Live**: Successfully built
   - **Failed**: Hard failure (shouldn't retry)
+   - **UpstreamFailed**: Partition failed because upstream dependencies failed (terminal state)
   - **Tainted**: Marked invalid by taint event

   **Removed:** Missing state - partitions only exist when jobs start building them or are completed.
@ -215,6 +216,7 @@ Can answer:
   Key transitions:
   - Building → UpstreamBuilding (job reports dep miss)
   - UpstreamBuilding → UpForRetry (all upstream deps satisfied)
+   - UpstreamBuilding → UpstreamFailed (upstream dependency hard failure)
   - Building → Live (job succeeds)
   - Building → Failed (job hard failure)
   - UpForRetry → Building (new job queued for retry, creates fresh UUID)
@ -309,6 +311,7 @@ Can answer:
   - Check canonical partition states for all partition refs
   - Transition based on observation (in priority order):
     - If ANY canonical partition is Failed → New → Failed (job can't be safely retried)
+     - If ANY canonical partition is UpstreamFailed → New → UpstreamFailed (upstream deps failed)
     - If ALL canonical partitions exist AND are Live → New → Successful (already built!)
     - If ANY canonical partition is Building → New → Building (being built now)
     - If ANY canonical partition is UpstreamBuilding → New → UpstreamBuilding (waiting for deps)
@ -316,7 +319,7 @@ Can answer:
     - Otherwise (partitions don't exist or other states) → New → Idle (need to schedule)
   - For derivative wants, additional logic may transition to UpstreamBuilding

-   Key insight: Most wants will go New → Idle because partitions won't exist yet (only created when jobs start). Subsequent wants for already-building partitions go New → Building. Wants arriving during dep miss go New → UpstreamBuilding. Wants for partitions ready to retry go New → Idle. Wants for already-Live partitions go New → Successful. Wants for Failed partitions go New → Failed.
+   Key insight: Most wants will go New → Idle because partitions won't exist yet (only created when jobs start). Subsequent wants for already-building partitions go New → Building. Wants arriving during dep miss go New → UpstreamBuilding. Wants for partitions ready to retry go New → Idle. Wants for already-Live partitions go New → Successful. Wants for Failed or UpstreamFailed partitions go New → Failed/UpstreamFailed.

 3. **Keep WantSchedulability building check**

@ -350,6 +353,7 @@ Can answer:
   New transitions needed:

   - **New → Failed:** Any partition failed
+   - **New → UpstreamFailed:** Any partition upstream failed
   - **New → Successful:** All partitions live
   - **New → Idle:** Normal case, partitions don't exist
   - **New → Building:** Partitions already building when want created
@ -468,14 +472,22 @@ Complete flow when a job has dependency miss:
   - Wants for partition: Building → UpstreamBuilding
   - Wants track the derivative want IDs in their UpstreamBuildingState

-4. **Upstream builds complete:**
-   - Derivative wants build upstream partitions → upstream partition becomes Live
-   - **Lookup downstream_waiting:** Get `downstream_waiting[upstream_partition_ref]` → list of UUIDs waiting for this upstream
-   - For each waiting partition UUID:
-     - Get partition from `partitions_by_uuid[uuid]`
-     - Check if ALL its MissingDeps are now satisfied (canonical partitions for all refs are Live)
-     - If satisfied: transition partition UpstreamBuilding → UpForRetry
-     - Remove uuid from `downstream_waiting` entries (cleanup)
+4. **Upstream builds complete or fail:**
+   - **Success case:** Derivative wants build upstream partitions → upstream partition becomes Live
+     - **Lookup downstream_waiting:** Get `downstream_waiting[upstream_partition_ref]` → list of UUIDs waiting for this upstream
+     - For each waiting partition UUID:
+       - Get partition from `partitions_by_uuid[uuid]`
+       - Check if ALL its MissingDeps are now satisfied (canonical partitions for all refs are Live)
+       - If satisfied: transition partition UpstreamBuilding → UpForRetry
+       - Remove uuid from `downstream_waiting` entries (cleanup)
+
+   - **Failure case:** Upstream partition transitions to Failed (hard failure)
+     - **Lookup downstream_waiting:** Get `downstream_waiting[failed_partition_ref]` → list of UUIDs waiting for this upstream
+     - For each waiting partition UUID in UpstreamBuilding state:
+       - Transition partition: UpstreamBuilding → UpstreamFailed
+       - Transition associated wants: UpstreamBuilding → UpstreamFailed
+       - Remove uuid from `downstream_waiting` entries (cleanup)
+     - This propagates failure information down the dependency chain

 5. **Want becomes schedulable:**
   - When partition transitions to UpForRetry, wants transition: UpstreamBuilding → Idle
@ -493,7 +505,8 @@ Complete flow when a job has dependency miss:
 - UpstreamBuilding also acts as lease (upstreams not ready, can't retry yet)
 - UpForRetry releases lease (upstreams ready, safe to schedule)
 - Failed releases lease but blocks new wants (hard failure, shouldn't retry)
- `downstream_waiting` index enables O(1) lookup of affected partitions when upstreams complete
+- UpstreamFailed releases lease and blocks new wants (upstream deps failed, can't succeed)
+- `downstream_waiting` index enables O(1) lookup of affected partitions when upstreams complete or fail

 **Taint Handling:**

@ -657,6 +670,122 @@ JobRunBufferEventV1 {
   - Want completes → partition remains in history
   - Partition expires → new UUID for rebuild, canonical updated

+## Implementation FAQs
+
+### Q: Do we need to maintain backwards compatibility with existing events?
+
+**A:** No. We can assume no need to maintain backwards compatibility or retain data produced before this change. This simplifies the implementation significantly - no need to handle old event formats or generate UUIDs for replayed pre-UUID events.
+
+### Q: How should we handle reference errors and index inconsistencies?
+
+**A:** Panic on any reference issues with contextual information. This includes:
+- Missing partition UUIDs in `partitions_by_uuid`
+- Missing canonical pointers in `canonical_partitions`
+- Inverted index inconsistencies (wants_for_partition, downstream_waiting)
+- Invalid state transitions
+
+Add assertions and validation throughout to catch these issues immediately rather than failing silently.
+
+### Q: What about cleanup of the `wants_for_partition` inverted index?
+
+**A:** Don't remove wants from the index when they complete. This is acceptable for the initial implementation. Building of years of partitions for a mature data platform would still represent less than a million entries, which is manageable. We can add cleanup later if needed.
+
+### Q: What happens when an upstream partition is Tainted instead of becoming Live?
+
+**A:** Tainting of an upstream means it is no longer live, and the downstream job should dep miss. The system will operate correctly:
+1. Downstream job discovers upstream is Tainted (not Live) → dep miss
+2. Derivative want created for tainted upstream
+3. Tainted upstream triggers rebuild (new UUID, replaces canonical)
+4. Derivative want succeeds → downstream can resume
+
+### Q: How should UUIDs be generated? Should the Orchestrator calculate them?
+
+**A:** Use deterministic derivation instead of orchestrator generation:
+
+```rust
+fn derive_partition_uuid(job_run_id: &str, partition_ref: &str) -> Uuid {
+    // Hash job_run_id + partition_ref bytes
+    let mut hasher = Sha256::new();
+    hasher.update(job_run_id.as_bytes());
+    hasher.update(partition_ref.as_bytes());
+    let hash = hasher.finalize();
+    // Convert first 16 bytes to UUID
+    Uuid::from_slice(&hash[0..16]).unwrap()
+}
+```
+
+**Benefits:**
+- No orchestrator UUID state/generation needed
+- Deterministic replay (same job + ref = same UUID)
+- Event schema stays simple (job_run_id + partition refs)
+- Build state derives UUIDs in `handle_job_run_buffer()`
+- No need for `PartitionInstanceRef` message in protobuf
+
+### Q: How do we enforce safe canonical partition access?
+
+**A:** Add and use helper methods in BuildState to enforce correct access patterns:
+- `get_canonical_partition(ref)` - lookup canonical partition for a ref
+- `get_canonical_partition_uuid(ref)` - get UUID of canonical partition
+- `get_partition_by_uuid(uuid)` - direct UUID lookup
+- `get_wants_for_partition(ref)` - query inverted index
+
+Existing `get_partition()` function should be updated to use canonical lookup. Code should always access "current state" via canonical_partitions, not by ref lookup in the deprecated partitions map.
+
+### Q: What is the want schedulability check logic?
+
+**A:** A want is schedulable if:
+- The canonical partition doesn't exist for any of its partition refs, OR
+- The canonical partition exists and is in Tainted or UpForRetry state
+
+In other words: `!exists || Tainted || UpForRetry`
+
+Building and UpstreamBuilding partitions act as leases (not schedulable).
+
+### Q: Should we implement phases strictly sequentially?
+
+**A:** No. Proceed in the most efficient and productive manner possible. Phases can be combined or reordered as makes sense. For example, Phase 1 + Phase 2 can be done together since want state sensing depends on the new partition states.
+
+### Q: Should we write tests incrementally or implement everything first?
+
+**A:** Implement tests as we go. Write unit tests for each component as it's implemented, then integration tests for full scenarios.
+
+### Q: Should wants reference partition UUIDs or partition refs?
+
+**A:** Wants should NEVER reference partition instances (via UUID). Wants should ONLY reference canonical partitions via partition ref strings. This is already the case - wants include partition refs, which allows the orchestrator to resolve partition info for want state updates. The separation is:
+- Wants → Partition Refs (canonical, user-facing)
+- Jobs → Partition UUIDs (specific instances, historical)
+
+### Q: Should we add UpstreamFailed state for partitions?
+
+**A:** Yes. This provides symmetry with want semantics and clear terminal state propagation:
+
+**Scenario:**
+1. Partition A: Building → Failed (hard failure)
+2. Partition B needs A, dep misses → UpstreamBuilding
+3. Derivative want created for A, immediately fails (A is Failed)
+4. Partition B: UpstreamBuilding → UpstreamFailed
+
+**Benefits:**
+- Clear signal that partition can never succeed (upstreams failed)
+- Mirrors Want UpstreamFailed semantics (consistency)
+- Useful for UIs and debugging
+- Prevents indefinite waiting in UpstreamBuilding state
+
+**Transition logic:**
+- When partition transitions to Failed, lookup `downstream_waiting[failed_partition_ref]`
+- For each downstream partition UUID in UpstreamBuilding state, transition to UpstreamFailed
+- This propagates failure information down the dependency chain
+
+**Add to Phase 1 partition states:**
+- **UpstreamFailed**: Partition failed because upstream dependencies failed (terminal state)
+
+**Add transition:**
+- UpstreamBuilding → UpstreamFailed (upstream dependency hard failure)
+
+### Q: Can a job build the same partition ref multiple times?
+
+**A:** No, this is invalid. A job run cannot build the same partition multiple times. Each partition ref should appear at most once in a job's building_partitions list.
+
 ## Summary

 Adding partition UUIDs solves fundamental architectural problems: