stuart/databuild

Fork 0

Stuart Axelbrooke 7ccec59364 update partitions refactor plan

2025-11-25 10:28:29 +08:00

35 KiB

Raw Blame History

Partition Identity Refactor: Adding UUIDs for Temporal Consistency

Problem Statement

Current Architecture

Partitions are currently keyed only by their reference string (e.g., "data/beta"):

partitions: HashMap<String, Partition>  // ref → partition

When a partition transitions through states (Missing → Building → Live → Tainted), it's the same object mutating. This creates several architectural problems:

Core Issue: Lack of Temporal Identity

The fundamental problem: We cannot distinguish between "the partition being built now" and "the partition built yesterday" or "the partition that will be built tomorrow."

This manifests in several ways:

Ambiguous Job-Partition Relationships
- When job J completes, which partition instance did it build?
- If partition is rebuilt, we lose information about previous builds
- Can't answer: "What was the state of data/beta when job Y ran?"
State Mutation Loss
- Once a partition transitions Live → Tainted → Missing, the Live state information is lost
- Can't track "Partition P was built successfully by job J at time T"
- Lineage and provenance information disappears on each rebuild
Redundant Data Structures (Symptoms)
- WantAttributedPartitions in JobRunDetail exists to snapshot want-partition relationships
- Partitions carry want_ids: Vec<String> that get cleared/modified as partitions transition
- Jobs need to capture relationships at creation time because they can't be reliably reconstructed later

Concrete Bug Example

The bug that led to this design discussion illustrates the problem:

1. Want 1 created for "data/beta" → partition becomes Building
2. Want 2 created for "data/beta" → but partition is ALREADY Building
3. Job has dep miss → creates derivative want
4. System expects all wants to be Building/UpstreamBuilding, but Want 2 is Idle → panic

Root cause: All wants reference the same mutable partition object. We can't distinguish:

"The partition instance Want 1 triggered"
"The partition instance Want 2 is waiting for"
They're the same object, but semantically they represent different temporal relationships

Proposed Solution: Partition UUIDs

Architecture Changes

Two-level indexing:

// All partition instances, keyed by UUID
partitions_by_uuid: HashMap<Uuid, Partition>

// Current/canonical partition for each ref
canonical_partitions: HashMap<String, Uuid>

Key Properties

Immutable Identity: Each partition build gets a unique UUID
- Partition(uuid-1, ref="data/beta", state=Building) is a distinct entity
- When rebuilt, create Partition(uuid-2, ref="data/beta", state=Missing)
- Both can coexist; uuid-1 represents historical fact, uuid-2 is current state

Stable Job References: Jobs reference the specific partition UUIDs they built

JobRunBufferEventV1 {
    building_partition_uuids: [uuid-1, uuid-2]  // Specific instances being built
}

Wants Reference Refs: Wants continue to reference partition refs, not UUIDs

WantCreateEventV1 {
    partitions: ["data/beta"]  // User-facing reference
}
// Want's state determined by canonical partition for "data/beta"

Temporal Queries: Can reconstruct state at any point
- "What was partition uuid-1's state when job J ran?" → Look up uuid-1, it's immutable
- "Which wants were waiting for data/beta at time T?" → Check canonical partition at T
- "What's the current state of data/beta?" → canonical_partitions["data/beta"] → uuid-2

Benefits

1. Removes WantAttributedPartitions Redundancy

Before:

JobRunBufferEventV1 {
    building_partitions: [PartitionRef("data/beta")],
    // Redundant: snapshot want-partition relationship
    servicing_wants: [WantAttributedPartitions {
        want_id: "w1",
        partitions: ["data/beta"]
    }]
}

After:

JobRunBufferEventV1 {
    building_partition_uuids: [uuid-1, uuid-2]
}

// To find serviced wants - use inverted index in BuildState
for uuid in job.building_partition_uuids {
    let partition = partitions_by_uuid[uuid];
    let partition_ref = &partition.partition_ref.r#ref;

    // Look up wants via inverted index (not stored on partition)
    if let Some(want_ids) = wants_for_partition.get(partition_ref) {
        for want_id in want_ids {
            // transition want
        }
    }
}

The relationship is discoverable via inverted index, not baked-in at event creation or stored on partitions.

Key improvement: Partitions don't store want_ids. This is cleaner separation of concerns:

Want → Partition: Inherent (want defines partitions it wants)
Partition → Want: Derived (maintained as inverted index in BuildState)

Note on want state vs schedulability:

Want state (Building) reflects current reality: "my partitions are being built"
Schedulability prevents duplicate jobs: "don't schedule another job if partitions already building"
Both mechanisms needed: state for correctness, schedulability for efficiency

2. Proper State Semantics for Wants

Current (problematic):

Want 1 → triggers build → Building (owns the job somehow?)
Want 2 → sees partition Building → stays Idle (different from Want 1?)
Want 3 → same partition → also Idle

With UUIDs and New state:

Want 1 arrives → New → no canonical partition exists → Idle → schedulable
Orchestrator queues job → generates uuid-1 for "data/beta"
Job buffer event → creates Partition(uuid-1, "data/beta", Building)
              → updates canonical["data/beta"] = uuid-1
              → transitions Want 1: Idle → Building
Want 2 arrives → New → canonical["data/beta"] = uuid-1 (Building) → Building
Want 3 arrives → New → canonical["data/beta"] = uuid-1 (Building) → Building
Want 4 arrives → New → canonical["data/beta"] = uuid-1 (Building) → Building

All 4 wants have identical relationship to the canonical partition. The state reflects reality: "is the canonical partition for my ref being built?"

Key insights:

Wants don't bind to UUIDs. They look up the canonical partition for their ref and base their state on that.
New state makes state determination explicit: want creation → observe world → transition to appropriate state

3. Historical Lineage

// Track partition lineage over time
Partition {
    uuid: uuid-3,
    partition_ref: "data/beta",
    previous_uuid: Some(uuid-2),  // Link to previous instance
    created_at: 1234567890,
    state: Live,
    produced_by_job: Some("job-xyz"),
}

Can answer:

"What partitions existed for this ref over time?"
"Which job produced this specific partition instance?"
"What was the dependency chain when this partition was built?"

Implementation Plan

Phase 1: Add UUID Infrastructure (Non-Breaking)

Goals:

Add UUID field to Partition
Create dual indexing (by UUID and by ref)
Maintain backward compatibility

Changes:

Update Partition struct (databuild/partition_state.rs)

Add UUID field to partition:
- uuid: Uuid - Unique identifier for this partition instance
- Remove want_ids field (now maintained as inverted index in BuildState)
Update partition state machine:

States:
- Building: Job actively building this partition
- UpstreamBuilding: Job had dep miss, partition waiting for upstream dependencies (stores MissingDeps)
- UpForRetry: Upstream dependencies satisfied, partition ready to retry building
- Live: Successfully built
- Failed: Hard failure (shouldn't retry)
- UpstreamFailed: Partition failed because upstream dependencies failed (terminal state)
- Tainted: Marked invalid by taint event
Removed: Missing state - partitions only exist when jobs start building them or are completed.

Key transitions:
- Building → UpstreamBuilding (job reports dep miss)
- UpstreamBuilding → UpForRetry (all upstream deps satisfied)
- UpstreamBuilding → UpstreamFailed (upstream dependency hard failure)
- Building → Live (job succeeds)
- Building → Failed (job hard failure)
- UpForRetry → Building (new job queued for retry, creates fresh UUID)
- Live → Tainted (partition tainted)
Add dual indexing and inverted indexes (databuild/build_state.rs)
```
pub struct BuildState {
    partitions_by_uuid: BTreeMap<Uuid, Partition>,           // NEW
    canonical_partitions: BTreeMap<String, Uuid>,             // NEW
    wants_for_partition: BTreeMap<String, Vec<String>>,      // NEW: partition ref → want IDs
    downstream_waiting: BTreeMap<String, Vec<Uuid>>,         // NEW: partition ref → UUIDs waiting for it
    partitions: BTreeMap<String, Partition>,                  // DEPRECATED, keep for now
    // ...
}
```
Rationale for inverted indexes:

wants_for_partition:
- Partitions shouldn't know about wants (layering violation)
- Want → Partition is inherent (want defines what it wants)
- Partition → Want is derived (computed from wants, maintained as index)
- BuildState owns this inverted relationship
downstream_waiting:
- Enables efficient dep miss resolution: when partition becomes Live, directly find which partitions are waiting for it
- Maps upstream partition ref → list of downstream partition UUIDs that have this ref in their MissingDeps
- Avoids scanning all UpstreamBuilding partitions when upstreams complete
- O(1) lookup to find affected partitions
Partition creation happens at job buffer time

Partitions are only created when a job starts building them:
- Orchestrator generates fresh UUIDs when queuing job
- handle_job_run_buffer() creates partitions directly in Building state with those UUIDs
- Store in both maps: partitions_by_uuid[uuid] and canonical_partitions[ref] = uuid
- Keep partitions[ref] updated for backward compatibility during migration
No partitions created during want creation - wants just register in inverted index.
Add helper methods for accessing partitions by UUID and ref
- get_canonical_partition(ref) - lookup canonical partition for a ref
- get_canonical_partition_uuid(ref) - get UUID of canonical partition
- get_partition_by_uuid(uuid) - direct UUID lookup
- get_wants_for_partition(ref) - query inverted index
Update inverted index maintenance

When wants are created, the wants_for_partition index must be updated:
- Want creation: Add want_id to index for each partition ref in the want
- Want completion/cancellation: For now, do NOT remove from index. Cleanup can be added later if needed.
No partition creation needed - just update the index. Partitions are created later when jobs are queued.

Rationale for not cleaning up:
- Index size should be manageable for now
- Cleanup logic is straightforward to add later when needed
- Avoids complexity around replay (removal operations not in event log)
Key consideration: The index maps partition refs (not UUIDs) to want IDs, since wants reference refs. When a partition is rebuilt with a new UUID, the same ref continues to map to the same wants until those wants complete.

Phase 2: Add New State and Want State Sensing

Goals:

Add explicit "New" state to Want state machine
Wants sense canonical partition state and transition appropriately
Clarify distinction between want state and schedulability

Changes:

Add New state to want_state.rs

Add a new state that represents a want that has just been created but hasn't yet observed the world:
- NewState - Want has been created from event, state not yet determined
- Transitions from New:
  - New → Failed (any partition failed)
  - New → Successful (all partitions live)
  - New → Building (any partition building)
  - New → Idle (partitions don't exist or other states)
This makes state determination explicit and observable in the event log.
Update handle_want_create() to sense and transition

During want creation event processing:
- Create want in New state from WantCreateEventV1
- Register want in inverted index (wants_for_partition)
- Check canonical partition states for all partition refs
- Transition based on observation (in priority order):
  - If ANY canonical partition is Failed → New → Failed (job can't be safely retried)
  - If ANY canonical partition is UpstreamFailed → New → UpstreamFailed (upstream deps failed)
  - If ALL canonical partitions exist AND are Live → New → Successful (already built!)
  - If ANY canonical partition is Building → New → Building (being built now)
  - If ANY canonical partition is UpstreamBuilding → New → UpstreamBuilding (waiting for deps)
  - If ANY canonical partition is UpForRetry → New → Idle (deps satisfied, ready to schedule)
  - Otherwise (partitions don't exist or other states) → New → Idle (need to schedule)
- For derivative wants, additional logic may transition to UpstreamBuilding
Key insight: Most wants will go New → Idle because partitions won't exist yet (only created when jobs start). Subsequent wants for already-building partitions go New → Building. Wants arriving during dep miss go New → UpstreamBuilding. Wants for partitions ready to retry go New → Idle. Wants for already-Live partitions go New → Successful. Wants for Failed or UpstreamFailed partitions go New → Failed/UpstreamFailed.
Keep WantSchedulability building check

Important distinction: Want state vs. schedulability are different concerns:
- Want state (New → Building): "Are my partitions currently being built?" - Reflects reality
- Schedulability: "Should the orchestrator start a NEW job for this want?" - Prevents duplicate jobs
Example scenario:
```
Want 1: Idle → schedules job → partition becomes Building → want becomes Building
Want 2 arrives → sees partition Building → New → Building
Orchestrator polls: both wants are Building, but should NOT schedule another job
```
The building field in WantUpstreamStatus remains necessary to prevent duplicate job scheduling. A want can be in Building state but not schedulable if partitions are already being built by another job.

Keep the existing schedulability logic that checks building.is_empty().
Update derivative want handling

Modify handle_derivative_want_creation() to handle wants in their appropriate states:
- Building → UpstreamBuilding: Want is Building when dep miss occurs (normal case)
- UpstreamBuilding → UpstreamBuilding: Want already waiting on upstreams, add another (additional dep miss)
Note: Idle wants should NOT be present during derivative want creation. If partitions are building (which they must be for a job to report dep miss), wants would have been created in Building state via New → Building transition.
Add required state transitions in want_state.rs

New transitions needed:
- New → Failed: Any partition failed
- New → UpstreamFailed: Any partition upstream failed
- New → Successful: All partitions live
- New → Idle: Normal case, partitions don't exist
- New → Building: Partitions already building when want created
- Building → UpstreamBuilding: Job reports dep miss (first time)
- UpstreamBuilding → UpstreamBuilding: Additional upstreams added
Note: New → UpstreamBuilding is not needed - wants go New → Building first, then Building → UpstreamBuilding when dep miss occurs.

Phase 3: Update Job Events

Goals:

Jobs reference partition UUIDs, not just refs
Remove WantAttributedPartitions redundancy

Changes:

Update JobRunBufferEventV1 in databuild.proto

Add new message and field:

message PartitionInstanceRef {
    PartitionRef partition_ref = 1;
    string uuid = 2;  // UUID as string
}

message JobRunBufferEventV1 {
    // ... existing fields ...
    repeated PartitionInstanceRef building_partitions_v2 = 6;  // NEW
    repeated PartitionRef building_partitions = 4;  // DEPRECATED
    repeated WantAttributedPartitions servicing_wants = 5;  // DEPRECATED
}

This pairs each partition ref with its UUID, solving the mapping problem.

Update handle_job_run_buffer() in build_state.rs

Change partition and want lookup logic:
- Parse UUIDs from event (need partition refs too - consider adding to event or deriving from wants)
- Create partitions directly in Building state with these UUIDs (no Missing state)
- Update canonical_partitions to point refs to these new UUIDs
- Use inverted index (wants_for_partition) to find wants for each partition ref
- Transition those wants: Idle → Building (or stay Building if already there)
- Create job run in Queued state
Key changes:
- Partitions created here, not during want creation
- No Missing → Building transition, created directly as Building
- Use inverted index for want discovery (not stored on partition or in event)
Update Orchestrator's queue_job() in orchestrator.rs

When creating JobRunBufferEventV1:
- Get partition refs from wants (existing logic)
- Generate fresh UUIDs for each unique partition ref (one UUID per ref)
- Include UUID list in event along with refs (may need to update event schema)
- Orchestrator no longer needs to track or snapshot want-partition relationships
Key change: Orchestrator generates UUIDs at job queue time, not looking up canonical partitions. Each job attempt gets fresh UUIDs. The event handler will create partitions in Building state with these UUIDs and update canonical pointers.

This eliminates WantAttributedPartitions entirely - relationships are discoverable via inverted index.

Phase 4: Partition Lifecycle Management

Goals:

Define when new partition UUIDs are created
Handle canonical partition transitions

Canonical Partition Transitions:

New partition UUID created when:

First build: Orchestrator queues job → generates UUID → partition created directly as Building
Taint: Partition tainted → transition current to Tainted state (keeps UUID, stays canonical so readers can see it's tainted)
Rebuild after taint: Existing want (still within TTL) sees tainted partition → triggers new job → orchestrator generates fresh UUID → new partition replaces tainted one in canonical_partitions

Note on TTL/SLA: These are want properties, not partition properties. TTL defines how long after want creation the orchestrator should keep attempting to build partitions. When a partition is tainted, wants within TTL will keep retrying. SLA is an alarm threshold. Partitions don't expire - they stay Live until explicitly tainted or replaced by a new build.

Key principles:

Building state as lease: The Building state serves as a lease mechanism. While a partition is in Building state, the orchestrator will not attempt to schedule additional jobs to build that partition. This prevents concurrent/duplicate builds. The lease is released when the partition transitions to Live, Failed, or when a new partition instance with a fresh UUID is created and becomes canonical (e.g., after the building job reports dep miss and a new job is queued).
When canonical pointer is updated (e.g., new build replaces tainted partition), old partition UUID remains in partitions_by_uuid for historical queries
Canonical pointer always points to current/active partition instance (Building, Live, Failed, or Tainted)
Tainted partitions stay canonical until replaced - readers need to see they're tainted
Old instances become immutable historical records
No Missing state - partitions only exist when jobs are actively building them or completed

Partition Creation:

Partitions created during handle_job_run_buffer():

UUIDs come from the event (generated by orchestrator)
Create partition directly in Building state with job_run_id
Update canonical_partitions map to point ref → UUID
Store in partitions_by_uuid
If replacing a tainted/failed partition, old one remains in partitions_by_uuid by its UUID

Dep Miss Handling:

Complete flow when a job has dependency miss:

Job reports dep miss:
- Job building partition uuid-1 encounters missing upstream deps
- JobRunDepMissEventV1 emitted with MissingDeps (partition refs needed)
- Derivative wants created for missing upstream partitions
Partition transitions to UpstreamBuilding:
- Partition uuid-1: Building → UpstreamBuilding
- Store MissingDeps in partition state (which upstream refs it's waiting for)
- Update inverted index: For each missing dep ref, add uuid-1 to downstream_waiting[missing_dep_ref]
- Partition remains canonical (holds lease - prevents concurrent retry attempts)
- Job run transitions to DepMissed state
Want transitions:
- Wants for partition: Building → UpstreamBuilding
- Wants track the derivative want IDs in their UpstreamBuildingState
Upstream builds complete or fail:
- Success case: Derivative wants build upstream partitions → upstream partition becomes Live
  - Lookup downstream_waiting: Get downstream_waiting[upstream_partition_ref] → list of UUIDs waiting for this upstream
  - For each waiting partition UUID:
    - Get partition from partitions_by_uuid[uuid]
    - Check if ALL its MissingDeps are now satisfied (canonical partitions for all refs are Live)
    - If satisfied: transition partition UpstreamBuilding → UpForRetry
    - Remove uuid from downstream_waiting entries (cleanup)
- Failure case: Upstream partition transitions to Failed (hard failure)
  - Lookup downstream_waiting: Get downstream_waiting[failed_partition_ref] → list of UUIDs waiting for this upstream
  - For each waiting partition UUID in UpstreamBuilding state:
    - Transition partition: UpstreamBuilding → UpstreamFailed
    - Transition associated wants: UpstreamBuilding → UpstreamFailed
    - Remove uuid from downstream_waiting entries (cleanup)
  - This propagates failure information down the dependency chain
Want becomes schedulable:
- When partition transitions to UpForRetry, wants transition: UpstreamBuilding → Idle
- Orchestrator sees Idle wants with UpForRetry canonical partitions → schedulable
- New job queued → fresh UUID (uuid-2) generated
- Partition uuid-2 created as Building, replaces uuid-1 in canonical_partitions
- Partition uuid-1 (UpForRetry) remains in partitions_by_uuid as historical record
New wants during dep miss:
- Want arrives while partition is UpstreamBuilding → New → UpstreamBuilding (correctly waits)
- Want arrives while partition is UpForRetry → New → Idle (correctly schedulable)

Key properties:

Building state acts as lease (no concurrent builds)
UpstreamBuilding also acts as lease (upstreams not ready, can't retry yet)
UpForRetry releases lease (upstreams ready, safe to schedule)
Failed releases lease but blocks new wants (hard failure, shouldn't retry)
UpstreamFailed releases lease and blocks new wants (upstream deps failed, can't succeed)
downstream_waiting index enables O(1) lookup of affected partitions when upstreams complete or fail

Taint Handling:

When partition is tainted (via TaintCreateEvent):

Find current canonical UUID for the ref
Transition that partition instance to Tainted state (preserves history)
Keep in canonical_partitions - readers need to see it's tainted
Wants within TTL will see partition is tainted (not Live)
Orchestrator will schedule new jobs for those wants
New partition created with fresh UUID when next job starts
New partition replaces tainted one in canonical_partitions

Phase 5: Migration and Cleanup

Goals:

Remove deprecated fields
Update API responses
Complete migration

Changes:

Remove deprecated fields from protobuf
- building_partitions from JobRunBufferEventV1
- servicing_wants from JobRunBufferEventV1
- WantAttributedPartitions message
Remove backward compatibility code
- partitions: BTreeMap<String, Partition> from BuildState
- Dual writes/reads
Update API responses to include UUIDs where relevant
- JobRunDetail can include partition UUIDs built
- PartitionDetail can include UUID for debugging
Update tests to use UUID-based assertions

Design Decisions & Trade-offs

1. Wants Reference Refs, Not UUIDs

Decision: Wants always reference partition refs (e.g., "data/beta"), not UUIDs.

Rationale:

User requests "data/beta" - the current/canonical partition for that ref
Want state is based on canonical partition: "is the current partition for my ref being built?"
If partition gets tainted/rebuilt, wants see the new canonical partition automatically
Simpler mental model: want doesn't care about historical instances

How it works:

// Want creation
want.partitions = ["data/beta"]  // ref, not UUID

// Want state determination
if let Some(canonical_uuid) = canonical_partitions.get("data/beta") {
    let partition = partitions_by_uuid[canonical_uuid];
    match partition.state {
        Building => want.state = Building,
        Live => want can complete,
        ...
    }
} else {
    // No canonical partition exists yet → Idle
}

2. Jobs Reference UUIDs, Not Refs

Decision: Jobs reference the specific partition UUIDs they built.

Rationale:

Jobs build specific partition instances
Historical record: "Job J built Partition(uuid-1)"
Even if partition is later tainted/rebuilt, job's record is immutable
Enables provenance: "Which job built this specific partition?"

How it works:

JobRunBufferEventV1 {
    building_partition_uuids: [uuid-1, uuid-2]  // Specific instances
}

3. UUID Generation: When?

Decision: Orchestrator generates UUIDs when queuing jobs, includes them in JobRunBufferEventV1.

Rationale:

UUIDs represent specific build attempts, not partition refs
Orchestrator is source of truth for "start building these partitions"
Event contains UUIDs, making replay deterministic (same UUIDs in event)
No UUID generation during event processing - UUIDs are in the event itself

Key insight: The orchestrator generates UUIDs (not BuildState during event handling). This makes UUIDs part of the immutable event log.

4. Canonical Partition: One at a Time

Decision: Only one canonical partition per ref at a time.

Scenario handling:

Partition(uuid-1, "data/beta") is Building
User requests rebuild → new want sees uuid-1 is Building → want becomes Building
Want waits for uuid-1 to complete
If uuid-1 completes successfully → want completes
If uuid-1 fails or is tainted → new partition instance created (uuid-2), canonical updated

Alternative considered: Multiple concurrent builds with versioning

Significantly more complex
No existing need for this

5. Event Format: UUID as String

Decision: Store UUIDs as strings in protobuf events.

Rationale:

Human-readable in logs/debugging
Standard UUID string format (36 chars)
Protobuf has no native UUID type

Trade-off: Larger event size (36 bytes vs 16 bytes) - acceptable for debuggability.

Testing Strategy

Unit Tests

Partition UUID uniqueness
- Creating partitions generates unique UUIDs
- Same ref at different times gets different UUIDs
Canonical partition tracking
- canonical_partitions always points to current instance
- Old instances remain in partitions_by_uuid
Want state determination
- Want checks canonical partition state
- Multiple wants see same canonical partition

Integration Tests

Multi-want scenario (reproduces original bug)
- Want 1 created → New → no partition exists → Idle
- Job scheduled → orchestrator generates uuid-1 → partition created Building
- Want 1 transitions Idle → Building (via job buffer event)
- Wants 2-4 created → New → partition Building (uuid-1) → Building
- All 4 wants reference same canonical partition uuid-1
- Job dep miss → all transition to UpstreamBuilding correctly
- Verifies New state transitions and state sensing work correctly
Rebuild scenario
- Partition built → Live (uuid-1)
- Partition tainted → new instance created (uuid-2), canonical updated
- New wants reference uuid-2
- Old partition uuid-1 still queryable for history

End-to-End Tests

Full lifecycle
- Want created → canonical partition determined
- Job runs → partition transitions through states
- Want completes → partition remains in history
- Partition expires → new UUID for rebuild, canonical updated

Implementation FAQs

Q: Do we need to maintain backwards compatibility with existing events?

A: No. We can assume no need to maintain backwards compatibility or retain data produced before this change. This simplifies the implementation significantly - no need to handle old event formats or generate UUIDs for replayed pre-UUID events.

Q: How should we handle reference errors and index inconsistencies?

A: Panic on any reference issues with contextual information. This includes:

Missing partition UUIDs in partitions_by_uuid
Missing canonical pointers in canonical_partitions
Inverted index inconsistencies (wants_for_partition, downstream_waiting)
Invalid state transitions

Add assertions and validation throughout to catch these issues immediately rather than failing silently.

Q: What about cleanup of the `wants_for_partition` inverted index?

A: Don't remove wants from the index when they complete. This is acceptable for the initial implementation. Building of years of partitions for a mature data platform would still represent less than a million entries, which is manageable. We can add cleanup later if needed.

Q: What happens when an upstream partition is Tainted instead of becoming Live?

A: Tainting of an upstream means it is no longer live, and the downstream job should dep miss. The system will operate correctly:

Downstream job discovers upstream is Tainted (not Live) → dep miss
Derivative want created for tainted upstream
Tainted upstream triggers rebuild (new UUID, replaces canonical)
Derivative want succeeds → downstream can resume

Q: How should UUIDs be generated? Should the Orchestrator calculate them?

A: Use deterministic derivation instead of orchestrator generation:

fn derive_partition_uuid(job_run_id: &str, partition_ref: &str) -> Uuid {
    // Hash job_run_id + partition_ref bytes
    let mut hasher = Sha256::new();
    hasher.update(job_run_id.as_bytes());
    hasher.update(partition_ref.as_bytes());
    let hash = hasher.finalize();
    // Convert first 16 bytes to UUID
    Uuid::from_slice(&hash[0..16]).unwrap()
}

Benefits:

No orchestrator UUID state/generation needed
Deterministic replay (same job + ref = same UUID)
Event schema stays simple (job_run_id + partition refs)
Build state derives UUIDs in handle_job_run_buffer()
No need for PartitionInstanceRef message in protobuf

Q: How do we enforce safe canonical partition access?

A: Add and use helper methods in BuildState to enforce correct access patterns:

get_canonical_partition(ref) - lookup canonical partition for a ref
get_canonical_partition_uuid(ref) - get UUID of canonical partition
get_partition_by_uuid(uuid) - direct UUID lookup
get_wants_for_partition(ref) - query inverted index

Existing get_partition() function should be updated to use canonical lookup. Code should always access "current state" via canonical_partitions, not by ref lookup in the deprecated partitions map.

Q: What is the want schedulability check logic?

A: A want is schedulable if:

The canonical partition doesn't exist for any of its partition refs, OR
The canonical partition exists and is in Tainted or UpForRetry state

In other words: !exists || Tainted || UpForRetry

Building and UpstreamBuilding partitions act as leases (not schedulable).

Q: Should we implement phases strictly sequentially?

A: No. Proceed in the most efficient and productive manner possible. Phases can be combined or reordered as makes sense. For example, Phase 1 + Phase 2 can be done together since want state sensing depends on the new partition states.

Q: Should we write tests incrementally or implement everything first?

A: Implement tests as we go. Write unit tests for each component as it's implemented, then integration tests for full scenarios.

Q: Should wants reference partition UUIDs or partition refs?

A: Wants should NEVER reference partition instances (via UUID). Wants should ONLY reference canonical partitions via partition ref strings. This is already the case - wants include partition refs, which allows the orchestrator to resolve partition info for want state updates. The separation is:

Wants → Partition Refs (canonical, user-facing)
Jobs → Partition UUIDs (specific instances, historical)

Q: Should we add UpstreamFailed state for partitions?

A: Yes. This provides symmetry with want semantics and clear terminal state propagation:

Scenario:

Partition A: Building → Failed (hard failure)
Partition B needs A, dep misses → UpstreamBuilding
Derivative want created for A, immediately fails (A is Failed)
Partition B: UpstreamBuilding → UpstreamFailed

Benefits:

Clear signal that partition can never succeed (upstreams failed)
Mirrors Want UpstreamFailed semantics (consistency)
Useful for UIs and debugging
Prevents indefinite waiting in UpstreamBuilding state

Transition logic:

When partition transitions to Failed, lookup downstream_waiting[failed_partition_ref]
For each downstream partition UUID in UpstreamBuilding state, transition to UpstreamFailed
This propagates failure information down the dependency chain

Add to Phase 1 partition states:

UpstreamFailed: Partition failed because upstream dependencies failed (terminal state)

Add transition:

UpstreamBuilding → UpstreamFailed (upstream dependency hard failure)

Q: Can a job build the same partition ref multiple times?

A: No, this is invalid. A job run cannot build the same partition multiple times. Each partition ref should appear at most once in a job's building_partitions list.

Summary

Adding partition UUIDs solves fundamental architectural problems:

Temporal identity: Distinguish partition instances over time
Stable job references: Jobs reference immutable partition UUIDs they built
Wants reference refs: Want state based on canonical partition for their ref
Discoverable relationships: Remove redundant snapshot data (WantAttributedPartitions)
Proper semantics: Want state reflects actual canonical partition state

Key principle: Wants care about "what's the current state of data/beta?" (refs), while jobs and historical queries care about "what happened to this specific partition instance?" (UUIDs).

This refactor enables cleaner code, better observability, and proper event sourcing semantics throughout the system.

35 KiB Raw Blame History

Partition Identity Refactor: Adding UUIDs for Temporal Consistency

Problem Statement

Current Architecture

Core Issue: Lack of Temporal Identity

Concrete Bug Example

Proposed Solution: Partition UUIDs

Architecture Changes

Key Properties

Benefits

1. Removes WantAttributedPartitions Redundancy

2. Proper State Semantics for Wants

3. Historical Lineage

Implementation Plan

Phase 1: Add UUID Infrastructure (Non-Breaking)

Phase 2: Add New State and Want State Sensing

Phase 3: Update Job Events

Phase 4: Partition Lifecycle Management

Phase 5: Migration and Cleanup

Design Decisions & Trade-offs

1. Wants Reference Refs, Not UUIDs

2. Jobs Reference UUIDs, Not Refs

3. UUID Generation: When?

4. Canonical Partition: One at a Time

5. Event Format: UUID as String

Testing Strategy

Unit Tests

Integration Tests

End-to-End Tests

Implementation FAQs

Q: Do we need to maintain backwards compatibility with existing events?

Q: How should we handle reference errors and index inconsistencies?

Q: What about cleanup of the wants_for_partition inverted index?

Q: What happens when an upstream partition is Tainted instead of becoming Live?

Q: How should UUIDs be generated? Should the Orchestrator calculate them?

Q: How do we enforce safe canonical partition access?

Q: What is the want schedulability check logic?

Q: Should we implement phases strictly sequentially?

Q: Should we write tests incrementally or implement everything first?

Q: Should wants reference partition UUIDs or partition refs?

Q: Should we add UpstreamFailed state for partitions?

Q: Can a job build the same partition ref multiple times?

Summary

35 KiB

Raw Blame History

Q: What about cleanup of the `wants_for_partition` inverted index?