databuild/docs/partition-delegation.md
2025-07-12 14:46:39 -07:00

8.2 KiB

Partition Delegation in DataBuild

Overview

Partition delegation is a core coordination mechanism in DataBuild that prevents duplicate work by allowing build requests to delegate partition creation to other builds. This system ensures efficient resource utilization and provides complete audit trails for all build activities.

Motivation

DataBuild is designed to handle concurrent build requests efficiently. Without delegation, multiple build requests might attempt to build the same partitions simultaneously, leading to:

  • Resource Waste: Multiple processes building identical partitions
  • Race Conditions: Concurrent writes to the same partition outputs
  • Inconsistent State: Different builds potentially producing different results for the same partition
  • Poor Performance: Duplicated computation and I/O overhead

Delegation solves these problems by establishing clear coordination rules and providing complete traceability.

Delegation Types

DataBuild implements two distinct delegation patterns:

1. Active Delegation

When: A partition is currently being built by another active build request.

Behavior:

  • Delegate to the currently executing build request
  • Log DelegationEvent pointing to the active build's ID
  • No job execution occurs for the delegating request
  • Wait for the active build to complete

Event Flow:

Build Request A wants partition X (currently being built by Build B):
1. DelegationEvent(partition=X, delegated_to=Build_B_ID, message="Delegated to active build during execution")
2. No JobEvent created for Build A
3. Task marked as succeeded locally in Build A

2. Historical Delegation

When: A partition already exists and is available (built by a previous request).

Behavior:

  • Delegate to the historical build request that created the partition
  • Log both DelegationEvent and JOB_SKIPPED events
  • Provide complete audit trail showing why work was avoided

Event Flow:

Build Request A wants partition X (already available from Build C):
1. DelegationEvent(partition=X, delegated_to=Build_C_ID, message="Delegated to historical build - partition already available")
2. JobEvent(status=JOB_SKIPPED, message="Job skipped - all target partitions already available")
3. Task marked as succeeded locally in Build A

Multi-Partition Job Coordination

Jobs in DataBuild can produce multiple partitions. Delegation decisions are made at the job level based on all target partitions:

Job Execution Rules

  1. Execute: If ANY target partition needs building, execute the entire job
  2. Skip: Only if ALL target partitions are already available
  3. Delegate to Active: If ANY target partition is being built by another request

Example Scenarios

Scenario 1: Mixed Availability

Job produces partitions [A, B, C]:
- A: Available (from Build X)  
- B: Needs building
- C: Available (from Build Y)

Result: Execute the job (because B needs building)
Events: Normal job execution (JOB_SCHEDULED → JOB_RUNNING → JOB_COMPLETED/FAILED)

Scenario 2: All Available

Job produces partitions [A, B, C]:
- A: Available (from Build X)
- B: Available (from Build Y)  
- C: Available (from Build Z)

Result: Skip the job (all partitions available)
Events: 
- DelegationEvent(A, delegated_to=Build_X_ID)
- DelegationEvent(B, delegated_to=Build_Y_ID)  
- DelegationEvent(C, delegated_to=Build_Z_ID)
- JobEvent(status=JOB_SKIPPED)

Scenario 3: Active Build Conflict

Job produces partitions [A, B]:
- A: Available (from Build X)
- B: Being built by Build Y (active)

Result: Delegate entire job to Build Y
Events:
- DelegationEvent(A, delegated_to=Build_Y_ID, message="Delegated to active build")
- DelegationEvent(B, delegated_to=Build_Y_ID, message="Delegated to active build") 
- No JobEvent (delegated at planning/coordination level)

Build Event Log Integration

Delegation is implemented through the Build Event Log (BEL), which serves as the authoritative source for all build coordination decisions.

Key Event Types

  1. DelegationEvent: Records partition-level delegation with full traceability
  2. JobEvent: Records job-level status including JOB_SKIPPED for historical delegation
  3. PartitionEvent: Tracks partition lifecycle (PARTITION_AVAILABLE, etc.)
  4. BuildRequestEvent: Tracks overall build request status

Event Log Queries

Finding Available Partitions:

SELECT build_request_id 
FROM partition_events pe 
JOIN build_events be ON pe.event_id = be.event_id 
WHERE pe.partition_ref = ? AND pe.status = '4' -- PARTITION_AVAILABLE
ORDER BY be.timestamp DESC 
LIMIT 1

Finding Active Builds:

SELECT DISTINCT be.build_request_id 
FROM partition_events pe 
JOIN build_events be ON pe.event_id = be.event_id 
WHERE pe.partition_ref = ? 
AND pe.status IN ('2', '3') -- PARTITION_SCHEDULED or PARTITION_BUILDING
AND be.build_request_id NOT IN (
    SELECT DISTINCT be3.build_request_id
    FROM build_request_events bre
    JOIN build_events be3 ON bre.event_id = be3.event_id
    WHERE bre.status IN ('4', '5') -- BUILD_REQUEST_COMPLETED or BUILD_REQUEST_FAILED
)

Success Rate Calculation

The delegation system ensures accurate success rate metrics by treating delegation outcomes appropriately:

Job Status Classifications

  • Successful: JOB_COMPLETED (3), JOB_SKIPPED (6)
  • Failed: JOB_FAILED (4)
  • In Progress: JOB_SCHEDULED (1), JOB_RUNNING (2)
  • Cancelled: JOB_CANCELLED (5)

Metrics Queries

-- Job success rate calculation
SELECT 
    job_label,
    COUNT(CASE WHEN status IN ('3', '6') THEN 1 END) as completed_count,
    COUNT(CASE WHEN status = '4' THEN 1 END) as failed_count,
    COUNT(*) as total_count
FROM job_events 
WHERE job_label = ?

Success Rate = (completed_count) / (total_count) where completed includes both executed and skipped jobs.

Implementation Architecture

Core Components

  1. Event Log Trait (databuild/event_log/mod.rs):

    • get_latest_partition_status(): Check partition availability
    • get_build_request_for_available_partition(): Find historical source
    • get_active_builds_for_partition(): Find concurrent builds
  2. Coordination Logic (databuild/graph/execute.rs):

    • check_build_coordination(): Implements delegation decision rules
    • Multi-partition job evaluation logic
    • Event logging for delegation and job skipping
  3. Dashboard Integration (databuild/service/handlers.rs):

    • Success rate calculations including JOB_SKIPPED
    • Job metrics queries treating delegation as success
    • Proper handling of skipped jobs in analytics

Delegation Decision Algorithm

for each job in execution_plan:
    available_partitions = []
    needs_building = false
    
    for each partition in job.outputs:
        if partition.status == PARTITION_AVAILABLE:
            source_build = get_build_request_for_available_partition(partition)
            available_partitions.push((partition, source_build))
        elif partition has active_builds:
            delegate_entire_job_to_active_build()
            return
        else:
            needs_building = true
    
    if !needs_building && available_partitions.len() == job.outputs.len():
        // Historical delegation - all partitions available
        log_delegation_events(available_partitions)
        log_job_skipped_event()
        mark_job_as_succeeded()
    elif needs_building:
        // Normal execution - some partitions need building
        execute_job_normally()

Benefits

  1. Efficiency: Eliminates duplicate computation
  2. Consistency: Single source of truth for each partition
  3. Traceability: Complete audit trail via delegation events
  4. Accuracy: Proper success rate calculation including delegated work
  5. Scalability: Supports concurrent build requests without conflicts
  6. Transparency: Clear visibility into why work was or wasn't performed

Future Enhancements

  1. Cross-Build Monitoring: Track when delegated builds complete/fail
  2. Delegation Timeouts: Handle cases where delegated builds stall
  3. Smart Invalidation: Detect when available partitions become stale
  4. Delegation Preferences: Allow builds to specify delegation strategies
  5. Performance Metrics: Track delegation efficiency and resource savings