9.6 KiB
Partition Delegation in DataBuild
Overview
Partition delegation is a core coordination mechanism in DataBuild that prevents duplicate work by allowing build requests to delegate partition creation to other builds. This system ensures efficient resource utilization and provides complete audit trails for all build activities.
Motivation
DataBuild is designed to handle concurrent build requests efficiently. Without delegation, multiple build requests might attempt to build the same partitions simultaneously, leading to:
- Resource Waste: Multiple processes building identical partitions
- Race Conditions: Concurrent writes to the same partition outputs
- Inconsistent State: Different builds potentially producing different results for the same partition
- Poor Performance: Duplicated computation and I/O overhead
Delegation solves these problems by establishing clear coordination rules and providing complete traceability.
Delegation Types
DataBuild implements two distinct delegation patterns:
1. Active Delegation
When: A partition is currently being built by another active build request.
Behavior:
- Delegate to the currently executing build request
- Log
DelegationEventpointing to the active build's ID - No job execution occurs for the delegating request
- Wait for the active build to complete
Event Flow:
Build Request A wants partition X (currently being built by Build B):
1. DelegationEvent(partition=X, delegated_to=Build_B_ID, message="Delegated to active build during execution")
2. No JobEvent created for Build A
3. Task marked as succeeded locally in Build A
2. Historical Delegation
When: A partition already exists and is available (built by a previous request).
Behavior:
- Delegate to the historical build request that created the partition
- Log both
DelegationEventandJOB_SKIPPEDevents - Provide complete audit trail showing why work was avoided
Event Flow:
Build Request A wants partition X (already available from Build C):
1. DelegationEvent(partition=X, delegated_to=Build_C_ID, message="Delegated to historical build - partition already available")
2. JobEvent(status=JOB_SKIPPED, message="Job skipped - all target partitions already available")
3. Task marked as succeeded locally in Build A
Multi-Partition Job Coordination
Jobs in DataBuild can produce multiple partitions. Delegation decisions are made at the job level based on all target partitions:
Job Execution Rules
- Execute: If ANY target partition needs building, execute the entire job
- Skip: Only if ALL target partitions are already available
- Delegate to Active: If ANY target partition is being built by another request
Example Scenarios
Scenario 1: Mixed Availability
Job produces partitions [A, B, C]:
- A: Available (from Build X)
- B: Needs building
- C: Available (from Build Y)
Result: Execute the job (because B needs building)
Events: Normal job execution (JOB_SCHEDULED → JOB_RUNNING → JOB_COMPLETED/FAILED)
Scenario 2: All Available
Job produces partitions [A, B, C]:
- A: Available (from Build X)
- B: Available (from Build Y)
- C: Available (from Build Z)
Result: Skip the job (all partitions available)
Events:
- DelegationEvent(A, delegated_to=Build_X_ID)
- DelegationEvent(B, delegated_to=Build_Y_ID)
- DelegationEvent(C, delegated_to=Build_Z_ID)
- JobEvent(status=JOB_SKIPPED)
Scenario 3: Active Build Conflict
Job produces partitions [A, B]:
- A: Available (from Build X)
- B: Being built by Build Y (active)
Result: Delegate entire job to Build Y
Events:
- DelegationEvent(A, delegated_to=Build_Y_ID, message="Delegated to active build")
- DelegationEvent(B, delegated_to=Build_Y_ID, message="Delegated to active build")
- No JobEvent (delegated at planning/coordination level)
Build Event Log Integration
Delegation is implemented through the Build Event Log (BEL), which serves as the authoritative source for all build coordination decisions.
Key Event Types
- DelegationEvent: Records partition-level delegation with full traceability
- JobEvent: Records job-level status including
JOB_SKIPPEDfor historical delegation - PartitionEvent: Tracks partition lifecycle (
PARTITION_AVAILABLE, etc.) - BuildRequestEvent: Tracks overall build request status
Event Log Queries
Finding Available Partitions:
SELECT build_request_id
FROM partition_events pe
JOIN build_events be ON pe.event_id = be.event_id
WHERE pe.partition_ref = ? AND pe.status = '4' -- PARTITION_AVAILABLE
ORDER BY be.timestamp DESC
LIMIT 1
Finding Active Builds:
SELECT DISTINCT be.build_request_id
FROM partition_events pe
JOIN build_events be ON pe.event_id = be.event_id
WHERE pe.partition_ref = ?
AND pe.status IN ('2', '3') -- PARTITION_SCHEDULED or PARTITION_BUILDING
AND be.build_request_id NOT IN (
SELECT DISTINCT be3.build_request_id
FROM build_request_events bre
JOIN build_events be3 ON bre.event_id = be3.event_id
WHERE bre.status IN ('4', '5') -- BUILD_REQUEST_COMPLETED or BUILD_REQUEST_FAILED
)
Success Rate Calculation
The delegation system ensures accurate success rate metrics by treating delegation outcomes appropriately:
Job Status Classifications
- Successful:
JOB_COMPLETED(3),JOB_SKIPPED(6) - Failed:
JOB_FAILED(4) - In Progress:
JOB_SCHEDULED(1),JOB_RUNNING(2) - Cancelled:
JOB_CANCELLED(5)
Metrics Queries
-- Job success rate calculation
SELECT
job_label,
COUNT(CASE WHEN status IN ('3', '6') THEN 1 END) as completed_count,
COUNT(CASE WHEN status = '4' THEN 1 END) as failed_count,
COUNT(*) as total_count
FROM job_events
WHERE job_label = ?
Success Rate = (completed_count) / (total_count) where completed includes both executed and skipped jobs.
Implementation Architecture
Clean Separation of Concerns
Analysis Phase (databuild/graph/analyze.rs):
- Purpose: Pure transformation of partition requests → job graph
- Responsibility: Determine what work would be needed (logical plan)
- No delegation logic: Creates jobs for all requested partitions
- Output: Complete job graph representing the logical work
Execution Phase (databuild/graph/execute.rs):
- Purpose: Execute the job graph efficiently with delegation optimization
- Responsibility: Coordinate with concurrent builds and optimize execution
- All delegation logic: Handles both active and historical delegation
- Event logging: Emits all job lifecycle events including
JOB_SKIPPED
Core Components
-
Event Log Trait (
databuild/event_log/mod.rs):get_latest_partition_status(): Check partition availabilityget_build_request_for_available_partition(): Find historical sourceget_active_builds_for_partition(): Find concurrent builds
-
Execution Coordination Logic (
databuild/graph/execute.rs):check_build_coordination(): Implements all delegation decision rules- Multi-partition job evaluation logic
- Event logging for delegation and job skipping
- Handles both active delegation (to running builds) and historical delegation (to completed builds)
-
Dashboard Integration (
databuild/service/handlers.rs):- Success rate calculations including
JOB_SKIPPED - Job metrics queries treating delegation as success
- Proper handling of skipped jobs in analytics
- Success rate calculations including
Delegation Decision Algorithm (Execution Phase)
// Analysis phase creates complete job graph for all requested partitions
job_graph = analyze_partitions(requested_partitions)
// Execution phase optimizes by delegating when possible
for each job in job_graph:
available_partitions = []
needs_building = false
for each partition in job.outputs:
if partition.status == PARTITION_AVAILABLE:
source_build = get_build_request_for_available_partition(partition)
available_partitions.push((partition, source_build))
elif partition has active_builds:
// Active delegation - delegate entire job to running build
log_delegation_events_to_active_build()
mark_job_as_succeeded()
continue_to_next_job()
else:
needs_building = true
if !needs_building && available_partitions.len() == job.outputs.len():
// Historical delegation - all partitions available
log_delegation_events(available_partitions) // Point to source builds
log_job_skipped_event()
mark_job_as_succeeded()
elif needs_building:
// Normal execution - some partitions need building
execute_job_normally()
Benefits
- Clean Architecture: Clear separation between logical planning (analysis) and execution optimization
- Efficiency: Eliminates duplicate computation through execution-time delegation
- Consistency: Single source of truth for each partition
- Traceability: Complete audit trail via delegation events with full build request traceability
- Accuracy: Proper success rate calculation including delegated work
- Scalability: Supports concurrent build requests without conflicts
- Testability: Analysis phase becomes pure function (requests → job graph)
- Transparency: Clear visibility into why work was or wasn't performed
Future Enhancements
- Cross-Build Monitoring: Track when delegated builds complete/fail
- Delegation Timeouts: Handle cases where delegated builds stall
- Smart Invalidation: Detect when available partitions become stale
- Delegation Preferences: Allow builds to specify delegation strategies
- Performance Metrics: Track delegation efficiency and resource savings