Add docs
This commit is contained in:
parent
6433fafd90
commit
aa561a8281
2 changed files with 240 additions and 0 deletions
233
docs/partition-delegation.md
Normal file
233
docs/partition-delegation.md
Normal file
|
|
@ -0,0 +1,233 @@
|
|||
# Partition Delegation in DataBuild
|
||||
|
||||
## Overview
|
||||
|
||||
Partition delegation is a core coordination mechanism in DataBuild that prevents duplicate work by allowing build requests to delegate partition creation to other builds. This system ensures efficient resource utilization and provides complete audit trails for all build activities.
|
||||
|
||||
## Motivation
|
||||
|
||||
DataBuild is designed to handle concurrent build requests efficiently. Without delegation, multiple build requests might attempt to build the same partitions simultaneously, leading to:
|
||||
|
||||
- **Resource Waste**: Multiple processes building identical partitions
|
||||
- **Race Conditions**: Concurrent writes to the same partition outputs
|
||||
- **Inconsistent State**: Different builds potentially producing different results for the same partition
|
||||
- **Poor Performance**: Duplicated computation and I/O overhead
|
||||
|
||||
Delegation solves these problems by establishing clear coordination rules and providing complete traceability.
|
||||
|
||||
## Delegation Types
|
||||
|
||||
DataBuild implements two distinct delegation patterns:
|
||||
|
||||
### 1. Active Delegation
|
||||
|
||||
**When**: A partition is currently being built by another active build request.
|
||||
|
||||
**Behavior**:
|
||||
- Delegate to the currently executing build request
|
||||
- Log `DelegationEvent` pointing to the active build's ID
|
||||
- No job execution occurs for the delegating request
|
||||
- Wait for the active build to complete
|
||||
|
||||
**Event Flow**:
|
||||
```
|
||||
Build Request A wants partition X (currently being built by Build B):
|
||||
1. DelegationEvent(partition=X, delegated_to=Build_B_ID, message="Delegated to active build during execution")
|
||||
2. No JobEvent created for Build A
|
||||
3. Task marked as succeeded locally in Build A
|
||||
```
|
||||
|
||||
### 2. Historical Delegation
|
||||
|
||||
**When**: A partition already exists and is available (built by a previous request).
|
||||
|
||||
**Behavior**:
|
||||
- Delegate to the historical build request that created the partition
|
||||
- Log both `DelegationEvent` and `JOB_SKIPPED` events
|
||||
- Provide complete audit trail showing why work was avoided
|
||||
|
||||
**Event Flow**:
|
||||
```
|
||||
Build Request A wants partition X (already available from Build C):
|
||||
1. DelegationEvent(partition=X, delegated_to=Build_C_ID, message="Delegated to historical build - partition already available")
|
||||
2. JobEvent(status=JOB_SKIPPED, message="Job skipped - all target partitions already available")
|
||||
3. Task marked as succeeded locally in Build A
|
||||
```
|
||||
|
||||
## Multi-Partition Job Coordination
|
||||
|
||||
Jobs in DataBuild can produce multiple partitions. Delegation decisions are made at the **job level** based on **all target partitions**:
|
||||
|
||||
### Job Execution Rules
|
||||
|
||||
1. **Execute**: If ANY target partition needs building, execute the entire job
|
||||
2. **Skip**: Only if ALL target partitions are already available
|
||||
3. **Delegate to Active**: If ANY target partition is being built by another request
|
||||
|
||||
### Example Scenarios
|
||||
|
||||
**Scenario 1: Mixed Availability**
|
||||
```
|
||||
Job produces partitions [A, B, C]:
|
||||
- A: Available (from Build X)
|
||||
- B: Needs building
|
||||
- C: Available (from Build Y)
|
||||
|
||||
Result: Execute the job (because B needs building)
|
||||
Events: Normal job execution (JOB_SCHEDULED → JOB_RUNNING → JOB_COMPLETED/FAILED)
|
||||
```
|
||||
|
||||
**Scenario 2: All Available**
|
||||
```
|
||||
Job produces partitions [A, B, C]:
|
||||
- A: Available (from Build X)
|
||||
- B: Available (from Build Y)
|
||||
- C: Available (from Build Z)
|
||||
|
||||
Result: Skip the job (all partitions available)
|
||||
Events:
|
||||
- DelegationEvent(A, delegated_to=Build_X_ID)
|
||||
- DelegationEvent(B, delegated_to=Build_Y_ID)
|
||||
- DelegationEvent(C, delegated_to=Build_Z_ID)
|
||||
- JobEvent(status=JOB_SKIPPED)
|
||||
```
|
||||
|
||||
**Scenario 3: Active Build Conflict**
|
||||
```
|
||||
Job produces partitions [A, B]:
|
||||
- A: Available (from Build X)
|
||||
- B: Being built by Build Y (active)
|
||||
|
||||
Result: Delegate entire job to Build Y
|
||||
Events:
|
||||
- DelegationEvent(A, delegated_to=Build_Y_ID, message="Delegated to active build")
|
||||
- DelegationEvent(B, delegated_to=Build_Y_ID, message="Delegated to active build")
|
||||
- No JobEvent (delegated at planning/coordination level)
|
||||
```
|
||||
|
||||
## Build Event Log Integration
|
||||
|
||||
Delegation is implemented through the Build Event Log (BEL), which serves as the authoritative source for all build coordination decisions.
|
||||
|
||||
### Key Event Types
|
||||
|
||||
1. **DelegationEvent**: Records partition-level delegation with full traceability
|
||||
2. **JobEvent**: Records job-level status including `JOB_SKIPPED` for historical delegation
|
||||
3. **PartitionEvent**: Tracks partition lifecycle (`PARTITION_AVAILABLE`, etc.)
|
||||
4. **BuildRequestEvent**: Tracks overall build request status
|
||||
|
||||
### Event Log Queries
|
||||
|
||||
**Finding Available Partitions**:
|
||||
```sql
|
||||
SELECT build_request_id
|
||||
FROM partition_events pe
|
||||
JOIN build_events be ON pe.event_id = be.event_id
|
||||
WHERE pe.partition_ref = ? AND pe.status = '4' -- PARTITION_AVAILABLE
|
||||
ORDER BY be.timestamp DESC
|
||||
LIMIT 1
|
||||
```
|
||||
|
||||
**Finding Active Builds**:
|
||||
```sql
|
||||
SELECT DISTINCT be.build_request_id
|
||||
FROM partition_events pe
|
||||
JOIN build_events be ON pe.event_id = be.event_id
|
||||
WHERE pe.partition_ref = ?
|
||||
AND pe.status IN ('2', '3') -- PARTITION_SCHEDULED or PARTITION_BUILDING
|
||||
AND be.build_request_id NOT IN (
|
||||
SELECT DISTINCT be3.build_request_id
|
||||
FROM build_request_events bre
|
||||
JOIN build_events be3 ON bre.event_id = be3.event_id
|
||||
WHERE bre.status IN ('4', '5') -- BUILD_REQUEST_COMPLETED or BUILD_REQUEST_FAILED
|
||||
)
|
||||
```
|
||||
|
||||
## Success Rate Calculation
|
||||
|
||||
The delegation system ensures accurate success rate metrics by treating delegation outcomes appropriately:
|
||||
|
||||
### Job Status Classifications
|
||||
|
||||
- **Successful**: `JOB_COMPLETED` (3), `JOB_SKIPPED` (6)
|
||||
- **Failed**: `JOB_FAILED` (4)
|
||||
- **In Progress**: `JOB_SCHEDULED` (1), `JOB_RUNNING` (2)
|
||||
- **Cancelled**: `JOB_CANCELLED` (5)
|
||||
|
||||
### Metrics Queries
|
||||
|
||||
```sql
|
||||
-- Job success rate calculation
|
||||
SELECT
|
||||
job_label,
|
||||
COUNT(CASE WHEN status IN ('3', '6') THEN 1 END) as completed_count,
|
||||
COUNT(CASE WHEN status = '4' THEN 1 END) as failed_count,
|
||||
COUNT(*) as total_count
|
||||
FROM job_events
|
||||
WHERE job_label = ?
|
||||
```
|
||||
|
||||
Success Rate = (completed_count) / (total_count) where completed includes both executed and skipped jobs.
|
||||
|
||||
## Implementation Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
1. **Event Log Trait** (`databuild/event_log/mod.rs`):
|
||||
- `get_latest_partition_status()`: Check partition availability
|
||||
- `get_build_request_for_available_partition()`: Find historical source
|
||||
- `get_active_builds_for_partition()`: Find concurrent builds
|
||||
|
||||
2. **Coordination Logic** (`databuild/graph/execute.rs`):
|
||||
- `check_build_coordination()`: Implements delegation decision rules
|
||||
- Multi-partition job evaluation logic
|
||||
- Event logging for delegation and job skipping
|
||||
|
||||
3. **Dashboard Integration** (`databuild/service/handlers.rs`):
|
||||
- Success rate calculations including `JOB_SKIPPED`
|
||||
- Job metrics queries treating delegation as success
|
||||
- Proper handling of skipped jobs in analytics
|
||||
|
||||
### Delegation Decision Algorithm
|
||||
|
||||
```rust
|
||||
for each job in execution_plan:
|
||||
available_partitions = []
|
||||
needs_building = false
|
||||
|
||||
for each partition in job.outputs:
|
||||
if partition.status == PARTITION_AVAILABLE:
|
||||
source_build = get_build_request_for_available_partition(partition)
|
||||
available_partitions.push((partition, source_build))
|
||||
elif partition has active_builds:
|
||||
delegate_entire_job_to_active_build()
|
||||
return
|
||||
else:
|
||||
needs_building = true
|
||||
|
||||
if !needs_building && available_partitions.len() == job.outputs.len():
|
||||
// Historical delegation - all partitions available
|
||||
log_delegation_events(available_partitions)
|
||||
log_job_skipped_event()
|
||||
mark_job_as_succeeded()
|
||||
elif needs_building:
|
||||
// Normal execution - some partitions need building
|
||||
execute_job_normally()
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Efficiency**: Eliminates duplicate computation
|
||||
2. **Consistency**: Single source of truth for each partition
|
||||
3. **Traceability**: Complete audit trail via delegation events
|
||||
4. **Accuracy**: Proper success rate calculation including delegated work
|
||||
5. **Scalability**: Supports concurrent build requests without conflicts
|
||||
6. **Transparency**: Clear visibility into why work was or wasn't performed
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Cross-Build Monitoring**: Track when delegated builds complete/fail
|
||||
2. **Delegation Timeouts**: Handle cases where delegated builds stall
|
||||
3. **Smart Invalidation**: Detect when available partitions become stale
|
||||
4. **Delegation Preferences**: Allow builds to specify delegation strategies
|
||||
5. **Performance Metrics**: Track delegation efficiency and resource savings
|
||||
7
plans/todo.md
Normal file
7
plans/todo.md
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
|
||||
- Status indicator for page selection
|
||||
- On build request detail page, show aggregated job results
|
||||
- Use path based navigation instead of hashbang?
|
||||
- Mermaid visualization of build requests
|
||||
- Build event job links are not encoding job labels properly
|
||||
|
||||
Loading…
Reference in a new issue