287 lines
No EOL
9.3 KiB
Markdown
287 lines
No EOL
9.3 KiB
Markdown
# Wants System
|
|
|
|
Purpose: Enable declarative specification of data requirements with SLA tracking, cross-graph coordination, and efficient build triggering while maintaining atomic build semantics.
|
|
|
|
## Overview
|
|
|
|
The wants system unifies all build requests (manual, scheduled, triggered) under a single declarative model where:
|
|
- **Wants declare intent** via events in the [build event log](./build-event-log.md)
|
|
- **Builds reactively satisfy** what's currently possible with atomic semantics
|
|
- **Monitoring identifies gaps** between declared wants and delivered partitions
|
|
- **Cross-graph coordination** happens via the `GraphService` API
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
1. **PartitionWantEvent**: Declarative specification of data requirements
|
|
2. **Build Evaluation**: Reactive logic that attempts to satisfy wants when possible
|
|
3. **SLA Monitoring**: External system that queries for expired wants
|
|
4. **Cross-Graph Coordination**: Event-driven dependency management across DataBuild instances
|
|
|
|
### Want Event Schema
|
|
|
|
Defined in `databuild.proto`:
|
|
|
|
```protobuf
|
|
message PartitionWantEvent {
|
|
string partition_ref = 1; // Partition being requested
|
|
int64 created_at = 2; // Server time when want registered
|
|
int64 data_timestamp = 3; // Business time this partition represents
|
|
optional uint64 ttl_seconds = 4; // Give up after this long (from created_at)
|
|
optional uint64 sla_seconds = 5; // SLA violation after this long (from data_timestamp)
|
|
repeated string external_dependencies = 6; // Cross-graph dependencies
|
|
string want_id = 7; // Unique identifier
|
|
WantSource source = 8; // How this want was created
|
|
}
|
|
|
|
message WantSource {
|
|
oneof source_type {
|
|
CliManual cli_manual = 1; // Manual CLI request
|
|
DashboardManual dashboard_manual = 2; // Manual dashboard request
|
|
Scheduled scheduled = 3; // Scheduled/triggered job
|
|
ApiRequest api_request = 4; // External API call
|
|
}
|
|
}
|
|
```
|
|
|
|
## Want Lifecycle
|
|
|
|
### 1. Want Registration
|
|
|
|
All build requests become wants:
|
|
|
|
```rust
|
|
// CLI: databuild build data/users/2024-01-01
|
|
PartitionWantEvent {
|
|
partition_ref: "data/users/2024-01-01",
|
|
created_at: now(),
|
|
data_timestamp: None, // These must be set explicitly in the request
|
|
ttl_seconds: None,
|
|
sla_seconds: None,
|
|
external_dependencies: vec![], // no externally sourced data necessary
|
|
want_id: generate_uuid(),
|
|
source: WantSource { ... },
|
|
}
|
|
|
|
// Scheduled pipeline: Daily analytics
|
|
PartitionWantEvent {
|
|
partition_ref: "analytics/daily/2024-01-01",
|
|
created_at: now(),
|
|
data_timestamp: parse_date("2024-01-01"),
|
|
ttl_seconds: Some(365 * 24 * 3600), // Keep trying for 1 year
|
|
sla_seconds: Some(9 * 3600), // Expected by 9am (9hrs after data_timestamp)
|
|
external_dependencies: vec!["data/users/2024-01-01"],
|
|
want_id: "daily-analytics-2024-01-01",
|
|
source: WantSource { ... },
|
|
}
|
|
```
|
|
|
|
### 2. Build Evaluation
|
|
|
|
DataBuild continuously evaluates build opportunities:
|
|
|
|
```rust
|
|
async fn evaluate_build_opportunities(&self) -> Result<Option<BuildRequest>> {
|
|
let now = current_timestamp_nanos();
|
|
|
|
// Get wants that haven't exceeded TTL
|
|
let active_wants = self.get_non_expired_wants(now).await?;
|
|
|
|
// Filter to wants where external dependencies are satisfied
|
|
let buildable_partitions = active_wants.into_iter()
|
|
.filter(|want| self.external_dependencies_satisfied(want))
|
|
.map(|want| want.partition_ref)
|
|
.collect();
|
|
|
|
if buildable_partitions.is_empty() { return Ok(None); }
|
|
|
|
// Create atomic build request for all currently buildable partitions
|
|
Ok(Some(BuildRequest {
|
|
requested_partitions: buildable_partitions,
|
|
reason: "satisfying_active_wants".to_string(),
|
|
}))
|
|
}
|
|
```
|
|
|
|
### 3. Build Triggers
|
|
|
|
Builds are triggered on:
|
|
- **New want registration**: Check if newly wanted partitions are immediately buildable
|
|
- **External partition availability**: Check if any blocked wants are now unblocked
|
|
- **Manual trigger**: Force re-evaluation (for debugging)
|
|
|
|
## Cross-Graph Coordination
|
|
### GraphService API
|
|
Graphs expose events for cross-graph coordination:
|
|
|
|
```rust
|
|
trait GraphService {
|
|
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
|
|
}
|
|
```
|
|
|
|
Where `EventFilter` supports partition patterns for efficient subscriptions:
|
|
|
|
```protobuf
|
|
message EventFilter {
|
|
repeated string partition_refs = 1; // Exact partition matches
|
|
repeated string partition_patterns = 2; // Glob patterns like "data/users/*"
|
|
repeated string job_labels = 3; // Job-specific events
|
|
repeated string task_ids = 4; // Task run events
|
|
repeated string build_request_ids = 5; // Build-specific events
|
|
}
|
|
```
|
|
|
|
### Upstream Dependencies
|
|
|
|
Downstream graphs subscribe to upstream events:
|
|
|
|
```rust
|
|
struct UpstreamDependency {
|
|
service_url: String, // "https://upstream-databuild.corp.com"
|
|
partition_patterns: Vec<String>, // ["data/users/*", "ml/models/prod/*"]
|
|
last_sync_idx: i64,
|
|
}
|
|
|
|
// Periodic sync of upstream events
|
|
async fn sync_upstream_events(upstream: &UpstreamDependency) -> Result<()> {
|
|
let client = GraphServiceClient::new(&upstream.service_url);
|
|
let filter = EventFilter {
|
|
partition_patterns: upstream.partition_patterns.clone(),
|
|
..Default::default()
|
|
};
|
|
|
|
let events = client.list_events(upstream.last_sync_idx, filter).await?;
|
|
|
|
// Process partition availability events
|
|
for event in events.events {
|
|
if let EventType::PartitionEvent(pe) = event.event_type {
|
|
if pe.status_code == PartitionStatus::PartitionAvailable {
|
|
// Trigger local build evaluation
|
|
trigger_build_evaluation().await?;
|
|
}
|
|
}
|
|
}
|
|
|
|
upstream.last_sync_idx = events.next_idx;
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
## SLA Monitoring and TTL Management
|
|
|
|
### SLA Violations
|
|
|
|
External monitoring systems query for SLA violations:
|
|
|
|
```sql
|
|
-- Find SLA violations (for alerting)
|
|
SELECT * FROM partition_want_events w
|
|
WHERE w.sla_seconds IS NOT NULL
|
|
AND (w.data_timestamp + (w.sla_seconds * 1000000000)) < ? -- now
|
|
AND NOT EXISTS (
|
|
SELECT 1 FROM partition_events p
|
|
WHERE p.partition_ref = w.partition_ref
|
|
AND p.status_code = ? -- PartitionAvailable
|
|
)
|
|
```
|
|
|
|
### TTL Expiration
|
|
|
|
Wants with expired TTLs are excluded from build evaluation:
|
|
|
|
```sql
|
|
-- Get active (non-expired) wants
|
|
SELECT * FROM partition_want_events w
|
|
WHERE (w.ttl_seconds IS NULL OR w.created_at + (w.ttl_seconds * 1000000000) > ?) -- now
|
|
AND NOT EXISTS (
|
|
SELECT 1 FROM partition_events p
|
|
WHERE p.partition_ref = w.partition_ref
|
|
AND p.status_code = ? -- PartitionAvailable
|
|
)
|
|
```
|
|
|
|
## Example Scenarios
|
|
|
|
### Scenario 1: Daily Analytics Pipeline
|
|
|
|
```
|
|
1. 6:00 AM: Daily trigger creates want for analytics/daily/2024-01-01
|
|
- SLA: 9:00 AM (3 hours after data_timestamp of midnight)
|
|
- TTL: 1 year (keep trying for historical data)
|
|
- External deps: ["data/users/2024-01-01"]
|
|
|
|
2. 6:01 AM: Build evaluation runs, data/users/2024-01-01 missing
|
|
- No build request generated
|
|
|
|
3. 8:30 AM: Upstream publishes data/users/2024-01-01
|
|
- Cross-graph sync detects availability
|
|
- Build evaluation triggered
|
|
- BuildRequest[analytics/daily/2024-01-01] succeeds
|
|
|
|
4. Result: Analytics available at 8:45 AM, within SLA
|
|
```
|
|
|
|
### Scenario 2: Late Data with SLA Miss
|
|
|
|
```
|
|
1. 6:00 AM: Want created for analytics/daily/2024-01-01 (SLA: 9:00 AM)
|
|
2. 9:30 AM: SLA monitoring detects violation, sends alert
|
|
3. 11:00 AM: Upstream data finally arrives
|
|
4. 11:01 AM: Build evaluation triggers, analytics built
|
|
5. Result: Late delivery logged, but data still processed
|
|
```
|
|
|
|
### Scenario 3: Manual CLI Build
|
|
|
|
```
|
|
1. User: databuild build data/transform/urgent
|
|
2. Want created with short TTL (30 min) and SLA (5 min)
|
|
3. Build evaluation: dependencies available, immediate build
|
|
4. Result: Fast feedback for interactive use
|
|
```
|
|
|
|
## Benefits
|
|
|
|
### Unified Build Model
|
|
- All builds (manual, scheduled, triggered) use same want mechanism
|
|
- Complete audit trail in build event log
|
|
- Consistent SLA tracking across all build types
|
|
|
|
### Event-Driven Efficiency
|
|
- Builds only triggered when dependencies change
|
|
- Cross-graph coordination via efficient event streaming
|
|
- No polling for task readiness within builds
|
|
|
|
### Atomic Build Semantics
|
|
- Individual build requests remain all-or-nothing
|
|
- Fast failure provides immediate feedback
|
|
- Partial progress via multiple build requests over time
|
|
|
|
### Flexible SLA Management
|
|
- Separate business expectations (SLA) from operational limits (TTL)
|
|
- External monitoring with clear blame assignment
|
|
- Automatic cleanup of stale wants
|
|
|
|
### Cross-Graph Scalability
|
|
- Reliable HTTP-based coordination (no message loss)
|
|
- Efficient filtering via partition patterns
|
|
- Decentralized architecture with clear boundaries
|
|
|
|
## Implementation Notes
|
|
|
|
### Build Event Log Integration
|
|
- Wants are stored as events in the BEL for consistency
|
|
- Same query interfaces used for wants and build coordination
|
|
- Event-driven architecture throughout
|
|
|
|
### Service Integration
|
|
- GraphService API exposed via HTTP for cross-graph coordination
|
|
- Dashboard integration for manual want creation
|
|
- External SLA monitoring via BEL queries
|
|
|
|
### CLI Integration
|
|
- CLI commands create manual wants with appropriate TTLs
|
|
- Immediate build evaluation for interactive feedback
|
|
- Standard build request execution path |