databuild/design/wants.md

287 lines
No EOL
9.3 KiB
Markdown

# Wants System
Purpose: Enable declarative specification of data requirements with SLA tracking, cross-graph coordination, and efficient build triggering while maintaining atomic build semantics.
## Overview
The wants system unifies all build requests (manual, scheduled, triggered) under a single declarative model where:
- **Wants declare intent** via events in the [build event log](./build-event-log.md)
- **Builds reactively satisfy** what's currently possible with atomic semantics
- **Monitoring identifies gaps** between declared wants and delivered partitions
- **Cross-graph coordination** happens via the `GraphService` API
## Architecture
### Core Components
1. **PartitionWantEvent**: Declarative specification of data requirements
2. **Build Evaluation**: Reactive logic that attempts to satisfy wants when possible
3. **SLA Monitoring**: External system that queries for expired wants
4. **Cross-Graph Coordination**: Event-driven dependency management across DataBuild instances
### Want Event Schema
Defined in `databuild.proto`:
```protobuf
message PartitionWantEvent {
string partition_ref = 1; // Partition being requested
int64 created_at = 2; // Server time when want registered
int64 data_timestamp = 3; // Business time this partition represents
optional uint64 ttl_seconds = 4; // Give up after this long (from created_at)
optional uint64 sla_seconds = 5; // SLA violation after this long (from data_timestamp)
repeated string external_dependencies = 6; // Cross-graph dependencies
string want_id = 7; // Unique identifier
WantSource source = 8; // How this want was created
}
message WantSource {
oneof source_type {
CliManual cli_manual = 1; // Manual CLI request
DashboardManual dashboard_manual = 2; // Manual dashboard request
Scheduled scheduled = 3; // Scheduled/triggered job
ApiRequest api_request = 4; // External API call
}
}
```
## Want Lifecycle
### 1. Want Registration
All build requests become wants:
```rust
// CLI: databuild build data/users/2024-01-01
PartitionWantEvent {
partition_ref: "data/users/2024-01-01",
created_at: now(),
data_timestamp: None, // These must be set explicitly in the request
ttl_seconds: None,
sla_seconds: None,
external_dependencies: vec![], // no externally sourced data necessary
want_id: generate_uuid(),
source: WantSource { ... },
}
// Scheduled pipeline: Daily analytics
PartitionWantEvent {
partition_ref: "analytics/daily/2024-01-01",
created_at: now(),
data_timestamp: parse_date("2024-01-01"),
ttl_seconds: Some(365 * 24 * 3600), // Keep trying for 1 year
sla_seconds: Some(9 * 3600), // Expected by 9am (9hrs after data_timestamp)
external_dependencies: vec!["data/users/2024-01-01"],
want_id: "daily-analytics-2024-01-01",
source: WantSource { ... },
}
```
### 2. Build Evaluation
DataBuild continuously evaluates build opportunities:
```rust
async fn evaluate_build_opportunities(&self) -> Result<Option<BuildRequest>> {
let now = current_timestamp_nanos();
// Get wants that haven't exceeded TTL
let active_wants = self.get_non_expired_wants(now).await?;
// Filter to wants where external dependencies are satisfied
let buildable_partitions = active_wants.into_iter()
.filter(|want| self.external_dependencies_satisfied(want))
.map(|want| want.partition_ref)
.collect();
if buildable_partitions.is_empty() { return Ok(None); }
// Create atomic build request for all currently buildable partitions
Ok(Some(BuildRequest {
requested_partitions: buildable_partitions,
reason: "satisfying_active_wants".to_string(),
}))
}
```
### 3. Build Triggers
Builds are triggered on:
- **New want registration**: Check if newly wanted partitions are immediately buildable
- **External partition availability**: Check if any blocked wants are now unblocked
- **Manual trigger**: Force re-evaluation (for debugging)
## Cross-Graph Coordination
### GraphService API
Graphs expose events for cross-graph coordination:
```rust
trait GraphService {
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
}
```
Where `EventFilter` supports partition patterns for efficient subscriptions:
```protobuf
message EventFilter {
repeated string partition_refs = 1; // Exact partition matches
repeated string partition_patterns = 2; // Glob patterns like "data/users/*"
repeated string job_labels = 3; // Job-specific events
repeated string task_ids = 4; // Task run events
repeated string build_request_ids = 5; // Build-specific events
}
```
### Upstream Dependencies
Downstream graphs subscribe to upstream events:
```rust
struct UpstreamDependency {
service_url: String, // "https://upstream-databuild.corp.com"
partition_patterns: Vec<String>, // ["data/users/*", "ml/models/prod/*"]
last_sync_idx: i64,
}
// Periodic sync of upstream events
async fn sync_upstream_events(upstream: &UpstreamDependency) -> Result<()> {
let client = GraphServiceClient::new(&upstream.service_url);
let filter = EventFilter {
partition_patterns: upstream.partition_patterns.clone(),
..Default::default()
};
let events = client.list_events(upstream.last_sync_idx, filter).await?;
// Process partition availability events
for event in events.events {
if let EventType::PartitionEvent(pe) = event.event_type {
if pe.status_code == PartitionStatus::PartitionAvailable {
// Trigger local build evaluation
trigger_build_evaluation().await?;
}
}
}
upstream.last_sync_idx = events.next_idx;
Ok(())
}
```
## SLA Monitoring and TTL Management
### SLA Violations
External monitoring systems query for SLA violations:
```sql
-- Find SLA violations (for alerting)
SELECT * FROM partition_want_events w
WHERE w.sla_seconds IS NOT NULL
AND (w.data_timestamp + (w.sla_seconds * 1000000000)) < ? -- now
AND NOT EXISTS (
SELECT 1 FROM partition_events p
WHERE p.partition_ref = w.partition_ref
AND p.status_code = ? -- PartitionAvailable
)
```
### TTL Expiration
Wants with expired TTLs are excluded from build evaluation:
```sql
-- Get active (non-expired) wants
SELECT * FROM partition_want_events w
WHERE (w.ttl_seconds IS NULL OR w.created_at + (w.ttl_seconds * 1000000000) > ?) -- now
AND NOT EXISTS (
SELECT 1 FROM partition_events p
WHERE p.partition_ref = w.partition_ref
AND p.status_code = ? -- PartitionAvailable
)
```
## Example Scenarios
### Scenario 1: Daily Analytics Pipeline
```
1. 6:00 AM: Daily trigger creates want for analytics/daily/2024-01-01
- SLA: 9:00 AM (3 hours after data_timestamp of midnight)
- TTL: 1 year (keep trying for historical data)
- External deps: ["data/users/2024-01-01"]
2. 6:01 AM: Build evaluation runs, data/users/2024-01-01 missing
- No build request generated
3. 8:30 AM: Upstream publishes data/users/2024-01-01
- Cross-graph sync detects availability
- Build evaluation triggered
- BuildRequest[analytics/daily/2024-01-01] succeeds
4. Result: Analytics available at 8:45 AM, within SLA
```
### Scenario 2: Late Data with SLA Miss
```
1. 6:00 AM: Want created for analytics/daily/2024-01-01 (SLA: 9:00 AM)
2. 9:30 AM: SLA monitoring detects violation, sends alert
3. 11:00 AM: Upstream data finally arrives
4. 11:01 AM: Build evaluation triggers, analytics built
5. Result: Late delivery logged, but data still processed
```
### Scenario 3: Manual CLI Build
```
1. User: databuild build data/transform/urgent
2. Want created with short TTL (30 min) and SLA (5 min)
3. Build evaluation: dependencies available, immediate build
4. Result: Fast feedback for interactive use
```
## Benefits
### Unified Build Model
- All builds (manual, scheduled, triggered) use same want mechanism
- Complete audit trail in build event log
- Consistent SLA tracking across all build types
### Event-Driven Efficiency
- Builds only triggered when dependencies change
- Cross-graph coordination via efficient event streaming
- No polling for task readiness within builds
### Atomic Build Semantics
- Individual build requests remain all-or-nothing
- Fast failure provides immediate feedback
- Partial progress via multiple build requests over time
### Flexible SLA Management
- Separate business expectations (SLA) from operational limits (TTL)
- External monitoring with clear blame assignment
- Automatic cleanup of stale wants
### Cross-Graph Scalability
- Reliable HTTP-based coordination (no message loss)
- Efficient filtering via partition patterns
- Decentralized architecture with clear boundaries
## Implementation Notes
### Build Event Log Integration
- Wants are stored as events in the BEL for consistency
- Same query interfaces used for wants and build coordination
- Event-driven architecture throughout
### Service Integration
- GraphService API exposed via HTTP for cross-graph coordination
- Dashboard integration for manual want creation
- External SLA monitoring via BEL queries
### CLI Integration
- CLI commands create manual wants with appropriate TTLs
- Immediate build evaluation for interactive feedback
- Standard build request execution path