databuild/design/wants.md

9.3 KiB

Wants System

Purpose: Enable declarative specification of data requirements with SLA tracking, cross-graph coordination, and efficient build triggering while maintaining atomic build semantics.

Overview

The wants system unifies all build requests (manual, scheduled, triggered) under a single declarative model where:

  • Wants declare intent via events in the build event log
  • Builds reactively satisfy what's currently possible with atomic semantics
  • Monitoring identifies gaps between declared wants and delivered partitions
  • Cross-graph coordination happens via the GraphService API

Architecture

Core Components

  1. PartitionWantEvent: Declarative specification of data requirements
  2. Build Evaluation: Reactive logic that attempts to satisfy wants when possible
  3. SLA Monitoring: External system that queries for expired wants
  4. Cross-Graph Coordination: Event-driven dependency management across DataBuild instances

Want Event Schema

Defined in databuild.proto:

message PartitionWantEvent {
  string partition_ref = 1;               // Partition being requested
  int64 created_at = 2;                   // Server time when want registered
  int64 data_timestamp = 3;               // Business time this partition represents
  optional uint64 ttl_seconds = 4;        // Give up after this long (from created_at)
  optional uint64 sla_seconds = 5;        // SLA violation after this long (from data_timestamp)
  repeated string external_dependencies = 6; // Cross-graph dependencies
  string want_id = 7;                     // Unique identifier
  WantSource source = 8;                  // How this want was created
}

message WantSource {
  oneof source_type {
    CliManual cli_manual = 1;             // Manual CLI request
    DashboardManual dashboard_manual = 2;  // Manual dashboard request  
    Scheduled scheduled = 3;              // Scheduled/triggered job
    ApiRequest api_request = 4;           // External API call
  }
}

Want Lifecycle

1. Want Registration

All build requests become wants:

// CLI: databuild build data/users/2024-01-01
PartitionWantEvent {
    partition_ref: "data/users/2024-01-01",
    created_at: now(),
    data_timestamp: None, // These must be set explicitly in the request
    ttl_seconds: None,
    sla_seconds: None,
    external_dependencies: vec![], // no externally sourced data necessary
    want_id: generate_uuid(),
    source: WantSource { ... },
}

// Scheduled pipeline: Daily analytics
PartitionWantEvent {
    partition_ref: "analytics/daily/2024-01-01", 
    created_at: now(),
    data_timestamp: parse_date("2024-01-01"),
    ttl_seconds: Some(365 * 24 * 3600), // Keep trying for 1 year
    sla_seconds: Some(9 * 3600),        // Expected by 9am (9hrs after data_timestamp)
    external_dependencies: vec!["data/users/2024-01-01"],
    want_id: "daily-analytics-2024-01-01",
    source: WantSource { ... },
}

2. Build Evaluation

DataBuild continuously evaluates build opportunities:

async fn evaluate_build_opportunities(&self) -> Result<Option<BuildRequest>> {
    let now = current_timestamp_nanos();
    
    // Get wants that haven't exceeded TTL
    let active_wants = self.get_non_expired_wants(now).await?;
    
    // Filter to wants where external dependencies are satisfied
    let buildable_partitions = active_wants.into_iter()
        .filter(|want| self.external_dependencies_satisfied(want))
        .map(|want| want.partition_ref)
        .collect();
        
    if buildable_partitions.is_empty() { return Ok(None); }
    
    // Create atomic build request for all currently buildable partitions
    Ok(Some(BuildRequest { 
        requested_partitions: buildable_partitions,
        reason: "satisfying_active_wants".to_string(),
    }))
}

3. Build Triggers

Builds are triggered on:

  • New want registration: Check if newly wanted partitions are immediately buildable
  • External partition availability: Check if any blocked wants are now unblocked
  • Manual trigger: Force re-evaluation (for debugging)

Cross-Graph Coordination

GraphService API

Graphs expose events for cross-graph coordination:

trait GraphService {
    async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
}

Where EventFilter supports partition patterns for efficient subscriptions:

message EventFilter {
  repeated string partition_refs = 1;        // Exact partition matches  
  repeated string partition_patterns = 2;    // Glob patterns like "data/users/*"
  repeated string job_labels = 3;           // Job-specific events
  repeated string task_ids = 4;             // Task run events
  repeated string build_request_ids = 5;    // Build-specific events
}

Upstream Dependencies

Downstream graphs subscribe to upstream events:

struct UpstreamDependency {
    service_url: String,           // "https://upstream-databuild.corp.com"
    partition_patterns: Vec<String>, // ["data/users/*", "ml/models/prod/*"]
    last_sync_idx: i64,
}

// Periodic sync of upstream events
async fn sync_upstream_events(upstream: &UpstreamDependency) -> Result<()> {
    let client = GraphServiceClient::new(&upstream.service_url);
    let filter = EventFilter {
        partition_patterns: upstream.partition_patterns.clone(),
        ..Default::default()
    };
    
    let events = client.list_events(upstream.last_sync_idx, filter).await?;
    
    // Process partition availability events
    for event in events.events {
        if let EventType::PartitionEvent(pe) = event.event_type {
            if pe.status_code == PartitionStatus::PartitionAvailable {
                // Trigger local build evaluation
                trigger_build_evaluation().await?;
            }
        }
    }
    
    upstream.last_sync_idx = events.next_idx;
    Ok(())
}

SLA Monitoring and TTL Management

SLA Violations

External monitoring systems query for SLA violations:

-- Find SLA violations (for alerting)
SELECT * FROM partition_want_events w
WHERE w.sla_seconds IS NOT NULL 
AND (w.data_timestamp + (w.sla_seconds * 1000000000)) < ?  -- now
AND NOT EXISTS (
    SELECT 1 FROM partition_events p 
    WHERE p.partition_ref = w.partition_ref 
    AND p.status_code = ?  -- PartitionAvailable
)

TTL Expiration

Wants with expired TTLs are excluded from build evaluation:

-- Get active (non-expired) wants
SELECT * FROM partition_want_events w
WHERE (w.ttl_seconds IS NULL OR w.created_at + (w.ttl_seconds * 1000000000) > ?) -- now
AND NOT EXISTS (
    SELECT 1 FROM partition_events p 
    WHERE p.partition_ref = w.partition_ref 
    AND p.status_code = ?  -- PartitionAvailable
)

Example Scenarios

Scenario 1: Daily Analytics Pipeline

1. 6:00 AM: Daily trigger creates want for analytics/daily/2024-01-01
   - SLA: 9:00 AM (3 hours after data_timestamp of midnight)
   - TTL: 1 year (keep trying for historical data)
   - External deps: ["data/users/2024-01-01"]

2. 6:01 AM: Build evaluation runs, data/users/2024-01-01 missing
   - No build request generated

3. 8:30 AM: Upstream publishes data/users/2024-01-01
   - Cross-graph sync detects availability
   - Build evaluation triggered
   - BuildRequest[analytics/daily/2024-01-01] succeeds

4. Result: Analytics available at 8:45 AM, within SLA

Scenario 2: Late Data with SLA Miss

1. 6:00 AM: Want created for analytics/daily/2024-01-01 (SLA: 9:00 AM)
2. 9:30 AM: SLA monitoring detects violation, sends alert
3. 11:00 AM: Upstream data finally arrives
4. 11:01 AM: Build evaluation triggers, analytics built
5. Result: Late delivery logged, but data still processed

Scenario 3: Manual CLI Build

1. User: databuild build data/transform/urgent
2. Want created with short TTL (30 min) and SLA (5 min)  
3. Build evaluation: dependencies available, immediate build
4. Result: Fast feedback for interactive use

Benefits

Unified Build Model

  • All builds (manual, scheduled, triggered) use same want mechanism
  • Complete audit trail in build event log
  • Consistent SLA tracking across all build types

Event-Driven Efficiency

  • Builds only triggered when dependencies change
  • Cross-graph coordination via efficient event streaming
  • No polling for task readiness within builds

Atomic Build Semantics

  • Individual build requests remain all-or-nothing
  • Fast failure provides immediate feedback
  • Partial progress via multiple build requests over time

Flexible SLA Management

  • Separate business expectations (SLA) from operational limits (TTL)
  • External monitoring with clear blame assignment
  • Automatic cleanup of stale wants

Cross-Graph Scalability

  • Reliable HTTP-based coordination (no message loss)
  • Efficient filtering via partition patterns
  • Decentralized architecture with clear boundaries

Implementation Notes

Build Event Log Integration

  • Wants are stored as events in the BEL for consistency
  • Same query interfaces used for wants and build coordination
  • Event-driven architecture throughout

Service Integration

  • GraphService API exposed via HTTP for cross-graph coordination
  • Dashboard integration for manual want creation
  • External SLA monitoring via BEL queries

CLI Integration

  • CLI commands create manual wants with appropriate TTLs
  • Immediate build evaluation for interactive feedback
  • Standard build request execution path