databuild/plans/01-build-event-log.md

14 KiB

Build Event Log Design

The foundation of persistence for DataBuild is the build event log, a fact table recording events related to build requests, partitions, and jobs. Each graph has exactly one build event log, upon which other views (potentially materialized) rely and aggregate, e.g. powering the partition liveness catalog and enabling delegation to in-progress partition builds.

1. Schema

The build event log is an append-only event stream that captures all build-related activity. Each event represents a state change in either a build request, partition, or job lifecycle.

// Partition lifecycle states
enum PartitionStatus {
  PARTITION_UNKNOWN = 0;
  PARTITION_REQUESTED = 1;     // Partition requested but not yet scheduled
  PARTITION_SCHEDULED = 2;     // Job scheduled to produce this partition
  PARTITION_BUILDING = 3;      // Job actively building this partition
  PARTITION_AVAILABLE = 4;     // Partition successfully built and available
  PARTITION_FAILED = 5;        // Partition build failed
  PARTITION_DELEGATED = 6;     // Request delegated to existing build
}

// Job execution lifecycle
enum JobStatus {
  JOB_UNKNOWN = 0;
  JOB_SCHEDULED = 1;           // Job scheduled for execution
  JOB_RUNNING = 2;             // Job actively executing
  JOB_COMPLETED = 3;           // Job completed successfully
  JOB_FAILED = 4;              // Job execution failed
  JOB_CANCELLED = 5;           // Job execution cancelled
}

// Build request lifecycle
enum BuildRequestStatus {
  BUILD_REQUEST_UNKNOWN = 0;
  BUILD_REQUEST_RECEIVED = 1;   // Build request received and queued
  BUILD_REQUEST_PLANNING = 2;   // Graph analysis in progress
  BUILD_REQUEST_EXECUTING = 3;  // Jobs are being executed
  BUILD_REQUEST_COMPLETED = 4;  // All requested partitions built
  BUILD_REQUEST_FAILED = 5;     // Build request failed
  BUILD_REQUEST_CANCELLED = 6;  // Build request cancelled
}

// Individual build event
message BuildEvent {
  // Event metadata
  string event_id = 1;                    // UUID for this event
  int64 timestamp = 2;                    // Unix timestamp (nanoseconds)
  string build_request_id = 3;            // UUID of the build request
  
  // Event type and payload (one of)
  oneof event_type {
    BuildRequestEvent build_request_event = 10;
    PartitionEvent partition_event = 11;
    JobEvent job_event = 12;
    DelegationEvent delegation_event = 13;
  }
}

// Build request lifecycle event
message BuildRequestEvent {
  BuildRequestStatus status = 1;
  repeated PartitionRef requested_partitions = 2;
  string message = 3;                     // Optional status message
}

// Partition state change event
message PartitionEvent {
  PartitionRef partition_ref = 1;
  PartitionStatus status = 2;
  string message = 3;                     // Optional status message
  string job_run_id = 4;                  // UUID of job run producing this partition (if applicable)
}

// Job execution event
message JobEvent {
  string job_run_id = 1;                  // UUID for this job run
  JobLabel job_label = 2;                 // Job being executed
  repeated PartitionRef target_partitions = 3; // Partitions this job run produces
  JobStatus status = 4;
  string message = 5;                     // Optional status message
  JobConfig config = 6;                   // Job configuration used (for SCHEDULED events)
  repeated PartitionManifest manifests = 7; // Results (for COMPLETED events)
}

// Delegation event (when build request delegates to existing build)
message DelegationEvent {
  PartitionRef partition_ref = 1;
  string delegated_to_build_request_id = 2; // Build request handling this partition
  string message = 3;                     // Optional message
}

Build events capture the complete lifecycle of composite build requests. A single build request can involve multiple partitions, each potentially requiring different jobs. The event stream allows reconstruction of the full state at any point in time.

Design Principles

Staleness as Planning Concern: Staleness detection and handling occurs during the analysis/planning phase, not during execution. The analyze operation detects partitions that need rebuilding due to upstream changes and includes them in the execution graph. In-progress builds do not react to newly stale partitions - they execute their planned graph to completion.

Delegation as Unidirectional Optimization: When a build request discovers another build is already producing a needed partition, it logs a delegation event and waits for that partition to become available. The delegated-to build request remains unaware of the delegation - it simply continues building its own graph. This eliminates the need for coordination protocols between builds.

2. Persistence

The build event log uses a single build_events table storing serialized protobuf events. This design supports multiple storage backends while maintaining consistency.

Storage Requirements

  • PostgreSQL: Primary production backend
  • SQLite: Local development and testing
  • Delta tables: Future extensibility for analytics workloads

Table Schema

-- Core event metadata
CREATE TABLE build_events (
    event_id UUID PRIMARY KEY,
    timestamp BIGINT NOT NULL,
    build_request_id UUID NOT NULL,
    event_type TEXT NOT NULL   -- 'build_request', 'partition', 'job', 'delegation'
);

-- Build request lifecycle events
CREATE TABLE build_request_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    status TEXT NOT NULL,      -- BuildRequestStatus enum
    requested_partitions TEXT[] NOT NULL,
    message TEXT
);

-- Partition lifecycle events  
CREATE TABLE partition_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    partition_ref TEXT NOT NULL,
    status TEXT NOT NULL,      -- PartitionStatus enum
    message TEXT,
    job_run_id UUID           -- NULL for non-job-related events
);

-- Job execution events
CREATE TABLE job_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    job_run_id UUID NOT NULL,
    job_label TEXT NOT NULL,
    target_partitions TEXT[] NOT NULL,
    status TEXT NOT NULL,      -- JobStatus enum
    message TEXT,
    config_json TEXT,          -- JobConfig as JSON (for SCHEDULED events)
    manifests_json TEXT,       -- PartitionManifests as JSON (for COMPLETED events)
    start_time BIGINT,         -- Extracted from config/manifests
    end_time BIGINT            -- Extracted from config/manifests
);

-- Delegation events
CREATE TABLE delegation_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    partition_ref TEXT NOT NULL,
    delegated_to_build_request_id UUID NOT NULL,
    message TEXT
);

-- Indexes for common query patterns
CREATE INDEX idx_build_events_build_request ON build_events(build_request_id, timestamp);
CREATE INDEX idx_build_events_timestamp ON build_events(timestamp);

CREATE INDEX idx_partition_events_partition ON partition_events(partition_ref, timestamp);
CREATE INDEX idx_partition_events_job_run ON partition_events(job_run_id, timestamp) WHERE job_run_id IS NOT NULL;

CREATE INDEX idx_job_events_job_run ON job_events(job_run_id);
CREATE INDEX idx_job_events_job_label ON job_events(job_label, timestamp);
CREATE INDEX idx_job_events_status ON job_events(status, timestamp);

CREATE INDEX idx_delegation_events_partition ON delegation_events(partition_ref, timestamp);
CREATE INDEX idx_delegation_events_delegated_to ON delegation_events(delegated_to_build_request_id, timestamp);

3. Access Layer

The access layer provides a simple append/query interface for build events, leaving aggregation logic to the service layer.

Core Interface

The normalized schema enables both simple event queries and complex analytical queries:

trait BuildEventLog {
    // Append new event to the log
    async fn append_event(&self, event: BuildEvent) -> Result<(), Error>;
    
    // Query events by build request
    async fn get_build_request_events(
        &self, 
        build_request_id: &str,
        since: Option<i64>
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Query events by partition
    async fn get_partition_events(
        &self,
        partition_ref: &str,
        since: Option<i64>
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Query events by job run
    async fn get_job_run_events(
        &self,
        job_run_id: &str
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Query events in time range
    async fn get_events_in_range(
        &self,
        start_time: i64,
        end_time: i64
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Execute raw SQL queries (for dashboard and debugging)
    async fn execute_query(&self, query: &str) -> Result<QueryResult, Error>;
}

Example Analytical Queries

The normalized schema enables dashboard queries like:

-- Job success rates by label
SELECT job_label, 
       COUNT(*) as total_runs,
       SUM(CASE WHEN status = 'JOB_COMPLETED' THEN 1 ELSE 0 END) as successful_runs,
       AVG(end_time - start_time) as avg_duration_ns
FROM job_events 
WHERE status IN ('JOB_COMPLETED', 'JOB_FAILED')
GROUP BY job_label;

-- Recent partition builds
SELECT p.partition_ref, p.status, e.timestamp, j.job_label
FROM partition_events p
JOIN build_events e ON p.event_id = e.event_id
LEFT JOIN job_events j ON p.job_run_id = j.job_run_id
WHERE p.status = 'PARTITION_AVAILABLE'
ORDER BY e.timestamp DESC
LIMIT 100;

-- Build request status summary
SELECT br.status, COUNT(*) as count
FROM build_request_events br
JOIN build_events e ON br.event_id = e.event_id
WHERE e.timestamp > extract(epoch from now() - interval '24 hours') * 1000000000
GROUP BY br.status;

The service layer builds higher-level operations on top of both the simple interface and direct SQL access.

4. Core Build Implementation Integration

Command Line Interface

The core build implementation (analyze.rs and execute.rs) will be enhanced with build event logging capabilities through new command line arguments:

# Standard usage with build event logging
./analyze partition_ref1 partition_ref2
./execute --build-event-log sqlite:///tmp/build.db < job_graph.json

# With explicit build request ID for correlation
./analyze --build-event-log postgres://user:pass@host/db --build-request-id 12345678-1234-1234-1234-123456789012

New Command Line Arguments:

  • --build-event-log <URI> - Specify persistence URI for build events (logging to stdout is implicit)
    • sqlite://path - Persist to SQLite database file
    • postgres://connection - Persist to PostgreSQL database
  • --build-request-id <UUID> - Optional build request ID (auto-generated if not provided)

Integration Points

In analyze.rs (Graph Analysis Phase):

  1. Build Request Lifecycle: Log BUILD_REQUEST_RECEIVED when analysis starts, BUILD_REQUEST_PLANNING during dependency resolution, and BUILD_REQUEST_COMPLETED when analysis finishes
  2. Staleness Detection: Query build event log for existing PARTITION_AVAILABLE events to identify non-stale partitions that can be skipped
  3. Delegation Logging: Log PARTITION_DELEGATED events when skipping partitions that are already being built by another request
  4. Job Planning: Log PARTITION_SCHEDULED events for partitions that will be built

In execute.rs (Graph Execution Phase):

  1. Execution Lifecycle: Log BUILD_REQUEST_EXECUTING when execution starts
  2. Job Execution Events: Log JOB_SCHEDULED, JOB_RUNNING, JOB_COMPLETED/FAILED events throughout job execution
  3. Partition Status: Log PARTITION_BUILDING when jobs start, PARTITION_AVAILABLE/FAILED when jobs complete
  4. Build Coordination: Check for concurrent builds before starting partition work to avoid duplicate effort

Non-Stale Partition Handling

The build event log enables intelligent partition skipping:

  1. During Analysis: Query for recent PARTITION_AVAILABLE events to identify partitions that don't need rebuilding
  2. Staleness Logic: Compare partition timestamps with upstream dependency timestamps to determine if rebuilding is needed
  3. Skip Documentation: Log PARTITION_DELEGATED events with references to the existing build request ID that produced the partition

Bazel Rules Integration

The databuild_graph rule in rules.bzl will be enhanced to propagate build event logging configuration:

databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],
    lookup = ":job_lookup",
    build_event_log = "sqlite:///tmp/builds.db",  # New attribute
)

Generated Targets Enhancement:

  • my_graph_analyze: Receives --build-event-log argument
  • my_graph_exec: Receives --build-event-log argument
  • my_graph_build: Coordinates build request ID across analyze/execute phases

Implementation Strategy

Phase 1: Infrastructure

  • Add BuildEventLog trait and implementations for stdout/SQLite/PostgreSQL
  • Update databuild.proto with build event schema
  • Add command line argument parsing to analyze.rs and execute.rs

Phase 2: Analysis Integration

  • Integrate build event logging into analyze.rs
  • Implement staleness detection queries
  • Add partition delegation logic

Phase 3: Execution Integration

  • Integrate build event logging into execute.rs
  • Add job lifecycle event logging
  • Implement build coordination checks

Phase 4: Bazel Integration

  • Update databuild_graph rule with build event log support
  • Add proper argument propagation and request ID correlation
  • End-to-end testing with example graphs

Key Benefits

  1. Stdout Logging: Immediate visibility into build progress with --build-event-log stdout
  2. Persistent History: Database persistence enables build coordination and historical analysis
  3. Intelligent Skipping: Avoid rebuilding fresh partitions, significantly improving build performance
  4. Build Coordination: Prevent duplicate work when multiple builds target the same partitions
  5. Audit Trail: Complete record of all build activities for debugging and monitoring