# Build Event Log Design

The foundation of persistence for DataBuild is the build event log, a fact table recording events related to build requests, partitions, and jobs. Each graph has exactly one build event log, upon which other views (potentially materialized) rely and aggregate, e.g. powering the partition liveness catalog and enabling delegation to in-progress partition builds.

## 1. Schema

The build event log is an append-only event stream that captures all build-related activity. Each event represents a state change in either a build request, partition, or job lifecycle.

```protobuf
// Partition lifecycle states
enum PartitionStatus {
  PARTITION_UNKNOWN = 0;
  PARTITION_REQUESTED = 1;     // Partition requested but not yet scheduled
  PARTITION_SCHEDULED = 2;     // Job scheduled to produce this partition
  PARTITION_BUILDING = 3;      // Job actively building this partition
  PARTITION_AVAILABLE = 4;     // Partition successfully built and available
  PARTITION_FAILED = 5;        // Partition build failed
  PARTITION_DELEGATED = 6;     // Request delegated to existing build
}

// Job execution lifecycle
enum JobStatus {
  JOB_UNKNOWN = 0;
  JOB_SCHEDULED = 1;           // Job scheduled for execution
  JOB_RUNNING = 2;             // Job actively executing
  JOB_COMPLETED = 3;           // Job completed successfully
  JOB_FAILED = 4;              // Job execution failed
  JOB_CANCELLED = 5;           // Job execution cancelled
}

// Build request lifecycle
enum BuildRequestStatus {
  BUILD_REQUEST_UNKNOWN = 0;
  BUILD_REQUEST_RECEIVED = 1;   // Build request received and queued
  BUILD_REQUEST_PLANNING = 2;   // Graph analysis in progress
  BUILD_REQUEST_EXECUTING = 3;  // Jobs are being executed
  BUILD_REQUEST_COMPLETED = 4;  // All requested partitions built
  BUILD_REQUEST_FAILED = 5;     // Build request failed
  BUILD_REQUEST_CANCELLED = 6;  // Build request cancelled
}

// Individual build event
message BuildEvent {
  // Event metadata
  string event_id = 1;                    // UUID for this event
  int64 timestamp = 2;                    // Unix timestamp (nanoseconds)
  string build_request_id = 3;            // UUID of the build request
  
  // Event type and payload (one of)
  oneof event_type {
    BuildRequestEvent build_request_event = 10;
    PartitionEvent partition_event = 11;
    JobEvent job_event = 12;
    DelegationEvent delegation_event = 13;
  }
}

// Build request lifecycle event
message BuildRequestEvent {
  BuildRequestStatus status = 1;
  repeated PartitionRef requested_partitions = 2;
  string message = 3;                     // Optional status message
}

// Partition state change event
message PartitionEvent {
  PartitionRef partition_ref = 1;
  PartitionStatus status = 2;
  string message = 3;                     // Optional status message
  string job_run_id = 4;                  // UUID of job run producing this partition (if applicable)
}

// Job execution event
message JobEvent {
  string job_run_id = 1;                  // UUID for this job run
  JobLabel job_label = 2;                 // Job being executed
  repeated PartitionRef target_partitions = 3; // Partitions this job run produces
  JobStatus status = 4;
  string message = 5;                     // Optional status message
  JobConfig config = 6;                   // Job configuration used (for SCHEDULED events)
  repeated PartitionManifest manifests = 7; // Results (for COMPLETED events)
}

// Delegation event (when build request delegates to existing build)
message DelegationEvent {
  PartitionRef partition_ref = 1;
  string delegated_to_build_request_id = 2; // Build request handling this partition
  string message = 3;                     // Optional message
}
```

Build events capture the complete lifecycle of composite build requests. A single build request can involve multiple partitions, each potentially requiring different jobs. The event stream allows reconstruction of the full state at any point in time.

### Design Principles

**Staleness as Planning Concern**: Staleness detection and handling occurs during the analysis/planning phase, not during execution. The analyze operation detects partitions that need rebuilding due to upstream changes and includes them in the execution graph. In-progress builds do not react to newly stale partitions - they execute their planned graph to completion.

**Delegation as Unidirectional Optimization**: When a build request discovers another build is already producing a needed partition, it logs a delegation event and waits for that partition to become available. The delegated-to build request remains unaware of the delegation - it simply continues building its own graph. This eliminates the need for coordination protocols between builds.

## 2. Persistence

The build event log uses a single `build_events` table storing serialized protobuf events. This design supports multiple storage backends while maintaining consistency.

### Storage Requirements
- **PostgreSQL**: Primary production backend
- **SQLite**: Local development and testing
- **Delta tables**: Future extensibility for analytics workloads

### Table Schema
```sql
-- Core event metadata
CREATE TABLE build_events (
    event_id UUID PRIMARY KEY,
    timestamp BIGINT NOT NULL,
    build_request_id UUID NOT NULL,
    event_type TEXT NOT NULL   -- 'build_request', 'partition', 'job', 'delegation'
);

-- Build request lifecycle events
CREATE TABLE build_request_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    status TEXT NOT NULL,      -- BuildRequestStatus enum
    requested_partitions TEXT[] NOT NULL,
    message TEXT
);

-- Partition lifecycle events  
CREATE TABLE partition_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    partition_ref TEXT NOT NULL,
    status TEXT NOT NULL,      -- PartitionStatus enum
    message TEXT,
    job_run_id UUID           -- NULL for non-job-related events
);

-- Job execution events
CREATE TABLE job_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    job_run_id UUID NOT NULL,
    job_label TEXT NOT NULL,
    target_partitions TEXT[] NOT NULL,
    status TEXT NOT NULL,      -- JobStatus enum
    message TEXT,
    config_json TEXT,          -- JobConfig as JSON (for SCHEDULED events)
    manifests_json TEXT,       -- PartitionManifests as JSON (for COMPLETED events)
    start_time BIGINT,         -- Extracted from config/manifests
    end_time BIGINT            -- Extracted from config/manifests
);

-- Delegation events
CREATE TABLE delegation_events (
    event_id UUID PRIMARY KEY REFERENCES build_events(event_id),
    partition_ref TEXT NOT NULL,
    delegated_to_build_request_id UUID NOT NULL,
    message TEXT
);

-- Indexes for common query patterns
CREATE INDEX idx_build_events_build_request ON build_events(build_request_id, timestamp);
CREATE INDEX idx_build_events_timestamp ON build_events(timestamp);

CREATE INDEX idx_partition_events_partition ON partition_events(partition_ref, timestamp);
CREATE INDEX idx_partition_events_job_run ON partition_events(job_run_id, timestamp) WHERE job_run_id IS NOT NULL;

CREATE INDEX idx_job_events_job_run ON job_events(job_run_id);
CREATE INDEX idx_job_events_job_label ON job_events(job_label, timestamp);
CREATE INDEX idx_job_events_status ON job_events(status, timestamp);

CREATE INDEX idx_delegation_events_partition ON delegation_events(partition_ref, timestamp);
CREATE INDEX idx_delegation_events_delegated_to ON delegation_events(delegated_to_build_request_id, timestamp);
```

## 3. Access Layer

The access layer provides a simple append/query interface for build events, leaving aggregation logic to the service layer.

### Core Interface
The normalized schema enables both simple event queries and complex analytical queries:

```rust
trait BuildEventLog {
    // Append new event to the log
    async fn append_event(&self, event: BuildEvent) -> Result<(), Error>;
    
    // Query events by build request
    async fn get_build_request_events(
        &self, 
        build_request_id: &str,
        since: Option<i64>
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Query events by partition
    async fn get_partition_events(
        &self,
        partition_ref: &str,
        since: Option<i64>
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Query events by job run
    async fn get_job_run_events(
        &self,
        job_run_id: &str
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Query events in time range
    async fn get_events_in_range(
        &self,
        start_time: i64,
        end_time: i64
    ) -> Result<Vec<BuildEvent>, Error>;
    
    // Execute raw SQL queries (for dashboard and debugging)
    async fn execute_query(&self, query: &str) -> Result<QueryResult, Error>;
}
```

### Example Analytical Queries
The normalized schema enables dashboard queries like:

```sql
-- Job success rates by label
SELECT job_label, 
       COUNT(*) as total_runs,
       SUM(CASE WHEN status = 'JOB_COMPLETED' THEN 1 ELSE 0 END) as successful_runs,
       AVG(end_time - start_time) as avg_duration_ns
FROM job_events 
WHERE status IN ('JOB_COMPLETED', 'JOB_FAILED')
GROUP BY job_label;

-- Recent partition builds
SELECT p.partition_ref, p.status, e.timestamp, j.job_label
FROM partition_events p
JOIN build_events e ON p.event_id = e.event_id
LEFT JOIN job_events j ON p.job_run_id = j.job_run_id
WHERE p.status = 'PARTITION_AVAILABLE'
ORDER BY e.timestamp DESC
LIMIT 100;

-- Build request status summary
SELECT br.status, COUNT(*) as count
FROM build_request_events br
JOIN build_events e ON br.event_id = e.event_id
WHERE e.timestamp > extract(epoch from now() - interval '24 hours') * 1000000000
GROUP BY br.status;
```

The service layer builds higher-level operations on top of both the simple interface and direct SQL access.

## 4. Core Build Implementation Integration

### Command Line Interface

The core build implementation (`analyze.rs` and `execute.rs`) will be enhanced with build event logging capabilities through new command line arguments:

```bash
# Standard usage with build event logging
./analyze partition_ref1 partition_ref2
./execute --build-event-log sqlite:///tmp/build.db < job_graph.json

# With explicit build request ID for correlation
./analyze --build-event-log postgres://user:pass@host/db --build-request-id 12345678-1234-1234-1234-123456789012
```

**New Command Line Arguments:**
- `--build-event-log <URI>` - Specify persistence URI for build events (logging to stdout is implicit)
  - `sqlite://path` - Persist to SQLite database file
  - `postgres://connection` - Persist to PostgreSQL database
- `--build-request-id <UUID>` - Optional build request ID (auto-generated if not provided)

### Integration Points

**In `analyze.rs` (Graph Analysis Phase):**
1. **Build Request Lifecycle**: Log `BUILD_REQUEST_RECEIVED` when analysis starts, `BUILD_REQUEST_PLANNING` during dependency resolution, and `BUILD_REQUEST_COMPLETED` when analysis finishes
2. **Staleness Detection**: Query build event log for existing `PARTITION_AVAILABLE` events to identify non-stale partitions that can be skipped
3. **Delegation Logging**: Log `PARTITION_DELEGATED` events when skipping partitions that are already being built by another request
4. **Job Planning**: Log `PARTITION_SCHEDULED` events for partitions that will be built

**In `execute.rs` (Graph Execution Phase):**
1. **Execution Lifecycle**: Log `BUILD_REQUEST_EXECUTING` when execution starts
2. **Job Execution Events**: Log `JOB_SCHEDULED`, `JOB_RUNNING`, `JOB_COMPLETED/FAILED` events throughout job execution
3. **Partition Status**: Log `PARTITION_BUILDING` when jobs start, `PARTITION_AVAILABLE/FAILED` when jobs complete
4. **Build Coordination**: Check for concurrent builds before starting partition work to avoid duplicate effort

### Non-Stale Partition Handling

The build event log enables intelligent partition skipping:

1. **During Analysis**: Query for recent `PARTITION_AVAILABLE` events to identify partitions that don't need rebuilding
2. **Staleness Logic**: Compare partition timestamps with upstream dependency timestamps to determine if rebuilding is needed
3. **Skip Documentation**: Log `PARTITION_DELEGATED` events with references to the existing build request ID that produced the partition

### Bazel Rules Integration

The `databuild_graph` rule in `rules.bzl` will be enhanced to propagate build event logging configuration:

```python
databuild_graph(
    name = "my_graph",
    jobs = [":job1", ":job2"],
    lookup = ":job_lookup",
    build_event_log = "sqlite:///tmp/builds.db",  # New attribute
)
```

**Generated Targets Enhancement:**
- `my_graph_analyze`: Receives `--build-event-log` argument
- `my_graph_exec`: Receives `--build-event-log` argument  
- `my_graph_build`: Coordinates build request ID across analyze/execute phases

### Implementation Strategy

**Phase 1: Infrastructure**
- Add `BuildEventLog` trait and implementations for stdout/SQLite/PostgreSQL
- Update `databuild.proto` with build event schema
- Add command line argument parsing to `analyze.rs` and `execute.rs`

**Phase 2: Analysis Integration**
- Integrate build event logging into `analyze.rs`
- Implement staleness detection queries
- Add partition delegation logic

**Phase 3: Execution Integration**
- Integrate build event logging into `execute.rs`
- Add job lifecycle event logging
- Implement build coordination checks

**Phase 4: Bazel Integration**
- Update `databuild_graph` rule with build event log support
- Add proper argument propagation and request ID correlation
- End-to-end testing with example graphs

### Key Benefits

1. **Stdout Logging**: Immediate visibility into build progress with `--build-event-log stdout`
2. **Persistent History**: Database persistence enables build coordination and historical analysis
3. **Intelligent Skipping**: Avoid rebuilding fresh partitions, significantly improving build performance
4. **Build Coordination**: Prevent duplicate work when multiple builds target the same partitions
5. **Audit Trail**: Complete record of all build activities for debugging and monitoring