# Build Event Log Design The foundation of persistence for DataBuild is the build event log, a fact table recording events related to build requests, partitions, and jobs. Each graph has exactly one build event log, upon which other views (potentially materialized) rely and aggregate, e.g. powering the partition liveness catalog and enabling delegation to in-progress partition builds. ## 1. Schema The build event log is an append-only event stream that captures all build-related activity. Each event represents a state change in either a build request, partition, or job lifecycle. ```protobuf // Partition lifecycle states enum PartitionStatus { PARTITION_UNKNOWN = 0; PARTITION_REQUESTED = 1; // Partition requested but not yet scheduled PARTITION_SCHEDULED = 2; // Job scheduled to produce this partition PARTITION_BUILDING = 3; // Job actively building this partition PARTITION_AVAILABLE = 4; // Partition successfully built and available PARTITION_FAILED = 5; // Partition build failed PARTITION_DELEGATED = 6; // Request delegated to existing build } // Job execution lifecycle enum JobStatus { JOB_UNKNOWN = 0; JOB_SCHEDULED = 1; // Job scheduled for execution JOB_RUNNING = 2; // Job actively executing JOB_COMPLETED = 3; // Job completed successfully JOB_FAILED = 4; // Job execution failed JOB_CANCELLED = 5; // Job execution cancelled } // Build request lifecycle enum BuildRequestStatus { BUILD_REQUEST_UNKNOWN = 0; BUILD_REQUEST_RECEIVED = 1; // Build request received and queued BUILD_REQUEST_PLANNING = 2; // Graph analysis in progress BUILD_REQUEST_EXECUTING = 3; // Jobs are being executed BUILD_REQUEST_COMPLETED = 4; // All requested partitions built BUILD_REQUEST_FAILED = 5; // Build request failed BUILD_REQUEST_CANCELLED = 6; // Build request cancelled } // Individual build event message BuildEvent { // Event metadata string event_id = 1; // UUID for this event int64 timestamp = 2; // Unix timestamp (nanoseconds) string build_request_id = 3; // UUID of the build request // Event type and payload (one of) oneof event_type { BuildRequestEvent build_request_event = 10; PartitionEvent partition_event = 11; JobEvent job_event = 12; DelegationEvent delegation_event = 13; } } // Build request lifecycle event message BuildRequestEvent { BuildRequestStatus status = 1; repeated PartitionRef requested_partitions = 2; string message = 3; // Optional status message } // Partition state change event message PartitionEvent { PartitionRef partition_ref = 1; PartitionStatus status = 2; string message = 3; // Optional status message string job_run_id = 4; // UUID of job run producing this partition (if applicable) } // Job execution event message JobEvent { string job_run_id = 1; // UUID for this job run JobLabel job_label = 2; // Job being executed repeated PartitionRef target_partitions = 3; // Partitions this job run produces JobStatus status = 4; string message = 5; // Optional status message JobConfig config = 6; // Job configuration used (for SCHEDULED events) repeated PartitionManifest manifests = 7; // Results (for COMPLETED events) } // Delegation event (when build request delegates to existing build) message DelegationEvent { PartitionRef partition_ref = 1; string delegated_to_build_request_id = 2; // Build request handling this partition string message = 3; // Optional message } ``` Build events capture the complete lifecycle of composite build requests. A single build request can involve multiple partitions, each potentially requiring different jobs. The event stream allows reconstruction of the full state at any point in time. ### Design Principles **Staleness as Planning Concern**: Staleness detection and handling occurs during the analysis/planning phase, not during execution. The analyze operation detects partitions that need rebuilding due to upstream changes and includes them in the execution graph. In-progress builds do not react to newly stale partitions - they execute their planned graph to completion. **Delegation as Unidirectional Optimization**: When a build request discovers another build is already producing a needed partition, it logs a delegation event and waits for that partition to become available. The delegated-to build request remains unaware of the delegation - it simply continues building its own graph. This eliminates the need for coordination protocols between builds. ## 2. Persistence The build event log uses a single `build_events` table storing serialized protobuf events. This design supports multiple storage backends while maintaining consistency. ### Storage Requirements - **PostgreSQL**: Primary production backend - **SQLite**: Local development and testing - **Delta tables**: Future extensibility for analytics workloads ### Table Schema ```sql -- Core event metadata CREATE TABLE build_events ( event_id UUID PRIMARY KEY, timestamp BIGINT NOT NULL, build_request_id UUID NOT NULL, event_type TEXT NOT NULL -- 'build_request', 'partition', 'job', 'delegation' ); -- Build request lifecycle events CREATE TABLE build_request_events ( event_id UUID PRIMARY KEY REFERENCES build_events(event_id), status TEXT NOT NULL, -- BuildRequestStatus enum requested_partitions TEXT[] NOT NULL, message TEXT ); -- Partition lifecycle events CREATE TABLE partition_events ( event_id UUID PRIMARY KEY REFERENCES build_events(event_id), partition_ref TEXT NOT NULL, status TEXT NOT NULL, -- PartitionStatus enum message TEXT, job_run_id UUID -- NULL for non-job-related events ); -- Job execution events CREATE TABLE job_events ( event_id UUID PRIMARY KEY REFERENCES build_events(event_id), job_run_id UUID NOT NULL, job_label TEXT NOT NULL, target_partitions TEXT[] NOT NULL, status TEXT NOT NULL, -- JobStatus enum message TEXT, config_json TEXT, -- JobConfig as JSON (for SCHEDULED events) manifests_json TEXT, -- PartitionManifests as JSON (for COMPLETED events) start_time BIGINT, -- Extracted from config/manifests end_time BIGINT -- Extracted from config/manifests ); -- Delegation events CREATE TABLE delegation_events ( event_id UUID PRIMARY KEY REFERENCES build_events(event_id), partition_ref TEXT NOT NULL, delegated_to_build_request_id UUID NOT NULL, message TEXT ); -- Indexes for common query patterns CREATE INDEX idx_build_events_build_request ON build_events(build_request_id, timestamp); CREATE INDEX idx_build_events_timestamp ON build_events(timestamp); CREATE INDEX idx_partition_events_partition ON partition_events(partition_ref, timestamp); CREATE INDEX idx_partition_events_job_run ON partition_events(job_run_id, timestamp) WHERE job_run_id IS NOT NULL; CREATE INDEX idx_job_events_job_run ON job_events(job_run_id); CREATE INDEX idx_job_events_job_label ON job_events(job_label, timestamp); CREATE INDEX idx_job_events_status ON job_events(status, timestamp); CREATE INDEX idx_delegation_events_partition ON delegation_events(partition_ref, timestamp); CREATE INDEX idx_delegation_events_delegated_to ON delegation_events(delegated_to_build_request_id, timestamp); ``` ## 3. Access Layer The access layer provides a simple append/query interface for build events, leaving aggregation logic to the service layer. ### Core Interface The normalized schema enables both simple event queries and complex analytical queries: ```rust trait BuildEventLog { // Append new event to the log async fn append_event(&self, event: BuildEvent) -> Result<(), Error>; // Query events by build request async fn get_build_request_events( &self, build_request_id: &str, since: Option ) -> Result, Error>; // Query events by partition async fn get_partition_events( &self, partition_ref: &str, since: Option ) -> Result, Error>; // Query events by job run async fn get_job_run_events( &self, job_run_id: &str ) -> Result, Error>; // Query events in time range async fn get_events_in_range( &self, start_time: i64, end_time: i64 ) -> Result, Error>; // Execute raw SQL queries (for dashboard and debugging) async fn execute_query(&self, query: &str) -> Result; } ``` ### Example Analytical Queries The normalized schema enables dashboard queries like: ```sql -- Job success rates by label SELECT job_label, COUNT(*) as total_runs, SUM(CASE WHEN status = 'JOB_COMPLETED' THEN 1 ELSE 0 END) as successful_runs, AVG(end_time - start_time) as avg_duration_ns FROM job_events WHERE status IN ('JOB_COMPLETED', 'JOB_FAILED') GROUP BY job_label; -- Recent partition builds SELECT p.partition_ref, p.status, e.timestamp, j.job_label FROM partition_events p JOIN build_events e ON p.event_id = e.event_id LEFT JOIN job_events j ON p.job_run_id = j.job_run_id WHERE p.status = 'PARTITION_AVAILABLE' ORDER BY e.timestamp DESC LIMIT 100; -- Build request status summary SELECT br.status, COUNT(*) as count FROM build_request_events br JOIN build_events e ON br.event_id = e.event_id WHERE e.timestamp > extract(epoch from now() - interval '24 hours') * 1000000000 GROUP BY br.status; ``` The service layer builds higher-level operations on top of both the simple interface and direct SQL access. ## 4. Core Build Implementation Integration ### Command Line Interface The core build implementation (`analyze.rs` and `execute.rs`) will be enhanced with build event logging capabilities through new command line arguments: ```bash # Standard usage with build event logging ./analyze partition_ref1 partition_ref2 ./execute --build-event-log sqlite:///tmp/build.db < job_graph.json # With explicit build request ID for correlation ./analyze --build-event-log postgres://user:pass@host/db --build-request-id 12345678-1234-1234-1234-123456789012 ``` **New Command Line Arguments:** - `--build-event-log ` - Specify persistence URI for build events (logging to stdout is implicit) - `sqlite://path` - Persist to SQLite database file - `postgres://connection` - Persist to PostgreSQL database - `--build-request-id ` - Optional build request ID (auto-generated if not provided) ### Integration Points **In `analyze.rs` (Graph Analysis Phase):** 1. **Build Request Lifecycle**: Log `BUILD_REQUEST_RECEIVED` when analysis starts, `BUILD_REQUEST_PLANNING` during dependency resolution, and `BUILD_REQUEST_COMPLETED` when analysis finishes 2. **Staleness Detection**: Query build event log for existing `PARTITION_AVAILABLE` events to identify non-stale partitions that can be skipped 3. **Delegation Logging**: Log `PARTITION_DELEGATED` events when skipping partitions that are already being built by another request 4. **Job Planning**: Log `PARTITION_SCHEDULED` events for partitions that will be built **In `execute.rs` (Graph Execution Phase):** 1. **Execution Lifecycle**: Log `BUILD_REQUEST_EXECUTING` when execution starts 2. **Job Execution Events**: Log `JOB_SCHEDULED`, `JOB_RUNNING`, `JOB_COMPLETED/FAILED` events throughout job execution 3. **Partition Status**: Log `PARTITION_BUILDING` when jobs start, `PARTITION_AVAILABLE/FAILED` when jobs complete 4. **Build Coordination**: Check for concurrent builds before starting partition work to avoid duplicate effort ### Non-Stale Partition Handling The build event log enables intelligent partition skipping: 1. **During Analysis**: Query for recent `PARTITION_AVAILABLE` events to identify partitions that don't need rebuilding 2. **Staleness Logic**: Compare partition timestamps with upstream dependency timestamps to determine if rebuilding is needed 3. **Skip Documentation**: Log `PARTITION_DELEGATED` events with references to the existing build request ID that produced the partition ### Bazel Rules Integration The `databuild_graph` rule in `rules.bzl` will be enhanced to propagate build event logging configuration: ```python databuild_graph( name = "my_graph", jobs = [":job1", ":job2"], lookup = ":job_lookup", build_event_log = "sqlite:///tmp/builds.db", # New attribute ) ``` **Generated Targets Enhancement:** - `my_graph_analyze`: Receives `--build-event-log` argument - `my_graph_exec`: Receives `--build-event-log` argument - `my_graph_build`: Coordinates build request ID across analyze/execute phases ### Implementation Strategy **Phase 1: Infrastructure** - Add `BuildEventLog` trait and implementations for stdout/SQLite/PostgreSQL - Update `databuild.proto` with build event schema - Add command line argument parsing to `analyze.rs` and `execute.rs` **Phase 2: Analysis Integration** - Integrate build event logging into `analyze.rs` - Implement staleness detection queries - Add partition delegation logic **Phase 3: Execution Integration** - Integrate build event logging into `execute.rs` - Add job lifecycle event logging - Implement build coordination checks **Phase 4: Bazel Integration** - Update `databuild_graph` rule with build event log support - Add proper argument propagation and request ID correlation - End-to-end testing with example graphs ### Key Benefits 1. **Stdout Logging**: Immediate visibility into build progress with `--build-event-log stdout` 2. **Persistent History**: Database persistence enables build coordination and historical analysis 3. **Intelligent Skipping**: Avoid rebuilding fresh partitions, significantly improving build performance 4. **Build Coordination**: Prevent duplicate work when multiple builds target the same partitions 5. **Audit Trail**: Complete record of all build activities for debugging and monitoring