diff --git a/plans/build-event-log.md b/plans/build-event-log.md index cc0ed10..f57563b 100644 --- a/plans/build-event-log.md +++ b/plans/build-event-log.md @@ -4,6 +4,8 @@ The foundation of persistence for DataBuild is the build event log, a fact table ## 1. Schema +The build event log is an append-only event stream that captures all build-related activity. Each event represents a state change in either a build request, partition, or job lifecycle. + ```protobuf // Partition lifecycle states enum PartitionStatus { @@ -13,30 +15,233 @@ enum PartitionStatus { PARTITION_BUILDING = 3; // Job actively building this partition PARTITION_AVAILABLE = 4; // Partition successfully built and available PARTITION_FAILED = 5; // Partition build failed - PARTITION_STALE = 6; // Partition exists but upstream dependencies changed - PARTITION_DELEGATED = 7; // Request delegated to existing build + PARTITION_DELEGATED = 6; // Request delegated to existing build } -// Job lifecycle +// Job execution lifecycle enum JobStatus { - // TODO implement me + JOB_UNKNOWN = 0; + JOB_SCHEDULED = 1; // Job scheduled for execution + JOB_RUNNING = 2; // Job actively executing + JOB_COMPLETED = 3; // Job completed successfully + JOB_FAILED = 4; // Job execution failed + JOB_CANCELLED = 5; // Job execution cancelled } -// Individual partition activity event +// Build request lifecycle +enum BuildRequestStatus { + BUILD_REQUEST_UNKNOWN = 0; + BUILD_REQUEST_RECEIVED = 1; // Build request received and queued + BUILD_REQUEST_PLANNING = 2; // Graph analysis in progress + BUILD_REQUEST_EXECUTING = 3; // Jobs are being executed + BUILD_REQUEST_COMPLETED = 4; // All requested partitions built + BUILD_REQUEST_FAILED = 5; // Build request failed + BUILD_REQUEST_CANCELLED = 6; // Build request cancelled +} + +// Individual build event message BuildEvent { - // TODO implement me + // Event metadata + string event_id = 1; // UUID for this event + int64 timestamp = 2; // Unix timestamp (nanoseconds) + string build_request_id = 3; // UUID of the build request + + // Event type and payload (one of) + oneof event_type { + BuildRequestEvent build_request_event = 10; + PartitionEvent partition_event = 11; + JobEvent job_event = 12; + DelegationEvent delegation_event = 13; + } +} + +// Build request lifecycle event +message BuildRequestEvent { + BuildRequestStatus status = 1; + repeated PartitionRef requested_partitions = 2; + string message = 3; // Optional status message +} + +// Partition state change event +message PartitionEvent { + PartitionRef partition_ref = 1; + PartitionStatus status = 2; + string message = 3; // Optional status message + string job_run_id = 4; // UUID of job run producing this partition (if applicable) +} + +// Job execution event +message JobEvent { + string job_run_id = 1; // UUID for this job run + JobLabel job_label = 2; // Job being executed + repeated PartitionRef target_partitions = 3; // Partitions this job run produces + JobStatus status = 4; + string message = 5; // Optional status message + JobConfig config = 6; // Job configuration used (for SCHEDULED events) + repeated PartitionManifest manifests = 7; // Results (for COMPLETED events) +} + +// Delegation event (when build request delegates to existing build) +message DelegationEvent { + PartitionRef partition_ref = 1; + string delegated_to_build_request_id = 2; // Build request handling this partition + string message = 3; // Optional message } ``` -Build events are practically job events, as they are the unit of work, but they also represent progress towards building specific partitions and their downstreams. One build request ID represents the literal request to the service (potentially accepting a provided build request ID). The expectation is that most build requests involve multiple partitions, and we should be able to see the tree structure over time to see jobs succeeding and progress towards the requested partition being built. Individual job runs should have their own ID allowing them to be referenced later. +Build events capture the complete lifecycle of composite build requests. A single build request can involve multiple partitions, each potentially requiring different jobs. The event stream allows reconstruction of the full state at any point in time. -TODO narrative +### Design Principles + +**Staleness as Planning Concern**: Staleness detection and handling occurs during the analysis/planning phase, not during execution. The analyze operation detects partitions that need rebuilding due to upstream changes and includes them in the execution graph. In-progress builds do not react to newly stale partitions - they execute their planned graph to completion. + +**Delegation as Unidirectional Optimization**: When a build request discovers another build is already producing a needed partition, it logs a delegation event and waits for that partition to become available. The delegated-to build request remains unaware of the delegation - it simply continues building its own graph. This eliminates the need for coordination protocols between builds. ## 2. Persistence -TODO narrative + design, with requirements: -- Should target postgres, sqlite, and delta tables +The build event log uses a single `build_events` table storing serialized protobuf events. This design supports multiple storage backends while maintaining consistency. + +### Storage Requirements +- **PostgreSQL**: Primary production backend +- **SQLite**: Local development and testing +- **Delta tables**: Future extensibility for analytics workloads + +### Table Schema +```sql +-- Core event metadata +CREATE TABLE build_events ( + event_id UUID PRIMARY KEY, + timestamp BIGINT NOT NULL, + build_request_id UUID NOT NULL, + event_type TEXT NOT NULL -- 'build_request', 'partition', 'job', 'delegation' +); + +-- Build request lifecycle events +CREATE TABLE build_request_events ( + event_id UUID PRIMARY KEY REFERENCES build_events(event_id), + status TEXT NOT NULL, -- BuildRequestStatus enum + requested_partitions TEXT[] NOT NULL, + message TEXT +); + +-- Partition lifecycle events +CREATE TABLE partition_events ( + event_id UUID PRIMARY KEY REFERENCES build_events(event_id), + partition_ref TEXT NOT NULL, + status TEXT NOT NULL, -- PartitionStatus enum + message TEXT, + job_run_id UUID -- NULL for non-job-related events +); + +-- Job execution events +CREATE TABLE job_events ( + event_id UUID PRIMARY KEY REFERENCES build_events(event_id), + job_run_id UUID NOT NULL, + job_label TEXT NOT NULL, + target_partitions TEXT[] NOT NULL, + status TEXT NOT NULL, -- JobStatus enum + message TEXT, + config_json TEXT, -- JobConfig as JSON (for SCHEDULED events) + manifests_json TEXT, -- PartitionManifests as JSON (for COMPLETED events) + start_time BIGINT, -- Extracted from config/manifests + end_time BIGINT -- Extracted from config/manifests +); + +-- Delegation events +CREATE TABLE delegation_events ( + event_id UUID PRIMARY KEY REFERENCES build_events(event_id), + partition_ref TEXT NOT NULL, + delegated_to_build_request_id UUID NOT NULL, + message TEXT +); + +-- Indexes for common query patterns +CREATE INDEX idx_build_events_build_request ON build_events(build_request_id, timestamp); +CREATE INDEX idx_build_events_timestamp ON build_events(timestamp); + +CREATE INDEX idx_partition_events_partition ON partition_events(partition_ref, timestamp); +CREATE INDEX idx_partition_events_job_run ON partition_events(job_run_id, timestamp) WHERE job_run_id IS NOT NULL; + +CREATE INDEX idx_job_events_job_run ON job_events(job_run_id); +CREATE INDEX idx_job_events_job_label ON job_events(job_label, timestamp); +CREATE INDEX idx_job_events_status ON job_events(status, timestamp); + +CREATE INDEX idx_delegation_events_partition ON delegation_events(partition_ref, timestamp); +CREATE INDEX idx_delegation_events_delegated_to ON delegation_events(delegated_to_build_request_id, timestamp); +``` ## 3. Access Layer -TODO narrative + design +The access layer provides a simple append/query interface for build events, leaving aggregation logic to the service layer. + +### Core Interface +The normalized schema enables both simple event queries and complex analytical queries: + +```rust +trait BuildEventLog { + // Append new event to the log + async fn append_event(&self, event: BuildEvent) -> Result<(), Error>; + + // Query events by build request + async fn get_build_request_events( + &self, + build_request_id: &str, + since: Option + ) -> Result, Error>; + + // Query events by partition + async fn get_partition_events( + &self, + partition_ref: &str, + since: Option + ) -> Result, Error>; + + // Query events by job run + async fn get_job_run_events( + &self, + job_run_id: &str + ) -> Result, Error>; + + // Query events in time range + async fn get_events_in_range( + &self, + start_time: i64, + end_time: i64 + ) -> Result, Error>; + + // Execute raw SQL queries (for dashboard and debugging) + async fn execute_query(&self, query: &str) -> Result; +} +``` + +### Example Analytical Queries +The normalized schema enables dashboard queries like: + +```sql +-- Job success rates by label +SELECT job_label, + COUNT(*) as total_runs, + SUM(CASE WHEN status = 'JOB_COMPLETED' THEN 1 ELSE 0 END) as successful_runs, + AVG(end_time - start_time) as avg_duration_ns +FROM job_events +WHERE status IN ('JOB_COMPLETED', 'JOB_FAILED') +GROUP BY job_label; + +-- Recent partition builds +SELECT p.partition_ref, p.status, e.timestamp, j.job_label +FROM partition_events p +JOIN build_events e ON p.event_id = e.event_id +LEFT JOIN job_events j ON p.job_run_id = j.job_run_id +WHERE p.status = 'PARTITION_AVAILABLE' +ORDER BY e.timestamp DESC +LIMIT 100; + +-- Build request status summary +SELECT br.status, COUNT(*) as count +FROM build_request_events br +JOIN build_events e ON br.event_id = e.event_id +WHERE e.timestamp > extract(epoch from now() - interval '24 hours') * 1000000000 +GROUP BY br.status; +``` + +The service layer builds higher-level operations on top of both the simple interface and direct SQL access.