Add plans / update designs

This commit is contained in:
Stuart Axelbrooke 2025-08-11 21:48:49 -07:00
parent ba18734190
commit 206c97bb66
11 changed files with 1060 additions and 84 deletions

View file

@ -12,7 +12,7 @@ DataBuild is a bazel-based data build system. Key files:
- [Graph specification](./design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface. - [Graph specification](./design/graph-specification.md) - Describes the different libraries that enable more succinct declaration of databuild applications than the core bazel-based interface.
- [Observability](./design/observability.md) - How observability is systematically achieved throughout databuild applications. - [Observability](./design/observability.md) - How observability is systematically achieved throughout databuild applications.
- [Deploy strategies](./design/deploy-strategies.md) - Different strategies for deploying databuild applications. - [Deploy strategies](./design/deploy-strategies.md) - Different strategies for deploying databuild applications.
- [Triggers](./design/triggers.md) - How triggering works in databuild applications. - [Wants](./design/wants.md) - How triggering works in databuild applications.
- [Why databuild?](./design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions. - [Why databuild?](./design/why-databuild.md) - Why to choose databuild instead of other better established orchestration solutions.
Please reference these for any related work, as they indicate key technical bias/direction of the project. Please reference these for any related work, as they indicate key technical bias/direction of the project.

View file

@ -58,7 +58,7 @@ The BEL encodes all relevant build actions that occur, enabling concurrent build
The BEL is similar to [event-sourced](https://martinfowler.com/eaaDev/EventSourcing.html) systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service. The BEL is similar to [event-sourced](https://martinfowler.com/eaaDev/EventSourcing.html) systems, as all application state is rendered from aggregations over the BEL. This enables the BEL to stay simple while also powering concurrent builds, the data catalog, and the DataBuild service.
### Triggers and Wants (Coming Soon) ### Triggers and Wants (Coming Soon)
["Wants"](./design/triggers.md) are the main mechanism for continually building partitions over time. In real world scenarios, it is standard for data to arrive late, or not at all. Wants cause the databuild graph to continually attempt to build the wanted partitions until a) the partitions are live or b) the want expires, at which another script can be run. Wants are the mechanism that implements SLA checking. ["Wants"](./design/wants.md) are the main mechanism for continually building partitions over time. In real world scenarios, it is standard for data to arrive late, or not at all. Wants cause the databuild graph to continually attempt to build the wanted partitions until a) the partitions are live or b) the want expires, at which another script can be run. Wants are the mechanism that implements SLA checking.
You can also use cron-based triggers, which return partition refs that they want built. You can also use cron-based triggers, which return partition refs that they want built.

View file

@ -256,6 +256,22 @@ message BuildCancelEvent {
string reason = 1; // Reason for cancellation string reason = 1; // Reason for cancellation
} }
// Partition Want
message WantSource {
// TODO
}
message PartitionWant {
PartitionRef partition_ref = 1; // Partition being requested
uint64 created_at = 2; // Server time when want registered
optional uint64 data_timestamp = 3; // Business time this partition represents
optional uint64 ttl_seconds = 4; // Give up after this long (from created_at)
optional uint64 sla_seconds = 5; // SLA violation after this long (from data_timestamp)
repeated string external_dependencies = 6; // Cross-graph dependencies
string want_id = 7; // Unique identifier
WantSource source = 8; // How this want was created
}
// Individual build event // Individual build event
message BuildEvent { message BuildEvent {
// Event metadata // Event metadata

View file

@ -1,34 +1,72 @@
# Build Event Log (BEL) # Build Event Log (BEL)
Purpose: Store build events and define views summarizing databuild application state, like partition catalog, build Purpose: Store build events and provide efficient cross-graph coordination via a minimal, append-only event stream.
status summary, job run statistics, etc.
## Architecture ## Architecture
- Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) / - Uses [event sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) /
[CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy. [CQRS](https://www.wikipedia.org/wiki/cqrs) philosophy.
- BELs are only ever written to by graph processes (e.g. CLI or service), not the jobs themselves. - BELs are only ever written to by graph processes (e.g. CLI or service), not the jobs themselves.
- BEL uses only two types of tables: - **Three-layer architecture:**
- The root event table, with event ID, timestamp, message, event type, and ID fields for related event types. 1. **Storage Layer**: Append-only event storage with sequential scanning
- Type-specific event tables (e.g. task even, partition event, build request event, etc). 2. **Query Engine Layer**: App-layer aggregation for entity queries (partition status, build summaries, etc.)
- This makes it easy to support multiple backends (SQLite, postgres, and delta tables are supported initially). 3. **Client Layer**: CLI, Service, Dashboard consuming aggregated views
- Exposes an access layer that mediates writes, and which exposes entity-specific repositories for reads. - **Cross-graph coordination** via minimal `GraphService` API that supports event streaming since a given index
- Storage backends focus on efficient append + sequential scan operations (file-based, SQLite, Postgres, Delta Lake)
## Correctness Strategy ## Correctness Strategy
- Access layer will evaluate events requested to be written, returning an error if the event is not a correct next. - Access layer will evaluate events requested to be written, returning an error if the event is not a correct next.
state based on the involved component's governing state diagram. state based on the involved component's governing state diagram.
- Events are versioned, with each versions' schemas stored in [`databuild.proto`](../databuild/databuild.proto). - Events are versioned, with each versions' schemas stored in [`databuild.proto`](../databuild/databuild.proto).
## Write Interface ## Storage Layer Interface
See [trait definition](../databuild/event_log/mod.rs). Minimal append-only interface optimized for sequential scanning:
## Read Repositories ```rust
There are repositories for the following entities: #[async_trait]
- Builds trait BELStorage {
- Jobs async fn append_event(&self, event: BuildEvent) -> Result<i64>; // returns event index
- Partitions async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
- Tasks }
```
Generally the following verbs are available for each: Where `EventFilter` is defined in `databuild.proto` as:
- Show ```protobuf
- List message EventFilter {
- Cancel repeated string partition_refs = 1; // Exact partition matches
repeated string partition_patterns = 2; // Glob patterns like "data/users/*"
repeated string job_labels = 3; // Job-specific events
repeated string task_ids = 4; // Task run events
repeated string build_request_ids = 5; // Build-specific events
}
```
## Query Engine Interface
App-layer aggregation that scans storage layer events:
```rust
struct BELQueryEngine {
storage: Box<dyn BELStorage>,
partition_status_cache: Option<PartitionStatusCache>,
}
impl BELQueryEngine {
async fn get_latest_partition_status(&self, partition_ref: &str) -> Result<Option<PartitionStatus>>;
async fn get_active_builds_for_partition(&self, partition_ref: &str) -> Result<Vec<String>>;
async fn get_build_request_summary(&self, build_id: &str) -> Result<BuildRequestSummary>;
async fn list_build_requests(&self, limit: u32, offset: u32, status_filter: Option<BuildRequestStatus>) -> Result<Vec<BuildRequestSummary>>;
}
```
## Cross-Graph Coordination
Graphs coordinate via the `GraphService` API for efficient event streaming:
```rust
trait GraphService {
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
}
```
This enables:
- **Event-driven reactivity**: Downstream graphs react within seconds of upstream partition availability
- **Efficient subscriptions**: Only scan events for relevant partitions
- **Reliable coordination**: HTTP polling avoids event-loss issues of streaming APIs

View file

@ -1,6 +1,11 @@
# Service # Service
Purpose: Enable centrally hostable and human-consumable interface for databuild applications. Purpose: Enable centrally hostable and human-consumable interface for databuild applications, plus efficient cross-graph coordination.
## Architecture
The service provides two primary capabilities:
1. **Human Interface**: Web dashboard and HTTP API for build management and monitoring
2. **Cross-Graph Coordination**: `GraphService` API enabling efficient event-driven coordination between DataBuild instances
## Correctness Strategy ## Correctness Strategy
- Rely on databuild.proto, call same shared code in core - Rely on databuild.proto, call same shared code in core
@ -8,6 +13,48 @@ Purpose: Enable centrally hostable and human-consumable interface for databuild
- Core -- databuild.proto --> service -- openapi --> web app - Core -- databuild.proto --> service -- openapi --> web app
- No magic strings (how? protobuf doesn't have consts. enums values? code gen over yaml?) - No magic strings (how? protobuf doesn't have consts. enums values? code gen over yaml?)
## Cross-Graph Coordination
Services expose the `GraphService` API for cross-graph dependency management:
```rust
trait GraphService {
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
}
```
### Cross-Graph Usage Pattern
```rust
// Downstream graph subscribing to upstream partitions
struct UpstreamDependency {
service_url: String, // e.g., "https://upstream-databuild.corp.com"
partition_patterns: Vec<String>, // e.g., ["data/users/*", "ml/models/prod/*"]
last_sync_idx: i64,
}
// Periodic sync of relevant upstream events
async fn sync_upstream_events(upstream: &UpstreamDependency) -> Result<()> {
let client = GraphServiceClient::new(&upstream.service_url);
let filter = EventFilter {
partition_patterns: upstream.partition_patterns.clone(),
..Default::default()
};
let events = client.list_events(upstream.last_sync_idx, filter).await?;
// Process partition availability events for immediate job triggering
for event in events.events {
if let EventType::PartitionEvent(pe) = event.event_type {
if pe.status_code == PartitionStatus::PartitionAvailable {
trigger_dependent_jobs(&pe.partition_ref).await?;
}
}
}
upstream.last_sync_idx = events.next_idx;
Ok(())
}
```
## API ## API
The purpose of the API is to enable remote, programmatic interaction with databuild applications, and to host endpoints The purpose of the API is to enable remote, programmatic interaction with databuild applications, and to host endpoints
needed by the [web app](#web-app). needed by the [web app](#web-app).

View file

@ -1,56 +0,0 @@
# Triggers
Purpose: to enable simple but powerful declarative specification of what data should be built.
## Correctness Strategy
- Wants + TTLs
- ...?
## Wants
Wants cause graphs to try to build the wanted partitions until a) the partitions are live or b) the TTL runs out. Wants
can trigger a callback on TTL expiry, enabling SLA-like behavior. Wants are recorded in the [BEL](./build-event-log.md),
so they can be queried and viewed in the web app, linking to build requests triggered by a given want, enabling
answering of the "why doesn't this partition exist yet?" question.
### Unwants
You can also unwant partitions, which overrides all wants of those partitions prior to the unwant timestamp. This is
primarily to enable the "data source is now disabled" style feature practically necessary in many data platforms.
### Virtual Partitions & External Data
Essentially all data teams consume some external data source, and late arriving data is the rule more than the
exception. Virtual partitions are a way to model external data that is not produced by a graph. For all intents and
purposes, these are standard partitions, the only difference is that the job that "produces" them doesn't actually
do any ETL, it just assesses external data sufficiency and emits a "partition live" event when its ready to be consumed.
## Triggers
## Taints
- Mechanism for invalidating existing partitions (e.g. we know bad data went into this, need to stop consumers from
using it)
---
- Purpose
- Every useful data application has triggering to ensure data is built on schedule
- Philosophy
- Opinionated strategy plus escape hatches
- Taints
- Two strategies
- Basic: cron triggered scripts that return partitions
- Bazel: target with `cron`, `executable` fields, optional `partition_patterns` field to constrain
- Declarative: want-based, wants cause build requests to be continually retried until the wanted
partitions are live, or running a `want_failed` script if it times out (e.g. SLA breach)
- +want and -want
- +want declares want for 1+ partitions with a timeout, recorded to the [build event log](./build-event-log.md)
- -want invalidates all past wants of specified partitions (but not future; doesn't impact non-specified
partitions)
- Their primary purpose is to prevent an SLA breach alarm when a datasource is disabled, etc.
- Need graph preconditions? And concept of external/virtual partitions or readiness probes?
- Virtual partitions: allow graphs to say "precondition failed"; can be created in BEL, created via want or
cron trigger? (e.g. want strategy continually tries to resolve the external data, creating a virtual
partition once it can find it; cron just runs the script when its triggered)
- Readiness probes don't fit the paradigm, feel too imperative.

287
design/wants.md Normal file
View file

@ -0,0 +1,287 @@
# Wants System
Purpose: Enable declarative specification of data requirements with SLA tracking, cross-graph coordination, and efficient build triggering while maintaining atomic build semantics.
## Overview
The wants system unifies all build requests (manual, scheduled, triggered) under a single declarative model where:
- **Wants declare intent** via events in the [build event log](./build-event-log.md)
- **Builds reactively satisfy** what's currently possible with atomic semantics
- **Monitoring identifies gaps** between declared wants and delivered partitions
- **Cross-graph coordination** happens via the `GraphService` API
## Architecture
### Core Components
1. **PartitionWantEvent**: Declarative specification of data requirements
2. **Build Evaluation**: Reactive logic that attempts to satisfy wants when possible
3. **SLA Monitoring**: External system that queries for expired wants
4. **Cross-Graph Coordination**: Event-driven dependency management across DataBuild instances
### Want Event Schema
Defined in `databuild.proto`:
```protobuf
message PartitionWantEvent {
string partition_ref = 1; // Partition being requested
int64 created_at = 2; // Server time when want registered
int64 data_timestamp = 3; // Business time this partition represents
optional uint64 ttl_seconds = 4; // Give up after this long (from created_at)
optional uint64 sla_seconds = 5; // SLA violation after this long (from data_timestamp)
repeated string external_dependencies = 6; // Cross-graph dependencies
string want_id = 7; // Unique identifier
WantSource source = 8; // How this want was created
}
message WantSource {
oneof source_type {
CliManual cli_manual = 1; // Manual CLI request
DashboardManual dashboard_manual = 2; // Manual dashboard request
Scheduled scheduled = 3; // Scheduled/triggered job
ApiRequest api_request = 4; // External API call
}
}
```
## Want Lifecycle
### 1. Want Registration
All build requests become wants:
```rust
// CLI: databuild build data/users/2024-01-01
PartitionWantEvent {
partition_ref: "data/users/2024-01-01",
created_at: now(),
data_timestamp: None, // These must be set explicitly in the request
ttl_seconds: None,
sla_seconds: None,
external_dependencies: vec![], // no externally sourced data necessary
want_id: generate_uuid(),
source: WantSource { ... },
}
// Scheduled pipeline: Daily analytics
PartitionWantEvent {
partition_ref: "analytics/daily/2024-01-01",
created_at: now(),
data_timestamp: parse_date("2024-01-01"),
ttl_seconds: Some(365 * 24 * 3600), // Keep trying for 1 year
sla_seconds: Some(9 * 3600), // Expected by 9am (9hrs after data_timestamp)
external_dependencies: vec!["data/users/2024-01-01"],
want_id: "daily-analytics-2024-01-01",
source: WantSource { ... },
}
```
### 2. Build Evaluation
DataBuild continuously evaluates build opportunities:
```rust
async fn evaluate_build_opportunities(&self) -> Result<Option<BuildRequest>> {
let now = current_timestamp_nanos();
// Get wants that haven't exceeded TTL
let active_wants = self.get_non_expired_wants(now).await?;
// Filter to wants where external dependencies are satisfied
let buildable_partitions = active_wants.into_iter()
.filter(|want| self.external_dependencies_satisfied(want))
.map(|want| want.partition_ref)
.collect();
if buildable_partitions.is_empty() { return Ok(None); }
// Create atomic build request for all currently buildable partitions
Ok(Some(BuildRequest {
requested_partitions: buildable_partitions,
reason: "satisfying_active_wants".to_string(),
}))
}
```
### 3. Build Triggers
Builds are triggered on:
- **New want registration**: Check if newly wanted partitions are immediately buildable
- **External partition availability**: Check if any blocked wants are now unblocked
- **Manual trigger**: Force re-evaluation (for debugging)
## Cross-Graph Coordination
### GraphService API
Graphs expose events for cross-graph coordination:
```rust
trait GraphService {
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
}
```
Where `EventFilter` supports partition patterns for efficient subscriptions:
```protobuf
message EventFilter {
repeated string partition_refs = 1; // Exact partition matches
repeated string partition_patterns = 2; // Glob patterns like "data/users/*"
repeated string job_labels = 3; // Job-specific events
repeated string task_ids = 4; // Task run events
repeated string build_request_ids = 5; // Build-specific events
}
```
### Upstream Dependencies
Downstream graphs subscribe to upstream events:
```rust
struct UpstreamDependency {
service_url: String, // "https://upstream-databuild.corp.com"
partition_patterns: Vec<String>, // ["data/users/*", "ml/models/prod/*"]
last_sync_idx: i64,
}
// Periodic sync of upstream events
async fn sync_upstream_events(upstream: &UpstreamDependency) -> Result<()> {
let client = GraphServiceClient::new(&upstream.service_url);
let filter = EventFilter {
partition_patterns: upstream.partition_patterns.clone(),
..Default::default()
};
let events = client.list_events(upstream.last_sync_idx, filter).await?;
// Process partition availability events
for event in events.events {
if let EventType::PartitionEvent(pe) = event.event_type {
if pe.status_code == PartitionStatus::PartitionAvailable {
// Trigger local build evaluation
trigger_build_evaluation().await?;
}
}
}
upstream.last_sync_idx = events.next_idx;
Ok(())
}
```
## SLA Monitoring and TTL Management
### SLA Violations
External monitoring systems query for SLA violations:
```sql
-- Find SLA violations (for alerting)
SELECT * FROM partition_want_events w
WHERE w.sla_seconds IS NOT NULL
AND (w.data_timestamp + (w.sla_seconds * 1000000000)) < ? -- now
AND NOT EXISTS (
SELECT 1 FROM partition_events p
WHERE p.partition_ref = w.partition_ref
AND p.status_code = ? -- PartitionAvailable
)
```
### TTL Expiration
Wants with expired TTLs are excluded from build evaluation:
```sql
-- Get active (non-expired) wants
SELECT * FROM partition_want_events w
WHERE (w.ttl_seconds IS NULL OR w.created_at + (w.ttl_seconds * 1000000000) > ?) -- now
AND NOT EXISTS (
SELECT 1 FROM partition_events p
WHERE p.partition_ref = w.partition_ref
AND p.status_code = ? -- PartitionAvailable
)
```
## Example Scenarios
### Scenario 1: Daily Analytics Pipeline
```
1. 6:00 AM: Daily trigger creates want for analytics/daily/2024-01-01
- SLA: 9:00 AM (3 hours after data_timestamp of midnight)
- TTL: 1 year (keep trying for historical data)
- External deps: ["data/users/2024-01-01"]
2. 6:01 AM: Build evaluation runs, data/users/2024-01-01 missing
- No build request generated
3. 8:30 AM: Upstream publishes data/users/2024-01-01
- Cross-graph sync detects availability
- Build evaluation triggered
- BuildRequest[analytics/daily/2024-01-01] succeeds
4. Result: Analytics available at 8:45 AM, within SLA
```
### Scenario 2: Late Data with SLA Miss
```
1. 6:00 AM: Want created for analytics/daily/2024-01-01 (SLA: 9:00 AM)
2. 9:30 AM: SLA monitoring detects violation, sends alert
3. 11:00 AM: Upstream data finally arrives
4. 11:01 AM: Build evaluation triggers, analytics built
5. Result: Late delivery logged, but data still processed
```
### Scenario 3: Manual CLI Build
```
1. User: databuild build data/transform/urgent
2. Want created with short TTL (30 min) and SLA (5 min)
3. Build evaluation: dependencies available, immediate build
4. Result: Fast feedback for interactive use
```
## Benefits
### Unified Build Model
- All builds (manual, scheduled, triggered) use same want mechanism
- Complete audit trail in build event log
- Consistent SLA tracking across all build types
### Event-Driven Efficiency
- Builds only triggered when dependencies change
- Cross-graph coordination via efficient event streaming
- No polling for task readiness within builds
### Atomic Build Semantics
- Individual build requests remain all-or-nothing
- Fast failure provides immediate feedback
- Partial progress via multiple build requests over time
### Flexible SLA Management
- Separate business expectations (SLA) from operational limits (TTL)
- External monitoring with clear blame assignment
- Automatic cleanup of stale wants
### Cross-Graph Scalability
- Reliable HTTP-based coordination (no message loss)
- Efficient filtering via partition patterns
- Decentralized architecture with clear boundaries
## Implementation Notes
### Build Event Log Integration
- Wants are stored as events in the BEL for consistency
- Same query interfaces used for wants and build coordination
- Event-driven architecture throughout
### Service Integration
- GraphService API exposed via HTTP for cross-graph coordination
- Dashboard integration for manual want creation
- External SLA monitoring via BEL queries
### CLI Integration
- CLI commands create manual wants with appropriate TTLs
- Immediate build evaluation for interactive feedback
- Standard build request execution path

View file

@ -178,12 +178,6 @@ py_binary(
], ],
) )
# Legacy test job (kept for compatibility)
databuild_job(
name = "test_job",
binary = ":test_job_binary",
)
# Test target # Test target
py_binary( py_binary(
name = "test_jobs", name = "test_jobs",

304
plans/18-bel-refactor.md Normal file
View file

@ -0,0 +1,304 @@
# BEL Refactoring to 3-Tier Architecture
## Overview
This plan restructures DataBuild's Build Event Log (BEL) access layer from the current monolithic trait to a clean 3-tier architecture as described in [design/build-event-log.md](../design/build-event-log.md). This refactoring creates clear separation of concerns and simplifies the codebase by removing complex storage backends.
## Current State Analysis
The current BEL implementation (`databuild/event_log/mod.rs`) has a single `BuildEventLog` trait that mixes:
- Low-level storage operations (`append_event`, `get_events_in_range`)
- High-level aggregation queries (`list_build_requests`, `get_activity_summary`)
- Application-specific logic (`get_latest_partition_status`, `get_active_builds_for_partition`)
This creates several problems:
- Storage backends must implement complex aggregation logic
- No clear separation between storage and business logic
- Difficult to extend with new query patterns
- Delta Lake implementation adds unnecessary complexity
## Target Architecture
### 1. Storage Layer: `BELStorage` Trait
Minimal append-only interface optimized for sequential scanning:
```rust
#[async_trait]
pub trait BELStorage: Send + Sync {
/// Append a single event, returns the sequential index
async fn append_event(&self, event: BuildEvent) -> Result<i64>;
/// List events with filtering, starting from a given index
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
/// Initialize storage backend (create tables, etc.)
async fn initialize(&self) -> Result<()>;
}
#[derive(Debug, Clone)]
pub struct EventPage {
pub events: Vec<BuildEvent>,
pub next_idx: i64,
pub has_more: bool,
}
```
### 2. Query Engine Layer: `BELQueryEngine`
App-layer aggregation that scans storage events:
```rust
pub struct BELQueryEngine {
storage: Arc<dyn BELStorage>,
}
impl BELQueryEngine {
pub fn new(storage: Arc<dyn BELStorage>) -> Self {
Self { storage }
}
/// Get latest status for a partition by scanning recent events
pub async fn get_latest_partition_status(&self, partition_ref: &str) -> Result<Option<PartitionStatus>>;
/// Get all build requests that are currently building a partition
pub async fn get_active_builds_for_partition(&self, partition_ref: &str) -> Result<Vec<String>>;
/// Get summary of a build request by aggregating its events
pub async fn get_build_request_summary(&self, build_id: &str) -> Result<BuildRequestSummary>;
/// List build requests with pagination and filtering
pub async fn list_build_requests(&self, request: BuildsListRequest) -> Result<BuildsListResponse>;
/// Get activity summary for dashboard
pub async fn get_activity_summary(&self) -> Result<ActivityResponse>;
}
```
### 3. Client Layer: Repository Pattern
Clean interfaces for CLI, Service, and Dashboard (unchanged from current):
```rust
// Existing repositories continue to work, but now use BELQueryEngine
pub struct PartitionsRepository {
query_engine: Arc<BELQueryEngine>,
}
pub struct BuildsRepository {
query_engine: Arc<BELQueryEngine>,
}
```
## Implementation Plan
### Phase 1: Create Storage Layer Interface
1. **Define New Storage Trait**
```rust
// In databuild/event_log/storage.rs
pub trait BELStorage { /* as defined above */ }
pub fn create_bel_storage(uri: &str) -> Result<Box<dyn BELStorage>>;
```
2. **Add EventFilter to Protobuf**
```protobuf
// In databuild/databuild.proto
message EventFilter {
repeated string partition_refs = 1;
repeated string partition_patterns = 2;
repeated string job_labels = 3;
repeated string task_ids = 4;
repeated string build_request_ids = 5;
}
message EventPage {
repeated BuildEvent events = 1;
int64 next_idx = 2;
bool has_more = 3;
}
```
3. **Implement SQLite Storage Backend**
```rust
// In databuild/event_log/sqlite_storage.rs
pub struct SqliteBELStorage {
pool: sqlx::SqlitePool,
}
impl BELStorage for SqliteBELStorage {
async fn append_event(&self, event: BuildEvent) -> Result<i64> {
// Simple INSERT returning rowid
let serialized = serde_json::to_string(&event)?;
let row_id = sqlx::query("INSERT INTO build_events (event_data) VALUES (?)")
.bind(serialized)
.execute(&self.pool)
.await?
.last_insert_rowid();
Ok(row_id)
}
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage> {
// Efficient sequential scan with filtering
// Build WHERE clause based on filter criteria
// Return paginated results
}
}
```
### Phase 2: Create Query Engine Layer
1. **Implement BELQueryEngine**
```rust
// In databuild/event_log/query_engine.rs
impl BELQueryEngine {
pub async fn get_latest_partition_status(&self, partition_ref: &str) -> Result<Option<PartitionStatus>> {
// Scan recent partition events to determine current status
let filter = EventFilter {
partition_refs: vec![partition_ref.to_string()],
..Default::default()
};
let events = self.storage.list_events(0, filter).await?;
self.aggregate_partition_status(&events.events)
}
async fn aggregate_partition_status(&self, events: &[BuildEvent]) -> Result<Option<PartitionStatus>> {
// Walk through events chronologically to determine final partition status
// Return the most recent status
}
}
```
2. **Implement All Current Query Methods**
- Port all methods from current `BuildEventLog` trait
- Use event scanning and aggregation instead of complex SQL queries
- Keep same return types for compatibility
### Phase 3: Migrate Existing Code
1. **Update Repository Constructors**
```rust
// Old: PartitionsRepository::new(Arc<dyn BuildEventLog>)
// New: PartitionsRepository::new(Arc<BELQueryEngine>)
impl PartitionsRepository {
pub fn new(query_engine: Arc<BELQueryEngine>) -> Self {
Self { query_engine }
}
pub async fn list_protobuf(&self, request: PartitionsListRequest) -> Result<PartitionsListResponse> {
self.query_engine.list_build_requests(request).await
}
}
```
2. **Update CLI and Service Initialization**
```rust
// In CLI main.rs and service mod.rs
let storage = create_bel_storage(&event_log_uri).await?;
let query_engine = Arc::new(BELQueryEngine::new(storage));
let partitions_repo = PartitionsRepository::new(query_engine.clone());
let builds_repo = BuildsRepository::new(query_engine.clone());
```
### Phase 4: Remove Legacy Components
1. **Remove Delta Lake Implementation**
```rust
// Delete databuild/event_log/delta.rs
// Remove delta dependencies from MODULE.bazel
// Remove delta:// support from create_build_event_log()
```
2. **Deprecate Old BuildEventLog Trait**
```rust
// Mark as deprecated, keep for backwards compatibility during transition
#[deprecated(note = "Use BELQueryEngine and BELStorage instead")]
pub trait BuildEventLog { /* existing implementation */ }
```
3. **Update Factory Function**
```rust
// In databuild/event_log/mod.rs
pub async fn create_build_event_log(uri: &str) -> Result<Arc<BELQueryEngine>> {
let storage = if uri == "stdout" {
Arc::new(stdout::StdoutBELStorage::new()) as Arc<dyn BELStorage>
} else if uri.starts_with("sqlite://") {
let path = &uri[9..];
let storage = sqlite_storage::SqliteBELStorage::new(path).await?;
storage.initialize().await?;
Arc::new(storage) as Arc<dyn BELStorage>
} else if uri.starts_with("postgres://") {
let storage = postgres_storage::PostgresBELStorage::new(uri).await?;
storage.initialize().await?;
Arc::new(storage) as Arc<dyn BELStorage>
} else {
return Err(BuildEventLogError::ConnectionError(
format!("Unsupported build event log URI: {}", uri)
));
};
Ok(Arc::new(BELQueryEngine::new(storage)))
}
```
### Phase 5: Final Cleanup
1. **Remove Legacy Implementations**
- Delete complex aggregation logic from existing storage backends
- Simplify remaining backends to implement only new `BELStorage` trait
- Remove deprecated `BuildEventLog` trait
2. **Update Documentation**
- Update design docs to reflect new architecture
- Create migration guide for external users
- Update code examples and README
## Benefits of 3-Tier Architecture
### ✅ **Simplified Codebase**
- Removes complex Delta Lake dependencies
- Storage backends focus only on append + scan operations
- Clear separation between storage and business logic
### ✅ **Better Maintainability**
- Single SQLite implementation for most use cases
- Query logic centralized in one place
- Easier to debug and test each layer independently
### ✅ **Future-Ready Foundation**
- Clean foundation for wants system (next phase)
- Easy to add new storage backends when needed
- Query engine ready for cross-graph coordination APIs
### ✅ **Performance Benefits**
- Eliminates complex SQL joins in storage layer
- Enables sequential scanning optimizations
- Cleaner separation allows targeted optimizations
## Success Criteria
### Phase 1-2: Foundation
- [ ] Storage layer trait compiles and tests pass
- [ ] SQLite storage backend supports append + list operations
- [ ] Query engine provides same functionality as current BEL trait
- [ ] EventFilter protobuf types generate correctly
### Phase 3-4: Migration
- [ ] All repositories work with new query engine
- [ ] CLI and service use new architecture
- [ ] Existing functionality unchanged from user perspective
- [ ] Delta Lake implementation removed
### Phase 5: Completion
- [ ] Legacy BEL trait removed
- [ ] Performance meets or exceeds current implementation
- [ ] Documentation updated for new architecture
- [ ] Codebase simplified and maintainable
## Risk Mitigation
1. **Gradual Migration**: Implement new architecture alongside existing code
2. **Feature Parity**: Ensure all existing functionality works before removing old code
3. **Performance Testing**: Benchmark new implementation against current performance
4. **Simple First**: Start with SQLite-only implementation, add complexity later as needed

View file

@ -0,0 +1,183 @@
# Client-Server CLI Architecture
## Overview
This plan transforms DataBuild's CLI from a monolithic in-process execution model to a Bazel-style client-server architecture. The CLI becomes a thin client that delegates all operations to a persistent service process, enabling better resource management and build coordination.
## Current State Analysis
The current CLI (`databuild/cli/main.rs`) directly:
- Creates event log connections
- Runs analysis and execution in-process
- Spawns bazel processes directly
- No coordination between concurrent CLI invocations
This creates several limitations:
- No coordination between concurrent builds
- Multiple BEL connections from concurrent CLI calls
- Each CLI process spawns separate bazel execution
- No shared execution environment for builds
## Target Architecture
### Bazel-Style Client-Server Model
**CLI (Thin Client)**:
- Auto-starts service if not running
- Delegates all operations to service via HTTP
- Streams progress back to user
- Auto-shuts down idle service
**Service (Persistent Process)**:
- Maintains single BEL connection
- Coordinates builds across multiple CLI calls
- Manages bazel execution processes
- Auto-shuts down after idle timeout
## Implementation Plan
### Phase 1: Service Foundation
1. **Extend Current Service for CLI Operations**
- Add new endpoints to handle CLI build requests
- Move analysis and execution logic from CLI to service
- Service maintains orchestrator state and coordinates builds
- Add real-time progress streaming for CLI consumption
2. **Add CLI-Specific API Endpoints**
- `/api/v1/cli/build` - Handle build requests from CLI
- `/api/v1/cli/builds/{id}/progress` - Stream build progress via Server-Sent Events
- Request/response types for CLI build operations
- Background vs foreground build support
3. **Add Service Auto-Management**
- Service tracks last activity timestamp
- Configurable auto-shutdown timeout (default: 5 minutes)
- Service monitors for idle state and gracefully shuts down
- Activity tracking includes API calls and active builds
4. **Service Port Management**
- Service attempts to bind to preferred port (e.g., 8080)
- If port unavailable, tries next available port in range
- Service writes actual port to lockfile/pidfile for CLI discovery
- CLI reads port from lockfile to connect to running service
- Cleanup lockfile on service shutdown
### Phase 2: Thin CLI Implementation
1. **New CLI Main Function**
- Replace existing main with service delegation logic
- Parse arguments and determine target service operation
- Handle service connection and auto-start logic
- Preserve existing CLI interface and help text
2. **Service Client Implementation**
- HTTP client for communicating with service
- Auto-start service if not already running
- Health check and connection retry logic
- Progress streaming for real-time build feedback
3. **Build Command via Service**
- Parse build arguments and create service request
- Submit build request to service endpoint
- Stream progress updates for foreground builds
- Return immediately for background builds with build ID
### Phase 3: Repository Commands via Service
1. **Delegate Repository Commands to Service**
- Partition, build, job, and task commands go through service
- Use existing service API endpoints where available
- Maintain same output formats (table, JSON) as current CLI
- Preserve all existing functionality and options
2. **Service Client Repository Methods**
- Client methods for each repository operation
- Handle pagination, filtering, and formatting options
- Error handling and appropriate HTTP status code handling
- URL encoding for partition references and other parameters
### Phase 4: Complete Migration
1. **Remove Old CLI Implementation**
- Delete existing `databuild/cli/main.rs` implementation
- Remove in-process analysis and execution logic
- Clean up CLI-specific dependencies that are no longer needed
- Update build configuration to use new thin client only
2. **Service Integration Testing**
- End-to-end testing of CLI-to-service communication
- Verify all existing CLI functionality works through service
- Performance testing to ensure no regression
- Error handling validation for various failure modes
### Phase 5: Integration and Testing
1. **Environment Variable Support**
- `DATABUILD_SERVICE_URL` for custom service locations
- `DATABUILD_SERVICE_TIMEOUT` for auto-shutdown configuration
- Existing BEL environment variables passed to service
- Clear precedence rules for configuration sources
2. **Error Handling and User Experience**
- Service startup timeout and clear error messages
- Connection failure handling with fallback suggestions
- Health check logic to verify service readiness
- Graceful handling of service unavailability
## Benefits of Client-Server Architecture
### ✅ **Build Coordination**
- Multiple CLI calls share same service instance
- Coordination between concurrent builds
- Single BEL connection eliminates connection conflicts
### ✅ **Resource Management**
- Auto-shutdown prevents resource leaks
- Service manages persistent connections
- Better isolation between CLI and build execution
- Shared bazel execution environment
### ✅ **Improved User Experience**
- Background builds with `--background` flag
- Real-time progress streaming
- Consistent build execution environment
### ✅ **Simplified Architecture**
- Single execution path through service
- Cleaner separation of concerns
- Reduced code duplication
### ✅ **Future-Ready Foundation**
- Service architecture prepared for additional coordination features
- HTTP API foundation for programmatic access
- Clear separation of concerns between client and execution
## Success Criteria
### Phase 1-2: Service Foundation
- [ ] Service can handle CLI build requests
- [ ] Service auto-shutdown works correctly
- [ ] Service port management and discovery works
- [ ] New CLI can start and connect to service
- [ ] Build requests execute with same functionality as current CLI
### Phase 3-4: Complete Migration
- [ ] All CLI commands work via service delegation
- [ ] Repository commands (partitions, builds, etc.) work via HTTP API
- [ ] Old CLI implementation completely removed
- [ ] Error handling provides clear user feedback
### Phase 5: Polish
- [ ] Multiple concurrent CLI calls work correctly
- [ ] Background builds work as expected
- [ ] Performance meets or exceeds current CLI
- [ ] Service management is reliable and transparent
## Risk Mitigation
1. **Thorough Testing**: Comprehensive testing before removing old CLI
2. **Feature Parity**: Ensure all existing functionality works via service
3. **Performance Validation**: Benchmark new implementation against current performance
4. **Simple Protocol**: Use HTTP/JSON for service communication (not gRPC initially)
5. **Clear Error Messages**: Service startup and connection failures should be obvious to users

163
plans/20-wants-initial.md Normal file
View file

@ -0,0 +1,163 @@
# Wants System Implementation
## Overview
This plan implements the wants system described in [design/wants.md](../design/wants.md), transitioning DataBuild from direct build requests to a declarative want-based model with cross-graph coordination and SLA tracking. This builds on the 3-tier BEL architecture and client-server CLI established in the previous phases.
## Prerequisites
This plan assumes completion of:
- **Phase 18**: 3-tier BEL architecture with storage/query/client layers
- **Phase 19**: Client-server CLI architecture with service delegation
## Implementation Phases
### Phase 1: Extend BEL Storage for Wants
1. **Add PartitionWantEvent to databuild.proto**
- Want event schema as defined in design/wants.md
- Want source tracking (CLI, dashboard, scheduled, API)
- TTL and SLA timestamp fields
- External dependency specifications
2. **Extend BELStorage Interface**
- Add `append_want()` method for want events
- Extend `EventFilter` to support want filtering
- Add want-specific query capabilities to storage layer
3. **Implement in SQLite Storage Backend**
- Add wants table with appropriate indexes
- Implement want filtering in list_events()
- Schema migration logic for existing databases
### Phase 2: Basic Want API in Service
1. **Implement Want Management in Service**
- Service methods for creating and querying wants
- Want lifecycle management (creation, expiration, satisfaction)
- Integration with existing service auto-management
2. **Add Want HTTP Endpoints**
- `POST /api/v1/wants` - Create new want
- `GET /api/v1/wants` - List active wants with filtering
- `GET /api/v1/wants/{id}` - Get want details
- `DELETE /api/v1/wants/{id}` - Cancel want
3. **CLI Want Commands**
- `./bazel-bin/my_graph.build want create <partition-ref>` with SLA/TTL options
- `./bazel-bin/my_graph.build want list` with filtering options
- `./bazel-bin/my_graph.build want status <partition-ref>` for want status
- Modify build commands to create wants via service
### Phase 3: Want-Driven Build Evaluation
1. **Implement Build Evaluator in Service**
- Continuous evaluation loop that checks for buildable wants
- External dependency satisfaction checking
- TTL expiration filtering for active wants
2. **Replace Build Request Handling**
- Graph build commands create wants instead of direct build requests
- Service background loop evaluates wants and triggers builds
- Maintain atomic build semantics while satisfying multiple wants
3. **Build Coordination Logic**
- Aggregate wants that can be satisfied by same build
- Priority handling for urgent wants (short SLA)
- Resource coordination across concurrent want evaluation
### Phase 4: Cross-Graph Coordination
1. **Implement GraphService API**
- HTTP API for cross-graph event streaming as defined in design/wants.md
- Event filtering for efficient partition pattern subscriptions
- Service-to-service communication for upstream dependencies
2. **Upstream Dependency Configuration**
- Service configuration for upstream DataBuild instances
- Partition pattern subscriptions to upstream graphs
- Automatic want evaluation when upstream partitions become available
3. **Cross-Graph Event Sync**
- Background sync process for upstream events
- Triggering local build evaluation on upstream availability
- Reliable HTTP-based coordination to avoid message loss
### Phase 5: SLA Monitoring and Dashboard Integration
1. **SLA Violation Tracking**
- External monitoring endpoints for SLA violations
- Want timeline and status tracking
- Integration with existing dashboard for want visualization
2. **Want Dashboard Features**
- Want creation and monitoring UI
- Cross-graph dependency visualization
- SLA violation dashboard and alerting
3. **Migration from Direct Builds**
- All build requests go through want system
- Remove direct build request pathways
- Update documentation for new build model
## Benefits of Want-Based Architecture
### ✅ **Unified Build Model**
- All builds (manual, scheduled, triggered) use same want mechanism
- Complete audit trail in build event log
- Consistent SLA tracking across all build types
### ✅ **Event-Driven Efficiency**
- Builds only triggered when dependencies change
- Cross-graph coordination via efficient event streaming
- No polling for task readiness within builds
### ✅ **Atomic Build Semantics Preserved**
- Individual build requests remain all-or-nothing
- Fast failure provides immediate feedback
- Partial progress via multiple build requests over time
### ✅ **Flexible SLA Management**
- Separate business expectations (SLA) from operational limits (TTL)
- External monitoring with clear blame assignment
- Automatic cleanup of stale wants
### ✅ **Cross-Graph Scalability**
- Reliable HTTP-based coordination
- Efficient filtering via partition patterns
- Decentralized architecture with clear boundaries
## Success Criteria
### Phase 1: Storage Foundation
- [ ] Want events can be stored and queried in BEL storage
- [ ] EventFilter supports want-specific filtering
- [ ] SQLite backend handles want operations efficiently
### Phase 2: Basic Want API
- [ ] Service can create and query wants via HTTP API
- [ ] Graph build commands work for want management
- [ ] Build commands create wants instead of direct builds
### Phase 3: Want-Driven Builds
- [ ] Service background loop evaluates wants continuously
- [ ] Build evaluation triggers on want creation and external events
- [ ] TTL expiration and external dependency checking work correctly
### Phase 4: Cross-Graph Coordination
- [ ] GraphService API returns filtered events for cross-graph coordination
- [ ] Upstream partition availability triggers downstream want evaluation
- [ ] Service-to-service communication is reliable and efficient
### Phase 5: Complete Migration
- [ ] All builds go through want system
- [ ] Dashboard supports want creation and monitoring
- [ ] SLA violation endpoints provide monitoring integration
- [ ] Documentation reflects new want-based build model
## Risk Mitigation
1. **Incremental Migration**: Implement wants alongside existing build system initially
2. **Performance Validation**: Ensure want evaluation doesn't introduce significant latency
3. **Backwards Compatibility**: Maintain existing build semantics during transition
4. **Monitoring Integration**: Provide clear observability into want lifecycle and performance