databuild/plans/18-bel-refactor.md

304 lines
No EOL
10 KiB
Markdown

# BEL Refactoring to 3-Tier Architecture
## Overview
This plan restructures DataBuild's Build Event Log (BEL) access layer from the current monolithic trait to a clean 3-tier architecture as described in [design/build-event-log.md](../design/build-event-log.md). This refactoring creates clear separation of concerns and simplifies the codebase by removing complex storage backends.
## Current State Analysis
The current BEL implementation (`databuild/event_log/mod.rs`) has a single `BuildEventLog` trait that mixes:
- Low-level storage operations (`append_event`, `get_events_in_range`)
- High-level aggregation queries (`list_build_requests`, `get_activity_summary`)
- Application-specific logic (`get_latest_partition_status`, `get_active_builds_for_partition`)
This creates several problems:
- Storage backends must implement complex aggregation logic
- No clear separation between storage and business logic
- Difficult to extend with new query patterns
- Delta Lake implementation adds unnecessary complexity
## Target Architecture
### 1. Storage Layer: `BELStorage` Trait
Minimal append-only interface optimized for sequential scanning:
```rust
#[async_trait]
pub trait BELStorage: Send + Sync {
/// Append a single event, returns the sequential index
async fn append_event(&self, event: BuildEvent) -> Result<i64>;
/// List events with filtering, starting from a given index
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage>;
/// Initialize storage backend (create tables, etc.)
async fn initialize(&self) -> Result<()>;
}
#[derive(Debug, Clone)]
pub struct EventPage {
pub events: Vec<BuildEvent>,
pub next_idx: i64,
pub has_more: bool,
}
```
### 2. Query Engine Layer: `BELQueryEngine`
App-layer aggregation that scans storage events:
```rust
pub struct BELQueryEngine {
storage: Arc<dyn BELStorage>,
}
impl BELQueryEngine {
pub fn new(storage: Arc<dyn BELStorage>) -> Self {
Self { storage }
}
/// Get latest status for a partition by scanning recent events
pub async fn get_latest_partition_status(&self, partition_ref: &str) -> Result<Option<PartitionStatus>>;
/// Get all build requests that are currently building a partition
pub async fn get_active_builds_for_partition(&self, partition_ref: &str) -> Result<Vec<String>>;
/// Get summary of a build request by aggregating its events
pub async fn get_build_request_summary(&self, build_id: &str) -> Result<BuildRequestSummary>;
/// List build requests with pagination and filtering
pub async fn list_build_requests(&self, request: BuildsListRequest) -> Result<BuildsListResponse>;
/// Get activity summary for dashboard
pub async fn get_activity_summary(&self) -> Result<ActivityResponse>;
}
```
### 3. Client Layer: Repository Pattern
Clean interfaces for CLI, Service, and Dashboard (unchanged from current):
```rust
// Existing repositories continue to work, but now use BELQueryEngine
pub struct PartitionsRepository {
query_engine: Arc<BELQueryEngine>,
}
pub struct BuildsRepository {
query_engine: Arc<BELQueryEngine>,
}
```
## Implementation Plan
### Phase 1: Create Storage Layer Interface
1. **Define New Storage Trait**
```rust
// In databuild/event_log/storage.rs
pub trait BELStorage { /* as defined above */ }
pub fn create_bel_storage(uri: &str) -> Result<Box<dyn BELStorage>>;
```
2. **Add EventFilter to Protobuf**
```protobuf
// In databuild/databuild.proto
message EventFilter {
repeated string partition_refs = 1;
repeated string partition_patterns = 2;
repeated string job_labels = 3;
repeated string task_ids = 4;
repeated string build_request_ids = 5;
}
message EventPage {
repeated BuildEvent events = 1;
int64 next_idx = 2;
bool has_more = 3;
}
```
3. **Implement SQLite Storage Backend**
```rust
// In databuild/event_log/sqlite_storage.rs
pub struct SqliteBELStorage {
pool: sqlx::SqlitePool,
}
impl BELStorage for SqliteBELStorage {
async fn append_event(&self, event: BuildEvent) -> Result<i64> {
// Simple INSERT returning rowid
let serialized = serde_json::to_string(&event)?;
let row_id = sqlx::query("INSERT INTO build_events (event_data) VALUES (?)")
.bind(serialized)
.execute(&self.pool)
.await?
.last_insert_rowid();
Ok(row_id)
}
async fn list_events(&self, since_idx: i64, filter: EventFilter) -> Result<EventPage> {
// Efficient sequential scan with filtering
// Build WHERE clause based on filter criteria
// Return paginated results
}
}
```
### Phase 2: Create Query Engine Layer
1. **Implement BELQueryEngine**
```rust
// In databuild/event_log/query_engine.rs
impl BELQueryEngine {
pub async fn get_latest_partition_status(&self, partition_ref: &str) -> Result<Option<PartitionStatus>> {
// Scan recent partition events to determine current status
let filter = EventFilter {
partition_refs: vec![partition_ref.to_string()],
..Default::default()
};
let events = self.storage.list_events(0, filter).await?;
self.aggregate_partition_status(&events.events)
}
async fn aggregate_partition_status(&self, events: &[BuildEvent]) -> Result<Option<PartitionStatus>> {
// Walk through events chronologically to determine final partition status
// Return the most recent status
}
}
```
2. **Implement All Current Query Methods**
- Port all methods from current `BuildEventLog` trait
- Use event scanning and aggregation instead of complex SQL queries
- Keep same return types for compatibility
### Phase 3: Migrate Existing Code
1. **Update Repository Constructors**
```rust
// Old: PartitionsRepository::new(Arc<dyn BuildEventLog>)
// New: PartitionsRepository::new(Arc<BELQueryEngine>)
impl PartitionsRepository {
pub fn new(query_engine: Arc<BELQueryEngine>) -> Self {
Self { query_engine }
}
pub async fn list_protobuf(&self, request: PartitionsListRequest) -> Result<PartitionsListResponse> {
self.query_engine.list_build_requests(request).await
}
}
```
2. **Update CLI and Service Initialization**
```rust
// In CLI main.rs and service mod.rs
let storage = create_bel_storage(&event_log_uri).await?;
let query_engine = Arc::new(BELQueryEngine::new(storage));
let partitions_repo = PartitionsRepository::new(query_engine.clone());
let builds_repo = BuildsRepository::new(query_engine.clone());
```
### Phase 4: Remove Legacy Components
1. **Remove Delta Lake Implementation**
```rust
// Delete databuild/event_log/delta.rs
// Remove delta dependencies from MODULE.bazel
// Remove delta:// support from create_build_event_log()
```
2. **Deprecate Old BuildEventLog Trait**
```rust
// Mark as deprecated, keep for backwards compatibility during transition
#[deprecated(note = "Use BELQueryEngine and BELStorage instead")]
pub trait BuildEventLog { /* existing implementation */ }
```
3. **Update Factory Function**
```rust
// In databuild/event_log/mod.rs
pub async fn create_build_event_log(uri: &str) -> Result<Arc<BELQueryEngine>> {
let storage = if uri == "stdout" {
Arc::new(stdout::StdoutBELStorage::new()) as Arc<dyn BELStorage>
} else if uri.starts_with("sqlite://") {
let path = &uri[9..];
let storage = sqlite_storage::SqliteBELStorage::new(path).await?;
storage.initialize().await?;
Arc::new(storage) as Arc<dyn BELStorage>
} else if uri.starts_with("postgres://") {
let storage = postgres_storage::PostgresBELStorage::new(uri).await?;
storage.initialize().await?;
Arc::new(storage) as Arc<dyn BELStorage>
} else {
return Err(BuildEventLogError::ConnectionError(
format!("Unsupported build event log URI: {}", uri)
));
};
Ok(Arc::new(BELQueryEngine::new(storage)))
}
```
### Phase 5: Final Cleanup
1. **Remove Legacy Implementations**
- Delete complex aggregation logic from existing storage backends
- Simplify remaining backends to implement only new `BELStorage` trait
- Remove deprecated `BuildEventLog` trait
2. **Update Documentation**
- Update design docs to reflect new architecture
- Create migration guide for external users
- Update code examples and README
## Benefits of 3-Tier Architecture
### ✅ **Simplified Codebase**
- Removes complex Delta Lake dependencies
- Storage backends focus only on append + scan operations
- Clear separation between storage and business logic
### ✅ **Better Maintainability**
- Single SQLite implementation for most use cases
- Query logic centralized in one place
- Easier to debug and test each layer independently
### ✅ **Future-Ready Foundation**
- Clean foundation for wants system (next phase)
- Easy to add new storage backends when needed
- Query engine ready for cross-graph coordination APIs
### ✅ **Performance Benefits**
- Eliminates complex SQL joins in storage layer
- Enables sequential scanning optimizations
- Cleaner separation allows targeted optimizations
## Success Criteria
### Phase 1-2: Foundation
- [ ] Storage layer trait compiles and tests pass
- [ ] SQLite storage backend supports append + list operations
- [ ] Query engine provides same functionality as current BEL trait
- [ ] EventFilter protobuf types generate correctly
### Phase 3-4: Migration
- [ ] All repositories work with new query engine
- [ ] CLI and service use new architecture
- [ ] Existing functionality unchanged from user perspective
- [ ] Delta Lake implementation removed
### Phase 5: Completion
- [ ] Legacy BEL trait removed
- [ ] Performance meets or exceeds current implementation
- [ ] Documentation updated for new architecture
- [ ] Codebase simplified and maintainable
## Risk Mitigation
1. **Gradual Migration**: Implement new architecture alongside existing code
2. **Feature Parity**: Ensure all existing functionality works before removing old code
3. **Performance Testing**: Benchmark new implementation against current performance
4. **Simple First**: Start with SQLite-only implementation, add complexity later as needed