databuild/plans/02-build-graph-service.md

182 lines
No EOL
6.3 KiB
Markdown

# Build Graph Service Design
## Overview
The Build Graph Service is a persistent HTTP service that coordinates build requests, tracks partition status, and serves as one operational interface for the DataBuild system. It bridges the gap between the stateless core DataBuild engine and the stateful requirements of production data orchestration.
## Core Architecture
```rust
// Main service interface
#[async_trait]
trait BuildGraphService {
// Build request lifecycle
async fn submit_build_request(&self, partitions: Vec<PartitionRef>) -> Result<String, Error>;
async fn get_build_status(&self, build_request_id: &str) -> Result<BuildRequestStatus, Error>;
async fn cancel_build_request(&self, build_request_id: &str) -> Result<(), Error>;
// Partition status queries
async fn get_partition_status(&self, partition_ref: &str) -> Result<PartitionStatus, Error>;
async fn get_partition_events(&self, partition_ref: &str) -> Result<Vec<BuildEvent>, Error>;
// Graph analysis
async fn analyze_build_graph(&self, partitions: Vec<PartitionRef>) -> Result<JobGraph, Error>;
// Service queries for dashboard
async fn get_recent_builds(&self, limit: Option<u32>) -> Result<Vec<BuildSummary>, Error>;
async fn get_job_metrics(&self, job_label: &str) -> Result<JobMetrics, Error>;
async fn execute_query(&self, query: &str) -> Result<QueryResult, Error>;
}
```
## Key Components
### 1. Build Request Coordinator
Manages the lifecycle of build requests:
- Receives partition build requests via HTTP
- Calls DataBuild core to analyze required work
- Detects delegation opportunities to existing builds
- Schedules job execution via external orchestrators
- Tracks build progress through event log
### 2. Partition Status Tracker
Provides real-time partition status:
- Queries build event log for partition lifecycle
- Determines partition liveness and staleness
- Handles partition status API endpoints
### 3. Event Log Interface
Wraps the build event log with service-level operations:
- Appends build events from job executions
- Provides structured queries for dashboard
- Maintains build request to partition mappings
### 4. Job Executor
Executes jobs directly within the service:
- Spawns job processes using DataBuild core
- Monitors job execution and resource usage
- Captures job outputs and translates to build events
## HTTP API Design
### Build Operations
```
POST /builds
Body: {"partitions": ["dal://table/date=2024-01-01", ...]}
Returns: {"build_request_id": "uuid"}
GET /builds/{build_request_id}
Returns: {"status": "executing", "progress": {"graph": {...}, "events": [...]}, "partitions": [...]}
DELETE /builds/{build_request_id}
Returns: {"cancelled": true}
```
### Partition Status
```
GET /partitions/{partition_ref}/status
Returns: {"status": "available", "last_updated": "timestamp", "build_requests": [...]}
GET /partitions/{partition_ref}/events
Returns: {"events": [...]}
```
### Analysis
```
POST /analyze
Body: {"partitions": ["dal://table/date=2024-01-01"]}
Returns: {"job_graph": {...}}
```
## Data Flow
1. **Build Request Submission**
- User/system submits partition build request
- Service generates build request ID
- Calls DataBuild core to analyze required work
- Logs BUILD_REQUEST_RECEIVED event
2. **Planning Phase**
- DataBuild core analyzes partition dependencies
- Service checks for delegation opportunities
- Logs BUILD_REQUEST_PLANNING event
- Creates execution plan with job graph
3. **Execution Phase**
- Service executes jobs directly using DataBuild core
- Jobs execute independently, logging events
- Service aggregates events to track progress
- Logs BUILD_REQUEST_EXECUTING event
4. **Completion**
- All jobs complete successfully/with errors
- Service updates final build request status
- Logs BUILD_REQUEST_COMPLETED/FAILED event
## Delegation Logic
When a build request analyzes partitions, it checks for existing builds producing the same partitions:
```rust
async fn check_delegation_opportunities(
&self,
required_partitions: &[PartitionRef]
) -> Result<Vec<DelegationDecision>, Error> {
let mut decisions = Vec::new();
for partition in required_partitions {
if let Some(existing_build) = self.find_active_build_for_partition(partition).await? {
decisions.push(DelegationDecision::Delegate {
partition: partition.clone(),
to_build_request: existing_build.build_request_id,
});
} else {
decisions.push(DelegationDecision::Build {
partition: partition.clone(),
});
}
}
Ok(decisions)
}
```
## Minimal Implementation Scope
### Phase 1: Core Service
- HTTP API for build requests and status
- Integration with existing DataBuild core
- SQLite-based build event log
- Basic partition status tracking
- In-process job execution
### Phase 2: Production Features
- PostgreSQL build event log
- Delegation logic for overlapping builds
- Comprehensive HTTP API
### Phase 3: Advanced Features
- Monitoring and alerting integration
## Key Design Questions
1. **Job Execution Strategy**: The service executes jobs directly using DataBuild core for simplicity.
2. **Concurrency Control**: How should the service handle concurrent builds of the same partition? Current design uses delegation, but alternatives include queuing or coordination protocols.
3. **Graph Persistence**: The service will re-analyze job graphs on each request to maintain simplicity.
4. **Multi-Graph Support**: A single service instance will handle one DataBuild graph for simplicity, with graph composition handling cross-graph dependencies.
5. **Event Log Granularity**: How detailed should build events be? More detail enables better observability but increases storage requirements.
## Risk Mitigation
Given the project's emphasis on avoiding over-engineering:
- **Start Simple**: Begin with SQLite, in-process execution, and basic APIs
- **Defer Complexity**: Postpone advanced features like optimization strategies
- **Focus on Core Value**: Prioritize build coordination over peripheral features
- **Maintain Boundaries**: Keep the service focused on orchestration, not data processing
The service should enable the compelling DataBuild vision while remaining implementable within the project's scope constraints.