databuild/plans/03-service-interface-refactor.md

134 lines
No EOL
4.8 KiB
Markdown

# Service Interface Refactor Plan
## Motivation
The Build Graph Service HTTP API successfully implements core DataBuild functionality but has interface misalignments with the proto definitions that reduce type safety and limit functionality. Key issues:
- **Type Safety Loss**: Analysis responses use `serde_json::Value` instead of structured `JobGraph`
- **Missing Manifest Access**: Build responses only return request IDs, no way to retrieve final manifests
- **Incomplete Dashboard Support**: Missing list/metrics endpoints from design specification
- **Inconsistent Error Handling**: Different patterns between gRPC and HTTP
## Core Interface Alignment
### Fix Analysis Response Type Safety
**Problem**: `AnalyzeResponse` uses generic JSON, losing compile-time guarantees
```rust
// Current: loses type safety
pub job_graph: serde_json::Value
// Target: structured type
pub job_graph: JobGraph
```
**Critical Detail**: Ensure `JobGraph` implements `Serialize` properly and remove `serde_json::to_value()` conversion in handlers.
### Add Manifest Retrieval
**Problem**: HTTP API returns build ID immediately but no way to get final manifests (unlike proto's synchronous `GraphBuildResponse`)
**Solution**: Add `GET /api/v1/builds/:id/manifests` endpoint
- Store manifests in `BuildRequestState` during execution
- Update execute command to capture manifests from job graph results
- Enable both async (current) and manifest retrieval patterns
## Missing Core Functionality
### Dashboard Endpoints
The design specification requires but current API lacks:
1. **Recent Builds**: `GET /api/v1/builds?limit=N`
- Enable dashboard functionality
- Support pagination for large build histories
2. **Job Metrics**: `GET /api/v1/jobs/:label/metrics`
- Aggregate success rates, durations, failure patterns
- Essential for operational monitoring
3. **Query Interface**: `POST /api/v1/query`
- Advanced querying beyond predefined endpoints
- SQL passthrough or structured query DSL
**Implementation Strategy**: Extend event log with aggregation queries, add proper pagination support.
## Error Response Standardization
**Problem**: Current `ErrorResponse {error: String}` lacks structure and traceability
**Target**:
```rust
pub struct ErrorResponse {
pub error: ErrorDetail,
pub request_id: Option<String>,
pub timestamp: i64,
}
```
**Error Code Strategy**:
- `INVALID_PARTITION_REF` - Validation failures
- `JOB_LOOKUP_FAILED` - Graph analysis issues
- `ANALYSIS_FAILED` / `EXECUTION_FAILED` - Build process errors
- `BUILD_NOT_FOUND` - Resource not found
- `INTERNAL_ERROR` - Server errors
**Critical**: Add request ID middleware for tracing across async operations.
## Protocol Compatibility Strategy
### Dual Protocol Support
**Narrative**: Some clients need gRPC's performance/streaming, others prefer HTTP's simplicity
**Approach**:
- Implement `DataBuildService` from proto using `tonic`
- Share core logic between HTTP and gRPC handlers
- Add sync mode option: `POST /api/v1/builds?mode=sync` returns manifests directly
**Critical Detail**: Avoid code duplication by extracting shared business logic into service layer.
## Backward Compatibility
**Philosophy**: Never break existing clients during improvements
**Strategy**:
- New endpoints alongside existing ones
- API versioning (`/api/v2`) only for incompatible changes
- Feature flags for experimental functionality
- Deprecation notices with migration paths
## Implementation Priority
### Phase 1: Core Safety (Essential)
1. Fix `AnalyzeResponse` type safety
2. Add manifest retrieval endpoint
3. Standardize error responses with request tracing
**Rationale**: These fix fundamental interface problems affecting all users.
### Phase 2: Dashboard Features (High Value)
1. Recent builds list with pagination
2. Job metrics aggregation
3. Enhanced build state tracking
**Rationale**: Enables operational monitoring and debugging workflows.
### Phase 3: Advanced Features (Complete)
1. Query execution interface
2. gRPC service implementation
3. Sync/async mode support
**Rationale**: Achieves full design specification compliance and protocol flexibility.
## Critical Success Factors
1. **Type Safety**: All responses use structured types, not generic JSON
2. **Interface Completeness**: Every proto service method has HTTP equivalent
3. **Operational Readiness**: Dashboard endpoints support real monitoring needs
4. **Zero Breakage**: Existing clients continue working throughout transition
5. **Performance**: No regression in build execution times
## Risk Mitigation
**Breaking Changes**: Use versioning strategy and maintain parallel endpoints during transitions
**Performance Impact**: Implement caching for metrics aggregation, rate limiting for expensive queries
**Complexity Growth**: Extract shared business logic to avoid duplication between HTTP/gRPC handlers