This commit is contained in:
parent
1bfda923b6
commit
3c817fac81
1 changed files with 134 additions and 0 deletions
134
plans/service-interface-refactor.md
Normal file
134
plans/service-interface-refactor.md
Normal file
|
|
@ -0,0 +1,134 @@
|
||||||
|
# Service Interface Refactor Plan
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
The Build Graph Service HTTP API successfully implements core DataBuild functionality but has interface misalignments with the proto definitions that reduce type safety and limit functionality. Key issues:
|
||||||
|
|
||||||
|
- **Type Safety Loss**: Analysis responses use `serde_json::Value` instead of structured `JobGraph`
|
||||||
|
- **Missing Manifest Access**: Build responses only return request IDs, no way to retrieve final manifests
|
||||||
|
- **Incomplete Dashboard Support**: Missing list/metrics endpoints from design specification
|
||||||
|
- **Inconsistent Error Handling**: Different patterns between gRPC and HTTP
|
||||||
|
|
||||||
|
## Core Interface Alignment
|
||||||
|
|
||||||
|
### Fix Analysis Response Type Safety
|
||||||
|
**Problem**: `AnalyzeResponse` uses generic JSON, losing compile-time guarantees
|
||||||
|
```rust
|
||||||
|
// Current: loses type safety
|
||||||
|
pub job_graph: serde_json::Value
|
||||||
|
|
||||||
|
// Target: structured type
|
||||||
|
pub job_graph: JobGraph
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical Detail**: Ensure `JobGraph` implements `Serialize` properly and remove `serde_json::to_value()` conversion in handlers.
|
||||||
|
|
||||||
|
### Add Manifest Retrieval
|
||||||
|
**Problem**: HTTP API returns build ID immediately but no way to get final manifests (unlike proto's synchronous `GraphBuildResponse`)
|
||||||
|
|
||||||
|
**Solution**: Add `GET /api/v1/builds/:id/manifests` endpoint
|
||||||
|
- Store manifests in `BuildRequestState` during execution
|
||||||
|
- Update execute command to capture manifests from job graph results
|
||||||
|
- Enable both async (current) and manifest retrieval patterns
|
||||||
|
|
||||||
|
## Missing Core Functionality
|
||||||
|
|
||||||
|
### Dashboard Endpoints
|
||||||
|
The design specification requires but current API lacks:
|
||||||
|
|
||||||
|
1. **Recent Builds**: `GET /api/v1/builds?limit=N`
|
||||||
|
- Enable dashboard functionality
|
||||||
|
- Support pagination for large build histories
|
||||||
|
|
||||||
|
2. **Job Metrics**: `GET /api/v1/jobs/:label/metrics`
|
||||||
|
- Aggregate success rates, durations, failure patterns
|
||||||
|
- Essential for operational monitoring
|
||||||
|
|
||||||
|
3. **Query Interface**: `POST /api/v1/query`
|
||||||
|
- Advanced querying beyond predefined endpoints
|
||||||
|
- SQL passthrough or structured query DSL
|
||||||
|
|
||||||
|
**Implementation Strategy**: Extend event log with aggregation queries, add proper pagination support.
|
||||||
|
|
||||||
|
## Error Response Standardization
|
||||||
|
|
||||||
|
**Problem**: Current `ErrorResponse {error: String}` lacks structure and traceability
|
||||||
|
|
||||||
|
**Target**:
|
||||||
|
```rust
|
||||||
|
pub struct ErrorResponse {
|
||||||
|
pub error: ErrorDetail,
|
||||||
|
pub request_id: Option<String>,
|
||||||
|
pub timestamp: i64,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Error Code Strategy**:
|
||||||
|
- `INVALID_PARTITION_REF` - Validation failures
|
||||||
|
- `JOB_LOOKUP_FAILED` - Graph analysis issues
|
||||||
|
- `ANALYSIS_FAILED` / `EXECUTION_FAILED` - Build process errors
|
||||||
|
- `BUILD_NOT_FOUND` - Resource not found
|
||||||
|
- `INTERNAL_ERROR` - Server errors
|
||||||
|
|
||||||
|
**Critical**: Add request ID middleware for tracing across async operations.
|
||||||
|
|
||||||
|
## Protocol Compatibility Strategy
|
||||||
|
|
||||||
|
### Dual Protocol Support
|
||||||
|
**Narrative**: Some clients need gRPC's performance/streaming, others prefer HTTP's simplicity
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Implement `DataBuildService` from proto using `tonic`
|
||||||
|
- Share core logic between HTTP and gRPC handlers
|
||||||
|
- Add sync mode option: `POST /api/v1/builds?mode=sync` returns manifests directly
|
||||||
|
|
||||||
|
**Critical Detail**: Avoid code duplication by extracting shared business logic into service layer.
|
||||||
|
|
||||||
|
## Backward Compatibility
|
||||||
|
|
||||||
|
**Philosophy**: Never break existing clients during improvements
|
||||||
|
|
||||||
|
**Strategy**:
|
||||||
|
- New endpoints alongside existing ones
|
||||||
|
- API versioning (`/api/v2`) only for incompatible changes
|
||||||
|
- Feature flags for experimental functionality
|
||||||
|
- Deprecation notices with migration paths
|
||||||
|
|
||||||
|
## Implementation Priority
|
||||||
|
|
||||||
|
### Phase 1: Core Safety (Essential)
|
||||||
|
1. Fix `AnalyzeResponse` type safety
|
||||||
|
2. Add manifest retrieval endpoint
|
||||||
|
3. Standardize error responses with request tracing
|
||||||
|
|
||||||
|
**Rationale**: These fix fundamental interface problems affecting all users.
|
||||||
|
|
||||||
|
### Phase 2: Dashboard Features (High Value)
|
||||||
|
1. Recent builds list with pagination
|
||||||
|
2. Job metrics aggregation
|
||||||
|
3. Enhanced build state tracking
|
||||||
|
|
||||||
|
**Rationale**: Enables operational monitoring and debugging workflows.
|
||||||
|
|
||||||
|
### Phase 3: Advanced Features (Complete)
|
||||||
|
1. Query execution interface
|
||||||
|
2. gRPC service implementation
|
||||||
|
3. Sync/async mode support
|
||||||
|
|
||||||
|
**Rationale**: Achieves full design specification compliance and protocol flexibility.
|
||||||
|
|
||||||
|
## Critical Success Factors
|
||||||
|
|
||||||
|
1. **Type Safety**: All responses use structured types, not generic JSON
|
||||||
|
2. **Interface Completeness**: Every proto service method has HTTP equivalent
|
||||||
|
3. **Operational Readiness**: Dashboard endpoints support real monitoring needs
|
||||||
|
4. **Zero Breakage**: Existing clients continue working throughout transition
|
||||||
|
5. **Performance**: No regression in build execution times
|
||||||
|
|
||||||
|
## Risk Mitigation
|
||||||
|
|
||||||
|
**Breaking Changes**: Use versioning strategy and maintain parallel endpoints during transitions
|
||||||
|
|
||||||
|
**Performance Impact**: Implement caching for metrics aggregation, rate limiting for expensive queries
|
||||||
|
|
||||||
|
**Complexity Growth**: Extract shared business logic to avoid duplication between HTTP/gRPC handlers
|
||||||
Loading…
Reference in a new issue