From 3c817fac81d0d85a3647ac238959d809535acc8b Mon Sep 17 00:00:00 2001 From: Stuart Axelbrooke Date: Sun, 6 Jul 2025 22:16:42 -0700 Subject: [PATCH] phase 4 - add plan for aligning interfaces --- plans/service-interface-refactor.md | 134 ++++++++++++++++++++++++++++ 1 file changed, 134 insertions(+) create mode 100644 plans/service-interface-refactor.md diff --git a/plans/service-interface-refactor.md b/plans/service-interface-refactor.md new file mode 100644 index 0000000..e5261bc --- /dev/null +++ b/plans/service-interface-refactor.md @@ -0,0 +1,134 @@ +# Service Interface Refactor Plan + +## Motivation + +The Build Graph Service HTTP API successfully implements core DataBuild functionality but has interface misalignments with the proto definitions that reduce type safety and limit functionality. Key issues: + +- **Type Safety Loss**: Analysis responses use `serde_json::Value` instead of structured `JobGraph` +- **Missing Manifest Access**: Build responses only return request IDs, no way to retrieve final manifests +- **Incomplete Dashboard Support**: Missing list/metrics endpoints from design specification +- **Inconsistent Error Handling**: Different patterns between gRPC and HTTP + +## Core Interface Alignment + +### Fix Analysis Response Type Safety +**Problem**: `AnalyzeResponse` uses generic JSON, losing compile-time guarantees +```rust +// Current: loses type safety +pub job_graph: serde_json::Value + +// Target: structured type +pub job_graph: JobGraph +``` + +**Critical Detail**: Ensure `JobGraph` implements `Serialize` properly and remove `serde_json::to_value()` conversion in handlers. + +### Add Manifest Retrieval +**Problem**: HTTP API returns build ID immediately but no way to get final manifests (unlike proto's synchronous `GraphBuildResponse`) + +**Solution**: Add `GET /api/v1/builds/:id/manifests` endpoint +- Store manifests in `BuildRequestState` during execution +- Update execute command to capture manifests from job graph results +- Enable both async (current) and manifest retrieval patterns + +## Missing Core Functionality + +### Dashboard Endpoints +The design specification requires but current API lacks: + +1. **Recent Builds**: `GET /api/v1/builds?limit=N` + - Enable dashboard functionality + - Support pagination for large build histories + +2. **Job Metrics**: `GET /api/v1/jobs/:label/metrics` + - Aggregate success rates, durations, failure patterns + - Essential for operational monitoring + +3. **Query Interface**: `POST /api/v1/query` + - Advanced querying beyond predefined endpoints + - SQL passthrough or structured query DSL + +**Implementation Strategy**: Extend event log with aggregation queries, add proper pagination support. + +## Error Response Standardization + +**Problem**: Current `ErrorResponse {error: String}` lacks structure and traceability + +**Target**: +```rust +pub struct ErrorResponse { + pub error: ErrorDetail, + pub request_id: Option, + pub timestamp: i64, +} +``` + +**Error Code Strategy**: +- `INVALID_PARTITION_REF` - Validation failures +- `JOB_LOOKUP_FAILED` - Graph analysis issues +- `ANALYSIS_FAILED` / `EXECUTION_FAILED` - Build process errors +- `BUILD_NOT_FOUND` - Resource not found +- `INTERNAL_ERROR` - Server errors + +**Critical**: Add request ID middleware for tracing across async operations. + +## Protocol Compatibility Strategy + +### Dual Protocol Support +**Narrative**: Some clients need gRPC's performance/streaming, others prefer HTTP's simplicity + +**Approach**: +- Implement `DataBuildService` from proto using `tonic` +- Share core logic between HTTP and gRPC handlers +- Add sync mode option: `POST /api/v1/builds?mode=sync` returns manifests directly + +**Critical Detail**: Avoid code duplication by extracting shared business logic into service layer. + +## Backward Compatibility + +**Philosophy**: Never break existing clients during improvements + +**Strategy**: +- New endpoints alongside existing ones +- API versioning (`/api/v2`) only for incompatible changes +- Feature flags for experimental functionality +- Deprecation notices with migration paths + +## Implementation Priority + +### Phase 1: Core Safety (Essential) +1. Fix `AnalyzeResponse` type safety +2. Add manifest retrieval endpoint +3. Standardize error responses with request tracing + +**Rationale**: These fix fundamental interface problems affecting all users. + +### Phase 2: Dashboard Features (High Value) +1. Recent builds list with pagination +2. Job metrics aggregation +3. Enhanced build state tracking + +**Rationale**: Enables operational monitoring and debugging workflows. + +### Phase 3: Advanced Features (Complete) +1. Query execution interface +2. gRPC service implementation +3. Sync/async mode support + +**Rationale**: Achieves full design specification compliance and protocol flexibility. + +## Critical Success Factors + +1. **Type Safety**: All responses use structured types, not generic JSON +2. **Interface Completeness**: Every proto service method has HTTP equivalent +3. **Operational Readiness**: Dashboard endpoints support real monitoring needs +4. **Zero Breakage**: Existing clients continue working throughout transition +5. **Performance**: No regression in build execution times + +## Risk Mitigation + +**Breaking Changes**: Use versioning strategy and maintain parallel endpoints during transitions + +**Performance Impact**: Implement caching for metrics aggregation, rate limiting for expensive queries + +**Complexity Growth**: Extract shared business logic to avoid duplication between HTTP/gRPC handlers \ No newline at end of file