# Service Interface Refactor Plan ## Motivation The Build Graph Service HTTP API successfully implements core DataBuild functionality but has interface misalignments with the proto definitions that reduce type safety and limit functionality. Key issues: - **Type Safety Loss**: Analysis responses use `serde_json::Value` instead of structured `JobGraph` - **Missing Manifest Access**: Build responses only return request IDs, no way to retrieve final manifests - **Incomplete Dashboard Support**: Missing list/metrics endpoints from design specification - **Inconsistent Error Handling**: Different patterns between gRPC and HTTP ## Core Interface Alignment ### Fix Analysis Response Type Safety **Problem**: `AnalyzeResponse` uses generic JSON, losing compile-time guarantees ```rust // Current: loses type safety pub job_graph: serde_json::Value // Target: structured type pub job_graph: JobGraph ``` **Critical Detail**: Ensure `JobGraph` implements `Serialize` properly and remove `serde_json::to_value()` conversion in handlers. ### Add Manifest Retrieval **Problem**: HTTP API returns build ID immediately but no way to get final manifests (unlike proto's synchronous `GraphBuildResponse`) **Solution**: Add `GET /api/v1/builds/:id/manifests` endpoint - Store manifests in `BuildRequestState` during execution - Update execute command to capture manifests from job graph results - Enable both async (current) and manifest retrieval patterns ## Missing Core Functionality ### Dashboard Endpoints The design specification requires but current API lacks: 1. **Recent Builds**: `GET /api/v1/builds?limit=N` - Enable dashboard functionality - Support pagination for large build histories 2. **Job Metrics**: `GET /api/v1/jobs/:label/metrics` - Aggregate success rates, durations, failure patterns - Essential for operational monitoring 3. **Query Interface**: `POST /api/v1/query` - Advanced querying beyond predefined endpoints - SQL passthrough or structured query DSL **Implementation Strategy**: Extend event log with aggregation queries, add proper pagination support. ## Error Response Standardization **Problem**: Current `ErrorResponse {error: String}` lacks structure and traceability **Target**: ```rust pub struct ErrorResponse { pub error: ErrorDetail, pub request_id: Option, pub timestamp: i64, } ``` **Error Code Strategy**: - `INVALID_PARTITION_REF` - Validation failures - `JOB_LOOKUP_FAILED` - Graph analysis issues - `ANALYSIS_FAILED` / `EXECUTION_FAILED` - Build process errors - `BUILD_NOT_FOUND` - Resource not found - `INTERNAL_ERROR` - Server errors **Critical**: Add request ID middleware for tracing across async operations. ## Protocol Compatibility Strategy ### Dual Protocol Support **Narrative**: Some clients need gRPC's performance/streaming, others prefer HTTP's simplicity **Approach**: - Implement `DataBuildService` from proto using `tonic` - Share core logic between HTTP and gRPC handlers - Add sync mode option: `POST /api/v1/builds?mode=sync` returns manifests directly **Critical Detail**: Avoid code duplication by extracting shared business logic into service layer. ## Backward Compatibility **Philosophy**: Never break existing clients during improvements **Strategy**: - New endpoints alongside existing ones - API versioning (`/api/v2`) only for incompatible changes - Feature flags for experimental functionality - Deprecation notices with migration paths ## Implementation Priority ### Phase 1: Core Safety (Essential) 1. Fix `AnalyzeResponse` type safety 2. Add manifest retrieval endpoint 3. Standardize error responses with request tracing **Rationale**: These fix fundamental interface problems affecting all users. ### Phase 2: Dashboard Features (High Value) 1. Recent builds list with pagination 2. Job metrics aggregation 3. Enhanced build state tracking **Rationale**: Enables operational monitoring and debugging workflows. ### Phase 3: Advanced Features (Complete) 1. Query execution interface 2. gRPC service implementation 3. Sync/async mode support **Rationale**: Achieves full design specification compliance and protocol flexibility. ## Critical Success Factors 1. **Type Safety**: All responses use structured types, not generic JSON 2. **Interface Completeness**: Every proto service method has HTTP equivalent 3. **Operational Readiness**: Dashboard endpoints support real monitoring needs 4. **Zero Breakage**: Existing clients continue working throughout transition 5. **Performance**: No regression in build execution times ## Risk Mitigation **Breaking Changes**: Use versioning strategy and maintain parallel endpoints during transitions **Performance Impact**: Implement caching for metrics aggregation, rate limiting for expensive queries **Complexity Growth**: Extract shared business logic to avoid duplication between HTTP/gRPC handlers