4.8 KiB
Service Interface Refactor Plan
Motivation
The Build Graph Service HTTP API successfully implements core DataBuild functionality but has interface misalignments with the proto definitions that reduce type safety and limit functionality. Key issues:
- Type Safety Loss: Analysis responses use
serde_json::Valueinstead of structuredJobGraph - Missing Manifest Access: Build responses only return request IDs, no way to retrieve final manifests
- Incomplete Dashboard Support: Missing list/metrics endpoints from design specification
- Inconsistent Error Handling: Different patterns between gRPC and HTTP
Core Interface Alignment
Fix Analysis Response Type Safety
Problem: AnalyzeResponse uses generic JSON, losing compile-time guarantees
// Current: loses type safety
pub job_graph: serde_json::Value
// Target: structured type
pub job_graph: JobGraph
Critical Detail: Ensure JobGraph implements Serialize properly and remove serde_json::to_value() conversion in handlers.
Add Manifest Retrieval
Problem: HTTP API returns build ID immediately but no way to get final manifests (unlike proto's synchronous GraphBuildResponse)
Solution: Add GET /api/v1/builds/:id/manifests endpoint
- Store manifests in
BuildRequestStateduring execution - Update execute command to capture manifests from job graph results
- Enable both async (current) and manifest retrieval patterns
Missing Core Functionality
Dashboard Endpoints
The design specification requires but current API lacks:
-
Recent Builds:
GET /api/v1/builds?limit=N- Enable dashboard functionality
- Support pagination for large build histories
-
Job Metrics:
GET /api/v1/jobs/:label/metrics- Aggregate success rates, durations, failure patterns
- Essential for operational monitoring
-
Query Interface:
POST /api/v1/query- Advanced querying beyond predefined endpoints
- SQL passthrough or structured query DSL
Implementation Strategy: Extend event log with aggregation queries, add proper pagination support.
Error Response Standardization
Problem: Current ErrorResponse {error: String} lacks structure and traceability
Target:
pub struct ErrorResponse {
pub error: ErrorDetail,
pub request_id: Option<String>,
pub timestamp: i64,
}
Error Code Strategy:
INVALID_PARTITION_REF- Validation failuresJOB_LOOKUP_FAILED- Graph analysis issuesANALYSIS_FAILED/EXECUTION_FAILED- Build process errorsBUILD_NOT_FOUND- Resource not foundINTERNAL_ERROR- Server errors
Critical: Add request ID middleware for tracing across async operations.
Protocol Compatibility Strategy
Dual Protocol Support
Narrative: Some clients need gRPC's performance/streaming, others prefer HTTP's simplicity
Approach:
- Implement
DataBuildServicefrom proto usingtonic - Share core logic between HTTP and gRPC handlers
- Add sync mode option:
POST /api/v1/builds?mode=syncreturns manifests directly
Critical Detail: Avoid code duplication by extracting shared business logic into service layer.
Backward Compatibility
Philosophy: Never break existing clients during improvements
Strategy:
- New endpoints alongside existing ones
- API versioning (
/api/v2) only for incompatible changes - Feature flags for experimental functionality
- Deprecation notices with migration paths
Implementation Priority
Phase 1: Core Safety (Essential)
- Fix
AnalyzeResponsetype safety - Add manifest retrieval endpoint
- Standardize error responses with request tracing
Rationale: These fix fundamental interface problems affecting all users.
Phase 2: Dashboard Features (High Value)
- Recent builds list with pagination
- Job metrics aggregation
- Enhanced build state tracking
Rationale: Enables operational monitoring and debugging workflows.
Phase 3: Advanced Features (Complete)
- Query execution interface
- gRPC service implementation
- Sync/async mode support
Rationale: Achieves full design specification compliance and protocol flexibility.
Critical Success Factors
- Type Safety: All responses use structured types, not generic JSON
- Interface Completeness: Every proto service method has HTTP equivalent
- Operational Readiness: Dashboard endpoints support real monitoring needs
- Zero Breakage: Existing clients continue working throughout transition
- Performance: No regression in build execution times
Risk Mitigation
Breaking Changes: Use versioning strategy and maintain parallel endpoints during transitions
Performance Impact: Implement caching for metrics aggregation, rate limiting for expensive queries
Complexity Growth: Extract shared business logic to avoid duplication between HTTP/gRPC handlers