databuild/plans/03-service-interface-refactor.md

4.8 KiB

Service Interface Refactor Plan

Motivation

The Build Graph Service HTTP API successfully implements core DataBuild functionality but has interface misalignments with the proto definitions that reduce type safety and limit functionality. Key issues:

  • Type Safety Loss: Analysis responses use serde_json::Value instead of structured JobGraph
  • Missing Manifest Access: Build responses only return request IDs, no way to retrieve final manifests
  • Incomplete Dashboard Support: Missing list/metrics endpoints from design specification
  • Inconsistent Error Handling: Different patterns between gRPC and HTTP

Core Interface Alignment

Fix Analysis Response Type Safety

Problem: AnalyzeResponse uses generic JSON, losing compile-time guarantees

// Current: loses type safety
pub job_graph: serde_json::Value

// Target: structured type
pub job_graph: JobGraph

Critical Detail: Ensure JobGraph implements Serialize properly and remove serde_json::to_value() conversion in handlers.

Add Manifest Retrieval

Problem: HTTP API returns build ID immediately but no way to get final manifests (unlike proto's synchronous GraphBuildResponse)

Solution: Add GET /api/v1/builds/:id/manifests endpoint

  • Store manifests in BuildRequestState during execution
  • Update execute command to capture manifests from job graph results
  • Enable both async (current) and manifest retrieval patterns

Missing Core Functionality

Dashboard Endpoints

The design specification requires but current API lacks:

  1. Recent Builds: GET /api/v1/builds?limit=N

    • Enable dashboard functionality
    • Support pagination for large build histories
  2. Job Metrics: GET /api/v1/jobs/:label/metrics

    • Aggregate success rates, durations, failure patterns
    • Essential for operational monitoring
  3. Query Interface: POST /api/v1/query

    • Advanced querying beyond predefined endpoints
    • SQL passthrough or structured query DSL

Implementation Strategy: Extend event log with aggregation queries, add proper pagination support.

Error Response Standardization

Problem: Current ErrorResponse {error: String} lacks structure and traceability

Target:

pub struct ErrorResponse {
    pub error: ErrorDetail,
    pub request_id: Option<String>,
    pub timestamp: i64,
}

Error Code Strategy:

  • INVALID_PARTITION_REF - Validation failures
  • JOB_LOOKUP_FAILED - Graph analysis issues
  • ANALYSIS_FAILED / EXECUTION_FAILED - Build process errors
  • BUILD_NOT_FOUND - Resource not found
  • INTERNAL_ERROR - Server errors

Critical: Add request ID middleware for tracing across async operations.

Protocol Compatibility Strategy

Dual Protocol Support

Narrative: Some clients need gRPC's performance/streaming, others prefer HTTP's simplicity

Approach:

  • Implement DataBuildService from proto using tonic
  • Share core logic between HTTP and gRPC handlers
  • Add sync mode option: POST /api/v1/builds?mode=sync returns manifests directly

Critical Detail: Avoid code duplication by extracting shared business logic into service layer.

Backward Compatibility

Philosophy: Never break existing clients during improvements

Strategy:

  • New endpoints alongside existing ones
  • API versioning (/api/v2) only for incompatible changes
  • Feature flags for experimental functionality
  • Deprecation notices with migration paths

Implementation Priority

Phase 1: Core Safety (Essential)

  1. Fix AnalyzeResponse type safety
  2. Add manifest retrieval endpoint
  3. Standardize error responses with request tracing

Rationale: These fix fundamental interface problems affecting all users.

Phase 2: Dashboard Features (High Value)

  1. Recent builds list with pagination
  2. Job metrics aggregation
  3. Enhanced build state tracking

Rationale: Enables operational monitoring and debugging workflows.

Phase 3: Advanced Features (Complete)

  1. Query execution interface
  2. gRPC service implementation
  3. Sync/async mode support

Rationale: Achieves full design specification compliance and protocol flexibility.

Critical Success Factors

  1. Type Safety: All responses use structured types, not generic JSON
  2. Interface Completeness: Every proto service method has HTTP equivalent
  3. Operational Readiness: Dashboard endpoints support real monitoring needs
  4. Zero Breakage: Existing clients continue working throughout transition
  5. Performance: No regression in build execution times

Risk Mitigation

Breaking Changes: Use versioning strategy and maintain parallel endpoints during transitions

Performance Impact: Implement caching for metrics aggregation, rate limiting for expensive queries

Complexity Growth: Extract shared business logic to avoid duplication between HTTP/gRPC handlers