databuild/plans/cli-service-build-unification.md
soaxelbrooke bfec05e065
Some checks are pending
/ setup (push) Waiting to run
Big change
2025-07-13 21:18:15 -07:00

12 KiB

CLI-Service Build Unification

Problem Statement

The current DataBuild architecture has significant duplication and architectural inconsistencies between CLI and Service build orchestration:

Current Duplication Issues

  1. Event Emission Logic: Service HTTP handlers and CLI binaries contain duplicate orchestration event emission code
  2. Mode Detection: Analysis and execution binaries (analyze.rs and execute.rs) use DATABUILD_CLI_MODE environment variable to conditionally emit different events
  3. Test Complexity: End-to-end tests must account for different event patterns between CLI and Service for identical logical operations

Specific Code References

  • CLI Mode Detection in Analysis: databuild/graph/analyze.rs:555-587 - Emits "Build request received" and "Starting build planning" events only in CLI mode
  • CLI Mode Detection in Execution: databuild/graph/execute.rs:413-428 and execute.rs:753-779 - Emits execution start/completion events only in CLI mode
  • Service Orchestration: databuild/service/handlers.rs - HTTP handlers emit orchestration events independently

Architectural Problems

  1. Single Responsibility Violation: Analysis and execution binaries serve dual purposes as both shared library functions and CLI entry points
  2. Consistency Risk: Separate implementations of orchestration logic create risk of drift between CLI and Service behavior
  3. Maintenance Burden: Changes to orchestration requirements must be implemented in multiple places

Current Architecture Analysis

Service Flow

HTTP Request → Service Handler → Orchestration Events → Analysis → Execution → Completion Events

The Service has a natural coordination point in the HTTP handler that manages the entire build lifecycle and emits appropriate orchestration events.

CLI Flow

Shell Script → Analysis Binary (CLI mode) → Execution Binary (CLI mode) → Orchestration Events

The CLI lacks a natural coordination point, forcing the shared analysis/execution binaries to detect CLI mode and emit orchestration events themselves.

Event Flow Comparison

Service Events (coordinated):

  1. Build request received
  2. Starting build planning
  3. Analysis events (partitions scheduled, jobs configured)
  4. Starting build execution
  5. Execution events (jobs scheduled/completed, partitions available)
  6. Build request completed

CLI Events (mode-dependent):

  • Same events as Service, but emitted conditionally based on DATABUILD_CLI_MODE
  • Creates awkward coupling between orchestration concerns and domain logic

Proposed Shared Library Design

Core Orchestrator API

pub struct BuildOrchestrator {
    event_log: Box<dyn BuildEventLog>,
    build_request_id: String,
    requested_partitions: Vec<PartitionRef>,
}

impl BuildOrchestrator {
    pub fn new(
        event_log: Box<dyn BuildEventLog>, 
        build_request_id: String,
        requested_partitions: Vec<PartitionRef>
    ) -> Self;
    
    // Lifecycle events
    pub async fn start_build(&self) -> Result<(), Error>;
    pub async fn start_planning(&self) -> Result<(), Error>;
    pub async fn start_execution(&self) -> Result<(), Error>;
    pub async fn complete_build(&self, result: BuildResult) -> Result<(), Error>;
    
    // Domain events (pass-through to existing logic)
    pub async fn emit_partition_scheduled(&self, partition: &PartitionRef) -> Result<(), Error>;
    pub async fn emit_job_scheduled(&self, job: &JobEvent) -> Result<(), Error>;
    pub async fn emit_job_completed(&self, job: &JobEvent) -> Result<(), Error>;
    pub async fn emit_partition_available(&self, partition: &PartitionEvent) -> Result<(), Error>;
    pub async fn emit_delegation(&self, partition: &str, target_build: &str, message: &str) -> Result<(), Error>;
}

pub enum BuildResult {
    Success { jobs_completed: usize },
    Failed { jobs_completed: usize, jobs_failed: usize },
    FailFast { trigger_job: String },
}

Event Emission Strategy

The orchestrator will emit standardized events at specific lifecycle points:

  1. Build Lifecycle Events: High-level orchestration (received, planning, executing, completed)
  2. Domain Events: Pass-through wrapper for existing analysis/execution events
  3. Consistent Timing: All events emitted through orchestrator ensure proper sequencing

Error Handling

#[derive(Debug, thiserror::Error)]
pub enum OrchestrationError {
    #[error("Event log error: {0}")]
    EventLog(#[from] databuild::event_log::Error),
    
    #[error("Build coordination error: {0}")]
    Coordination(String),
    
    #[error("Invalid build state transition: {current} -> {requested}")]
    InvalidStateTransition { current: String, requested: String },
}

Testing Interface

#[cfg(test)]
impl BuildOrchestrator {
    pub fn with_mock_event_log(build_request_id: String) -> (Self, MockEventLog);
    pub fn emitted_events(&self) -> &[BuildEvent];
}

Implementation Phases

Phase 1: Create Shared Orchestration Library

Files to Create:

  • databuild/orchestration/mod.rs - Core orchestrator implementation
  • databuild/orchestration/events.rs - Event type definitions and helpers
  • databuild/orchestration/error.rs - Error types
  • databuild/orchestration/tests.rs - Unit tests for orchestrator

Key Implementation Points:

  • Extract common event emission patterns from Service and CLI
  • Ensure orchestrator is async-compatible with existing event log interface
  • Design for testability with dependency injection

Phase 2: Refactor Service to Use Orchestrator

Files to Modify:

  • databuild/service/handlers.rs - Replace direct event emission with orchestrator calls
  • databuild/service/mod.rs - Integration with orchestrator lifecycle

Implementation:

  • Replace existing event emission code directly with orchestrator calls
  • Ensure proper error handling and async integration

Phase 3: Create New CLI Wrapper

Files to Create:

  • databuild/cli/main.rs - New CLI entry point using orchestrator
  • databuild/cli/error.rs - CLI-specific error handling

Implementation:

// databuild/cli/main.rs
#[tokio::main]
async fn main() -> Result<(), CliError> {
    let args = parse_cli_args();
    let event_log = create_build_event_log(&args.event_log_uri).await?;
    let build_request_id = args.build_request_id.unwrap_or_else(|| Uuid::new_v4().to_string());
    
    let orchestrator = BuildOrchestrator::new(event_log, build_request_id, args.partitions.clone());
    
    // Emit orchestration events
    orchestrator.start_build().await?;
    orchestrator.start_planning().await?;
    
    // Run analysis
    let graph = run_analysis(&args.partitions, &orchestrator).await?;
    
    orchestrator.start_execution().await?;
    
    // Run execution  
    let result = run_execution(graph, &orchestrator).await?;
    
    orchestrator.complete_build(result).await?;
    
    Ok(())
}

Phase 4: Remove CLI Mode Detection

Files to Modify:

  • databuild/graph/analyze.rs - Remove lines 555-587 (CLI mode orchestration events)
  • databuild/graph/execute.rs - Remove lines 413-428 and 753-779 (CLI mode orchestration events)

Verification:

  • Analysis and execution binaries become pure domain functions
  • No more environment variable mode detection
  • All orchestration handled by wrapper/service

Phase 5: Update Bazel Rules

Files to Modify:

  • databuild/rules.bzl - Update _databuild_graph_build_impl to use new CLI wrapper instead of direct analysis/execution pipeline

Before:

$(rlocation _main/{analyze_path}) $@ | $(rlocation _main/{exec_path})

After:

$(rlocation _main/{cli_wrapper_path}) $@

Phase 6: Update Tests

Files to Modify:

  • tests/end_to_end/simple_test.sh - Remove separate CLI/Service event validation
  • tests/end_to_end/podcast_simple_test.sh - Same simplification
  • All tests expect identical event patterns from CLI and Service

Migration Strategy

Direct Replacement Approach

Since we don't need backwards compatibility, we can implement a direct replacement:

  • Replace existing CLI mode detection immediately
  • Refactor Service handlers to use orchestrator directly
  • Update Bazel rules to use new CLI wrapper
  • Update tests to expect unified behavior

Testing Strategy

  1. Unit Tests: Comprehensive orchestrator testing with mock event logs
  2. Integration Tests: Existing end-to-end tests pass with unified implementation
  3. Event Verification: Ensure orchestrator produces expected events for all scenarios

File Changes Required

New Files

  • databuild/orchestration/mod.rs - 200+ lines, core orchestrator
  • databuild/orchestration/events.rs - 100+ lines, event helpers
  • databuild/orchestration/error.rs - 50+ lines, error types
  • databuild/orchestration/tests.rs - 300+ lines, comprehensive tests
  • databuild/cli/main.rs - 150+ lines, CLI wrapper
  • databuild/cli/error.rs - 50+ lines, CLI error handling

Modified Files

  • databuild/service/handlers.rs - Replace ~50 lines of event emission with orchestrator calls
  • databuild/graph/analyze.rs - Remove ~30 lines of CLI mode detection
  • databuild/graph/execute.rs - Remove ~60 lines of CLI mode detection
  • databuild/rules.bzl - Update ~10 lines for new CLI wrapper
  • tests/end_to_end/simple_test.sh - Simplify ~20 lines of event validation
  • tests/end_to_end/podcast_simple_test.sh - Same simplification

Build Configuration

  • Update databuild/BUILD.bazel to include orchestration module
  • Update databuild/cli/BUILD.bazel for new CLI binary
  • Modify example graphs to use new CLI wrapper

Benefits & Risk Analysis

Benefits

  1. Maintainability: Single source of truth for orchestration logic eliminates duplication
  2. Consistency: Guaranteed identical events across CLI and Service interfaces
  3. Extensibility: Foundation for SDK, additional CLI commands, monitoring integration
  4. Testing: Simplified test expectations, better unit test coverage of orchestration
  5. Architecture: Clean separation between orchestration and domain logic

Implementation Risks

  1. Regression: Changes to critical path could introduce subtle bugs
  2. Performance: Additional abstraction layer could impact latency
  3. Integration: Bazel build changes could break example workflows

Risk Mitigation

  1. Phased Implementation: Implement in stages with verification at each step
  2. Comprehensive Testing: Thorough unit and integration testing
  3. Event Verification: Ensure identical event patterns to current behavior

Future Architecture Extensions

SDK Integration

The unified orchestrator provides a natural integration point for external SDKs:

// Future SDK usage
let databuild_client = DatabuildClient::new(endpoint);
let orchestrator = databuild_client.create_orchestrator(partitions).await?;

orchestrator.start_build().await?;
let result = databuild_client.execute_build(orchestrator).await?;

Additional CLI Commands

Orchestrator enables consistent event emission across CLI commands:

databuild validate --partitions "data/users" --dry-run
databuild status --build-id "abc123"  
databuild retry --build-id "abc123" --failed-jobs-only

Monitoring Integration

Standardized events provide foundation for observability:

impl BuildOrchestrator {
    pub fn with_tracing_span(&self, span: tracing::Span) -> Self;
    pub fn emit_otel_metrics(&self) -> Result<(), Error>;
}

CI/CD Pipeline Integration

Orchestrator events enable standardized build reporting across environments:

# GitHub Actions integration
- name: DataBuild
  uses: databuild/github-action@v1
  with:
    partitions: "data/daily_reports"
    event-log: "${{ env.DATABUILD_EVENT_LOG }}"
    # Automatic event collection for build status reporting

Conclusion

This unification addresses fundamental architectural inconsistencies while providing a foundation for future extensibility. The phased implementation approach minimizes risk while ensuring backward compatibility throughout the transition.

The shared orchestrator eliminates the current awkward CLI mode detection pattern and establishes DataBuild as a platform that can support multiple interfaces with guaranteed consistency.