databuild/docs/plans/api.md

17 KiB

Web Server Implementation Plan

Architecture Summary

Concurrency Model: Event Log Separation

  • Orchestrator runs synchronously in dedicated thread, owns BEL exclusively
  • Web server reads from shared BEL storage, sends write commands via channel
  • No locks on hot path, orchestrator stays single-threaded
  • Eventual consistency for reads (acceptable since builds take time anyway)

Daemon Model:

  • Server binary started manually from workspace root (for now)
  • Server tracks last request time, shuts down after idle timeout (default: 3 hours)
  • HTTP REST on localhost random port
  • Future: CLI can auto-discover/start server

Thread Model

Main Process
├─ HTTP Server (tokio multi-threaded runtime)
│  ├─ Request handlers (async, read from BEL storage)
│  └─ Command sender (send writes to orchestrator)
└─ Orchestrator Thread (std::thread, synchronous)
   ├─ Receives commands via mpsc channel
   ├─ Owns BEL (exclusive mutable access)
   └─ Runs existing step() loop

Read Path (Low Latency):

  1. HTTP request → Axum handler
  2. Read events from shared BEL storage (no lock contention)
  3. Reconstruct BuildState from events (can cache this)
  4. Return response

Write Path (Strong Consistency):

  1. HTTP request → Axum handler
  2. Send command via channel to orchestrator
  3. Orchestrator processes command in its thread
  4. Reply sent back via oneshot channel
  5. Return response

Why This Works:

  • Orchestrator remains completely synchronous (no refactoring needed)
  • Reads scale horizontally (multiple handlers, no locks)
  • Writes are serialized through orchestrator (consistent with current model)
  • Event sourcing means reads can be eventually consistent

Phase 1: Foundation - Make BEL Storage Thread-Safe

Goal: Allow BEL storage to be safely shared between orchestrator and web server

Tasks:

  1. Add Send + Sync bounds to BELStorage trait
  2. Wrap SqliteBELStorage::connection in Arc<Mutex<Connection>> or use r2d2 pool
  3. Add read-only methods to BELStorage:
    • list_events(offset: usize, limit: usize) -> Vec<DataBuildEvent>
    • get_event(event_id: u64) -> Option<DataBuildEvent>
    • latest_event_id() -> u64
  4. Add builder method to reconstruct BuildState from events:
    • BuildState::from_events(events: &[DataBuildEvent]) -> Self

Files Modified:

  • databuild/build_event_log.rs - update trait and storage impls
  • databuild/build_state.rs - add from_events() builder

Acceptance Criteria:

  • BELStorage trait has Send + Sync bounds
  • Can clone Arc<SqliteBELStorage> and use from multiple threads
  • Can reconstruct BuildState from events without mutating storage

Phase 2: Web Server - HTTP API with Axum

Goal: HTTP server serving read/write APIs

Tasks:

  1. Add dependencies to MODULE.bazel:

    crate.spec(package = "tokio", features = ["full"], version = "1.0")
    crate.spec(package = "axum", version = "0.7")
    crate.spec(package = "tower", version = "0.4")
    crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")
    
  2. Create databuild/http_server.rs module with:

    • AppState struct holding:
      • bel_storage: Arc<dyn BELStorage> - shared read access
      • command_tx: mpsc::Sender<Command> - channel to orchestrator
      • last_request_time: Arc<AtomicU64> - for idle tracking
    • Axum router with all endpoints
    • Handler functions delegating to existing api_handle_* methods
  3. API Endpoints:

    GET  /health                       → health check
    GET  /api/wants                    → list_wants
    POST /api/wants                    → create_want
    GET  /api/wants/:id                → get_want
    DELETE /api/wants/:id              → cancel_want
    GET  /api/partitions               → list_partitions
    GET  /api/job_runs                 → list_job_runs
    GET  /api/job_runs/:id/logs/stdout → stream_logs (stub)
    
  4. Handler pattern (reads):

    async fn list_wants(
        State(state): State<AppState>,
        Query(params): Query<ListWantsParams>,
    ) -> Json<ListWantsResponse> {
        // Read events from storage
        let events = state.bel_storage.list_events(0, 10000)?;
    
        // Reconstruct state
        let build_state = BuildState::from_events(&events);
    
        // Use existing API method
        Json(build_state.list_wants(&params.into()))
    }
    
  5. Handler pattern (writes):

    async fn create_want(
        State(state): State<AppState>,
        Json(req): Json<CreateWantRequest>,
    ) -> Json<CreateWantResponse> {
        // Send command to orchestrator
        let (reply_tx, reply_rx) = oneshot::channel();
        state.command_tx.send(Command::CreateWant(req, reply_tx)).await?;
    
        // Wait for orchestrator reply
        let response = reply_rx.await?;
        Json(response)
    }
    

Files Created:

  • databuild/http_server.rs - new module

Files Modified:

  • databuild/lib.rs - add pub mod http_server;
  • MODULE.bazel - add dependencies

Acceptance Criteria:

  • Server starts on localhost random port, prints "Listening on http://127.0.0.1:XXXXX"
  • All read endpoints return correct JSON responses
  • Write endpoints return stub responses (Phase 4 will connect to orchestrator)

Phase 3: CLI - HTTP Client

Goal: CLI that sends HTTP requests to running server

Tasks:

  1. Add dependencies to MODULE.bazel:

    crate.spec(package = "clap", features = ["derive"], version = "4.0")
    crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")
    
  2. Create databuild/bin/databuild.rs main binary:

    #[derive(Parser)]
    #[command(name = "databuild")]
    enum Cli {
        /// Start the databuild server
        Serve(ServeArgs),
    
        /// Create a want for partitions
        Build(BuildArgs),
    
        /// Want operations
        Want(WantCommand),
    
        /// Stream job run logs
        Logs(LogsArgs),
    }
    
    #[derive(Args)]
    struct ServeArgs {
        #[arg(long, default_value = "8080")]
        port: u16,
    }
    
    #[derive(Subcommand)]
    enum WantCommand {
        Create(CreateWantArgs),
        List,
        Get { want_id: String },
        Cancel { want_id: String },
    }
    
  3. Server address discovery:

    • For now: hardcode http://localhost:8080 or accept --server-url flag
    • Future: read from .databuild/server.json file
  4. HTTP client implementation:

    fn list_wants(server_url: &str) -> Result<Vec<WantDetail>> {
        let client = reqwest::blocking::Client::new();
        let resp = client.get(&format!("{}/api/wants", server_url))
            .send()?
            .json::<ListWantsResponse>()?;
        Ok(resp.data)
    }
    
  5. Commands:

    • databuild serve --port 8080 - Start server (blocks)
    • databuild build part1 part2 - Create want for partitions
    • databuild want list - List all wants
    • databuild want get <id> - Get specific want
    • databuild want cancel <id> - Cancel want
    • databuild logs <job_run_id> - Stream logs (stub)

Files Created:

  • databuild/bin/databuild.rs - new CLI binary

Files Modified:

  • databuild/BUILD.bazel - add rust_binary target for databuild CLI

Acceptance Criteria:

  • Can run databuild serve to start server
  • Can run databuild want list in another terminal and see wants
  • Commands print pretty JSON or formatted tables

Phase 4: Orchestrator Integration - Command Channel

Goal: Connect orchestrator to web server via message passing

Tasks:

  1. Create databuild/commands.rs with command enum:

    pub enum Command {
        CreateWant(CreateWantRequest, oneshot::Sender<CreateWantResponse>),
        CancelWant(CancelWantRequest, oneshot::Sender<CancelWantResponse>),
        // Only write operations need commands
    }
    
  2. Update Orchestrator:

    • Add command_rx: mpsc::Receiver<Command> field
    • In step() method, before polling:
      // Process all pending commands
      while let Ok(cmd) = self.command_rx.try_recv() {
          match cmd {
              Command::CreateWant(req, reply) => {
                  let resp = self.bel.api_handle_want_create(req);
                  let _ = reply.send(resp); // Ignore send errors
              }
              // ... other commands
          }
      }
      
  3. Create server startup function in http_server.rs:

    pub fn start_server(
        bel_storage: Arc<dyn BELStorage>,
        port: u16,
    ) -> (JoinHandle<()>, mpsc::Sender<Command>) {
        let (cmd_tx, cmd_rx) = mpsc::channel(100);
    
        // Spawn orchestrator in background thread
        let orch_bel = bel_storage.clone();
        let orch_handle = std::thread::spawn(move || {
            let mut orch = Orchestrator::new_with_commands(orch_bel, cmd_rx);
            orch.join().unwrap();
        });
    
        // Start HTTP server in tokio runtime
        let runtime = tokio::runtime::Runtime::new().unwrap();
        let http_handle = runtime.spawn(async move {
            let app_state = AppState {
                bel_storage,
                command_tx: cmd_tx.clone(),
                last_request_time: Arc::new(AtomicU64::new(0)),
            };
    
            let app = create_router(app_state);
            let addr = SocketAddr::from(([127, 0, 0, 1], port));
            axum::Server::bind(&addr)
                .serve(app.into_make_service())
                .await
                .unwrap();
        });
    
        (http_handle, cmd_tx)
    }
    
  4. Update databuild serve command to use start_server()

Files Created:

  • databuild/commands.rs - new module

Files Modified:

  • databuild/orchestrator.rs - accept command channel, process in step()
  • databuild/http_server.rs - send commands for writes
  • databuild/bin/databuild.rs - use start_server() in serve command

Acceptance Criteria:

  • Creating a want via HTTP actually creates it in BuildState
  • Orchestrator processes commands without blocking its main loop
  • Can observe wants being scheduled into job runs

Phase 5: Daemon Lifecycle - Auto-Shutdown

Goal: Server shuts down gracefully after idle timeout

Tasks:

  1. Update AppState to track last request time:

    pub struct AppState {
        bel_storage: Arc<dyn BELStorage>,
        command_tx: mpsc::Sender<Command>,
        last_request_time: Arc<AtomicU64>, // epoch millis
        shutdown_tx: broadcast::Sender<()>,
    }
    
  2. Add Tower middleware to update timestamp:

    async fn update_last_request_time<B>(
        State(state): State<AppState>,
        req: Request<B>,
        next: Next<B>,
    ) -> Response {
        state.last_request_time.store(
            SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as u64,
            Ordering::Relaxed,
        );
        next.run(req).await
    }
    
  3. Background idle checker task:

    tokio::spawn(async move {
        let idle_timeout = Duration::from_secs(3 * 60 * 60); // 3 hours
    
        loop {
            tokio::time::sleep(Duration::from_secs(60)).await;
    
            let last_request = state.last_request_time.load(Ordering::Relaxed);
            let now = SystemTime::now()...;
    
            if now - last_request > idle_timeout.as_millis() as u64 {
                eprintln!("Server idle for {} hours, shutting down",
                    idle_timeout.as_secs() / 3600);
                shutdown_tx.send(()).unwrap();
                break;
            }
        }
    });
    
  4. Graceful shutdown handling:

    let app = create_router(state);
    axum::Server::bind(&addr)
        .serve(app.into_make_service())
        .with_graceful_shutdown(async {
            shutdown_rx.recv().await.ok();
        })
        .await?;
    
  5. Cleanup on shutdown:

    • Orchestrator: finish current step, don't start new one
    • HTTP server: stop accepting new connections, finish in-flight requests
    • Log: "Shutdown complete"

Files Modified:

  • databuild/http_server.rs - add idle tracking, shutdown logic
  • databuild/orchestrator.rs - accept shutdown signal, check before each step

Acceptance Criteria:

  • Server shuts down after configured idle timeout
  • In-flight requests complete successfully during shutdown
  • Shutdown is logged clearly

Phase 6: Testing & Polish

Goal: End-to-end testing and production readiness

Tasks:

  1. Integration tests:

    #[test]
    fn test_server_lifecycle() {
        // Start server
        let (handle, port) = start_test_server();
    
        // Make requests
        let wants = reqwest::blocking::get(
            &format!("http://localhost:{}/api/wants", port)
        ).unwrap().json::<ListWantsResponse>().unwrap();
    
        // Stop server
        handle.shutdown();
    }
    
  2. Error handling improvements:

    • Proper HTTP status codes (400, 404, 500)
    • Structured error responses:
      {"error": "Want not found", "want_id": "abc123"}
      
    • Add tracing crate for structured logging
  3. Add CORS middleware for web app:

    let cors = CorsLayer::new()
        .allow_origin("http://localhost:3000".parse::<HeaderValue>().unwrap())
        .allow_methods([Method::GET, Method::POST, Method::DELETE]);
    
    app.layer(cors)
    
  4. Health check endpoint:

    async fn health() -> &'static str {
        "OK"
    }
    
  5. Optional: Metrics endpoint (prometheus format):

    async fn metrics() -> String {
        format!(
            "# HELP databuild_wants_total Total number of wants\n\
             databuild_wants_total {}\n\
             # HELP databuild_job_runs_total Total number of job runs\n\
             databuild_job_runs_total {}\n",
            want_count, job_run_count
        )
    }
    

Files Created:

  • databuild/tests/http_integration_test.rs - integration tests

Files Modified:

  • databuild/http_server.rs - add CORS, health, metrics, better errors
  • MODULE.bazel - add tracing dependency

Acceptance Criteria:

  • All endpoints have proper error handling
  • CORS works for web app development
  • Health check returns 200 OK
  • Integration tests pass

Future Enhancements (Not in Initial Plan)

Workspace Auto-Discovery

  • Walk up directory tree looking for .databuild/ marker
  • Store server metadata in .databuild/server.json:
    {
      "pid": 12345,
      "port": 54321,
      "started_at": "2025-01-22T10:30:00Z",
      "workspace_root": "/Users/stuart/Projects/databuild"
    }
    
  • CLI auto-starts server if not running

Log Streaming (SSE)

  • Implement GET /api/job_runs/:id/logs/stdout?follow=true
  • Use Server-Sent Events for streaming
  • Integrate with FileLogStore from logging.md plan

State Caching

  • Cache reconstructed BuildState for faster reads
  • Invalidate cache when new events arrive
  • Use tokio::sync::RwLock<Option<(u64, BuildState)>> where u64 is latest_event_id

gRPC Support (If Needed)

  • Add Tonic alongside Axum
  • Share same orchestrator/command channel
  • Useful for language-agnostic clients

Dependencies Summary

New dependencies to add to MODULE.bazel:

# Async runtime
crate.spec(package = "tokio", features = ["full"], version = "1.0")

# Web framework
crate.spec(package = "axum", version = "0.7")
crate.spec(package = "tower", version = "0.4")
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")

# CLI
crate.spec(package = "clap", features = ["derive"], version = "4.0")

# HTTP client for CLI
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")

# Logging
crate.spec(package = "tracing", version = "0.1")
crate.spec(package = "tracing-subscriber", version = "0.3")

Estimated Timeline

  • Phase 1: 2-3 hours (thread-safe BEL storage)
  • Phase 2: 4-6 hours (HTTP server with Axum)
  • Phase 3: 3-4 hours (basic CLI)
  • Phase 4: 3-4 hours (orchestrator integration)
  • Phase 5: 2-3 hours (idle shutdown)
  • Phase 6: 4-6 hours (testing and polish)

Total: ~18-26 hours for complete implementation


Design Rationale

Why Event Log Separation?

Alternatives Considered:

  1. Shared State with RwLock: Orchestrator holds write lock during step(), blocking all reads
  2. Actor Model: Extra overhead from message passing for all operations

Why Event Log Separation Wins:

  • Orchestrator stays completely synchronous (no refactoring)
  • Reads don't block writes (eventual consistency acceptable for build system)
  • Natural fit with event sourcing architecture
  • Can cache reconstructed state for even better read performance

Why Not gRPC?

  • User requirement: "JSON is a must"
  • REST is more debuggable (curl, browser dev tools)
  • gRPC adds complexity without clear benefit
  • Can add gRPC later if needed (both can coexist)

Why Axum Over Actix?

  • Better compile-time type safety (extractors)
  • Cleaner middleware composition (Tower)
  • Native async/await (Actix uses actor model internally)
  • More ergonomic for this use case

Why Per-Workspace Server?

  • Isolation: builds in different projects don't interfere
  • Simpler: no need to route requests by workspace
  • Matches Bazel's model (users already understand it)
  • Easier to reason about resource usage