# Web Server Implementation Plan ## Architecture Summary **Concurrency Model: Event Log Separation** - Orchestrator runs synchronously in dedicated thread, owns BEL exclusively - Web server reads from shared BEL storage, sends write commands via channel - No locks on hot path, orchestrator stays single-threaded - Eventual consistency for reads (acceptable since builds take time anyway) **Daemon Model:** - Server binary started manually from workspace root (for now) - Server tracks last request time, shuts down after idle timeout (default: 3 hours) - HTTP REST on localhost random port - Future: CLI can auto-discover/start server --- ## Thread Model ``` Main Process ├─ HTTP Server (tokio multi-threaded runtime) │ ├─ Request handlers (async, read from BEL storage) │ └─ Command sender (send writes to orchestrator) └─ Orchestrator Thread (std::thread, synchronous) ├─ Receives commands via mpsc channel ├─ Owns BEL (exclusive mutable access) └─ Runs existing step() loop ``` **Read Path (Low Latency):** 1. HTTP request → Axum handler 2. Read events from shared BEL storage (no lock contention) 3. Reconstruct BuildState from events (can cache this) 4. Return response **Write Path (Strong Consistency):** 1. HTTP request → Axum handler 2. Send command via channel to orchestrator 3. Orchestrator processes command in its thread 4. Reply sent back via oneshot channel 5. Return response **Why This Works:** - Orchestrator remains completely synchronous (no refactoring needed) - Reads scale horizontally (multiple handlers, no locks) - Writes are serialized through orchestrator (consistent with current model) - Event sourcing means reads can be eventually consistent --- ## Phase 1: Foundation - Make BEL Storage Thread-Safe **Goal:** Allow BEL storage to be safely shared between orchestrator and web server **Tasks:** 1. Add `Send + Sync` bounds to `BELStorage` trait 2. Wrap `SqliteBELStorage::connection` in `Arc>` or use r2d2 pool 3. Add read-only methods to BELStorage: - `list_events(offset: usize, limit: usize) -> Vec` - `get_event(event_id: u64) -> Option` - `latest_event_id() -> u64` 4. Add builder method to reconstruct BuildState from events: - `BuildState::from_events(events: &[DataBuildEvent]) -> Self` **Files Modified:** - `databuild/build_event_log.rs` - update trait and storage impls - `databuild/build_state.rs` - add `from_events()` builder **Acceptance Criteria:** - `BELStorage` trait has `Send + Sync` bounds - Can clone `Arc` and use from multiple threads - Can reconstruct BuildState from events without mutating storage --- ## Phase 2: Web Server - HTTP API with Axum **Goal:** HTTP server serving read/write APIs **Tasks:** 1. Add dependencies to MODULE.bazel: ```python crate.spec(package = "tokio", features = ["full"], version = "1.0") crate.spec(package = "axum", version = "0.7") crate.spec(package = "tower", version = "0.4") crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5") ``` 2. Create `databuild/http_server.rs` module with: - `AppState` struct holding: - `bel_storage: Arc` - shared read access - `command_tx: mpsc::Sender` - channel to orchestrator - `last_request_time: Arc` - for idle tracking - Axum router with all endpoints - Handler functions delegating to existing `api_handle_*` methods 3. API Endpoints: ``` GET /health → health check GET /api/wants → list_wants POST /api/wants → create_want GET /api/wants/:id → get_want DELETE /api/wants/:id → cancel_want GET /api/partitions → list_partitions GET /api/job_runs → list_job_runs GET /api/job_runs/:id/logs/stdout → stream_logs (stub) ``` 4. Handler pattern (reads): ```rust async fn list_wants( State(state): State, Query(params): Query, ) -> Json { // Read events from storage let events = state.bel_storage.list_events(0, 10000)?; // Reconstruct state let build_state = BuildState::from_events(&events); // Use existing API method Json(build_state.list_wants(¶ms.into())) } ``` 5. Handler pattern (writes): ```rust async fn create_want( State(state): State, Json(req): Json, ) -> Json { // Send command to orchestrator let (reply_tx, reply_rx) = oneshot::channel(); state.command_tx.send(Command::CreateWant(req, reply_tx)).await?; // Wait for orchestrator reply let response = reply_rx.await?; Json(response) } ``` **Files Created:** - `databuild/http_server.rs` - new module **Files Modified:** - `databuild/lib.rs` - add `pub mod http_server;` - `MODULE.bazel` - add dependencies **Acceptance Criteria:** - Server starts on localhost random port, prints "Listening on http://127.0.0.1:XXXXX" - All read endpoints return correct JSON responses - Write endpoints return stub responses (Phase 4 will connect to orchestrator) --- ## Phase 3: CLI - HTTP Client **Goal:** CLI that sends HTTP requests to running server **Tasks:** 1. Add dependencies to MODULE.bazel: ```python crate.spec(package = "clap", features = ["derive"], version = "4.0") crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11") ``` 2. Create `databuild/bin/databuild.rs` main binary: ```rust #[derive(Parser)] #[command(name = "databuild")] enum Cli { /// Start the databuild server Serve(ServeArgs), /// Create a want for partitions Build(BuildArgs), /// Want operations Want(WantCommand), /// Stream job run logs Logs(LogsArgs), } #[derive(Args)] struct ServeArgs { #[arg(long, default_value = "8080")] port: u16, } #[derive(Subcommand)] enum WantCommand { Create(CreateWantArgs), List, Get { want_id: String }, Cancel { want_id: String }, } ``` 3. Server address discovery: - For now: hardcode `http://localhost:8080` or accept `--server-url` flag - Future: read from `.databuild/server.json` file 4. HTTP client implementation: ```rust fn list_wants(server_url: &str) -> Result> { let client = reqwest::blocking::Client::new(); let resp = client.get(&format!("{}/api/wants", server_url)) .send()? .json::()?; Ok(resp.data) } ``` 5. Commands: - `databuild serve --port 8080` - Start server (blocks) - `databuild build part1 part2` - Create want for partitions - `databuild want list` - List all wants - `databuild want get ` - Get specific want - `databuild want cancel ` - Cancel want - `databuild logs ` - Stream logs (stub) **Files Created:** - `databuild/bin/databuild.rs` - new CLI binary **Files Modified:** - `databuild/BUILD.bazel` - add `rust_binary` target for databuild CLI **Acceptance Criteria:** - Can run `databuild serve` to start server - Can run `databuild want list` in another terminal and see wants - Commands print pretty JSON or formatted tables --- ## Phase 4: Orchestrator Integration - Command Channel **Goal:** Connect orchestrator to web server via message passing **Tasks:** 1. Create `databuild/commands.rs` with command enum: ```rust pub enum Command { CreateWant(CreateWantRequest, oneshot::Sender), CancelWant(CancelWantRequest, oneshot::Sender), // Only write operations need commands } ``` 2. Update `Orchestrator`: - Add `command_rx: mpsc::Receiver` field - In `step()` method, before polling: ```rust // Process all pending commands while let Ok(cmd) = self.command_rx.try_recv() { match cmd { Command::CreateWant(req, reply) => { let resp = self.bel.api_handle_want_create(req); let _ = reply.send(resp); // Ignore send errors } // ... other commands } } ``` 3. Create server startup function in `http_server.rs`: ```rust pub fn start_server( bel_storage: Arc, port: u16, ) -> (JoinHandle<()>, mpsc::Sender) { let (cmd_tx, cmd_rx) = mpsc::channel(100); // Spawn orchestrator in background thread let orch_bel = bel_storage.clone(); let orch_handle = std::thread::spawn(move || { let mut orch = Orchestrator::new_with_commands(orch_bel, cmd_rx); orch.join().unwrap(); }); // Start HTTP server in tokio runtime let runtime = tokio::runtime::Runtime::new().unwrap(); let http_handle = runtime.spawn(async move { let app_state = AppState { bel_storage, command_tx: cmd_tx.clone(), last_request_time: Arc::new(AtomicU64::new(0)), }; let app = create_router(app_state); let addr = SocketAddr::from(([127, 0, 0, 1], port)); axum::Server::bind(&addr) .serve(app.into_make_service()) .await .unwrap(); }); (http_handle, cmd_tx) } ``` 4. Update `databuild serve` command to use `start_server()` **Files Created:** - `databuild/commands.rs` - new module **Files Modified:** - `databuild/orchestrator.rs` - accept command channel, process in `step()` - `databuild/http_server.rs` - send commands for writes - `databuild/bin/databuild.rs` - use `start_server()` in `serve` command **Acceptance Criteria:** - Creating a want via HTTP actually creates it in BuildState - Orchestrator processes commands without blocking its main loop - Can observe wants being scheduled into job runs --- ## Phase 5: Daemon Lifecycle - Auto-Shutdown **Goal:** Server shuts down gracefully after idle timeout **Tasks:** 1. Update AppState to track last request time: ```rust pub struct AppState { bel_storage: Arc, command_tx: mpsc::Sender, last_request_time: Arc, // epoch millis shutdown_tx: broadcast::Sender<()>, } ``` 2. Add Tower middleware to update timestamp: ```rust async fn update_last_request_time( State(state): State, req: Request, next: Next, ) -> Response { state.last_request_time.store( SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as u64, Ordering::Relaxed, ); next.run(req).await } ``` 3. Background idle checker task: ```rust tokio::spawn(async move { let idle_timeout = Duration::from_secs(3 * 60 * 60); // 3 hours loop { tokio::time::sleep(Duration::from_secs(60)).await; let last_request = state.last_request_time.load(Ordering::Relaxed); let now = SystemTime::now()...; if now - last_request > idle_timeout.as_millis() as u64 { eprintln!("Server idle for {} hours, shutting down", idle_timeout.as_secs() / 3600); shutdown_tx.send(()).unwrap(); break; } } }); ``` 4. Graceful shutdown handling: ```rust let app = create_router(state); axum::Server::bind(&addr) .serve(app.into_make_service()) .with_graceful_shutdown(async { shutdown_rx.recv().await.ok(); }) .await?; ``` 5. Cleanup on shutdown: - Orchestrator: finish current step, don't start new one - HTTP server: stop accepting new connections, finish in-flight requests - Log: "Shutdown complete" **Files Modified:** - `databuild/http_server.rs` - add idle tracking, shutdown logic - `databuild/orchestrator.rs` - accept shutdown signal, check before each step **Acceptance Criteria:** - Server shuts down after configured idle timeout - In-flight requests complete successfully during shutdown - Shutdown is logged clearly --- ## Phase 6: Testing & Polish **Goal:** End-to-end testing and production readiness **Tasks:** 1. Integration tests: ```rust #[test] fn test_server_lifecycle() { // Start server let (handle, port) = start_test_server(); // Make requests let wants = reqwest::blocking::get( &format!("http://localhost:{}/api/wants", port) ).unwrap().json::().unwrap(); // Stop server handle.shutdown(); } ``` 2. Error handling improvements: - Proper HTTP status codes (400, 404, 500) - Structured error responses: ```json {"error": "Want not found", "want_id": "abc123"} ``` - Add `tracing` crate for structured logging 3. Add CORS middleware for web app: ```rust let cors = CorsLayer::new() .allow_origin("http://localhost:3000".parse::().unwrap()) .allow_methods([Method::GET, Method::POST, Method::DELETE]); app.layer(cors) ``` 4. Health check endpoint: ```rust async fn health() -> &'static str { "OK" } ``` 5. Optional: Metrics endpoint (prometheus format): ```rust async fn metrics() -> String { format!( "# HELP databuild_wants_total Total number of wants\n\ databuild_wants_total {}\n\ # HELP databuild_job_runs_total Total number of job runs\n\ databuild_job_runs_total {}\n", want_count, job_run_count ) } ``` **Files Created:** - `databuild/tests/http_integration_test.rs` - integration tests **Files Modified:** - `databuild/http_server.rs` - add CORS, health, metrics, better errors - `MODULE.bazel` - add `tracing` dependency **Acceptance Criteria:** - All endpoints have proper error handling - CORS works for web app development - Health check returns 200 OK - Integration tests pass --- ## Future Enhancements (Not in Initial Plan) ### Workspace Auto-Discovery - Walk up directory tree looking for `.databuild/` marker - Store server metadata in `.databuild/server.json`: ```json { "pid": 12345, "port": 54321, "started_at": "2025-01-22T10:30:00Z", "workspace_root": "/Users/stuart/Projects/databuild" } ``` - CLI auto-starts server if not running ### Log Streaming (SSE) - Implement `GET /api/job_runs/:id/logs/stdout?follow=true` - Use Server-Sent Events for streaming - Integrate with FileLogStore from logging.md plan ### State Caching - Cache reconstructed BuildState for faster reads - Invalidate cache when new events arrive - Use `tokio::sync::RwLock>` where u64 is latest_event_id ### gRPC Support (If Needed) - Add Tonic alongside Axum - Share same orchestrator/command channel - Useful for language-agnostic clients --- ## Dependencies Summary New dependencies to add to `MODULE.bazel`: ```python # Async runtime crate.spec(package = "tokio", features = ["full"], version = "1.0") # Web framework crate.spec(package = "axum", version = "0.7") crate.spec(package = "tower", version = "0.4") crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5") # CLI crate.spec(package = "clap", features = ["derive"], version = "4.0") # HTTP client for CLI crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11") # Logging crate.spec(package = "tracing", version = "0.1") crate.spec(package = "tracing-subscriber", version = "0.3") ``` --- ## Estimated Timeline - **Phase 1:** 2-3 hours (thread-safe BEL storage) - **Phase 2:** 4-6 hours (HTTP server with Axum) - **Phase 3:** 3-4 hours (basic CLI) - **Phase 4:** 3-4 hours (orchestrator integration) - **Phase 5:** 2-3 hours (idle shutdown) - **Phase 6:** 4-6 hours (testing and polish) **Total:** ~18-26 hours for complete implementation --- ## Design Rationale ### Why Event Log Separation? **Alternatives Considered:** 1. **Shared State with RwLock**: Orchestrator holds write lock during `step()`, blocking all reads 2. **Actor Model**: Extra overhead from message passing for all operations **Why Event Log Separation Wins:** - Orchestrator stays completely synchronous (no refactoring) - Reads don't block writes (eventual consistency acceptable for build system) - Natural fit with event sourcing architecture - Can cache reconstructed state for even better read performance ### Why Not gRPC? - User requirement: "JSON is a must" - REST is more debuggable (curl, browser dev tools) - gRPC adds complexity without clear benefit - Can add gRPC later if needed (both can coexist) ### Why Axum Over Actix? - Better compile-time type safety (extractors) - Cleaner middleware composition (Tower) - Native async/await (Actix uses actor model internally) - More ergonomic for this use case ### Why Per-Workspace Server? - Isolation: builds in different projects don't interfere - Simpler: no need to route requests by workspace - Matches Bazel's model (users already understand it) - Easier to reason about resource usage