17 KiB
Web Server Implementation Plan
Architecture Summary
Concurrency Model: Event Log Separation
- Orchestrator runs synchronously in dedicated thread, owns BEL exclusively
- Web server reads from shared BEL storage, sends write commands via channel
- No locks on hot path, orchestrator stays single-threaded
- Eventual consistency for reads (acceptable since builds take time anyway)
Daemon Model:
- Server binary started manually from workspace root (for now)
- Server tracks last request time, shuts down after idle timeout (default: 3 hours)
- HTTP REST on localhost random port
- Future: CLI can auto-discover/start server
Thread Model
Main Process
├─ HTTP Server (tokio multi-threaded runtime)
│ ├─ Request handlers (async, read from BEL storage)
│ └─ Command sender (send writes to orchestrator)
└─ Orchestrator Thread (std::thread, synchronous)
├─ Receives commands via mpsc channel
├─ Owns BEL (exclusive mutable access)
└─ Runs existing step() loop
Read Path (Low Latency):
- HTTP request → Axum handler
- Read events from shared BEL storage (no lock contention)
- Reconstruct BuildState from events (can cache this)
- Return response
Write Path (Strong Consistency):
- HTTP request → Axum handler
- Send command via channel to orchestrator
- Orchestrator processes command in its thread
- Reply sent back via oneshot channel
- Return response
Why This Works:
- Orchestrator remains completely synchronous (no refactoring needed)
- Reads scale horizontally (multiple handlers, no locks)
- Writes are serialized through orchestrator (consistent with current model)
- Event sourcing means reads can be eventually consistent
Phase 1: Foundation - Make BEL Storage Thread-Safe
Goal: Allow BEL storage to be safely shared between orchestrator and web server
Tasks:
- Add
Send + Syncbounds toBELStoragetrait - Wrap
SqliteBELStorage::connectioninArc<Mutex<Connection>>or use r2d2 pool - Add read-only methods to BELStorage:
list_events(offset: usize, limit: usize) -> Vec<DataBuildEvent>get_event(event_id: u64) -> Option<DataBuildEvent>latest_event_id() -> u64
- Add builder method to reconstruct BuildState from events:
BuildState::from_events(events: &[DataBuildEvent]) -> Self
Files Modified:
databuild/build_event_log.rs- update trait and storage implsdatabuild/build_state.rs- addfrom_events()builder
Acceptance Criteria:
BELStoragetrait hasSend + Syncbounds- Can clone
Arc<SqliteBELStorage>and use from multiple threads - Can reconstruct BuildState from events without mutating storage
Phase 2: Web Server - HTTP API with Axum
Goal: HTTP server serving read/write APIs
Tasks:
-
Add dependencies to MODULE.bazel:
crate.spec(package = "tokio", features = ["full"], version = "1.0") crate.spec(package = "axum", version = "0.7") crate.spec(package = "tower", version = "0.4") crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5") -
Create
databuild/http_server.rsmodule with:AppStatestruct holding:bel_storage: Arc<dyn BELStorage>- shared read accesscommand_tx: mpsc::Sender<Command>- channel to orchestratorlast_request_time: Arc<AtomicU64>- for idle tracking
- Axum router with all endpoints
- Handler functions delegating to existing
api_handle_*methods
-
API Endpoints:
GET /health → health check GET /api/wants → list_wants POST /api/wants → create_want GET /api/wants/:id → get_want DELETE /api/wants/:id → cancel_want GET /api/partitions → list_partitions GET /api/job_runs → list_job_runs GET /api/job_runs/:id/logs/stdout → stream_logs (stub) -
Handler pattern (reads):
async fn list_wants( State(state): State<AppState>, Query(params): Query<ListWantsParams>, ) -> Json<ListWantsResponse> { // Read events from storage let events = state.bel_storage.list_events(0, 10000)?; // Reconstruct state let build_state = BuildState::from_events(&events); // Use existing API method Json(build_state.list_wants(¶ms.into())) } -
Handler pattern (writes):
async fn create_want( State(state): State<AppState>, Json(req): Json<CreateWantRequest>, ) -> Json<CreateWantResponse> { // Send command to orchestrator let (reply_tx, reply_rx) = oneshot::channel(); state.command_tx.send(Command::CreateWant(req, reply_tx)).await?; // Wait for orchestrator reply let response = reply_rx.await?; Json(response) }
Files Created:
databuild/http_server.rs- new module
Files Modified:
databuild/lib.rs- addpub mod http_server;MODULE.bazel- add dependencies
Acceptance Criteria:
- Server starts on localhost random port, prints "Listening on http://127.0.0.1:XXXXX"
- All read endpoints return correct JSON responses
- Write endpoints return stub responses (Phase 4 will connect to orchestrator)
Phase 3: CLI - HTTP Client
Goal: CLI that sends HTTP requests to running server
Tasks:
-
Add dependencies to MODULE.bazel:
crate.spec(package = "clap", features = ["derive"], version = "4.0") crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11") -
Create
databuild/bin/databuild.rsmain binary:#[derive(Parser)] #[command(name = "databuild")] enum Cli { /// Start the databuild server Serve(ServeArgs), /// Create a want for partitions Build(BuildArgs), /// Want operations Want(WantCommand), /// Stream job run logs Logs(LogsArgs), } #[derive(Args)] struct ServeArgs { #[arg(long, default_value = "8080")] port: u16, } #[derive(Subcommand)] enum WantCommand { Create(CreateWantArgs), List, Get { want_id: String }, Cancel { want_id: String }, } -
Server address discovery:
- For now: hardcode
http://localhost:8080or accept--server-urlflag - Future: read from
.databuild/server.jsonfile
- For now: hardcode
-
HTTP client implementation:
fn list_wants(server_url: &str) -> Result<Vec<WantDetail>> { let client = reqwest::blocking::Client::new(); let resp = client.get(&format!("{}/api/wants", server_url)) .send()? .json::<ListWantsResponse>()?; Ok(resp.data) } -
Commands:
databuild serve --port 8080- Start server (blocks)databuild build part1 part2- Create want for partitionsdatabuild want list- List all wantsdatabuild want get <id>- Get specific wantdatabuild want cancel <id>- Cancel wantdatabuild logs <job_run_id>- Stream logs (stub)
Files Created:
databuild/bin/databuild.rs- new CLI binary
Files Modified:
databuild/BUILD.bazel- addrust_binarytarget for databuild CLI
Acceptance Criteria:
- Can run
databuild serveto start server - Can run
databuild want listin another terminal and see wants - Commands print pretty JSON or formatted tables
Phase 4: Orchestrator Integration - Command Channel
Goal: Connect orchestrator to web server via message passing
Tasks:
-
Create
databuild/commands.rswith command enum:pub enum Command { CreateWant(CreateWantRequest, oneshot::Sender<CreateWantResponse>), CancelWant(CancelWantRequest, oneshot::Sender<CancelWantResponse>), // Only write operations need commands } -
Update
Orchestrator:- Add
command_rx: mpsc::Receiver<Command>field - In
step()method, before polling:// Process all pending commands while let Ok(cmd) = self.command_rx.try_recv() { match cmd { Command::CreateWant(req, reply) => { let resp = self.bel.api_handle_want_create(req); let _ = reply.send(resp); // Ignore send errors } // ... other commands } }
- Add
-
Create server startup function in
http_server.rs:pub fn start_server( bel_storage: Arc<dyn BELStorage>, port: u16, ) -> (JoinHandle<()>, mpsc::Sender<Command>) { let (cmd_tx, cmd_rx) = mpsc::channel(100); // Spawn orchestrator in background thread let orch_bel = bel_storage.clone(); let orch_handle = std::thread::spawn(move || { let mut orch = Orchestrator::new_with_commands(orch_bel, cmd_rx); orch.join().unwrap(); }); // Start HTTP server in tokio runtime let runtime = tokio::runtime::Runtime::new().unwrap(); let http_handle = runtime.spawn(async move { let app_state = AppState { bel_storage, command_tx: cmd_tx.clone(), last_request_time: Arc::new(AtomicU64::new(0)), }; let app = create_router(app_state); let addr = SocketAddr::from(([127, 0, 0, 1], port)); axum::Server::bind(&addr) .serve(app.into_make_service()) .await .unwrap(); }); (http_handle, cmd_tx) } -
Update
databuild servecommand to usestart_server()
Files Created:
databuild/commands.rs- new module
Files Modified:
databuild/orchestrator.rs- accept command channel, process instep()databuild/http_server.rs- send commands for writesdatabuild/bin/databuild.rs- usestart_server()inservecommand
Acceptance Criteria:
- Creating a want via HTTP actually creates it in BuildState
- Orchestrator processes commands without blocking its main loop
- Can observe wants being scheduled into job runs
Phase 5: Daemon Lifecycle - Auto-Shutdown
Goal: Server shuts down gracefully after idle timeout
Tasks:
-
Update AppState to track last request time:
pub struct AppState { bel_storage: Arc<dyn BELStorage>, command_tx: mpsc::Sender<Command>, last_request_time: Arc<AtomicU64>, // epoch millis shutdown_tx: broadcast::Sender<()>, } -
Add Tower middleware to update timestamp:
async fn update_last_request_time<B>( State(state): State<AppState>, req: Request<B>, next: Next<B>, ) -> Response { state.last_request_time.store( SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as u64, Ordering::Relaxed, ); next.run(req).await } -
Background idle checker task:
tokio::spawn(async move { let idle_timeout = Duration::from_secs(3 * 60 * 60); // 3 hours loop { tokio::time::sleep(Duration::from_secs(60)).await; let last_request = state.last_request_time.load(Ordering::Relaxed); let now = SystemTime::now()...; if now - last_request > idle_timeout.as_millis() as u64 { eprintln!("Server idle for {} hours, shutting down", idle_timeout.as_secs() / 3600); shutdown_tx.send(()).unwrap(); break; } } }); -
Graceful shutdown handling:
let app = create_router(state); axum::Server::bind(&addr) .serve(app.into_make_service()) .with_graceful_shutdown(async { shutdown_rx.recv().await.ok(); }) .await?; -
Cleanup on shutdown:
- Orchestrator: finish current step, don't start new one
- HTTP server: stop accepting new connections, finish in-flight requests
- Log: "Shutdown complete"
Files Modified:
databuild/http_server.rs- add idle tracking, shutdown logicdatabuild/orchestrator.rs- accept shutdown signal, check before each step
Acceptance Criteria:
- Server shuts down after configured idle timeout
- In-flight requests complete successfully during shutdown
- Shutdown is logged clearly
Phase 6: Testing & Polish
Goal: End-to-end testing and production readiness
Tasks:
-
Integration tests:
#[test] fn test_server_lifecycle() { // Start server let (handle, port) = start_test_server(); // Make requests let wants = reqwest::blocking::get( &format!("http://localhost:{}/api/wants", port) ).unwrap().json::<ListWantsResponse>().unwrap(); // Stop server handle.shutdown(); } -
Error handling improvements:
- Proper HTTP status codes (400, 404, 500)
- Structured error responses:
{"error": "Want not found", "want_id": "abc123"} - Add
tracingcrate for structured logging
-
Add CORS middleware for web app:
let cors = CorsLayer::new() .allow_origin("http://localhost:3000".parse::<HeaderValue>().unwrap()) .allow_methods([Method::GET, Method::POST, Method::DELETE]); app.layer(cors) -
Health check endpoint:
async fn health() -> &'static str { "OK" } -
Optional: Metrics endpoint (prometheus format):
async fn metrics() -> String { format!( "# HELP databuild_wants_total Total number of wants\n\ databuild_wants_total {}\n\ # HELP databuild_job_runs_total Total number of job runs\n\ databuild_job_runs_total {}\n", want_count, job_run_count ) }
Files Created:
databuild/tests/http_integration_test.rs- integration tests
Files Modified:
databuild/http_server.rs- add CORS, health, metrics, better errorsMODULE.bazel- addtracingdependency
Acceptance Criteria:
- All endpoints have proper error handling
- CORS works for web app development
- Health check returns 200 OK
- Integration tests pass
Future Enhancements (Not in Initial Plan)
Workspace Auto-Discovery
- Walk up directory tree looking for
.databuild/marker - Store server metadata in
.databuild/server.json:{ "pid": 12345, "port": 54321, "started_at": "2025-01-22T10:30:00Z", "workspace_root": "/Users/stuart/Projects/databuild" } - CLI auto-starts server if not running
Log Streaming (SSE)
- Implement
GET /api/job_runs/:id/logs/stdout?follow=true - Use Server-Sent Events for streaming
- Integrate with FileLogStore from logging.md plan
State Caching
- Cache reconstructed BuildState for faster reads
- Invalidate cache when new events arrive
- Use
tokio::sync::RwLock<Option<(u64, BuildState)>>where u64 is latest_event_id
gRPC Support (If Needed)
- Add Tonic alongside Axum
- Share same orchestrator/command channel
- Useful for language-agnostic clients
Dependencies Summary
New dependencies to add to MODULE.bazel:
# Async runtime
crate.spec(package = "tokio", features = ["full"], version = "1.0")
# Web framework
crate.spec(package = "axum", version = "0.7")
crate.spec(package = "tower", version = "0.4")
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")
# CLI
crate.spec(package = "clap", features = ["derive"], version = "4.0")
# HTTP client for CLI
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")
# Logging
crate.spec(package = "tracing", version = "0.1")
crate.spec(package = "tracing-subscriber", version = "0.3")
Estimated Timeline
- Phase 1: 2-3 hours (thread-safe BEL storage)
- Phase 2: 4-6 hours (HTTP server with Axum)
- Phase 3: 3-4 hours (basic CLI)
- Phase 4: 3-4 hours (orchestrator integration)
- Phase 5: 2-3 hours (idle shutdown)
- Phase 6: 4-6 hours (testing and polish)
Total: ~18-26 hours for complete implementation
Design Rationale
Why Event Log Separation?
Alternatives Considered:
- Shared State with RwLock: Orchestrator holds write lock during
step(), blocking all reads - Actor Model: Extra overhead from message passing for all operations
Why Event Log Separation Wins:
- Orchestrator stays completely synchronous (no refactoring)
- Reads don't block writes (eventual consistency acceptable for build system)
- Natural fit with event sourcing architecture
- Can cache reconstructed state for even better read performance
Why Not gRPC?
- User requirement: "JSON is a must"
- REST is more debuggable (curl, browser dev tools)
- gRPC adds complexity without clear benefit
- Can add gRPC later if needed (both can coexist)
Why Axum Over Actix?
- Better compile-time type safety (extractors)
- Cleaner middleware composition (Tower)
- Native async/await (Actix uses actor model internally)
- More ergonomic for this use case
Why Per-Workspace Server?
- Isolation: builds in different projects don't interfere
- Simpler: no need to route requests by workspace
- Matches Bazel's model (users already understand it)
- Easier to reason about resource usage