databuild/docs/plans/api.md

# Web Server Implementation Plan

## Architecture Summary

**Concurrency Model: Event Log Separation**
- Orchestrator runs synchronously in dedicated thread, owns BEL exclusively
- Web server reads from shared BEL storage, sends write commands via channel
- No locks on hot path, orchestrator stays single-threaded
- Eventual consistency for reads (acceptable since builds take time anyway)

**Daemon Model:**
- Server binary started manually from workspace root (for now)
- Server tracks last request time, shuts down after idle timeout (default: 3 hours)
- HTTP REST on localhost random port
- Future: CLI can auto-discover/start server

---

## Thread Model

```
Main Process
├─ HTTP Server (tokio multi-threaded runtime)
│  ├─ Request handlers (async, read from BEL storage)
│  └─ Command sender (send writes to orchestrator)
└─ Orchestrator Thread (std::thread, synchronous)
   ├─ Receives commands via mpsc channel
   ├─ Owns BEL (exclusive mutable access)
   └─ Runs existing step() loop
```

**Read Path (Low Latency):**
1. HTTP request → Axum handler
2. Read events from shared BEL storage (no lock contention)
3. Reconstruct BuildState from events (can cache this)
4. Return response

**Write Path (Strong Consistency):**
1. HTTP request → Axum handler
2. Send command via channel to orchestrator
3. Orchestrator processes command in its thread
4. Reply sent back via oneshot channel
5. Return response

**Why This Works:**
- Orchestrator remains completely synchronous (no refactoring needed)
- Reads scale horizontally (multiple handlers, no locks)
- Writes are serialized through orchestrator (consistent with current model)
- Event sourcing means reads can be eventually consistent

---

## Phase 1: Foundation - Make BEL Storage Thread-Safe

**Goal:** Allow BEL storage to be safely shared between orchestrator and web server

**Tasks:**
1. Add `Send + Sync` bounds to `BELStorage` trait
2. Wrap `SqliteBELStorage::connection` in `Arc<Mutex<Connection>>` or use r2d2 pool
3. Add read-only methods to BELStorage:
   - `list_events(offset: usize, limit: usize) -> Vec<DataBuildEvent>`
   - `get_event(event_id: u64) -> Option<DataBuildEvent>`
   - `latest_event_id() -> u64`
4. Add builder method to reconstruct BuildState from events:
   - `BuildState::from_events(events: &[DataBuildEvent]) -> Self`

**Files Modified:**
- `databuild/build_event_log.rs` - update trait and storage impls
- `databuild/build_state.rs` - add `from_events()` builder

**Acceptance Criteria:**
- `BELStorage` trait has `Send + Sync` bounds
- Can clone `Arc<SqliteBELStorage>` and use from multiple threads
- Can reconstruct BuildState from events without mutating storage

---

## Phase 2: Web Server - HTTP API with Axum

**Goal:** HTTP server serving read/write APIs

**Tasks:**
1. Add dependencies to MODULE.bazel:
   ```python
   crate.spec(package = "tokio", features = ["full"], version = "1.0")
   crate.spec(package = "axum", version = "0.7")
   crate.spec(package = "tower", version = "0.4")
   crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")
   ```

2. Create `databuild/http_server.rs` module with:
   - `AppState` struct holding:
     - `bel_storage: Arc<dyn BELStorage>` - shared read access
     - `command_tx: mpsc::Sender<Command>` - channel to orchestrator
     - `last_request_time: Arc<AtomicU64>` - for idle tracking
   - Axum router with all endpoints
   - Handler functions delegating to existing `api_handle_*` methods

3. API Endpoints:
   ```
   GET  /health                       → health check
   GET  /api/wants                    → list_wants
   POST /api/wants                    → create_want
   GET  /api/wants/:id                → get_want
   DELETE /api/wants/:id              → cancel_want
   GET  /api/partitions               → list_partitions
   GET  /api/job_runs                 → list_job_runs
   GET  /api/job_runs/:id/logs/stdout → stream_logs (stub)
   ```

4. Handler pattern (reads):
   ```rust
   async fn list_wants(
       State(state): State<AppState>,
       Query(params): Query<ListWantsParams>,
   ) -> Json<ListWantsResponse> {
       // Read events from storage
       let events = state.bel_storage.list_events(0, 10000)?;

       // Reconstruct state
       let build_state = BuildState::from_events(&events);

       // Use existing API method
       Json(build_state.list_wants(&params.into()))
   }
   ```

5. Handler pattern (writes):
   ```rust
   async fn create_want(
       State(state): State<AppState>,
       Json(req): Json<CreateWantRequest>,
   ) -> Json<CreateWantResponse> {
       // Send command to orchestrator
       let (reply_tx, reply_rx) = oneshot::channel();
       state.command_tx.send(Command::CreateWant(req, reply_tx)).await?;

       // Wait for orchestrator reply
       let response = reply_rx.await?;
       Json(response)
   }
   ```

**Files Created:**
- `databuild/http_server.rs` - new module

**Files Modified:**
- `databuild/lib.rs` - add `pub mod http_server;`
- `MODULE.bazel` - add dependencies

**Acceptance Criteria:**
- Server starts on localhost random port, prints "Listening on http://127.0.0.1:XXXXX"
- All read endpoints return correct JSON responses
- Write endpoints return stub responses (Phase 4 will connect to orchestrator)

---

## Phase 3: CLI - HTTP Client

**Goal:** CLI that sends HTTP requests to running server

**Tasks:**
1. Add dependencies to MODULE.bazel:
   ```python
   crate.spec(package = "clap", features = ["derive"], version = "4.0")
   crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")
   ```

2. Create `databuild/bin/databuild.rs` main binary:
   ```rust
   #[derive(Parser)]
   #[command(name = "databuild")]
   enum Cli {
       /// Start the databuild server
       Serve(ServeArgs),

       /// Create a want for partitions
       Build(BuildArgs),

       /// Want operations
       Want(WantCommand),

       /// Stream job run logs
       Logs(LogsArgs),
   }

   #[derive(Args)]
   struct ServeArgs {
       #[arg(long, default_value = "8080")]
       port: u16,
   }

   #[derive(Subcommand)]
   enum WantCommand {
       Create(CreateWantArgs),
       List,
       Get { want_id: String },
       Cancel { want_id: String },
   }
   ```

3. Server address discovery:
   - For now: hardcode `http://localhost:8080` or accept `--server-url` flag
   - Future: read from `.databuild/server.json` file

4. HTTP client implementation:
   ```rust
   fn list_wants(server_url: &str) -> Result<Vec<WantDetail>> {
       let client = reqwest::blocking::Client::new();
       let resp = client.get(&format!("{}/api/wants", server_url))
           .send()?
           .json::<ListWantsResponse>()?;
       Ok(resp.data)
   }
   ```

5. Commands:
   - `databuild serve --port 8080` - Start server (blocks)
   - `databuild build part1 part2` - Create want for partitions
   - `databuild want list` - List all wants
   - `databuild want get <id>` - Get specific want
   - `databuild want cancel <id>` - Cancel want
   - `databuild logs <job_run_id>` - Stream logs (stub)

**Files Created:**
- `databuild/bin/databuild.rs` - new CLI binary

**Files Modified:**
- `databuild/BUILD.bazel` - add `rust_binary` target for databuild CLI

**Acceptance Criteria:**
- Can run `databuild serve` to start server
- Can run `databuild want list` in another terminal and see wants
- Commands print pretty JSON or formatted tables

---

## Phase 4: Orchestrator Integration - Command Channel

**Goal:** Connect orchestrator to web server via message passing

**Tasks:**
1. Create `databuild/commands.rs` with command enum:
   ```rust
   pub enum Command {
       CreateWant(CreateWantRequest, oneshot::Sender<CreateWantResponse>),
       CancelWant(CancelWantRequest, oneshot::Sender<CancelWantResponse>),
       // Only write operations need commands
   }
   ```

2. Update `Orchestrator`:
   - Add `command_rx: mpsc::Receiver<Command>` field
   - In `step()` method, before polling:
     ```rust
     // Process all pending commands
     while let Ok(cmd) = self.command_rx.try_recv() {
         match cmd {
             Command::CreateWant(req, reply) => {
                 let resp = self.bel.api_handle_want_create(req);
                 let _ = reply.send(resp); // Ignore send errors
             }
             // ... other commands
         }
     }
     ```

3. Create server startup function in `http_server.rs`:
   ```rust
   pub fn start_server(
       bel_storage: Arc<dyn BELStorage>,
       port: u16,
   ) -> (JoinHandle<()>, mpsc::Sender<Command>) {
       let (cmd_tx, cmd_rx) = mpsc::channel(100);

       // Spawn orchestrator in background thread
       let orch_bel = bel_storage.clone();
       let orch_handle = std::thread::spawn(move || {
           let mut orch = Orchestrator::new_with_commands(orch_bel, cmd_rx);
           orch.join().unwrap();
       });

       // Start HTTP server in tokio runtime
       let runtime = tokio::runtime::Runtime::new().unwrap();
       let http_handle = runtime.spawn(async move {
           let app_state = AppState {
               bel_storage,
               command_tx: cmd_tx.clone(),
               last_request_time: Arc::new(AtomicU64::new(0)),
           };

           let app = create_router(app_state);
           let addr = SocketAddr::from(([127, 0, 0, 1], port));
           axum::Server::bind(&addr)
               .serve(app.into_make_service())
               .await
               .unwrap();
       });

       (http_handle, cmd_tx)
   }
   ```

4. Update `databuild serve` command to use `start_server()`

**Files Created:**
- `databuild/commands.rs` - new module

**Files Modified:**
- `databuild/orchestrator.rs` - accept command channel, process in `step()`
- `databuild/http_server.rs` - send commands for writes
- `databuild/bin/databuild.rs` - use `start_server()` in `serve` command

**Acceptance Criteria:**
- Creating a want via HTTP actually creates it in BuildState
- Orchestrator processes commands without blocking its main loop
- Can observe wants being scheduled into job runs

---

## Phase 5: Daemon Lifecycle - Auto-Shutdown

**Goal:** Server shuts down gracefully after idle timeout

**Tasks:**
1. Update AppState to track last request time:
   ```rust
   pub struct AppState {
       bel_storage: Arc<dyn BELStorage>,
       command_tx: mpsc::Sender<Command>,
       last_request_time: Arc<AtomicU64>, // epoch millis
       shutdown_tx: broadcast::Sender<()>,
   }
   ```

2. Add Tower middleware to update timestamp:
   ```rust
   async fn update_last_request_time<B>(
       State(state): State<AppState>,
       req: Request<B>,
       next: Next<B>,
   ) -> Response {
       state.last_request_time.store(
           SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as u64,
           Ordering::Relaxed,
       );
       next.run(req).await
   }
   ```

3. Background idle checker task:
   ```rust
   tokio::spawn(async move {
       let idle_timeout = Duration::from_secs(3 * 60 * 60); // 3 hours

       loop {
           tokio::time::sleep(Duration::from_secs(60)).await;

           let last_request = state.last_request_time.load(Ordering::Relaxed);
           let now = SystemTime::now()...;

           if now - last_request > idle_timeout.as_millis() as u64 {
               eprintln!("Server idle for {} hours, shutting down",
                   idle_timeout.as_secs() / 3600);
               shutdown_tx.send(()).unwrap();
               break;
           }
       }
   });
   ```

4. Graceful shutdown handling:
   ```rust
   let app = create_router(state);
   axum::Server::bind(&addr)
       .serve(app.into_make_service())
       .with_graceful_shutdown(async {
           shutdown_rx.recv().await.ok();
       })
       .await?;
   ```

5. Cleanup on shutdown:
   - Orchestrator: finish current step, don't start new one
   - HTTP server: stop accepting new connections, finish in-flight requests
   - Log: "Shutdown complete"

**Files Modified:**
- `databuild/http_server.rs` - add idle tracking, shutdown logic
- `databuild/orchestrator.rs` - accept shutdown signal, check before each step

**Acceptance Criteria:**
- Server shuts down after configured idle timeout
- In-flight requests complete successfully during shutdown
- Shutdown is logged clearly

---

## Phase 6: Testing & Polish

**Goal:** End-to-end testing and production readiness

**Tasks:**
1. Integration tests:
   ```rust
   #[test]
   fn test_server_lifecycle() {
       // Start server
       let (handle, port) = start_test_server();

       // Make requests
       let wants = reqwest::blocking::get(
           &format!("http://localhost:{}/api/wants", port)
       ).unwrap().json::<ListWantsResponse>().unwrap();

       // Stop server
       handle.shutdown();
   }
   ```

2. Error handling improvements:
   - Proper HTTP status codes (400, 404, 500)
   - Structured error responses:
     ```json
     {"error": "Want not found", "want_id": "abc123"}
     ```
   - Add `tracing` crate for structured logging

3. Add CORS middleware for web app:
   ```rust
   let cors = CorsLayer::new()
       .allow_origin("http://localhost:3000".parse::<HeaderValue>().unwrap())
       .allow_methods([Method::GET, Method::POST, Method::DELETE]);

   app.layer(cors)
   ```

4. Health check endpoint:
   ```rust
   async fn health() -> &'static str {
       "OK"
   }
   ```

5. Optional: Metrics endpoint (prometheus format):
   ```rust
   async fn metrics() -> String {
       format!(
           "# HELP databuild_wants_total Total number of wants\n\
            databuild_wants_total {}\n\
            # HELP databuild_job_runs_total Total number of job runs\n\
            databuild_job_runs_total {}\n",
           want_count, job_run_count
       )
   }
   ```

**Files Created:**
- `databuild/tests/http_integration_test.rs` - integration tests

**Files Modified:**
- `databuild/http_server.rs` - add CORS, health, metrics, better errors
- `MODULE.bazel` - add `tracing` dependency

**Acceptance Criteria:**
- All endpoints have proper error handling
- CORS works for web app development
- Health check returns 200 OK
- Integration tests pass

---

## Future Enhancements (Not in Initial Plan)

### Workspace Auto-Discovery
- Walk up directory tree looking for `.databuild/` marker
- Store server metadata in `.databuild/server.json`:
  ```json
  {
    "pid": 12345,
    "port": 54321,
    "started_at": "2025-01-22T10:30:00Z",
    "workspace_root": "/Users/stuart/Projects/databuild"
  }
  ```
- CLI auto-starts server if not running

### Log Streaming (SSE)
- Implement `GET /api/job_runs/:id/logs/stdout?follow=true`
- Use Server-Sent Events for streaming
- Integrate with FileLogStore from logging.md plan

### State Caching
- Cache reconstructed BuildState for faster reads
- Invalidate cache when new events arrive
- Use `tokio::sync::RwLock<Option<(u64, BuildState)>>` where u64 is latest_event_id

### gRPC Support (If Needed)
- Add Tonic alongside Axum
- Share same orchestrator/command channel
- Useful for language-agnostic clients

---

## Dependencies Summary

New dependencies to add to `MODULE.bazel`:

```python
# Async runtime
crate.spec(package = "tokio", features = ["full"], version = "1.0")

# Web framework
crate.spec(package = "axum", version = "0.7")
crate.spec(package = "tower", version = "0.4")
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")

# CLI
crate.spec(package = "clap", features = ["derive"], version = "4.0")

# HTTP client for CLI
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")

# Logging
crate.spec(package = "tracing", version = "0.1")
crate.spec(package = "tracing-subscriber", version = "0.3")
```

---

## Estimated Timeline

- **Phase 1:** 2-3 hours (thread-safe BEL storage)
- **Phase 2:** 4-6 hours (HTTP server with Axum)
- **Phase 3:** 3-4 hours (basic CLI)
- **Phase 4:** 3-4 hours (orchestrator integration)
- **Phase 5:** 2-3 hours (idle shutdown)
- **Phase 6:** 4-6 hours (testing and polish)

**Total:** ~18-26 hours for complete implementation

---

## Design Rationale

### Why Event Log Separation?

**Alternatives Considered:**
1. **Shared State with RwLock**: Orchestrator holds write lock during `step()`, blocking all reads
2. **Actor Model**: Extra overhead from message passing for all operations

**Why Event Log Separation Wins:**
- Orchestrator stays completely synchronous (no refactoring)
- Reads don't block writes (eventual consistency acceptable for build system)
- Natural fit with event sourcing architecture
- Can cache reconstructed state for even better read performance

### Why Not gRPC?

- User requirement: "JSON is a must"
- REST is more debuggable (curl, browser dev tools)
- gRPC adds complexity without clear benefit
- Can add gRPC later if needed (both can coexist)

### Why Axum Over Actix?

- Better compile-time type safety (extractors)
- Cleaner middleware composition (Tower)
- Native async/await (Actix uses actor model internally)
- More ergonomic for this use case

### Why Per-Workspace Server?

- Isolation: builds in different projects don't interfere
- Simpler: no need to route requests by workspace
- Matches Bazel's model (users already understand it)
- Easier to reason about resource usage