databuild/docs/plans/api.md

577 lines
17 KiB
Markdown

# Web Server Implementation Plan
## Architecture Summary
**Concurrency Model: Event Log Separation**
- Orchestrator runs synchronously in dedicated thread, owns BEL exclusively
- Web server reads from shared BEL storage, sends write commands via channel
- No locks on hot path, orchestrator stays single-threaded
- Eventual consistency for reads (acceptable since builds take time anyway)
**Daemon Model:**
- Server binary started manually from workspace root (for now)
- Server tracks last request time, shuts down after idle timeout (default: 3 hours)
- HTTP REST on localhost random port
- Future: CLI can auto-discover/start server
---
## Thread Model
```
Main Process
├─ HTTP Server (tokio multi-threaded runtime)
│ ├─ Request handlers (async, read from BEL storage)
│ └─ Command sender (send writes to orchestrator)
└─ Orchestrator Thread (std::thread, synchronous)
├─ Receives commands via mpsc channel
├─ Owns BEL (exclusive mutable access)
└─ Runs existing step() loop
```
**Read Path (Low Latency):**
1. HTTP request → Axum handler
2. Read events from shared BEL storage (no lock contention)
3. Reconstruct BuildState from events (can cache this)
4. Return response
**Write Path (Strong Consistency):**
1. HTTP request → Axum handler
2. Send command via channel to orchestrator
3. Orchestrator processes command in its thread
4. Reply sent back via oneshot channel
5. Return response
**Why This Works:**
- Orchestrator remains completely synchronous (no refactoring needed)
- Reads scale horizontally (multiple handlers, no locks)
- Writes are serialized through orchestrator (consistent with current model)
- Event sourcing means reads can be eventually consistent
---
## Phase 1: Foundation - Make BEL Storage Thread-Safe
**Goal:** Allow BEL storage to be safely shared between orchestrator and web server
**Tasks:**
1. Add `Send + Sync` bounds to `BELStorage` trait
2. Wrap `SqliteBELStorage::connection` in `Arc<Mutex<Connection>>` or use r2d2 pool
3. Add read-only methods to BELStorage:
- `list_events(offset: usize, limit: usize) -> Vec<DataBuildEvent>`
- `get_event(event_id: u64) -> Option<DataBuildEvent>`
- `latest_event_id() -> u64`
4. Add builder method to reconstruct BuildState from events:
- `BuildState::from_events(events: &[DataBuildEvent]) -> Self`
**Files Modified:**
- `databuild/build_event_log.rs` - update trait and storage impls
- `databuild/build_state.rs` - add `from_events()` builder
**Acceptance Criteria:**
- `BELStorage` trait has `Send + Sync` bounds
- Can clone `Arc<SqliteBELStorage>` and use from multiple threads
- Can reconstruct BuildState from events without mutating storage
---
## Phase 2: Web Server - HTTP API with Axum
**Goal:** HTTP server serving read/write APIs
**Tasks:**
1. Add dependencies to MODULE.bazel:
```python
crate.spec(package = "tokio", features = ["full"], version = "1.0")
crate.spec(package = "axum", version = "0.7")
crate.spec(package = "tower", version = "0.4")
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")
```
2. Create `databuild/http_server.rs` module with:
- `AppState` struct holding:
- `bel_storage: Arc<dyn BELStorage>` - shared read access
- `command_tx: mpsc::Sender<Command>` - channel to orchestrator
- `last_request_time: Arc<AtomicU64>` - for idle tracking
- Axum router with all endpoints
- Handler functions delegating to existing `api_handle_*` methods
3. API Endpoints:
```
GET /health → health check
GET /api/wants → list_wants
POST /api/wants → create_want
GET /api/wants/:id → get_want
DELETE /api/wants/:id → cancel_want
GET /api/partitions → list_partitions
GET /api/job_runs → list_job_runs
GET /api/job_runs/:id/logs/stdout → stream_logs (stub)
```
4. Handler pattern (reads):
```rust
async fn list_wants(
State(state): State<AppState>,
Query(params): Query<ListWantsParams>,
) -> Json<ListWantsResponse> {
// Read events from storage
let events = state.bel_storage.list_events(0, 10000)?;
// Reconstruct state
let build_state = BuildState::from_events(&events);
// Use existing API method
Json(build_state.list_wants(&params.into()))
}
```
5. Handler pattern (writes):
```rust
async fn create_want(
State(state): State<AppState>,
Json(req): Json<CreateWantRequest>,
) -> Json<CreateWantResponse> {
// Send command to orchestrator
let (reply_tx, reply_rx) = oneshot::channel();
state.command_tx.send(Command::CreateWant(req, reply_tx)).await?;
// Wait for orchestrator reply
let response = reply_rx.await?;
Json(response)
}
```
**Files Created:**
- `databuild/http_server.rs` - new module
**Files Modified:**
- `databuild/lib.rs` - add `pub mod http_server;`
- `MODULE.bazel` - add dependencies
**Acceptance Criteria:**
- Server starts on localhost random port, prints "Listening on http://127.0.0.1:XXXXX"
- All read endpoints return correct JSON responses
- Write endpoints return stub responses (Phase 4 will connect to orchestrator)
---
## Phase 3: CLI - HTTP Client
**Goal:** CLI that sends HTTP requests to running server
**Tasks:**
1. Add dependencies to MODULE.bazel:
```python
crate.spec(package = "clap", features = ["derive"], version = "4.0")
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")
```
2. Create `databuild/bin/databuild.rs` main binary:
```rust
#[derive(Parser)]
#[command(name = "databuild")]
enum Cli {
/// Start the databuild server
Serve(ServeArgs),
/// Create a want for partitions
Build(BuildArgs),
/// Want operations
Want(WantCommand),
/// Stream job run logs
Logs(LogsArgs),
}
#[derive(Args)]
struct ServeArgs {
#[arg(long, default_value = "8080")]
port: u16,
}
#[derive(Subcommand)]
enum WantCommand {
Create(CreateWantArgs),
List,
Get { want_id: String },
Cancel { want_id: String },
}
```
3. Server address discovery:
- For now: hardcode `http://localhost:8080` or accept `--server-url` flag
- Future: read from `.databuild/server.json` file
4. HTTP client implementation:
```rust
fn list_wants(server_url: &str) -> Result<Vec<WantDetail>> {
let client = reqwest::blocking::Client::new();
let resp = client.get(&format!("{}/api/wants", server_url))
.send()?
.json::<ListWantsResponse>()?;
Ok(resp.data)
}
```
5. Commands:
- `databuild serve --port 8080` - Start server (blocks)
- `databuild build part1 part2` - Create want for partitions
- `databuild want list` - List all wants
- `databuild want get <id>` - Get specific want
- `databuild want cancel <id>` - Cancel want
- `databuild logs <job_run_id>` - Stream logs (stub)
**Files Created:**
- `databuild/bin/databuild.rs` - new CLI binary
**Files Modified:**
- `databuild/BUILD.bazel` - add `rust_binary` target for databuild CLI
**Acceptance Criteria:**
- Can run `databuild serve` to start server
- Can run `databuild want list` in another terminal and see wants
- Commands print pretty JSON or formatted tables
---
## Phase 4: Orchestrator Integration - Command Channel
**Goal:** Connect orchestrator to web server via message passing
**Tasks:**
1. Create `databuild/commands.rs` with command enum:
```rust
pub enum Command {
CreateWant(CreateWantRequest, oneshot::Sender<CreateWantResponse>),
CancelWant(CancelWantRequest, oneshot::Sender<CancelWantResponse>),
// Only write operations need commands
}
```
2. Update `Orchestrator`:
- Add `command_rx: mpsc::Receiver<Command>` field
- In `step()` method, before polling:
```rust
// Process all pending commands
while let Ok(cmd) = self.command_rx.try_recv() {
match cmd {
Command::CreateWant(req, reply) => {
let resp = self.bel.api_handle_want_create(req);
let _ = reply.send(resp); // Ignore send errors
}
// ... other commands
}
}
```
3. Create server startup function in `http_server.rs`:
```rust
pub fn start_server(
bel_storage: Arc<dyn BELStorage>,
port: u16,
) -> (JoinHandle<()>, mpsc::Sender<Command>) {
let (cmd_tx, cmd_rx) = mpsc::channel(100);
// Spawn orchestrator in background thread
let orch_bel = bel_storage.clone();
let orch_handle = std::thread::spawn(move || {
let mut orch = Orchestrator::new_with_commands(orch_bel, cmd_rx);
orch.join().unwrap();
});
// Start HTTP server in tokio runtime
let runtime = tokio::runtime::Runtime::new().unwrap();
let http_handle = runtime.spawn(async move {
let app_state = AppState {
bel_storage,
command_tx: cmd_tx.clone(),
last_request_time: Arc::new(AtomicU64::new(0)),
};
let app = create_router(app_state);
let addr = SocketAddr::from(([127, 0, 0, 1], port));
axum::Server::bind(&addr)
.serve(app.into_make_service())
.await
.unwrap();
});
(http_handle, cmd_tx)
}
```
4. Update `databuild serve` command to use `start_server()`
**Files Created:**
- `databuild/commands.rs` - new module
**Files Modified:**
- `databuild/orchestrator.rs` - accept command channel, process in `step()`
- `databuild/http_server.rs` - send commands for writes
- `databuild/bin/databuild.rs` - use `start_server()` in `serve` command
**Acceptance Criteria:**
- Creating a want via HTTP actually creates it in BuildState
- Orchestrator processes commands without blocking its main loop
- Can observe wants being scheduled into job runs
---
## Phase 5: Daemon Lifecycle - Auto-Shutdown
**Goal:** Server shuts down gracefully after idle timeout
**Tasks:**
1. Update AppState to track last request time:
```rust
pub struct AppState {
bel_storage: Arc<dyn BELStorage>,
command_tx: mpsc::Sender<Command>,
last_request_time: Arc<AtomicU64>, // epoch millis
shutdown_tx: broadcast::Sender<()>,
}
```
2. Add Tower middleware to update timestamp:
```rust
async fn update_last_request_time<B>(
State(state): State<AppState>,
req: Request<B>,
next: Next<B>,
) -> Response {
state.last_request_time.store(
SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as u64,
Ordering::Relaxed,
);
next.run(req).await
}
```
3. Background idle checker task:
```rust
tokio::spawn(async move {
let idle_timeout = Duration::from_secs(3 * 60 * 60); // 3 hours
loop {
tokio::time::sleep(Duration::from_secs(60)).await;
let last_request = state.last_request_time.load(Ordering::Relaxed);
let now = SystemTime::now()...;
if now - last_request > idle_timeout.as_millis() as u64 {
eprintln!("Server idle for {} hours, shutting down",
idle_timeout.as_secs() / 3600);
shutdown_tx.send(()).unwrap();
break;
}
}
});
```
4. Graceful shutdown handling:
```rust
let app = create_router(state);
axum::Server::bind(&addr)
.serve(app.into_make_service())
.with_graceful_shutdown(async {
shutdown_rx.recv().await.ok();
})
.await?;
```
5. Cleanup on shutdown:
- Orchestrator: finish current step, don't start new one
- HTTP server: stop accepting new connections, finish in-flight requests
- Log: "Shutdown complete"
**Files Modified:**
- `databuild/http_server.rs` - add idle tracking, shutdown logic
- `databuild/orchestrator.rs` - accept shutdown signal, check before each step
**Acceptance Criteria:**
- Server shuts down after configured idle timeout
- In-flight requests complete successfully during shutdown
- Shutdown is logged clearly
---
## Phase 6: Testing & Polish
**Goal:** End-to-end testing and production readiness
**Tasks:**
1. Integration tests:
```rust
#[test]
fn test_server_lifecycle() {
// Start server
let (handle, port) = start_test_server();
// Make requests
let wants = reqwest::blocking::get(
&format!("http://localhost:{}/api/wants", port)
).unwrap().json::<ListWantsResponse>().unwrap();
// Stop server
handle.shutdown();
}
```
2. Error handling improvements:
- Proper HTTP status codes (400, 404, 500)
- Structured error responses:
```json
{"error": "Want not found", "want_id": "abc123"}
```
- Add `tracing` crate for structured logging
3. Add CORS middleware for web app:
```rust
let cors = CorsLayer::new()
.allow_origin("http://localhost:3000".parse::<HeaderValue>().unwrap())
.allow_methods([Method::GET, Method::POST, Method::DELETE]);
app.layer(cors)
```
4. Health check endpoint:
```rust
async fn health() -> &'static str {
"OK"
}
```
5. Optional: Metrics endpoint (prometheus format):
```rust
async fn metrics() -> String {
format!(
"# HELP databuild_wants_total Total number of wants\n\
databuild_wants_total {}\n\
# HELP databuild_job_runs_total Total number of job runs\n\
databuild_job_runs_total {}\n",
want_count, job_run_count
)
}
```
**Files Created:**
- `databuild/tests/http_integration_test.rs` - integration tests
**Files Modified:**
- `databuild/http_server.rs` - add CORS, health, metrics, better errors
- `MODULE.bazel` - add `tracing` dependency
**Acceptance Criteria:**
- All endpoints have proper error handling
- CORS works for web app development
- Health check returns 200 OK
- Integration tests pass
---
## Future Enhancements (Not in Initial Plan)
### Workspace Auto-Discovery
- Walk up directory tree looking for `.databuild/` marker
- Store server metadata in `.databuild/server.json`:
```json
{
"pid": 12345,
"port": 54321,
"started_at": "2025-01-22T10:30:00Z",
"workspace_root": "/Users/stuart/Projects/databuild"
}
```
- CLI auto-starts server if not running
### Log Streaming (SSE)
- Implement `GET /api/job_runs/:id/logs/stdout?follow=true`
- Use Server-Sent Events for streaming
- Integrate with FileLogStore from logging.md plan
### State Caching
- Cache reconstructed BuildState for faster reads
- Invalidate cache when new events arrive
- Use `tokio::sync::RwLock<Option<(u64, BuildState)>>` where u64 is latest_event_id
### gRPC Support (If Needed)
- Add Tonic alongside Axum
- Share same orchestrator/command channel
- Useful for language-agnostic clients
---
## Dependencies Summary
New dependencies to add to `MODULE.bazel`:
```python
# Async runtime
crate.spec(package = "tokio", features = ["full"], version = "1.0")
# Web framework
crate.spec(package = "axum", version = "0.7")
crate.spec(package = "tower", version = "0.4")
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")
# CLI
crate.spec(package = "clap", features = ["derive"], version = "4.0")
# HTTP client for CLI
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")
# Logging
crate.spec(package = "tracing", version = "0.1")
crate.spec(package = "tracing-subscriber", version = "0.3")
```
---
## Estimated Timeline
- **Phase 1:** 2-3 hours (thread-safe BEL storage)
- **Phase 2:** 4-6 hours (HTTP server with Axum)
- **Phase 3:** 3-4 hours (basic CLI)
- **Phase 4:** 3-4 hours (orchestrator integration)
- **Phase 5:** 2-3 hours (idle shutdown)
- **Phase 6:** 4-6 hours (testing and polish)
**Total:** ~18-26 hours for complete implementation
---
## Design Rationale
### Why Event Log Separation?
**Alternatives Considered:**
1. **Shared State with RwLock**: Orchestrator holds write lock during `step()`, blocking all reads
2. **Actor Model**: Extra overhead from message passing for all operations
**Why Event Log Separation Wins:**
- Orchestrator stays completely synchronous (no refactoring)
- Reads don't block writes (eventual consistency acceptable for build system)
- Natural fit with event sourcing architecture
- Can cache reconstructed state for even better read performance
### Why Not gRPC?
- User requirement: "JSON is a must"
- REST is more debuggable (curl, browser dev tools)
- gRPC adds complexity without clear benefit
- Can add gRPC later if needed (both can coexist)
### Why Axum Over Actix?
- Better compile-time type safety (extractors)
- Cleaner middleware composition (Tower)
- Native async/await (Actix uses actor model internally)
- More ergonomic for this use case
### Why Per-Workspace Server?
- Isolation: builds in different projects don't interfere
- Simpler: no need to route requests by workspace
- Matches Bazel's model (users already understand it)
- Easier to reason about resource usage