577 lines
17 KiB
Markdown
577 lines
17 KiB
Markdown
# Web Server Implementation Plan
|
|
|
|
## Architecture Summary
|
|
|
|
**Concurrency Model: Event Log Separation**
|
|
- Orchestrator runs synchronously in dedicated thread, owns BEL exclusively
|
|
- Web server reads from shared BEL storage, sends write commands via channel
|
|
- No locks on hot path, orchestrator stays single-threaded
|
|
- Eventual consistency for reads (acceptable since builds take time anyway)
|
|
|
|
**Daemon Model:**
|
|
- Server binary started manually from workspace root (for now)
|
|
- Server tracks last request time, shuts down after idle timeout (default: 3 hours)
|
|
- HTTP REST on localhost random port
|
|
- Future: CLI can auto-discover/start server
|
|
|
|
---
|
|
|
|
## Thread Model
|
|
|
|
```
|
|
Main Process
|
|
├─ HTTP Server (tokio multi-threaded runtime)
|
|
│ ├─ Request handlers (async, read from BEL storage)
|
|
│ └─ Command sender (send writes to orchestrator)
|
|
└─ Orchestrator Thread (std::thread, synchronous)
|
|
├─ Receives commands via mpsc channel
|
|
├─ Owns BEL (exclusive mutable access)
|
|
└─ Runs existing step() loop
|
|
```
|
|
|
|
**Read Path (Low Latency):**
|
|
1. HTTP request → Axum handler
|
|
2. Read events from shared BEL storage (no lock contention)
|
|
3. Reconstruct BuildState from events (can cache this)
|
|
4. Return response
|
|
|
|
**Write Path (Strong Consistency):**
|
|
1. HTTP request → Axum handler
|
|
2. Send command via channel to orchestrator
|
|
3. Orchestrator processes command in its thread
|
|
4. Reply sent back via oneshot channel
|
|
5. Return response
|
|
|
|
**Why This Works:**
|
|
- Orchestrator remains completely synchronous (no refactoring needed)
|
|
- Reads scale horizontally (multiple handlers, no locks)
|
|
- Writes are serialized through orchestrator (consistent with current model)
|
|
- Event sourcing means reads can be eventually consistent
|
|
|
|
---
|
|
|
|
## Phase 1: Foundation - Make BEL Storage Thread-Safe
|
|
|
|
**Goal:** Allow BEL storage to be safely shared between orchestrator and web server
|
|
|
|
**Tasks:**
|
|
1. Add `Send + Sync` bounds to `BELStorage` trait
|
|
2. Wrap `SqliteBELStorage::connection` in `Arc<Mutex<Connection>>` or use r2d2 pool
|
|
3. Add read-only methods to BELStorage:
|
|
- `list_events(offset: usize, limit: usize) -> Vec<DataBuildEvent>`
|
|
- `get_event(event_id: u64) -> Option<DataBuildEvent>`
|
|
- `latest_event_id() -> u64`
|
|
4. Add builder method to reconstruct BuildState from events:
|
|
- `BuildState::from_events(events: &[DataBuildEvent]) -> Self`
|
|
|
|
**Files Modified:**
|
|
- `databuild/build_event_log.rs` - update trait and storage impls
|
|
- `databuild/build_state.rs` - add `from_events()` builder
|
|
|
|
**Acceptance Criteria:**
|
|
- `BELStorage` trait has `Send + Sync` bounds
|
|
- Can clone `Arc<SqliteBELStorage>` and use from multiple threads
|
|
- Can reconstruct BuildState from events without mutating storage
|
|
|
|
---
|
|
|
|
## Phase 2: Web Server - HTTP API with Axum
|
|
|
|
**Goal:** HTTP server serving read/write APIs
|
|
|
|
**Tasks:**
|
|
1. Add dependencies to MODULE.bazel:
|
|
```python
|
|
crate.spec(package = "tokio", features = ["full"], version = "1.0")
|
|
crate.spec(package = "axum", version = "0.7")
|
|
crate.spec(package = "tower", version = "0.4")
|
|
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")
|
|
```
|
|
|
|
2. Create `databuild/http_server.rs` module with:
|
|
- `AppState` struct holding:
|
|
- `bel_storage: Arc<dyn BELStorage>` - shared read access
|
|
- `command_tx: mpsc::Sender<Command>` - channel to orchestrator
|
|
- `last_request_time: Arc<AtomicU64>` - for idle tracking
|
|
- Axum router with all endpoints
|
|
- Handler functions delegating to existing `api_handle_*` methods
|
|
|
|
3. API Endpoints:
|
|
```
|
|
GET /health → health check
|
|
GET /api/wants → list_wants
|
|
POST /api/wants → create_want
|
|
GET /api/wants/:id → get_want
|
|
DELETE /api/wants/:id → cancel_want
|
|
GET /api/partitions → list_partitions
|
|
GET /api/job_runs → list_job_runs
|
|
GET /api/job_runs/:id/logs/stdout → stream_logs (stub)
|
|
```
|
|
|
|
4. Handler pattern (reads):
|
|
```rust
|
|
async fn list_wants(
|
|
State(state): State<AppState>,
|
|
Query(params): Query<ListWantsParams>,
|
|
) -> Json<ListWantsResponse> {
|
|
// Read events from storage
|
|
let events = state.bel_storage.list_events(0, 10000)?;
|
|
|
|
// Reconstruct state
|
|
let build_state = BuildState::from_events(&events);
|
|
|
|
// Use existing API method
|
|
Json(build_state.list_wants(¶ms.into()))
|
|
}
|
|
```
|
|
|
|
5. Handler pattern (writes):
|
|
```rust
|
|
async fn create_want(
|
|
State(state): State<AppState>,
|
|
Json(req): Json<CreateWantRequest>,
|
|
) -> Json<CreateWantResponse> {
|
|
// Send command to orchestrator
|
|
let (reply_tx, reply_rx) = oneshot::channel();
|
|
state.command_tx.send(Command::CreateWant(req, reply_tx)).await?;
|
|
|
|
// Wait for orchestrator reply
|
|
let response = reply_rx.await?;
|
|
Json(response)
|
|
}
|
|
```
|
|
|
|
**Files Created:**
|
|
- `databuild/http_server.rs` - new module
|
|
|
|
**Files Modified:**
|
|
- `databuild/lib.rs` - add `pub mod http_server;`
|
|
- `MODULE.bazel` - add dependencies
|
|
|
|
**Acceptance Criteria:**
|
|
- Server starts on localhost random port, prints "Listening on http://127.0.0.1:XXXXX"
|
|
- All read endpoints return correct JSON responses
|
|
- Write endpoints return stub responses (Phase 4 will connect to orchestrator)
|
|
|
|
---
|
|
|
|
## Phase 3: CLI - HTTP Client
|
|
|
|
**Goal:** CLI that sends HTTP requests to running server
|
|
|
|
**Tasks:**
|
|
1. Add dependencies to MODULE.bazel:
|
|
```python
|
|
crate.spec(package = "clap", features = ["derive"], version = "4.0")
|
|
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")
|
|
```
|
|
|
|
2. Create `databuild/bin/databuild.rs` main binary:
|
|
```rust
|
|
#[derive(Parser)]
|
|
#[command(name = "databuild")]
|
|
enum Cli {
|
|
/// Start the databuild server
|
|
Serve(ServeArgs),
|
|
|
|
/// Create a want for partitions
|
|
Build(BuildArgs),
|
|
|
|
/// Want operations
|
|
Want(WantCommand),
|
|
|
|
/// Stream job run logs
|
|
Logs(LogsArgs),
|
|
}
|
|
|
|
#[derive(Args)]
|
|
struct ServeArgs {
|
|
#[arg(long, default_value = "8080")]
|
|
port: u16,
|
|
}
|
|
|
|
#[derive(Subcommand)]
|
|
enum WantCommand {
|
|
Create(CreateWantArgs),
|
|
List,
|
|
Get { want_id: String },
|
|
Cancel { want_id: String },
|
|
}
|
|
```
|
|
|
|
3. Server address discovery:
|
|
- For now: hardcode `http://localhost:8080` or accept `--server-url` flag
|
|
- Future: read from `.databuild/server.json` file
|
|
|
|
4. HTTP client implementation:
|
|
```rust
|
|
fn list_wants(server_url: &str) -> Result<Vec<WantDetail>> {
|
|
let client = reqwest::blocking::Client::new();
|
|
let resp = client.get(&format!("{}/api/wants", server_url))
|
|
.send()?
|
|
.json::<ListWantsResponse>()?;
|
|
Ok(resp.data)
|
|
}
|
|
```
|
|
|
|
5. Commands:
|
|
- `databuild serve --port 8080` - Start server (blocks)
|
|
- `databuild build part1 part2` - Create want for partitions
|
|
- `databuild want list` - List all wants
|
|
- `databuild want get <id>` - Get specific want
|
|
- `databuild want cancel <id>` - Cancel want
|
|
- `databuild logs <job_run_id>` - Stream logs (stub)
|
|
|
|
**Files Created:**
|
|
- `databuild/bin/databuild.rs` - new CLI binary
|
|
|
|
**Files Modified:**
|
|
- `databuild/BUILD.bazel` - add `rust_binary` target for databuild CLI
|
|
|
|
**Acceptance Criteria:**
|
|
- Can run `databuild serve` to start server
|
|
- Can run `databuild want list` in another terminal and see wants
|
|
- Commands print pretty JSON or formatted tables
|
|
|
|
---
|
|
|
|
## Phase 4: Orchestrator Integration - Command Channel
|
|
|
|
**Goal:** Connect orchestrator to web server via message passing
|
|
|
|
**Tasks:**
|
|
1. Create `databuild/commands.rs` with command enum:
|
|
```rust
|
|
pub enum Command {
|
|
CreateWant(CreateWantRequest, oneshot::Sender<CreateWantResponse>),
|
|
CancelWant(CancelWantRequest, oneshot::Sender<CancelWantResponse>),
|
|
// Only write operations need commands
|
|
}
|
|
```
|
|
|
|
2. Update `Orchestrator`:
|
|
- Add `command_rx: mpsc::Receiver<Command>` field
|
|
- In `step()` method, before polling:
|
|
```rust
|
|
// Process all pending commands
|
|
while let Ok(cmd) = self.command_rx.try_recv() {
|
|
match cmd {
|
|
Command::CreateWant(req, reply) => {
|
|
let resp = self.bel.api_handle_want_create(req);
|
|
let _ = reply.send(resp); // Ignore send errors
|
|
}
|
|
// ... other commands
|
|
}
|
|
}
|
|
```
|
|
|
|
3. Create server startup function in `http_server.rs`:
|
|
```rust
|
|
pub fn start_server(
|
|
bel_storage: Arc<dyn BELStorage>,
|
|
port: u16,
|
|
) -> (JoinHandle<()>, mpsc::Sender<Command>) {
|
|
let (cmd_tx, cmd_rx) = mpsc::channel(100);
|
|
|
|
// Spawn orchestrator in background thread
|
|
let orch_bel = bel_storage.clone();
|
|
let orch_handle = std::thread::spawn(move || {
|
|
let mut orch = Orchestrator::new_with_commands(orch_bel, cmd_rx);
|
|
orch.join().unwrap();
|
|
});
|
|
|
|
// Start HTTP server in tokio runtime
|
|
let runtime = tokio::runtime::Runtime::new().unwrap();
|
|
let http_handle = runtime.spawn(async move {
|
|
let app_state = AppState {
|
|
bel_storage,
|
|
command_tx: cmd_tx.clone(),
|
|
last_request_time: Arc::new(AtomicU64::new(0)),
|
|
};
|
|
|
|
let app = create_router(app_state);
|
|
let addr = SocketAddr::from(([127, 0, 0, 1], port));
|
|
axum::Server::bind(&addr)
|
|
.serve(app.into_make_service())
|
|
.await
|
|
.unwrap();
|
|
});
|
|
|
|
(http_handle, cmd_tx)
|
|
}
|
|
```
|
|
|
|
4. Update `databuild serve` command to use `start_server()`
|
|
|
|
**Files Created:**
|
|
- `databuild/commands.rs` - new module
|
|
|
|
**Files Modified:**
|
|
- `databuild/orchestrator.rs` - accept command channel, process in `step()`
|
|
- `databuild/http_server.rs` - send commands for writes
|
|
- `databuild/bin/databuild.rs` - use `start_server()` in `serve` command
|
|
|
|
**Acceptance Criteria:**
|
|
- Creating a want via HTTP actually creates it in BuildState
|
|
- Orchestrator processes commands without blocking its main loop
|
|
- Can observe wants being scheduled into job runs
|
|
|
|
---
|
|
|
|
## Phase 5: Daemon Lifecycle - Auto-Shutdown
|
|
|
|
**Goal:** Server shuts down gracefully after idle timeout
|
|
|
|
**Tasks:**
|
|
1. Update AppState to track last request time:
|
|
```rust
|
|
pub struct AppState {
|
|
bel_storage: Arc<dyn BELStorage>,
|
|
command_tx: mpsc::Sender<Command>,
|
|
last_request_time: Arc<AtomicU64>, // epoch millis
|
|
shutdown_tx: broadcast::Sender<()>,
|
|
}
|
|
```
|
|
|
|
2. Add Tower middleware to update timestamp:
|
|
```rust
|
|
async fn update_last_request_time<B>(
|
|
State(state): State<AppState>,
|
|
req: Request<B>,
|
|
next: Next<B>,
|
|
) -> Response {
|
|
state.last_request_time.store(
|
|
SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as u64,
|
|
Ordering::Relaxed,
|
|
);
|
|
next.run(req).await
|
|
}
|
|
```
|
|
|
|
3. Background idle checker task:
|
|
```rust
|
|
tokio::spawn(async move {
|
|
let idle_timeout = Duration::from_secs(3 * 60 * 60); // 3 hours
|
|
|
|
loop {
|
|
tokio::time::sleep(Duration::from_secs(60)).await;
|
|
|
|
let last_request = state.last_request_time.load(Ordering::Relaxed);
|
|
let now = SystemTime::now()...;
|
|
|
|
if now - last_request > idle_timeout.as_millis() as u64 {
|
|
eprintln!("Server idle for {} hours, shutting down",
|
|
idle_timeout.as_secs() / 3600);
|
|
shutdown_tx.send(()).unwrap();
|
|
break;
|
|
}
|
|
}
|
|
});
|
|
```
|
|
|
|
4. Graceful shutdown handling:
|
|
```rust
|
|
let app = create_router(state);
|
|
axum::Server::bind(&addr)
|
|
.serve(app.into_make_service())
|
|
.with_graceful_shutdown(async {
|
|
shutdown_rx.recv().await.ok();
|
|
})
|
|
.await?;
|
|
```
|
|
|
|
5. Cleanup on shutdown:
|
|
- Orchestrator: finish current step, don't start new one
|
|
- HTTP server: stop accepting new connections, finish in-flight requests
|
|
- Log: "Shutdown complete"
|
|
|
|
**Files Modified:**
|
|
- `databuild/http_server.rs` - add idle tracking, shutdown logic
|
|
- `databuild/orchestrator.rs` - accept shutdown signal, check before each step
|
|
|
|
**Acceptance Criteria:**
|
|
- Server shuts down after configured idle timeout
|
|
- In-flight requests complete successfully during shutdown
|
|
- Shutdown is logged clearly
|
|
|
|
---
|
|
|
|
## Phase 6: Testing & Polish
|
|
|
|
**Goal:** End-to-end testing and production readiness
|
|
|
|
**Tasks:**
|
|
1. Integration tests:
|
|
```rust
|
|
#[test]
|
|
fn test_server_lifecycle() {
|
|
// Start server
|
|
let (handle, port) = start_test_server();
|
|
|
|
// Make requests
|
|
let wants = reqwest::blocking::get(
|
|
&format!("http://localhost:{}/api/wants", port)
|
|
).unwrap().json::<ListWantsResponse>().unwrap();
|
|
|
|
// Stop server
|
|
handle.shutdown();
|
|
}
|
|
```
|
|
|
|
2. Error handling improvements:
|
|
- Proper HTTP status codes (400, 404, 500)
|
|
- Structured error responses:
|
|
```json
|
|
{"error": "Want not found", "want_id": "abc123"}
|
|
```
|
|
- Add `tracing` crate for structured logging
|
|
|
|
3. Add CORS middleware for web app:
|
|
```rust
|
|
let cors = CorsLayer::new()
|
|
.allow_origin("http://localhost:3000".parse::<HeaderValue>().unwrap())
|
|
.allow_methods([Method::GET, Method::POST, Method::DELETE]);
|
|
|
|
app.layer(cors)
|
|
```
|
|
|
|
4. Health check endpoint:
|
|
```rust
|
|
async fn health() -> &'static str {
|
|
"OK"
|
|
}
|
|
```
|
|
|
|
5. Optional: Metrics endpoint (prometheus format):
|
|
```rust
|
|
async fn metrics() -> String {
|
|
format!(
|
|
"# HELP databuild_wants_total Total number of wants\n\
|
|
databuild_wants_total {}\n\
|
|
# HELP databuild_job_runs_total Total number of job runs\n\
|
|
databuild_job_runs_total {}\n",
|
|
want_count, job_run_count
|
|
)
|
|
}
|
|
```
|
|
|
|
**Files Created:**
|
|
- `databuild/tests/http_integration_test.rs` - integration tests
|
|
|
|
**Files Modified:**
|
|
- `databuild/http_server.rs` - add CORS, health, metrics, better errors
|
|
- `MODULE.bazel` - add `tracing` dependency
|
|
|
|
**Acceptance Criteria:**
|
|
- All endpoints have proper error handling
|
|
- CORS works for web app development
|
|
- Health check returns 200 OK
|
|
- Integration tests pass
|
|
|
|
---
|
|
|
|
## Future Enhancements (Not in Initial Plan)
|
|
|
|
### Workspace Auto-Discovery
|
|
- Walk up directory tree looking for `.databuild/` marker
|
|
- Store server metadata in `.databuild/server.json`:
|
|
```json
|
|
{
|
|
"pid": 12345,
|
|
"port": 54321,
|
|
"started_at": "2025-01-22T10:30:00Z",
|
|
"workspace_root": "/Users/stuart/Projects/databuild"
|
|
}
|
|
```
|
|
- CLI auto-starts server if not running
|
|
|
|
### Log Streaming (SSE)
|
|
- Implement `GET /api/job_runs/:id/logs/stdout?follow=true`
|
|
- Use Server-Sent Events for streaming
|
|
- Integrate with FileLogStore from logging.md plan
|
|
|
|
### State Caching
|
|
- Cache reconstructed BuildState for faster reads
|
|
- Invalidate cache when new events arrive
|
|
- Use `tokio::sync::RwLock<Option<(u64, BuildState)>>` where u64 is latest_event_id
|
|
|
|
### gRPC Support (If Needed)
|
|
- Add Tonic alongside Axum
|
|
- Share same orchestrator/command channel
|
|
- Useful for language-agnostic clients
|
|
|
|
---
|
|
|
|
## Dependencies Summary
|
|
|
|
New dependencies to add to `MODULE.bazel`:
|
|
|
|
```python
|
|
# Async runtime
|
|
crate.spec(package = "tokio", features = ["full"], version = "1.0")
|
|
|
|
# Web framework
|
|
crate.spec(package = "axum", version = "0.7")
|
|
crate.spec(package = "tower", version = "0.4")
|
|
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")
|
|
|
|
# CLI
|
|
crate.spec(package = "clap", features = ["derive"], version = "4.0")
|
|
|
|
# HTTP client for CLI
|
|
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")
|
|
|
|
# Logging
|
|
crate.spec(package = "tracing", version = "0.1")
|
|
crate.spec(package = "tracing-subscriber", version = "0.3")
|
|
```
|
|
|
|
---
|
|
|
|
## Estimated Timeline
|
|
|
|
- **Phase 1:** 2-3 hours (thread-safe BEL storage)
|
|
- **Phase 2:** 4-6 hours (HTTP server with Axum)
|
|
- **Phase 3:** 3-4 hours (basic CLI)
|
|
- **Phase 4:** 3-4 hours (orchestrator integration)
|
|
- **Phase 5:** 2-3 hours (idle shutdown)
|
|
- **Phase 6:** 4-6 hours (testing and polish)
|
|
|
|
**Total:** ~18-26 hours for complete implementation
|
|
|
|
---
|
|
|
|
## Design Rationale
|
|
|
|
### Why Event Log Separation?
|
|
|
|
**Alternatives Considered:**
|
|
1. **Shared State with RwLock**: Orchestrator holds write lock during `step()`, blocking all reads
|
|
2. **Actor Model**: Extra overhead from message passing for all operations
|
|
|
|
**Why Event Log Separation Wins:**
|
|
- Orchestrator stays completely synchronous (no refactoring)
|
|
- Reads don't block writes (eventual consistency acceptable for build system)
|
|
- Natural fit with event sourcing architecture
|
|
- Can cache reconstructed state for even better read performance
|
|
|
|
### Why Not gRPC?
|
|
|
|
- User requirement: "JSON is a must"
|
|
- REST is more debuggable (curl, browser dev tools)
|
|
- gRPC adds complexity without clear benefit
|
|
- Can add gRPC later if needed (both can coexist)
|
|
|
|
### Why Axum Over Actix?
|
|
|
|
- Better compile-time type safety (extractors)
|
|
- Cleaner middleware composition (Tower)
|
|
- Native async/await (Actix uses actor model internally)
|
|
- More ergonomic for this use case
|
|
|
|
### Why Per-Workspace Server?
|
|
|
|
- Isolation: builds in different projects don't interfere
|
|
- Simpler: no need to route requests by workspace
|
|
- Matches Bazel's model (users already understand it)
|
|
- Easier to reason about resource usage
|