stuart/databuild

Fork 0

Stuart Axelbrooke 6f7c6b3318 api phase 2 complete

2025-11-22 21:24:12 +08:00

17 KiB

Raw Blame History

Web Server Implementation Plan

Architecture Summary

Concurrency Model: Event Log Separation

Orchestrator runs synchronously in dedicated thread, owns BEL exclusively
Web server reads from shared BEL storage, sends write commands via channel
No locks on hot path, orchestrator stays single-threaded
Eventual consistency for reads (acceptable since builds take time anyway)

Daemon Model:

Server binary started manually from workspace root (for now)
Server tracks last request time, shuts down after idle timeout (default: 3 hours)
HTTP REST on localhost random port
Future: CLI can auto-discover/start server

Thread Model

Main Process
├─ HTTP Server (tokio multi-threaded runtime)
│  ├─ Request handlers (async, read from BEL storage)
│  └─ Command sender (send writes to orchestrator)
└─ Orchestrator Thread (std::thread, synchronous)
   ├─ Receives commands via mpsc channel
   ├─ Owns BEL (exclusive mutable access)
   └─ Runs existing step() loop

Read Path (Low Latency):

HTTP request → Axum handler
Read events from shared BEL storage (no lock contention)
Reconstruct BuildState from events (can cache this)
Return response

Write Path (Strong Consistency):

HTTP request → Axum handler
Send command via channel to orchestrator
Orchestrator processes command in its thread
Reply sent back via oneshot channel
Return response

Why This Works:

Orchestrator remains completely synchronous (no refactoring needed)
Reads scale horizontally (multiple handlers, no locks)
Writes are serialized through orchestrator (consistent with current model)
Event sourcing means reads can be eventually consistent

Phase 1: Foundation - Make BEL Storage Thread-Safe

Goal: Allow BEL storage to be safely shared between orchestrator and web server

Tasks:

Add Send + Sync bounds to BELStorage trait
Wrap SqliteBELStorage::connection in Arc<Mutex<Connection>> or use r2d2 pool
Add read-only methods to BELStorage:
- list_events(offset: usize, limit: usize) -> Vec<DataBuildEvent>
- get_event(event_id: u64) -> Option<DataBuildEvent>
- latest_event_id() -> u64
Add builder method to reconstruct BuildState from events:
- BuildState::from_events(events: &[DataBuildEvent]) -> Self

Files Modified:

databuild/build_event_log.rs - update trait and storage impls
databuild/build_state.rs - add from_events() builder

Acceptance Criteria:

BELStorage trait has Send + Sync bounds
Can clone Arc<SqliteBELStorage> and use from multiple threads
Can reconstruct BuildState from events without mutating storage

Phase 2: Web Server - HTTP API with Axum

Goal: HTTP server serving read/write APIs

Tasks:

Add dependencies to MODULE.bazel:

crate.spec(package = "tokio", features = ["full"], version = "1.0")
crate.spec(package = "axum", version = "0.7")
crate.spec(package = "tower", version = "0.4")
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")

Create databuild/http_server.rs module with:
- AppState struct holding:
  - bel_storage: Arc<dyn BELStorage> - shared read access
  - command_tx: mpsc::Sender<Command> - channel to orchestrator
  - last_request_time: Arc<AtomicU64> - for idle tracking
- Axum router with all endpoints
- Handler functions delegating to existing api_handle_* methods

API Endpoints:

GET  /health                       → health check
GET  /api/wants                    → list_wants
POST /api/wants                    → create_want
GET  /api/wants/:id                → get_want
DELETE /api/wants/:id              → cancel_want
GET  /api/partitions               → list_partitions
GET  /api/job_runs                 → list_job_runs
GET  /api/job_runs/:id/logs/stdout → stream_logs (stub)

Handler pattern (reads):

async fn list_wants(
    State(state): State<AppState>,
    Query(params): Query<ListWantsParams>,
) -> Json<ListWantsResponse> {
    // Read events from storage
    let events = state.bel_storage.list_events(0, 10000)?;

    // Reconstruct state
    let build_state = BuildState::from_events(&events);

    // Use existing API method
    Json(build_state.list_wants(&params.into()))
}

Handler pattern (writes):

async fn create_want(
    State(state): State<AppState>,
    Json(req): Json<CreateWantRequest>,
) -> Json<CreateWantResponse> {
    // Send command to orchestrator
    let (reply_tx, reply_rx) = oneshot::channel();
    state.command_tx.send(Command::CreateWant(req, reply_tx)).await?;

    // Wait for orchestrator reply
    let response = reply_rx.await?;
    Json(response)
}

Files Created:

databuild/http_server.rs - new module

Files Modified:

databuild/lib.rs - add pub mod http_server;
MODULE.bazel - add dependencies

Acceptance Criteria:

Server starts on localhost random port, prints "Listening on http://127.0.0.1:XXXXX"
All read endpoints return correct JSON responses
Write endpoints return stub responses (Phase 4 will connect to orchestrator)

Phase 3: CLI - HTTP Client

Goal: CLI that sends HTTP requests to running server

Tasks:

Add dependencies to MODULE.bazel:

crate.spec(package = "clap", features = ["derive"], version = "4.0")
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")

Create databuild/bin/databuild.rs main binary:

#[derive(Parser)]
#[command(name = "databuild")]
enum Cli {
    /// Start the databuild server
    Serve(ServeArgs),

    /// Create a want for partitions
    Build(BuildArgs),

    /// Want operations
    Want(WantCommand),

    /// Stream job run logs
    Logs(LogsArgs),
}

#[derive(Args)]
struct ServeArgs {
    #[arg(long, default_value = "8080")]
    port: u16,
}

#[derive(Subcommand)]
enum WantCommand {
    Create(CreateWantArgs),
    List,
    Get { want_id: String },
    Cancel { want_id: String },
}

Server address discovery:
- For now: hardcode http://localhost:8080 or accept --server-url flag
- Future: read from .databuild/server.json file

HTTP client implementation:

fn list_wants(server_url: &str) -> Result<Vec<WantDetail>> {
    let client = reqwest::blocking::Client::new();
    let resp = client.get(&format!("{}/api/wants", server_url))
        .send()?
        .json::<ListWantsResponse>()?;
    Ok(resp.data)
}

Commands:
- databuild serve --port 8080 - Start server (blocks)
- databuild build part1 part2 - Create want for partitions
- databuild want list - List all wants
- databuild want get <id> - Get specific want
- databuild want cancel <id> - Cancel want
- databuild logs <job_run_id> - Stream logs (stub)

Files Created:

databuild/bin/databuild.rs - new CLI binary

Files Modified:

databuild/BUILD.bazel - add rust_binary target for databuild CLI

Acceptance Criteria:

Can run databuild serve to start server
Can run databuild want list in another terminal and see wants
Commands print pretty JSON or formatted tables

Phase 4: Orchestrator Integration - Command Channel

Goal: Connect orchestrator to web server via message passing

Tasks:

Create databuild/commands.rs with command enum:

pub enum Command {
    CreateWant(CreateWantRequest, oneshot::Sender<CreateWantResponse>),
    CancelWant(CancelWantRequest, oneshot::Sender<CancelWantResponse>),
    // Only write operations need commands
}

Update Orchestrator:

Add command_rx: mpsc::Receiver<Command> field

In step() method, before polling:

// Process all pending commands
while let Ok(cmd) = self.command_rx.try_recv() {
    match cmd {
        Command::CreateWant(req, reply) => {
            let resp = self.bel.api_handle_want_create(req);
            let _ = reply.send(resp); // Ignore send errors
        }
        // ... other commands
    }
}

Create server startup function in http_server.rs:

pub fn start_server(
    bel_storage: Arc<dyn BELStorage>,
    port: u16,
) -> (JoinHandle<()>, mpsc::Sender<Command>) {
    let (cmd_tx, cmd_rx) = mpsc::channel(100);

    // Spawn orchestrator in background thread
    let orch_bel = bel_storage.clone();
    let orch_handle = std::thread::spawn(move || {
        let mut orch = Orchestrator::new_with_commands(orch_bel, cmd_rx);
        orch.join().unwrap();
    });

    // Start HTTP server in tokio runtime
    let runtime = tokio::runtime::Runtime::new().unwrap();
    let http_handle = runtime.spawn(async move {
        let app_state = AppState {
            bel_storage,
            command_tx: cmd_tx.clone(),
            last_request_time: Arc::new(AtomicU64::new(0)),
        };

        let app = create_router(app_state);
        let addr = SocketAddr::from(([127, 0, 0, 1], port));
        axum::Server::bind(&addr)
            .serve(app.into_make_service())
            .await
            .unwrap();
    });

    (http_handle, cmd_tx)
}

Update databuild serve command to use start_server()

Files Created:

databuild/commands.rs - new module

Files Modified:

databuild/orchestrator.rs - accept command channel, process in step()
databuild/http_server.rs - send commands for writes
databuild/bin/databuild.rs - use start_server() in serve command

Acceptance Criteria:

Creating a want via HTTP actually creates it in BuildState
Orchestrator processes commands without blocking its main loop
Can observe wants being scheduled into job runs

Phase 5: Daemon Lifecycle - Auto-Shutdown

Goal: Server shuts down gracefully after idle timeout

Tasks:

Update AppState to track last request time:

pub struct AppState {
    bel_storage: Arc<dyn BELStorage>,
    command_tx: mpsc::Sender<Command>,
    last_request_time: Arc<AtomicU64>, // epoch millis
    shutdown_tx: broadcast::Sender<()>,
}

Add Tower middleware to update timestamp:

async fn update_last_request_time<B>(
    State(state): State<AppState>,
    req: Request<B>,
    next: Next<B>,
) -> Response {
    state.last_request_time.store(
        SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as u64,
        Ordering::Relaxed,
    );
    next.run(req).await
}

Background idle checker task:

tokio::spawn(async move {
    let idle_timeout = Duration::from_secs(3 * 60 * 60); // 3 hours

    loop {
        tokio::time::sleep(Duration::from_secs(60)).await;

        let last_request = state.last_request_time.load(Ordering::Relaxed);
        let now = SystemTime::now()...;

        if now - last_request > idle_timeout.as_millis() as u64 {
            eprintln!("Server idle for {} hours, shutting down",
                idle_timeout.as_secs() / 3600);
            shutdown_tx.send(()).unwrap();
            break;
        }
    }
});

Graceful shutdown handling:

let app = create_router(state);
axum::Server::bind(&addr)
    .serve(app.into_make_service())
    .with_graceful_shutdown(async {
        shutdown_rx.recv().await.ok();
    })
    .await?;

Cleanup on shutdown:
- Orchestrator: finish current step, don't start new one
- HTTP server: stop accepting new connections, finish in-flight requests
- Log: "Shutdown complete"

Files Modified:

databuild/http_server.rs - add idle tracking, shutdown logic
databuild/orchestrator.rs - accept shutdown signal, check before each step

Acceptance Criteria:

Server shuts down after configured idle timeout
In-flight requests complete successfully during shutdown
Shutdown is logged clearly

Phase 6: Testing & Polish

Goal: End-to-end testing and production readiness

Tasks:

Integration tests:

#[test]
fn test_server_lifecycle() {
    // Start server
    let (handle, port) = start_test_server();

    // Make requests
    let wants = reqwest::blocking::get(
        &format!("http://localhost:{}/api/wants", port)
    ).unwrap().json::<ListWantsResponse>().unwrap();

    // Stop server
    handle.shutdown();
}

Error handling improvements:
- Proper HTTP status codes (400, 404, 500)
- Structured error responses:
```
{"error": "Want not found", "want_id": "abc123"}
```
- Add tracing crate for structured logging

Add CORS middleware for web app:

let cors = CorsLayer::new()
    .allow_origin("http://localhost:3000".parse::<HeaderValue>().unwrap())
    .allow_methods([Method::GET, Method::POST, Method::DELETE]);

app.layer(cors)

Health check endpoint:

async fn health() -> &'static str {
    "OK"
}

Optional: Metrics endpoint (prometheus format):

async fn metrics() -> String {
    format!(
        "# HELP databuild_wants_total Total number of wants\n\
         databuild_wants_total {}\n\
         # HELP databuild_job_runs_total Total number of job runs\n\
         databuild_job_runs_total {}\n",
        want_count, job_run_count
    )
}

Files Created:

databuild/tests/http_integration_test.rs - integration tests

Files Modified:

databuild/http_server.rs - add CORS, health, metrics, better errors
MODULE.bazel - add tracing dependency

Acceptance Criteria:

All endpoints have proper error handling
CORS works for web app development
Health check returns 200 OK
Integration tests pass

Future Enhancements (Not in Initial Plan)

Workspace Auto-Discovery

Walk up directory tree looking for .databuild/ marker

Store server metadata in .databuild/server.json:

{
  "pid": 12345,
  "port": 54321,
  "started_at": "2025-01-22T10:30:00Z",
  "workspace_root": "/Users/stuart/Projects/databuild"
}

CLI auto-starts server if not running

Log Streaming (SSE)

Implement GET /api/job_runs/:id/logs/stdout?follow=true
Use Server-Sent Events for streaming
Integrate with FileLogStore from logging.md plan

State Caching

Cache reconstructed BuildState for faster reads
Invalidate cache when new events arrive
Use tokio::sync::RwLock<Option<(u64, BuildState)>> where u64 is latest_event_id

gRPC Support (If Needed)

Add Tonic alongside Axum
Share same orchestrator/command channel
Useful for language-agnostic clients

Dependencies Summary

New dependencies to add to MODULE.bazel:

# Async runtime
crate.spec(package = "tokio", features = ["full"], version = "1.0")

# Web framework
crate.spec(package = "axum", version = "0.7")
crate.spec(package = "tower", version = "0.4")
crate.spec(package = "tower-http", features = ["trace", "cors"], version = "0.5")

# CLI
crate.spec(package = "clap", features = ["derive"], version = "4.0")

# HTTP client for CLI
crate.spec(package = "reqwest", features = ["blocking", "json"], version = "0.11")

# Logging
crate.spec(package = "tracing", version = "0.1")
crate.spec(package = "tracing-subscriber", version = "0.3")

Estimated Timeline

Phase 1: 2-3 hours (thread-safe BEL storage)
Phase 2: 4-6 hours (HTTP server with Axum)
Phase 3: 3-4 hours (basic CLI)
Phase 4: 3-4 hours (orchestrator integration)
Phase 5: 2-3 hours (idle shutdown)
Phase 6: 4-6 hours (testing and polish)

Total: ~18-26 hours for complete implementation

Design Rationale

Why Event Log Separation?

Alternatives Considered:

Shared State with RwLock: Orchestrator holds write lock during step(), blocking all reads
Actor Model: Extra overhead from message passing for all operations

Why Event Log Separation Wins:

Orchestrator stays completely synchronous (no refactoring)
Reads don't block writes (eventual consistency acceptable for build system)
Natural fit with event sourcing architecture
Can cache reconstructed state for even better read performance

Why Not gRPC?

User requirement: "JSON is a must"
REST is more debuggable (curl, browser dev tools)
gRPC adds complexity without clear benefit
Can add gRPC later if needed (both can coexist)

Why Axum Over Actix?

Better compile-time type safety (extractors)
Cleaner middleware composition (Tower)
Native async/await (Actix uses actor model internally)
More ergonomic for this use case

Why Per-Workspace Server?

Isolation: builds in different projects don't interfere
Simpler: no need to route requests by workspace
Matches Bazel's model (users already understand it)
Easier to reason about resource usage

17 KiB Raw Blame History

Web Server Implementation Plan

Architecture Summary

Thread Model

Phase 1: Foundation - Make BEL Storage Thread-Safe

Phase 2: Web Server - HTTP API with Axum

Phase 3: CLI - HTTP Client

Phase 4: Orchestrator Integration - Command Channel

Phase 5: Daemon Lifecycle - Auto-Shutdown

Phase 6: Testing & Polish

Future Enhancements (Not in Initial Plan)

Workspace Auto-Discovery

Log Streaming (SSE)

State Caching

gRPC Support (If Needed)

Dependencies Summary

Estimated Timeline

Design Rationale

Why Event Log Separation?

Why Not gRPC?

Why Axum Over Actix?

Why Per-Workspace Server?

17 KiB

Raw Blame History