databuild/docs/design/cli-server-automation.md

# CLI-Server Automation

This document describes how the DataBuild CLI automatically manages the HTTP server lifecycle, providing a "magical" experience where users don't need to think about starting or stopping servers.

## Goals

1. **Zero-config startup**: Running `databuild want data/alpha` should "just work" without manual server management
2. **Workspace isolation**: Multiple graphs can run independently with separate servers and databases
3. **Resource efficiency**: Servers auto-shutdown after idle timeout
4. **Transparency**: Users can inspect server state and logs when needed

## Design Overview

### Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                         CLI Process                          │
│  databuild want data/alpha                                   │
│                                                              │
│  1. Load config (databuild.json)                            │
│  2. Check .databuild/${graph_label}/server.lock             │
│  3. If not running → daemonize server                       │
│  4. Forward request to http://localhost:${port}/api/wants   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Daemonized Server                         │
│  PID: 12345, Port: 8080                                     │
│                                                              │
│  - Holds file lock on server.lock                           │
│  - Writes logs to server.log                                │
│  - Auto-shutdown after idle_timeout_seconds                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              .databuild/${graph_label}/                      │
│                                                              │
│  server.lock   - Lock file + runtime state (JSON)           │
│  bel.sqlite    - Build Event Log database                   │
│  server.log    - Server stdout/stderr                       │
└─────────────────────────────────────────────────────────────┘
```

### Directory Structure

```
project/
├── databuild.json              # User-authored config
├── .databuild/
│   └── ${graph_label}/         # e.g., "podcast_reviews"
│       ├── server.lock         # Runtime state + file lock
│       ├── bel.sqlite          # Build Event Log (SQLite)
│       └── server.log          # Server logs
```

## Configuration

### Extended Config Schema

The `databuild.json` (or custom config file) is extended with:

```json
{
  "graph_label": "podcast_reviews",
  "idle_timeout_seconds": 3600,
  "jobs": [
    {
      "label": "//examples:daily_summaries",
      "entrypoint": "./jobs/daily_summaries.sh",
      "environment": { "OUTPUT_DIR": "/data/output" },
      "partition_patterns": ["daily_summaries/.*"]
    }
  ]
}
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `graph_label` | string | **required** | Unique identifier for this graph, used for `.databuild/${graph_label}/` directory |
| `idle_timeout_seconds` | u64 | 3600 | Server auto-shutdown after this many seconds of inactivity |
| `jobs` | array | [] | Job configurations (existing schema) |

### Runtime State (server.lock)

The `server.lock` file serves dual purposes:
1. **File lock**: Prevents multiple servers for the same graph
2. **Runtime state**: Contains current server information

```json
{
  "pid": 12345,
  "port": 8080,
  "started_at": 1701234567890,
  "config_hash": "sha256:abc123..."
}
```

| Field | Description |
|-------|-------------|
| `pid` | Server process ID |
| `port` | HTTP port the server is listening on |
| `started_at` | Unix timestamp (milliseconds) when server started |
| `config_hash` | Hash of config file contents (for detecting config changes) |

## CLI Commands

### Existing Commands (Enhanced)

All commands that interact with the server now auto-start if needed:

```bash
# Creates want, auto-starting server if not running
databuild want data/alpha data/beta

# Lists wants, auto-starting server if not running
databuild wants list

# Lists partitions
databuild partitions list

# Lists job runs
databuild job-runs list
```

### New Commands

```bash
# Explicitly start server (for users who want manual control)
databuild serve
databuild serve --config ./custom-config.json

# Show server status
databuild status

# Graceful shutdown
databuild stop
```

### Command: `databuild status`

Shows current server state:

```
DataBuild Server Status
━━━━━━━━━━━━━━━━━━━━━━━━
Graph:    podcast_reviews
Status:   Running
PID:      12345
Port:     8080
Uptime:   2h 34m
Database: .databuild/podcast_reviews/bel.sqlite

Active Job Runs: 2
Pending Wants:   5
```

### Command: `databuild stop`

Gracefully shuts down the server:

```bash
$ databuild stop
Stopping DataBuild server (PID 12345)...
Server stopped.
```

## Server Lifecycle

### Startup Flow

```
CLI invocation (e.g., databuild want data/alpha)
     │
     ▼
Load databuild.json (or --config path)
     │
     ▼
Extract graph_label from config
     │
     ▼
Ensure .databuild/${graph_label}/ exists
     │
     ▼
Try flock(server.lock, LOCK_EX | LOCK_NB)
     │
     ├─── Lock acquired → Server not running (or crashed)
     │         │
     │         ▼
     │    Find available port (start from 3538, increment if busy)
     │         │
     │         ▼
     │    Daemonize: fork → setsid → fork → redirect I/O
     │         │
     │         ▼
     │    Child: Start server, hold lock, write server.lock JSON
     │    Parent: Wait for health check, then proceed
     │
     └─── Lock blocked → Server already running
               │
               ▼
          Read port from server.lock
               │
               ▼
          Health check (GET /health)
               │
               ├─── Success → Use this server
               │
               └─── Failure → Wait and retry (server starting up)
     │
     ▼
Forward request to http://localhost:${port}/api/...
```

### Daemonization

The server daemonizes using the classic double-fork pattern:

1. **First fork**: Parent returns immediately to CLI
2. **setsid()**: Become session leader, detach from terminal
3. **Second fork**: Prevent re-acquiring terminal
4. **Redirect I/O**: stdout/stderr → `server.log`, stdin → `/dev/null`
5. **Write lock file**: PID, port, started_at, config_hash
6. **Start serving**: Hold file lock for lifetime of process

### Idle Timeout

The server monitors request activity:

1. Track `last_request_time` (updated on each HTTP request)
2. Background task checks every 60 seconds
3. If `now - last_request_time > idle_timeout_seconds` → graceful shutdown

### Graceful Shutdown

On shutdown (idle timeout, SIGTERM, or `databuild stop`):

1. Stop accepting new connections
2. Wait for in-flight requests to complete (with timeout)
3. Signal orchestrator to stop
4. Wait for orchestrator thread to finish
5. Release file lock (automatic on process exit)
6. Exit

## Port Selection

When starting a new server:

1. Start with default port 3538
2. Try to bind; if port in use, increment and retry
3. Store selected port in `server.lock`
4. CLI reads port from lock file, not from config

This handles the case where the preferred port is occupied by another process.

## Config Change Detection

The `config_hash` field in `server.lock` enables detecting when the config file has changed since the server started:

1. On CLI invocation, compute hash of current config file
2. Compare with `config_hash` in `server.lock`
3. If different, warn user:
   ```
   Warning: Config has changed since server started.
   Run 'databuild stop && databuild serve' to apply changes.
   ```

We don't auto-restart because that could interrupt in-progress builds.

## Error Handling

### Stale Lock File

If `server.lock` exists but the lock is not held (process crashed):

1. Delete the stale `server.lock`
2. Proceed with normal startup

### Server Unreachable

If lock is held but health check fails repeatedly:

1. Log warning: "Server appears unresponsive"
2. After N retries, suggest: "Try 'kill -9 ${pid}' and retry"

### Port Conflict

If preferred port is in use:

1. Automatically try next port (3539, 3540, ...)
2. Store actual port in `server.lock`
3. CLI reads from lock file, so it always connects to correct port

## Future Considerations

### Multi-Graph Scenarios

The `graph_label` based directory structure supports multiple graphs in the same workspace. Each graph has independent:
- Server process
- Port allocation
- BEL database
- Idle timeout

### Remote Servers

The current design assumes localhost. Future extensions could support:
- Remote server URLs in config
- SSH tunneling
- Cloud-hosted servers

### Job Re-entrance

Currently, if a server crashes mid-build, job runs are orphaned. Future work:
- Detect orphaned job runs on startup
- Resume or mark as failed
- Track external job processes (e.g., Databricks jobs)

## Implementation Checklist

- [ ] Extend `DatabuildConfig` with `graph_label` and `idle_timeout_seconds`
- [ ] Create `ServerLock` struct for reading/writing lock file
- [ ] Implement file locking with `flock()`
- [ ] Implement daemonization (double-fork pattern)
- [ ] Add auto-start logic to existing CLI commands
- [ ] Add `databuild stop` command
- [ ] Add `databuild status` command
- [ ] Update example configs with `graph_label`
- [ ] Add integration tests for server lifecycle