databuild/docs/design/cli-server-automation.md

330 lines
10 KiB
Markdown

# CLI-Server Automation
This document describes how the DataBuild CLI automatically manages the HTTP server lifecycle, providing a "magical" experience where users don't need to think about starting or stopping servers.
## Goals
1. **Zero-config startup**: Running `databuild want data/alpha` should "just work" without manual server management
2. **Workspace isolation**: Multiple graphs can run independently with separate servers and databases
3. **Resource efficiency**: Servers auto-shutdown after idle timeout
4. **Transparency**: Users can inspect server state and logs when needed
## Design Overview
### Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ CLI Process │
│ databuild want data/alpha │
│ │
│ 1. Load config (databuild.json) │
│ 2. Check .databuild/${graph_label}/server.lock │
│ 3. If not running → daemonize server │
│ 4. Forward request to http://localhost:${port}/api/wants │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Daemonized Server │
│ PID: 12345, Port: 8080 │
│ │
│ - Holds file lock on server.lock │
│ - Writes logs to server.log │
│ - Auto-shutdown after idle_timeout_seconds │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ .databuild/${graph_label}/ │
│ │
│ server.lock - Lock file + runtime state (JSON) │
│ bel.sqlite - Build Event Log database │
│ server.log - Server stdout/stderr │
└─────────────────────────────────────────────────────────────┘
```
### Directory Structure
```
project/
├── databuild.json # User-authored config
├── .databuild/
│ └── ${graph_label}/ # e.g., "podcast_reviews"
│ ├── server.lock # Runtime state + file lock
│ ├── bel.sqlite # Build Event Log (SQLite)
│ └── server.log # Server logs
```
## Configuration
### Extended Config Schema
The `databuild.json` (or custom config file) is extended with:
```json
{
"graph_label": "podcast_reviews",
"idle_timeout_seconds": 3600,
"jobs": [
{
"label": "//examples:daily_summaries",
"entrypoint": "./jobs/daily_summaries.sh",
"environment": { "OUTPUT_DIR": "/data/output" },
"partition_patterns": ["daily_summaries/.*"]
}
]
}
```
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `graph_label` | string | **required** | Unique identifier for this graph, used for `.databuild/${graph_label}/` directory |
| `idle_timeout_seconds` | u64 | 3600 | Server auto-shutdown after this many seconds of inactivity |
| `jobs` | array | [] | Job configurations (existing schema) |
### Runtime State (server.lock)
The `server.lock` file serves dual purposes:
1. **File lock**: Prevents multiple servers for the same graph
2. **Runtime state**: Contains current server information
```json
{
"pid": 12345,
"port": 8080,
"started_at": 1701234567890,
"config_hash": "sha256:abc123..."
}
```
| Field | Description |
|-------|-------------|
| `pid` | Server process ID |
| `port` | HTTP port the server is listening on |
| `started_at` | Unix timestamp (milliseconds) when server started |
| `config_hash` | Hash of config file contents (for detecting config changes) |
## CLI Commands
### Existing Commands (Enhanced)
All commands that interact with the server now auto-start if needed:
```bash
# Creates want, auto-starting server if not running
databuild want data/alpha data/beta
# Lists wants, auto-starting server if not running
databuild wants list
# Lists partitions
databuild partitions list
# Lists job runs
databuild job-runs list
```
### New Commands
```bash
# Explicitly start server (for users who want manual control)
databuild serve
databuild serve --config ./custom-config.json
# Show server status
databuild status
# Graceful shutdown
databuild stop
```
### Command: `databuild status`
Shows current server state:
```
DataBuild Server Status
━━━━━━━━━━━━━━━━━━━━━━━━
Graph: podcast_reviews
Status: Running
PID: 12345
Port: 8080
Uptime: 2h 34m
Database: .databuild/podcast_reviews/bel.sqlite
Active Job Runs: 2
Pending Wants: 5
```
### Command: `databuild stop`
Gracefully shuts down the server:
```bash
$ databuild stop
Stopping DataBuild server (PID 12345)...
Server stopped.
```
## Server Lifecycle
### Startup Flow
```
CLI invocation (e.g., databuild want data/alpha)
Load databuild.json (or --config path)
Extract graph_label from config
Ensure .databuild/${graph_label}/ exists
Try flock(server.lock, LOCK_EX | LOCK_NB)
├─── Lock acquired → Server not running (or crashed)
│ │
│ ▼
│ Find available port (start from 3538, increment if busy)
│ │
│ ▼
│ Daemonize: fork → setsid → fork → redirect I/O
│ │
│ ▼
│ Child: Start server, hold lock, write server.lock JSON
│ Parent: Wait for health check, then proceed
└─── Lock blocked → Server already running
Read port from server.lock
Health check (GET /health)
├─── Success → Use this server
└─── Failure → Wait and retry (server starting up)
Forward request to http://localhost:${port}/api/...
```
### Daemonization
The server daemonizes using the classic double-fork pattern:
1. **First fork**: Parent returns immediately to CLI
2. **setsid()**: Become session leader, detach from terminal
3. **Second fork**: Prevent re-acquiring terminal
4. **Redirect I/O**: stdout/stderr → `server.log`, stdin → `/dev/null`
5. **Write lock file**: PID, port, started_at, config_hash
6. **Start serving**: Hold file lock for lifetime of process
### Idle Timeout
The server monitors request activity:
1. Track `last_request_time` (updated on each HTTP request)
2. Background task checks every 60 seconds
3. If `now - last_request_time > idle_timeout_seconds` → graceful shutdown
### Graceful Shutdown
On shutdown (idle timeout, SIGTERM, or `databuild stop`):
1. Stop accepting new connections
2. Wait for in-flight requests to complete (with timeout)
3. Signal orchestrator to stop
4. Wait for orchestrator thread to finish
5. Release file lock (automatic on process exit)
6. Exit
## Port Selection
When starting a new server:
1. Start with default port 3538
2. Try to bind; if port in use, increment and retry
3. Store selected port in `server.lock`
4. CLI reads port from lock file, not from config
This handles the case where the preferred port is occupied by another process.
## Config Change Detection
The `config_hash` field in `server.lock` enables detecting when the config file has changed since the server started:
1. On CLI invocation, compute hash of current config file
2. Compare with `config_hash` in `server.lock`
3. If different, warn user:
```
Warning: Config has changed since server started.
Run 'databuild stop && databuild serve' to apply changes.
```
We don't auto-restart because that could interrupt in-progress builds.
## Error Handling
### Stale Lock File
If `server.lock` exists but the lock is not held (process crashed):
1. Delete the stale `server.lock`
2. Proceed with normal startup
### Server Unreachable
If lock is held but health check fails repeatedly:
1. Log warning: "Server appears unresponsive"
2. After N retries, suggest: "Try 'kill -9 ${pid}' and retry"
### Port Conflict
If preferred port is in use:
1. Automatically try next port (3539, 3540, ...)
2. Store actual port in `server.lock`
3. CLI reads from lock file, so it always connects to correct port
## Future Considerations
### Multi-Graph Scenarios
The `graph_label` based directory structure supports multiple graphs in the same workspace. Each graph has independent:
- Server process
- Port allocation
- BEL database
- Idle timeout
### Remote Servers
The current design assumes localhost. Future extensions could support:
- Remote server URLs in config
- SSH tunneling
- Cloud-hosted servers
### Job Re-entrance
Currently, if a server crashes mid-build, job runs are orphaned. Future work:
- Detect orphaned job runs on startup
- Resume or mark as failed
- Track external job processes (e.g., Databricks jobs)
## Implementation Checklist
- [ ] Extend `DatabuildConfig` with `graph_label` and `idle_timeout_seconds`
- [ ] Create `ServerLock` struct for reading/writing lock file
- [ ] Implement file locking with `flock()`
- [ ] Implement daemonization (double-fork pattern)
- [ ] Add auto-start logic to existing CLI commands
- [ ] Add `databuild stop` command
- [ ] Add `databuild status` command
- [ ] Update example configs with `graph_label`
- [ ] Add integration tests for server lifecycle