330 lines
10 KiB
Markdown
330 lines
10 KiB
Markdown
# CLI-Server Automation
|
|
|
|
This document describes how the DataBuild CLI automatically manages the HTTP server lifecycle, providing a "magical" experience where users don't need to think about starting or stopping servers.
|
|
|
|
## Goals
|
|
|
|
1. **Zero-config startup**: Running `databuild want data/alpha` should "just work" without manual server management
|
|
2. **Workspace isolation**: Multiple graphs can run independently with separate servers and databases
|
|
3. **Resource efficiency**: Servers auto-shutdown after idle timeout
|
|
4. **Transparency**: Users can inspect server state and logs when needed
|
|
|
|
## Design Overview
|
|
|
|
### Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ CLI Process │
|
|
│ databuild want data/alpha │
|
|
│ │
|
|
│ 1. Load config (databuild.json) │
|
|
│ 2. Check .databuild/${graph_label}/server.lock │
|
|
│ 3. If not running → daemonize server │
|
|
│ 4. Forward request to http://localhost:${port}/api/wants │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Daemonized Server │
|
|
│ PID: 12345, Port: 8080 │
|
|
│ │
|
|
│ - Holds file lock on server.lock │
|
|
│ - Writes logs to server.log │
|
|
│ - Auto-shutdown after idle_timeout_seconds │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ .databuild/${graph_label}/ │
|
|
│ │
|
|
│ server.lock - Lock file + runtime state (JSON) │
|
|
│ bel.sqlite - Build Event Log database │
|
|
│ server.log - Server stdout/stderr │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
project/
|
|
├── databuild.json # User-authored config
|
|
├── .databuild/
|
|
│ └── ${graph_label}/ # e.g., "podcast_reviews"
|
|
│ ├── server.lock # Runtime state + file lock
|
|
│ ├── bel.sqlite # Build Event Log (SQLite)
|
|
│ └── server.log # Server logs
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Extended Config Schema
|
|
|
|
The `databuild.json` (or custom config file) is extended with:
|
|
|
|
```json
|
|
{
|
|
"graph_label": "podcast_reviews",
|
|
"idle_timeout_seconds": 3600,
|
|
"jobs": [
|
|
{
|
|
"label": "//examples:daily_summaries",
|
|
"entrypoint": "./jobs/daily_summaries.sh",
|
|
"environment": { "OUTPUT_DIR": "/data/output" },
|
|
"partition_patterns": ["daily_summaries/.*"]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `graph_label` | string | **required** | Unique identifier for this graph, used for `.databuild/${graph_label}/` directory |
|
|
| `idle_timeout_seconds` | u64 | 3600 | Server auto-shutdown after this many seconds of inactivity |
|
|
| `jobs` | array | [] | Job configurations (existing schema) |
|
|
|
|
### Runtime State (server.lock)
|
|
|
|
The `server.lock` file serves dual purposes:
|
|
1. **File lock**: Prevents multiple servers for the same graph
|
|
2. **Runtime state**: Contains current server information
|
|
|
|
```json
|
|
{
|
|
"pid": 12345,
|
|
"port": 8080,
|
|
"started_at": 1701234567890,
|
|
"config_hash": "sha256:abc123..."
|
|
}
|
|
```
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `pid` | Server process ID |
|
|
| `port` | HTTP port the server is listening on |
|
|
| `started_at` | Unix timestamp (milliseconds) when server started |
|
|
| `config_hash` | Hash of config file contents (for detecting config changes) |
|
|
|
|
## CLI Commands
|
|
|
|
### Existing Commands (Enhanced)
|
|
|
|
All commands that interact with the server now auto-start if needed:
|
|
|
|
```bash
|
|
# Creates want, auto-starting server if not running
|
|
databuild want data/alpha data/beta
|
|
|
|
# Lists wants, auto-starting server if not running
|
|
databuild wants list
|
|
|
|
# Lists partitions
|
|
databuild partitions list
|
|
|
|
# Lists job runs
|
|
databuild job-runs list
|
|
```
|
|
|
|
### New Commands
|
|
|
|
```bash
|
|
# Explicitly start server (for users who want manual control)
|
|
databuild serve
|
|
databuild serve --config ./custom-config.json
|
|
|
|
# Show server status
|
|
databuild status
|
|
|
|
# Graceful shutdown
|
|
databuild stop
|
|
```
|
|
|
|
### Command: `databuild status`
|
|
|
|
Shows current server state:
|
|
|
|
```
|
|
DataBuild Server Status
|
|
━━━━━━━━━━━━━━━━━━━━━━━━
|
|
Graph: podcast_reviews
|
|
Status: Running
|
|
PID: 12345
|
|
Port: 8080
|
|
Uptime: 2h 34m
|
|
Database: .databuild/podcast_reviews/bel.sqlite
|
|
|
|
Active Job Runs: 2
|
|
Pending Wants: 5
|
|
```
|
|
|
|
### Command: `databuild stop`
|
|
|
|
Gracefully shuts down the server:
|
|
|
|
```bash
|
|
$ databuild stop
|
|
Stopping DataBuild server (PID 12345)...
|
|
Server stopped.
|
|
```
|
|
|
|
## Server Lifecycle
|
|
|
|
### Startup Flow
|
|
|
|
```
|
|
CLI invocation (e.g., databuild want data/alpha)
|
|
│
|
|
▼
|
|
Load databuild.json (or --config path)
|
|
│
|
|
▼
|
|
Extract graph_label from config
|
|
│
|
|
▼
|
|
Ensure .databuild/${graph_label}/ exists
|
|
│
|
|
▼
|
|
Try flock(server.lock, LOCK_EX | LOCK_NB)
|
|
│
|
|
├─── Lock acquired → Server not running (or crashed)
|
|
│ │
|
|
│ ▼
|
|
│ Find available port (start from 3538, increment if busy)
|
|
│ │
|
|
│ ▼
|
|
│ Daemonize: fork → setsid → fork → redirect I/O
|
|
│ │
|
|
│ ▼
|
|
│ Child: Start server, hold lock, write server.lock JSON
|
|
│ Parent: Wait for health check, then proceed
|
|
│
|
|
└─── Lock blocked → Server already running
|
|
│
|
|
▼
|
|
Read port from server.lock
|
|
│
|
|
▼
|
|
Health check (GET /health)
|
|
│
|
|
├─── Success → Use this server
|
|
│
|
|
└─── Failure → Wait and retry (server starting up)
|
|
│
|
|
▼
|
|
Forward request to http://localhost:${port}/api/...
|
|
```
|
|
|
|
### Daemonization
|
|
|
|
The server daemonizes using the classic double-fork pattern:
|
|
|
|
1. **First fork**: Parent returns immediately to CLI
|
|
2. **setsid()**: Become session leader, detach from terminal
|
|
3. **Second fork**: Prevent re-acquiring terminal
|
|
4. **Redirect I/O**: stdout/stderr → `server.log`, stdin → `/dev/null`
|
|
5. **Write lock file**: PID, port, started_at, config_hash
|
|
6. **Start serving**: Hold file lock for lifetime of process
|
|
|
|
### Idle Timeout
|
|
|
|
The server monitors request activity:
|
|
|
|
1. Track `last_request_time` (updated on each HTTP request)
|
|
2. Background task checks every 60 seconds
|
|
3. If `now - last_request_time > idle_timeout_seconds` → graceful shutdown
|
|
|
|
### Graceful Shutdown
|
|
|
|
On shutdown (idle timeout, SIGTERM, or `databuild stop`):
|
|
|
|
1. Stop accepting new connections
|
|
2. Wait for in-flight requests to complete (with timeout)
|
|
3. Signal orchestrator to stop
|
|
4. Wait for orchestrator thread to finish
|
|
5. Release file lock (automatic on process exit)
|
|
6. Exit
|
|
|
|
## Port Selection
|
|
|
|
When starting a new server:
|
|
|
|
1. Start with default port 3538
|
|
2. Try to bind; if port in use, increment and retry
|
|
3. Store selected port in `server.lock`
|
|
4. CLI reads port from lock file, not from config
|
|
|
|
This handles the case where the preferred port is occupied by another process.
|
|
|
|
## Config Change Detection
|
|
|
|
The `config_hash` field in `server.lock` enables detecting when the config file has changed since the server started:
|
|
|
|
1. On CLI invocation, compute hash of current config file
|
|
2. Compare with `config_hash` in `server.lock`
|
|
3. If different, warn user:
|
|
```
|
|
Warning: Config has changed since server started.
|
|
Run 'databuild stop && databuild serve' to apply changes.
|
|
```
|
|
|
|
We don't auto-restart because that could interrupt in-progress builds.
|
|
|
|
## Error Handling
|
|
|
|
### Stale Lock File
|
|
|
|
If `server.lock` exists but the lock is not held (process crashed):
|
|
|
|
1. Delete the stale `server.lock`
|
|
2. Proceed with normal startup
|
|
|
|
### Server Unreachable
|
|
|
|
If lock is held but health check fails repeatedly:
|
|
|
|
1. Log warning: "Server appears unresponsive"
|
|
2. After N retries, suggest: "Try 'kill -9 ${pid}' and retry"
|
|
|
|
### Port Conflict
|
|
|
|
If preferred port is in use:
|
|
|
|
1. Automatically try next port (3539, 3540, ...)
|
|
2. Store actual port in `server.lock`
|
|
3. CLI reads from lock file, so it always connects to correct port
|
|
|
|
## Future Considerations
|
|
|
|
### Multi-Graph Scenarios
|
|
|
|
The `graph_label` based directory structure supports multiple graphs in the same workspace. Each graph has independent:
|
|
- Server process
|
|
- Port allocation
|
|
- BEL database
|
|
- Idle timeout
|
|
|
|
### Remote Servers
|
|
|
|
The current design assumes localhost. Future extensions could support:
|
|
- Remote server URLs in config
|
|
- SSH tunneling
|
|
- Cloud-hosted servers
|
|
|
|
### Job Re-entrance
|
|
|
|
Currently, if a server crashes mid-build, job runs are orphaned. Future work:
|
|
- Detect orphaned job runs on startup
|
|
- Resume or mark as failed
|
|
- Track external job processes (e.g., Databricks jobs)
|
|
|
|
## Implementation Checklist
|
|
|
|
- [ ] Extend `DatabuildConfig` with `graph_label` and `idle_timeout_seconds`
|
|
- [ ] Create `ServerLock` struct for reading/writing lock file
|
|
- [ ] Implement file locking with `flock()`
|
|
- [ ] Implement daemonization (double-fork pattern)
|
|
- [ ] Add auto-start logic to existing CLI commands
|
|
- [ ] Add `databuild stop` command
|
|
- [ ] Add `databuild status` command
|
|
- [ ] Update example configs with `graph_label`
|
|
- [ ] Add integration tests for server lifecycle
|