Stuart Axelbrooke f7c196c9b3 add automated server startup for cli

2025-11-27 14:20:40 +08:00

10 KiB

Raw Blame History

CLI-Server Automation

This document describes how the DataBuild CLI automatically manages the HTTP server lifecycle, providing a "magical" experience where users don't need to think about starting or stopping servers.

Goals

Zero-config startup: Running databuild want data/alpha should "just work" without manual server management
Workspace isolation: Multiple graphs can run independently with separate servers and databases
Resource efficiency: Servers auto-shutdown after idle timeout
Transparency: Users can inspect server state and logs when needed

Design Overview

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         CLI Process                          │
│  databuild want data/alpha                                   │
│                                                              │
│  1. Load config (databuild.json)                            │
│  2. Check .databuild/${graph_label}/server.lock             │
│  3. If not running → daemonize server                       │
│  4. Forward request to http://localhost:${port}/api/wants   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Daemonized Server                         │
│  PID: 12345, Port: 8080                                     │
│                                                              │
│  - Holds file lock on server.lock                           │
│  - Writes logs to server.log                                │
│  - Auto-shutdown after idle_timeout_seconds                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              .databuild/${graph_label}/                      │
│                                                              │
│  server.lock   - Lock file + runtime state (JSON)           │
│  bel.sqlite    - Build Event Log database                   │
│  server.log    - Server stdout/stderr                       │
└─────────────────────────────────────────────────────────────┘

Directory Structure

project/
├── databuild.json              # User-authored config
├── .databuild/
│   └── ${graph_label}/         # e.g., "podcast_reviews"
│       ├── server.lock         # Runtime state + file lock
│       ├── bel.sqlite          # Build Event Log (SQLite)
│       └── server.log          # Server logs

Configuration

Extended Config Schema

The databuild.json (or custom config file) is extended with:

{
  "graph_label": "podcast_reviews",
  "idle_timeout_seconds": 3600,
  "jobs": [
    {
      "label": "//examples:daily_summaries",
      "entrypoint": "./jobs/daily_summaries.sh",
      "environment": { "OUTPUT_DIR": "/data/output" },
      "partition_patterns": ["daily_summaries/.*"]
    }
  ]
}

Field	Type	Default	Description
`graph_label`	string	required	Unique identifier for this graph, used for `.databuild/${graph_label}/` directory
`idle_timeout_seconds`	u64	3600	Server auto-shutdown after this many seconds of inactivity
`jobs`	array	[]	Job configurations (existing schema)

Runtime State (server.lock)

The server.lock file serves dual purposes:

File lock: Prevents multiple servers for the same graph
Runtime state: Contains current server information

{
  "pid": 12345,
  "port": 8080,
  "started_at": 1701234567890,
  "config_hash": "sha256:abc123..."
}

Field	Description
`pid`	Server process ID
`port`	HTTP port the server is listening on
`started_at`	Unix timestamp (milliseconds) when server started
`config_hash`	Hash of config file contents (for detecting config changes)

CLI Commands

Existing Commands (Enhanced)

All commands that interact with the server now auto-start if needed:

# Creates want, auto-starting server if not running
databuild want data/alpha data/beta

# Lists wants, auto-starting server if not running
databuild wants list

# Lists partitions
databuild partitions list

# Lists job runs
databuild job-runs list

New Commands

# Explicitly start server (for users who want manual control)
databuild serve
databuild serve --config ./custom-config.json

# Show server status
databuild status

# Graceful shutdown
databuild stop

Command: `databuild status`

Shows current server state:

DataBuild Server Status
━━━━━━━━━━━━━━━━━━━━━━━━
Graph:    podcast_reviews
Status:   Running
PID:      12345
Port:     8080
Uptime:   2h 34m
Database: .databuild/podcast_reviews/bel.sqlite

Active Job Runs: 2
Pending Wants:   5

Command: `databuild stop`

Gracefully shuts down the server:

$ databuild stop
Stopping DataBuild server (PID 12345)...
Server stopped.

Server Lifecycle

Startup Flow

CLI invocation (e.g., databuild want data/alpha)
     │
     ▼
Load databuild.json (or --config path)
     │
     ▼
Extract graph_label from config
     │
     ▼
Ensure .databuild/${graph_label}/ exists
     │
     ▼
Try flock(server.lock, LOCK_EX | LOCK_NB)
     │
     ├─── Lock acquired → Server not running (or crashed)
     │         │
     │         ▼
     │    Find available port (start from 3538, increment if busy)
     │         │
     │         ▼
     │    Daemonize: fork → setsid → fork → redirect I/O
     │         │
     │         ▼
     │    Child: Start server, hold lock, write server.lock JSON
     │    Parent: Wait for health check, then proceed
     │
     └─── Lock blocked → Server already running
               │
               ▼
          Read port from server.lock
               │
               ▼
          Health check (GET /health)
               │
               ├─── Success → Use this server
               │
               └─── Failure → Wait and retry (server starting up)
     │
     ▼
Forward request to http://localhost:${port}/api/...

Daemonization

The server daemonizes using the classic double-fork pattern:

First fork: Parent returns immediately to CLI
setsid(): Become session leader, detach from terminal
Second fork: Prevent re-acquiring terminal
Redirect I/O: stdout/stderr → server.log, stdin → /dev/null
Write lock file: PID, port, started_at, config_hash
Start serving: Hold file lock for lifetime of process

Idle Timeout

The server monitors request activity:

Track last_request_time (updated on each HTTP request)
Background task checks every 60 seconds
If now - last_request_time > idle_timeout_seconds → graceful shutdown

Graceful Shutdown

On shutdown (idle timeout, SIGTERM, or databuild stop):

Stop accepting new connections
Wait for in-flight requests to complete (with timeout)
Signal orchestrator to stop
Wait for orchestrator thread to finish
Release file lock (automatic on process exit)
Exit

Port Selection

When starting a new server:

Start with default port 3538
Try to bind; if port in use, increment and retry
Store selected port in server.lock
CLI reads port from lock file, not from config

This handles the case where the preferred port is occupied by another process.

Config Change Detection

The config_hash field in server.lock enables detecting when the config file has changed since the server started:

On CLI invocation, compute hash of current config file
Compare with config_hash in server.lock

If different, warn user:

Warning: Config has changed since server started.
Run 'databuild stop && databuild serve' to apply changes.

We don't auto-restart because that could interrupt in-progress builds.

Error Handling

Stale Lock File

If server.lock exists but the lock is not held (process crashed):

Delete the stale server.lock
Proceed with normal startup

Server Unreachable

If lock is held but health check fails repeatedly:

Log warning: "Server appears unresponsive"
After N retries, suggest: "Try 'kill -9 ${pid}' and retry"

Port Conflict

If preferred port is in use:

Automatically try next port (3539, 3540, ...)
Store actual port in server.lock
CLI reads from lock file, so it always connects to correct port

Future Considerations

Multi-Graph Scenarios

The graph_label based directory structure supports multiple graphs in the same workspace. Each graph has independent:

Server process
Port allocation
BEL database
Idle timeout

Remote Servers

The current design assumes localhost. Future extensions could support:

Remote server URLs in config
SSH tunneling
Cloud-hosted servers

Job Re-entrance

Currently, if a server crashes mid-build, job runs are orphaned. Future work:

Detect orphaned job runs on startup
Resume or mark as failed
Track external job processes (e.g., Databricks jobs)

Implementation Checklist

Extend DatabuildConfig with graph_label and idle_timeout_seconds
Create ServerLock struct for reading/writing lock file
Implement file locking with flock()
Implement daemonization (double-fork pattern)
Add auto-start logic to existing CLI commands
Add databuild stop command
Add databuild status command
Update example configs with graph_label
Add integration tests for server lifecycle

10 KiB Raw Blame History