databuild/docs/design/cli-server-automation.md

10 KiB

CLI-Server Automation

This document describes how the DataBuild CLI automatically manages the HTTP server lifecycle, providing a "magical" experience where users don't need to think about starting or stopping servers.

Goals

  1. Zero-config startup: Running databuild want data/alpha should "just work" without manual server management
  2. Workspace isolation: Multiple graphs can run independently with separate servers and databases
  3. Resource efficiency: Servers auto-shutdown after idle timeout
  4. Transparency: Users can inspect server state and logs when needed

Design Overview

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         CLI Process                          │
│  databuild want data/alpha                                   │
│                                                              │
│  1. Load config (databuild.json)                            │
│  2. Check .databuild/${graph_label}/server.lock             │
│  3. If not running → daemonize server                       │
│  4. Forward request to http://localhost:${port}/api/wants   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Daemonized Server                         │
│  PID: 12345, Port: 8080                                     │
│                                                              │
│  - Holds file lock on server.lock                           │
│  - Writes logs to server.log                                │
│  - Auto-shutdown after idle_timeout_seconds                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              .databuild/${graph_label}/                      │
│                                                              │
│  server.lock   - Lock file + runtime state (JSON)           │
│  bel.sqlite    - Build Event Log database                   │
│  server.log    - Server stdout/stderr                       │
└─────────────────────────────────────────────────────────────┘

Directory Structure

project/
├── databuild.json              # User-authored config
├── .databuild/
│   └── ${graph_label}/         # e.g., "podcast_reviews"
│       ├── server.lock         # Runtime state + file lock
│       ├── bel.sqlite          # Build Event Log (SQLite)
│       └── server.log          # Server logs

Configuration

Extended Config Schema

The databuild.json (or custom config file) is extended with:

{
  "graph_label": "podcast_reviews",
  "idle_timeout_seconds": 3600,
  "jobs": [
    {
      "label": "//examples:daily_summaries",
      "entrypoint": "./jobs/daily_summaries.sh",
      "environment": { "OUTPUT_DIR": "/data/output" },
      "partition_patterns": ["daily_summaries/.*"]
    }
  ]
}
Field Type Default Description
graph_label string required Unique identifier for this graph, used for .databuild/${graph_label}/ directory
idle_timeout_seconds u64 3600 Server auto-shutdown after this many seconds of inactivity
jobs array [] Job configurations (existing schema)

Runtime State (server.lock)

The server.lock file serves dual purposes:

  1. File lock: Prevents multiple servers for the same graph
  2. Runtime state: Contains current server information
{
  "pid": 12345,
  "port": 8080,
  "started_at": 1701234567890,
  "config_hash": "sha256:abc123..."
}
Field Description
pid Server process ID
port HTTP port the server is listening on
started_at Unix timestamp (milliseconds) when server started
config_hash Hash of config file contents (for detecting config changes)

CLI Commands

Existing Commands (Enhanced)

All commands that interact with the server now auto-start if needed:

# Creates want, auto-starting server if not running
databuild want data/alpha data/beta

# Lists wants, auto-starting server if not running
databuild wants list

# Lists partitions
databuild partitions list

# Lists job runs
databuild job-runs list

New Commands

# Explicitly start server (for users who want manual control)
databuild serve
databuild serve --config ./custom-config.json

# Show server status
databuild status

# Graceful shutdown
databuild stop

Command: databuild status

Shows current server state:

DataBuild Server Status
━━━━━━━━━━━━━━━━━━━━━━━━
Graph:    podcast_reviews
Status:   Running
PID:      12345
Port:     8080
Uptime:   2h 34m
Database: .databuild/podcast_reviews/bel.sqlite

Active Job Runs: 2
Pending Wants:   5

Command: databuild stop

Gracefully shuts down the server:

$ databuild stop
Stopping DataBuild server (PID 12345)...
Server stopped.

Server Lifecycle

Startup Flow

CLI invocation (e.g., databuild want data/alpha)
     │
     ▼
Load databuild.json (or --config path)
     │
     ▼
Extract graph_label from config
     │
     ▼
Ensure .databuild/${graph_label}/ exists
     │
     ▼
Try flock(server.lock, LOCK_EX | LOCK_NB)
     │
     ├─── Lock acquired → Server not running (or crashed)
     │         │
     │         ▼
     │    Find available port (start from 3538, increment if busy)
     │         │
     │         ▼
     │    Daemonize: fork → setsid → fork → redirect I/O
     │         │
     │         ▼
     │    Child: Start server, hold lock, write server.lock JSON
     │    Parent: Wait for health check, then proceed
     │
     └─── Lock blocked → Server already running
               │
               ▼
          Read port from server.lock
               │
               ▼
          Health check (GET /health)
               │
               ├─── Success → Use this server
               │
               └─── Failure → Wait and retry (server starting up)
     │
     ▼
Forward request to http://localhost:${port}/api/...

Daemonization

The server daemonizes using the classic double-fork pattern:

  1. First fork: Parent returns immediately to CLI
  2. setsid(): Become session leader, detach from terminal
  3. Second fork: Prevent re-acquiring terminal
  4. Redirect I/O: stdout/stderr → server.log, stdin → /dev/null
  5. Write lock file: PID, port, started_at, config_hash
  6. Start serving: Hold file lock for lifetime of process

Idle Timeout

The server monitors request activity:

  1. Track last_request_time (updated on each HTTP request)
  2. Background task checks every 60 seconds
  3. If now - last_request_time > idle_timeout_seconds → graceful shutdown

Graceful Shutdown

On shutdown (idle timeout, SIGTERM, or databuild stop):

  1. Stop accepting new connections
  2. Wait for in-flight requests to complete (with timeout)
  3. Signal orchestrator to stop
  4. Wait for orchestrator thread to finish
  5. Release file lock (automatic on process exit)
  6. Exit

Port Selection

When starting a new server:

  1. Start with default port 3538
  2. Try to bind; if port in use, increment and retry
  3. Store selected port in server.lock
  4. CLI reads port from lock file, not from config

This handles the case where the preferred port is occupied by another process.

Config Change Detection

The config_hash field in server.lock enables detecting when the config file has changed since the server started:

  1. On CLI invocation, compute hash of current config file
  2. Compare with config_hash in server.lock
  3. If different, warn user:
    Warning: Config has changed since server started.
    Run 'databuild stop && databuild serve' to apply changes.
    

We don't auto-restart because that could interrupt in-progress builds.

Error Handling

Stale Lock File

If server.lock exists but the lock is not held (process crashed):

  1. Delete the stale server.lock
  2. Proceed with normal startup

Server Unreachable

If lock is held but health check fails repeatedly:

  1. Log warning: "Server appears unresponsive"
  2. After N retries, suggest: "Try 'kill -9 ${pid}' and retry"

Port Conflict

If preferred port is in use:

  1. Automatically try next port (3539, 3540, ...)
  2. Store actual port in server.lock
  3. CLI reads from lock file, so it always connects to correct port

Future Considerations

Multi-Graph Scenarios

The graph_label based directory structure supports multiple graphs in the same workspace. Each graph has independent:

  • Server process
  • Port allocation
  • BEL database
  • Idle timeout

Remote Servers

The current design assumes localhost. Future extensions could support:

  • Remote server URLs in config
  • SSH tunneling
  • Cloud-hosted servers

Job Re-entrance

Currently, if a server crashes mid-build, job runs are orphaned. Future work:

  • Detect orphaned job runs on startup
  • Resume or mark as failed
  • Track external job processes (e.g., Databricks jobs)

Implementation Checklist

  • Extend DatabuildConfig with graph_label and idle_timeout_seconds
  • Create ServerLock struct for reading/writing lock file
  • Implement file locking with flock()
  • Implement daemonization (double-fork pattern)
  • Add auto-start logic to existing CLI commands
  • Add databuild stop command
  • Add databuild status command
  • Update example configs with graph_label
  • Add integration tests for server lifecycle