databuild/docs/ideas/logging.md
Stuart Axelbrooke 2084fadbb6
Some checks are pending
/ setup (push) Waiting to run
add claude logging notes
2025-11-22 20:44:40 +08:00

6.7 KiB

Job Run Logging Strategy

Claude-generated plan for logging.

Philosophy

Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:

  1. Be resource efficient - Not consume unbounded memory in the service process
  2. Persist across restarts - Logs survive service restarts/crashes
  3. Stream in real-time - Enable live tailing for running jobs
  4. Support future log shipping - Abstract design allows later integration with log aggregation systems
  5. Maintain data locality - Keep logs on build machines where jobs execute

File-Based Approach

Directory Structure

/var/log/databuild/
  job_runs/
    {job_run_id}/
      stdout.log      # Job standard output
      stderr.log      # Job standard error
      metadata.json   # Job metadata (timestamps, exit code, building_partitions, etc.)

Write Strategy

  • Streaming writes: Job output written to disk as it's produced (not buffered in memory)
  • Append-only: Log files are append-only for simplicity and crash safety
  • Metadata on completion: Write metadata.json when job reaches terminal state

Rotation & Cleanup Policy

Two-pronged approach to prevent unbounded disk usage:

  1. Time-based TTL: Delete logs older than N days (default: 7 days)
  2. Size-based cap: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first

Configuration via environment variables:

  • DATABUILD_LOG_TTL_DAYS (default: 7)
  • DATABUILD_LOG_MAX_SIZE_GB (default: 10)

API Streaming

HTTP Endpoints

GET /api/job_runs/{job_run_id}/logs/stdout
GET /api/job_runs/{job_run_id}/logs/stderr

Streaming Protocol

Use Server-Sent Events (SSE) for real-time log streaming:

  • Efficient for text streams (line-oriented)
  • Native browser support (no WebSocket complexity)
  • Automatic reconnection
  • Works through HTTP/1.1 (no HTTP/2 requirement)

Example Response

event: log
data: Building partition data/alpha...

event: log
data: [INFO] Reading dependencies from upstream

event: complete
data: {"exit_code": 0, "duration_ms": 1234}

Query Parameters

  • ?follow=true - Keep connection open, stream new lines as they're written (like tail -f)
  • ?since=<line_number> - Start from specific line (for reconnection)
  • ?lines=<N> - Return last N lines and close (for quick inspection)

Abstraction Layer

Define LogStore trait to enable future log shipping without changing core logic:

/// Abstraction for job run log storage and retrieval
pub trait LogStore: Send + Sync {
    /// Append a line to stdout for the given job run
    fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;

    /// Append a line to stderr for the given job run
    fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;

    /// Stream stdout lines from a job run
    fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;

    /// Stream stderr lines from a job run
    fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;

    /// Write job metadata on completion
    fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
        -> Result<(), LogError>;
}

pub struct StreamOptions {
    pub follow: bool,        // Keep streaming new lines
    pub since_line: usize,   // Start from line N
    pub max_lines: Option<usize>,  // Limit to N lines
}

Initial Implementation: FileLogStore

pub struct FileLogStore {
    base_path: PathBuf,  // e.g., /var/log/databuild/job_runs
}

Writes directly to {base_path}/{job_run_id}/stdout.log.

Future Implementations

  • ShippingLogStore: Wraps FileLogStore, ships logs to S3/GCS/CloudWatch in background
  • CompositeLogStore: Writes to multiple stores (local + remote)
  • BufferedLogStore: Batches writes for efficiency

Integration with Job Runner

The SubProcessBackend (in job_run.rs) currently buffers stdout in memory. This needs updating:

Current (in-memory buffering):

pub struct SubProcessRunning {
    pub process: Child,
    pub stdout_buffer: Vec<String>,  // ❌ Unbounded memory
}

Proposed (streaming to disk):

pub struct SubProcessRunning {
    pub process: Child,
    pub log_store: Arc<Mutex<dyn LogStore>>,
    pub job_run_id: String,
}

When polling the job:

  1. Read available stdout/stderr from process
  2. Write each line to log_store.append_stdout(job_run_id, line)
  3. Parse for special lines (e.g., DATABUILD_MISSING_DEPS_JSON:...)
  4. Don't keep full log in memory

CLI Integration

The CLI should support log streaming:

# Stream logs for a running or completed job
databuild logs <job_run_id>

# Follow mode (tail -f)
databuild logs <job_run_id> --follow

# Show last N lines
databuild logs <job_run_id> --tail 100

# Show stderr instead of stdout
databuild logs <job_run_id> --stderr

Under the hood, this hits the /api/job_runs/{id}/logs/stdout?follow=true endpoint.

Web App Integration

The web app can use native EventSource API:

const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);

eventSource.addEventListener('log', (event) => {
    appendToTerminal(event.data);
});

eventSource.addEventListener('complete', (event) => {
    const metadata = JSON.parse(event.data);
    showExitCode(metadata.exit_code);
    eventSource.close();
});

Future: Log Shipping

When adding log shipping (e.g., to S3, CloudWatch, Datadog):

  1. Create a ShippingLogStore implementation
  2. Run background task that:
    • Watches for completed jobs
    • Batches log lines
    • Ships to configured destination
    • Deletes local files after successful upload (if configured)
  3. Configure via:
    export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs
    export DATABUILD_LOG_KEEP_LOCAL=false  # Delete after ship
    

The LogStore trait means the core system doesn't change - just swap implementations.

Open Questions

  1. Log format: Plain text vs structured (JSON lines)?

    • Plain text is more human-readable
    • Structured is easier to search/analyze
    • Suggestion: Plain text in files, parse to structured for API if needed
  2. Compression: Compress old logs to save space?

    • Could gzip files older than 24 hours
    • Trade-off: disk space vs CPU on access
  3. Indexing: Build an index for fast log search?

    • Simple grep is probably fine initially
    • Could add full-text search later if needed
  4. Multi-machine: How do logs work in distributed builds?

    • Each build machine has its own log directory
    • Central service aggregates via log shipping
    • Need to design this when we tackle distributed execution