stuart/databuild

Fork 0

Stuart Axelbrooke 2084fadbb6

/ setup (push) Waiting to run

Details

add claude logging notes

2025-11-22 20:44:40 +08:00

6.7 KiB

Raw Blame History

Job Run Logging Strategy

Claude-generated plan for logging.

Philosophy

Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:

Be resource efficient - Not consume unbounded memory in the service process
Persist across restarts - Logs survive service restarts/crashes
Stream in real-time - Enable live tailing for running jobs
Support future log shipping - Abstract design allows later integration with log aggregation systems
Maintain data locality - Keep logs on build machines where jobs execute

File-Based Approach

Directory Structure

/var/log/databuild/
  job_runs/
    {job_run_id}/
      stdout.log      # Job standard output
      stderr.log      # Job standard error
      metadata.json   # Job metadata (timestamps, exit code, building_partitions, etc.)

Write Strategy

Streaming writes: Job output written to disk as it's produced (not buffered in memory)
Append-only: Log files are append-only for simplicity and crash safety
Metadata on completion: Write metadata.json when job reaches terminal state

Rotation & Cleanup Policy

Two-pronged approach to prevent unbounded disk usage:

Time-based TTL: Delete logs older than N days (default: 7 days)
Size-based cap: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first

Configuration via environment variables:

DATABUILD_LOG_TTL_DAYS (default: 7)
DATABUILD_LOG_MAX_SIZE_GB (default: 10)

API Streaming

HTTP Endpoints

GET /api/job_runs/{job_run_id}/logs/stdout
GET /api/job_runs/{job_run_id}/logs/stderr

Streaming Protocol

Use Server-Sent Events (SSE) for real-time log streaming:

Efficient for text streams (line-oriented)
Native browser support (no WebSocket complexity)
Automatic reconnection
Works through HTTP/1.1 (no HTTP/2 requirement)

Example Response

event: log
data: Building partition data/alpha...

event: log
data: [INFO] Reading dependencies from upstream

event: complete
data: {"exit_code": 0, "duration_ms": 1234}

Query Parameters

?follow=true - Keep connection open, stream new lines as they're written (like tail -f)
?since=<line_number> - Start from specific line (for reconnection)
?lines=<N> - Return last N lines and close (for quick inspection)

Abstraction Layer

Define LogStore trait to enable future log shipping without changing core logic:

/// Abstraction for job run log storage and retrieval
pub trait LogStore: Send + Sync {
    /// Append a line to stdout for the given job run
    fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;

    /// Append a line to stderr for the given job run
    fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;

    /// Stream stdout lines from a job run
    fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;

    /// Stream stderr lines from a job run
    fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;

    /// Write job metadata on completion
    fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
        -> Result<(), LogError>;
}

pub struct StreamOptions {
    pub follow: bool,        // Keep streaming new lines
    pub since_line: usize,   // Start from line N
    pub max_lines: Option<usize>,  // Limit to N lines
}

Initial Implementation: `FileLogStore`

pub struct FileLogStore {
    base_path: PathBuf,  // e.g., /var/log/databuild/job_runs
}

Writes directly to {base_path}/{job_run_id}/stdout.log.

Future Implementations

ShippingLogStore: Wraps FileLogStore, ships logs to S3/GCS/CloudWatch in background
CompositeLogStore: Writes to multiple stores (local + remote)
BufferedLogStore: Batches writes for efficiency

Integration with Job Runner

The SubProcessBackend (in job_run.rs) currently buffers stdout in memory. This needs updating:

Current (in-memory buffering):

pub struct SubProcessRunning {
    pub process: Child,
    pub stdout_buffer: Vec<String>,  // ❌ Unbounded memory
}

Proposed (streaming to disk):

pub struct SubProcessRunning {
    pub process: Child,
    pub log_store: Arc<Mutex<dyn LogStore>>,
    pub job_run_id: String,
}

When polling the job:

Read available stdout/stderr from process
Write each line to log_store.append_stdout(job_run_id, line)
Parse for special lines (e.g., DATABUILD_MISSING_DEPS_JSON:...)
Don't keep full log in memory

CLI Integration

The CLI should support log streaming:

# Stream logs for a running or completed job
databuild logs <job_run_id>

# Follow mode (tail -f)
databuild logs <job_run_id> --follow

# Show last N lines
databuild logs <job_run_id> --tail 100

# Show stderr instead of stdout
databuild logs <job_run_id> --stderr

Under the hood, this hits the /api/job_runs/{id}/logs/stdout?follow=true endpoint.

Web App Integration

The web app can use native EventSource API:

const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);

eventSource.addEventListener('log', (event) => {
    appendToTerminal(event.data);
});

eventSource.addEventListener('complete', (event) => {
    const metadata = JSON.parse(event.data);
    showExitCode(metadata.exit_code);
    eventSource.close();
});

Future: Log Shipping

When adding log shipping (e.g., to S3, CloudWatch, Datadog):

Create a ShippingLogStore implementation
Run background task that:
- Watches for completed jobs
- Batches log lines
- Ships to configured destination
- Deletes local files after successful upload (if configured)

Configure via:

export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs
export DATABUILD_LOG_KEEP_LOCAL=false  # Delete after ship

The LogStore trait means the core system doesn't change - just swap implementations.

Open Questions

Log format: Plain text vs structured (JSON lines)?
- Plain text is more human-readable
- Structured is easier to search/analyze
- Suggestion: Plain text in files, parse to structured for API if needed
Compression: Compress old logs to save space?
- Could gzip files older than 24 hours
- Trade-off: disk space vs CPU on access
Indexing: Build an index for fast log search?
- Simple grep is probably fine initially
- Could add full-text search later if needed
Multi-machine: How do logs work in distributed builds?
- Each build machine has its own log directory
- Central service aggregates via log shipping
- Need to design this when we tackle distributed execution

6.7 KiB Raw Blame History