# Job Run Logging Strategy Claude-generated plan for logging. ## Philosophy Job run logs are critical for debugging build failures and understanding system behavior. The logging system must: 1. **Be resource efficient** - Not consume unbounded memory in the service process 2. **Persist across restarts** - Logs survive service restarts/crashes 3. **Stream in real-time** - Enable live tailing for running jobs 4. **Support future log shipping** - Abstract design allows later integration with log aggregation systems 5. **Maintain data locality** - Keep logs on build machines where jobs execute ## File-Based Approach ### Directory Structure ``` /var/log/databuild/ job_runs/ {job_run_id}/ stdout.log # Job standard output stderr.log # Job standard error metadata.json # Job metadata (timestamps, exit code, building_partitions, etc.) ``` ### Write Strategy - **Streaming writes**: Job output written to disk as it's produced (not buffered in memory) - **Append-only**: Log files are append-only for simplicity and crash safety - **Metadata on completion**: Write metadata.json when job reaches terminal state ### Rotation & Cleanup Policy Two-pronged approach to prevent unbounded disk usage: 1. **Time-based TTL**: Delete logs older than N days (default: 7 days) 2. **Size-based cap**: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first Configuration via environment variables: - `DATABUILD_LOG_TTL_DAYS` (default: 7) - `DATABUILD_LOG_MAX_SIZE_GB` (default: 10) ## API Streaming ### HTTP Endpoints ``` GET /api/job_runs/{job_run_id}/logs/stdout GET /api/job_runs/{job_run_id}/logs/stderr ``` ### Streaming Protocol Use **Server-Sent Events (SSE)** for real-time log streaming: - Efficient for text streams (line-oriented) - Native browser support (no WebSocket complexity) - Automatic reconnection - Works through HTTP/1.1 (no HTTP/2 requirement) ### Example Response ``` event: log data: Building partition data/alpha... event: log data: [INFO] Reading dependencies from upstream event: complete data: {"exit_code": 0, "duration_ms": 1234} ``` ### Query Parameters - `?follow=true` - Keep connection open, stream new lines as they're written (like `tail -f`) - `?since=` - Start from specific line (for reconnection) - `?lines=` - Return last N lines and close (for quick inspection) ## Abstraction Layer Define `LogStore` trait to enable future log shipping without changing core logic: ```rust /// Abstraction for job run log storage and retrieval pub trait LogStore: Send + Sync { /// Append a line to stdout for the given job run fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>; /// Append a line to stderr for the given job run fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>; /// Stream stdout lines from a job run fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions) -> impl Stream>; /// Stream stderr lines from a job run fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions) -> impl Stream>; /// Write job metadata on completion fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata) -> Result<(), LogError>; } pub struct StreamOptions { pub follow: bool, // Keep streaming new lines pub since_line: usize, // Start from line N pub max_lines: Option, // Limit to N lines } ``` ### Initial Implementation: `FileLogStore` ```rust pub struct FileLogStore { base_path: PathBuf, // e.g., /var/log/databuild/job_runs } ``` Writes directly to `{base_path}/{job_run_id}/stdout.log`. ### Future Implementations - **`ShippingLogStore`**: Wraps `FileLogStore`, ships logs to S3/GCS/CloudWatch in background - **`CompositeLogStore`**: Writes to multiple stores (local + remote) - **`BufferedLogStore`**: Batches writes for efficiency ## Integration with Job Runner The `SubProcessBackend` (in `job_run.rs`) currently buffers stdout in memory. This needs updating: ### Current (in-memory buffering): ```rust pub struct SubProcessRunning { pub process: Child, pub stdout_buffer: Vec, // ❌ Unbounded memory } ``` ### Proposed (streaming to disk): ```rust pub struct SubProcessRunning { pub process: Child, pub log_store: Arc>, pub job_run_id: String, } ``` When polling the job: 1. Read available stdout/stderr from process 2. Write each line to `log_store.append_stdout(job_run_id, line)` 3. Parse for special lines (e.g., `DATABUILD_MISSING_DEPS_JSON:...`) 4. Don't keep full log in memory ## CLI Integration The CLI should support log streaming: ```bash # Stream logs for a running or completed job databuild logs # Follow mode (tail -f) databuild logs --follow # Show last N lines databuild logs --tail 100 # Show stderr instead of stdout databuild logs --stderr ``` Under the hood, this hits the `/api/job_runs/{id}/logs/stdout?follow=true` endpoint. ## Web App Integration The web app can use native EventSource API: ```javascript const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`); eventSource.addEventListener('log', (event) => { appendToTerminal(event.data); }); eventSource.addEventListener('complete', (event) => { const metadata = JSON.parse(event.data); showExitCode(metadata.exit_code); eventSource.close(); }); ``` ## Future: Log Shipping When adding log shipping (e.g., to S3, CloudWatch, Datadog): 1. Create a `ShippingLogStore` implementation 2. Run background task that: - Watches for completed jobs - Batches log lines - Ships to configured destination - Deletes local files after successful upload (if configured) 3. Configure via: ```bash export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs export DATABUILD_LOG_KEEP_LOCAL=false # Delete after ship ``` The `LogStore` trait means the core system doesn't change - just swap implementations. ## Open Questions 1. **Log format**: Plain text vs structured (JSON lines)? - Plain text is more human-readable - Structured is easier to search/analyze - Suggestion: Plain text in files, parse to structured for API if needed 2. **Compression**: Compress old logs to save space? - Could gzip files older than 24 hours - Trade-off: disk space vs CPU on access 3. **Indexing**: Build an index for fast log search? - Simple grep is probably fine initially - Could add full-text search later if needed 4. **Multi-machine**: How do logs work in distributed builds? - Each build machine has its own log directory - Central service aggregates via log shipping - Need to design this when we tackle distributed execution