diff --git a/docs/ideas/logging.md b/docs/ideas/logging.md new file mode 100644 index 0000000..99f25e6 --- /dev/null +++ b/docs/ideas/logging.md @@ -0,0 +1,231 @@ +# Job Run Logging Strategy +Claude-generated plan for logging. + +## Philosophy + +Job run logs are critical for debugging build failures and understanding system behavior. The logging system must: + +1. **Be resource efficient** - Not consume unbounded memory in the service process +2. **Persist across restarts** - Logs survive service restarts/crashes +3. **Stream in real-time** - Enable live tailing for running jobs +4. **Support future log shipping** - Abstract design allows later integration with log aggregation systems +5. **Maintain data locality** - Keep logs on build machines where jobs execute + +## File-Based Approach + +### Directory Structure + +``` +/var/log/databuild/ + job_runs/ + {job_run_id}/ + stdout.log # Job standard output + stderr.log # Job standard error + metadata.json # Job metadata (timestamps, exit code, building_partitions, etc.) +``` + +### Write Strategy + +- **Streaming writes**: Job output written to disk as it's produced (not buffered in memory) +- **Append-only**: Log files are append-only for simplicity and crash safety +- **Metadata on completion**: Write metadata.json when job reaches terminal state + +### Rotation & Cleanup Policy + +Two-pronged approach to prevent unbounded disk usage: + +1. **Time-based TTL**: Delete logs older than N days (default: 7 days) +2. **Size-based cap**: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first + +Configuration via environment variables: +- `DATABUILD_LOG_TTL_DAYS` (default: 7) +- `DATABUILD_LOG_MAX_SIZE_GB` (default: 10) + +## API Streaming + +### HTTP Endpoints + +``` +GET /api/job_runs/{job_run_id}/logs/stdout +GET /api/job_runs/{job_run_id}/logs/stderr +``` + +### Streaming Protocol + +Use **Server-Sent Events (SSE)** for real-time log streaming: + +- Efficient for text streams (line-oriented) +- Native browser support (no WebSocket complexity) +- Automatic reconnection +- Works through HTTP/1.1 (no HTTP/2 requirement) + +### Example Response + +``` +event: log +data: Building partition data/alpha... + +event: log +data: [INFO] Reading dependencies from upstream + +event: complete +data: {"exit_code": 0, "duration_ms": 1234} +``` + +### Query Parameters + +- `?follow=true` - Keep connection open, stream new lines as they're written (like `tail -f`) +- `?since=` - Start from specific line (for reconnection) +- `?lines=` - Return last N lines and close (for quick inspection) + +## Abstraction Layer + +Define `LogStore` trait to enable future log shipping without changing core logic: + +```rust +/// Abstraction for job run log storage and retrieval +pub trait LogStore: Send + Sync { + /// Append a line to stdout for the given job run + fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>; + + /// Append a line to stderr for the given job run + fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>; + + /// Stream stdout lines from a job run + fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions) + -> impl Stream>; + + /// Stream stderr lines from a job run + fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions) + -> impl Stream>; + + /// Write job metadata on completion + fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata) + -> Result<(), LogError>; +} + +pub struct StreamOptions { + pub follow: bool, // Keep streaming new lines + pub since_line: usize, // Start from line N + pub max_lines: Option, // Limit to N lines +} +``` + +### Initial Implementation: `FileLogStore` + +```rust +pub struct FileLogStore { + base_path: PathBuf, // e.g., /var/log/databuild/job_runs +} +``` + +Writes directly to `{base_path}/{job_run_id}/stdout.log`. + +### Future Implementations + +- **`ShippingLogStore`**: Wraps `FileLogStore`, ships logs to S3/GCS/CloudWatch in background +- **`CompositeLogStore`**: Writes to multiple stores (local + remote) +- **`BufferedLogStore`**: Batches writes for efficiency + +## Integration with Job Runner + +The `SubProcessBackend` (in `job_run.rs`) currently buffers stdout in memory. This needs updating: + +### Current (in-memory buffering): +```rust +pub struct SubProcessRunning { + pub process: Child, + pub stdout_buffer: Vec, // ❌ Unbounded memory +} +``` + +### Proposed (streaming to disk): +```rust +pub struct SubProcessRunning { + pub process: Child, + pub log_store: Arc>, + pub job_run_id: String, +} +``` + +When polling the job: +1. Read available stdout/stderr from process +2. Write each line to `log_store.append_stdout(job_run_id, line)` +3. Parse for special lines (e.g., `DATABUILD_MISSING_DEPS_JSON:...`) +4. Don't keep full log in memory + +## CLI Integration + +The CLI should support log streaming: + +```bash +# Stream logs for a running or completed job +databuild logs + +# Follow mode (tail -f) +databuild logs --follow + +# Show last N lines +databuild logs --tail 100 + +# Show stderr instead of stdout +databuild logs --stderr +``` + +Under the hood, this hits the `/api/job_runs/{id}/logs/stdout?follow=true` endpoint. + +## Web App Integration + +The web app can use native EventSource API: + +```javascript +const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`); + +eventSource.addEventListener('log', (event) => { + appendToTerminal(event.data); +}); + +eventSource.addEventListener('complete', (event) => { + const metadata = JSON.parse(event.data); + showExitCode(metadata.exit_code); + eventSource.close(); +}); +``` + +## Future: Log Shipping + +When adding log shipping (e.g., to S3, CloudWatch, Datadog): + +1. Create a `ShippingLogStore` implementation +2. Run background task that: + - Watches for completed jobs + - Batches log lines + - Ships to configured destination + - Deletes local files after successful upload (if configured) +3. Configure via: + ```bash + export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs + export DATABUILD_LOG_KEEP_LOCAL=false # Delete after ship + ``` + +The `LogStore` trait means the core system doesn't change - just swap implementations. + +## Open Questions + +1. **Log format**: Plain text vs structured (JSON lines)? + - Plain text is more human-readable + - Structured is easier to search/analyze + - Suggestion: Plain text in files, parse to structured for API if needed + +2. **Compression**: Compress old logs to save space? + - Could gzip files older than 24 hours + - Trade-off: disk space vs CPU on access + +3. **Indexing**: Build an index for fast log search? + - Simple grep is probably fine initially + - Could add full-text search later if needed + +4. **Multi-machine**: How do logs work in distributed builds? + - Each build machine has its own log directory + - Central service aggregates via log shipping + - Need to design this when we tackle distributed execution