# Job Run Logging Strategy
Claude-generated plan for logging.

## Philosophy

Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:

1. **Be resource efficient** - Not consume unbounded memory in the service process
2. **Persist across restarts** - Logs survive service restarts/crashes
3. **Stream in real-time** - Enable live tailing for running jobs
4. **Support future log shipping** - Abstract design allows later integration with log aggregation systems
5. **Maintain data locality** - Keep logs on build machines where jobs execute

## File-Based Approach

### Directory Structure

```
/var/log/databuild/
  job_runs/
    {job_run_id}/
      stdout.log      # Job standard output
      stderr.log      # Job standard error
      metadata.json   # Job metadata (timestamps, exit code, building_partitions, etc.)
```

### Write Strategy

- **Streaming writes**: Job output written to disk as it's produced (not buffered in memory)
- **Append-only**: Log files are append-only for simplicity and crash safety
- **Metadata on completion**: Write metadata.json when job reaches terminal state

### Rotation & Cleanup Policy

Two-pronged approach to prevent unbounded disk usage:

1. **Time-based TTL**: Delete logs older than N days (default: 7 days)
2. **Size-based cap**: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first

Configuration via environment variables:
- `DATABUILD_LOG_TTL_DAYS` (default: 7)
- `DATABUILD_LOG_MAX_SIZE_GB` (default: 10)

## API Streaming

### HTTP Endpoints

```
GET /api/job_runs/{job_run_id}/logs/stdout
GET /api/job_runs/{job_run_id}/logs/stderr
```

### Streaming Protocol

Use **Server-Sent Events (SSE)** for real-time log streaming:

- Efficient for text streams (line-oriented)
- Native browser support (no WebSocket complexity)
- Automatic reconnection
- Works through HTTP/1.1 (no HTTP/2 requirement)

### Example Response

```
event: log
data: Building partition data/alpha...

event: log
data: [INFO] Reading dependencies from upstream

event: complete
data: {"exit_code": 0, "duration_ms": 1234}
```

### Query Parameters

- `?follow=true` - Keep connection open, stream new lines as they're written (like `tail -f`)
- `?since=<line_number>` - Start from specific line (for reconnection)
- `?lines=<N>` - Return last N lines and close (for quick inspection)

## Abstraction Layer

Define `LogStore` trait to enable future log shipping without changing core logic:

```rust
/// Abstraction for job run log storage and retrieval
pub trait LogStore: Send + Sync {
    /// Append a line to stdout for the given job run
    fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;

    /// Append a line to stderr for the given job run
    fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;

    /// Stream stdout lines from a job run
    fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;

    /// Stream stderr lines from a job run
    fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;

    /// Write job metadata on completion
    fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
        -> Result<(), LogError>;
}

pub struct StreamOptions {
    pub follow: bool,        // Keep streaming new lines
    pub since_line: usize,   // Start from line N
    pub max_lines: Option<usize>,  // Limit to N lines
}
```

### Initial Implementation: `FileLogStore`

```rust
pub struct FileLogStore {
    base_path: PathBuf,  // e.g., /var/log/databuild/job_runs
}
```

Writes directly to `{base_path}/{job_run_id}/stdout.log`.

### Future Implementations

- **`ShippingLogStore`**: Wraps `FileLogStore`, ships logs to S3/GCS/CloudWatch in background
- **`CompositeLogStore`**: Writes to multiple stores (local + remote)
- **`BufferedLogStore`**: Batches writes for efficiency

## Integration with Job Runner

The `SubProcessBackend` (in `job_run.rs`) currently buffers stdout in memory. This needs updating:

### Current (in-memory buffering):
```rust
pub struct SubProcessRunning {
    pub process: Child,
    pub stdout_buffer: Vec<String>,  // ❌ Unbounded memory
}
```

### Proposed (streaming to disk):
```rust
pub struct SubProcessRunning {
    pub process: Child,
    pub log_store: Arc<Mutex<dyn LogStore>>,
    pub job_run_id: String,
}
```

When polling the job:
1. Read available stdout/stderr from process
2. Write each line to `log_store.append_stdout(job_run_id, line)`
3. Parse for special lines (e.g., `DATABUILD_MISSING_DEPS_JSON:...`)
4. Don't keep full log in memory

## CLI Integration

The CLI should support log streaming:

```bash
# Stream logs for a running or completed job
databuild logs <job_run_id>

# Follow mode (tail -f)
databuild logs <job_run_id> --follow

# Show last N lines
databuild logs <job_run_id> --tail 100

# Show stderr instead of stdout
databuild logs <job_run_id> --stderr
```

Under the hood, this hits the `/api/job_runs/{id}/logs/stdout?follow=true` endpoint.

## Web App Integration

The web app can use native EventSource API:

```javascript
const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);

eventSource.addEventListener('log', (event) => {
    appendToTerminal(event.data);
});

eventSource.addEventListener('complete', (event) => {
    const metadata = JSON.parse(event.data);
    showExitCode(metadata.exit_code);
    eventSource.close();
});
```

## Future: Log Shipping

When adding log shipping (e.g., to S3, CloudWatch, Datadog):

1. Create a `ShippingLogStore` implementation
2. Run background task that:
   - Watches for completed jobs
   - Batches log lines
   - Ships to configured destination
   - Deletes local files after successful upload (if configured)
3. Configure via:
   ```bash
   export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs
   export DATABUILD_LOG_KEEP_LOCAL=false  # Delete after ship
   ```

The `LogStore` trait means the core system doesn't change - just swap implementations.

## Open Questions

1. **Log format**: Plain text vs structured (JSON lines)?
   - Plain text is more human-readable
   - Structured is easier to search/analyze
   - Suggestion: Plain text in files, parse to structured for API if needed

2. **Compression**: Compress old logs to save space?
   - Could gzip files older than 24 hours
   - Trade-off: disk space vs CPU on access

3. **Indexing**: Build an index for fast log search?
   - Simple grep is probably fine initially
   - Could add full-text search later if needed

4. **Multi-machine**: How do logs work in distributed builds?
   - Each build machine has its own log directory
   - Central service aggregates via log shipping
   - Need to design this when we tackle distributed execution