add claude logging notes

2025-11-22 20:44:40 +08:00 · 2025-11-22 20:44:40 +08:00 · 2084fadbb6
commit 2084fadbb6
parent cf163b294d
1 changed files with 231 additions and 0 deletions
--- a/docs/ideas/logging.md
+++ b/docs/ideas/logging.md
@ -0,0 +1,231 @@
 # Job Run Logging Strategy
 Claude-generated plan for logging.
 ## Philosophy
 Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:
 1. **Be resource efficient** - Not consume unbounded memory in the service process
 2. **Persist across restarts** - Logs survive service restarts/crashes
 3. **Stream in real-time** - Enable live tailing for running jobs
 4. **Support future log shipping** - Abstract design allows later integration with log aggregation systems
 5. **Maintain data locality** - Keep logs on build machines where jobs execute
 ## File-Based Approach
 ### Directory Structure
 ```
 /var/log/databuild/
  job_runs/
    {job_run_id}/
      stdout.log      # Job standard output
      stderr.log      # Job standard error
      metadata.json   # Job metadata (timestamps, exit code, building_partitions, etc.)
 ```
 ### Write Strategy
 - **Streaming writes**: Job output written to disk as it's produced (not buffered in memory)
 - **Append-only**: Log files are append-only for simplicity and crash safety
 - **Metadata on completion**: Write metadata.json when job reaches terminal state
 ### Rotation & Cleanup Policy
 Two-pronged approach to prevent unbounded disk usage:
 1. **Time-based TTL**: Delete logs older than N days (default: 7 days)
 2. **Size-based cap**: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first
 Configuration via environment variables:
 - `DATABUILD_LOG_TTL_DAYS` (default: 7)
 - `DATABUILD_LOG_MAX_SIZE_GB` (default: 10)
 ## API Streaming
 ### HTTP Endpoints
 ```
 GET /api/job_runs/{job_run_id}/logs/stdout
 GET /api/job_runs/{job_run_id}/logs/stderr
 ```
 ### Streaming Protocol
 Use **Server-Sent Events (SSE)** for real-time log streaming:
 - Efficient for text streams (line-oriented)
 - Native browser support (no WebSocket complexity)
 - Automatic reconnection
 - Works through HTTP/1.1 (no HTTP/2 requirement)
 ### Example Response
 ```
 event: log
 data: Building partition data/alpha...
 event: log
 data: [INFO] Reading dependencies from upstream
 event: complete
 data: {"exit_code": 0, "duration_ms": 1234}
 ```
 ### Query Parameters
 - `?follow=true` - Keep connection open, stream new lines as they're written (like `tail -f`)
 - `?since=<line_number>` - Start from specific line (for reconnection)
 - `?lines=<N>` - Return last N lines and close (for quick inspection)
 ## Abstraction Layer
 Define `LogStore` trait to enable future log shipping without changing core logic:
 ```rust
 /// Abstraction for job run log storage and retrieval
 pub trait LogStore: Send + Sync {
    /// Append a line to stdout for the given job run
    fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
    /// Append a line to stderr for the given job run
    fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
    /// Stream stdout lines from a job run
    fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;
    /// Stream stderr lines from a job run
    fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
        -> impl Stream<Item = Result<String, LogError>>;
    /// Write job metadata on completion
    fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
        -> Result<(), LogError>;
 }
 pub struct StreamOptions {
    pub follow: bool,        // Keep streaming new lines
    pub since_line: usize,   // Start from line N
    pub max_lines: Option<usize>,  // Limit to N lines
 }
 ```
 ### Initial Implementation: `FileLogStore`
 ```rust
 pub struct FileLogStore {
    base_path: PathBuf,  // e.g., /var/log/databuild/job_runs
 }
 ```
 Writes directly to `{base_path}/{job_run_id}/stdout.log`.
 ### Future Implementations
 - **`ShippingLogStore`**: Wraps `FileLogStore`, ships logs to S3/GCS/CloudWatch in background
 - **`CompositeLogStore`**: Writes to multiple stores (local + remote)
 - **`BufferedLogStore`**: Batches writes for efficiency
 ## Integration with Job Runner
 The `SubProcessBackend` (in `job_run.rs`) currently buffers stdout in memory. This needs updating:
 ### Current (in-memory buffering):
 ```rust
 pub struct SubProcessRunning {
    pub process: Child,
    pub stdout_buffer: Vec<String>,  // ❌ Unbounded memory
 }
 ```
 ### Proposed (streaming to disk):
 ```rust
 pub struct SubProcessRunning {
    pub process: Child,
    pub log_store: Arc<Mutex<dyn LogStore>>,
    pub job_run_id: String,
 }
 ```
 When polling the job:
 1. Read available stdout/stderr from process
 2. Write each line to `log_store.append_stdout(job_run_id, line)`
 3. Parse for special lines (e.g., `DATABUILD_MISSING_DEPS_JSON:...`)
 4. Don't keep full log in memory
 ## CLI Integration
 The CLI should support log streaming:
 ```bash
 # Stream logs for a running or completed job
 databuild logs <job_run_id>
 # Follow mode (tail -f)
 databuild logs <job_run_id> --follow
 # Show last N lines
 databuild logs <job_run_id> --tail 100
 # Show stderr instead of stdout
 databuild logs <job_run_id> --stderr
 ```
 Under the hood, this hits the `/api/job_runs/{id}/logs/stdout?follow=true` endpoint.
 ## Web App Integration
 The web app can use native EventSource API:
 ```javascript
 const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);
 eventSource.addEventListener('log', (event) => {
    appendToTerminal(event.data);
 });
 eventSource.addEventListener('complete', (event) => {
    const metadata = JSON.parse(event.data);
    showExitCode(metadata.exit_code);
    eventSource.close();
 });
 ```
 ## Future: Log Shipping
 When adding log shipping (e.g., to S3, CloudWatch, Datadog):
 1. Create a `ShippingLogStore` implementation
 2. Run background task that:
   - Watches for completed jobs
   - Batches log lines
   - Ships to configured destination
   - Deletes local files after successful upload (if configured)
 3. Configure via:
   ```bash
   export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs
   export DATABUILD_LOG_KEEP_LOCAL=false  # Delete after ship
   ```
 The `LogStore` trait means the core system doesn't change - just swap implementations.
 ## Open Questions
 1. **Log format**: Plain text vs structured (JSON lines)?
   - Plain text is more human-readable
   - Structured is easier to search/analyze
   - Suggestion: Plain text in files, parse to structured for API if needed
 2. **Compression**: Compress old logs to save space?
   - Could gzip files older than 24 hours
   - Trade-off: disk space vs CPU on access
 3. **Indexing**: Build an index for fast log search?
   - Simple grep is probably fine initially
   - Could add full-text search later if needed
 4. **Multi-machine**: How do logs work in distributed builds?
   - Each build machine has its own log directory
   - Central service aggregates via log shipping
   - Need to design this when we tackle distributed execution