add claude logging notes

2025-11-22 20:44:40 +08:00 · 2025-11-22 20:44:40 +08:00 · 2084fadbb6
commit 2084fadbb6
parent cf163b294d
1 changed files with 231 additions and 0 deletions
--- a/docs/ideas/logging.md
+++ b/docs/ideas/logging.md
@ -0,0 +1,231 @@
+# Job Run Logging Strategy
+Claude-generated plan for logging.
+
+## Philosophy
+
+Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:
+
+1. **Be resource efficient** - Not consume unbounded memory in the service process
+2. **Persist across restarts** - Logs survive service restarts/crashes
+3. **Stream in real-time** - Enable live tailing for running jobs
+4. **Support future log shipping** - Abstract design allows later integration with log aggregation systems
+5. **Maintain data locality** - Keep logs on build machines where jobs execute
+
+## File-Based Approach
+
+### Directory Structure
+
+```
+/var/log/databuild/
+  job_runs/
+    {job_run_id}/
+      stdout.log      # Job standard output
+      stderr.log      # Job standard error
+      metadata.json   # Job metadata (timestamps, exit code, building_partitions, etc.)
+```
+
+### Write Strategy
+
+- **Streaming writes**: Job output written to disk as it's produced (not buffered in memory)
+- **Append-only**: Log files are append-only for simplicity and crash safety
+- **Metadata on completion**: Write metadata.json when job reaches terminal state
+
+### Rotation & Cleanup Policy
+
+Two-pronged approach to prevent unbounded disk usage:
+
+1. **Time-based TTL**: Delete logs older than N days (default: 7 days)
+2. **Size-based cap**: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first
+
+Configuration via environment variables:
+- `DATABUILD_LOG_TTL_DAYS` (default: 7)
+- `DATABUILD_LOG_MAX_SIZE_GB` (default: 10)
+
+## API Streaming
+
+### HTTP Endpoints
+
+```
+GET /api/job_runs/{job_run_id}/logs/stdout
+GET /api/job_runs/{job_run_id}/logs/stderr
+```
+
+### Streaming Protocol
+
+Use **Server-Sent Events (SSE)** for real-time log streaming:
+
+- Efficient for text streams (line-oriented)
+- Native browser support (no WebSocket complexity)
+- Automatic reconnection
+- Works through HTTP/1.1 (no HTTP/2 requirement)
+
+### Example Response
+
+```
+event: log
+data: Building partition data/alpha...
+
+event: log
+data: [INFO] Reading dependencies from upstream
+
+event: complete
+data: {"exit_code": 0, "duration_ms": 1234}
+```
+
+### Query Parameters
+
+- `?follow=true` - Keep connection open, stream new lines as they're written (like `tail -f`)
+- `?since=<line_number>` - Start from specific line (for reconnection)
+- `?lines=<N>` - Return last N lines and close (for quick inspection)
+
+## Abstraction Layer
+
+Define `LogStore` trait to enable future log shipping without changing core logic:
+
+```rust
+/// Abstraction for job run log storage and retrieval
+pub trait LogStore: Send + Sync {
+    /// Append a line to stdout for the given job run
+    fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
+
+    /// Append a line to stderr for the given job run
+    fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
+
+    /// Stream stdout lines from a job run
+    fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
+        -> impl Stream<Item = Result<String, LogError>>;
+
+    /// Stream stderr lines from a job run
+    fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
+        -> impl Stream<Item = Result<String, LogError>>;
+
+    /// Write job metadata on completion
+    fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
+        -> Result<(), LogError>;
+}
+
+pub struct StreamOptions {
+    pub follow: bool,        // Keep streaming new lines
+    pub since_line: usize,   // Start from line N
+    pub max_lines: Option<usize>,  // Limit to N lines
+}
+```
+
+### Initial Implementation: `FileLogStore`
+
+```rust
+pub struct FileLogStore {
+    base_path: PathBuf,  // e.g., /var/log/databuild/job_runs
+}
+```
+
+Writes directly to `{base_path}/{job_run_id}/stdout.log`.
+
+### Future Implementations
+
+- **`ShippingLogStore`**: Wraps `FileLogStore`, ships logs to S3/GCS/CloudWatch in background
+- **`CompositeLogStore`**: Writes to multiple stores (local + remote)
+- **`BufferedLogStore`**: Batches writes for efficiency
+
+## Integration with Job Runner
+
+The `SubProcessBackend` (in `job_run.rs`) currently buffers stdout in memory. This needs updating:
+
+### Current (in-memory buffering):
+```rust
+pub struct SubProcessRunning {
+    pub process: Child,
+    pub stdout_buffer: Vec<String>,  // ❌ Unbounded memory
+}
+```
+
+### Proposed (streaming to disk):
+```rust
+pub struct SubProcessRunning {
+    pub process: Child,
+    pub log_store: Arc<Mutex<dyn LogStore>>,
+    pub job_run_id: String,
+}
+```
+
+When polling the job:
+1. Read available stdout/stderr from process
+2. Write each line to `log_store.append_stdout(job_run_id, line)`
+3. Parse for special lines (e.g., `DATABUILD_MISSING_DEPS_JSON:...`)
+4. Don't keep full log in memory
+
+## CLI Integration
+
+The CLI should support log streaming:
+
+```bash
+# Stream logs for a running or completed job
+databuild logs <job_run_id>
+
+# Follow mode (tail -f)
+databuild logs <job_run_id> --follow
+
+# Show last N lines
+databuild logs <job_run_id> --tail 100
+
+# Show stderr instead of stdout
+databuild logs <job_run_id> --stderr
+```
+
+Under the hood, this hits the `/api/job_runs/{id}/logs/stdout?follow=true` endpoint.
+
+## Web App Integration
+
+The web app can use native EventSource API:
+
+```javascript
+const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);
+
+eventSource.addEventListener('log', (event) => {
+    appendToTerminal(event.data);
+});
+
+eventSource.addEventListener('complete', (event) => {
+    const metadata = JSON.parse(event.data);
+    showExitCode(metadata.exit_code);
+    eventSource.close();
+});
+```
+
+## Future: Log Shipping
+
+When adding log shipping (e.g., to S3, CloudWatch, Datadog):
+
+1. Create a `ShippingLogStore` implementation
+2. Run background task that:
+   - Watches for completed jobs
+   - Batches log lines
+   - Ships to configured destination
+   - Deletes local files after successful upload (if configured)
+3. Configure via:
+   ```bash
+   export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs
+   export DATABUILD_LOG_KEEP_LOCAL=false  # Delete after ship
+   ```
+
+The `LogStore` trait means the core system doesn't change - just swap implementations.
+
+## Open Questions
+
+1. **Log format**: Plain text vs structured (JSON lines)?
+   - Plain text is more human-readable
+   - Structured is easier to search/analyze
+   - Suggestion: Plain text in files, parse to structured for API if needed
+
+2. **Compression**: Compress old logs to save space?
+   - Could gzip files older than 24 hours
+   - Trade-off: disk space vs CPU on access
+
+3. **Indexing**: Build an index for fast log search?
+   - Simple grep is probably fine initially
+   - Could add full-text search later if needed
+
+4. **Multi-machine**: How do logs work in distributed builds?
+   - Each build machine has its own log directory
+   - Central service aggregates via log shipping
+   - Need to design this when we tackle distributed execution