add claude logging notes
Some checks are pending
/ setup (push) Waiting to run

This commit is contained in:
Stuart Axelbrooke 2025-11-22 20:44:40 +08:00
parent cf163b294d
commit 2084fadbb6

231
docs/ideas/logging.md Normal file
View file

@ -0,0 +1,231 @@
# Job Run Logging Strategy
Claude-generated plan for logging.
## Philosophy
Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:
1. **Be resource efficient** - Not consume unbounded memory in the service process
2. **Persist across restarts** - Logs survive service restarts/crashes
3. **Stream in real-time** - Enable live tailing for running jobs
4. **Support future log shipping** - Abstract design allows later integration with log aggregation systems
5. **Maintain data locality** - Keep logs on build machines where jobs execute
## File-Based Approach
### Directory Structure
```
/var/log/databuild/
job_runs/
{job_run_id}/
stdout.log # Job standard output
stderr.log # Job standard error
metadata.json # Job metadata (timestamps, exit code, building_partitions, etc.)
```
### Write Strategy
- **Streaming writes**: Job output written to disk as it's produced (not buffered in memory)
- **Append-only**: Log files are append-only for simplicity and crash safety
- **Metadata on completion**: Write metadata.json when job reaches terminal state
### Rotation & Cleanup Policy
Two-pronged approach to prevent unbounded disk usage:
1. **Time-based TTL**: Delete logs older than N days (default: 7 days)
2. **Size-based cap**: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first
Configuration via environment variables:
- `DATABUILD_LOG_TTL_DAYS` (default: 7)
- `DATABUILD_LOG_MAX_SIZE_GB` (default: 10)
## API Streaming
### HTTP Endpoints
```
GET /api/job_runs/{job_run_id}/logs/stdout
GET /api/job_runs/{job_run_id}/logs/stderr
```
### Streaming Protocol
Use **Server-Sent Events (SSE)** for real-time log streaming:
- Efficient for text streams (line-oriented)
- Native browser support (no WebSocket complexity)
- Automatic reconnection
- Works through HTTP/1.1 (no HTTP/2 requirement)
### Example Response
```
event: log
data: Building partition data/alpha...
event: log
data: [INFO] Reading dependencies from upstream
event: complete
data: {"exit_code": 0, "duration_ms": 1234}
```
### Query Parameters
- `?follow=true` - Keep connection open, stream new lines as they're written (like `tail -f`)
- `?since=<line_number>` - Start from specific line (for reconnection)
- `?lines=<N>` - Return last N lines and close (for quick inspection)
## Abstraction Layer
Define `LogStore` trait to enable future log shipping without changing core logic:
```rust
/// Abstraction for job run log storage and retrieval
pub trait LogStore: Send + Sync {
/// Append a line to stdout for the given job run
fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
/// Append a line to stderr for the given job run
fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
/// Stream stdout lines from a job run
fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
-> impl Stream<Item = Result<String, LogError>>;
/// Stream stderr lines from a job run
fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
-> impl Stream<Item = Result<String, LogError>>;
/// Write job metadata on completion
fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
-> Result<(), LogError>;
}
pub struct StreamOptions {
pub follow: bool, // Keep streaming new lines
pub since_line: usize, // Start from line N
pub max_lines: Option<usize>, // Limit to N lines
}
```
### Initial Implementation: `FileLogStore`
```rust
pub struct FileLogStore {
base_path: PathBuf, // e.g., /var/log/databuild/job_runs
}
```
Writes directly to `{base_path}/{job_run_id}/stdout.log`.
### Future Implementations
- **`ShippingLogStore`**: Wraps `FileLogStore`, ships logs to S3/GCS/CloudWatch in background
- **`CompositeLogStore`**: Writes to multiple stores (local + remote)
- **`BufferedLogStore`**: Batches writes for efficiency
## Integration with Job Runner
The `SubProcessBackend` (in `job_run.rs`) currently buffers stdout in memory. This needs updating:
### Current (in-memory buffering):
```rust
pub struct SubProcessRunning {
pub process: Child,
pub stdout_buffer: Vec<String>, // ❌ Unbounded memory
}
```
### Proposed (streaming to disk):
```rust
pub struct SubProcessRunning {
pub process: Child,
pub log_store: Arc<Mutex<dyn LogStore>>,
pub job_run_id: String,
}
```
When polling the job:
1. Read available stdout/stderr from process
2. Write each line to `log_store.append_stdout(job_run_id, line)`
3. Parse for special lines (e.g., `DATABUILD_MISSING_DEPS_JSON:...`)
4. Don't keep full log in memory
## CLI Integration
The CLI should support log streaming:
```bash
# Stream logs for a running or completed job
databuild logs <job_run_id>
# Follow mode (tail -f)
databuild logs <job_run_id> --follow
# Show last N lines
databuild logs <job_run_id> --tail 100
# Show stderr instead of stdout
databuild logs <job_run_id> --stderr
```
Under the hood, this hits the `/api/job_runs/{id}/logs/stdout?follow=true` endpoint.
## Web App Integration
The web app can use native EventSource API:
```javascript
const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);
eventSource.addEventListener('log', (event) => {
appendToTerminal(event.data);
});
eventSource.addEventListener('complete', (event) => {
const metadata = JSON.parse(event.data);
showExitCode(metadata.exit_code);
eventSource.close();
});
```
## Future: Log Shipping
When adding log shipping (e.g., to S3, CloudWatch, Datadog):
1. Create a `ShippingLogStore` implementation
2. Run background task that:
- Watches for completed jobs
- Batches log lines
- Ships to configured destination
- Deletes local files after successful upload (if configured)
3. Configure via:
```bash
export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs
export DATABUILD_LOG_KEEP_LOCAL=false # Delete after ship
```
The `LogStore` trait means the core system doesn't change - just swap implementations.
## Open Questions
1. **Log format**: Plain text vs structured (JSON lines)?
- Plain text is more human-readable
- Structured is easier to search/analyze
- Suggestion: Plain text in files, parse to structured for API if needed
2. **Compression**: Compress old logs to save space?
- Could gzip files older than 24 hours
- Trade-off: disk space vs CPU on access
3. **Indexing**: Build an index for fast log search?
- Simple grep is probably fine initially
- Could add full-text search later if needed
4. **Multi-machine**: How do logs work in distributed builds?
- Each build machine has its own log directory
- Central service aggregates via log shipping
- Need to design this when we tackle distributed execution