6.7 KiB
Job Run Logging Strategy
Claude-generated plan for logging.
Philosophy
Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:
- Be resource efficient - Not consume unbounded memory in the service process
- Persist across restarts - Logs survive service restarts/crashes
- Stream in real-time - Enable live tailing for running jobs
- Support future log shipping - Abstract design allows later integration with log aggregation systems
- Maintain data locality - Keep logs on build machines where jobs execute
File-Based Approach
Directory Structure
/var/log/databuild/
job_runs/
{job_run_id}/
stdout.log # Job standard output
stderr.log # Job standard error
metadata.json # Job metadata (timestamps, exit code, building_partitions, etc.)
Write Strategy
- Streaming writes: Job output written to disk as it's produced (not buffered in memory)
- Append-only: Log files are append-only for simplicity and crash safety
- Metadata on completion: Write metadata.json when job reaches terminal state
Rotation & Cleanup Policy
Two-pronged approach to prevent unbounded disk usage:
- Time-based TTL: Delete logs older than N days (default: 7 days)
- Size-based cap: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first
Configuration via environment variables:
DATABUILD_LOG_TTL_DAYS(default: 7)DATABUILD_LOG_MAX_SIZE_GB(default: 10)
API Streaming
HTTP Endpoints
GET /api/job_runs/{job_run_id}/logs/stdout
GET /api/job_runs/{job_run_id}/logs/stderr
Streaming Protocol
Use Server-Sent Events (SSE) for real-time log streaming:
- Efficient for text streams (line-oriented)
- Native browser support (no WebSocket complexity)
- Automatic reconnection
- Works through HTTP/1.1 (no HTTP/2 requirement)
Example Response
event: log
data: Building partition data/alpha...
event: log
data: [INFO] Reading dependencies from upstream
event: complete
data: {"exit_code": 0, "duration_ms": 1234}
Query Parameters
?follow=true- Keep connection open, stream new lines as they're written (liketail -f)?since=<line_number>- Start from specific line (for reconnection)?lines=<N>- Return last N lines and close (for quick inspection)
Abstraction Layer
Define LogStore trait to enable future log shipping without changing core logic:
/// Abstraction for job run log storage and retrieval
pub trait LogStore: Send + Sync {
/// Append a line to stdout for the given job run
fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
/// Append a line to stderr for the given job run
fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
/// Stream stdout lines from a job run
fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
-> impl Stream<Item = Result<String, LogError>>;
/// Stream stderr lines from a job run
fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
-> impl Stream<Item = Result<String, LogError>>;
/// Write job metadata on completion
fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
-> Result<(), LogError>;
}
pub struct StreamOptions {
pub follow: bool, // Keep streaming new lines
pub since_line: usize, // Start from line N
pub max_lines: Option<usize>, // Limit to N lines
}
Initial Implementation: FileLogStore
pub struct FileLogStore {
base_path: PathBuf, // e.g., /var/log/databuild/job_runs
}
Writes directly to {base_path}/{job_run_id}/stdout.log.
Future Implementations
ShippingLogStore: WrapsFileLogStore, ships logs to S3/GCS/CloudWatch in backgroundCompositeLogStore: Writes to multiple stores (local + remote)BufferedLogStore: Batches writes for efficiency
Integration with Job Runner
The SubProcessBackend (in job_run.rs) currently buffers stdout in memory. This needs updating:
Current (in-memory buffering):
pub struct SubProcessRunning {
pub process: Child,
pub stdout_buffer: Vec<String>, // ❌ Unbounded memory
}
Proposed (streaming to disk):
pub struct SubProcessRunning {
pub process: Child,
pub log_store: Arc<Mutex<dyn LogStore>>,
pub job_run_id: String,
}
When polling the job:
- Read available stdout/stderr from process
- Write each line to
log_store.append_stdout(job_run_id, line) - Parse for special lines (e.g.,
DATABUILD_MISSING_DEPS_JSON:...) - Don't keep full log in memory
CLI Integration
The CLI should support log streaming:
# Stream logs for a running or completed job
databuild logs <job_run_id>
# Follow mode (tail -f)
databuild logs <job_run_id> --follow
# Show last N lines
databuild logs <job_run_id> --tail 100
# Show stderr instead of stdout
databuild logs <job_run_id> --stderr
Under the hood, this hits the /api/job_runs/{id}/logs/stdout?follow=true endpoint.
Web App Integration
The web app can use native EventSource API:
const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);
eventSource.addEventListener('log', (event) => {
appendToTerminal(event.data);
});
eventSource.addEventListener('complete', (event) => {
const metadata = JSON.parse(event.data);
showExitCode(metadata.exit_code);
eventSource.close();
});
Future: Log Shipping
When adding log shipping (e.g., to S3, CloudWatch, Datadog):
- Create a
ShippingLogStoreimplementation - Run background task that:
- Watches for completed jobs
- Batches log lines
- Ships to configured destination
- Deletes local files after successful upload (if configured)
- Configure via:
export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs export DATABUILD_LOG_KEEP_LOCAL=false # Delete after ship
The LogStore trait means the core system doesn't change - just swap implementations.
Open Questions
-
Log format: Plain text vs structured (JSON lines)?
- Plain text is more human-readable
- Structured is easier to search/analyze
- Suggestion: Plain text in files, parse to structured for API if needed
-
Compression: Compress old logs to save space?
- Could gzip files older than 24 hours
- Trade-off: disk space vs CPU on access
-
Indexing: Build an index for fast log search?
- Simple grep is probably fine initially
- Could add full-text search later if needed
-
Multi-machine: How do logs work in distributed builds?
- Each build machine has its own log directory
- Central service aggregates via log shipping
- Need to design this when we tackle distributed execution