This commit is contained in:
parent
cf163b294d
commit
2084fadbb6
1 changed files with 231 additions and 0 deletions
231
docs/ideas/logging.md
Normal file
231
docs/ideas/logging.md
Normal file
|
|
@ -0,0 +1,231 @@
|
||||||
|
# Job Run Logging Strategy
|
||||||
|
Claude-generated plan for logging.
|
||||||
|
|
||||||
|
## Philosophy
|
||||||
|
|
||||||
|
Job run logs are critical for debugging build failures and understanding system behavior. The logging system must:
|
||||||
|
|
||||||
|
1. **Be resource efficient** - Not consume unbounded memory in the service process
|
||||||
|
2. **Persist across restarts** - Logs survive service restarts/crashes
|
||||||
|
3. **Stream in real-time** - Enable live tailing for running jobs
|
||||||
|
4. **Support future log shipping** - Abstract design allows later integration with log aggregation systems
|
||||||
|
5. **Maintain data locality** - Keep logs on build machines where jobs execute
|
||||||
|
|
||||||
|
## File-Based Approach
|
||||||
|
|
||||||
|
### Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
/var/log/databuild/
|
||||||
|
job_runs/
|
||||||
|
{job_run_id}/
|
||||||
|
stdout.log # Job standard output
|
||||||
|
stderr.log # Job standard error
|
||||||
|
metadata.json # Job metadata (timestamps, exit code, building_partitions, etc.)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Write Strategy
|
||||||
|
|
||||||
|
- **Streaming writes**: Job output written to disk as it's produced (not buffered in memory)
|
||||||
|
- **Append-only**: Log files are append-only for simplicity and crash safety
|
||||||
|
- **Metadata on completion**: Write metadata.json when job reaches terminal state
|
||||||
|
|
||||||
|
### Rotation & Cleanup Policy
|
||||||
|
|
||||||
|
Two-pronged approach to prevent unbounded disk usage:
|
||||||
|
|
||||||
|
1. **Time-based TTL**: Delete logs older than N days (default: 7 days)
|
||||||
|
2. **Size-based cap**: If total log directory exceeds M GB (default: 10 GB), delete oldest logs first
|
||||||
|
|
||||||
|
Configuration via environment variables:
|
||||||
|
- `DATABUILD_LOG_TTL_DAYS` (default: 7)
|
||||||
|
- `DATABUILD_LOG_MAX_SIZE_GB` (default: 10)
|
||||||
|
|
||||||
|
## API Streaming
|
||||||
|
|
||||||
|
### HTTP Endpoints
|
||||||
|
|
||||||
|
```
|
||||||
|
GET /api/job_runs/{job_run_id}/logs/stdout
|
||||||
|
GET /api/job_runs/{job_run_id}/logs/stderr
|
||||||
|
```
|
||||||
|
|
||||||
|
### Streaming Protocol
|
||||||
|
|
||||||
|
Use **Server-Sent Events (SSE)** for real-time log streaming:
|
||||||
|
|
||||||
|
- Efficient for text streams (line-oriented)
|
||||||
|
- Native browser support (no WebSocket complexity)
|
||||||
|
- Automatic reconnection
|
||||||
|
- Works through HTTP/1.1 (no HTTP/2 requirement)
|
||||||
|
|
||||||
|
### Example Response
|
||||||
|
|
||||||
|
```
|
||||||
|
event: log
|
||||||
|
data: Building partition data/alpha...
|
||||||
|
|
||||||
|
event: log
|
||||||
|
data: [INFO] Reading dependencies from upstream
|
||||||
|
|
||||||
|
event: complete
|
||||||
|
data: {"exit_code": 0, "duration_ms": 1234}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Query Parameters
|
||||||
|
|
||||||
|
- `?follow=true` - Keep connection open, stream new lines as they're written (like `tail -f`)
|
||||||
|
- `?since=<line_number>` - Start from specific line (for reconnection)
|
||||||
|
- `?lines=<N>` - Return last N lines and close (for quick inspection)
|
||||||
|
|
||||||
|
## Abstraction Layer
|
||||||
|
|
||||||
|
Define `LogStore` trait to enable future log shipping without changing core logic:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
/// Abstraction for job run log storage and retrieval
|
||||||
|
pub trait LogStore: Send + Sync {
|
||||||
|
/// Append a line to stdout for the given job run
|
||||||
|
fn append_stdout(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
|
||||||
|
|
||||||
|
/// Append a line to stderr for the given job run
|
||||||
|
fn append_stderr(&mut self, job_run_id: &str, line: &str) -> Result<(), LogError>;
|
||||||
|
|
||||||
|
/// Stream stdout lines from a job run
|
||||||
|
fn stream_stdout(&self, job_run_id: &str, opts: StreamOptions)
|
||||||
|
-> impl Stream<Item = Result<String, LogError>>;
|
||||||
|
|
||||||
|
/// Stream stderr lines from a job run
|
||||||
|
fn stream_stderr(&self, job_run_id: &str, opts: StreamOptions)
|
||||||
|
-> impl Stream<Item = Result<String, LogError>>;
|
||||||
|
|
||||||
|
/// Write job metadata on completion
|
||||||
|
fn write_metadata(&mut self, job_run_id: &str, metadata: JobMetadata)
|
||||||
|
-> Result<(), LogError>;
|
||||||
|
}
|
||||||
|
|
||||||
|
pub struct StreamOptions {
|
||||||
|
pub follow: bool, // Keep streaming new lines
|
||||||
|
pub since_line: usize, // Start from line N
|
||||||
|
pub max_lines: Option<usize>, // Limit to N lines
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Initial Implementation: `FileLogStore`
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct FileLogStore {
|
||||||
|
base_path: PathBuf, // e.g., /var/log/databuild/job_runs
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Writes directly to `{base_path}/{job_run_id}/stdout.log`.
|
||||||
|
|
||||||
|
### Future Implementations
|
||||||
|
|
||||||
|
- **`ShippingLogStore`**: Wraps `FileLogStore`, ships logs to S3/GCS/CloudWatch in background
|
||||||
|
- **`CompositeLogStore`**: Writes to multiple stores (local + remote)
|
||||||
|
- **`BufferedLogStore`**: Batches writes for efficiency
|
||||||
|
|
||||||
|
## Integration with Job Runner
|
||||||
|
|
||||||
|
The `SubProcessBackend` (in `job_run.rs`) currently buffers stdout in memory. This needs updating:
|
||||||
|
|
||||||
|
### Current (in-memory buffering):
|
||||||
|
```rust
|
||||||
|
pub struct SubProcessRunning {
|
||||||
|
pub process: Child,
|
||||||
|
pub stdout_buffer: Vec<String>, // ❌ Unbounded memory
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Proposed (streaming to disk):
|
||||||
|
```rust
|
||||||
|
pub struct SubProcessRunning {
|
||||||
|
pub process: Child,
|
||||||
|
pub log_store: Arc<Mutex<dyn LogStore>>,
|
||||||
|
pub job_run_id: String,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
When polling the job:
|
||||||
|
1. Read available stdout/stderr from process
|
||||||
|
2. Write each line to `log_store.append_stdout(job_run_id, line)`
|
||||||
|
3. Parse for special lines (e.g., `DATABUILD_MISSING_DEPS_JSON:...`)
|
||||||
|
4. Don't keep full log in memory
|
||||||
|
|
||||||
|
## CLI Integration
|
||||||
|
|
||||||
|
The CLI should support log streaming:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stream logs for a running or completed job
|
||||||
|
databuild logs <job_run_id>
|
||||||
|
|
||||||
|
# Follow mode (tail -f)
|
||||||
|
databuild logs <job_run_id> --follow
|
||||||
|
|
||||||
|
# Show last N lines
|
||||||
|
databuild logs <job_run_id> --tail 100
|
||||||
|
|
||||||
|
# Show stderr instead of stdout
|
||||||
|
databuild logs <job_run_id> --stderr
|
||||||
|
```
|
||||||
|
|
||||||
|
Under the hood, this hits the `/api/job_runs/{id}/logs/stdout?follow=true` endpoint.
|
||||||
|
|
||||||
|
## Web App Integration
|
||||||
|
|
||||||
|
The web app can use native EventSource API:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
const eventSource = new EventSource(`/api/job_runs/${jobId}/logs/stdout?follow=true`);
|
||||||
|
|
||||||
|
eventSource.addEventListener('log', (event) => {
|
||||||
|
appendToTerminal(event.data);
|
||||||
|
});
|
||||||
|
|
||||||
|
eventSource.addEventListener('complete', (event) => {
|
||||||
|
const metadata = JSON.parse(event.data);
|
||||||
|
showExitCode(metadata.exit_code);
|
||||||
|
eventSource.close();
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
## Future: Log Shipping
|
||||||
|
|
||||||
|
When adding log shipping (e.g., to S3, CloudWatch, Datadog):
|
||||||
|
|
||||||
|
1. Create a `ShippingLogStore` implementation
|
||||||
|
2. Run background task that:
|
||||||
|
- Watches for completed jobs
|
||||||
|
- Batches log lines
|
||||||
|
- Ships to configured destination
|
||||||
|
- Deletes local files after successful upload (if configured)
|
||||||
|
3. Configure via:
|
||||||
|
```bash
|
||||||
|
export DATABUILD_LOG_SHIP_DEST=s3://my-bucket/databuild-logs
|
||||||
|
export DATABUILD_LOG_KEEP_LOCAL=false # Delete after ship
|
||||||
|
```
|
||||||
|
|
||||||
|
The `LogStore` trait means the core system doesn't change - just swap implementations.
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
1. **Log format**: Plain text vs structured (JSON lines)?
|
||||||
|
- Plain text is more human-readable
|
||||||
|
- Structured is easier to search/analyze
|
||||||
|
- Suggestion: Plain text in files, parse to structured for API if needed
|
||||||
|
|
||||||
|
2. **Compression**: Compress old logs to save space?
|
||||||
|
- Could gzip files older than 24 hours
|
||||||
|
- Trade-off: disk space vs CPU on access
|
||||||
|
|
||||||
|
3. **Indexing**: Build an index for fast log search?
|
||||||
|
- Simple grep is probably fine initially
|
||||||
|
- Could add full-text search later if needed
|
||||||
|
|
||||||
|
4. **Multi-machine**: How do logs work in distributed builds?
|
||||||
|
- Each build machine has its own log directory
|
||||||
|
- Central service aggregates via log shipping
|
||||||
|
- Need to design this when we tackle distributed execution
|
||||||
Loading…
Reference in a new issue