384 lines
No EOL
16 KiB
Markdown
384 lines
No EOL
16 KiB
Markdown
# Graph-Side Log Consumption Plan
|
||
|
||
## Status
|
||
- Phase 0: Design [DONE]
|
||
- Phase 1: Core Implementation [COMPLETED ✅]
|
||
- Phase 2: Advanced Features [FUTURE]
|
||
|
||
## Phase 1 Implementation Status
|
||
|
||
### ✅ **Core Components COMPLETED**
|
||
1. **JobLogEntry protobuf interface fixed** - Updated `databuild.proto` to use `repeated PartitionRef outputs` instead of single `string partition_ref`
|
||
2. **LogCollector implemented** - Consumes job wrapper stdout, parses structured logs, writes to date-organized JSONL files (`logs/databuild/YYYY-MM-DD/job_run_id.jsonl`)
|
||
3. **Graph integration completed** - LogCollector integrated into graph execution with UUID-based job ID coordination between graph and wrapper
|
||
4. **Unified Log Access Layer implemented** - Protobuf-based `LogReader` interface ensuring CLI/Service consistency for log retrieval
|
||
5. **Centralized metric templates** - All metric definitions centralized in `databuild/metric_templates.rs` module
|
||
6. **MetricsAggregator with cardinality safety** - Prometheus output without partition reference explosion, using job labels instead
|
||
7. **REST API endpoints implemented** - `/api/v1/jobs/{job_run_id}/logs` and `/api/v1/metrics` fully functional
|
||
8. **Graph-level job_label enrichment** - Solved cardinality issue via LogCollector enrichment pattern, consistent with design philosophy
|
||
|
||
### ✅ **Key Architectural Decisions Implemented**
|
||
- **Cardinality-safe metrics**: Job labels used instead of high-cardinality partition references in Prometheus output
|
||
- **Graph-level enrichment**: LogCollector enriches both WrapperJobEvent and Manifest entries with job_label from graph context
|
||
- **JSONL storage**: Date-organized file structure with robust error handling and concurrent access safety
|
||
- **Unified execution paths**: Both CLI and service builds produce identical BEL events and JSONL logs in same locations
|
||
- **Job ID coordination**: UUID-based job run IDs shared between graph execution and job wrapper via environment variable
|
||
|
||
### ✅ **All Success Criteria Met**
|
||
- ✅ **Reliable Log Capture**: All job wrapper output captured without loss through LogCollector
|
||
- ✅ **API Functionality**: REST API retrieves logs by job run ID, timestamp filtering, and log level filtering
|
||
- ✅ **Safe Metrics**: Prometheus endpoint works without cardinality explosion (job labels only, no partition refs)
|
||
- ✅ **Correctness**: No duplicated metric templates, all definitions centralized in `metric_templates.rs`
|
||
- ✅ **Concurrent Safety**: Multiple jobs write logs simultaneously without corruption via separate JSONL files per job
|
||
- ✅ **Simple Testing**: Test suite covers core functionality with minimal brittleness, all tests passing
|
||
|
||
### 🏗️ **Implementation Files**
|
||
- `databuild/databuild.proto` - Updated protobuf interfaces
|
||
- `databuild/log_collector.rs` - Core log collection and JSONL writing
|
||
- `databuild/log_access.rs` - Unified log reading interface
|
||
- `databuild/metric_templates.rs` - Centralized metric definitions
|
||
- `databuild/metrics_aggregator.rs` - Cardinality-safe Prometheus output
|
||
- `databuild/service/handlers.rs` - REST API endpoints implementation
|
||
- `databuild/graph/execute.rs` - Integration point for LogCollector
|
||
- `databuild/job/main.rs` - Job wrapper structured log emission
|
||
|
||
## Required Reading
|
||
|
||
Before implementing this plan, engineers should thoroughly understand these design documents:
|
||
|
||
- **[DESIGN.md](../DESIGN.md)** - Overall DataBuild architecture and job execution model
|
||
- **[design/core-build.md](../design/core-build.md)** - Core build semantics and job lifecycle state machines
|
||
- **[design/build-event-log.md](../design/build-event-log.md)** - Event sourcing model and BEL integration
|
||
- **[design/observability.md](../design/observability.md)** - Observability strategy and telemetry requirements
|
||
- **[plans/job-wrapper.md](./job-wrapper.md)** - Job wrapper implementation and structured log protocol
|
||
- **[databuild.proto](../databuild/databuild.proto)** - System interfaces and data structures
|
||
|
||
## Overview
|
||
|
||
This plan describes the graph-side implementation for consuming structured logs emitted by the job wrapper. The job wrapper emits `JobLogEntry` protobuf messages to stdout during job execution. The graph must consume these logs to provide log retrieval by job run ID and expose metrics for Prometheus scraping.
|
||
|
||
## Key Technical Decisions
|
||
|
||
### 1. Storage Strategy: On-Disk with BEL Separation
|
||
**Decision**: Store structured logs on disk separate from the Build Event Log (BEL).
|
||
|
||
**Motivation**:
|
||
- Log volumes can be legitimately large and would place undue stress on the BEL-backing datastore
|
||
- BEL is optimized for event-sourcing patterns, not high-volume log queries
|
||
- Separate storage allows independent scaling and retention policies
|
||
|
||
### 2. File Organization: Date-Organized Structure
|
||
**Decision**: Store logs in configurable-base, date-organized directories: `$LOGS_BASEPATH/YYYY-MM-DD/{job_run_id}.jsonl`
|
||
|
||
**Motivation**:
|
||
- Enables efficient cleanup by date (future optimization)
|
||
- Simplifies manual log management during development
|
||
- Facilitates external log collection tools (future)
|
||
|
||
### 3. Static Update Period (Phase 1)
|
||
**Decision**: Use fixed refresh interval for log processing. Adaptive batching is a future optimization.
|
||
|
||
**Motivation**:
|
||
- Simplicity for initial implementation
|
||
- Predictable performance characteristics
|
||
- Easier to debug and test
|
||
- Can optimize later based on real usage patterns
|
||
|
||
### 4. Manual Log Cleanup (Phase 1)
|
||
**Decision**: No automatic log retention/cleanup in initial implementation.
|
||
|
||
**Motivation**:
|
||
- We're in early development phase
|
||
- Manual cleanup acceptable for now
|
||
- Avoids complexity in initial implementation
|
||
- Automatic retention can be added as future optimization
|
||
|
||
### 5. Unified Telemetry Stream
|
||
**Decision**: All `JobLogEntry` messages (logs, metrics, events) flow through the same JSONL files.
|
||
|
||
**Motivation**:
|
||
- Simplicity - single consumption pipeline
|
||
- Temporal consistency - metrics and logs naturally correlated
|
||
- Unified file format reduces complexity
|
||
|
||
### 6. Cardinality-Safe Prometheus Metrics
|
||
**Decision**: Prometheus metrics will NOT include partition references as labels to avoid cardinality explosion.
|
||
|
||
**Motivation**:
|
||
- Partition labels (date × customer × region × etc.) would create massive cardinality
|
||
- Focus on job-level and system-level metrics only
|
||
- Use job_id and job_type labels instead of partition-specific labels
|
||
|
||
### 7. Centralized Metric Templates for Correctness
|
||
**Decision**: Define all Prometheus metric names and label templates in a central location to avoid string duplication.
|
||
|
||
**Motivation**:
|
||
- Prevents implicit coupling via duplicated string templates
|
||
- Single source of truth for metric definitions
|
||
- Easier to maintain consistency across codebase
|
||
|
||
### 8. Limited Scope (Phase 1)
|
||
**Decision**: Phase 1 focuses on log retrieval API and Prometheus metrics, excluding web app integration.
|
||
|
||
**Motivation**:
|
||
- Web app integration is part of a bigger update
|
||
- Allows focused implementation on core log consumption
|
||
- API-first approach enables multiple consumers
|
||
|
||
### 9. Unified Execution Paths
|
||
**Decision**: Both CLI and service builds produce identical BEL events and JSONL logs in the same locations.
|
||
|
||
**Motivation**:
|
||
- Building with CLI then querying from service "just works"
|
||
- Single source of truth for all build artifacts
|
||
- Consistent behavior regardless of execution method
|
||
- Simplifies debugging and operational workflows
|
||
|
||
## Interface Issues to Fix
|
||
|
||
### JobLogEntry Protobuf Update Required
|
||
The current `JobLogEntry` definition needs updates:
|
||
|
||
**Current (INCORRECT)**:
|
||
```proto
|
||
message JobLogEntry {
|
||
string partition_ref = 3; // Single string
|
||
// ...
|
||
}
|
||
```
|
||
|
||
**Required (CORRECT)**:
|
||
```proto
|
||
message JobLogEntry {
|
||
repeated PartitionRef outputs = 3; // Multiple PartitionRef objects
|
||
// ...
|
||
}
|
||
```
|
||
|
||
**Rationale**: Jobs produce multiple partitions, and we should use the proper `PartitionRef` type for consistency with other interfaces.
|
||
|
||
## Architecture
|
||
|
||
### Storage Layout
|
||
```
|
||
/logs/databuild/
|
||
├── 2025-01-27/
|
||
│ ├── job_run_123abc.jsonl
|
||
│ ├── job_run_456def.jsonl
|
||
│ └── ...
|
||
├── 2025-01-28/
|
||
│ └── ...
|
||
```
|
||
|
||
### File Format (JSONL)
|
||
Each file contains one JSON object per line, representing a `JobLogEntry`:
|
||
```json
|
||
{"timestamp":"2025-01-27T10:30:45Z","job_id":"job_run_123abc","outputs":[{"path":"s3://bucket/dataset/date=2025-01-27"}],"sequence_number":1,"content":{"job_event":{"event_type":"task_launched","metadata":{}}}}
|
||
{"timestamp":"2025-01-27T10:30:46Z","job_id":"job_run_123abc","outputs":[{"path":"s3://bucket/dataset/date=2025-01-27"}],"sequence_number":2,"content":{"log":{"level":"INFO","message":"Processing started","fields":{"rows":"1000"}}}}
|
||
{"timestamp":"2025-01-27T10:30:50Z","job_id":"job_run_123abc","outputs":[{"path":"s3://bucket/dataset/date=2025-01-27"}],"sequence_number":3,"content":{"metric":{"name":"rows_processed","value":1000,"labels":{"stage":"transform"},"unit":"count"}}}
|
||
```
|
||
|
||
### Consumption Pipeline
|
||
```
|
||
Job Wrapper (stdout) → Graph Log Collector → JSONL Files
|
||
↓
|
||
Unified Log Access Layer
|
||
↙ ↘
|
||
Service API CLI API
|
||
↓
|
||
Metrics Aggregator → /api/v1/metrics
|
||
```
|
||
|
||
## Implementation Components
|
||
|
||
### 1. Log Collector [PHASE 1]
|
||
**Responsibility**: Consume job wrapper stdout and write to JSONL files.
|
||
|
||
```rust
|
||
struct LogCollector {
|
||
logs_dir: PathBuf, // /logs/databuild
|
||
active_files: HashMap<String, File>, // job_run_id -> file handle
|
||
}
|
||
|
||
impl LogCollector {
|
||
fn consume_job_output(&mut self, job_run_id: &str, stdout: &mut BufReader<ChildStdout>) -> Result<()>;
|
||
fn write_log_entry(&mut self, job_run_id: &str, entry: &JobLogEntry) -> Result<()>;
|
||
fn ensure_date_directory(&self) -> Result<PathBuf>;
|
||
}
|
||
```
|
||
|
||
### 2. Unified Log Access Layer [PHASE 1]
|
||
**Responsibility**: Provide common interface for reading logs from JSONL files, used by both service and CLI.
|
||
|
||
```rust
|
||
// Core log access implementation
|
||
struct LogReader {
|
||
logs_base_path: PathBuf,
|
||
}
|
||
|
||
impl LogReader {
|
||
fn get_job_logs(&self, request: &JobLogsRequest) -> Result<JobLogsResponse>;
|
||
fn list_available_jobs(&self, date_range: Option<(String, String)>) -> Result<Vec<String>>;
|
||
fn get_job_metrics(&self, job_run_id: &str) -> Result<Vec<MetricPoint>>;
|
||
}
|
||
```
|
||
|
||
**Protobuf Interface** (ensures CLI/Service consistency):
|
||
```proto
|
||
message JobLogsRequest {
|
||
string job_run_id = 1;
|
||
int64 since_timestamp = 2; // Unix timestamp (nanoseconds)
|
||
int32 min_level = 3; // LogLevel enum value
|
||
uint32 limit = 4;
|
||
}
|
||
|
||
message JobLogsResponse {
|
||
repeated JobLogEntry entries = 1;
|
||
bool has_more = 2;
|
||
}
|
||
```
|
||
|
||
### 3. Metrics Templates [PHASE 1]
|
||
**Responsibility**: Centralized metric definitions to avoid string duplication.
|
||
|
||
```rust
|
||
// Central location for all metric definitions
|
||
mod metric_templates {
|
||
pub const JOB_RUNTIME_SECONDS: &str = "databuild_job_runtime_seconds";
|
||
pub const JOB_MEMORY_PEAK_MB: &str = "databuild_job_memory_peak_mb";
|
||
pub const JOB_CPU_TOTAL_SECONDS: &str = "databuild_job_cpu_total_seconds";
|
||
pub const ROWS_PROCESSED_TOTAL: &str = "databuild_rows_processed_total";
|
||
|
||
pub fn job_labels(job_run_id: &str, job_type: &str) -> HashMap<String, String> {
|
||
let mut labels = HashMap::new();
|
||
labels.insert("job_run_id".to_string(), job_run_id.to_string());
|
||
labels.insert("job_type".to_string(), job_type.to_string());
|
||
labels
|
||
}
|
||
}
|
||
```
|
||
|
||
### 4. Metrics Aggregator [PHASE 1]
|
||
**Responsibility**: Process `MetricPoint` messages and expose Prometheus format with safe cardinality.
|
||
|
||
```rust
|
||
struct MetricsAggregator {
|
||
metrics: HashMap<String, MetricFamily>,
|
||
}
|
||
|
||
impl MetricsAggregator {
|
||
fn ingest_metric(&mut self, metric: &MetricPoint, job_run_id: &str, job_type: &str);
|
||
fn generate_prometheus_output(&self) -> String;
|
||
}
|
||
```
|
||
|
||
**Safe Prometheus Output** (NO partition labels):
|
||
```
|
||
# HELP databuild_job_runtime_seconds Job execution time in seconds
|
||
# TYPE databuild_job_runtime_seconds gauge
|
||
databuild_job_runtime_seconds{job_run_id="job_run_123abc"} 45.2
|
||
|
||
# HELP databuild_rows_processed_total Total rows processed by job
|
||
# TYPE databuild_rows_processed_total counter
|
||
databuild_rows_processed_total{job_run_id="job_run_123abc"} 1000
|
||
```
|
||
|
||
## API Implementation
|
||
|
||
### REST Endpoints [PHASE 1]
|
||
|
||
**Get Job Logs**:
|
||
```
|
||
GET /api/v1/jobs/{job_run_id}/logs?since={timestamp}&level={log_level}
|
||
```
|
||
Response: Array of `LogEntry` objects with filtering support.
|
||
|
||
**Prometheus Metrics Scraping**:
|
||
```
|
||
GET /api/v1/metrics
|
||
```
|
||
Response: All metrics in Prometheus exposition format.
|
||
|
||
## Configuration
|
||
|
||
### Environment Variables
|
||
```bash
|
||
# Log storage configuration
|
||
DATABUILD_LOGS_DIR=/logs/databuild # Log storage directory
|
||
|
||
# Processing configuration
|
||
DATABUILD_LOG_REFRESH_INTERVAL_MS=1000 # Fixed refresh interval (1s)
|
||
DATABUILD_LOG_CACHE_SIZE=100 # LRU cache size for job logs
|
||
```
|
||
|
||
## Implementation Phases
|
||
|
||
### Phase 1: Core Implementation [COMPLETED ✅]
|
||
**Goal**: Basic log consumption and storage with REST API for log retrieval and Prometheus metrics.
|
||
|
||
**Deliverables** ✅:
|
||
- ✅ Fix `JobLogEntry` protobuf interface (partition_ref → outputs)
|
||
- ✅ LogCollector with JSONL file writing and graph-level job_label enrichment
|
||
- ✅ LogReader with unified protobuf interface for CLI/Service consistency
|
||
- ✅ REST API endpoints for job logs and Prometheus metrics
|
||
- ✅ MetricsAggregator with cardinality-safe output (job labels, not partition refs)
|
||
- ✅ Centralized metric templates module
|
||
|
||
**Success Criteria** ✅:
|
||
- ✅ Job logs are captured and stored reliably via LogCollector integration
|
||
- ✅ REST API can retrieve logs by job run ID and time range with filtering
|
||
- ✅ Prometheus metrics are exposed at `/api/v1/metrics` endpoint without cardinality issues
|
||
- ✅ System handles concurrent job execution without data corruption (separate JSONL files per job)
|
||
- ✅ All metric names/labels are defined in central location (`metric_templates.rs`)
|
||
|
||
### Phase 2: Advanced Features [FUTURE]
|
||
**Goal**: Performance optimizations and production features.
|
||
|
||
**Deliverables**:
|
||
- Adaptive batching based on system load
|
||
- Automatic log retention and cleanup
|
||
- Web app integration for log viewing
|
||
- Rate limiting for high-volume jobs
|
||
- Performance monitoring and alerting
|
||
|
||
## Testing Strategy
|
||
|
||
### Core Tests (90% Coverage, Maximum Simplicity) [PHASE 1]
|
||
|
||
**Unit Tests**:
|
||
- JSONL parsing and serialization (basic happy path)
|
||
- Metrics aggregation and Prometheus formatting (template correctness)
|
||
- API endpoint responses (log retrieval by job_run_id)
|
||
|
||
**Integration Tests**:
|
||
- End-to-end: wrapper stdout → JSONL file → API response
|
||
- Concurrent job log collection (2-3 jobs simultaneously)
|
||
- Prometheus metrics scraping endpoint
|
||
|
||
**Key Principle**: Tests should be simple and focus on core workflows. Avoid testing edge cases that may change as requirements evolve.
|
||
|
||
## Future Extensions
|
||
|
||
### Performance Optimizations [FUTURE]
|
||
- Adaptive refresh intervals based on load
|
||
- Log compression for storage efficiency
|
||
- Advanced caching strategies
|
||
|
||
### Production Features [FUTURE]
|
||
- Automatic log retention and cleanup policies
|
||
- Integration with external log collection tools
|
||
- Web app log viewing and search capabilities
|
||
|
||
### Monitoring Integration [FUTURE]
|
||
- Grafana dashboard templates
|
||
- Alerting on log system health
|
||
- Performance metrics for log processing pipeline
|
||
|
||
## Success Criteria
|
||
|
||
1. **Reliable Log Capture**: All job wrapper output captured without loss
|
||
2. **API Functionality**: Can retrieve logs by job run ID and time range
|
||
3. **Safe Metrics**: Prometheus endpoint works without cardinality explosion
|
||
4. **Correctness**: No duplicated metric templates, centralized definitions
|
||
5. **Concurrent Safety**: Multiple jobs can write logs simultaneously without corruption
|
||
6. **Simple Testing**: Test suite covers core functionality with minimal brittleness |