stuart/databuild

Fork 0

Stuart Axelbrooke cccfbd1133

/ setup (push) Waiting to run

Details

Add plan for graph-side log consumption

2025-07-28 20:56:07 -07:00

12 KiB

Raw Blame History

Graph-Side Log Consumption Plan

Status

Phase 0: Design [DONE]
Phase 1: Core Implementation [FUTURE]
Phase 2: Advanced Features [FUTURE]

Required Reading

Before implementing this plan, engineers should thoroughly understand these design documents:

DESIGN.md - Overall DataBuild architecture and job execution model
design/core-build.md - Core build semantics and job lifecycle state machines
design/build-event-log.md - Event sourcing model and BEL integration
design/observability.md - Observability strategy and telemetry requirements
plans/job-wrapper.md - Job wrapper implementation and structured log protocol
databuild.proto - System interfaces and data structures

Overview

This plan describes the graph-side implementation for consuming structured logs emitted by the job wrapper. The job wrapper emits JobLogEntry protobuf messages to stdout during job execution. The graph must consume these logs to provide log retrieval by job run ID and expose metrics for Prometheus scraping.

Key Technical Decisions

1. Storage Strategy: On-Disk with BEL Separation

Decision: Store structured logs on disk separate from the Build Event Log (BEL).

Motivation:

Log volumes can be legitimately large and would place undue stress on the BEL-backing datastore
BEL is optimized for event-sourcing patterns, not high-volume log queries
Separate storage allows independent scaling and retention policies

2. File Organization: Date-Organized Structure

Decision: Store logs in configurable-base, date-organized directories: $LOGS_BASEPATH/YYYY-MM-DD/{job_run_id}.jsonl

Motivation:

Enables efficient cleanup by date (future optimization)
Simplifies manual log management during development
Facilitates external log collection tools (future)

3. Static Update Period (Phase 1)

Decision: Use fixed refresh interval for log processing. Adaptive batching is a future optimization.

Motivation:

Simplicity for initial implementation
Predictable performance characteristics
Easier to debug and test
Can optimize later based on real usage patterns

4. Manual Log Cleanup (Phase 1)

Decision: No automatic log retention/cleanup in initial implementation.

Motivation:

We're in early development phase
Manual cleanup acceptable for now
Avoids complexity in initial implementation
Automatic retention can be added as future optimization

5. Unified Telemetry Stream

Decision: All JobLogEntry messages (logs, metrics, events) flow through the same JSONL files.

Motivation:

Simplicity - single consumption pipeline
Temporal consistency - metrics and logs naturally correlated
Unified file format reduces complexity

6. Cardinality-Safe Prometheus Metrics

Decision: Prometheus metrics will NOT include partition references as labels to avoid cardinality explosion.

Motivation:

Partition labels (date × customer × region × etc.) would create massive cardinality
Focus on job-level and system-level metrics only
Use job_id and job_type labels instead of partition-specific labels

7. Centralized Metric Templates for Correctness

Decision: Define all Prometheus metric names and label templates in a central location to avoid string duplication.

Motivation:

Prevents implicit coupling via duplicated string templates
Single source of truth for metric definitions
Easier to maintain consistency across codebase

8. Limited Scope (Phase 1)

Decision: Phase 1 focuses on log retrieval API and Prometheus metrics, excluding web app integration.

Motivation:

Web app integration is part of a bigger update
Allows focused implementation on core log consumption
API-first approach enables multiple consumers

9. Unified Execution Paths

Decision: Both CLI and service builds produce identical BEL events and JSONL logs in the same locations.

Motivation:

Building with CLI then querying from service "just works"
Single source of truth for all build artifacts
Consistent behavior regardless of execution method
Simplifies debugging and operational workflows

Interface Issues to Fix

JobLogEntry Protobuf Update Required

The current JobLogEntry definition needs updates:

Current (INCORRECT):

message JobLogEntry {
  string partition_ref = 3;  // Single string
  // ...
}

Required (CORRECT):

message JobLogEntry {
  repeated PartitionRef outputs = 3;  // Multiple PartitionRef objects
  // ...
}

Rationale: Jobs produce multiple partitions, and we should use the proper PartitionRef type for consistency with other interfaces.

Architecture

Storage Layout

/logs/databuild/
├── 2025-01-27/
│   ├── job_run_123abc.jsonl
│   ├── job_run_456def.jsonl
│   └── ...
├── 2025-01-28/
│   └── ...

File Format (JSONL)

Each file contains one JSON object per line, representing a JobLogEntry:

{"timestamp":"2025-01-27T10:30:45Z","job_id":"job_run_123abc","outputs":[{"path":"s3://bucket/dataset/date=2025-01-27"}],"sequence_number":1,"content":{"job_event":{"event_type":"task_launched","metadata":{}}}}
{"timestamp":"2025-01-27T10:30:46Z","job_id":"job_run_123abc","outputs":[{"path":"s3://bucket/dataset/date=2025-01-27"}],"sequence_number":2,"content":{"log":{"level":"INFO","message":"Processing started","fields":{"rows":"1000"}}}}
{"timestamp":"2025-01-27T10:30:50Z","job_id":"job_run_123abc","outputs":[{"path":"s3://bucket/dataset/date=2025-01-27"}],"sequence_number":3,"content":{"metric":{"name":"rows_processed","value":1000,"labels":{"stage":"transform"},"unit":"count"}}}

Consumption Pipeline

Job Wrapper (stdout) → Graph Log Collector → JSONL Files
                                                        ↓
                                              Unified Log Access Layer
                                                   ↙        ↘
                                            Service API    CLI API
                                                   ↓
                                           Metrics Aggregator → /api/v1/metrics

Implementation Components

1. Log Collector [PHASE 1]

Responsibility: Consume job wrapper stdout and write to JSONL files.

struct LogCollector {
    logs_dir: PathBuf,              // /logs/databuild
    active_files: HashMap<String, File>,  // job_run_id -> file handle
}

impl LogCollector {
    fn consume_job_output(&mut self, job_run_id: &str, stdout: &mut BufReader<ChildStdout>) -> Result<()>;
    fn write_log_entry(&mut self, job_run_id: &str, entry: &JobLogEntry) -> Result<()>;
    fn ensure_date_directory(&self) -> Result<PathBuf>;
}

2. Unified Log Access Layer [PHASE 1]

Responsibility: Provide common interface for reading logs from JSONL files, used by both service and CLI.

// Core log access implementation
struct LogReader {
    logs_base_path: PathBuf,
}

impl LogReader {
    fn get_job_logs(&self, request: &JobLogsRequest) -> Result<JobLogsResponse>;
    fn list_available_jobs(&self, date_range: Option<(String, String)>) -> Result<Vec<String>>;
    fn get_job_metrics(&self, job_run_id: &str) -> Result<Vec<MetricPoint>>;
}

Protobuf Interface (ensures CLI/Service consistency):

message JobLogsRequest {
  string job_run_id = 1;
  int64 since_timestamp = 2;  // Unix timestamp (nanoseconds)
  int32 min_level = 3;        // LogLevel enum value
  uint32 limit = 4;
}

message JobLogsResponse {
  repeated JobLogEntry entries = 1;
  bool has_more = 2;
}

3. Metrics Templates [PHASE 1]

Responsibility: Centralized metric definitions to avoid string duplication.

// Central location for all metric definitions
mod metric_templates {
    pub const JOB_RUNTIME_SECONDS: &str = "databuild_job_runtime_seconds";
    pub const JOB_MEMORY_PEAK_MB: &str = "databuild_job_memory_peak_mb";
    pub const JOB_CPU_TOTAL_SECONDS: &str = "databuild_job_cpu_total_seconds";
    pub const ROWS_PROCESSED_TOTAL: &str = "databuild_rows_processed_total";
    
    pub fn job_labels(job_run_id: &str, job_type: &str) -> HashMap<String, String> {
        let mut labels = HashMap::new();
        labels.insert("job_run_id".to_string(), job_run_id.to_string());
        labels.insert("job_type".to_string(), job_type.to_string());
        labels
    }
}

4. Metrics Aggregator [PHASE 1]

Responsibility: Process MetricPoint messages and expose Prometheus format with safe cardinality.

struct MetricsAggregator {
    metrics: HashMap<String, MetricFamily>,
}

impl MetricsAggregator {
    fn ingest_metric(&mut self, metric: &MetricPoint, job_run_id: &str, job_type: &str);
    fn generate_prometheus_output(&self) -> String;
}

Safe Prometheus Output (NO partition labels):

# HELP databuild_job_runtime_seconds Job execution time in seconds  
# TYPE databuild_job_runtime_seconds gauge
databuild_job_runtime_seconds{job_run_id="job_run_123abc"} 45.2

# HELP databuild_rows_processed_total Total rows processed by job
# TYPE databuild_rows_processed_total counter  
databuild_rows_processed_total{job_run_id="job_run_123abc"} 1000

API Implementation

REST Endpoints [PHASE 1]

Get Job Logs:

GET /api/v1/jobs/{job_run_id}/logs?since={timestamp}&level={log_level}

Response: Array of LogEntry objects with filtering support.

Prometheus Metrics Scraping:

GET /api/v1/metrics

Response: All metrics in Prometheus exposition format.

Configuration

Environment Variables

# Log storage configuration
DATABUILD_LOGS_DIR=/logs/databuild                    # Log storage directory

# Processing configuration  
DATABUILD_LOG_REFRESH_INTERVAL_MS=1000                # Fixed refresh interval (1s)
DATABUILD_LOG_CACHE_SIZE=100                          # LRU cache size for job logs

Implementation Phases

Phase 1: Core Implementation [FUTURE]

Goal: Basic log consumption and storage with REST API for log retrieval and Prometheus metrics.

Deliverables:

Fix JobLogEntry protobuf interface (partition_ref → outputs)
LogCollector with JSONL file writing
LogProcessor with fixed refresh intervals
REST API endpoints for job logs and Prometheus metrics
MetricsAggregator with cardinality-safe output
Centralized metric templates module

Success Criteria:

Job logs are captured and stored reliably
REST API can retrieve logs by job run ID and time range
Prometheus metrics are exposed at /api/v1/metrics endpoint without cardinality issues
System handles concurrent job execution without data corruption
All metric names/labels are defined in central location

Phase 2: Advanced Features [FUTURE]

Goal: Performance optimizations and production features.

Deliverables:

Adaptive batching based on system load
Automatic log retention and cleanup
Web app integration for log viewing
Rate limiting for high-volume jobs
Performance monitoring and alerting

Testing Strategy

Core Tests (90% Coverage, Maximum Simplicity) [PHASE 1]

Unit Tests:

JSONL parsing and serialization (basic happy path)
Metrics aggregation and Prometheus formatting (template correctness)
API endpoint responses (log retrieval by job_run_id)

Integration Tests:

End-to-end: wrapper stdout → JSONL file → API response
Concurrent job log collection (2-3 jobs simultaneously)
Prometheus metrics scraping endpoint

Key Principle: Tests should be simple and focus on core workflows. Avoid testing edge cases that may change as requirements evolve.

Future Extensions

Performance Optimizations [FUTURE]

Adaptive refresh intervals based on load
Log compression for storage efficiency
Advanced caching strategies

Production Features [FUTURE]

Automatic log retention and cleanup policies
Integration with external log collection tools
Web app log viewing and search capabilities

Monitoring Integration [FUTURE]

Grafana dashboard templates
Alerting on log system health
Performance metrics for log processing pipeline

Success Criteria

Reliable Log Capture: All job wrapper output captured without loss
API Functionality: Can retrieve logs by job run ID and time range
Safe Metrics: Prometheus endpoint works without cardinality explosion
Correctness: No duplicated metric templates, centralized definitions
Concurrent Safety: Multiple jobs can write logs simultaneously without corruption
Simple Testing: Test suite covers core functionality with minimal brittleness

12 KiB Raw Blame History Unescape Escape

Graph-Side Log Consumption Plan

Status

Required Reading

Overview

Key Technical Decisions

1. Storage Strategy: On-Disk with BEL Separation

2. File Organization: Date-Organized Structure

3. Static Update Period (Phase 1)

4. Manual Log Cleanup (Phase 1)

5. Unified Telemetry Stream

6. Cardinality-Safe Prometheus Metrics

7. Centralized Metric Templates for Correctness

8. Limited Scope (Phase 1)

9. Unified Execution Paths

Interface Issues to Fix

JobLogEntry Protobuf Update Required

Architecture

Storage Layout

File Format (JSONL)

Consumption Pipeline

Implementation Components

1. Log Collector [PHASE 1]

2. Unified Log Access Layer [PHASE 1]

3. Metrics Templates [PHASE 1]

4. Metrics Aggregator [PHASE 1]

API Implementation

REST Endpoints [PHASE 1]

Configuration

Environment Variables

Implementation Phases

Phase 1: Core Implementation [FUTURE]

Phase 2: Advanced Features [FUTURE]

Testing Strategy

Core Tests (90% Coverage, Maximum Simplicity) [PHASE 1]

Future Extensions

Performance Optimizations [FUTURE]

Production Features [FUTURE]

Monitoring Integration [FUTURE]

Success Criteria

12 KiB

Raw Blame History