Add plan for implementing vnext of job wrapper

2025-07-26 22:47:13 -07:00 · 2025-07-26 22:47:13 -07:00 · 0810c82e7d
commit 0810c82e7d
parent d19c14aac3
4 changed files with 348 additions and 30 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -39,6 +39,8 @@ The `databuild_job` rule expects to reference a binary that adheres to the follo
 - For the `config` subcommand, it prints the JSON job config to stdout based on the requested partitions, e.g. for a binary `bazel-bin/my_binary`, it prints a valid job config when called like `bazel-bin/my_binary config my_dataset/color=red my_dataset/color=blue`.
 - For the `exec` subcommand, it produces the partitions requested to the `config` subcommand when configured by the job config it produced. E.g., if `config` had produced `{..., "args": ["red", "blue"], "env": {"MY_ENV": "foo"}`, then calling `MY_ENV=foo bazel-bin/my_binary exec red blue` should produce partitions `my_dataset/color=red` and `my_dataset/color=blue`.
 Jobs are executed via a wrapper component that provides observability, error handling, and standardized communication with the graph. The wrapper captures all job output as structured logs, enabling comprehensive monitoring without requiring jobs to have network connectivity.
 ### Graph
 The `databuild_graph` rule expects two fields, `jobs`, and `lookup`:
--- a/design/core-build.md
+++ b/design/core-build.md
@ -12,8 +12,11 @@ Purpose: Centralize the build logic and semantics in a performant, correct core.
 - Graph-based composition is the basis for databuild application [deployment](./deploy-strategies.md)
 ## Jobs
-Jobs are the atomic unit of work in databuild.
+Jobs are the atomic unit of work in databuild, executed via a Rust-based wrapper that provides:
- Job wrapper fulfills configuration, observability, and record keeping
+- Structured logging and telemetry collection
 - Platform-agnostic execution across local, container, and cloud environments
 - Zero-network-dependency operation via log-based communication
 - Standardized error handling and exit code categorization
 ### `job.config`
 Purpose: Enable planning of execution graph. Executed in-process when possible for speed. For interface details, see 
@ -50,17 +53,18 @@ trait DataBuildJob {
 #### `job.exec` State Diagram
 ```mermaid
 flowchart TD
-    begin((begin)) --> validate_config
+  begin((begin)) --> wrapper_validate_config
-    emit_job_exec_fail --> fail((fail))
+  emit_job_exec_fail --> fail((fail))
-    validate_config -- fail --> emit_config_validate_fail --> emit_job_exec_fail
+  wrapper_validate_config -- fail --> emit_config_validate_fail --> emit_job_exec_fail
-    validate_config -- success --> emit_config_validate_success --> launch_task
+  wrapper_validate_config -- success --> emit_config_validate_success --> wrapper_launch_task
-    launch_task -- fail --> emit_task_launch_fail --> emit_job_exec_fail
+  wrapper_launch_task -- fail --> emit_task_launch_fail --> emit_job_exec_fail
-    launch_task -- success --> emit_task_launch_success --> await_task
+  wrapper_launch_task -- success --> emit_task_launch_success --> wrapper_monitor_task
-    await_task -- waited N seconds --> emit_heartbeat --> await_task
+  wrapper_monitor_task -- heartbeat timer --> emit_heartbeat --> wrapper_monitor_task
-    await_task -- non-zero exit code --> emit_task_failed --> emit_job_exec_fail
+  wrapper_monitor_task -- job stderr --> emit_log_entry --> wrapper_monitor_task
-    await_task -- zero exit code --> emit_task_success --> calculate_metadata
+  wrapper_monitor_task -- job stdout --> emit_log_entry --> wrapper_monitor_task
-    calculate_metadata -- fail --> emit_metadata_calculation_fail --> emit_job_exec_fail
+  wrapper_monitor_task -- non-zero exit --> emit_task_failed --> emit_job_exec_fail
-    calculate_metadata -- success --> emit_metadata ---> success((success))
+  wrapper_monitor_task -- zero exit --> emit_task_success --> emit_partition_manifest
  emit_partition_manifest --> success((success))
 ```
 ## Graphs
--- a/design/observability.md
+++ b/design/observability.md
@ -1,19 +1,47 @@
 # Observability
- Purpose
+## Purpose
-  - To enable simple, comprehensive metrics and logging observability for databuild applications
+Provide comprehensive, platform-agnostic observability for DataBuild applications through standardized job wrapper 
- Wrappers as observability implementation
+telemetry.
-  - Liveness guarantees are:
+
-    - Task process is still running
+## Architecture
-    - Logs are being shipped
+
-    - Metrics are being gathered (graph scrapes worker metrics, re-exposes)
+### Wrapper-Based Observability
-  - Heartbeating
+All observability flows through the job wrapper:
-  - Log shipping
+- **Jobs** emit application logs to stdout/stderr
-  - Metrics exposed
+- **Wrapper** captures and enriches with structured metadata
- Metrics
+- **Graph** parses structured logs into metrics, events, and monitoring data
-  - Service
+- [**BEL**](./build-event-log.md) stores aggregated telemetry for historical analysis
-  - Jobs
+
- Logging
+### Communication Protocol
-  - Service
+Log-based telemetry using protobuf-defined structured messages:
-  - Jobs
+- LogMessage: Application stdout/stderr with metadata
 - MetricPoint: StatsD-style metrics with labels
 - JobEvent: State transitions and system events
 - PartitionManifest: Job completion with output metadata
 ## Implementation
 ### Metrics Collection
 - Format: StatsD-like embedded in structured logs
 - Aggregation: Graph components collect and expose via Prometheus
 - Storage: Summary metrics stored in BEL for historical analysis
 - Scope: Job execution, resource usage, partition metadata
 ### Logging
 - Capture: All job stdout/stderr via wrapper
 - Enhancement: Automatic injection of job_id, partition_ref, timestamps
 - Format: Structured JSON for consistent parsing
 - Retention: Platform-dependent (container logs, cloud logging APIs)
 ### Monitoring
 - Heartbeats: 30-second intervals with resource utilization
 - Health: Exit code categorization and failure analysis
 - Alerting: Standard Prometheus/alertmanager integration
 - Debugging: Complete log trails for job troubleshooting
 ### Platform Integration
 - **Local**: Direct stdout pipe reading
 - **Docker**: Container log persistence and `docker logs`
 - **Kubernetes**: Pod logs API with configurable retention
 - **Cloud**: Platform logging services (CloudWatch, Cloud Logging)
--- a/plans/job-wrapper.md
+++ b/plans/job-wrapper.md
@ -0,0 +1,284 @@
 # Job Wrapper v2 Plan
 ## Overview
 The job wrapper is a critical component that mediates between DataBuild graphs and job executables, providing observability, error handling, and state management. This plan describes the next generation job wrapper implementation in Rust.
 ## Architecture
 ### Core Design Principles
 1. **Single Communication Channel**: Jobs communicate with graphs exclusively through structured logs
 2. **Platform Agnostic**: Works identically across local, Docker, K8s, and cloud platforms
 3. **Zero Network Requirements**: Jobs don't need to connect to any services
 4. **Fail-Safe**: Graceful handling of job crashes and fast completions
 ### Communication Model
 ```
 Graph → Job: Launch with JobConfig (via CLI args/env)
 Job → Graph: Structured logs (stdout)
 Graph: Tails logs and interprets into metrics, events, and manifests
 ```
 ## Structured Log Protocol
 ### Message Format (Protobuf)
 ```proto
 message JobLogEntry {
  string timestamp = 1;
  string job_id = 2;
  string partition_ref = 3;
  uint64 sequence_number = 4;  // Monotonic sequence starting from 1
  oneof content {
    LogMessage log = 5;
    MetricPoint metric = 6;
    JobEvent event = 7;
    PartitionManifest manifest = 8;
  }
 }
 message LogMessage {
  enum LogLevel {
    DEBUG = 0;
    INFO = 1;
    WARN = 2;
    ERROR = 3;
  }
  LogLevel level = 1;
  string message = 2;
  map<string, string> fields = 3;
 }
 message MetricPoint {
  string name = 1;
  double value = 2;
  map<string, string> labels = 3;
  string unit = 4;
 }
 message JobEvent {
  string event_type = 1;  // "task_launched", "heartbeat", "task_completed", etc
  google.protobuf.Any details = 2;
  map<string, string> metadata = 3;
 }
 ```
 ### Log Stream Lifecycle
 1. Wrapper emits `job_config_started` event (sequence #1)
 2. Wrapper validates configuration
 3. Wrapper emits `task_launched` event (sequence #2)
 4. Job executes, wrapper captures stdout/stderr (sequence #3+)
 5. Wrapper emits periodic `heartbeat` events (every 30s)
 6. Wrapper detects job completion
 7. Wrapper emits `PartitionManifest` message (final required message with highest sequence number)
 8. Wrapper exits
 The PartitionManifest serves as the implicit end-of-logs marker - the graph knows processing is complete when it sees this message. Sequence numbers enable the graph to detect missing or out-of-order messages and ensure reliable telemetry collection.
 ## Wrapper Implementation
 ### Interfaces
 ```rust
 trait JobWrapper {
    // Config mode - accepts PartitionRef objects
    fn config(outputs: Vec<PartitionRef>) -> Result<JobConfig>;
    // Exec mode - accepts serialized JobConfig
    fn exec(config: JobConfig) -> Result<()>;
 }
 ```
 ### Exit Code Standards
 Following POSIX conventions and avoiding collisions with standard exit codes:
 Reference: 
 - https://manpages.ubuntu.com/manpages/noble/man3/sysexits.h.3head.html
 - https://tldp.org/LDP/abs/html/exitcodes.html
 ```rust
 // Standard POSIX codes we respect:
 // 0   - Success
 // 1   - General error
 // 2   - Misuse of shell builtin
 // 64  - Command line usage error (EX_USAGE)
 // 65  - Data format error (EX_DATAERR)
 // 66  - Cannot open input (EX_NOINPUT)
 // 69  - Service unavailable (EX_UNAVAILABLE)
 // 70  - Internal software error (EX_SOFTWARE)
 // 71  - System error (EX_OSERR)
 // 73  - Can't create output file (EX_CANTCREAT)
 // 74  - Input/output error (EX_IOERR)
 // 75  - Temp failure; retry (EX_TEMPFAIL)
 // 77  - Permission denied (EX_NOPERM)
 // 78  - Configuration error (EX_CONFIG)
 // DataBuild-specific codes (100+ to avoid collisions):
 // 100-109 - User-defined permanent failures
 // 110-119 - User-defined transient failures  
 // 120-129 - User-defined resource failures
 // 130+    - Other user-defined codes
 enum ExitCodeCategory {
    Success,              // 0
    StandardError,        // 1-63 (shell/system)
    PosixError,          // 64-78 (sysexits.h)
    TransientFailure,    // 75 (EX_TEMPFAIL) or 110-119
    UserDefined,         // 100+
 }
 ```
 ## Platform-Specific Log Handling
 ### Local Execution
 - Graph spawns wrapper process
 - Graph reads from stdout pipe directly
 - PartitionManifest indicates completion
 ### Docker
 - Graph runs `docker run` with wrapper as entrypoint
 - Graph uses `docker logs -f` to tail output
 - Logs persist after container exit
 ### Kubernetes
 - Job pods use wrapper as container entrypoint
 - Graph tails logs via K8s API
 - Configure `terminationGracePeriodSeconds` for log retention
 ### Cloud Run / Lambda
 - Wrapper logs to platform logging service
 - Graph queries logs via platform API
 - Natural buffering and persistence
 ## Observability Features
 ### Metrics Collection
 For metrics, we'll use a simplified StatsD-like format in our structured logs, which the graph can aggregate and expose via Prometheus format:
 ```json
 {
  "timestamp": "2025-01-27T10:30:45Z",
  "content": {
    "metric": {
      "name": "rows_processed",
      "value": 1500000,
      "labels": {
        "partition": "date=2025-01-27",
        "stage": "transform"
      },
      "unit": "count"
    }
  }
 }
 ```
 The graph component will:
 - Aggregate metrics from job logs
 - Expose them in Prometheus format for scraping (when running as a service)
 - Store summary metrics in the BEL for historical analysis
 For CLI-invoked builds, metrics are still captured in the BEL but not exposed for scraping (which is acceptable since these are typically one-off runs).
 ### Heartbeating
 Fixed 30-second heartbeat interval (based on Kubernetes best practices):
 ```json
 {
  "timestamp": "2025-01-27T10:30:45Z", 
  "content": {
    "event": {
      "event_type": "heartbeat",
      "metadata": {
        "memory_usage_mb": "1024",
        "cpu_usage_percent": "85.2"
      }
    }
  }
 }
 ```
 ### Log Bandwidth Limits
 To prevent log flooding:
 - Maximum log rate: 1000 messages/second
 - Maximum message size: 1MB
 - If limits exceeded: Wrapper emits rate limit warning and drops messages
 - Final metrics show dropped message count
 ## Testing Strategy
 ### Unit Tests
 - Log parsing and serialization
 - Exit code categorization  
 - Rate limiting behavior
 - State machine transitions
 ### Integration Tests
 - Full job execution lifecycle
 - Platform-specific log tailing
 - Fast job completion handling
 - Large log volume handling
 ### Platform Tests
 - Local process execution
 - Docker container runs
 - Kubernetes job pods
 - Cloud Run invocations
 ### Failure Scenario Tests
 - Job crashes (SIGSEGV, SIGKILL)
 - Wrapper crashes
 - Log tailing interruptions
 - Platform-specific failures
 ## Implementation Phases
 ### Phase 0: Minimal Bootstrap
 Implement the absolute minimum to unblock development and testing:
 - Simple JSON-based logging (no protobuf yet)
 - Basic wrapper that only handles happy path
 - Support for local execution only
 - Minimal log parsing in graph
 - Integration with existing example jobs
 This phase delivers a working end-to-end system that can be continuously evolved.
 ### Phase 1: Core Protocol
 - Define protobuf schemas
 - Implement structured logger
 - Add error handling and exit codes
 - Implement heartbeating
 - Graph-side log parser improvements
 ### Phase 2: Platform Support
 - Docker integration
 - Kubernetes support
 - Cloud platform adapters
 - Platform-specific testing
 ### Phase 3: Production Hardening
 - Rate limiting
 - Error recovery
 - Performance optimization
 - Monitoring integration
 ### Phase 4: Advanced Features
 - In-process config for library jobs
 - Custom metrics backends
 - Advanced failure analysis
 ## Success Criteria
 1. **Zero Network Dependencies**: Jobs run without any network access
 2. **Platform Parity**: Identical behavior across all execution platforms
 3. **Minimal Overhead**: < 100ms wrapper overhead for config, < 1s for exec
 4. **Complete Observability**: All job state changes captured in logs
 5. **Graceful Failures**: No log data loss even in crash scenarios
 ## Next Steps
 1. Implement minimal bootstrap wrapper
 2. Test with existing example jobs
 3. Iterate on log format based on real usage
 4. Gradually add features per implementation phases