databuild/design/observability.md

1.8 KiB

Observability

Purpose

Provide comprehensive, platform-agnostic observability for DataBuild applications through standardized job wrapper telemetry.

Architecture

Wrapper-Based Observability

All observability flows through the job wrapper:

  • Jobs emit application logs to stdout/stderr
  • Wrapper captures and enriches with structured metadata
  • Graph parses structured logs into metrics, events, and monitoring data
  • BEL stores aggregated telemetry for historical analysis

Communication Protocol

Log-based telemetry using protobuf-defined structured messages:

  • LogMessage: Application stdout/stderr with metadata
  • MetricPoint: StatsD-style metrics with labels
  • JobEvent: State transitions and system events
  • PartitionManifest: Job completion with output metadata

Implementation

Metrics Collection

  • Format: StatsD-like embedded in structured logs
  • Aggregation: Graph components collect and expose via Prometheus
  • Storage: Summary metrics stored in BEL for historical analysis
  • Scope: Job execution, resource usage, partition metadata

Logging

  • Capture: All job stdout/stderr via wrapper
  • Enhancement: Automatic injection of job_id, partition_ref, timestamps
  • Format: Structured JSON for consistent parsing
  • Retention: Platform-dependent (container logs, cloud logging APIs)

Monitoring

  • Heartbeats: 30-second intervals with resource utilization
  • Health: Exit code categorization and failure analysis
  • Alerting: Standard Prometheus/alertmanager integration
  • Debugging: Complete log trails for job troubleshooting

Platform Integration

  • Local: Direct stdout pipe reading
  • Docker: Container log persistence and docker logs
  • Kubernetes: Pod logs API with configurable retention
  • Cloud: Platform logging services (CloudWatch, Cloud Logging)