databuild/design/observability.md

47 lines
1.8 KiB
Markdown

# Observability
## Purpose
Provide comprehensive, platform-agnostic observability for DataBuild applications through standardized job wrapper
telemetry.
## Architecture
### Wrapper-Based Observability
All observability flows through the job wrapper:
- **Jobs** emit application logs to stdout/stderr
- **Wrapper** captures and enriches with structured metadata
- **Graph** parses structured logs into metrics, events, and monitoring data
- [**BEL**](./build-event-log.md) stores aggregated telemetry for historical analysis
### Communication Protocol
Log-based telemetry using protobuf-defined structured messages:
- LogMessage: Application stdout/stderr with metadata
- MetricPoint: StatsD-style metrics with labels
- JobEvent: State transitions and system events
- PartitionManifest: Job completion with output metadata
## Implementation
### Metrics Collection
- Format: StatsD-like embedded in structured logs
- Aggregation: Graph components collect and expose via Prometheus
- Storage: Summary metrics stored in BEL for historical analysis
- Scope: Job execution, resource usage, partition metadata
### Logging
- Capture: All job stdout/stderr via wrapper
- Enhancement: Automatic injection of job_id, partition_ref, timestamps
- Format: Structured JSON for consistent parsing
- Retention: Platform-dependent (container logs, cloud logging APIs)
### Monitoring
- Heartbeats: 30-second intervals with resource utilization
- Health: Exit code categorization and failure analysis
- Alerting: Standard Prometheus/alertmanager integration
- Debugging: Complete log trails for job troubleshooting
### Platform Integration
- **Local**: Direct stdout pipe reading
- **Docker**: Container log persistence and `docker logs`
- **Kubernetes**: Pod logs API with configurable retention
- **Cloud**: Platform logging services (CloudWatch, Cloud Logging)