47 lines
1.8 KiB
Markdown
47 lines
1.8 KiB
Markdown
# Observability
|
|
|
|
## Purpose
|
|
Provide comprehensive, platform-agnostic observability for DataBuild applications through standardized job wrapper
|
|
telemetry.
|
|
|
|
## Architecture
|
|
|
|
### Wrapper-Based Observability
|
|
All observability flows through the job wrapper:
|
|
- **Jobs** emit application logs to stdout/stderr
|
|
- **Wrapper** captures and enriches with structured metadata
|
|
- **Graph** parses structured logs into metrics, events, and monitoring data
|
|
- [**BEL**](./build-event-log.md) stores aggregated telemetry for historical analysis
|
|
|
|
### Communication Protocol
|
|
Log-based telemetry using protobuf-defined structured messages:
|
|
- LogMessage: Application stdout/stderr with metadata
|
|
- MetricPoint: StatsD-style metrics with labels
|
|
- JobEvent: State transitions and system events
|
|
- PartitionManifest: Job completion with output metadata
|
|
|
|
## Implementation
|
|
|
|
### Metrics Collection
|
|
- Format: StatsD-like embedded in structured logs
|
|
- Aggregation: Graph components collect and expose via Prometheus
|
|
- Storage: Summary metrics stored in BEL for historical analysis
|
|
- Scope: Job execution, resource usage, partition metadata
|
|
|
|
### Logging
|
|
- Capture: All job stdout/stderr via wrapper
|
|
- Enhancement: Automatic injection of job_id, partition_ref, timestamps
|
|
- Format: Structured JSON for consistent parsing
|
|
- Retention: Platform-dependent (container logs, cloud logging APIs)
|
|
|
|
### Monitoring
|
|
- Heartbeats: 30-second intervals with resource utilization
|
|
- Health: Exit code categorization and failure analysis
|
|
- Alerting: Standard Prometheus/alertmanager integration
|
|
- Debugging: Complete log trails for job troubleshooting
|
|
|
|
### Platform Integration
|
|
- **Local**: Direct stdout pipe reading
|
|
- **Docker**: Container log persistence and `docker logs`
|
|
- **Kubernetes**: Pod logs API with configurable retention
|
|
- **Cloud**: Platform logging services (CloudWatch, Cloud Logging)
|