1.8 KiB
1.8 KiB
Observability
Purpose
Provide comprehensive, platform-agnostic observability for DataBuild applications through standardized job wrapper telemetry.
Architecture
Wrapper-Based Observability
All observability flows through the job wrapper:
- Jobs emit application logs to stdout/stderr
- Wrapper captures and enriches with structured metadata
- Graph parses structured logs into metrics, events, and monitoring data
- BEL stores aggregated telemetry for historical analysis
Communication Protocol
Log-based telemetry using protobuf-defined structured messages:
- LogMessage: Application stdout/stderr with metadata
- MetricPoint: StatsD-style metrics with labels
- JobEvent: State transitions and system events
- PartitionManifest: Job completion with output metadata
Implementation
Metrics Collection
- Format: StatsD-like embedded in structured logs
- Aggregation: Graph components collect and expose via Prometheus
- Storage: Summary metrics stored in BEL for historical analysis
- Scope: Job execution, resource usage, partition metadata
Logging
- Capture: All job stdout/stderr via wrapper
- Enhancement: Automatic injection of job_id, partition_ref, timestamps
- Format: Structured JSON for consistent parsing
- Retention: Platform-dependent (container logs, cloud logging APIs)
Monitoring
- Heartbeats: 30-second intervals with resource utilization
- Health: Exit code categorization and failure analysis
- Alerting: Standard Prometheus/alertmanager integration
- Debugging: Complete log trails for job troubleshooting
Platform Integration
- Local: Direct stdout pipe reading
- Docker: Container log persistence and
docker logs - Kubernetes: Pod logs API with configurable retention
- Cloud: Platform logging services (CloudWatch, Cloud Logging)