Capabilities
Fleet observability
Diagnostics that connect edge, runtime, and cloud: actionable signals, traces, and context to debug distributed automation.
Design intent
Use this lens when adopting Fleet observability: define success criteria, start narrow, and scale with safe rollouts and observability.
- Start with versions to avoid debugging the wrong state
- Classify failures (runtime vs adapter vs network vs config) quickly
- Compare against known-good snapshot/site to isolate environment vs change
What it is
Diagnostics includes runtime health, logs/events, deployment status, and telemetry correlated to versions and devices so issues can be understood quickly.
Design constraints
- Start with versions to avoid debugging the wrong state
- Classify failures (runtime vs adapter vs network vs config) quickly
- Compare against known-good snapshot/site to isolate environment vs change
Architecture at a glance
- Edge emits health/events/telemetry with stable IDs (device/resource/app/deploy)
- Store-and-forward buffers protect evidence during outages
- Backend correlates timelines to snapshots so debugging becomes “find the diff”
- This is a capability surface concern: visibility is required to scale safely
Typical workflow
- Define the minimal signal set: deployment state + runtime health + adapter health + telemetry
- Create dashboards that answer: what is running? is it healthy? is it correct?
- Use correlated timelines (events + telemetry + rollout state) to debug
- Codify alerts for restart bursts, buffer saturation, flapping, and drift
System boundary
Treat Fleet observability as a capability boundary: define what success means, what is configurable per site, and how you will validate behavior under rollout.
How it relates to snapshots
Because incidents are often change-induced. If you don’t confirm the deployed snapshot first, you can waste hours debugging the wrong system state.
Example artifact
Event timeline (conceptual)
t=00:00 deploy:start snapshot=snap_... device=edge-12
t=00:12 runtime:healthy resource=forte-1
t=00:20 adapter:connected protocol=modbus endpoint=pump-1
t=01:10 telemetry:ok points=128 gap=0s
t=03:40 alert:staleness point=pump_speed > 5sWhat it enables
- Faster troubleshooting with fewer site visits
- Root-cause analysis tied to exact deployed versions
- Safer operations through visibility and alerts
A practical troubleshooting path
- Confirm: which snapshot/version is deployed and whether rollout is complete
- Check: runtime health and recent lifecycle events
- Validate: I/O mapping and data quality for critical points
- Compare: to a known-good site or previous snapshot to isolate the change
Common failure modes
- Silent misconfiguration (correctly “running”, but wrong units/scaling)
- Connectivity flapping causing intermittent symptoms
- Telemetry gaps that hide the true timeline
Engineering outcomes
- Start with versions to avoid debugging the wrong state
- Classify failures (runtime vs adapter vs network vs config) quickly
- Compare against known-good snapshot/site to isolate environment vs change
Quick acceptance checks
- Always start with “what snapshot is deployed?” and “did it just change?”
- Check runtime health + adapter/connectivity health next
What to monitor
Validate mapping/scaling and drift, then check timing pressure and intermittent connectivity. Those explain many symptoms without obvious crashes.
Acceptance tests
- Verify the deployed snapshot/version matches intent (no drift)
- Run a canary validation: behavior, health, and telemetry align with expectations
- Verify rollback works and restores known-good behavior
Deep dive
Practical next steps
How teams typically turn this capability into outcomes.
Key takeaways
- Start with versions to avoid debugging the wrong state
- Classify failures (runtime vs adapter vs network vs config) quickly
- Compare against known-good snapshot/site to isolate environment vs change
Checklist
- Always start with “what snapshot is deployed?” and “did it just change?”
- Check runtime health + adapter/connectivity health next
- Validate data quality (mapping/scaling/staleness) on critical points
- Compare against known-good site/snapshot to isolate environment vs change
Next steps
Related topics
Deep dive
Common questions
Quick answers that help align engineering and operations.
Why do we start with versions for debugging?
Because incidents are often change-induced. If you don’t confirm the deployed snapshot first, you can waste hours debugging the wrong system state.
What’s the common “healthy but wrong” diagnosis path?
Validate mapping/scaling and drift, then check timing pressure and intermittent connectivity. Those explain many symptoms without obvious crashes.
What evidence should we capture during incidents?
Deployment snapshot ID, first failure timestamp, adapter error categories, buffer depth/telemetry gaps, and any recovery/rollback actions taken.