Alerting Process Walkthrough
This section gives a walkthrough of the entire process involved in Gyeeta Alerting from the trigger to the resolution of the alert.
For the sake of simplicity, only Realtime Alerts are explained in this section. Most of the concepts will also apply to DB Aggregated Alerts.
Let us assume an Alert Definition has been set to fire a Realtime Alert if any redis
, mysql
or postgres
services has states Bad
or Severe
over
5 consecutive evaluations.
({ name in 'postgres','redis','mysql' } and { state in 'Bad','Severe' })
This alert is for extsvcstate
subsystem i.e. related to Service State. Gyeeta updates Service States every 5 sec in the madhava
Intermediate Servers.
On every 5 sec Service State update, madhava
will check if the above filter expression matches any Service State. If
a match is found, madhava
will check the number of consecutive occurences of this alert for that specific Service.
If it is the fifth occurence, then as per the Alert Definition, madhava
will then send a preliminary alert to Shyama Alertmanager.
The Alertmanager, on receiving this pre-alert will check the following conditions which will cancel out (suppress) this alert :
- Check if the Alert is Silenced as per this Alert Definition specific Silence Rules
- Check if All Alerts are disabled which could be because of an Alert Storm
- Check if the Alert is Silenced as per Global Alerts Silencing Rules
- Check if the Alert is Inhibited because of a prior Alert as per Inhibition Rules
If all checks passed, Shyama will check if the Alert Definition has Grouping enabled. If Grouping is enabled, the Alert is pushed to the respective Alert Group. Thereafter, when the Group Wait period completes, the Alert will be become active.
If Grouping is disabled, the Alert becomes Active. The Alertmanager will then send the new Alert to the remote Alert Action Agent. The Alert Action Agent will execute all the Actions (Notifications) set for that alert.
At this point, the Alert is Active and is also pushed to Shyama server DB. The Alert can also be seen in the Web UI Alerts Dashboard. If the Alert was cancelled out due to any of the checks above, the Alert will not be visible in the UI nor will it be databased.
Let us assume that the Alert Definition has set the Alert to be Repeated if it remains active for over an hour.
Now, on every 5 sec Service State update, the corresponding madhava
instance will check the Alert Filter and set the Alert as active if the
condition keeps getting hit.
If the Alert condition remains active an hour after the initial alert time, the corresponding madhava
instance will send a
Repeat alert to shyama
.
On receiving this Repeat alert, Shyama will again evaluate all the Checks for cancelling and if the checks pass, will send the Repeat Alert to the remote Alert Agent for the next round of notifications.
Now let us assume, a Service State update is seen where the Alert filter expression failed for that specific service. The corresponding madhava
instance will thereafter mark this alert as Resolved and will intimate the Shyama Alertmanager.
On Alert Resolution, Shyama will update the DB and the Web UI will also show the Alert as Resolved. If the Alert Definition specified sending of Notifications on Alert Resolution, then the Alertmanager will send the notification request to the Alert Agent who will execute the action.
Each Alert Definition is associated with an Alert Expiry Duration (defaults to 10 hours and upto a max of 24 hours).
Alertmanager will periodicaly check for Alert Expiries and on an expiry will mark that alert as Expired and the DB updated. The Web UI will show the Alert as Expired. If the Alert Definition specified sending of Notifications on Alert Resolution, then Shyama will send the Expired Alert notification request to the Alert Agent who will execute the action.
If the Alert Definition had specified Manual Alert Resolution instead of the default auto resolution, then the Shyama Alertmanager will not set the Alert as Resolved until users click on the Set as Resolved button from the Web UI Alerts Dashboard or call the corresponding REST API call. Alert Expiry Rules will still be active for Manual Resolve Alerts as well.