# Triggers and problems

### Configuration

##### Naming

Trigger names must be prefixed with the LLD object they belong to.

Trigger names should not use the {HOST.NAME} macro to keep names shorter.
Consider getting this data from the host column.

**Avoid using {ITEM.LASTVALUE} in trigger name**

Don’t use {ITEM.LASTVALUE1-9} macros right in trigger names. These macros are expanded to values at the time when problem name is generated.

Use it in the operational data field (available since Zabbix 4.4) instead.

**Explain the threshold in event name**

Consider explaining why trigger fired (threshold) in parenthesis (). 

Use the event name field for it (supported since Zabbix 5.2), to keep the 
trigger name short. The event name, if defined, will be used for generating the problem name.

E. g.:

-   Trigger name: CPU load is too high
-   Event name: CPU load is too high **(over 1.5)**

Other examples for event names:

|Good|Bad|
|----|---|
|Temperature is too high **(over 35 C for 5m)**<br>MySQL: Refused connections **(max\_connections limit reached)**|Temperature is too high (now: 40)<br>MySQL: Refused connections|

##### Trigger description

Use this field to describe:

-   Describe the problem in more detail. But do not just copy the text
    from the trigger name.
-   Why it is important to check this
-   Describe the probable root cause of the problem if possible and
    which actions should be taken
-   Provide a reference to the documentation if any

##### Expressions

Trigger expressions should be reasonably flap-resistant - that is, not
relying on the last value only but checking last 5 or 10 minutes
instead. On the other hand, do not make the expressions overly complex -
for example, do not use trigger hysteresis unless it really adds
significant value.

Prefer to use user macros in trigger expressions to allow thresholds
tuning.

|Good|Bad|
|----|---|
|`last(/TEMPLATE_NAME/temperature)>{$TEMP.MAX.WARN}`|`last(/TEMPLATE_NAME/temperature)>30`|

Use newlines and spaces to make long trigger expressions more
human-readable.

##### Using time and data suffixes in triggers

Always use time (1m, 5m, 1d...) and size
[suffixes](https://www.zabbix.com/documentation/current/manual/appendix/suffixes)
(1K, 1B, 1G) in trigger expressions and problem names, trigger
description, operational data to improve readability. Remember, that you
can use them in user macros, too.

|Good|Bad|
|----|---|
|`avg(/TEMPLATE_NAME/temperature,10m)>{$TEMP.MAX.WARN}`<br>`avg(/TEMPLATE_NAME/memory.free,10m)<{$MEM_FREE.WARN}` where {$MEM_FREE.WARN} = 100M|`avg(/TEMPLATE_NAME/temperature,600)>{$TEMP.MAX.WARN}}`<br>`avg(/TEMPLATE_NAME/memory.free,600)<{$MEM_FREE.WARN}` where {$MEM\_FREE.WARN} = 104857600|

##### Severity

Triggers created in the templates are mapped to the standard Zabbix
severity scale. Consider choosing the severity assigned to the trigger
with the following in mind:

|Severity|Description|Examples|Expected reaction type and time (not always true!), given as example only|
|--------|-----------|--------|-------------------------------------------------------------------------|
|*Not classified*|Not used under normal circumstances|<|<|
|*Info*|The event happened that is not an alarm at all. This is the info that might be helpful in the future for retrospective analysis or for auditing.|Examples: s/n changed, user logged in, etc|None|
|*Warning*|A minor alarm that could lead to some more serious problem if left without attention.|Examples: Disk usage is low but there is still some room|React during working hours, no notification is expected.|
|*Average*|Performance alarms: Average alarm that indicates serious performance problems or key service degradation.<br><br>Fault alarms: partial resource failure or warnings that if left without attention might lead to complete device fault.|Examples: CPU utilization is high, Low memory, High device temperature, Disk health failure in the disk array, Website is slow.|React during working hours, create an issue ticket if the problem stays for hours.|
|*High*|Performance alarms: Key service is not available. Fault alarms: The device is not functioning or not available.|No ICMP PING, Website is down.|React off working hours if affects services with the page.<br><br>React with a ticket during working hours otherwise.|
|*Disaster*|Reserved for alarms indicating blackouts, disasters, global business service faults.<br><br>There should be no triggers with disaster level severity in resource templates.|Riga DC is down, Level core network is down, >50% of users cannot purchase anything from our website.|Always react by paging the responsible person.|

##### Trigger tags

Use tags to logically group triggers using the recommended tagging model.

**Trigger tags**

|Tag|Value|Description|
|----|---|---|
|scope|**performance**<br> **availability** - a monitoring target or it's part may become unavailable<br> **capacity** - a monitored resource may be exhausted <br> **notice**<br> **security**<br> **compliance** - reserved for user-defined templates |Specifies the type of a problem.<br>Including at least one tag is mandatory; multiple tags are allowed.|

For example, the trigger *High memory utilization* might contain the following tags:

    scope: capacity; scope: performance


##### Trigger macros

For macros used in trigger expressions (thresholds) use this form:

    {$[<NAMESPACE>.]<METRIC_NAME>[.MAX|.MIN][.OK |.WARN|.CRIT]}

Use MAX|MIN when you need to highlight whether it is the high or low
threshold.

|Good|Bad|
|----|---|
|{$MYSQL.REPLICATION\_LAG.MAX.WARN}<br>{$TEMP.MAX.WARN:”{\#SENSOR}”}<br>{$SERVICE.STATUS.CRIT}<br>{$IF.ERRORS.MAX.WARN}<br>{$DISK.STATUS.OK}<br>{$DISK.STATUS.WARN}<br>{$DISK.STATUS.CRIT}<br>{$MEM\_UTIL.MAX.WARN}<br>{$MEM\_UTIL.MAX.CRIT}|{$DISK\_OK\_STATUS}<br>{$MEMORY\_UTIL\_MAX}|


### Use trigger snippets

Check the following trigger snippets library and consider reusing
configuration to avoid reinventing the wheel.

**Case: Something has just been restarted**

Trigger: \<resource\> has just been restarted (uptime < 10m)

|Applicable for|For uptime counters for device, host, or software/service running|
|--------------|-----------------------------------------------------------------|
|*Name*|\<resource\> has been restarted|
|*Event name*|\<resource\> has been restarted (uptime < 10m)|
|*Description*|\<resource\> uptime is less than 10 minutes|
|*Expression*|last(/TEMPLATE\_NAME/METRIC)<10m|
|*Recovery expression*|\-|
|*Recovery mode*|\-|
|*Manual close*|Yes|
|*Severity*|Warning for the host. Info for all others.|
|*Depends on*|\-|

**Case: Any master item + preprocessing in dependent items**

Trigger: Master item is not responding

\<resource\>: Failed to get items (no data for 30m)

|Applicable for|Any type of items used for bulk data collection|
|--------------|-----------------------------------------------|
|*Expression*|nodata(/TEMPLATE\_NAME/temperature,30m)=1|
|*Recovery expression*|\-|
|*Recovery mode*|\-|
|*Manual close*|Yes|
|*Severity*|Warning|
|*Depends on*|If present: \<Proc\> is not running|

**Case: HTTP item + regex preprocessing in dependent items**

Trigger: HTTP item is not responding

|Applicable for|HTTP items that provide output for future regex preprocessing;<br>use ‘Headers and Body’ mode in the item|
|--------------|----------------------------------------------------------------|
|*Expression*|find(/TEMPLATE\_NAME/METRIC,"HTTP/1.1 200")=0 or nodata(/TEMPLATE\_NAME/METRIC,30m)=1|
|*Recovery expression*|\-|
|*Recovery mode*|\-|
|*Manual close*|Yes|
|*Severity*|Warning|
|*Depends on*|If present: \<Proc\> is not running|

**Case: \<VALUE\> is too high (over X)/ is too low (under X) for
slow to change values**

For slow changing values (i.e. temperature), use max() for high, and min() for lows to get immediate response with delayed (confirmed) recovery.

Trigger: \<VALUE\> is too high (over X)

|Applicable for|High temperature (slow to change)|
|--------------|---------------------------------|
|*Expression*|max(/TEMPLATE\_NAME/METRIC,5m) > X|

Trigger: \<VALUE\> is too low (under X)

|Applicable for|Low temperature (slow to change)|
|--------------|--------------------------------|
|*Expression*|min(/TEMPLATE\_NAME/METRIC,5m) < X|

**Case: \<VALUE\> is too high (over X for 5m)/ is too low (under X
for 5m) for quick-to-change and jumpy values**

For jumpy values, use min (for high) and max(for low) to make triggers
more tolerable to spikes/noise.

Trigger: \<VALUE\> is too high (over X for 5m)

|Applicable for|CPU utilization (jumpy), signal strength(jumpy), network utilization|
|--------------|--------------------------------------------------------------------|
|*Expression*|min(/TEMPLATE\_NAME/METRIC,5m) > X|

Trigger: \<VALUE\> is too low (under X for 5m)

|Applicable for|CPU utilization (jumpy), signal strength(jumpy), network utilization|
|--------------|--------------------------------------------------------------------|
|*Expression*|max(/TEMPLATE\_NAME/METRIC,5m) < X|

**Case: Serial number has changed on the device**

Trigger: Serial numbers controls

|Applicable for|Serial numbers items|
|--------------|--------------------|
|*Name*|\<resource\> has been replaced|
|*Event name*|\<resource\> has been replaced (new serial number received)|
|*Description*|\<resource\> serial number has changed. Ack to close|
|*Expression*|last(/TEMPLATE\_NAME/METRIC)<>last(/TEMPLATE\_NAME/METRIC,#2) and length(/TEMPLATE\_NAME/METRIC)>0|
|*Recovery expression*|\-|
|*Recovery mode*|None|
|*Manual close*|Yes|
|*Severity*|Info|
|*Depends on*|\-|

**Case: Software version has changed on the device**

Trigger: Version controls

|Applicable for|Software version items|
|--------------|----------------------|
|*Name*|\<resource\> version has changed|
|*Event name*|\<resource\> version has changed (new version: {ITEM.VALUE})|
|*Description*|\<resource\> version has changed. Ack to close|
|*Expression*|last(/TEMPLATE\_NAME/METRIC)<>last(/TEMPLATE\_NAME/METRIC,#2) and length(/TEMPLATE\_NAME/METRIC)>0|
|*Recovery expression*|\-|
|*Recovery mode*|None|
|*Manual close*|Yes|
|*Severity*|Info|
|*Depends on*|\-|

**Case: Control how much disk space is left**

Trigger: Filesystem space is critically low with timeleft with [context](https://www.zabbix.com/documentation/6.2/en/manual/config/macros/user_macros_context) macro

{$VFS.FS.PUSED.MAX.CRIT:\\"\_\_RESOURCE\_\_\\"} = 90

|Applicable for|Filesystems|
|--------------|-----------|
|*Name*|Disk space is critically low|
|*Event name*|Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:\\"\_\_RESOURCE\_\_\\"})|
|*Description*|Space used: {ITEM.VALUE3} of {ITEM.VALUE2} ({ITEM.VALUE1}), time left till full: < 24h.<br><br>Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:\\"\_\_RESOURCE\_\_\\"}.<br><br>Second condition should be one of the following:<br>- The disk free space is less than 5G.<br>- The disk will be full in less than 24 hours.|
|*Expression*|last(/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},pused])>{$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"} and (last(/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},total])-last(/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},used]))<{$VFS.FS.FREE.MIN.CRIT:"{#FSNAME}"} or timeleft((/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},pused],1h,100)<1d|
|*Recovery expression*|\-|
|*Recovery mode*|None|
|*Manual close*|Yes|
|*Severity*|Average|
|*Depends on*|\-|

Trigger: Filesystem space is low with timeleft with [context](https://www.zabbix.com/documentation/6.2/en/manual/config/macros/user_macros_context) macro

{$VFS.FS.PUSED.WARN.CRIT:\\"\_\_RESOURCE\_\_\\"} = 80

|Applicable for|Filesystems|
|--------------|-----------|
|*Name*|Disk space is low|
|*Event name*|Disk space is low (used > {$VFS.FS.PUSED.MAX.WARN:\\"\_\_RESOURCE\_\_\\"})|
|*Description*|Space used: {ITEM.VALUE3} of {ITEM.VALUE2} ({ITEM.VALUE1}), time left till full: < 24h.<br><br>Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.WARN:\\"\_\_RESOURCE\_\_\\"}.<br><br>Second condition should be one of the following:<br>- The disk free space is less than 10G.<br>- The disk will be full in less than 24 hours.|
|*Expression*|last(/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},pused])>{$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"} and (last(/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},total])-last(/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},used]))<{$VFS.FS.FREE.MIN.WARN:"{#FSNAME}"} or timeleft((/TEMPLATE\_NAME/vfs.fs.size[{#FSNAME},pused],1h,100)<1d|
|*Recovery expression*|\-|
|*Recovery mode*|None|
|*Manual close*|Yes|
|*Severity*|Warning|
|*Depends on*|Disk space is critically low.|
