NVIDIA GPU loadable plugin
Source
36
36
- Metrics may report errors or be unsupported if your device cannot provide the required information.
37
37
- **If the NVML dynamic library, which is installed by default with the NVIDIA driver, is absent, Zabbix Agent 2 with the NVIDIA GPU plugin will not start.**
38
38
39
39
## Build from Source
40
40
41
41
The plugin supports building for both Linux and Windows. To avoid errors during cross-compilation, it is recommended to build the plugin directly on the target operating system.
42
42
43
43
To build the NVIDIA GPU Plugin for Zabbix Agent 2 from source, ensure you have the following prerequisites.
44
44
45
45
### Prerequisites
46
-
- **Go Programming Language**: Version 1.21 or higher.
46
+
- **Go Programming Language**: Version 1.21 or higher.
47
47
- **CGO Enabled**: The build process requires `CGO_ENABLED=1` for proper compilation.
48
-
- **C Compiler**: A C compiler is required for building with `CGO_ENABLED=1`.
48
+
- **C Compiler**: A C compiler is required for building with `CGO_ENABLED=1`.
49
49
50
50
## Plugin Setup
51
51
The `Plugins.NVIDIA.System.Path` variable must be set in the Zabbix Agent 2 configuration file, specifying the path to the NVIDIA GPU plugin executable. By default, this variable is set in the **plugin** configuration file `nvidia.conf`, which is then included in the **agent** configuration file `zabbix_agent2.conf`.
52
52
53
53
### Example Setup:
54
54
- Add the following option to the **plugin** configuration file:
55
55
```text
56
56
Plugins.NVIDIA.System.Path=/path/to/executable/nvidia
57
57
```
58
58
- Include the plugin configuration file in the main Zabbix Agent 2 configuration file using the `Include` directive:
59
59
```text
60
60
Include=/path/to/config/nvidia.conf
61
61
```
62
62
63
63
## Configuration
64
64
To configure plugins, use the Zabbix Agent configuration file.
65
65
66
-
- **`Plugins.NVIDIA.Timeout`**: Specifies the maximum time (in seconds) to wait for a server response during connection attempts and subsequent operations in the session. The global item-type timeout or individual item timeout will override this value if greater.
67
-
- **Default**: Equal to the global `Timeout` parameter in the Zabbix Agent 2 configuration file.
66
+
- **`Plugins.NVIDIA.Timeout`**: Specifies the maximum time (in seconds) to wait for a server response during connection attempts and subsequent operations in the session. The global item-type timeout or individual item timeout will override this value if greater.
67
+
- **Default**: Equal to the global `Timeout` parameter in the Zabbix Agent 2 configuration file.
68
68
- **Limits**: 1-30 seconds.
69
69
70
70
# Metric Keys
71
71
72
72
## General Information
73
-
- **`nvml.version`**
73
+
- **`nvml.version`**
74
74
Returns a single value: (string) version of the NVML library.
75
75
76
-
- **`nvml.system.driver.version`**
76
+
- **`nvml.system.driver.version`**
77
77
Returns a single value: (string) version of the installed NVIDIA driver.
78
78
79
-
- **`nvml.device.get`**
80
-
Returns a JSON array, where each element represents a device in the system with the following fields:
81
-
- **`device_uuid`**: Unique identifier for the device.
79
+
- **`nvml.device.get`**
80
+
Returns a JSON array, where each element represents a device in the system with the following fields:
81
+
- **`device_uuid`**: Unique identifier for the device.
82
82
- **`device_name`**: Name of the device.
83
83
84
-
- **`nvml.device.count`**
84
+
- **`nvml.device.count`**
85
85
Returns a single value: (unsigned int) number of devices.
86
86
87
87
## General Device Metrics
88
-
- **`nvml.device.temperature[<deviceUUID>]`**
88
+
- **`nvml.device.temperature[<deviceUUID>]`**
89
89
Returns a single value: (unsigned int) temperature of the device in Celsius.
90
90
91
-
- **`nvml.device.serial[<deviceUUID>]`**
91
+
- **`nvml.device.serial[<deviceUUID>]`**
92
92
Returns a single value: (unsigned int) number of devices.
93
93
94
-
- **`nvml.device.fan.speed.avg[<deviceUUID>]`**
94
+
- **`nvml.device.fan.speed.avg[<deviceUUID>]`**
95
95
Returns a single value: (unsigned int) average fan speed as a percentage of maximum speed.
96
96
97
-
- **`nvml.device.performance.state[<deviceUUID>]`**
97
+
- **`nvml.device.performance.state[<deviceUUID>]`**
98
98
Returns a single value: (unsigned int) performance state of the device (0 = max, 15 = min).
99
99
100
-
- **`nvml.device.energy.consumption[<deviceUUID>]`**
100
+
- **`nvml.device.energy.consumption[<deviceUUID>]`**
101
101
Returns a single value: (unsigned int) total energy consumption in millijoules (mJ) since the driver was last reloaded.
102
102
103
-
- **`nvml.device.power.limit[<deviceUUID>]`**
103
+
- **`nvml.device.power.limit[<deviceUUID>]`**
104
104
Returns a single value: (unsigned int) power limit in milliwatts.
105
105
106
-
- **`nvml.device.power.usage[<deviceUUID>]`**
106
+
- **`nvml.device.power.usage[<deviceUUID>]`**
107
107
Returns a single value: (unsigned int) current power usage in milliwatts.
108
108
109
109
## Device Memory Metrics
110
-
- **`nvml.device.memory.bar1.get[<deviceUUID>]`**
111
-
Returns a JSON structure with the following fields (in bytes):
112
-
- **`total_memory_bytes`**: Total BAR1 memory available on the GPU.
110
+
- **`nvml.device.memory.bar1.get[<deviceUUID>]`**
111
+
Returns a JSON structure with the following fields (in bytes):
112
+
- **`total_memory_bytes`**: Total BAR1 memory available on the GPU.
113
113
- **`free_memory_bytes`**: Available BAR1 memory.
114
-
- **`used_memory_bytes`**: BAR1 memory currently in use.
114
+
- **`used_memory_bytes`**: BAR1 memory currently in use.
115
115
116
-
- **`nvml.device.memory.fb.get[<deviceUUID>]`**
117
-
Returns a JSON structure with the following fields (in bytes):
118
-
- **`total_memory_bytes`**: Total framebuffer memory of the GPU.
119
-
- **`reserved_memory_bytes`**: Memory reserved for internal GPU operations.
120
-
- **`free_memory_bytes`**: Available framebuffer memory.
116
+
- **`nvml.device.memory.fb.get[<deviceUUID>]`**
117
+
Returns a JSON structure with the following fields (in bytes):
118
+
- **`total_memory_bytes`**: Total framebuffer memory of the GPU.
119
+
- **`reserved_memory_bytes`**: Memory reserved for internal GPU operations.
120
+
- **`free_memory_bytes`**: Available framebuffer memory.
121
121
- **`used_memory_bytes`**: Memory currently in use (includes reserved memory).
122
122
123
123
### Notes
124
124
- Reserved memory is included in the used memory.
125
125
126
126
## Device ECC Mode
127
-
- **`nvml.device.ecc.mode[<deviceUUID>]`**
128
-
Returns a JSON structure with the following fields:
129
-
- **`current`**: The current ECC mode (bool).
127
+
- **`nvml.device.ecc.mode[<deviceUUID>]`**
128
+
Returns a JSON structure with the following fields:
129
+
- **`current`**: The current ECC mode (bool).
130
130
- **`pending`**: The pending ECC mode (bool) to be applied after reboot.
131
131
132
132
## Device ECC Error Metrics
133
-
- **`nvml.device.errors.memory[<deviceUUID>]`**
134
-
Returns a JSON structure with the following fields:
135
-
- **`corrected`**: Count of ECC errors that were corrected in memory.
133
+
- **`nvml.device.errors.memory[<deviceUUID>]`**
134
+
Returns a JSON structure with the following fields:
135
+
- **`corrected`**: Count of ECC errors that were corrected in memory.
136
136
- **`uncorrected`**: Count of ECC errors that could not be corrected in memory.
137
137
138
-
- **`nvml.device.errors.register[<deviceUUID>]`**
139
-
Returns a JSON structure with the following fields:
140
-
- **`corrected`**: Count of ECC errors that were corrected in register file.
138
+
- **`nvml.device.errors.register[<deviceUUID>]`**
139
+
Returns a JSON structure with the following fields:
140
+
- **`corrected`**: Count of ECC errors that were corrected in register file.
141
141
- **`uncorrected`**: Count of ECC errors that could not be corrected in register file.
142
142
143
143
## Device PCI Metrics
144
-
- **`nvml.device.pci.utilization[<deviceUUID>]`**
145
-
Returns a JSON structure with the following fields:
146
-
- **`tx_rate_kb_s`**: PCI transmit throughput in KB/s.
144
+
- **`nvml.device.pci.utilization[<deviceUUID>]`**
145
+
Returns a JSON structure with the following fields:
146
+
- **`tx_rate_kb_s`**: PCI transmit throughput in KB/s.
147
147
- **`rx_rate_kb_s`**: PCI receive throughput in KB/s.
148
148
149
149
## Device Encoder/Decoder Metrics
150
-
- **`nvml.device.encoder.stats.get[<deviceUUID>]`**
151
-
Returns a JSON structure with the following fields:
152
-
- **`session_count`**: Count of active encoder sessions.
153
-
- **`average_fps`**: Average FPS of all active sessions.
150
+
- **`nvml.device.encoder.stats.get[<deviceUUID>]`**
151
+
Returns a JSON structure with the following fields:
152
+
- **`session_count`**: Count of active encoder sessions.
153
+
- **`average_fps`**: Average FPS of all active sessions.
154
154
- **`average_latency_ms`**: Encode latency in microseconds.
155
155
156
-
- **`nvml.device.encoder.utilization[<deviceUUID>]`**
156
+
- **`nvml.device.encoder.utilization[<deviceUUID>]`**
157
157
Returns a single value: (unsigned int) encoder utilization as a percentage.
158
158
159
-
- **`nvml.device.decoder.utilization[<deviceUUID>]`**
159
+
- **`nvml.device.decoder.utilization[<deviceUUID>]`**
160
160
Returns a single value: (unsigned int) decoder utilization as a percentage.
161
161
162
162
## Device Frequency Metrics
163
-
- **`nvml.device.video.frequency[<deviceUUID>]`**
163
+
- **`nvml.device.video.frequency[<deviceUUID>]`**
164
164
Returns a single value: (unsigned int) video clock speed in MHz.
165
165
166
-
- **`nvml.device.graphics.frequency[<deviceUUID>]`**
166
+
- **`nvml.device.graphics.frequency[<deviceUUID>]`**
167
167
Returns a single value: (unsigned int) graphics clock speed in MHz.
168
168
169
-
- **`nvml.device.sm.frequency[<deviceUUID>]`**
169
+
- **`nvml.device.sm.frequency[<deviceUUID>]`**
170
170
Returns a single value: (unsigned int) streaming multiprocessor (SM) clock speed in MHz.
171
171
172
-
- **`nvml.device.memory.frequency[<deviceUUID>]`**
172
+
- **`nvml.device.memory.frequency[<deviceUUID>]`**
173
173
Returns a single value: (unsigned int) memory clock speed in MHz.
174
174
175
175
## Device Utilization Metrics
176
-
- **`nvml.device.utilization[<deviceUUID>]`**
177
-
Returns a JSON structure with the following fields:
178
-
- **`device`**: GPU utilization as a percentage.
176
+
- **`nvml.device.utilization[<deviceUUID>]`**
177
+
Returns a JSON structure with the following fields:
178
+
- **`device`**: GPU utilization as a percentage.
179
179
- **`memory`**: Memory utilization as a percentage.
180
180
181
181
## Troubleshooting
182
182
183
183
The plugin forwards all its logs to Zabbix Agent 2, which then logs them according to the log location configured for the agent.
184
184
185
185
For debugging purposes, you can increase the Zabbix Agent 2 log level by either updating the `DebugLevel` field in the configuration file or using runtime control with the following command:
186
186
187
187
```sh
188
188
zabbix_agent2 -R log_level_increase