-
Problem report
-
Resolution: Fixed
-
Trivial
-
7.0.9, 7.2.2, 7.2.3
-
None
-
HideContainer images of all favors from https://github.com/zabbix/zabbix-docker or installation from RPM (RHEL 8 and 9 servers).
Zabbix server with/without proxy involved.
Large corporate network, hosts with dual stack enabled (IPv4/IPv6), however IPv4 only used in reality, DNS system replies with NOERROR and an empty response section in response to AAAA queries. DNS system replies with valid IPv4 addresses and TTL set between 10 and 15 minutes (depends on site and type of host in question) due to DHCP involved.ShowContainer images of all favors from https://github.com/zabbix/zabbix-docker or installation from RPM (RHEL 8 and 9 servers). Zabbix server with/without proxy involved. Large corporate network, hosts with dual stack enabled (IPv4/IPv6), however IPv4 only used in reality, DNS system replies with NOERROR and an empty response section in response to AAAA queries. DNS system replies with valid IPv4 addresses and TTL set between 10 and 15 minutes (depends on site and type of host in question) due to DHCP involved.
-
Prev.Sprint, S25-W8/9
-
3
Steps to reproduce:
Clean installation of 7.2.x in any form (rpm, spin up container)
Result:
Hosts with passive agent checks go up and down randomly not collecting items as expected.
Expected:
Stable status of hosts and collection of passive agent items
We made an attempt to upgrade from 6.4 to 7.2 but had to rollback due to this unfortunate experience. We are running sites with ~2K hosts monitored in this way.
We were also able to narrow down the root cause of the problem (at least as we believe):
This problem shows up for both passive agent checks as well as for SNMP checks, however for SNMP checks it resolves itself after a period of time as these devices are usually added with static IPv4 assignments.
For DHCP allocated hosts this problem is really relevant and we were not able to figure out a workaround neither from DNS side, nor by tweaking OS/container setup.
From DNS side we observe A and AAAA queries sent in parallel (which is normal) and responses sent back in no specific order. On Zabbix side we can see loads of log entries like this:
Zabbix agent item "system.cpu.util" on host "****" failed: first network error, wait for 15 seconds
Querying is done over UDP thus we have no control on the order of messages being delivered and processed on Zabbix side.
Using tcpdump we were able to figure out that failure happens most of the time when AAAA records get delivered before A records. This is where DHCP helped isolate the cause: AAAA responses are delivered instantly as they are cached with long TTLs (IPv6 addresses do not exist for hosts), while IPv4 addresses require more time to resolve when cache is invalidated due to TTL expiration. It is in this moment that it is clearly visible how hosts toggle their state from online to unavailable. Otherwise it happens fairly randomly depending on which replies get delivered first.
Tweaking responses to AAAA queries to be NXDOMAIN clearly causes host down following standard. Tweaking OS/container to disable IPv6 in kernel has no effect as libevent's implementation of getaddrinfo clearly disregards this. Adjusting getaddrinfo-allow-skew (libevent's invention ) option in /etc/resolv.conf to tolerate bigger delays between AAAA and A responses does not have an effect either.
On top of that looking into async_dns_event code (asyncpoller.c) and adding simple logging guards reveals that libevent always delivers only a single evutil_addrinfo in ai parameter as opposed to potentially expected 2: one for AAAA and another one for A records.
Would it be possible to either have an option to refrain from using libevent's implementation of getaddrinfo (use OS implementation) or implement using evdns_base_resolve_ipv4/ipv6 in a similar manner it is done for reverse lookup (as seen in async_event implementation in the same source file)?
- duplicates
-
ZBX-25899 Monitoring issues with IPv6 when IPv4 is not available
-
- Closed
-
- is duplicated by
-
ZBX-24572 Host agent dont work with DNS name settings
-
- Closed
-
- part of
-
ZBXNEXT-1002 dns caching by zabbix daemons
-
- Open
-
- related to
-
ZBXNEXT-1275 Use c-ares for DNS resolving
-
- Open
-
-
ZBX-24572 Host agent dont work with DNS name settings
-
- Closed
-
[ZBX-26014] Zabbix passive agent checks fail randomly due to DNS resolution issues (libevent)
Thank you for your detailed report, could it be that it is something to be addressed in libevent bug tracker ? Also if possible please try ZBX-25899 when available to see if it will resolve issue after all.
Sure, I'll give it a try - we would really like to upgrade to v7, however I'm in doubt if your changes are going to help as the core problem is still there: you still set up a call to libevent's implementation of getaddrinfo (https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/src/libs/zbxasyncpoller/asyncpoller.c?at=refs%2Fheads%2Ffeature%2FZBX-25899-7.3#286). This function implementation is tricky by itself in different libc/glibc/musl flavors and libevent have not excelled with their own implementation at all.
I might have overseen something (hopefully!) but debugging at your async_dns_event (in asyncpoller.c) I've clearly seen that while DNS server sends 2 responses over UDP, in your event handler you get
- either null ( if empty response arrives first, at https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/src/libs/zbxasyncpoller/asyncpoller.c?at=refs%2Fheads%2Ffeature%2FZBX-25899-7.3#194)
- or exactly one addrinfo ( if a non-empty response arrives, at https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/src/libs/zbxasyncpoller/asyncpoller.c?at=refs%2Fheads%2Ffeature%2FZBX-25899-7.3#201). Try inspecting ai->ai_next at this point: it has always been null for me.
I believe your changes now allow better handling connection init phase in case hostname resolves to multiple IP addresses (and therefore you expect multiple addrinfo structures). This is not the case with libevent's implementation of getaddrinfo (at least in <= 2.1.12) while valid, but empty DNS replies arrive first.
Submitting a bug to libevent is not an option: project seems to be dead with latest stable release being 2.1.12 (2020-07-05).
They show some work on master - but this is not something I could expect in my PROD environment. RHEL 9, Debian bookworm and Alpine 3.21 have libevent at 2.1.12-r8 in their repos, RHEL 8 provides us with libevent 2.1.8 and does not even let 2.1.12.
They do react on some issues but nowhere near fixing them and releasing another stable version:
https://github.com/libevent/libevent/issues/1702
https://github.com/libevent/libevent/issues/1041
As I can see general trend has been to migrate from libevent to c-ares for async DNS resolution and as a consequence noone on libevent side dared to look into issues like this.
Is something like this meant ? ZBX-26014-7.3.diff in theory it can cause problems as it's blocking call
To be honest - I have not yet had enough time to propose a fix suitable from my POV.
Looking into patch - that would not be exactly what I would propose. This patch will not cause problems as such IMO - you just issue normal blocking call to system's implementation of getaddrinfo. What I do not get is why issue evdns_getaddrinfo in case system call returns anything but zero? (in the best case scenario one should get exactly the same result).
I definitely support a non-blocking implementation for resolver.
I'd vote for
- either sightly modified version of evdns_getaddrinfo (https://github.com/libevent/libevent/blob/5df3037d10556bfcb675bc73e516978b75fc7bc7/evdns.c#L4701) where I would not issue ipv4 and ipv6 requests in parallel, but rather sequentially. Here it would also be possible to opt out of IPv6 calls entirely since you have that option on your configure script. I.e. it would be possible to do a lookup in local hosts file (if that is crucial) at the point your patch proposes blocking systemcal and if that does not succeed - issue IPv4 search (using evdns_base_resolve_ipv4) on an event thread with your ordinary callback. Then on your event callback I'd have assessed the result and either failed gracefully, or pushed IPv6 lookup on events queue (in case IPv6 is not opted out and returned IPv4 result does not prevent from continuing search (e.g. not an NXDOMAIN) ).
- or I'd go for some other library like c-ares.
All in all I understand implementing the entire complexity of getaddrinfo would be too much a burden.
P.S. I'll try to build and deploy from the tag with fix for 25899 and let you know if that eventually helps.
Oh, and BTW - looking more carefully into 25899 I must admit dual stack situation is more complex: that person had a problem connecting to a valid IPv4 as long as his Zabbix proxy had only IPv6 connectivity. I do not know enough about Zabbix internals and whether you have a possibility to figure out what is the available and preferred connection type on server/proxy side. Considering my wall of text in a previous post - an order of resolving IPc4 / IPv6 would definitely be impacted by what is the preference on Zabbix server/proxy side.
OK, let's wait and see if ZBX-25899 helps, hopefully there are 2 addresses delivered just order is sometimes wrong, also created branch with 7.2.
It should have returned both:
https://github.com/libevent/libevent/blob/5df3037d10556bfcb675bc73e516978b75fc7bc7/evdns.c#L4615)
Please try increasing Timeout variable configuration file, perhaps timeout is too small and this should be addressed.
I've done build from feature/ZBX-25899-7.2 and it did not help our case.
I carves out a bit of log just to show what happens to one of the hosts. In a small lab env there like 15 hosts configured in the same way. All of them exhibit a behavior like below.
2025-02-11T11:36:58.431756221-05:00 103:20250211:113658.431 End of zbx_async_check_agent():SUCCEED 2025-02-11T11:36:58.431763498-05:00 103:20250211:113658.431 In zbx_async_check_agent() key:'vm.memory.size[total]' host:'USHAGWHXAAS102' addr:'ushagwhxaas102.domain.net' conn:'unencrypted' 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 In zbx_is_ip4() ip:'ushagwhxaas102.domain.net' 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 End of zbx_is_ip4():FAIL 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 In zbx_is_ip6() ip:'ushagwhxaas102.domain.net' 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 End of zbx_is_ip6():FAIL 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 In zbx_is_ip4() ip:'ushagwhxaas102.domain.net' 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 End of zbx_is_ip4():FAIL 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 End of zbx_async_check_agent():SUCCEED 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 End of async_initiate_queued_checks(): num:2 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.432088656-05:00 103:20250211:113658.431 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.432193475-05:00 102:20250211:113658.432 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.432193475-05:00 102:20250211:113658.432 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.432193475-05:00 102:20250211:113658.432 In zbx_dc_config_get_poller_items() poller_type:8 2025-02-11T11:36:58.432193475-05:00 102:20250211:113658.432 End of zbx_dc_config_get_poller_items():0 2025-02-11T11:36:58.432193475-05:00 102:20250211:113658.432 queue processing_num:0 pending:0 2025-02-11T11:36:58.432203892-05:00 103:20250211:113658.432 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.432203892-05:00 103:20250211:113658.432 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.432656387-05:00 104:20250211:113658.432 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.432656387-05:00 104:20250211:113658.432 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.432656387-05:00 100:20250211:113658.432 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.432680194-05:00 100:20250211:113658.432 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.455407921-05:00 103:20250211:113658.455 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.455407921-05:00 103:20250211:113658.455 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.527192723-05:00 103:20250211:113658.527 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.527192723-05:00 103:20250211:113658.527 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.531957445-05:00 103:20250211:113658.531 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.531957445-05:00 103:20250211:113658.531 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.622409553-05:00 103:20250211:113658.622 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.622409553-05:00 103:20250211:113658.622 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.625963108-05:00 103:20250211:113658.625 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.625963108-05:00 103:20250211:113658.625 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.631621224-05:00 103:20250211:113658.631 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.631621224-05:00 103:20250211:113658.631 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.645850208-05:00 103:20250211:113658.645 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.645850208-05:00 103:20250211:113658.645 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.717311797-05:00 103:20250211:113658.717 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.717311797-05:00 103:20250211:113658.717 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.722246865-05:00 103:20250211:113658.722 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.722246865-05:00 103:20250211:113658.722 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.731300938-05:00 103:20250211:113658.731 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.731300938-05:00 103:20250211:113658.731 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.740729554-05:00 103:20250211:113658.740 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.740729554-05:00 103:20250211:113658.740 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.818649015-05:00 103:20250211:113658.818 In async_dns_event() result:-2 2025-02-11T11:36:58.818649015-05:00 103:20250211:113658.818 cannot resolve DNS name: nodename nor servname provided, or not known 2025-02-11T11:36:58.818649015-05:00 103:20250211:113658.818 In async_event() 2025-02-11T11:36:58.818649015-05:00 103:20250211:113658.818 In agent_task_process() step 'init' event:1 itemid:169892 addr: 2025-02-11T11:36:58.818649015-05:00 103:20250211:113658.818 In process_async_result() itemid:169892 key:'vm.memory.size[total]' host:'USHAGWHXAAS102' addr:'ushagwhxaas102.domain.net' 2025-02-11T11:36:58.818649015-05:00 103:20250211:113658.818 finished processing itemid:169892 processing:1 2025-02-11T11:36:58.818649015-05:00 103:20250211:113658.818 End of process_async_result():TIMEOUT_ERROR 2025-02-11T11:36:58.818693712-05:00 103:20250211:113658.818 End of async_event():stop 2025-02-11T11:36:58.818693712-05:00 103:20250211:113658.818 End of async_dns_event() 2025-02-11T11:36:58.818693712-05:00 103:20250211:113658.818 In zbx_ipc_async_socket_recv() timeout:0 2025-02-11T11:36:58.818701361-05:00 103:20250211:113658.818 End of zbx_ipc_async_socket_recv():0 2025-02-11T11:36:58.818715291-05:00 103:20250211:113658.818 interfaces num:1 2025-02-11T11:36:58.818715291-05:00 103:20250211:113658.818 requeue num:1 2025-02-11T11:36:58.818721337-05:00 103:20250211:113658.818 In poller_update_interfaces() num:1 2025-02-11T11:36:58.818721337-05:00 103:20250211:113658.818 In zbx_deactivate_item_interface() interfaceid:9108 itemid:169892 type:0 2025-02-11T11:36:58.818874017-05:00 103:20250211:113658.818 Zabbix agent item "vm.memory.size[total]" on host "USHAGWHXAAS102" failed: another network error, wait for 15 seconds 2025-02-11T11:36:58.818874017-05:00 103:20250211:113658.818 zbx_deactivate_item_interface() errors_from:1739291788 available:1 2025-02-11T11:36:58.818874017-05:00 103:20250211:113658.818 End of zbx_deactivate_item_interface() 2025-02-11T11:36:58.818874017-05:00 103:20250211:113658.818 In zbx_ipc_socket_write() 2025-02-11T11:36:58.818874017-05:00 103:20250211:113658.818 End of zbx_ipc_socket_write():SUCCEED 2025-02-11T11:36:58.818874017-05:00 103:20250211:113658.818 End of poller_update_interfaces() 2025-02-11T11:36:58.818874017-05:00 103:20250211:113658.818 requeue items nextcheck:1739291820 2025-02-11T11:36:58.835777876-05:00 103:20250211:113658.835 In async_dns_event() result:-2 2025-02-11T11:36:58.835777876-05:00 103:20250211:113658.835 cannot resolve DNS name: nodename nor servname provided, or not known 2025-02-11T11:36:58.835777876-05:00 103:20250211:113658.835 In async_event()
When it comes to libevent behavior:
as I mentioned in an opening post - increasing timeout does not help (and this is a global allowed skew which is configured through options in /etc/resolv.conf https://github.com/libevent/libevent/blob/5df3037d10556bfcb675bc73e516978b75fc7bc7/evdns.c#L4444) I can see with tcpdump both responses (A and AAAA) are incoming almost instantly.
However Timeout in server/proxy config is set to 30 seconds.
addrinfo structure returned to Zabbix callback always contained ether null or a pointer to single structure (ai->ai_next is always null). And an error code of -2 ( which is EVUTIL_EAI_NONAME (NXDOMAIN) as translated by evutil_gai_strerror() ). There are never 2 answers (regardless of the default or custom set skew value). As I can see EVUTIL_EAI_NODATA does not appear among returned values ever.
It might be reasonable to take a look into libevent's internals. Sorry for being lazy: could you please suggest the proper place in Zabbix where libevent is initialized: I'd consider adding event_set_log_callback and a call to event_enable_debug_logging(EVENT_DBG_ALL).
Great thank you for patches, I'll be back with investigation outcome
Is this issue only happening on rhel, if some most up to date distribution is used or docker then issue does not occur ?
I have the following environments in my disposal that I have tested and verified this issue appearing for me:
1) Win11 dev machines with podman. Zabbix container images (alpine, centos, ubuntu - server with pgsql)
2) RHEL 8, Zabbix RPM
3) RHEL 9, Zabbix RPM
4) RHEL 8, crun + crio - Zabbix in custom built container image (just different packaging omitting startup scripts, including other components) (alpine-based, forked off https://github.com/zabbix/zabbix-docker)
Scenario #4 is our target for production use.
In all of these scenarios the same behavior is exhibited. Trying to switch environments was the first move we naturally took in order to mitigate a problem. Our RHEL installations are patched centrally on a monthly basis and I have checked to see there are no recent patches that would have impacted the behavior (which would even be less of a factor for containerized setup).
vso - thank you for your support and prompt patches. I feel like I've narrowed it down.
First and foremost: with ZBX-26014-7.3.diff Zabbix server / proxy works as expected. libevent's resolver never got called, though I did not specifically test a host with FQDN that would not resolve.
Now the issue with libevent I believe is causing this unfortunate behavior. Having evdns logs enabled I figured a behavior I did not expect (following log truncated for readability):
Starting Zabbix Proxy (active) [Proxy-HAG]. Zabbix 7.2.4rc1 (revision 57e8569). Press Ctrl+C to exit. 2:20250212:071916.351 Starting Zabbix Proxy (active) [Proxy-HAG]. Zabbix 7.2.4rc1 (revision 57e8569). 2:20250212:071916.351 **** Enabled features **** ----- 103:20250212:071916.746 EVDNS: [0] Parsing resolv.conf file /etc/resolv.conf 104:20250212:071916.746 EVDNS: [0] Parsing resolv.conf file /etc/resolv.conf 103:20250212:071916.746 EVDNS: [0] Added nameserver 192.168.0.10:53 as 0x7fa7065dd7c0 103:20250212:071916.746 EVDNS: [0] Setting ndots to 1 104:20250212:071916.746 EVDNS: [0] Added nameserver 192.168.0.10:53 as 0x7fa7065dd7c0 104:20250212:071916.746 EVDNS: [0] Setting ndots to 1 104:20250212:071916.746 EVDNS: [0] Setting getaddrinfo-allow-skew to 10.1 104:20250212:071916.746 EVDNS: [0] Setting timeout to 30 103:20250212:071916.746 EVDNS: [0] Setting getaddrinfo-allow-skew to 10.1 103:20250212:071916.746 EVDNS: [0] Setting timeout to 30 ----- 103:20250212:071917.546 EVDNS: [0] Sending request for ushagwhxaas101.company.domain on ipv4 as 0x7fa6fb96b800 103:20250212:071917.546 EVDNS: [0] Resolve requested for ushagwhxaas101.company.domain 103:20250212:071917.546 EVDNS: [0] Setting timeout for request 0x7fa6fb968cd0, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.546 EVDNS: [0] Sending request for ushagwhxaas101.company.domain on ipv6 as 0x7fa6fb96b810 103:20250212:071917.546 EVDNS: [0] Resolve requested for ushagwhxaas101.company.domain 103:20250212:071917.546 EVDNS: [0] Setting timeout for request 0x7fa6fb968e60, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.546 EVDNS: [0] Search: now trying ushagwhxaas118.company.domain.suffix.search.list (0) 103:20250212:071917.546 EVDNS: [0] Removing timeout for request 0x7fa6fb968820 103:20250212:071917.546 EVDNS: [0] Setting timeout for request 0x7fa6fb95e030, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.546 EVDNS: [0] Removing timeout for request 0x7fa6fb9689b0 103:20250212:071917.547 EVDNS: [0] Removing timeout for request 0x7fa7064d3e40 103:20250212:071917.547 EVDNS: [0] Search: now trying ushagwhxaas106.company.domain.suffix.search.list (0) 103:20250212:071917.547 EVDNS: [0] Removing timeout for request 0x7fa6fb968b40 103:20250212:071917.547 EVDNS: [0] Setting timeout for request 0x7fa6fb95e1c0, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.547 EVDNS: [0] Search: now trying ushagwhxaas101.company.domain.suffix.search.list (0) 103:20250212:071917.547 EVDNS: [0] Removing timeout for request 0x7fa6fb968cd0 103:20250212:071917.547 EVDNS: [0] Setting timeout for request 0x7fa6fb95e350, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.547 EVDNS: [0] Search: now trying ushagwhxaas101.company.domain.suffix.search.list (0) 103:20250212:071917.547 EVDNS: [0] Removing timeout for request 0x7fa6fb968e60 103:20250212:071917.547 EVDNS: [0] Setting timeout for request 0x7fa6fb95e4e0, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.642 EVDNS: [0] Search: now trying ushagwhxaas101.company.domain.search.list (1) 103:20250212:071917.642 EVDNS: [0] Removing timeout for request 0x7fa6fb95e4e0 103:20250212:071917.642 EVDNS: [0] Setting timeout for request 0x7fa6fb95e670, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.643 EVDNS: [0] Search: now trying ushagwhxaas106.company.domain.search.list (1) 103:20250212:071917.643 EVDNS: [0] Removing timeout for request 0x7fa6fb95e1c0 103:20250212:071917.643 EVDNS: [0] Setting timeout for request 0x7fa7064d3e50, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.656 EVDNS: [0] Search: now trying ushagwhxaas118.company.domain.search.list (1) 103:20250212:071917.656 EVDNS: [0] Removing timeout for request 0x7fa6fb95e030 103:20250212:071917.656 EVDNS: [0] Setting timeout for request 0x7fa6fb95e1b0, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.737 EVDNS: [0] Search: now trying ushagwhxaas101.company.domain.list (2) 103:20250212:071917.737 EVDNS: [0] Removing timeout for request 0x7fa6fb95e670 103:20250212:071917.737 EVDNS: [0] Setting timeout for request 0x7fa6fb95e4d0, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.739 EVDNS: [0] Search: now trying ushagwhxaas106.company.domain.list (2) 103:20250212:071917.739 EVDNS: [0] Removing timeout for request 0x7fa7064d3e50 103:20250212:071917.739 EVDNS: [0] Setting timeout for request 0x7fa6fb95e020, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.766 EVDNS: [0] Search: now trying ushagwhxaas118.company.domain.list (2) 103:20250212:071917.766 EVDNS: [0] Removing timeout for request 0x7fa6fb95e1b0 103:20250212:071917.766 EVDNS: [0] Setting timeout for request 0x7fa6fb95e660, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.768 EVDNS: [0] Search: now trying ushagwhxaas101.company.domain.search.list (1) 103:20250212:071917.768 EVDNS: [0] Removing timeout for request 0x7fa6fb95e350 103:20250212:071917.768 EVDNS: [0] Setting timeout for request 0x7fa7064d3e40, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.832 EVDNS: [0] Removing timeout for request 0x7fa6fb95e4d0 103:20250212:071917.836 EVDNS: [0] Removing timeout for request 0x7fa6fb95e020 103:20250212:071917.876 EVDNS: [0] Removing timeout for request 0x7fa6fb95e660 103:20250212:071917.878 EVDNS: [0] Search: now trying ushagwhxaas101.company.domain.list (2) 103:20250212:071917.878 EVDNS: [0] Removing timeout for request 0x7fa7064d3e40 103:20250212:071917.878 EVDNS: [0] Setting timeout for request 0x7fa6fb968810, sent to nameserver 0x7fa7065dd7c0 103:20250212:071917.988 EVDNS: [0] Removing timeout for request 0x7fa6fb968810 103:20250212:071917.988 Zabbix agent item "system.cpu.util" on host "USHAGWHXAAS101" failed: first network error, wait for 15 seconds ...
The host exists and is happily resolved by system resolver (with the patch applied).
There are two things that go wrong here:
- despite sample host being referenced by FQDN and ndots set to 1, evdns resolver resorts to looping through the list of search domains defined in /etc/resolv.conf. The root cause for this is reply_handle() will always go for search in case of forward lookups. And that in turn is caused by normal lookups for IPv4 and IPv6 with DNS_QUERY_NO_SEARCH not set ( we'd still want unqualified names resolved - right? ). While this behavior can be tolerated in some scenarios - generally it goes against what is sought. Besides for AAAA type queries we get a valid but empty response which must stop any further requests: or we'll get a record for something we did not ask - this goes against RFC and has already been brought to libevent in this issue.
- What goes entirely wrong here is the behavior of evdns_getaddrinfo_gotresolve: a callback to both IPv4 and IPv6 resolvers. It tries to get hold of both of them. IPv6 will misinterpret NO_ERROR + 0 answers condition and go for search by appending suffixes and will end up in with NXDOMAIN (because those with suffixes do not exist). But it also seems that for some weird error condition evaluation between parallel requests/parallel request handling it will misinterpret responses and continue search within subdomains for IPv4 as well. This happens in an unpredictable way though probably on the same thread (103).
As you can see from init section timers are rather tolerating: 30 seconds for total timeout and 10 seconds for skew. The entire DNS lookup failure fits in less than a second.
PS. Had to edit my post - color-coding does not work in code snippet and has lost all line highlighting I've applied. Please take a look at the failing host: USHAGWHXAAS101
There is sketch of solution using c-ares, you can try it feature/ZBX-26014-7.2 to see if issue persists or if c-ares helps.
Built it from https://git.zabbix.com/projects/ZBX/repos/zabbix/commits/b2099c773f6bf15cba01e2023d8e146f925a5df3 and it behaves quite stable.
I can see there are more changes incoming.
Let me know when you have something you would like me to test and I'll redeploy it on our side.
Out of curiosity: what would be your plan to use c-ares channel: one per server/proxy process or per thread ?
In this solution it's per process and each process can do 1000 asynchronous checks or DNS resolutions. It can be used on Zabbix server or on Zabbix proxy, and 1000 such processes can be created.
I've built Zabbix from the latest commit feature/ZBX-26014-7.2 today and monitored the behavior throughout the day.
It seems mainly stable now with some interesting behavior:
- right after Zabbix startup I can see majority of hosts go down and log entries tell zabbix is going to wait 15 seconds to query again. Subsequent queries successfully return DNS results and interfaces become available. In frontend it looks like after restarting proxy ~half of the hosts go down and all of them come back within couple minutes.
- After all hosts are online - they remain online with eventual entries in Zabbix proxy log that DNS query times out and a retry to be attempted within 15 seconds. Below is the log extract showing what is happening: presumably within 3 seconds there are 3 requeues with 1 sec timeout.
102:20250214:113411.512 In zbx_async_check_agent() key:'vfs.fs.size[C:,pused]' host:'USHAGWHXAAS123' addr:'ushagwhxaas123....' conn:'unencrypted' 102:20250214:113411.512 In zbx_is_ip4() ip:'ushagwhxaas123....' ... hard to link items in between to this particular request ... 102:20250214:113414.511 In ares_addrinfo_cb() result:12 timeouts:3 102:20250214:113414.511 cannot resolve DNS name: Timeout while contacting DNS servers 102:20250214:113414.511 In async_event() 102:20250214:113414.511 In agent_task_process() step 'init' event:1 itemid:170111 addr: 102:20250214:113414.511 In process_async_result() itemid:170111 key:'vfs.fs.size[C:,pused]' host:'USHAGWHXAAS123' addr:'ushagwhxaas123....' 102:20250214:113414.511 finished processing itemid:170111 processing:0 102:20250214:113414.511 End of process_async_result():TIMEOUT_ERROR 102:20250214:113414.511 End of async_event():stop 102:20250214:113414.511 End of ares_addrinfo_cb() 102:20250214:113414.511 interfaces num:1 102:20250214:113414.511 requeue num:1 102:20250214:113414.511 In poller_update_interfaces() num:1 102:20250214:113414.511 In zbx_deactivate_item_interface() interfaceid:9066 itemid:170111 type:0 104:20250214:113414.511 In zbx_ipc_async_socket_recv() timeout:0 104:20250214:113414.511 End of zbx_ipc_async_socket_recv():0 104:20250214:113414.511 In zbx_dc_config_get_poller_items() poller_type:8 102:20250214:113414.511 Zabbix agent item "vfs.fs.size[C:,pused]" on host "USHAGWHXAAS123" failed: first network error, wait for 15 seconds ..... and then 15 seconds later ..... 104:20250214:113429.515 In async_event() 104:20250214:113429.515 In agent_task_process() step 'receive' event:2 itemid:170111 addr:xx.xx.xxx.xxx 104:20250214:113429.515 get value from agent result: '24.745747' 104:20250214:113429.515 In process_async_result() itemid:170111 key:'vfs.fs.size[C:,pused]' host:'USHAGWHXAAS123' addr:'ushagwhxaas123....' 104:20250214:113429.515 In zbx_preprocess_item_value() 104:20250214:113429.515 End of zbx_preprocess_item_value() 104:20250214:113429.515 finished processing itemid:170111 processing:2 104:20250214:113429.515 End of process_async_result():SUCCEED 104:20250214:113429.515 End of async_event():stop 104:20250214:113429.515 In zbx_ipc_socket_write() 104:20250214:113429.515 interfaces num:1 104:20250214:113429.515 requeue num:1 104:20250214:113429.515 In poller_update_interfaces() num:1 104:20250214:113429.515 In zbx_activate_item_interface() interfaceid:9066 itemid:170111 type:0 version:0 104:20250214:113429.515 resuming Zabbix agent checks on host "USHAGWHXAAS123": connection restored
I do understand you went for your own event loop when invoking c-ares as you already have one based on libevent, hence you have gone for implementing socket state callback. Is there a specific reason I missed to not use ares_timeout() and not set administratively maximum timeout on init options? Could this latter possibly help mitigate behavior exposed at (re)start of Zabbix process causing the very first queries to instantly fail and probably eventual timeouts? Also benefit from library's dynamic server timeout calculation feature?
If this change is accepted - will you also consider adding ares_reinit() to zbx_async_dns_update_host_addresses() ?
And last question that may have relevance within the context of this feature: given a single host with multiple zabbix items seems to trigger DNS lookup and tcp connection for every such item upon the need to collect value. Are there any plans to reconsider this behavior in favor of persistent connections (or at least longer-lived, closed on some timeout but allowing to collect several items)? What would be a factor distracting from such approach: limits on file descriptors available per process?
Please feel free to set own timeout in init to see if it helps and more retries, it should be respected as it timeouts only by c-ares source.
Please check again OAMike I believe all questions were addressed, except last question that is currently out of scope but agree that it could be good also to have bulk collection.
vso , I was about to propose a patch to address timer in a way I have explained earlier - but I can see you have already made corresponding commits on the branch.
Let me test latest changes and come back before the end of today.
Ouch, with latest changes everything went wrong: all hosts are down and logs just show they are not resolved because of DNS times out.
What feels weird is that ares_sock_cb() is almost never get called. It feels like ares_process_fd() is getting called from async_timeout_timer() with ARES_SOCKET_BAD arguments and simply times out all queries without actually processing incoming data on sockets.samplelog1.log
Not sure why, timeout seems to be correct , please compile with "--with-cares"
Also added itemid to zbx_async_check_agent so it's easier to search by itemid where check started and where timed out and in your case it seems that 3 seconds has passed.
However in your case for some reason sock_state_cb was not called
Having compiled latest version - it still exposes the same behavior: no names are getting resolved (or maybe only the ones instantly returned by DNS local cache). What bothers me is exactly why sock_state_cb() is not getting called properly.
Am I correct that there is a separate timeout event spawned via evtimer_add(poller_config->async_timeout_timer, &tv); and it basically keeps running on its own rescheduling itself via evtimer_add(poller_config->async_timeout_timer, &tv_next); ? It appears to call ares_process_fd(poller_config->channel, ARES_SOCKET_BAD, ARES_SOCKET_BAD); on timer - which is meant to process timeouts.
At the same time sock_state_cb() upon read event should spawn new event with ares_sock_cb() callback which in turn should call for actually processing readable data with ares_process_fd((ares_channel_t *)arg, (events & EV_READ) ? fd : ARES_SOCKET_BAD, (events & EV_WRITE) ? fd : ARES_SOCKET_BAD); This in turn should trigger a callback that parses response from DNS server.
I can only guess that the timeout processing event and socket state callback event are competing for processing sockets via ares_process_fd and the timeout event handler wins - effectively clearing socket of read data before this data can actually be read.
What I see in c-ares basic examples (Sock State Callback example) is that in their process() function - infinite loop, they make a call either to ares_process_fd( ...ARES_SOCK_BAD) or ares_process_fd(... actual_fd) but not both. Hence my guess that there might be a race for the same fd.
Logging could be more informative if information about zabbix item id being processed can be logged in timeout event callback and in socket event callback - is that possible?
otherwise log currently is pretty much the same - not allowing to correlate timeout event to actual item processed (or I'm tired by the end of the day)
They might actually be competing, as in docs it says: "ares_timeout should be used to retrieve the desired timeout, and when the timeout expires, the integrator must call ares_process_fds with a NULL events array. (or ares_process_fd with both sockets set to ARES_SOCKET_BAD). There is no need to do this if events are also delivered for any file descriptors as timeout processing will automatically be handled by any call to ares_process_fds or ares_process_fd." Can there be a race condition?
Does it help if timeout is hardcoded to 1 second again ? Or maybe it is due to reinit added ? Can you please comment out reinit ?
I guess reinit only fires upon /etc/[hosts|resolv.conf] changes - I can't see this func call in debug logs.
However let me play with timers:
- disable timeout event
- set timer to 1 sec at init options (however with 30 seconds from my config this feels weird)
I encountered same issue when started testing, adding small fix seems to have fixed the issue, "--with-ares"
Now I believe I nailed it down. May I please propose the following patch: rearm_timer.diff
It basically does 2 things:
- it initially creates timeout event with the timeout value from configuration (instead of the default 1 sec in async_poller_init)
- it rearms timeout event if there is a read/write event on ares socket so that there is no accidental race between an actual ares_process_fd_cb event and timeout event.
Now it seems to work perfect for me with all hosts resolved properly right after proxy restart and no unexpected timeouts occurring.
I also attach a log that shows that timer gets rearmed on fd reads and also timeout event properly rearms when there is no DNS resolution ongoing (items are on 1 minute poll period in database): samplelog2.log
There was another fix done, if not too much to ask, could you please double check that problem is also gone without suggested diff
I have tested with this commit before actually coming up with the patch - and it did not work.
The key difference since yesterday I can see is adding persistence flag when creating ares_process_fd_cb event at sock_state_cb:
struct event *ev = event_new(poller_config->base, s, events|EV_PERSIST, ares_sock_cb, poller_config->channel);
Is there any other change relevant for this issue I missed?
I'll do a test without a patch anyway and let you know the outcome.
After carefully testing I can confirm: it works fine without patch proposed from my end. A build from the latest commit to 26014 branch works fine as expected. Bug is not showing itself. There must be something I did wrong building/testing on my end which caused me think a fix was from my side: it's actually event persistence that solves the problem somehow.
Thank you for your efforts - looking forward to see this in the nearest release.
Thanks for the feedback, regarding caching DNS requests it seems there is following option:
diff --git a/src/libs/zbxpoller/async_poller.c b/src/libs/zbxpoller/async_poller.c index ecd4938d60b..dd0f5936963 100644 --- a/src/libs/zbxpoller/async_poller.c +++ b/src/libs/zbxpoller/async_poller.c @@ -556,7 +556,7 @@ static void async_poller_dns_init(zbx_poller_config_t *poller_config, zbx_thread { char *timeout; #ifdef HAVE_ARES - struct ares_options options; + struct ares_options options = {0}; int optmask, status; status = ares_library_init(ARES_LIB_INIT_ALL); @@ -567,10 +567,11 @@ static void async_poller_dns_init(zbx_poller_config_t *poller_config, zbx_thread exit(EXIT_FAILURE); } - optmask = ARES_OPT_SOCK_STATE_CB|ARES_OPT_TIMEOUT; + optmask = ARES_OPT_SOCK_STATE_CB|ARES_OPT_TIMEOUT|ARES_OPT_QUERY_CACHE; options.sock_state_cb = sock_state_cb; options.sock_state_cb_data = poller_config; options.timeout = poller_args_in->config_comms->config_timeout; + options.qcache_max_ttl = 3600; status = ares_init_options(&poller_config->channel, &options, optmask);
According to their docs this option is on by default - which I was able to see was our case (once successfully resolved hosts did not trigger ares_process_fb_cb call upon subsequent requests - just ares_addrinfo_cb). Seems to work just fine and respects TTLs below 1 hour on our DHCP records.
In our installations we anyway have recursive caching DNS resolvers supporting zabbix proxies not to flood our main DNS servers.
We are comfortable with the default library caching behavior - however other Zabbix users may want to have some control over the cache (but this would also mean tighter dependency on c-ares library, whereas I can see you have opted for preserving evdns code). We would follow you as decision owners in this matter.
Once again - thank you for working on the fix.
Could you please share version of c-ares that was used for testing ? OAMike
RHEL8:
Installed Packages Name : c-ares Version : 1.13.0 Release : 11.el8_10 Architecture : x86_64 Size : 207 k Source : c-ares-1.13.0-11.el8_10.src.rpm
RHEL9:
Installed Packages Name : c-ares Version : 1.19.1 Release : 2.el9_4 Architecture : x86_64 Size : 279 k Source : c-ares-1.19.1-2.el9_4.src.rpm
Alpine 3.21 container build:
Installed: Available: c-ares-1.34.3-r0 = 1.34.3-r0
Hope this helps
Fixed in:
- pre-7.0.11rc1 a9a141f1a90 12a34cd1297 4944621e923
- pre-7.2.5rc1 502ebe88ec0 60c015d6aed fa94bc97e1b
- pre-7.4.0alpha2 (master) d3f18468d44 3a6ec88a778
Your issue is very similar to the one described in
ZBX-25899. I will close your ticket as a duplicate.