Help with troubleshooting pdns_recursor

TL;DR I’m getting DNS resolution timeouts. My problem feels like an issue with my ISP (Spectrum residential) but seeking some guidance, tips-n-tricks, etc for how to troubleshoot this. If it is an ISP issue, how to raise this with them so as not to go round and round with their tier 1 support?

Details
I’m getting intermittent “Internet” issues on my clients (laptops, tablets, etc) and found pdns_recursor throwing errors on resolve timeouts. My setup looks like client → pihole (192.168.1.5) → vyos → Internet. For VyOS, I’m basically running with the quick start guides DNS setup:

me@vyos# show service dns forwarding
 allow-from 192.168.1.0/24
 allow-from 10.64.0.0/16
 cache-size 0
 listen-address 192.168.1.1
 listen-address 10.64.200.1
 listen-address 10.64.20.1
 listen-address 10.64.30.1
 listen-address 10.64.40.1
 listen-address 10.64.60.1
 listen-address 10.64.150.1
 listen-address 10.64.0.1
 listen-address 10.64.50.1

The 10.64 space is a to-be realized VLAN segmentation. ATM everything is operating off the 192.168.1.0/24 space.

My setup has been operating in it’s current running config for 2 months without this issue, which is further making me think this is something with Spectrum.

me@vyos# run show system image
The system currently has the following image(s) installed:

   1: 1.4-rolling-202108221610
   2: 1.4-rolling-202107010537 (default boot) (running image)
   3: 1.4-rolling-202105090417

Example errors I’m seeing:

Sep  2 09:59:28 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:39105 during resolve of 'xmpp013.zoom.us' because: Too much time waiting for ns-1772.awsdns-29.co.uk|A, timeouts: 2, throttles: 3, queries: 25, 7742msec
Sep  2 09:59:28 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:39105 during resolve of 'xmpp013.zoom.us' because: Too much time waiting for ns-1772.awsdns-29.co.uk|A, timeouts: 2, throttles: 0, queries: 22, 7754msec
Sep  2 09:59:29 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:41095 during resolve of 'eastus2-prod-2.notifications.teams.microsoft.com' because: Too much time waiting for cloudapp.azure.com|A, timeouts: 3, throttles: 1, queries: 19, 7184msec
Sep  2 09:59:29 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:41095 during resolve of 'eastus2-prod-2.notifications.teams.microsoft.com' because: Too much time waiting for cloudapp.azure.com|A, timeouts: 3, throttles: 1, queries: 19, 7215msec
Sep  2 09:59:29 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:41095 during resolve of 'eastus2-prod-2.notifications.teams.microsoft.com' because: Too much time waiting for cloudapp.azure.com|A, timeouts: 2, throttles: 1, queries: 21, 7166msec
Sep  2 09:59:29 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:41095 during resolve of 'eastus2-prod-2.notifications.teams.microsoft.com' because: Too much time waiting for cloudapp.azure.com|A, timeouts: 3, throttles: 2, queries: 20, 7170msec
Sep  2 09:59:30 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:53402 during resolve of 'ic3.events.data.microsoft.com' because: Too much time waiting for eastus.cloudapp.azure.com|A, timeouts: 3, throttles: 2, queries: 18, 7514msec
Sep  2 09:59:30 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:53402 during resolve of 'ic3.events.data.microsoft.com' because: Too much time waiting for eastus.cloudapp.azure.com|A, timeouts: 2, throttles: 5, queries: 23, 7524msec
Sep  2 09:59:30 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:56607 during resolve of 'ic3.events.data.microsoft.com' because: Too much time waiting for japaneast.cloudapp.azure.com|A, timeouts: 2, throttles: 0, queries: 20, 7529msec
Sep  2 09:59:30 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:56607 during resolve of 'ic3.events.data.microsoft.com' because: Too much time waiting for japaneast.cloudapp.azure.com|A, timeouts: 2, throttles: 3, queries: 23, 7533msec
Sep  2 09:59:30 vyos pdns_recursor[2968]: Sending SERVFAIL to 192.168.1.5:54315 during resolve of 'xmpp003.zoom.us' because: Too much time waiting for ns-888.awsdns-47.net|A, timeouts: 1, throttles: 2, queries: 23, 7159msec

My assumption is that I can’t just send those SERVFAIL messages to Spectrum without offering some additional proof their mucking with my DNS queries?

Spectrum’s not going to care about this if you mention it to them. What upstream DNS is VyOS pointing at? You don’t have to use your ISPs DNS; in fact I would recommend using a publicly available resolver instead. There used to be a bunch of programs that would benchmark various public resolvers and give a recommendation based on latency and response time that I’m sure you could find.

Right, that was my thought too, they won’t care given it is residential.

My understanding from the quick start was that I don’t have to point to a specific upstream resolver. I took this to mean vyos was resolving directly. Maybe this is a miss on my part? Prior to vyos I would have pointed to cloudflare.

For what it’s worth, they don’t care if businesses complain about it either.

Default, I think, it uses the system resolver which can be viewed by doing cat /etc/resolv.conf. I’d say take your pick of a public resolver (or stand up your own) and set it using:

set service dns forwarding name-server X.X.X.X

or you could set it as vyos’ name server and set the resolver to piggyback off that:

set system name-server X.X.X.X
set service dns forwarding system

Hmm, the Operating System’s resolver is pointing to vyos. Feeling like I’ve got some kind of looping going on, even though it has been operating this way for a while… But based on this convo I think I’m going to just configure vyos to point to CloudFlare’s DNS resolvers…

me@vyos# show system name-server
 name-server 192.168.1.1

me@vyos# cat /etc/resolv.conf
### Autogenerated by VyOS ###
### Do not edit, your changes will get overwritten ###


# system
nameserver 192.168.1.1

FWIW, this line in the Quick Start was what lead me to not forwarding to a public resolver:

VyOS will serve as a full DNS recursor, replacing the need to utilize Google, Cloudflare, or other public DNS servers (which is good for privacy)

I’ll have to do some digging in to the internals of VyOS to see what it uses for upstream by default. I could interpret that sentence a couple of different ways and see why you think what you do. If you’re still having DNS timeouts after manually setting a resolver then there are other things to investigate.

EDIT:
Looks like 1.3rc6 talks directly to the root servers but, my resolv.conf doesn’t point at itself in my test environment. I’ll try 1.4.

EDIT2:
No issues with 1.4-rolling-202109030217, it speaks directly to root as well. I even pointed it at itself by setting the system name-server to its own LAN IP. Perhaps you were hitting a greater outage or it was fixed between now and your last rolling?

Ok, good to know my interpretation was inline, and VyOS does point to roots. Can you point me to the spot in the codebase that governs this? Want to get more familiar with the codebase and this would save me some hunting…

Yeah, this is where my brain went to; although I assumed it was Spectrum network issue. Since having configured VyOS to point to CloudFlare I’ve not had any issues. I may pivot back pointing to the root servers, although I’m not as concerned with privacy angle compared to usability one so I may just leave well enough alone.

Thanks for the help with this! Much appreciated!

The way I learned what VyOS is talking to is setting up a mock environment in GNS3 and running a wireshark capture between the VyOS WAN port and the GNS3 cloud/NAT object. I then placed a client machine on the LAN side and did some lookups. Some of the generated config files generally end up in /run so taking a poke around there, you can see how things are configured.