PowerDNS Recursor returns SERVFAIL with wan-load-balancer

jimloko · April 23, 2019, 8:14am

Ok, i can’t figure this out, hopefully a silly mistake somewhere. Upgrade from 1.1.8 to 1.2.1 (built last week) with existing wan-load-balancer on 3 interfaces and dns forwarder.

Changed forwarder from listen-on to listen-address and now it replies SERVFAIL for some subnets including the local router. I can’t figure out why. I’m thinking it might be to do with the load-balance rules. I just want it to forward queries to cloudflare etc (and read specific names from the local hosts file, but i’ve disabled that for now).

When this is ran from the router (ssh):

vyos@gateway-temp# dig google.com

; <<>> DiG 9.9.5-9+deb8u17-Debian <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27634
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A

;; Query time: 4902 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Apr 23 18:05:23 AEST 2019
;; MSG SIZE  rcvd: 39

And from a host in the 10.1.0.0/24 subnet:

root@kcbmain:~# dig google.com

; <<>> DiG 9.10.3-P4-Debian <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 63623
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A

;; Query time: 4507 msec
;; SERVER: 10.1.0.2#53(10.1.0.2)
;; WHEN: Tue Apr 23 18:07:02 AEST 2019
;; MSG SIZE  rcvd: 39

And here is the config. eth0 and eth1 are internal subnets, while eth2, eth3 and eth4 are load-balanced to 3 separate isps (which seems to work better than 1.1.8 btw):

 interfaces {
     ethernet eth0 {
         address 10.0.0.125/24
         duplex auto
         smp-affinity auto
         speed auto
     }
     ethernet eth1 {
         address 10.1.0.2/24
         duplex auto
         smp-affinity auto
         speed auto
     }
     ethernet eth2 {
         address 172.16.2.10/24
         duplex auto
         smp-affinity auto
         speed auto
     }
     ethernet eth3 {
         address 172.16.3.10/24
         duplex auto
         smp-affinity auto
         speed auto
     }
     ethernet eth4 {
         address 172.16.1.10/24
         duplex auto
         smp-affinity auto
         speed auto
     }
     loopback lo {
     }
 }
 load-balancing {
     wan {
         flush-connections
         interface-health eth2 {
             failure-count 3
             nexthop 172.16.2.1
             success-count 1
             test 10 {
                 resp-time 5
                 target 1.1.1.1
                 ttl-limit 1
                 type ping
             }
         }
         interface-health eth3 {
             failure-count 3
             nexthop 172.16.3.1
             success-count 1
             test 10 {
                 resp-time 5
                 target 1.0.0.1
                 ttl-limit 1
                 type ping
             }
         }
         interface-health eth4 {
             failure-count 3
             nexthop 172.16.1.1
             success-count 1
             test 10 {
                 resp-time 5
                 target 8.8.8.8
                 ttl-limit 1
                 type ping
             }
         }
         rule 4 {
             destination {
                 address 100.64.0.0/19
             }
             exclude
             inbound-interface eth0
             protocol all
         }
         rule 5 {
             destination {
                 address 10.1.0.0/24
             }
             exclude
             inbound-interface eth0
             protocol all
         }
         rule 7 {
             destination {
                 address 192.168.100.0/24
             }
             exclude
             inbound-interface eth0
             protocol all
         }
         rule 10 {
             inbound-interface eth1
             interface eth2 {
                 weight 1
             }
             interface eth3 {
                 weight 1
             }
             interface eth4 {
                 weight 1
             }
             protocol all
         }
         rule 20 {
             inbound-interface eth0
             interface eth2 {
                 weight 1
             }
             interface eth3 {
                 weight 1
			 }
             interface eth4 {
                 weight 1
             }
             protocol all
         }
         sticky-connections {
             inbound
         }
     }
 }
 protocols {
     static {
         route 0.0.0.0/0 {
             next-hop 172.16.1.1 {
             }
             next-hop 172.16.2.1 {
             }
             next-hop 172.16.3.1 {
             }
         }
         route 100.64.0.0/19 {
             next-hop 10.0.0.99 {
             }
         }
         route 192.168.100.0/24 {
             next-hop 10.0.0.99 {
             }
         }
     }
 }
 service {
     dhcp-server {
         shared-network-name LOCAL {
             subnet 10.0.0.0/24 {
                 default-router 10.0.0.125
                 dns-server 10.0.0.125
                 lease 86400
                 range 0 {
                     start 10.0.0.40
                     stop 10.0.0.55
                 }
             }
         }
     }
     dns {
         forwarding {
             cache-size 1000
             dnssec process-no-validate
             listen-address 10.0.0.125
             listen-address 10.1.0.2
             listen-address 127.0.0.1
             name-server 1.1.1.1
             name-server 1.0.0.1
             name-server 8.8.8.8
         }
     }
     ssh {
         listen-address 10.0.0.125
         port 22
     }
 }
 system {
     config-management {
         commit-revisions 20
     }
     console {
     }
     host-name gateway-temp
     ipv6 {
         disable-forwarding
     }
     ntp {
         server 0.pool.ntp.org {
         }
         server 1.pool.ntp.org {
         }
         server 2.pool.ntp.org {
         }
     }
     syslog {
         global {
             facility all {
                 level notice
             }
             facility protocols {
                 level debug
             }
         }
     }
     time-zone Australia/Brisbane
 }

Help is greatly appreciated.

Thanks

c-po · April 23, 2019, 9:15am

Can you try disable dnssec and wan-load-balance and test again?

elico · April 24, 2019, 6:01am

A dnssec compatible DNS service would be 9.9.9.9
If this doesn’t work with 9.9.9.9 I believe that there is something wrong with the PowerDNS settings.
@c-po : how do you disable dnssec in VyOS PowerDNS?

c-po · April 24, 2019, 8:17am

delete service dns forwarding dnssec

jimloko · April 24, 2019, 8:46am

I have tried disabling dnssec, but not load-balancing as this is a production router i don’t want to mess with it too much - at least not in peak times. What is the simplest way to disable load-balancing? Just remove the default route and temp set to say one of the gateways?

Also i found that powerdns default setting for dnssec is process-no-validate, which also shows as dnssec=process-no-validate in the recursor.conf file. To disable you need to set it to off set service dns forwarding dnssec off to have dnssec=off in the config.

I stopped the service and manually ran with tracing and here is a snippet of the trace output:

recursor-4.1.12.security-status.secpoll.powerdns.com: timeout resolving after 1502.99msec
recursor-4.1.12.security-status.secpoll.powerdns.com: Trying IP 8.8.8.8:53, asking 'recursor-4.1.12.security-status.secpoll.powerdns.com|TXT'
recursor-4.1.12.security-status.secpoll.powerdns.com: timeout resolving after 1502.05msec
recursor-4.1.12.security-status.secpoll.powerdns.com: Failed to resolve via any of the 1 offered NS at level '.'
recursor-4.1.12.security-status.secpoll.powerdns.com: failed (res=-1)
Could not retrieve security status update for '4.1.12' on 'recursor-4.1.12.security-status.secpoll.powerdns.com', RCODE = Server Failure
0 [4/1] question for 'clientservices.googleapis.com|A' from 10.0.0.55
[4] clientservices.googleapis.com: Wants NO DNSSEC processing, auth data in query for A
[4] clientservices.googleapis.com: Looking for CNAME cache hit of 'clientservices.googleapis.com|CNAME'
[4] clientservices.googleapis.com: No CNAME cache hit of 'clientservices.googleapis.com|CNAME' found
[4] clientservices.googleapis.com: No cache hit for 'clientservices.googleapis.com|A', trying to find an appropriate NS record
[4] : got TA for '.'
[4] : setting cut state for . to Secure
[4] clientservices.googleapis.com: initial validation status for clientservices.googleapis.com is Indeterminate
[4] clientservices.googleapis.com: Cache consultations done, have 1 NS to contact
[4] clientservices.googleapis.com: Domain has hardcoded nameservers
[4] clientservices.googleapis.com: Resolved '.' NS (empty) to: 1.1.1.1, 1.0.0.1, 8.8.8.8
[4] clientservices.googleapis.com: Trying IP 1.1.1.1:53, asking 'clientservices.googleapis.com|A'
[4] clientservices.googleapis.com: timeout resolving after 1501.99msec
[4] clientservices.googleapis.com: Trying IP 1.0.0.1:53, asking 'clientservices.googleapis.com|A'
[4] clientservices.googleapis.com: timeout resolving after 1502.01msec
[4] clientservices.googleapis.com: Trying IP 8.8.8.8:53, asking 'clientservices.googleapis.com|A'
[4] clientservices.googleapis.com: timeout resolving after 1502.14msec
[4] clientservices.googleapis.com: Failed to resolve via any of the 1 offered NS at level '.'
[4] clientservices.googleapis.com: failed (res=-1)
0 [4/1] answer to question 'clientservices.googleapis.com|A': 0 answers, 0 additional, took 3 packets, 4506.14 netw ms, 4507.39 tot ms, 0 throttled, 3 timeouts, 0 tcp connections, rcode=2

With dnssec enabled the first [4] line above reads (different request):

[31] clientservices.googleapis.com: Wants DNSSEC processing, auth data in query for A

Its almost like the router can’t talk to the IP’s, but i can ping all of them successfully bot from vyos and from the hosts. What do you make of the recursor-4.1.12.security-status.secpoll.powerdns.com: failed (res=-1) and related lines?

EDIT: More info: There is no etc/resolv.conf. If i dig one.one.one.one for eg i get SERVFAIL with it trying the lookup on local server on 127.0.0.1

If i set say nameserver 1.1.1.1 in resolv.conf and dig it looks up that nameserver directly and returns the record. If i remove resolv and nslookup one.one.one.one 1.1.1.1 it returns. So i don’t know if its powerdns or my rules blocking incoming data to 127.0.0.1?

jimloko · April 24, 2019, 8:47am

FYI, here’s my recursor.conf (as generated by VyOS):

### Autogenerated by dns_forwarding.py ###

# Non-configurable defaults
daemon=yes
threads=1
allow-from=0.0.0.0/0, ::/0
log-common-errors=yes
non-local-bind=yes
query-local-address=0.0.0.0
query-local-address6=::

# cache-size
max-cache-entries=1000

# negative TTL for NXDOMAIN
max-negative-ttl=3600

# ignore-hosts-file
export-etc-hosts=yes

# listen-on
local-address=10.0.0.125,10.1.0.2,127.0.0.1,127.0.1.1

# domain ... server ...

# dnssec
dnssec=off

# name-server
forward-zones-recurse=.=1.1.1.1;1.0.0.1;8.8.8.8

EDIT: I tried Quad9 with dnssec on and off but still receiving SERVFAIL.

elico · April 24, 2019, 8:25pm

Try from VyOS:

$ dig @9.9.9.9 google.com

; <<>> DiG 9.11.3-1ubuntu1.5-Ubuntu <<>> @9.9.9.9 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40792
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com. IN A

;; ANSWER SECTION:
google.com. 38 IN A 216.58.212.238

;; Query time: 107 msec
;; SERVER: 9.9.9.9#53(9.9.9.9)
;; WHEN: Wed Apr 24 23:23:55 IDT 2019
;; MSG SIZE rcvd: 55

It should work for you, if it’s not working directly from 9.9.9.9 or 8.8.8.8 or 1.1.1.1 then it’s a probably a network issue.

jimloko · April 24, 2019, 9:35pm

Correct, specifying a nameserver with dig works. Setting a system name-server set system name-server 1.1.1.1 also allows dig to work on vyos without specifying a nameserver: 'dig google.com', but hosts still can not resolve using powerdns recursor.

dig @1.1.1.1 google.com also works from a host, whereas dig google.com results in SERVFAIL. Thats why i’m suspecting the load-balancer, like the traffic is not returning to the local interface, where powerdns is making its calls from. I tried adding a rule to exclude traffic to the lo interface but this didn’t work either.

Whats the easiest way to temporarily disable load-balancing without killing it?

EDIT: Interestingly, traceroute -n 1.1.1.1 produces no results from vyos but does from a host, albeit very slowly with missing status’s. Whats different between 1.1.8 and 1.2.1 load-balancing? I believe 1.1.8 included the change to powerdns?

Release notes for 1.2.0 has this: “Operational mode command to restart the dnsmasq service:” so it wasn’t powerdns?

elico · April 24, 2019, 11:36pm

I do not know how to disable the balancer but my assumption is that both dig and powerdns ontop of the VyOS should have the same network access.
I would suggest to try and verify with tcpdump on a specific interface and also use only one DNS ie instead of both 8.8.8.8 and 1.1.1.1 and couple others just try with one like 8.8.8.8 or 9.9.9.9 or 1.1.1.1 and see what happens.
if it’s dnsmasq …(just noticed) then you should use another dns software(to my opinion…).
I have used dnsmasq and have seen very weird things.
However there is one thing I do remeber… “all-servers” (How to disable dnsmasq - #9 by elico - Support - NethServer Community , Dnsmasq: remove strict-order option · Issue #5705 · NethServer/dev · GitHub) it seems that many do not allow dnsmasq to run queries against all hosts.
If the server doesn’t work very well then it’s an option to start with.
Check this and make sure if it fixes, if not we will see what to do next.

jimloko · April 25, 2019, 12:07am

Ha, it looks like they forgot to update the docs after changing the dns, it is definately power dns.

Tcpdump on the listen-address interface shows request coming in from host and SERVFAIL reply back. I ran tcpdump on each of the 3 outgoing interfaces but didn’t capture traffic from my request, only traffic from various hosts direct to public nameservers:

09:59:07.450520 IP kcbmain.49146 > 10.1.0.2.domain: 3367+ [1au] A? google.com. (39)
09:59:08.957115 IP 10.1.0.2.domain > kcbmain.49146: 3367 ServFail 0/0/1 (39)
snip
10:02:49.964877 IP 172.16.2.10.59872 > 8.8.8.8.domain: 38309+ PTR? 57.231.33.13.in-addr.arpa. (43)
10:02:50.081117 IP 8.8.8.8.domain > 172.16.2.10.59872: 38309 1/0/0 PTR server-13-33-231-57.lax3.r.cloudfront.net. (98)

My last few efforts have had a single nameserver 1.1.1.1, and i just tried removing all listen-addresses bar one, and still same symptom reply from host:

 forwarding {
     cache-size 100
     dnssec off
     listen-address 10.1.0.2
     name-server 9.9.9.9
 }

root@kcbmain:~# dig google.com

; <<>> DiG 9.10.3-P4-Debian <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 58648
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A

;; Query time: 0 msec
;; SERVER: 10.1.0.2#53(10.1.0.2)
;; WHEN: Thu Apr 25 09:43:14 AEST 2019
;; MSG SIZE  rcvd: 39

elico · April 25, 2019, 12:21am

You are not listening on this address per your configuration:

local-address=10.0.0.125,10.1.0.2,127.0.0.1,127.0.1.1

Try on the box:

dig @127.0.0.1 google.com

…Sorry didn’t noticed the 10.1.0.2 but try to dig with 127.0.0.1 and see if there is a response.
I can try to test it but tcpdump needs to be on the outer interface which receives the traffic from the 1.1.1.1 and 8.8.8.8 or 9.9.9.9.
Hope it helps.

jimloko · April 25, 2019, 7:49am

It seems its just not returning traffic from the balancer back to pdns.

vyos@gateway-temp# dig @127.0.0.1 google.com

; <<>> DiG 9.9.5-9+deb8u17-Debian <<>> @127.0.0.1 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 62256
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A

;; Query time: 1503 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Apr 25 17:34:20 AEST 2019
;; MSG SIZE  rcvd: 39

Ok, now this is crazy. I’ve got another vyos instance running with pdns listening on 10.1.0.3 and forwarding to 3 nameservers (including an internal authoritative ns). I changed our config nameserver to forward to just 10.1.0.3, and ran a trace on the other vm to listen to port 53 on 10.1.0.3 and then ran dig on our box:

vyos@gateway-temp# dig @127.0.0.1 google.com

; <<>> DiG 9.9.5-9+deb8u17-Debian <<>> @127.0.0.1 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 53074
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A

;; Query time: 1505 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Apr 25 17:38:21 AEST 2019
;; MSG SIZE  rcvd: 39

as we have seen, BUT this is what was captured on the other vyos box:

07:38:18.612784 IP 10.1.0.2.31885 > 10.1.0.3.domain: 7357+ [1au] A? google.com. (39)
07:38:20.131825 IP 10.1.0.3.domain > 10.1.0.2.31885: 7357 1/0/1 A 216.58.203.110 (55)

Wah? A successful lookup performed on 10.1.0.3 and passed back to 10.1.0.2 (which i’ve confirmed it IS listening on), but SERVFAIL was still returned?

EDIT: Ok, i am now getting correct query replies being sent back to the load-balanced vyos using this config:

 forwarding {
     cache-size 100
     dnssec off
     listen-address 10.1.0.2
     listen-address 127.0.0.1
     listen-address 10.0.0.125
     name-server 10.1.0.3
 }

vyos@gateway-temp# dig @127.0.0.1 google.co.uk

; <<>> DiG 9.9.5-9+deb8u17-Debian <<>> @127.0.0.1 google.co.uk
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36156
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.co.uk.                  IN      A

;; ANSWER SECTION:
google.co.uk.           278     IN      A       172.217.25.131

;; Query time: 46 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Apr 25 17:57:41 AEST 2019
;; MSG SIZE  rcvd: 57

Its not ideal having another vm running just to forward queries, so why is my local pdns not able to look up queries?

jimloko · April 26, 2019, 11:32am

Any idea why pdns is refusing to respond with domain queries when forwarding to external dns servers where requests go out (assuming) through the load-balanced interfaces? What else can i do to test this?

c-po · April 27, 2019, 7:00am

I had the same problem with DNS resolver 1.1.1.1 configured in one of my routers yesterday thus I decided to remove the set service dns forwarding name-server completely as VyOS will be a full DNS recursor if no nameserver is given.

Maybe this fixes your problem, too?

jimloko · April 27, 2019, 8:08am

Just tried removing name-server and receiving SERVFAIL. I do have system name-server set so it seems pdns will fall back to using the system defined nameserver for query forwarding (maybe not, don’t know, i just checked pdns docs).

Unfortunately no fix

elico · April 27, 2019, 8:44pm

Why do you keep responding with the tcpdump of the internal interfaces???
You should use tcpdump to see all you external interfaces traffic…
I still think that there is something wrong and you can try to debug it using -vvv first.
Later on if there is no issue with a simple dig you should fallback to dnsmasq or PowerDNS.
I have not tested it but can test it if you really don’t have any other option.
I am still waiting for the VyOS developers to respond.

jimloko · May 3, 2019, 12:26pm

There is no traffic form my host (or local vyos) on the external interfaces for my requests.
The tcpdump i posted above shows my dns requests going from the pdns server to another box on the lan which then forwards queries directly and replies yet the pdns server replied with servfail.

It appears there might be an issue with the load-balancer and traffic originating from the local interface. I might play with iptables a bit and see what i can turn up.

Thanks for your help.

elico · May 4, 2019, 5:46pm

@jimloko If it’s indeed an issue with the LB setup it should be reproducible but requires a full working lab.
So:

three(3) external networks , eth2-eth4
two(2) internal networks eth0-eth1
sticky-connections
DHCP on eth0
DNS

Right?

elico · May 4, 2019, 9:02pm

@jimloko this is the issue inside VYoS…

There is nothing like sticky-connections outband.
If there was then the traffic would stick to a specific route in some way.
Currently it seems to me that with the next settings:

It’s doing a LB per packet for outband traffic. it means that outband traffic packets will have some kind of trouble reaching the destination and back.
I have seen it in the past on couple Linux servers I have and eventually for them I modified iptables manually to make it some how work.
I am not sure what about your setup but it seems that the wan lb is for wan routers and not on LAN with NAT all over the place towards the Internet.
And… it’s not something with PowerDNS

jimloko · May 5, 2019, 11:59pm

Correct.

The 3 172.x networks each have a vdsl router as gateway, and each is doing NAT, but that’s upstream from this router, and are happily routing the internal traffic from the LAN interfaces (eth0 and eth1).

I have suspected it has something to do with the LB rules not playing nice with locally originating traffic, such as pnds so it seems you are right, it is not pdns that has the issue. The SERVFAIL (above) with after doing a successfuly lookup using the 2nd internal vyos router threw me.

I think i just need to sort out the iptable mangle rules then it should be able to route the local originating traffic?