All the wheels will stop turning, if your strong arm wants them to

Complaints about DNS failures in the xDSL network became more frequent this morning. The reports were unspecific in terms of time, went across the product range and also affected backend systems. What's going on?

Sniffing revealed that the requests actually arrive at the DNS servers, but they simply don't always respond. Restarting the processes helped, but only briefly. After about half a minute, each of the servers ceased sending replies.

In sniffing, however, each server displayed hectic DNS activity. Only why? What is it doing there?

Statistics for each server show that the request queue, the list of open requests, spills over. And surprisingly, the list of open requests contained hundreds of lines of the following kind

499 CNAME IN psai9edi.s3.amazonaws.com. 10.412281 iterator wait for 2600:9000:5300:1b00::1
500 CNAME IN psg62lat.s3.amazonaws.com. 41.447997 iterator wait for 2600:9000:5300:1b00::1
501 CNAME IN psgp7262.s3.amazonaws.com. 7.193284 iterator wait for 2600:9000:5300:1b00::1
502 CNAME IN pshziple.s3.amazonaws.com. 44.356077 iterator wait for 2600:9000:5300:1b00::1
503 CNAME IN psigagat.s3.amazonaws.com. 45.621077 iterator wait for 2600:9000:5300:1b00::1
504 CNAME IN pspcvsil.s3.amazonaws.com. 22.752443 iterator wait for 2600:9000:5300:1b00::1
505 CNAME IN pswgyera.s3.amazonaws.com. 52.993825 iterator wait for 2600:9000:5300:1b00::1
506 CNAME IN qa9c96lt.s3.amazonaws.com. 56.530104 iterator wait for 2600:9000:5300:1b00::1
507 CNAME IN qah-zkwp.s3.amazonaws.com. 1.215374 iterator wait for 2600:9000:5300:1b00::1
508 CNAME IN qavtvxcc.s3.amazonaws.com. 37.358889 iterator wait for 2600:9000:5300:1b00::1
509 CNAME IN v118gk69.s3.amazonaws.com. 38.617430 iterator wait for 2600:9000:5300:1b00::1
510 CNAME IN v12ggbel.s3.amazonaws.com. 8.180806 iterator wait for 2600:9000:5300:1b00::1
511 CNAME IN v19pg7un.s3.amazonaws.com. 45.592578 iterator wait for 2600:9000:5300:1b00::1
512 CNAME IN v1h66nam.s3.amazonaws.com. 20.184565 iterator wait for 2600:9000:5300:1b00::1
513 CNAME IN v1i5mjps.s3.amazonaws.com. 2.479473 iterator wait for 2600:9000:5300:1b00::1
514 CNAME IN v1t5buni.s3.amazonaws.com. 13.825713 iterator wait for 2600:9000:5300:1b00::1
515 CNAME IN x8cca8xn.s3.amazonaws.com. 20.251822 iterator wait for 2600:9000:5300:1b00::1
516 CNAME IN x8ln-0rj.s3.amazonaws.com. 50.227529 iterator wait for 2600:9000:5300:1b00::1
517 CNAME IN x8q9jrjv.s3.amazonaws.com. 50.693382 iterator wait for 2600:9000:5300:1b00::1
518 CNAME IN x8x9zq3n.s3.amazonaws.com. 44.069048 iterator wait for 2600:9000:5300:1b00::1
519 CNAME IN xk8t2xuy.s3.amazonaws.com. 52.709106 iterator wait for 2600:9000:5300:1b00::1
520 CNAME IN xkkh6w1b.s3.amazonaws.com. 48.943785 iterator wait for 2600:9000:5300:1b00::1
521 CNAME IN xkl4qcvv.s3.amazonaws.com. 43.927290 iterator wait for 2600:9000:5300:1b00::1
522 CNAME IN xksddwxf.s3.amazonaws.com. 11.812726 iterator wait for 2600:9000:5300:1b00::1
523 CNAME IN xkuf9dcb.s3.amazonaws.com. 49.485060 iterator wait for 2600:9000:5300:1b00::1

What the hell's going on here?

External validation of one of these entries results in the following:

This includes the following error messages:

  • amazonaws.com to s3-1-w.amazonaws.com
    The server(s) for the parent zone (amazonaws.com) responded with a referral instead of answering authoritatively for the DS RR type. (205.251.192.27, 205.251.195.199, 2600:9000:5300:1b00::1, 2600:9000:5303:c700::1, UDP_-_EDNS0_4096_D_K)
  • x8x9zq3n.s3.amazonaws.com/A
    No response was received from the server over UDP (tried 12 times). (156.154.64.10, 156.154.65.10, 2001:502:f3ff::10, 2610:a1:1014::10, UDP_-_NOEDNS_)
  • x8x9zq3n.s3.amazonaws.com/AAAA
    No response was received from the server over UDP (tried 12 times). (156.154.64.10, 156.154.65.10, 2001:502:f3ff::10, 2610:a1:1014::10, UDP_-_NOEDNS_)

In other words, the resolution of the domain "amazonaws.com", the functional domain of Amazon, is broken.

A validating resolver turns several loops over several servers until, after a long time, it gives up unsuccessfully and empty-handed. Since he has no result to cache, he goes back into the loop with every request and remains busy.

Each of our resolvers has 700 worker slots per CPU thread. Given the number of customers and the number of requests, there is no free slot available for a new request after 20 to 30 seconds. Over and out.

The solution was to redirect the zone "amazonaws.com" to a non-validating resolver and use these results. Spontaneously the situation relaxed.

But what really happened?

A quote from a more closely involved technician: Uh...what they're entering here is invalid and won't be funny when the corresponding TTLs expire.

Post a comment

Related content