First results from DNS Dampening

Writing software for DNS Dampening moves only step by step. Testing in production is the only way to determine if the assumptions made correct code.

Release often, release early

The first unforeseen results of publishing my patch are:

  • I was pointed to several fundamental errors in my assumptions. I do understand those concerns and agree with many of them Please allow me to gain my own experience.
  • I was pointed to an awful error in the heap manipulation functions. It saved me days to nail down this error.
  • Kind encouragement (even in this early stage) helped me to continue with this work.
  • I learned more about internal processes in our own company, than I ever wished to know.

Thank you very much.

If I had tried to delay the publication, I'd choosen other ways and refused to throw away my code that easily. Probably I had given up.

First results

After deploying the patch to the involved server everything was sound. Outgoing traffic decreased, I was happy.

Some strange lines scrolled through the logs: The IPv6 enabled resolvers of DTAG, HE, and SiXXS went into dampening. But I did not found detailed information. That's why I extended the patch to some logging:

--- bind-9.9.1-P3/bin/named/query.c     2012-08-24 06:43:09.000000000 +0200
+++ bind-9.9.1-P3-dampening/bin/named/query.c   2012-09-25 22:37:26.000000000 +0
@@ -7146,7 +7148,7 @@
        }

        static inline void
-       log_query(ns_client_t *client, unsigned int flags, unsigned int extflags) {
+       log_query(ns_client_t *client, unsigned int flags, unsigned int extflags, unsigned int penalty) {
                char namebuf[DNS_NAME_FORMATSIZE];
                char typename[DNS_RDATATYPE_FORMATSIZE];
                char classname[DNS_RDATACLASS_FORMATSIZE];
@@ -7165,7 +7167,7 @@
                isc_netaddr_format(&client->destaddr, onbuf, sizeof(onbuf));

                ns_client_log(client, NS_LOGCATEGORY_QUERIES, NS_LOGMODULE_QUERY,
-                             level, "query: %s %s %s %s%s%s%s%s%s (%s)", namebuf,
+                             level, "query: %s %s %s %s%s%s%s%s%s (%s) %u", namebuf,
                              classname, typename, WANTRECURSION(client) ? "+" : "-",
                              (client->signer != NULL) ? "S": "",
                              (client->opt != NULL) ? "E" : "",
@@ -7173,7 +7175,7 @@
                                         "T" : "",
                              ((extflags & DNS_MESSAGEEXTFLAG_DO) != 0) ? "D" : "",
                              ((flags & DNS_MESSAGEFLAG_CD) != 0) ? "C" : "",
-                             onbuf);
+                             onbuf, penalty);
        }

        static inline void
@@ -7228,6 +7230,7 @@
                unsigned int saved_extflags = client->extflags;
                unsigned int saved_flags = client->message->flags;
                isc_boolean_t want_ad;
+               unsigned int penalty;

                CTRACE("ns_query_start");

@@ -7282,6 +7285,14 @@
                }

                /*
+                * Update the penalty and report the current state
+                */
+               if (dampening_query(client, &penalty) == DAMPENING_SUPPRESS) {
+                       query_next(client, DNS_R_DROP);
+                       return;
+               }
+
+               /*
                 * Get the question name.
                 */
                result = dns_message_firstname(message, DNS_SECTION_QUESTION);
@@ -7306,7 +7317,7 @@
                }

                if (ns_g_server->log_queries)
-                       log_query(client, saved_flags, saved_extflags);
+                       log_query(client, saved_flags, saved_extflags, penalty);

                /*
                 * Check for multiple question queries, since edns1 is dead.

Storing the value in dampening_query is a no-brainer.

But the system appears to respond unusually slowly. Bash prompts with the hostname, so let's debug this: strace -f hostname -f. There are heavy timeouts while name resolution by localhost. So let me extract the penalty over time, sorry Kris.

penalty3

Let's inspect the spikes in the evening:

  • At 22:57 I tried a tcpdump with activated name resolution in order to verify the basic DNS functionality.
  • The wide bars at 22:05 and 23:43 are brute force attacks via SSH. SSH checks reverse and forward DNS for each connection.
  • At midnight the system storm into the dampening.

Expanding the bar between 23:02:50 and 23:07:00 reveals the glory details:

penalty3-1

Penalty rises to about 40 points and drops to zero repeatedly. Dampening works. But it looks a bit inverted. Overwriting the data in the last element might be another explanation for this graph. But overall, it's fine for me.

Deeper inspection of the midnight even reveals a buch of scripts which cross check the consistency of DNS entries: Does foreward and reverse lookup correspond to each other, is the name server listed in the parent zone, etc. pp.? Yesm those tests will raise 40000 queries in a few minutes. That's a real problem.

The most prominent solution is to exclude the locally connected networks from dampening. If you really have a spoofing problem with those addresses, you are not better than any of the attackers exploiting missing implementations of BCP38.

A completely different client reached over thousend pealty points within 20 minutes only by sending a few queries. It took a long time starring on the graph, until I was able to draw the enlightening two lines. m(

penalty4

Instead of decaying the points (following the green line), the penalty increases expotentially. Really, the code misses the important minus in the exponent. What a braindead error!

On the other hand, the error had it's nuts. I was able to test the system under conditions of very heavy attackers from all sides. Even then a large part of relevant traffic was processed correctly. Only the fixed ressouce allocation prevents the system from exploding.

Bugfix in production

The fixed system respondes smoothly. And first statistics are promising:

penalty5

After the first learning dampening and relearning causes waves. Variations in attack rates let the waves disappear.

There is a difference between accumulating by address and classifying by query type. The question is if and how normal queries are suppressed by ANY-type attacks. So let's have a lock at penalty by time for different query types.

penalty6-1

None of the regular queries comes from a dampened address! Attackers did not destroy production. For reference, let's include the attacker packers:

penalty6

Attacks are identified within 40 packets and stopped with silence. The horizontal lines are characteristic for massive repeated ANY query with the same ID. But the output rate is now production ready. Welcome back, server.

Real world statistics about query types could be derived easily.

 165976 A
  56926 ANY
  46890 AAAA
  36527 PTR
  11336 MX
   8025 SOA
   4958 DNSKEY
   2974 NS
   2786 SRV
   2293 TXT
    587 A6
    568 SPF
    543 DLV
    170 DS
    105 CNAME
      4 HINFO
      4 AXFR
      3 TLSA
      2 NAPTR
      2 RRSIG
      1 NSEC

I'm thankful for the TLSA queries. Not so funny is any query for A6. DLV queries are quite usual, I'm running such a list. But the ratio of IPv6 to IPv4 is outstanding. A quarter of the clients out there interesting in our services have enough IPv6 connectivity to ask for AAAA.

Open problems

The system is usable at a very basic level. Next steps include:

  • Can I do better than the simple heap? It might be better to use a circular buffer for age management and a hash table for searching?
  • Add configuration, ACLs etc.
  • What can I do, to migrate an attack to the well known resolver IP of a major ISP? Are our zones unreachable by the customers of this ISP?
  • How does my ideas compare with others? What are the pros and the cons?

Post a comment

Related content