First results from DNS Dampening
Writing software for DNS Dampening moves only step by step. Testing in production is the only way to determine if the assumptions made correct code.
Release often, release early
The first unforeseen results of publishing my patch are:
- I was pointed to several fundamental errors in my assumptions. I do understand those concerns and agree with many of them Please allow me to gain my own experience.
- I was pointed to an awful error in the heap manipulation functions. It saved me days to nail down this error.
- Kind encouragement (even in this early stage) helped me to continue with this work.
- I learned more about internal processes in our own company, than I ever wished to know.
Thank you very much.
If I had tried to delay the publication, I'd choosen other ways and refused to throw away my code that easily. Probably I had given up.
First results
After deploying the patch to the involved server everything was sound. Outgoing traffic decreased, I was happy.
Some strange lines scrolled through the logs: The IPv6 enabled resolvers of DTAG, HE, and SiXXS went into dampening. But I did not found detailed information. That's why I extended the patch to some logging:
--- bind-9.9.1-P3/bin/named/query.c 2012-08-24 06:43:09.000000000 +0200 +++ bind-9.9.1-P3-dampening/bin/named/query.c 2012-09-25 22:37:26.000000000 +0 @@ -7146,7 +7148,7 @@ } static inline void - log_query(ns_client_t *client, unsigned int flags, unsigned int extflags) { + log_query(ns_client_t *client, unsigned int flags, unsigned int extflags, unsigned int penalty) { char namebuf[DNS_NAME_FORMATSIZE]; char typename[DNS_RDATATYPE_FORMATSIZE]; char classname[DNS_RDATACLASS_FORMATSIZE]; @@ -7165,7 +7167,7 @@ isc_netaddr_format(&client->destaddr, onbuf, sizeof(onbuf)); ns_client_log(client, NS_LOGCATEGORY_QUERIES, NS_LOGMODULE_QUERY, - level, "query: %s %s %s %s%s%s%s%s%s (%s)", namebuf, + level, "query: %s %s %s %s%s%s%s%s%s (%s) %u", namebuf, classname, typename, WANTRECURSION(client) ? "+" : "-", (client->signer != NULL) ? "S": "", (client->opt != NULL) ? "E" : "", @@ -7173,7 +7175,7 @@ "T" : "", ((extflags & DNS_MESSAGEEXTFLAG_DO) != 0) ? "D" : "", ((flags & DNS_MESSAGEFLAG_CD) != 0) ? "C" : "", - onbuf); + onbuf, penalty); } static inline void @@ -7228,6 +7230,7 @@ unsigned int saved_extflags = client->extflags; unsigned int saved_flags = client->message->flags; isc_boolean_t want_ad; + unsigned int penalty; CTRACE("ns_query_start"); @@ -7282,6 +7285,14 @@ } /* + * Update the penalty and report the current state + */ + if (dampening_query(client, &penalty) == DAMPENING_SUPPRESS) { + query_next(client, DNS_R_DROP); + return; + } + + /* * Get the question name. */ result = dns_message_firstname(message, DNS_SECTION_QUESTION); @@ -7306,7 +7317,7 @@ } if (ns_g_server->log_queries) - log_query(client, saved_flags, saved_extflags); + log_query(client, saved_flags, saved_extflags, penalty); /* * Check for multiple question queries, since edns1 is dead.
Storing the value in dampening_query is a no-brainer.
But the system appears to respond unusually slowly. Bash prompts with the hostname, so let's debug this: strace -f hostname -f. There are heavy timeouts while name resolution by localhost. So let me extract the penalty over time, sorry Kris.
Let's inspect the spikes in the evening:
- At 22:57 I tried a tcpdump with activated name resolution in order to verify the basic DNS functionality.
- The wide bars at 22:05 and 23:43 are brute force attacks via SSH. SSH checks reverse and forward DNS for each connection.
- At midnight the system storm into the dampening.
Expanding the bar between 23:02:50 and 23:07:00 reveals the glory details:
Penalty rises to about 40 points and drops to zero repeatedly. Dampening works. But it looks a bit inverted. Overwriting the data in the last element might be another explanation for this graph. But overall, it's fine for me.
Deeper inspection of the midnight even reveals a buch of scripts which cross check the consistency of DNS entries: Does foreward and reverse lookup correspond to each other, is the name server listed in the parent zone, etc. pp.? Yesm those tests will raise 40000 queries in a few minutes. That's a real problem.
The most prominent solution is to exclude the locally connected networks from dampening. If you really have a spoofing problem with those addresses, you are not better than any of the attackers exploiting missing implementations of BCP38.
A completely different client reached over thousend pealty points within 20 minutes only by sending a few queries. It took a long time starring on the graph, until I was able to draw the enlightening two lines. m(
Instead of decaying the points (following the green line), the penalty increases expotentially. Really, the code misses the important minus in the exponent. What a braindead error!
On the other hand, the error had it's nuts. I was able to test the system under conditions of very heavy attackers from all sides. Even then a large part of relevant traffic was processed correctly. Only the fixed ressouce allocation prevents the system from exploding.
Bugfix in production
The fixed system respondes smoothly. And first statistics are promising:
After the first learning dampening and relearning causes waves. Variations in attack rates let the waves disappear.
There is a difference between accumulating by address and classifying by query type. The question is if and how normal queries are suppressed by ANY-type attacks. So let's have a lock at penalty by time for different query types.
None of the regular queries comes from a dampened address! Attackers did not destroy production. For reference, let's include the attacker packers:
Attacks are identified within 40 packets and stopped with silence. The horizontal lines are characteristic for massive repeated ANY query with the same ID. But the output rate is now production ready. Welcome back, server.
Real world statistics about query types could be derived easily.
165976 A 56926 ANY 46890 AAAA 36527 PTR 11336 MX 8025 SOA 4958 DNSKEY 2974 NS 2786 SRV 2293 TXT 587 A6 568 SPF 543 DLV 170 DS 105 CNAME 4 HINFO 4 AXFR 3 TLSA 2 NAPTR 2 RRSIG 1 NSEC
I'm thankful for the TLSA queries. Not so funny is any query for A6. DLV queries are quite usual, I'm running such a list. But the ratio of IPv6 to IPv4 is outstanding. A quarter of the clients out there interesting in our services have enough IPv6 connectivity to ask for AAAA.
Open problems
The system is usable at a very basic level. Next steps include:
- Can I do better than the simple heap? It might be better to use a circular buffer for age management and a hash table for searching?
- Add configuration, ACLs etc.
- What can I do, to migrate an attack to the well known resolver IP of a major ISP? Are our zones unreachable by the customers of this ISP?
- How does my ideas compare with others? What are the pros and the cons?