Advanced search


Whenever you offer IPTV streaming to customers, you need to monitor it. Unfortunately it becomes complicated, if you can not  control all components. We were urged to measure the quality of the IPTV signal, but only have access to a single switch in the distribution network.

Short Outages

The main problem of IPTV is, that customers will notice every outage, even if it lasts a second. And every such outage will result in a serious complaint to the service desk.

Therefore the most important issue was to inform the service desk as soon as possible about missing multicast packets. A few seconds after the outage an alarm has to be raised. This way the service desk is informed before the customer complains. Due to the customer reaction time, the alarm needs to be delivered within five seconds.

What we did was simple: Take an old, low performing system, join all available streams (radio as well as TV) because it does not saturate the Gigabit, and collect per stream when the last packet was seen. If a stream does not received a packet for more than five seconds, report the stream and the corresponding parameters. Those reports are injected into the alarm system, which in turn sends SMS, email or open a popup window. Finally the data can be plotted: Using logarithmic axes to enhance the short outages.

iptv-missing-channels

As secondary result the reporting also tells which streamer is affected and how many streams are still available from this device. If the number of streams drops to zero, urgent replacement of broken hardware can be scheduled. The alarm system can inform a different set of people about this type of events.

Details

Customers do not only complain if the movie freezes or the TV goes black, they even call if the picture or audio is garbled.

That's why our software was extended to examine MPEG-TS packets itself. There is a lot of interesting information inside:

Datenstrom liegt von a.b.c.d an.

Statistik über 3 Sekunden:
 Empfangene UDP Pakete      3446   Paket zu kurz                 0
 Verarbeitete MPEG Pakete  24122   Fehlende Sync-Markierung      0

Statistiken pro Datenstrom (MPEG-Programm)
 Programmkennung (PID)      6630 6610 6622 6620 6621 6600    0   32
                            19e6 19d2 19de 19dc 19dd 19c8 0000 0020
 Empfangene Pakete           56721738  865  494  377   29   28   24
 verschluesselte Daten         021366  829  448  331    0    0    0
 Ansetzpunkte, FullFrames      0    5   18   23   23    0    0    0
 priorisierte Daten            0    5    0    0    0    0    0    0
 Signalstoerung am Streamer    0    0    0    0    0    0    0    0
 Taktstoerung am Streamer      0    0    0    0    0    0    0    0
 Sequenzfehler(CC)             0    0    0    0    0    0    0    0

     Abweichungen vom Synchronisationstakt 19d2 (absoluter Jitter)
26%  ||                                                                   
20%  ||                                                                   
14% ||||                                                                  
 8% ||||||                                                                
 2% ||||||                                                                
   +----------------------------------------------------------------------
 ms 0123456789012345678901234567890123456789012345678901234567890123456789
    0         1         2         3         4         5         6         

     Abweichungen der Paketankunftszeiten (relativer Jitter)
66% |                                                                     
52% |                                                                     
37% |                                                                     
22% |                                                                     
 7% ||  ||||                                                              
   +----------------------------------------------------------------------
 ms 0123456789012345678901234567890123456789012345678901234567890123456789
    0         1         2         3         4         5         6         

Kanal ist durch Kunden im Moment abonniert.

Such a HDTV channel has high data rate and a low jitter. A radio stream is quite different:

Datenstrom liegt von a.b.c.e an.

Statistik über 3 Sekunden:
 Empfangene UDP Pakete        43   Paket zu kurz                 0
 Verarbeitete MPEG Pakete    301   Fehlende Sync-Markierung      0

Statistiken pro Datenstrom (MPEG-Programm)
 Programmkennung (PID)       851    0  850
                            0353 0000 0352
 Empfangene Pakete           282   13    6
 verschluesselte Daten         0    0    0
 Ansetzpunkte, FullFrames      0    0    0
 priorisierte Daten            0    0    0
 Signalstoerung am Streamer    0    0    0
 Taktstoerung am Streamer      0    0    0
 Sequenzfehler(CC)             4    0    4

     Abweichungen vom Synchronisationstakt 0353 (absoluter Jitter)
59%                                                                      |
46%                                                                      |
33%                                                                      |
19%                                                                      |
 6% |    |     | |     |     ||   ||             ||   |      || |        |
   +----------------------------------------------------------------------
 ms 0123456789012345678901234567890123456789012345678901234567890123456789
    0         1         2         3         4         5         6         

     Abweichungen der Paketankunftszeiten (relativer Jitter)
15%         |                                                             
11%         |                                                             
 8%       | | |        |                                                  
 5%     | | | |  |  | || |                                                
 1%  | || ||| ||||||||||||  ||  | |                |                      
   +----------------------------------------------------------------------
 ms 0123456789012345678901234567890123456789012345678901234567890123456789
    0         1         2         3         4         5         6         

Kanal ist durch Kunden im Moment nicht abonniert.

MPEG-Transport-Streams (representing a single TV channel) consist of several programs (sub streams), which are reported separately. The program 0000 contains the index, the information which program is used for which purpose. But even without decoding the index, it's easy to derive that the highest bandwidth program will represent the picture stream. Other programs are teletext, subtitles, EPG, on or more audio channels. ZDF-HD is very rich in respect of those programs.

Statistics

Plotting those data is an obvious wish, simply to investigate customer complains post mortem. In order to find out, what was going wrong last evening, historical data is needed.

Aggregating data over a minute is use to limit the amount of recorded data: The database has to store a single group of measurements per program and minute. And those data can be viewed.

For the first example that the bandwidth as (packets * size / time). Please notice how the bandwidth of an HD-channel is dominated by the movie/picture stream.

iptv-history-data-rate

Between 17:30 and 20:00, the bandwidth dropped, and recurred with half the nominal bandwidth. A reasonable explanation is, that the broadcasting organization switched to SDTV material, i.e. coming from an different source.

Some broadcasting organizations, especially the third programs in the German TV market, do copy the whole stream from the ARD for common broadcasting (i.e. Tagesschau). The switch does show even the different program id.

iptv-history-kanaluebernahme

But let's go back to the bandwidth drop. The number of full frames (restart points for the decoder) may reveal, if there is a problem in the local distribution or at the broadcasting organization. The network itself is usually unable transcode the stream and inject full frames by itself.

iptv-history-fullframes

During this period, the broadcasting organization did repeatedly retry to sync their own signal. The level of resync-points is obviously lower than the full HD stream, but the other programs contains much more resync information than normal.

It's possible, that the streamer lost the (analog) signal. In this case a special bit is set in the MPEG-TS to indicate such a loss of signal, which allows the decoder to restart from scratch instead of waiting for additional information.

iptv-history-signal-streamer

There was such as loss of (analog) signal at the very beginning of the disruption. But the following effects are solely caused by the received signal itself.

There is an other source of trouble which can occur at the broadcasting organization: In order to synchronize the independent programs of the MPEG-TS, a common 33 MHz clock is used. Using this clock, audio and video streams do not diverge. Such a high clock may (according to the standard) be also used to regenerate or calibrate the clock in cheap decoder hardware. But even this clock might be changed. In this case a special bit is set by the broadcasting organization to inform the decoders.

iptv-history-takt-streamer

Even this bit clearly shows, that the disruption was caused by the broadcasting organization. An error in the local distribution infrastructure can be denied.

A still open question is, how to detect packet loss and packet reorder, the most common problems with IP networks. Those are detected by monitoring the sequence numbers (CC) of the MPEG-TS. Unfortunately the sequence number space runs cyclic from 0 to 15, while up to seven MPEG-TS packets are packed into a single UDP packet. This renders this number nearly useless.

iptv-history-cc-error

The most interesting point on this graph is the small amount of CC errors during the disruption. Several data packets are lost or reordered. A possible reason might be, that the programs are resynced over and over again from the broadcasting organization itself. The sequence counter can be reset in this case.

Finally let's take a look to the jitter:

iptv-history-jitter

No more questions.

Summary

It's possible to monitor several hundred of IPTV streams without spending lot's of money. Just do it.

Ich habe das Problem, dass DSLAMs das DSL-Netz segmentieren, während die Layer2 Wolke es als ein durchgehendes Netz ansehen. Ein Lösungsvorschlag ist, die Layer2 Wolke zu drastisch zu verkleinern. Dies probiere ich aus.

Ausgangslage

local proxy arp1

Die DSLAMs verhindern eine direkte Kommunikation von CPEs, die an diesem einem DSLAM angeschlossen sind. Eine Firma kann also nicht mit ihren Mitarbeitern im gleichen Dorf kommunizieren, wenn diese von zu Hause arbeiten wollen.

Die kanonische Lösung mit local-proxy-arp besteht darin, dass der Router die ARP-Antworten fälscht und damit die Kommunikation wieder ermöglicht. Leider macht das massiv Probleme, wenn man Redundanz oder mehrere zentrale Geräte in das Netz bringen will.

Netzmaske verkleinern

Wie ich gelernt habe, besteht die branchenübliche Lösung darin, dem Endgerät (CPE) eine extrem kleine Netzmaske (/32) zu geben. Damit glaubt das Endgerät es sei alleine am Netz.

Aber wie erreicht es sein Gateway? Das liegt doch außerhalb des zugewiesenen Netzes? Was man bräuchte wäre folgendes:

# ip address add a.b.c.d/32 dev eth
# ip route a.b.c.e/32 dev eth
# ip route 0.0.0.0/0 via a.b.c.e

Aber wie bekommt man so etwas Krankes der CPE beigebogen? Schließlich hat sich der Kunde den Router selbst gekauft und wir haben keinen Einfluss auf die Kiste.

Alles was, wir tun können, ist eine passende DHCP Antwort schicken. Also probiere ich das mal (ganz vorsichtig)

subclass "Dynamisch" 1:00:90:33:1f:19:1a {option subnet-mask 255.255.255.255;}
subclass "Dynamisch" 1:9c:c7:a6:ba:e0:1b {option subnet-mask 255.255.255.255;}

Das sollte tun (Pools etc. sind anderweitig konfiguriert).

Und nun der Test vom Router aus:

cisco#ping 198.51.100.128

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 198.51.100.128, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/5/8 ms

cisco#ping  198.51.100.201

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 198.51.100.201, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 8/12/24 ms

cisco#sh arp | in 198.51.100.128|198.51.100.201
Internet  198.51.100.201          2   9cc7.a6ba.e01b  ARPA   Vlan140
Internet  198.51.100.128          2   0090.331f.191a  ARPA   Vlan140

Tada! Das tut doch schon mal. Und Tests an dem Gerät bestätigen auch die Funktionalität des Internets.

Andere Server

Nun der kompliziertere Teil. Wie kommen zentrale Geräte, die nicht das Gateway sind, mit dem Setup klar?

Probieren wir's aus:

linux# ip addr add 198.51.100.108/23 dev eth_slot
linux# ping -c2 198.51.100.201
PING 198.51.100.201 (198.51.100.201) 56(84) bytes of data.
64 bytes from 198.51.100.201: icmp_seq=1 ttl=64 time=7.01 ms
64 bytes from 198.51.100.201: icmp_seq=2 ttl=64 time=6.83 ms

--- 198.51.100.201 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1004ms
rtt min/avg/max/mdev = 6.834/6.922/7.011/0.121 ms 

Und der Tcpdump dazu sagt:

13:36:57.930798 00:1b:21:2d:3a:bb > 9c:c7:a6:ba:e0:1b, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.201: icmp 64: echo request seq 1

13:36:57.937778 9c:c7:a6:ba:e0:1b > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.201 > 198.51.100.108: icmp 64: echo reply seq 1

13:36:58.935366 00:1b:21:2d:3a:bb > 9c:c7:a6:ba:e0:1b, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.201: icmp 64: echo request seq 2

13:36:58.942188 9c:c7:a6:ba:e0:1b > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.201 > 198.51.100.108: icmp 64: echo reply seq 2 

Geht hin und her. Die Pakete kommen direkt von der CPE (eine Fritzbox) und reden mit dem Server.

Stop mal! Warum kommen die Pakete direkt? Sollten die nicht über den Router gehen?

Offensichtlich hat die Fritzbox gelernt, dass sie dieses Gerät direkt am WAN-Bein erreichen kann und schickt auch die Pakete direkt.

Genauer gefragt: Ist das ein Problem?

Wenn zwei CPEs miteinander reden sollen, aber nicht können (DSLAM-Sperre), dann können sie gegenseitig ihre ARP-Anfragen nicht sehen und beantworten. Also werden sie nie versuchen, direkt miteinander zu kommunizieren. Also kein Problem.

Apropos ARP. Da waren keine ARP-Pakete im Tcpdump. Also nochmal mit gelöschtem ARP-Cache.

linux# arp -d 198.51.100.201
linux# ping  -c2 198.51.100.201
PING 198.51.100.201 (198.51.100.201) 56(84) bytes of data.
64 bytes from 198.51.100.201: icmp_seq=2 ttl=63 time=6.97 ms

--- 198.51.100.201 ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 999ms
rtt min/avg/max/mdev = 6.970/6.970/6.970/0.000 ms

Und der Dump dazu:

13:46:38.389346 00:1b:21:2d:3a:bb > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42:
 arp who-has 198.51.100.201 tell 198.51.100.108

13:46:38.390063 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype ARP (0x0806), length 60:
 arp reply 198.51.100.201 is-at 00:07:b4:00:8c:02

13:46:38.390073 00:1b:21:2d:3a:bb > 00:07:b4:00:8c:02, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.201: icmp 64: echo request seq 1

13:46:38.396544 9c:c7:a6:ba:e0:1b > 00:1b:21:2d:3a:bb, ethertype ARP (0x0806), length 60:
 arp reply 198.51.100.201 is-at 9c:c7:a6:ba:e0:1b

13:46:39.389251 00:1b:21:2d:3a:bb > 00:07:b4:00:8c:02, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.201: icmp 64: echo request seq 2

13:46:39.396208 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.201 > 198.51.100.108: icmp 64: echo reply seq 2

Das erste Paket ging über local-proxy-arp und wurde nicht beantwortet.

Das könnte daran liegen, dass die Fritzbox bereits einen ARP EIntrag hat, so dass sie das ankommende Paket als gespooft verwirft. Aber das bedarf noch der Klärung.

Innovaphone

Die andere Test-CPE ist eine Innovaphone.

linux# arp -d 198.51.100.128
linux# ping  -c2 198.51.100.128
PING 198.51.100.128 (198.51.100.128) 56(84) bytes of data.
64 bytes from 198.51.100.128: icmp_seq=2 ttl=127 time=5.86 ms

--- 198.51.100.128 ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1000ms
rtt min/avg/max/mdev = 5.860/5.860/5.860/0.000 ms

Auch das geht! Aber auch hier fehlt das erste Paket. Liegt es am proxy ARP?

13:56:05.356955 00:1b:21:2d:3a:bb > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42:
 arp who-has 198.51.100.128 tell 198.51.100.108

13:56:05.359382 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype ARP (0x0806), length 60:
 arp reply 198.51.100.128 is-at 00:07:b4:00:8c:02

13:56:05.359392 00:1b:21:2d:3a:bb > 00:07:b4:00:8c:02, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 1

13:56:05.362879 00:90:33:1f:19:1a > 00:1b:21:2d:3a:bb, ethertype ARP (0x0806), length 64:
 arp reply 198.51.100.128 is-at 00:90:33:1f:19:1a

13:56:06.356946 00:1b:21:2d:3a:bb > 00:07:b4:00:8c:02, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 2

13:56:06.362793 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.128 > 198.51.100.108: icmp 64: echo reply seq 2 

Diesmal erfolgt der Ping komplett über den zentrale local-proxy-arp Router.

Und noch ein Pingtest bei vorhandenem ARP-Eintrag:

linux# ping  -c2 198.51.100.128
PING 198.51.100.128 (198.51.100.128) 56(84) bytes of data.
64 bytes from 198.51.100.128: icmp_seq=1 ttl=127 time=6.02 ms
64 bytes from 198.51.100.128: icmp_seq=2 ttl=127 time=5.84 ms

--- 198.51.100.128 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1004ms
rtt min/avg/max/mdev = 5.841/5.930/6.020/0.118 ms

Und der Dump dazu:

14:09:37.939110 00:1b:21:2d:3a:bb > 00:07:b4:00:8c:02, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 1

14:09:37.945105 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.128 > 198.51.100.108: icmp 64: echo reply seq 1

14:09:38.943687 00:1b:21:2d:3a:bb > 00:07:b4:00:8c:02, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 2

14:09:38.949516 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.128 > 198.51.100.108: icmp 64: echo reply seq 2 

Die Kommunikation erfolgt komplett über den Router.

Stop mal! Warum antwortet die Innovaphone nicht auf ARP? Für einen zentralen Server heißt das, dass er auf local-proxy-arp angewiesen ist.

Aber genau diese Funktion soll doch abgeschaltet werden! Das ist genau der Zweck der Übung.

Bei einem Versuch einige Minuten später zeigt sich erstaunlicherweise ein anderes Bild

14:23:44.753899 00:1b:21:2d:3a:bb > 00:90:33:1f:19:1a, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 1

14:23:44.759817 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.128 > 198.51.100.108: icmp 64: echo reply seq 1

14:23:45.757222 00:1b:21:2d:3a:bb > 00:90:33:1f:19:1a, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 2

14:23:45.762979 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.128 > 198.51.100.108: icmp 64: echo reply seq 2

Die Pakete werden direkt an die Innovaphone gesendet, aber über den Router beantwortet. Das ist komplett in Ordnung!

Aber was ist da passiert? Ich lösche nochmal den ARP-Cache auf dem Linux und schnüffle mit:

14:38:22.492139 00:1b:21:2d:3a:bb > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42:
 arp who-has 198.51.100.128 tell 198.51.100.108

14:38:22.498015 00:90:33:1f:19:1a > 00:1b:21:2d:3a:bb, ethertype ARP (0x0806), length 64:
 arp reply 198.51.100.128 is-at 00:90:33:1f:19:1a

14:38:22.498027 00:1b:21:2d:3a:bb > 00:90:33:1f:19:1a, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 1

14:38:22.503751 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.128 > 198.51.100.108: icmp 64: echo reply seq 1

14:38:22.509747 ac:f2:c5:df:30:3f > 00:1b:21:2d:3a:bb, ethertype ARP (0x0806), length 60:
 arp reply 198.51.100.128 is-at 00:07:b4:00:8c:02

14:38:23.497404 00:1b:21:2d:3a:bb > 00:90:33:1f:19:1a, ethertype IPv4 (0x0800), length 98:
 IP 198.51.100.108 > 198.51.100.128: icmp 64: echo request seq 2

Die Innovaphone scheint sich diesmal kooperativer zu verhalten. Damit hat sie den Test auch bestanden.

Fazit

Der Ansatz scheint zu tun.

  • Ein Fritzbox als CPE tut wie erwartet. Die bedient andere Geräte im gleichen Layer2 Netzwerk direkt, sobald sie ARP-Kontakt hatte.
  • Eine Innovaphone als CPE tut völlig korrekt. Der Rückweg geht immer den konfigurierten Weg über den Router. Im Test hatte sie sich ab und zu mal zickig, aber kein prinzipelles Problem.

Beide benutzen den Router, der außerhalb ihres logischen Netzsegementes liegt, korrekt. Andere zentrale Komponenten werden ebenso behandelt.

Damit steht einer schrittweisen Einführung nichts prinzipelles im Wege.

Über DirectAccess hatte ich schon einiges erzählt, und natürlich auch ausprobiert. Was ich aber noch nie angesehen habe, war DirectAccess ohne IPv6 im inneren oder äußeren Netz, also total legacy. In der Schulung habe ich die Gelegenheit dazu und möchte die Ergebnisse nicht vorenthalten.

IPv4 only Lab

Nachdem der DirectAccess Client in die weite Welt verschwunden ist, und er voller Verzweiflung nach Hause telefoniert hat, bietet sich folgendes Bild.

C:\> ipconfig
Ethernet adapter Ethernet 2:
   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::a19f:cc85:1320:8c4%5
   IPv4 Address. . . . . . . . . . . : 131.107.0.2
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . :
Tunnel adapter iphttpsinterface:
   Connection-specific DNS Suffix  . :
   IPv6 Address. . . . . . . . . . . : 2002:836b:c8:1000:1934:30a2:9e84:cdf4
   Temporary IPv6 Address. . . . . . : 2002:836b:c8:1000:f037:3b8d:cabc:bfcb
   Link-local IPv6 Address . . . . . : fe80::1934:30a2:9e84:cdf4%24
   Default Gateway . . . . . . . . . :

Es gibt also ein Interface im Internet mit öffentlichen Adressen ohne IPv6 Versorgung. Und dann gibt es den Tunnel nach Hause. Da sowohl Client als auch der DirectAccess Server öffentliche IPs haben, wurde der Tunnel mit 6to4-Adressen aufgebaut.

In den Schulungsunterlagen steht Notice the IP address for Tunnel Adapter is IPHTTPSInterface starting with 2002. This is an IP-HTTPS address. (Fehler sind so übernommen)

Nein, Microsoft! Es ist eine 6to4 Adresse. Offenbar habt Ihr wirklich keine praktische Erfahrungen mit IPv6.

Die zugehörige Routingtabelle zeigt:

c:\> route print -6
===========================================================================
Interface List
  5...00 15 5d 64 4d 4d ......Microsoft Hyper-V Network Adapter #2
  1...........................Software Loopback Interface 1
 24...00 00 00 00 00 00 00 e0 iphttpsinterface
===========================================================================
IPv6 Route Table
===========================================================================
Active Routes:
 If Metric Network Destination      Gateway
  1    331 ::1/128                  On-link
 24   4171 2002::/16                fe80::d92f:34d4:3add:e715
 24    331 2002:836b:c8::/48        fe80::d92f:34d4:3add:e715
 24    331 2002:836b:c8::/64        fe80::d92f:34d4:3add:e715
 24    331 2002:836b:c8:1::/64      fe80::d92f:34d4:3add:e715
 24    331 2002:836b:c8:5::/64      fe80::d92f:34d4:3add:e715
 24    331 2002:836b:c8:1000::/64   On-link
 24    331 2002:836b:c8:1000:1934:30a2:9e84:cdf4/128
                                    On-link
 24    331 2002:836b:c8:1000:f037:3b8d:cabc:bfcb/128
                                    On-link
 24    331 fd68:d6bf:56b6:7777::/96 fe80::d92f:34d4:3add:e715
  5    271 fe80::/64                On-link
 24    331 fe80::/64                On-link
 24    331 fe80::1934:30a2:9e84:cdf4/128
                                    On-link
  5    271 fe80::a19f:cc85:1320:8c4/128
                                    On-link
  1    331 ff00::/8                 On-link
  5    271 ff00::/8                 On-link
 24    331 ff00::/8                 On-link
===========================================================================
Persistent Routes:
  None

Das Routingziel ist eine Link-Local-Adresse, wie es sich für IPv6 gehört. Prima! Und die Adresse ist auch über den Tunnel erreichbar.

C:\>netsh int ipv6 sho nei 24
Internet Address                              Physical Address   Type
--------------------------------------------  -----------------  -----------
2002:836b:c8:1000:d92f:34d4:3add:e715                            Reachable (Router)
fe80::d92f:34d4:3add:e715                                        Reachable (Router)

Spielt man nun mit Browser und Explorer im Netz rum, gibt es offene Verbindungen:

C:\> netstat -n
Active Connections
  Proto  Local Address          Foreign Address        State
  TCP    131.107.0.2:49782      131.107.0.200:443      ESTABLISHED
  TCP    [2002:836b:c8:1000:f037:3b8d:cabc:bfcb]:49785  [fd68:d6bf:56b6:7777::ac10:c8]:80  ESTABLISHED
  TCP    [2002:836b:c8:1000:f037:3b8d:cabc:bfcb]:61893  [fd68:d6bf:56b6:7777::ac10:b]:80  ESTABLISHED

Die Verbindungen sind also offenbar über IPv6 zu einem ULA-Ziel (bäh!).

Das DirectAccess-Gateway macht noch NAT64, bettet also die IPv4 Adressen der LAN-Geräte ins IPv6 ein. Sieht man deutlich.

Dazu muss das Gateway offenbar auch DNS64 machen, um die DNS Antworten umzubiegen. Das sieht man leicht:

c:\>ipconfig /displaydns
Windows IP Configuration
    directaccess-webprobehost.adatum.com
    ----------------------------------------
    Record Name . . . . . : directaccess-WebProbeHost.Adatum.com
    Record Type . . . . . : 28
    Time To Live  . . . . : 213
    Data Length . . . . . : 16
    Section . . . . . . . : Answer
    AAAA Record . . . . . : fd68:d6bf:56b6:7777::ac10:c8

Aber warum macht der Client das? Der hat doch einen ganz anderen DNS Server?

C:\>netsh name show eff
DNS Effective Name Resolution Policy Table Settings

Settings for .Adatum.com
----------------------------------------------------------------------
DirectAccess (Certification Authority)  :
DirectAccess (IPsec)                    : disabled
DirectAccess (DNS Servers)              : 2002:836b:c8:3333::1
DirectAccess (Proxy Settings)           : Bypass Proxy

Settings for DirectAccess-NLS.Adatum.com
----------------------------------------------------------------------
DirectAccess (Certification Authority)  :
DirectAccess (IPsec)                    : disabled
DirectAccess (DNS Servers)              :
DirectAccess (Proxy Settings)           : Use default browser settings

Der Client hat also eine Policy für die Namensauflösung, die bei bestimmten Domains einen anderen DNS Server befragt. Genau, das Gateway.

C:\> ping lon-svr1.adatum.com
Pinging lon-svr1.adatum.com [fd68:d6bf:56b6:7777::ac10:b] with 32 bytes of data:
Reply from fd68:d6bf:56b6:7777::ac10:b: time=4ms
Reply from fd68:d6bf:56b6:7777::ac10:b: time=10ms
Reply from fd68:d6bf:56b6:7777::ac10:b: time=1ms
Ping statistics for fd68:d6bf:56b6:7777::ac10:b:
    Packets: Sent = 3, Received = 3, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 1ms, Maximum = 10ms, Average = 5ms

Allerdings funktioniert das nur, wenn diese Policy Regeln auch von der Applikation berücksichtigt werden:

C:\> nslookup lon-svr1.adatum.com
Server:  UnKnown
Address:  131.107.0.100

DNS request timed out.
    timeout was 2 seconds.
*** Request to UnKnown timed-out

Fragt man den richtigen Server, gibt es ganz andere Antworten:

C:\> nslookup lon-svr1.adatum.com 2002:836b:c8:3333::1
Server:  UnKnown
Address:  2002:836b:c8:3333::1

Non-authoritative answer:
Name:    lon-svr1.adatum.com
Addresses:  fd68:d6bf:56b6:7777::ac10:b
          172.16.0.11

All das ist mir all die Jahre verborgen geblieben, weil ich IPv6 ausgerollt hatte.

Ich vermisse nichts.

Was IPv6 ändert

Wird auf dem Außeninterface natives IPv6 eingesetzt, verschwinden sofort die 2002-er 6to4 Adressen. Dazu ist anzumerken:

  • Das im Lab die Kommunikation geklappt hatte, liegt daran, dass Microsoft das 2002:836b:c8:1000::/64 auf dem Interface betreibt.
  • Um mit 6to4 Adressen zu kommunizieren, muss man sich auf externe 6to4 Gateways verlassen. Dies kommt ins Spiel, weil der DNS Server 2002:836b:c8:3333::1 angesprochen wird.
  • Da beide Hosts über den IPv4-HTTPS-Tunnel IPv6 im externen Routing komplett umgehen, benötigen sie keine 6to4 Router im Internet.
  • Da beide Server extern öffentliche IPv4 Adressen haben, wird überhaupt 6to4 benutzt. Steht ein Gerät hinter NAT wird es auf Teredo zurück fallen.
  • Teredo funktioniert nicht mit allen NAT-Typen, speziell gibt es Probleme, wenn die öffentliche NAT-IP nicht stabil ist (z.B. bei Carrier Grade NAT oder Large Scale NAT)
  • Mit RFC 6540 ist IPv6 auf den Außeninterfaces Pflicht für die ISPs.

Wird IPv6 im LAN eingesetzt, verschwindet sofort das fc00::/7 ULA Netz. Stattdessen bekommen die Clients die gleiche IP, wie sie auch im LAN hätten. Das hat folgende Implikationen:

  • Der Client ist intern, wie extern unter der gleichen IP ansprechbar.
  • Die Fernwartung wird damit im Unternehmen unabhängig vom Aufenthaltsort des Clients.
  • Da Server und Clients IPv6 einsetzen, entfällt DNS64 und NAT64.
  • Applikationen können mit allen benötigten Ports Verbindungen in alle benötigten Richtungen ausführen. Es ist nicht notwendig im DirectAccess-Server NAT-Helper pro Protokoll zu haben.
  • Auf diese Weise funktionieren Protokolle wie Telefonie, Videokonferenzen, FTP, etc. pp. einfach auch extern.
  • Fremdapplikationen, die ihre eigene Namensauflösung fahren, funktionieren problemlos, weil sich nur das Routing zum DNS Server ändert. Es ist nicht länger notwendig, sich Domainabhängig an unterschiedliche DNS Server zu wenden.

Man sollte sich also wirklich überlegen, ob man weiter auf IPv6 verzichten will.

packet-loss-lacp-asymmectric-praxis

packet-loss-bandwidthlimit

packet-loss-netzplan

packet-loss-lacp-asymmectric-theorie

A customer complained about network problems. An SMB/CIFS mount would always break down between two of his servers. Not that there were any noticeable problems, but there are still some error messages in the log.

Setup

The customer has a certain number of servers in his Layer2 segment. One of them is a Windows file server for a group of Windows computers. Another is the Linux monitoring system, which regularly tries to access the share.

The whole thing has a longer history, which led half a year ago to the recommendation to abandon SMB1. This part is not relevant, it is what followed the recommendation.
Since the transition, the monitoring Linux has had strange entries in the kernel log:

Feb 12 19:07:02 kernel: CIFS VFS: Server a.b.c.d has not responded in 120 seconds. Reconnecting...

Where do these 120 seconds come from? The man page does help:

echo_interval=n

sets the interval at which echo requests are sent to the server on an idling
connection. This setting also affects the time required for a connection to
an unresponsive server to timeout. Here n is the echo interval in seconds.
The reconnection happens at twice the value of the echo_interval set for
an unresponsive server. If this option is not given then the default value of
60 seconds is used.

That explains the 120 seconds: Every minute an SMB echo is sent and if no echo is answered twice, the connection is believed to be dead. Therefore, the customer (understandably) thinks that there would be packet loss in the network.

To prove this assumption wrong takes weeks. For this purpose, the data traffic is recorded on both sides and put next to each other packet by packet.

Unfortunately, this still doesn't solve the real problem.

Analysis

Again and again such a recording is examined, in order to discover any unusual phenomenon.

smb-tcp-flow

It's awesome to see each minute the keep-alive request-response game is happening. Then data is sent.

The communication comes to a sudden pause and the time-out stated in the log occurs. Between 19:05:01.99... and 19:07:02.25... there are a little more than 120 seconds. So far that seems to fit.

The transmission error (the one with the TCP keep-alive) is remarkable, because it hits the server directly before the connection is terminated. This must be investigated!

Why are the lines black? Because the keep-alive sends exactly one byte over again, which has been acked for a long time. It is a very unusual conversation. Wireshark colours such sequence errors in black.

  • At 19:05:01.949511 128 bytes are received.
  • At 19:05:01.997203 the reception of these 128 bytes is confirmed.
  • At 19:07:01.999481 the final of the 128 bytes is transmitted again (together with the keep-alive flag).

If the acknowledgement of the data reception (ACK) had not arrived, the entire 128 bytes would be sent again. But they are not.

Did a firewall gamble on the sequence numbers underway? But there is no firewall in between, both servers are in the same LAN and see each other directly. But the other side verifies that the acknowledgement has arrived completely.

So why should the kernel resend a single byte? It turns out that the Windows implementation of the TCP keepalive sends exactly one byte again, while the Linux implementation sends the keepalive without payload. This is apparently an adjustment of Windows to broken middleware, which drops TCP packets without payload.

So the part is fine. But what is it then?

Maybe there are some delays somewhere? So let's take a look at the round-trip times. Maybe there are outliers.

smb-tcp-rtt

There are no particular irregularities with the remote location. Everything looks nice.

smb-tcp-rtt2

Even the local processing is completely inconspicuous: Everything that the kernel can process directly is done in almost no time. If the application software is involved, it takes longer. One can see very clearly how regular the measurements are.

Nothing is striking here either. But what is it then?

A closer look reveals that an echo request should have been sent at 19:06:00. But that is missing!

Kernel Archaeology

The function cifs_echo_request is responsible for sending the echo requests in the Linux kernel. There it says:

static void
cifs_echo_request(struct work_struct *work)
{
 int rc;
 struct TCP_Server_Info *server = container_of(work,
     struct TCP_Server_Info, echo.work);
 unsigned long echo_interval;

 /*
  * If we need to renegotiate, set echo interval to zero to
  * immediately call echo service where we can renegotiate.
  */
 if (server->tcpStatus == CifsNeedNegotiate)
  echo_interval = 0;
 else
  echo_interval = server->echo_interval;

 /*
  * We cannot send an echo if it is disabled.
  * Also, no need to ping if we got a response recently.
  */

 if (server->tcpStatus == CifsNeedReconnect ||
     server->tcpStatus == CifsExiting ||
     server->tcpStatus == CifsNew ||
     (server->ops->can_echo && !server->ops->can_echo(server)) ||
     time_before(jiffies, server->lstrp + echo_interval - HZ))
  goto requeue_echo;

 rc = server->ops->echo ? server->ops->echo(server) : -ENOSYS;
 if (rc)
  cifs_dbg(FYI, "Unable to send echo request to server: %s\n",
    server->hostname);

requeue_echo:
 queue_delayed_work(cifsiod_wq, &server->echo, server->echo_interval);
}

The code is interesting in several ways:

  • On the one hand, an echo is sent only if there exists a real need to do so.
  • On the other hand, this is a daisy chain scheduling: Only after work has been done the next action is planned.

As far as the needs are concerned, no echoes are sent out as long as the last response was received earlier than one echo interval.

But what is a response? In the code, the server→lstrp is always set to the current time whenever a packet is received. Regardless wether it is an echo response or normal data.

Subtracting HZ (one second) is a countermeasure preventing the echo response to your own request from being interpreted as interesting traffic. So this hard-coded value is calculated assuming that the answer from the remote station comes always more quickly than in a second.

If such an SMB connection runs over a longer distance or against a slower server, every second echo request will be suppressed if no further traffic occurs. This directly leads to the fact that SMB only works in the LAN.

Due to the protocol architecture, all requests are processed sequentially. If an echo request is sent while another request is running on the server, the echo request will not be processed until the previous request has been completed. Asynchronous processing was abandoned with the transition from SMB1 to SMB2. Now the server should tell the client that a more complex request is still being processed.

In this case an answer was received at 19:05:01.91... from the server. The echo packet should have been sent around 19:06:00. That's a very precise mistake!

The second problem is the scheduling. Only when all the work has been done (sending the echo packet) the next echo is scheduled to be sent 60 seconds later. The entire processing time is missing from the planning. This means that the echo packets are not sent exactly after 60 seconds, but always a bit later.

The recording shows this very clearly: 19:01:55.05, 19:02:56.49, 19:03:57.94, 19:04:59.37, 19:06:00.90, 19:07:02.27. The intervals are about 61.5 seconds. So 1.5 seconds more than planned. In the last step the interval is shorter, because the echo packet was not sent.

Racecondition

But what really happens is explained quickly: Suppressing the planned echo request is faulty.

In detail:

smb-tcp-timing
  • Daisy chaining scheduling creates a gap.
  • If traffic comes unfortunate shortly after the last echo, the planned echo is suppressed.
  • Due to the gap, the next echo transmission is delayed until after the timeout, which is set hard to twice the echo interval.

It is particularly charming to realize that the final echo request is first sent and then the time-out strikes, because the first action when receiving the echo response is to check for a time-out and terminate it. Ironically, the successful echo triggered the abort.

Now the question arises since when the problem occurred and who is responsible for it:

  • The patch c740 will send permanently echo requests every 60 seconds if no other traffic occurred. (Jan 11, 2011)
  • The patch fda3 all these echoes are used to trigger a timeout at five times the interval (configurable) without traffic. (Jan 20, 2011)
  • The patch 6dae switches from the configurable 60s x variable(5) to variable(60s) x two. (Feb 21, 2012)

The gap and suppression was introduced on 11 January 2011. However, this has no effect, as the time-out starts at five times the interval.

With the changeover on 21 February 2012, the gap now takes effect, as the time-out was set hard to twice the interval.

Solutions

There are three options.

First, eliminate the gap by scheduling at fixed times (always add 60 seconds to the last schedule time). By the way, it is not enough to move the rescheduling to the beginning of the routine. This only shortens the gap, but it will not disappear.

Another possibility is to send the echo requests in any case regardless of other traffic. In other words, if traffic is already running, the additional request will not bother any more. There is a potential risk that a server will choke on an echo between other requests.

Furthermore, avoiding the entire problem by waiting at least three times the interval length seems plausible.

If you want a quick fix, you should choose the third approach: Change a 2 to a 3 in the code:

static bool
server_unresponsive(struct TCP_Server_Info *server)
{
 /*
  * We need to wait 2 echo intervals to make sure we handle such
  * situations right:
  * 1s  client sends a normal SMB request
  * 2s  client gets a response
  * 30s echo workqueue job pops, and decides we got a response recently
  *     and don't need to send another
  * ...
  * 65s kernel_recvmsg times out, and we see that we haven't gotten
  *     a response in >60s.
  */
 if ((server->tcpStatus == CifsGood ||
     server->tcpStatus == CifsNeedNegotiate) &&
     time_after(jiffies, server->lstrp + 2 * server->echo_interval)) {
  cifs_dbg(VFS, "Server %s has not responded in %lu seconds. Reconnecting...\n",
    server->hostname, (2 * server->echo_interval) / HZ);
  cifs_reconnect(server);
  wake_up(&server->response_q);
  return true;
 }

 return false;
}

All that remains is the question of the TCP keepalive. Windows will wait exactly 120s until it sends a keepalive on a TCP session. Note that the Windows server sends the TCP keepalive first, because it transmitted the last packet before it arrived at Linux.

Acknowledgement

The entire analysis was done by my colleague Jens, who just asked me to write things together. This is what I did here.

The only thing left to do now is to send a bug report to the right place.