TCP Chorusing

TCP Chorusing in the
Windows 9x TCP/IP Stack

A Preliminary Analysis

By Dan Kaminsky

Abstract

02-Feb-1999
Dan Kaminsky

Microsoft Windows 95 and 98 clients have the capability to bind multiple TCP/IP stacks to the same MAC address, simply by having the protocol added more than once in the Network control panel. This is actually quite useful, except for the fact that these stacks can run concurrently on the same IP, even if they receive their IP through BOOTP/DHCP. The effect of the bug is to cause the number of ACKnowledgement packets sent to be equal to that of the number of loaded and bound TCP/IP stacks, creating excessive and significant network noise and collisions. At least one Samba 2.0.0beta1 server on an affected subnet can become completely inaccessible when one of these machines start misbehaving.

Redundant ACKing can be referred to as TCP Chorusing, due to the minor time delays introduced between multiple copies of identical data. The problem is undetectable using the Ping command built into Windows 95 or 98--this is a significant bug in and of itself. Linux´s ping is not similarly crippled. NT was not available for testing.

Introduction: Discovery

A word of warning: This is the first edition of this document and there are bound to be errors. My ego isn't so fragile as to be bothered if I made a misstatement of fact when writing this. Just tell me.

My university possesses a generally excellent network, but on occasion certain dorms would grind to a halt for no apparent reason. Seeking answers, I used a windows platform pinger to see if there were correlations between network downtimes and the presence of specific IP´s on a specific subnet. We use essentially static IP´s distributed from a DHCP server--a cookie seems to be assigned to a given MAC address on first request for an IP, and all future IP´s are given on that IP. Nothing out of the ordinary was found using the Windows pingers, so I decided I´d automate the testing process over time using an excellent Linux tool entitled fping. (In another environment, I might have merely shoved up a sniffer, but the secure hubs and my lack of permission to modify them in any way prevented that possibility.

Very quickly, I noticed some very strange entries in the fping logs(IP´s changed):

10.0.9.42 : duplicate for [3], 84 bytes, 3.28 ms 10.0.9.73 : duplicate for [3], 84 bytes, 3.59 ms 10.0.10.33 : duplicate for [3], 84 bytes, 3.51 ms 10.0.10.99 : duplicate for [3], 84 bytes, 3.81 ms

I thought there might be a bug in fping, so I pinged the offending machines from Windows 98:

C:\WINDOWS>ping -f 10.0.9.42 Pinging 10.0.9.42 with 32 bytes of data: Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126 Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126 Ping statistics for 10.0.9.42: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 3ms, Maximum = 4ms, Average = 3ms

Confusing, everything seemed normal from here. Then I tried the Linux ping command.

effugas@doxpara:~> ping 10.0.9.42 PING 10.0.9.42 (10.0.9.42): 56 data bytes 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=3.5 ms 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=14.7 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=6.2 ms 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=7.5 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.3 ms 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.8 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.0 ms 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.4 ms (DUP!) --- 10.0.9.42 ping statistics --- 4 packets transmitted, 4 packets received, +4 duplicates, 0% packet loss round-trip min/avg/max = 3.3/8.6/15.4 ms

This was disturbing, especially since there was a very high correlation between subnets experiencing high collisions and slow networks and the number of TCP-Chorusing machines on that subnet. What was causing this? The first step was to hunt down the machines exhibiting the bug and do a little exploratory surgery. It didn't take much deduction once I got access to a few of the affected machines to realize that there were the same number of extra TCP/IP stacks bound to the main adapter as there were extra pings. Plus, just because there were machines with extra stacks didn't mean it wasn't the NIC´s fault--a bug in the NIC installer could have have created the extra TCP/IP entries. And what about wiring? All of these machines were exhibiting these reactions on a rather non-standard "secure hub". Perhaps that was the cause of the stacks reacting so strangely?

Further investigation did shed some light. The automated installation routines are suspect, since they´re the routines that most commonly add the stacks. All cards, though, from generic Linksys´s to an Intel 8255x 10/100 board to the entire bevy of PCI and PCMCIA that 3Com offers can have additional TCP/IP stacks merely added onto them for this behavior. While only 3Com cards have been seen by me suffering from unintentional TCP/IP Chorusing, this is probably because of the 90%+ market share 3Com enjoys on campus and not because of a flaw in their drivers. It´s quite likely that, since students and not staff install network drivers on campus, this is more of a wetware problem--the student does whatever he or she can to "just make it work like the directions say", and if adding TCP/IP multiple times happens to "Just Work", so be it.

Before I could be sure that this was the problem, though, I needed to isolate a computer from the University network first. I used my dorm room 100baseT internal network to do so. The following tcpdump is from a single character typed from the chorusing machine into the telnet port of the Linux machine:

11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: P 6:7(1) ack 171 win 7756 (DF) [Initial Keypress] 11:31:02.390000 10.0.6.194.telnet > 10.0.6.195.1043: P 171:172(1) ack 7 win 16352 (DF) [Pressed key is echoed from the Linux machine to be displayed on the Windows box.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine acknowledges receipt of data signifying what character it should display.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine again acknowledges receipt. This is the "chorus".]

There´s most probably no limit to the number of extra ACKs--If I had ten TCP/IP stacks, I´d have nine duplicate packets, as far as I can tell.

A final note--I have thus far been able to locate the bug in Windows 98 and Windows 95 OSR2. The original version of Windows 95 was simply unavailable for testing, but I would appreciate an email verifying the bug harkens back that far.
.

Impact of Bug

The impact of TCP Chorusing, barring significant installer bugs(possible), is directly related to the extent to which non-technical users have installed network hardware and software. With millions of computers hooked up to college dormitories, it would be quite myopic to dismiss this population as minimal. Still, the relatively small percentage of machines on campus here(maybe 1%) suggests the problem isn´t too prevalent.

It should be noted, though, that TCP Chorusing can wreak havoc. It only takes one, possibly two TCP Chorusers on the same subnet as my Samba 2.0.0beta1 server to render it inaccessible. Logs appear to show that the Windows machines time-out attempting to connect to the machine, but the problem is quite difficult to debug due to the bug´s moderately uncommon nature--the server only occasionally becomes disabled by the Chorusers. Attempts to connect to other 9x machines on the affected subnet still remain successful, however. There were no NT machines available to test with, but they´d probably remain intact as long as the overall network remained usable.

However, four TCP Chorusers on a moderately active network will rend it unusable. My theory, completely unsubstantiated by observation(I do not possess local sniff capability) is that since additional packets are being sent out by the interface--one per stack per packet to acknowledge--and since these additional packets are sent out nearly simultaneously, they´re much more likely to cause collisions than randomly distributed packets. Ethernet handles collisions based on random backoff within one to ten 512 bit slot times--5.2ms on a 10mbit network. Since the ACK packets are generated simultaneously and pretty much must be delivered(lest a new packet come to replace it), they stream themselves over the line as soon as an incoming packet comes in, possibly colliding with packets from other nodes, possibly even colliding with other incoming packets from the same host. Empirical evidence shows that a network becomes near unusable with four TCP Chorusers on--assuming 1.5 duplicate packets per node, that´s 10 ACKs for each packet being sent down the line if all four of them are simultaneously attempting to acknowledge received data--and, yes, these ACKs can collide with each other, leading to a reverberating feedback effect. That significantly overloads the backoff system, and the ether becomes unusable. That´s my theory, for now.

It´s unknown at this time whether or not this bug affects modem users. If it does, it´s an extremely significant bug, considering the reduced bandwidth of SLIP/PPP.

Extra TCP/IP stacks listed in the registry but not active under Network Neighborhood do not appear to generate duplicate packets. The probability for a machine sending out duplicate packets to be suffering from multiple active TCP/IP stacks is 100% within the observed sample set at Santa Clara University.

One final note--when scanning for affected machines, be sure to deliver at least two or three pings to each client. There is a degree of intermittance with this bug.

Solutions

Adminstrators should download and run fping to scan their networks for machines that respond multiple times to pings. It doesn´t take much more than removing the excess stackage and rebooting to make a chorusing machine normal again. From the router/hub side, there´s not much that can be done since the right MAC is asking for the right IP--multiple times, yes, but still valid. Some extra software on the router end might be able to help. But to really fix this, MS needs to change the behavior of TCP/IP stacks.

Microsoft should not make a one-tcp/ip-stack-per-device limit--this reduces much of the effectiveness of Windows as a TCP/IP client. Rather, a simple check should be instituted so that no two TCP/IP stacks will attempt to provide services to the same IP.