A word of warning: This is the first edition of this document and there are bound to be errors. My ego isn't so fragile as to be bothered if I made a misstatement of fact when writing this. Just tell me.
My university possesses a generally excellent network, but on occasion certain dorms would grind to a halt for no apparent reason. Seeking answers, I used a windows platform pinger to see if there were correlations between network downtimes and the presence of specific IPīs on a specific subnet. We use essentially static IPīs distributed from a DHCP server--a cookie seems to be assigned to a given MAC address on first request for an IP, and all future IPīs are given on that IP. Nothing out of the ordinary was found using the Windows pingers, so I decided Iīd automate the testing process over time using an excellent Linux tool entitled fping. (In another environment, I might have merely shoved up a sniffer, but the secure hubs and my lack of permission to modify them in any way prevented that possibility.
Very quickly, I noticed some very strange entries in the fping logs(IPīs changed):
10.0.9.42 : duplicate for [3], 84 bytes, 3.28 ms 10.0.9.73 : duplicate for [3], 84 bytes, 3.59 ms 10.0.10.33 : duplicate for [3], 84 bytes, 3.51 ms 10.0.10.99 : duplicate for [3], 84 bytes, 3.81 ms
I thought there might be a bug in fping, so I pinged the offending machines from Windows 98:
C:\WINDOWS>ping -f 10.0.9.42
Pinging 10.0.9.42 with 32 bytes of data:
Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126 Reply from 10.0.9.42: bytes=32 time=4ms TTL=126 Reply from 10.0.9.42: bytes=32 time=3ms TTL=126
Ping statistics for 10.0.9.42: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 3ms, Maximum = 4ms, Average = 3ms
Confusing, everything seemed normal from here. Then I tried the Linux ping command.
effugas@doxpara:~> ping 10.0.9.42 PING 10.0.9.42 (10.0.9.42): 56 data bytes 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=3.5 ms 64 bytes from 10.0.9.42: icmp_seq=0 ttl=127 time=14.7 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=6.2 ms 64 bytes from 10.0.9.42: icmp_seq=1 ttl=127 time=7.5 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.3 ms 64 bytes from 10.0.9.42: icmp_seq=2 ttl=127 time=3.8 ms (DUP!) 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.0 ms 64 bytes from 10.0.9.42: icmp_seq=3 ttl=127 time=15.4 ms (DUP!)
--- 10.0.9.42 ping statistics --- 4 packets transmitted, 4 packets received, +4 duplicates, 0% packet loss round-trip min/avg/max = 3.3/8.6/15.4 ms
This was disturbing, especially since there was a very high correlation between subnets experiencing high collisions and slow networks and the number of TCP-Chorusing machines on that subnet. What was causing this? The first step was to hunt down the machines exhibiting the bug and do a little exploratory surgery. It didn't take much deduction once I got access to a few of the affected machines to realize that there were the same number of extra TCP/IP stacks bound to the main adapter as there were extra pings. Plus, just because there were machines with extra stacks didn't mean it wasn't the NICīs fault--a bug in the NIC installer could have have created the extra TCP/IP entries. And what about wiring? All of these machines were exhibiting these reactions on a rather non-standard "secure hub". Perhaps that was the cause of the stacks reacting so strangely?
Further investigation did shed some light. The automated installation routines are suspect, since theyīre the routines that most commonly add the stacks. All cards, though, from generic Linksysīs to an Intel 8255x 10/100 board to the entire bevy of PCI and PCMCIA that 3Com offers can have additional TCP/IP stacks merely added onto them for this behavior. While only 3Com cards have been seen by me suffering from unintentional TCP/IP Chorusing, this is probably because of the 90%+ market share 3Com enjoys on campus and not because of a flaw in their drivers. Itīs quite likely that, since students and not staff install network drivers on campus, this is more of a wetware problem--the student does whatever he or she can to "just make it work like the directions say", and if adding TCP/IP multiple times happens to "Just Work", so be it.
Before I could be sure that this was the problem, though, I needed to isolate a computer from the University network first. I used my dorm room 100baseT internal network to do so. The following tcpdump is from a single character typed from the chorusing machine into the telnet port of the Linux machine:
11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: P 6:7(1) ack 171 win 7756 (DF) [Initial Keypress] 11:31:02.390000 10.0.6.194.telnet > 10.0.6.195.1043: P 171:172(1) ack 7 win 16352 (DF) [Pressed key is echoed from the Linux machine to be displayed on the Windows box.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine acknowledges receipt of data signifying what character it should display.] 11:31:02.390000 10.0.6.195.1043 > 10.0.6.194.telnet: . ack 172 win 7755 (DF) [Windows machine again acknowledges receipt. This is the "chorus".]
Thereīs most probably no limit to the number of extra ACKs--If I had ten TCP/IP stacks, Iīd have nine duplicate packets, as far as I can tell.
A final note--I have thus far been able to locate the bug in Windows 98 and Windows 95 OSR2. The original version of Windows 95 was simply unavailable for testing, but I would appreciate an email verifying the bug harkens back that far. . |