When Cisco’s IGMP Snooping stops working for a few seconds

Editor’s note: this post related to when traffic is forwarded for a few seconds even if IGMP Snooping is working. Not for instances where you loose all multicast traffic when activating IGMP snooping. For those cases, please remind to activate at least one IGMP querier somewhere on the network, and the corresponding IP address (no IP address, no querier….).

Introduction

IGMP snooping is a feature derived from the regular IGMP standard since version 1, however it works on a mostly independent manner, as it only works at the local switch level. Multicast traffic, without IGMP snooping enabled, works as a simple broadcast at layer 2 level, thus flooding every single switch interface configured on the same layer 2 network (or VLAN).

In order to avoid this traffic flooding, IGMP Snooping was created, and extensively used to minimize unnecessary traffic on interfaces which have not requested to receive this traffic. This works on 3 typical use cases:

1 – End consumer device interface on IPTV

Diagramas mcast_1a

On this case, if IGMP Snooping was not enabled at all Layer 2 equipments, not only Client B would be flooded with 1Gbps of traffic, but the Wifi network would be completely useless. With IGMP snooping, the home gateway doesn’t have to deal with 1Gbps of multicast traffic (which would completely kill it’s CPU), but only the absolutely essential traffic will transverse the Wifi network.

2 – Backbone multicast traffic

Diagramas mcast_2a

On this use case, the biggest problem lies on the fact that the multicast clients only support 1Gbps of traffic, out of a possible 5Gbps, so IGMP snooping is absolutely essential. Having IGMP snooping, reduces the traffic from the Core Switch to each of the Access Switch from 5Gbps, to 200Mbps on the top switch, and 150Mbps on the bottom switch.

When it doesn’t work

All this works amazingly well, unless you join IGMP Snooping, Cisco Switches, Spanning Tree and 1+ Gbps aggregated multicast traffic.

Symptom: everything is going ok, but when any of the clients goes port up or port down, every single other client goes awry, dropping packets and displaying every kind of errors. This happens regardless of which access switch is connected to, it always affect clients connected to the other access switch.

So, IGMP was working… sometimes. Further analysis included a Wireshark tap on the clients (just try to tap on a gigabit interface on a 6 years old Macbook…..), which resulted wonderfully. The tap demonstrated that for a few seconds, ALL multicast groups were being forwarded to all clients! As the sum of the traffic is greater than the interface throughput, there’s no change this wouldn’t bring errors. So, for some reason, IGMP snooping stops working for a few seconds, and then starts working again.

Now, it took me a few hours going though Cisco’s documentation, supports forums and such, and finally, reading the documentation from the beginning to the end on IGMP, until I found this (the switches were not Cisco 4500, but IOS is mostly the same throughout all Cisco Switches):

If a VLAN experiences a spanning-tree topology change, IP multicast traffic floods on all VLAN ports where PortFast is not enabled, as well as on ports with the no igmp snooping tcn flood command configured for a period of TCN query count.

What the hell…? Why on earth would someone flood one interface because some other interface went down !?!??! After digging around, the only conclusion was to minimize the slow performance on old routers, whereas a new IGMP join would take around 5 to 10 seconds. If you’re doing redundancy, it would mean that the backup received would take 5 to 10 seconds to recover. But this also means that IGMP snooping would be necessary at all! So for some weird reason, Cisco enables this by default, screwing everyone which actually needs IGMP snooping…
Anyways, the fix is quite simple:

no ip igmp snooping tcn flood

Let’s just hope Cisco disables this by default….