[Nfsen-discuss] Netflow TCAM threshold exceeded

Discussion:

Tristan RHODES

2007-12-04 18:19:23 UTC

We are seeing these errors on our Cisco 6500 with Sup720 modules. It
looks like we might be processing more flows than it can handle.

Dec 4 18:10:59: %EARL_NETFLOW-SP-4-TCAM_THRLD: Netflow TCAM threshold
exceeded, TCAM Utilization [93%]

This Cisco document explains that this error will only affect Netflow
accounting information, and not affect packet forwarding.

http://www.cisco.com/warp/public/473/186_ErrormsgIOS_41265.html#prob1a

Anyone run into this? How can we fix it? Here are the netflow commands
on that router.

ip flow-cache timeout active 5
mls ip multicast flow-stat-timer 9
mls flow ip interface-full
no mls flow ipv6
mls nde sender version 5
no mls acl tcam share-global
ip flow-export source Loopback0
ip flow-export version 5
ip flow-export destination 192.168.1.100 5555

Thanks,

Tristan Rhodes

Felix Schueren

2007-12-05 10:31:24 UTC

Permalink

IIRC, you can't do anything - (sampled) netflow is horribly broken on
the 6500s, where data will be stored in TCAM _before_ sampling, so even
if you just wanted 1/1000 sampled data, you'd still overflow TCAM once
your netflow-observed interfaces pass enough traffic...

someone with more cisco background should clarify that though, that's
just second-hand information.

kind regards,

Felix
--
Felix Schueren, Head of NOC

mailto:***@hosteurope.de

Host Europe GmbH - http://www.hosteurope.de
Welserstrasse 14 - D-51149 Koeln - Germany
Telefon (0800) 4678387 - Telefax (01805) 663233
HRB 28495 Amtsgericht Koeln - UST ID DE187370678
Geschaeftsfuehrer: Uwe Braun - Patrick Pulvermueller -
Mike Read - Stewart Porter

Fuer diese Nachricht gilt: http://www.hosteurope.de/disclaimer.html

Joel Krauska

2007-12-05 17:04:02 UTC

Permalink

Post by Felix Schueren

I'd go farther and say Netflow in general on the cisco 6500 is pretty broken.

For example, by default Netflow is turned on for ALL ports.

And sampled Netflow isn't sampled on collection in to the TCAM, it's sampled on what they export back
to the receiver.

I hear that in newer versions of IOS, it will honor your per port configs, and it's not a global,
but I haven't had the opportunity to test that yet.

If you find anything let me know, as I'd love to try it.

Here's how my flow stats are configured on my cat6k to help allievate the problem, but
my TCAMS are regularly full too.

ip flow-cache timeout inactive 300
ip flow-cache timeout active 5
mls ip multicast flow-stat-timer 9
mls flow ip interface-full
no mls flow ipv6
ip flow ingress
ip flow ingress
ip flow-export source Loopback0
ip flow-export version 5 origin-as
ip flow-export destination x.x.x.x 6000

Best of luck,

Joel

John Fraizer

2007-12-05 17:39:08 UTC

Permalink

I believe this is also the same on the 7600 platform running SUP720's unless
I'm confusing something. NDE on 6500 and 7600 is pretty worthless in
general without TCP flags anyway.

Post by Felix Schueren

Mohacsi Janos

2007-12-05 13:37:22 UTC

Permalink

Try reducing the number of flows:
mls flow ip
or
mls flow ip {destination | source-only}

or maybe more aggresive aging can help.

Best Regards,

Janos Mohacsi
Network Engineer, Research Associate, Head of Network Planning and Projects
NIIF/HUNGARNET, HUNGARY
Key 70EF9882: DEC2 C685 1ED4 C95A 145F 4300 6F64 7B00 70EF 9882

Post by Tristan RHODES
ip flow-cache timeout active 5
mls ip multicast flow-stat-timer 9
mls flow ip interface-full
no mls flow ipv6
mls nde sender version 5
no mls acl tcam share-global
ip flow-export source Loopback0
ip flow-export version 5
ip flow-export destination 192.168.1.100 5555
Thanks,
Tristan Rhodes
-------------------------------------------------------------------------
SF.Net email is sponsored by: The Future of Linux Business White Paper
from Novell.

Jose Manuel Agudo Cuesta

2007-12-05 15:23:13 UTC

Permalink

Hi,

You can try to use fast aging to reduce the number of flows in TCAM:

mls aging fast [{threshold packet-count} [{time seconds}]]

The default packet-count value is 100 packets and the seconds default is 32
seconds, try with different values to reduce TCAM utilization.

Best regards,

JosÃ© Agudo

Post by Tristan RHODES
We are seeing these errors on our Cisco 6500 with Sup720 modules. It
looks like we might be processing more flows than it can handle.
Dec 4 18:10:59: %EARL_NETFLOW-SP-4-TCAM_THRLD: Netflow TCAM threshold
exceeded, TCAM Utilization [93%]
This Cisco document explains that this error will only affect Netflow
accounting information, and not affect packet forwarding.
http://www.cisco.com/warp/public/473/186_ErrormsgIOS_41265.html#prob1a
Anyone run into this? How can we fix it? Here are the netflow commands
on that router.
ip flow-cache timeout active 5
mls ip multicast flow-stat-timer 9
mls flow ip interface-full
no mls flow ipv6
mls nde sender version 5
no mls acl tcam share-global
ip flow-export source Loopback0
ip flow-export version 5
ip flow-export destination 192.168.1.100 5555
Thanks,
Tristan Rhodes
-------------------------------------------------------------------------
SF.Net email is sponsored by: The Future of Linux Business White Paper

Shane Gaumond

2007-12-05 19:38:07 UTC

Permalink

I run the following on my cat 6K
I have no problems..

IOS Version 12.2(18)
ip flow-cache timeout inactive 10
ip flow-cache timeout active 1
no ip flow export layer2-switched vlan 1-4094
mls netflow usage notify 80 120
mls flow ip interface-full
no mls flow ipv6
ip flow-export version 9

The timeout values keep my graph less spikey "comb like"

I think the layer 2 flows are filling up your buffers and causing your
error. Whenever i enable the layer 2 flows i get a whole lot of data.
most of it is useless ARP replys, OSPF, HSRP etc. I think if you
disable the layer 2 reporting you should be fine. You can still enable
layer 2 for a few vlans if you really need that info but layer 2 data
for the whole chassis will overrun buffers. I also run Version 9 which
is supposed to be more efficient.

I disagree with the netflow being automatic or broken on the Cat6k. I
had to enable "ip flow ingress" on all my vlan interfaces to have the
chassis report flows for the interfaces. Netflow reporting worked
exactly like Cisco said it would.

My only complaint of netflow on the Cat6K is that netflow will only
report IFindexes of interfaces with IP's. This rules out switchports.
The ifindex reported in flows will always be the ifindex of the vlan
interface and never the physical interface. Unless the physical
interface has an IP and "IP flow ingress" configured which is not
standard.

Shane

Shane Gaumond

2007-12-06 15:35:21 UTC

Permalink

Joel,
Just to clarify my cat6K is a 6500 series with a sup720.

You got me beat on the traffic stats. We are averaging 150 Kpps daily
and we see spikes at 250 Kpps. no where near your numbers. Don't be
fooled Cisco makes extremely solid hardware and software but they are
not the fastest switch/router manufacturers out there. Often times
there chassis and modules are NOT non-blocking. What modules do you
have on your box? I would guess that at your traffic rates your dropping
packets somewhere. I bring this up because I've always wondered if
netflow data includes dropped packets or not (i think it does).

Shane Gaumond

2007-12-06 21:56:36 UTC

Permalink

Joel,
My cat6k is a 6500 series with sup 720. I admit its a tad overkill but
we like the feature set.

Joel said: "I find that my tables fill up when I reach about 5 Gigs of

aggregate traffic. (all ports in+out)"

Joel said: "a platform created to pass 400Mpps of forwarding

that can't do 5Mpps of netflow collection has a broken netflow
collection implementation." (how did you get this number 400Mpps? 1.5)

I measure my router/switch performance by throughput in Gb/s. I get 305
Kpps at 2.6 Gb/s, if i extrapolate that pattern out i would get approx
605 Kpps at 5Gb/s. Your traffic must have a high percentage of small
packets and consist of a large amount of connections...

If the TCAM didn't have a limit and did not stop processing flows the
box might DoS itself or start to drop packets. It's about balance. A
large amount of smaller packets running through your network would
indicate more connections and using a flow mask of "mls flow ip
interface-full" would create more flows to keep track of in memory. You
need to lower your flow mask granularity and/or use aggressive ageing
timers. What are your timers at? If you posted them i cant find them.

Have I read your posts correctly, you are seeing less then 7Mpps during
your 5 Gig peaks but your are only receiving 5Mpps of flow data because
of the TCAM table filling up. If that is the case a timer tweak might
fix that up.

Shane

Post by Shane Gaumond
What modules do you
have on your box?

6748, 6704, Sup720-3bXL
If you're spiking to 250Kpps on a Cat6k, then I think you may have
bought the wrong switch for your application. You're three orders of
magnitude over provisioned.

Post by Shane Gaumond
I would guess that at your traffic rates your dropping
packets somewhere.

My packet rates are fine. I'm not dropping data.
The box forwards packets just fine.
It's the netflow TCAM (statistics gathering for netflow) that overflows.
This is outside fo the packet data path.
(that's what this thread is about)
I'm not doing any L2 netflow. I do netflow on my L3 network egress points for
customer traffic evaluation.
In any case, I will say again that a platform created to pass 400Mpps of forwarding
that can't do 5Mpps of netflow collection has a broken netflow collection implementation.
It's like a race car who's speedometer only goes up to 5Mph.
I've found that using DFC cards helps scale the issue.
Each local DFC card has it's own netflow processing engine.
(so putting an additional DFC engine on a card with lots of netflow ports
can mitigate/scale/localize the "problem"... -- it's just that dfc cards aren't cheap)
--joel

Johnson, Neil M

2007-12-07 20:36:49 UTC

Permalink

There are some SNMP OID's that you can use to monitor Netflow TCAM
utilization and Netflow learn failures. I don't have them handy, but if I
find them I'll post them.

We found that even during the summer break our TCAM tables where often full
and 1000's of flows per second were failing to be recorded. We are using
fairly aggressive timers, too. These are sup 720's with PFC 3BXL's and
DFC's. We are implementing optical taps on certain links to address these
issues.

If you are trying to collect flow data from sup720's on busy networks,
please keep in mind you may not be getting as complete a picture as you
think.

-Neil
--
Neil Johnson
Network Engineer
ITS-Telecommunications and Network Services
The University of Iowa
319 384-0938

-----Original Message-----
From: nfsen-discuss-***@lists.sourceforge.net
[mailto:nfsen-discuss-***@lists.sourceforge.net] On Behalf Of Shane
Gaumond
Sent: Wednesday, December 05, 2007 1:38 PM
To: nfsen-***@lists.sourceforge.net
Subject: [Nfsen-discuss] Netflow TCAM threshold exceeded

I run the following on my cat 6K
I have no problems..

IOS Version 12.2(18)
ip flow-cache timeout inactive 10
ip flow-cache timeout active 1
no ip flow export layer2-switched vlan 1-4094 mls netflow usage notify 80
120 mls flow ip interface-full no mls flow ipv6 ip flow-export version 9

The timeout values keep my graph less spikey "comb like"

I think the layer 2 flows are filling up your buffers and causing your
error. Whenever i enable the layer 2 flows i get a whole lot of data.
most of it is useless ARP replys, OSPF, HSRP etc. I think if you
disable the layer 2 reporting you should be fine. You can still enable
layer 2 for a few vlans if you really need that info but layer 2 data
for the whole chassis will overrun buffers. I also run Version 9 which
is supposed to be more efficient.

I disagree with the netflow being automatic or broken on the Cat6k. I
had to enable "ip flow ingress" on all my vlan interfaces to have the
chassis report flows for the interfaces. Netflow reporting worked
exactly like Cisco said it would.

My only complaint of netflow on the Cat6K is that netflow will only
report IFindexes of interfaces with IP's. This rules out switchports.
The ifindex reported in flows will always be the ifindex of the vlan
interface and never the physical interface. Unless the physical
interface has an IP and "IP flow ingress" configured which is not
standard.

Shane

-------------------------------------------------------------------------
SF.Net email is sponsored by: The Future of Linux Business White Paper