100% CPU on Netapp / N Series

I’m pretty partial to the ol’ N Series (IBM’s version of Netapp storage). We use them as our VMware storage over NFS. Easy to set up and manage.

This problem has been annoying me for ages, and I’m excited to finally have an answer to it. Every now and again, users would complain that performance had dropped through the floor. I’ve got pretty used assuming that bad VM performance = bad storage performance , and so jump onto the N Series / filer straight away.

systat -x 1

is a good start when investigating – it’ll show you how many NFS packets are passing through the filer, and the CPU, Network and Disk utilization levels needed to service the requests . The Disk Util column is interesting too – it isn’t an average or anything, it’s the busiest disk in the filer. And since we’ve got 50-ish disks in the filer.

This is what a normal sysstat output looks like. (At least for me).

netapp> sysstat -x 1
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 64%  3039     0     0    3039 20249  4826  25087  52299     0     0    12s  93% 100%  :f  24%      0     0     0     0     0     0
 55%  2816     0     0    2816 17085 12570  20919  37107     0     0    12s  85% 100%  :f  17%      0     0     0     0     0     0
 44%  2856     0     0    2856 17580  6341  12312  39792     0     0    12s  82% 100%  :f  51%      0     0     0     0     0     0
 28%  3448     0     0    3448 18033  6292   4988   3980     0     0    12s  82%  13%  :   12%      0     0     0     0     0     0

And this is one when all hell is breaking loose:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
100%    26     0     0      26   148     9   4584   7308     0     0     7s 100%  50%  :   39%      0     0     0     0     0     0
100%    66     0     0      66   260    20   4728      0     0     0     7s 100%   0%  -   31%      0     0     0     0     0     0
100%    39     0     0      39   275   211   5071     24     0     0     8s 100%   0%  -   29%      0     0     0     0     0     0
100%   143     0     0     143   633    43   4548      8     0     0     8s 100%   0%  -   30%      0     0     0     0     0     0

See the NFS column? It’s not like the CPU is busy because it is servicing NFS requests. It’s not being overtaxed by the VMs – it’s something internal to the filer. Even the Disk Util isn’t very high. What’s going on?

In our particular environment, our VMs are really disposable, since the build of WCM that goes into them is obsolete the next day. At any time the filer might be half filled with switched off VMs. Eventually, the filer fills up and you have to delete all the obsolete VMs. This is what has caused the high CPU – deleting a bunch of VMs. Each are about 25 gigs and I must have deleted around 200 or something. The deleting process itself is quite quick, but it spiked the CPU for about an hour. An agonizing hour!

At the very least, it’s great to know what was going on. I’ll open a case with IBM support and report back (if I can). In the meantime better write a delete queue to try not to tax it as badly.

This entry was posted in tip and tagged . Bookmark the permalink.

6 Responses to 100% CPU on Netapp / N Series

  1. Jordan Xu says:

    Very interested to know what is happening there. Seeing the same thing myself for a while now but cannot pinpoint the cause. Please let me know if you’ve found out why deletions are pegging the CPU for you 🙂

  2. Jay says:

    We’ve had something very similar happen with our filer – not with NFS but with CIFS. We are waiting on a response from NetApp. Did you ever get any indication of what was going on with your filer? Which firmware version are you running? We are running 8.1.1 in C-Mode on v3240’s.

  3. Alexandre Derumier says:

    Hi, same here, 8.1.1 C-mode on fas2240-2.

    Seem to be the space reclaim job which take a lot of cpu.
    (when you delete a big file, the space is not available right after the delete)
    I’m new to netapp, I don’t know if it’s the normal behaviour….

  4. Simon C. says:

    Hi

    Same here, FAS-2240-2. When I delete clone of a LUN file (500GB, thin provisioned), sysstat for several minutes show :

    netapp-x> sysstat -m 1
    ANY AVG CPU0 CPU1 CPU2 CPU3
    92% 29% 10% 7% 15% 85%
    92% 29% 11% 8% 17% 80%
    88% 28% 10% 7% 15% 82%
    91% 28% 10% 7% 15% 82%
    80% 25% 7% 5% 11% 76%
    99% 40% 21% 17% 28% 95%
    87% 29% 14% 9% 11% 81%
    82% 27% 14% 8% 12% 75%
    netapp-x>

    One core is at 80 – 90%, high latencies kills NFS performance at this time 🙁
    DataONTAP 8.1.2, 10k SAS disks, no flash cache, aggr at about 30% usage.
    If any of You solved this problem ?

  5. Simon C. says:

    PS: 7-mode.

  6. E-One says:

    This is BUG 90314

Leave a Reply

Your email address will not be published. Required fields are marked *