I was seeing this OutOfMemory error on our Jenkins instance. It’s a good thing really. It means our backend infrastructure is making more nodes that ever before and we are hitting limits in the default config of Jenkins. This error was in the logs. We’re using the IBM JVM (natch).
SEVERE: Failed to execute command UserRequest:hudson.remoting.PingThread$Ping@6e6f6e6f (channel ci-p-94545)
Throwable occurred: java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 11
at java.lang.Thread.startImpl(Native Method)
Sep 5, 2013 3:38:57 AM hudson.remoting.Channel$2 handle
SEVERE: This command is created here
Googling this error leads to a bit of a red herring. Usually when you see an OutOfMemory error and it references threads, this is commonly a native (system) out of memory because each additional thread that the JVM spawns uses native memory (as opposed to Java Heap memory). The counter intuitive advice in this case is to lower the Java heap size – since a large Java heap is crowding out the address space that needs to be used for keeping track of new threads. Check out this article on Developerworks for an excellent rundown on a traditional native out of memory and how to fix it.
In this case the operating system had plenty of free memory and Jenkins hadn’t been changed from it’s default heapsize (512m on my system) so crowding out of native memory wasn’t the issue. So how many threads were being used anyway? This will give you the answer on Linux :
For me it returned 1024, which seemed like a bit too much of a nice round number. Perhaps it could be a limit of some sort? 🙂 Enter ulimit.
Ulimit has never done me any favours. It’s purpose is to stop one user from hogging the resources of a system. It seems to harken back to a different age where there were lots of different users (at a university say) logging into one large system. In my experience it just stops things from running the way you think they will. All I ever seem to do with it is increase limits. YMMV, rant over.
So usually with ulimit, you’d edit the /etc/security/limits.conf file to raise the hard limit of the resource type:
and then add a line in the user’s login script (.bash_profile) to set the new limit.
The default jenkins configuration on RHEL 6 has an application user called jenkins that has login disabled – so there is no way to trigger setting the ulimit. What’s the cleanest way to fix this? It seemed a bit crappy to have to enable a login account for this user. Still, I wasn’t sure how to do it, so I modified the jenkins user to have a login shell and created a one line .bash_profile script to set the ulimit. This still didn’t fix the problem (so we didn’t hit the login shell at all), so I delved into the /etc/init.d/functions script which actually starts the jenkins/java process. It ultimately uses the ‘runuser’ command to start jenkins.
I pulled out the same command to start Jenkins as the jenkins init.d script uses and tested all sorts of ways of setting the ulimit.
This will return the default (1024) – so we’re not hitting the bash_profile script.
But this will return the correct value (62835).
So then it looked like the answer was editing the /etc/init.d/functions script.
It’s used by all sorts of different services (all of them?) on a Redhat system. Must be an easier way. Well, I never knew this, but the soft ulimit will actually change the user’s ulimit values without you having to make a change in a login script.
A simple addition to the ulimit.conf file , and we’re good.
jenkins soft nproc 62835
To verify, running this now returns 62835. (Don’t ask how I picked that number).
A postscript, I found that I couldn’t make a thread dump happen with jenkins, because stdout is pointing at /dev/null. If you’re dealing with a different java application , a thread dump is a good first point of call because it prints a list of the ulimit settings that are in effect for the process, making the first stage of debugging that much easier. To get a thread dump from a jenkins process, start it without the –daemon value and then do a kill -3
I know that was a bit long, but thought going through the procedure of figuring out the answer might be interesting.