'Profiling memory leak in a non-redundant uptime-critical application

We have a major challenge which have been stumping us for months now.

A couple of months ago, we took over the maintenance of a legacy application, where the last developer to touch the code, left the company several years ago.

This application needs to be more or less always online. It's developed many years ago without staging and test environments, and without a redundant infrastructure setup.

We're dealing with a legacy Java EJB application running on Payara application server (Glassfish derivative) on an Ubuntu server.

Within the last year or two, it has been necessary to restart Payara approximately once a week, and the Ubuntu server once a month.

This is due to a memory leak which slows down the application over a period of around a week. The GUI becomes almost entirely non-responsive, but a restart of Payara fixes this, at least for a while.

However after each Payara restart, there is still some kind of residual memory use. The baseline memory usage increases, thereby reducing the time between Payara restarts. Around every month, we thus do a full Ubuntu reboot, which fixes the issue.

Naturally we want to find the memory leak, but we are unable to run a profiler on the server because it's resource intensive, and would need to run for several days in order to capture the memory leak.

We have also tried several times to dump the heap using "gcore" command, but it always result in a segfault and then we need to reboot the Ubuntu server.

What other options / approaches do we have to figure out which objects in the heap are not being garbage collected?



Solution 1:[1]

I would try to clone the server in some way to another system where you can perform tests without clients being affected. Could even be a system with less resources, if you want to trigger a resource based problem.

To be able to observe the memory leak without having to wait for days, I would create a load test, maybe with Apache JMeter, to simulate accesses of a week within a day or even hours or minutes (don't know if the base load is at a level where that is feasible from the server and network infrastructure).

First you could set up the load test to act as a "regular" mix of requests like seen in the wild. After you can trigger the loss of response, you can try to find out, if there are specific requests that are more likely to be the cause for the leak than others. (It also could be that some basic component that is reused in nearly any call contains the leak, and so you cannot find out "the" call with the leak.)

Then you can instrument this test server with a profiler.

To get another approach (you could do it in parallel) you also can use a static code inspection tool like SonarQube to analyze the source code for typical patterns of memory leaks.

And one other idea comes to my mind, but it is coming with many preconditions: if you have recorded typical scenarios for the backend calls, and if you have enough development resources, and if it is a stateless web application where each call could be inspoected more or less individually, then you could try to set up partial integration tests where you simulate the incoming web calls, with database and file access, but if possible without the application server, and record the increase of the heap usage after each of the calls. Statistically you might be able to find out the "bad" call this way. (So this would be something I would try as very last option.)

Solution 2:[2]

  1. Apart from heap dump have to tried any realtime app perf monitoring (APM) like appdynamics or the opensource alternative like https://github.com/scouter-project/scouter.
  2. Alternate approach would be to analyse existing application issue Eg: Payara issues like these https://github.com/payara/Payara/issues/4098 or maybe the ubuntu patch you are currently running app on.

Solution 3:[3]

You can use jmap, an exe bundled with the JDK, to check the memory. From the documentation:-

jmap prints shared object memory maps or heap memory details of a given process or core file or a remote debug server.

For more information you can see the documentation or see the stackoverflow question How to analyse the heap dump using jmap in java

There is also a tool called jhat which can be used tp analise java heap. From the documentation:-

The jhat command parses a java heap dump file and launches a webserver. jhat enables you to browse heap dumps using your favorite webbrowser. jhat supports pre-designed queries (such as 'show all instances of a known class "Foo"') as well as OQL (Object Query Language) - a SQL-like query language to query heap dumps. Help on OQL is available from the OQL help page shown by jhat. With the default port, OQL help is available at http://localhost:7000/oqlhelp/

See JHat Dcoumentation, or How to analyze the heap dump using jhat

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 cyberbrain
Solution 2 Roopesh Payyanath
Solution 3