<< Concrete problems when developers opt out of TDD | Home | Testing and concurrency >>

Importance of real time performance monitors

Decent log entries are essential. On our current project we aim to write enough data to allow off-line analysis of performance and usage plus errors. I am also an advocate of more immediate and accessible runtime information. Log analysis is great but sometimes you need empirical data right away. On our current project we use the Java MBean facility. These MBeans can be easily accessed in a graphically rich way using tools like JConsole or VisualVM.

We have a couple of different types of analyzer which we expose through MBeans. One simply records how many times an event has occurred in a short time period. Another calculates an real time average, again across a short time period. For example, we have analyzers which record the length of time it takes to make a call to a particular downstream application. Each duration is recorded and an average over the last ten seconds is reported via the MBean. This calculation has been implemented to make it very efficient from a CPU perspective since 99.999% of the time the average discarded before anybody bothers to look at it. Originally we were only using two or three of these average analyzers in the system. As developers found them useful they were placed around every single external interaction and we suddenly found ourselves with several thousand per application. These used about 25% of the heap and consumed significant CPU resource. The analyzer was then optimized and now consumes negligible resources.

I have been personally a little disappointed that our operations team have not made as much as of this facility as I expected. They are happy with their existing log analysis tools. As a team, we have questioned whether our investment in MBeans is worthwhile. We concluded that it was as even though the Ops team don't use it in Production the development group rely on the data exposed through JMX for trouble shooting, especially in system test, monitoring load tests and as a quick way to gauge the health of Production.

Last week I was reminded again how useful this immediately accessible data was. After a system restart Production was doing 'something funny'. We had ambiguous portents of doom and various excited people considering some fairly drastic remedial action, including switching off a production system which was serving several thousand users. The fear was that something in the affected system might be placing unbearable demands on downstream applications. This seemed unlikely as we have many layers of throttles and queues to prevent just such an occurrence but there was something odd going on. The first port of call for the developers were the log files. With several thousands transactions being performed a second there was a lot of log lines whizzing past. Panic began to creep in as it was impossible to discern what, if anything was going on in the explosion of data. I was able to walk over to my workstation and bring up VisualVM. In about thirty seconds I could see that right at that very moment we were sending a great many messages but well within the tolerances we had load tested against. I was able to use VisualVMs graphing function to track various data and within a minute or so could see that there was an unexpected correlation between two sets of events. (The number of messages sent to mobile phones and the number of identification requests made to a network component were drawing the same shaped graph, with a slight lag between the first and second sets of data and an order of magnitude difference in volume). Again these events were both within tolerances. Yes something unexpected was occurring. No it was not going to kill the system right now. We went to lunch instead of pulling the plug.

The data we collected pointed us in the right direction and we were able to find, again using VisualVM, that a database connection pool had been incorrectly set to a tenth of its intended size. The Ops guys made some tuning changes to the configuration based on what we had discovered. The application stayed up through the peak period.

In summary, log files are essential but there is still a need for real time, pre-processed data available via a easy to access channel. MBeans hit the spot in the Java world. Developers should not be scared of calculating real time statistics, like average durations, on the fly. They do need to make sure that the system does not spend a disproportionate amount of resources monitoring itself rather than delivering its function.




Add a comment Send a TrackBack