How do we do faster Root Cause Analysis (RCA) without Performance Impact?

As we develop enterprise level software products and customers use it, many times we come across situations where in to find the root cause of issues faced by customer using our product is difficult, especially with the default log levels. Increasing the log level for better tracing, typically has some performance impact which may not be acceptable to customer running the product in their production workloads where performance is of utmost importance.

I have been thinking lately about how this problem can be addressed. One of the ideas that came to my mind was to look at what part of logging really has the major performance impact. As we see, making a log entry has 2 parts to it:

1. Creating the log string in the memory.
2. Writing the log string to the log file.

As one would guess so, and can be proved, that writing the log string to the log file could have the maximum performance impact. One of the thoughts that came to my mind was that we really don't need the entire trace of execution for multiple days/months, even if we were to get the traces of the last hour or a few hours of execution before failure, it should most certainly help us arrive at the root cause of the issue. Hence why not focus on maintaining only that much limited trace in memory and keep on adding latest to it, and when the in-memory log queue reaches a maximum size limit, we just remove the log entry that was added first (FIFO logic), and add the new entry to the queue. And once an exception occurs, that may need RCA, we just flush out the in-memory trace to a log file and empty the log queue again.

This approach is good, in the aspect that we are not writing to file for each log entry and in fact not writing for all entries, just writing when an exception occurs and for only a limited set of relevant entries. In fact a sample implementation that i did of this approach showed that it took 1/3rd the time (in nanoseconds) to add a log entry into an in-memory queue versus writing the same log entry in a file, the same was tried for 1000 log entries in a loop to get total time taken. However, we are still doing the operations of creating log string and adding it in in-memory log queue which should have some performance impact for sure.

Is there a better approach that can be taken to reduce the performance impact further while still getting the required traces to do faster root cause analysis? Any ideas or suggestions are welcome.

Pad antar - a step forward...

Thursday, 27 July 2017

How do we do faster Root Cause Analysis (RCA) without Performance Impact?

No comments:

Post a Comment