Oracle performance analysis has come a long way in the last 20 years. First there was the "just add more resources" approach, then came ratio analysis, and finally, wait event analysis emerged. But those approaches were not complete or broad enough, so in 2001 I published a paper called Oracle Response Time Analysis (RTA). The RTA article brought together and detailed the two elements of response time: service time and queue time. One of the more subtle aspects of RTA is understanding the relationship between service time and queue time and how that relates to the operating system.
The response time a user feels is the result of many interrelated systems. The Oracle server is just one of these systems. The time spent in each system can be divided into service time and queue time. We could just add up all the service time and all the queue time to determine the end user's response time. But in reality, sometimes the effort is not worth the results…not to mention extremely difficult to do. But what we can do that's definitely worth the minimal effort is to get service time and queue time from an Oracle perspective. In addition, we can get details about how this relates to the database server's CPU subsystem.
Let's take a look at these basic Oracle-focused response time formulas:
Just one of the many things we can learn from analyzing response time is that service time has a limit but queue is limitlessness. Each CPU subsystem has a maximum fixed amount of CPU power it can provide. If you have a one-CPU machine, that machine can provide a maximum of 60 seconds of CPU each minute. If you have a 10 CPU machine, it can provide a maximum of 600 seconds of CPU each minute. The situation is distinctly different with queue time.
Queue time is not fixed and is limited only by the workload. If the workload is relatively small, queue time may be near zero. But keep increasing the workload and queue time will reach infinity -- it has no limit.
The notion of the limitlessness of queue time gives rise to two concepts that we need to consider. First, if Oracle is consuming all the available CPU, then asking for more CPU will increase service time and can also increase Oracle wait time. The result is an increase in response time. This is bad, very bad. This means our solutions need to carefully weigh how they will affect the CPU subsystem. (This concept is elaborated upon in my new paper entitled, Oracle Wait Interface Deceit.) The second concept is that we now have another way to view a very dynamic system. This not only helps our understanding of the system but it allows us to help others understand a potentially very complex Oracle-based system.
For example, consider the figure below. The data was gathered from a real-life production Oracle system. Each hour, the response time components were gathered and summarized. The queue time was gathered from v$system_event and the service time was gathered from v$sysstat. By looking at this graph, if performance is poor and all non-Oracle server architecture components are shown not to be the bottleneck, the bottleneck would rest on the Oracle server. Armed with the graph below, we could infer there is a serious IO subsystem bottleneck or locking/blocking issues. Perhaps at around 2200 there is a CPU bottleneck, but the rest of the time is definitely an IO bottleneck or locking/blocking issues.
Let's look at this in more detail. It all centers on knowing the percentage of available CPU power Oracle is consuming from the operating system. Look closely at the figure above. Let's start with the CPU subsystem. Since approximately a maximum of 1000 minutes of CPU time was consumed in a single hour, we know there must be at least 17 CPUs. A worst-case scenario would be the CPU resources used equaled the CPU resources available. That combined with the numbers taken from the above graph yields:
CPU available = CPU used X CPUs * 60 minutes/hour = 1000 minutes/hour X CPUs = 1000 minutes/hour / 60 minutes/hour X CPUs = 16.67
So there must be at least 17 CPUs on the database server. If we know from monitoring the operating system that CPU utilization was around 50% during 2200, we can also infer there could be around 34 CPUs (~16.67 X 2).
Now let's put this information together so that it's useful. Notice that around lunchtime, Oracle is not consuming up to the 1000 minutes of CPU each hour. We know the server can supply 1000 minutes of CPU each hour, yet at around lunch time it supplied less than 500 minutes of CPU time each hour, and users are not satisfied with performance. Therefore, we can infer that the queue time (i.e., Oracle wait events) is not CPU related but either IO related or locking/block (e.g., enqueue wait) related. Powerful information!
We know that understanding and applying Oracle response time components increases our performance prowess. But understanding how Oracle service time relates to the operating system CPU takes us to yet another level. Without understanding the percentage of available CPU Oracle is consuming from the operating system, we simply don't know if there is plenty of CPU available or not. Knowing the CPU percentage allows our performance diagnosis to be more accurate which should lead to more powerful solutions. Understanding the limits and limitlessness of time will help not only you but also your peers and the people you need to influence. The better they understand the situation, the more likely they will agree with your proposed course of action. In other words, it makes your performance analysis stronger and it builds trust. And that is something all DBAs can use a little bit more of.