Sampling Performance Data on Oracle

Oracle wait events took years to become widely known for their power in performance tuning. Similarly it has taken years for most DBAs to adopt Active Session History (ASH). ASH represents a revolution in performance tuning data as radical as the introduction of wait events.

Wait events revolutionized tuning way back in Oracle 7 by identifying performance bottlenecks in the database and explicitly quantifying their impact. Still lacking in the wait events interface was the ability to identify the causes of the problems. The solution has come with ASH. ASH clearly identifies the SQL and Sessions at the root of wait bottlenecks.

ASH ironically has be technically feasible by a savvy DBA since Oracle 7 but it took Oracle 10g for anyone to put this cutting edge technology to work, and in no small way, but by actually implanting it in the Oracle kernel and basing Oracle 10g performance tools on ASH.

ASH is founded upon a data gathering technique called “Sampling.” Sampling is a method of collecting data by taking snapshots every second. The snapshots record the state of every active session in the database. For every active session, Oracle records the SQL being executed as well as the state of the session. A session state can be running on the CPU or waiting on any of several hundred wait events. With this simple amount of information, the session + state + sql , we can quickly identify SQL that are CPU intensive, IO intensive or involved in any number of the possible resource contention issues that can arise in an Oracle database.

Looking at the situation more concretely, when there is a slow down on the database we look at our trusted AWR or statspack report for the period of the slowdown. The first step in analyzing the statspack report is to look at the Top 5 Timed Events

The top 5 timed wait events will tell us if any wait event has crept up to cause a bottleneck. If we do find a wait event bottleneck we will want to know who or what is causing the problem in order to solve it. For example if there is a CPU bottleneck, we want to know what SQL statement is hogging the CPU. If there is an IO bottleneck we want to know what SQL statement is stuck on IO and needs tuning. If there is a complex situation like a buffer busy waits or latch contention we want to know which sessions were involved, what the wait event arguments were and what SQL they were executing. Statspack fails to give us the necessary detailed information but ASH does.

Real Life Example

Let’s look at a real life example. This example is one of the harder wait bottleneck to solve, so if you can follow this one then it’s downhill from here. This example is one of the more complex tuning problems and thus the analysis requires a deep understanding of Oracle wait events.

Imagine a situation where users call into the help desk complaining that the application has slowed down. The help desk determines it’s not the application and calls me, the DBA, to fix it. To solve the situation I run the statspack re-ports over the last hour and look at “Top 5 Timed Events”


Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                        % Total
Event                 Waits    Time (s) Call Time
--------------------------- ----------- ---------
buffer busy waits     2,748         250     78.72
CPU time                             32     10.16
free buffer waits     1,588          15      4.63
write complete waits     10           8      2.51
log buffer space        306           5      1.51
-------------------------------


Sure enough there is a bottleneck and the bottleneck is on “buffer busy waits”, but now what do I do to solve it? There are some guesses I can make, and I can even use some data in statspack to help guess, but there is no way to conclusively analyze this wait bottleneck or most wait events with statspack. To solve most wait event bottlenecks we want more detailed information than is available in statspack. What information do we need? We mainly want to know what

  • SQL
  • Sessions
  • Objects
  • wait details

are involved. How do we get that information? We get that information by sampling v$session_wait, i.e. via ASH. We also have to understand how to solve each wait event and specifically in this case how to solve “buffer busy waits.” A complete explanation of buffer busy see the previous blog post on buffer busy waits.

How do we find the objects and SQL involved in the buffer busy waits? The information we want can be found in a table v$session_wait when the problem exists but v$session_wait only represents the current state of the system and if our problem is intermittent or in the past then the information we want from v$session_wait will be lost. V$session_wait’s lack of history is where the new Oracle 10g concept Active Session History (ASH) comes in. ASH at its simplest is just reading v$session_wait (actually v$session which includes v$session_wait starting in 10g) every second and storing the data. Oracle does this automatically in 10g but it has been possible in every version since Oracle 7 by running the following query:
column sid format 999
column username format a10
column serial format 99999
column hash_value format 9999999999
column event format a15
column p1 format 9999999999
column p2 format 99999999
column p3 format 9999
select
   nvl(s.username,s.program) username,
   s.sid sid,
   s.serial# serial,
   s.sql_hash_value hash_value,
   substr(decode(w.wait_time,
          0, w.event,
         'ON CPU'),1,15) event ,
   w.p1  p1,
   w.p2  p2,
   w.p3  p3
from         v$session        s,
    v$session_wait   w
where     w.sid=s.sid
and s.status='ACTIVE'
and s.type='USER';

From the above query we get output like this.

USERNAME   SID   SERIAL   HASH_VALUE EVENT                P1       P2       P3
---------- ----- -------- ---------- -------------------- -------- -------- ----
SYS        64    8717     4116021597 PL/SQL lock timer    300      0        0
SYS        58    19467     961168820 ON CPU               16508152 1        0
STARGUS    71    6251     1311875676 direct path write    201      2155902  127
(CJQ0)     9     1                 0 rdbms ipc message    500      0        0


In the case of the buffer busy wait, I’ve been running the above query and storing it in a table that I call v$ash. I can now read v$ash and look for the detail information I need to solve the buffer busy wait problem. First I want to know what block type the buffer busy waits are on.

SQL> Select count(*), p3
from v$ash
where event = 'buffer busy waits'
group by p3;

COUNT(*)    P3
---------- ----
3423          1

All the waits are on block type 1. What is block type 1? we can check in v$waitstat translate block type 1 into the actual block type:


select rownum n,
 class 
 from v$waitstat;


 N CLASS 
--- ------------------
 1 data block
 2 sort block
 3 save undo block
 4 segment header
 5 save undo header
 6 free list
 7 extent map 
 8 1st level bmb
 9 2nd level bmb
 10 3rd level bmb
 11 bitmap block
 12 bitmap index block
 13 file header block
 14 unused
 15 system undo header
 16 system undo block
 17 undo header
 18 undo block

 Block type 1 is a data block.

When there are buffer busy waits on block type 1 it is typically an insert problem. The problem could be either on index or a table. Depending whether it is a table or an index with the problem we take different actions. Lets find out what object the buffer busy wait is on.

We can find this out by looking at p1 and p2 for the buffer busy waits which represents the file and the block:

select count(*),  p1 filen, p2 blockn , hash_value
from v$ash
where event='buffer busy waits'
group by p1, p2, hash_value;

COUNT(*) FILEN    BLOCKN
---------- -------- --------
1   11       90644
2   11       90651
3   11       98233
1   11       104767
3   11       113291
1   11       119842
1   11       119856
3   11       121632
1   11       126334


I’ll just pick the last block and find out what object it is coming from:

ccolumn segment_name format a30
select owner,
segment_name,
segment_type,
block_id, blocks+block_id
from dba_extents
where file_id = 1
and 126334 between block_id AND block_id + blocks-1;

OWNER      SEGMENT_NAME       SEGMENT_TY   BLOCK_ID BLOCKS+BLOCK_ID
---------- ------------------ ---------- ---------- ---------------
SYSTEM     TOTO1               TABLE          125201          127249

From this result I know it’s a data block of a table and not a header block (because the header block would be the first block in the object). Now the question is what are the sessions doing? I can look at the sql and see what they are executing:

SQL> select count(*),  p1 filen, p2 blockn , hash_value
2  from v$ash
3  where event='buffer busy waits'
4  group by p1, p2, hash_value;

COUNT(*) FILEN    BLOCKN   HASH_VALUE
---------- -------- -------- ----------
3 1        94609     558666863
2 11       81163     558666863
2 11       87123     558666863
1 11       90644     558666863
2 11       90651     558666863
3 11       98233     558666863
1 11       104767    558666863
3 11       113291    558666863
1 11       119842    558666863
1 11       119856    558666863
3 11       121632    558666863
1 11       126334    558666863

I can see that the same SQL is causing all the “buffer busy waits” because the hash value is the same. Now I can find out what that sql statement is by:

select sql_text
from v$sqltext
where hash_value=558666863;

SQL_TEXT
-----------------------------------------------------
INSERT into toto1 values (:b1,lpad('a',1000,'a'))

With this final piece of information it looks like the segment is probably missing free lists or not in an ASSM tablespace. Let’s check these possibilities.


Freelists:

select  FREELISTS, TABLESPACE_NAME
from dba_tables
where table_name='TOTO1' and owner='SCOTT';

FREELISTS TABLESPACE_NAME
---------- ------------------------------
1   USERS


Automatic Segment Space Management:

select tablespace_name, SEGMENT_SPACE_MANAGEMENT
from dba_tablespaces;

TABLESPACE_NAME                SEGMEN
------------------------------ ------
SYSTEM                         MANUAL
UNDOTBS1                       MANUAL
TEMP                           MANUAL
USERS                          MANUAL
TOOLS                          AUTO


We see that that the table is in a table space, USERS, that does not have ASSM active.


In summary, the table TOTO1 has wait contention on writes to blocks and multiple sessions are concurrently inserting into it and it has no free lists nor is it in an Automatic Segment Space Managed (ASSM) table space. This causes all sessions inserting to insert into the same block, the last free one on the one free list and thus fight over exclusive write access to that last block. The solutions is to either put the segment into a ASSM table space and let Oracle handle the contention or put free lists on the segment which is more precise. If I choose free lists, then how many do I need? I need as many as there are concurrent users inserting into the table which I can get with

select count(*), sid, serial  from v$ash where event='buffer busy waits'
group by sid, serial;

COUNT(*) SID   SERIAL
---------- ----- --------
7 12    79
4 14    99
4 19    770
8 20    176

There were 4 sessions inserting during the period of ASH data I looked at, thus in my case I need 4 free lists.

Extended Power of ASH


The above example is simplified. Two other important pieces of data missing in the above example that also need to be tracked are sql_text and sample_time. Sample_time is required to group data by performance bottleneck time periods and sql_text should be collected on an ongoing basis in case the SQL found in v$session_wait gets kicked out of the shared pool. At the end of the day collecting ASH becomes quite complicated with the need to partition and purge data, send data to a different repository to reduce overhead on the target and the need to collect other bits of information like maps of user# to username and object# to object name. To see a package that does much of this, refer to
http://ashmasters.com

 

Even with a good collection running, it takes some nontrivial sql to mine the richness of the data.
Even with a good set of scripts to analyze the data it takes a number of well written sql statements to drill down sufficiently into the problem.

 

Luckily Oracle EM does all of the above for you. Oracle collects sampled data in v$active_session_history, displays is powerfully and graphically in Top Activity screen and provides automated analysis with ADDM.

 

On top of this, the data is displayed in a graphically in Top Activity condenses the equivalent dozens of sql statements that would take time and effort to run and comprehend into visual that can be immediately understood.
 
 
Even in this reduced graphic we can quickly and easily see that there is a performance problem on the machine. In this case there is clearly a CPU bottleneck, because the CPU (in the top chart, the green) is above the maximum CPU on the machine (the dotted red line). This means that the machine is at 100% CPU and there are processes in the run queue. We can also see the chart at the bottom left that one SQL statement is using much more CPU than any other sql statement, so this is the statement to tune. It only takes a second to scan the graphic and see where the problem is and it’s all based on ASH.

 

Summary
ASH is a leap in performance tuning technology. It took a clear vision of the future and a bit of a leap of faith to let go of the compulsive need to have exact measurements and instead embrace sampling a technology that statistically approximates the same value with little loss of accuracy but at the same time brings a great wealth of information necessary for analyzing the performance of a database.

 

ASH can be simulated on any version of Oracle since v7 with the simple act of repeatedly sampling and storing the values from v$session_wait into a historic table. Although the idea is easy, the implementation turns out to be fairly complex. Luckily Oracle implements not only ASH in version 10g but more importantly Oracle implements a diagnostic package that reads the data, analyzes the data and report the findings and solutions automatically so you don’t even need to know all the esoteric wait analysis. The best part of all is Oracle EM exposes all the data and analysis graphically provide an interface that communicates quickly and clearly the date of the system.


Uncategorized

  1. Trackbacks

  2. No trackbacks yet.
  1. Comments

  2. No comments yet.


+ three = 8