Oracle I/O latency monitoring

August 29th, 2013


stopwatchOne thing that I have found sorely missing in the performance pages of Enterprise Manager is latency values for various types of I/O. The performance page or top activity may show high I/O waits but it won’t indicated if the latency of I/O is unusually high or not. Thus I put together a shell script that shows latency for the main I/O waits

  • db file sequential read
  • db file scattered read
  • log file parallel write
  • direct path reads
  • direct path reads temp

Of course it would be nice to add a few others like direct path writes, direct path writes temp and log file sync but there is only so much room in the screen width.

oramon.sh

 

The script is called oramon.sh and is available on github at

https://github.com/khailey/oramon/blob/master/oramon.sh

Example:

$  oramon.sh
Usage: oramon.sh [username] [password] [host] [sid] <port=1521> <runtime=3600>

$ ./oramon.sh system sys 172.16.100.81 vsol
RUN_TIME=-1
COLLECT_LIST=
FAST_SAMPLE=iolatency
TARGET=172.16.100.81:vsol
DEBUG=0
Connected, starting collect at Wed Apr 18 18:41:13 UTC 2012
starting stats collecting

   single block       logfile write       multi block      direct read   direct read temp
   ms      IOP/s        ms    IOP/s       ms    IOP/s       ms    IOP/s       ms    IOP/s
   20.76    27.55    32.55      .71     3.50      .00      .00      .01      .00      .00
     .00      .20               .00               .00               .00               .00
   34.93   369.64   116.79     3.55               .00               .00               .00
   31.43   640.33    92.40     8.33               .00               .00               .00
   39.39   692.33   111.69     8.00               .00               .00               .00

The first line of output is the average since the database started up.
The subsequent lines are the averages since the last line which is 5 seconds by default.
One should be able to see immediately how much activity there is on the database and the latency for the basic types of database I/O.

Reads
Single block reads are the typical I/O from a database which would happen for example when reading a row in a table with indexes in place.
Multi block reads are common as well is which would happen when for example summing the values over all rows in a table.
Direct reads are less common but quite normal and happen almost exclusively for parallel query though may be used for other activities especially in newer version of Oracle such as 11.2. Direct reads are multiblock reads that by pass the Oracle buffer cache. The size varies from a datablock, such as 8k to 1MB.
Direct read temp happens when a sort has overflowed memory limits and been written to disk. Direct reads temp are multiblock reads that by pass the Oracle buffer cache. The size varies from a datablock, such as 8k to 1MB.

Writes
Logfile writes are the only writes that database users wait for in general. Actually users only wait when the commit, which then is a wait for a signal from the log writer that their particular redo data is on disk which could have already happened. Typically the user wait time is a bit slower than the logwrite time but in general it’s close, ie within a few milliseconds. The farther apart the user wait time is from the log write time the more likely there is a CPU, paging or other concurrency problem on the VDB host slowing down the users signalling and wake up time.

Rising profits

oramon.sql : Oracle Latency Query

If for some reason the shell script isn’t able to connect to the database, then the same data can be collected manually by running the SQL query in SQL*Plus by hand.
The following two SQL queries, oramon_setup.sql and oramon.sql are available on github at

https://github.com/khailey/oramon

If you want to see the latencies over periods shorter than 60s, then you have to collect the values of the cumulative counters at time A, then again at time B and take the difference. The following two queries, oramon.sql and oramon_setup.sql, are available on ftp site

Run oramon_setup.sql *once*
  column seq_ms for 9999.99
   column seq_ct for 9999.99
   column lfpw_ms for 9999.99
   column lfpw_ct for 9999.99
   column seq_ms for 9999.99
   column scat_ct for 9999.99
   column dpr_ms for 9999.99
   column dpr_ct for 9999.99
   column dprt_ms for 9999.99
   column dprt_ct for 9999.99
   column prevdprt_ct new_value prevdprt_ct_var
   column prevdprt_tm new_value prevdprt_tm_var
   column prevdpwt_ct new_value prevdpwt_ct_var
   column prevdpwt_tm new_value prevdpwt_tm_var
   column prevdpr_ct new_value prevdpr_ct_var
   column prevdpr_tm new_value prevdpr_tm_var
   column prevdpw_ct new_value prevdpw_ct_var
   column prevdpw_tm new_value prevdpw_tm_var
   column prevseq_ct new_value prevseq_ct_var
   column prevseq_tm new_value prevseq_tm_var
   column prevscat_ct new_value prevscat_ct_var
   column prevscat_tm new_value prevscat_tm_var
   column prevlfpw_ct new_value prevlfpw_ct_var
   column prevlfpw_tm new_value prevlfpw_tm_var
   column prevsec new_value prevsec_var
   select 0 prevsec from dual;
   select 0 prevseq_tm from dual;
   select 0 prevseq_ct from dual;
   select 0 prevscat_ct from dual;
   select 0 prevscat_tm from dual;
   select 0 prevlfpw_ct from dual;
   select 0 prevlfpw_tm from dual;
   select 0 prevdprt_ct from dual;
   select 0 prevdprt_tm from dual;
   select 0 prevdpwt_ct from dual;
   select 0 prevdpwt_tm from dual;
   select 0 prevdpr_ct from dual;
   select 0 prevdpr_tm from dual;
   select 0 prevdpw_ct from dual;
   select 0 prevdpw_tm from dual;
   column prevdprt_ct noprint
   column prevdprt_tm noprint
   column prevdpwt_ct noprint
   column prevdpwt_tm noprint
   column prevdpr_ct noprint
   column prevdpr_tm noprint
   column prevdpw_ct noprint
   column prevdpw_tm noprint
   column prevseq_ct noprint
   column prevseq_tm noprint
   column prevscat_ct noprint
   column prevscat_tm noprint
   column prevlfpw_ct noprint
   column prevlfpw_tm noprint
   column prevsec noprint

Run following query to see the current latency for

  • single block read
  • log file parallel write
  • multi-block read

oramon.sql

select
        round(seqtm/nullif(seqct,0),2) seq_ms,
        round(seqct/nullif(delta,0),2) seq_ct,
        round(lfpwtm/nullif(lfpwct,0),2) lfpw_ms,
        round(lfpwct/nullif(delta,0),2) lfpw_ct,
        round(scattm/nullif(scatct,0),2) scat_ms,
        round(scatct/nullif(delta,0),0) scat_ct,
        round(dprtm/nullif(dprct,0),2) dpr_ms,
        round(dprct/nullif(delta,0),2) dpr_ct,
        round(dprttm/nullif(dprtct,0),2) dprt_ms,
        round(dprtct/nullif(delta,0),2) dprt_ct,
        prevseq_ct, prevscat_ct, prevseq_tm, prevscat_tm, prevsec,prevlfpw_tm,prevlfpw_ct
        , prevdpr_ct, prevdpr_tm , prevdprt_ct, prevdprt_tm , prevdpw_ct, prevdpw_tm
        , prevdpwt_ct, prevdpwt_tm
from
(select
       sum(decode(event,'db file sequential read', round(time_waited_micro/1000) -  &prevseq_tm_var,0)) seqtm,
       sum(decode(event,'db file scattered read',  round(time_waited_micro/1000) - &prevscat_tm_var,0)) scattm,
       sum(decode(event,'log file parallel write',  round(time_waited_micro/1000) - &prevlfpw_tm_var,0)) lfpwtm,
       sum(decode(event,'db file sequential read', round(time_waited_micro/1000) ,0)) prevseq_tm,
       sum(decode(event,'db file scattered read',  round(time_waited_micro/1000) ,0)) prevscat_tm,
       sum(decode(event,'log file parallel write',  round(time_waited_micro/1000) ,0)) prevlfpw_tm,
       sum(decode(event,'db file sequential read', total_waits - &prevseq_ct_var,0)) seqct,
       sum(decode(event,'db file scattered read',  total_waits - &prevscat_ct_var,0)) scatct,
       sum(decode(event,'log file parallel write',  total_waits - &prevlfpw_ct_var,0)) lfpwct,
       sum(decode(event,'db file sequential read', total_waits ,0)) prevseq_ct,
       sum(decode(event,'db file scattered read',  total_waits ,0)) prevscat_ct,
       sum(decode(event,'log file parallel write',  total_waits ,0)) prevlfpw_ct,
       sum(decode(event,'direct path read',  round(time_waited_micro/1000) - &prevdpr_tm_var,0)) dprtm,
       sum(decode(event,'direct path read',  round(time_waited_micro/1000) ,0)) prevdpr_tm,
       sum(decode(event,'direct path read',  total_waits - &prevdpr_ct_var,0)) dprct,
       sum(decode(event,'direct path read',  total_waits ,0)) prevdpr_ct,
       sum(decode(event,'direct path write',  round(time_waited_micro/1000) - &prevdpw_tm_var,0)) dpwtm,
       sum(decode(event,'direct path write',  round(time_waited_micro/1000) ,0)) prevdpw_tm,
       sum(decode(event,'direct path write',  total_waits - &prevdpw_ct_var,0)) dpwct,
       sum(decode(event,'direct path write',  total_waits ,0)) prevdpw_ct,
       sum(decode(event,'direct path write temp',  round(time_waited_micro/1000) - &prevdpwt_tm_var,0)) dpwttm,
       sum(decode(event,'direct path write temp',  round(time_waited_micro/1000) ,0)) prevdpwt_tm,
       sum(decode(event,'direct path write temp',  total_waits - &prevdpwt_ct_var,0)) dpwtct,
       sum(decode(event,'direct path write temp',  total_waits ,0)) prevdpwt_ct,
       sum(decode(event,'direct path read temp',  round(time_waited_micro/1000) - &prevdprt_tm_var,0)) dprttm,
       sum(decode(event,'direct path read temp',  round(time_waited_micro/1000) ,0)) prevdprt_tm,
       sum(decode(event,'direct path read temp',  total_waits - &prevdprt_ct_var,0)) dprtct,
       sum(decode(event,'direct path read temp',  total_waits ,0)) prevdprt_ct,
       to_char(sysdate,'SSSSS')-&prevsec_var delta,
       to_char(sysdate,'SSSSS') prevsec
from
     v$system_event
where
     event in ('db file sequential read',
               'db file scattered read',
               'direct path read temp',
               'direct path write temp',
               'direct path read',
               'direct path write',
               'log file parallel write')
) ;

Output looks like

  SEQ_MS   SEQ_CT  LFPW_MS  LFPW_CT   SEQ_MS  SCAT_CT   DPR_MS   DPR_CT  DPRT_MS  DPRT_CT
-------- -------- -------- -------- -------- -------- -------- -------- -------- --------
  115.71   422.67    76.17    12.00               .00               .00               .00

The first execution of the query is I/O since database startup, so should most likely be ignored.
Subsequent executions are the I/O since the last execution

The columns are

  1. SEQ_MS: single block latency
  2. SEQ_CT: single block reads per second
  3. LFPW_MS: log file parallel write latency
  4. LFPW_CT: log file parallel write count per second
  5. SCAT_MS: multi-block latency
  6. SCAT_CT: multi-block reads per second
  7. DPR_MS: direct path read latency
  8. DPR_CT: direct path read count
  9. DPRT_MS: direct path read temp latency
  10. DPRT_CT: direct path read temp count
Instead of running the query by hand the script “oramon.sh” available at https://github.com/khailey/oramon/blob/master/oramon.sh (see top of page) will collect this info ever 5 seconds in a loop and output to standard out at the UNIX prompt

Simple but only once a minute

NOTE: the following is a simpler query but the data only updates once a minute
select
       n.name event,
       m.wait_count  cnt,
       10*m.time_waited ms,
       nvl(round(10*m.time_waited/nullif(m.wait_count,0),3) ,0) avg_ms
  from v$eventmetric m,
       v$event_name n
  where m.event_id=n.event_id
        and (
              wait_class_id= 1740759767 --  User I/O 
                   or
              wait_class_id= 4108307767 --  System I/O  
             )
        and m.wait_count > 0 ;

.


io, Oracle, performance, wait events
, ,

  1. Trackbacks

  1. Comments

  2. Yousuf
    September 12th, 2013 at 15:21 | #1

    Thanks for sharing the details …

    One questions. Last week my report was took more than 5 hours instead of 3 hour… When I check all inner & outer details with the help of tools & tracing ..etc.. but not found any interesting things even execution plan.. gather stats was also intact properly.

    When we did last check on storage level & found one storage path was dead & all traffic was moving only one path which cause this issue….after fixing again reports are back to normal 3 hours..

    Is there way can we find pro active this type of unexpected issues..

    Please request to shed some light & how to handle this type of issue..

    Thanks

    Yousuf

    Note: I have both good & bad AWR report of this issue.

  3. khailey
    September 12th, 2013 at 16:19 | #2

    @Yousuf: ASH will give you a break down of where the time was spent and why more time was spent on the 5 hour instead of the 3 hour. Once you identify the difference you can set an alert when that type of even occurs

  4. Yousuf
    September 17th, 2013 at 21:01 | #3

    Thanks for your reply…..

    Sure, I will validate the result as suggested with ASH (Active Session History) report..


4 − four =