Before Blaming the Database: Key Things to Check

Performance tuning is no longer a craft of reading EXPLAIN PLAN and guessing. It is statistical forensics — excavating microseconds from ASH samples, understanding adaptive execution, and surgically removing chaos before it becomes catastrophe.

The Oracle database knows everything that happened to every session at every microsecond. The only question is whether you know how to ask.

The Day Production Died at 3 AM:
📖 True Story (Anonymised)

Singapore. Tuesday night. A fintech platform's nightly batch completes. Wednesday 03:17 AM — a single SQL statement that ran in 0.3 seconds yesterday now takes 47 seconds. It's called 12,000 times per minute. The connection pool explodes. The app tier goes dark. Trading opens in four hours.

The on-call DBA connects. Runs EXPLAIN PLAN. "Looks fine," he says. He runs it again. Still 47 seconds. He checks indexes. All present. He checks statistics. Recent. He reboots the instance at 04:30 — and this buys 20 minutes before it collapses again.

At 05:15, a senior architect joins. She doesn't look at the execution plan. She opens ASH. In 90 seconds she finds the answer: the SQL's execution plan had silently mutated two hours ago — from an index range scan delivering 400 rows to a full table scan delivering 1.2 million. The optimizer had picked a new plan because a statistics job had partially refreshed one partition, skewing cardinality estimates by 3,000x.

She pinned the old plan via SPM baseline. Trading opened on time. The DBA who rebooted the instance was staring at symptoms. She was reading the history of decisions.

"The execution plan tells you what the optimizer decided. ASH tells you what actually happened to real sessions, at real moments in time. One is a blueprint. The other is a flight data recorder."
— Maria Chen, Principal Oracle Architect, spoken at OracleWorld 2025

This blog takes you from knowing nothing about AWR to being the person who solves the 3 AM crisis in 90 seconds.

What Is AWR?

⬡ Knowledge Level: Zero → Foundation

Imagine your car's engine but instead of a warning light that turns on only when the engine has already blown, you had a co-pilot who silently wrote down every RPM, fuel pressure, and temperature reading every 60 seconds — and stored those snapshots for the last 8 days.

That is AWR: Automatic Workload Repository. Oracle's background machinery takes periodic snapshots of over 200 performance metrics, stores them in the SYSAUX tablespace, and gives you the ability to answer the question: "What exactly was happening in this database between 02:00 and 02:30 last Tuesday?"

60

Default snapshot interval (minutes)

8

Default retention (days)

200+

Metrics captured per snapshot

~200MB

Typical weekly storage footprint

AWR vs STATSPACK vs Manual Tracing

Tool	Granularity	Overhead	Historical	Best For
AWR	60-min snapshots	~0.1%	Yes – 8 days+	Trend analysis, regression detection
ASH	1-second samples	~0.5%	Yes – in-memory + AWR	Real-time forensics, top-N waits
STATSPACK	Manual snapshots	~0.5%	Manual setup	Pre-11g, SE2 editions
SQL Trace	Per-call microseconds	High (specific session)	No	Deep single-SQL forensics

💡 Architect's Tip

Increase AWR retention to 30 days on production and reduce interval to 30 minutes for high-frequency OLTP systems. The storage cost is trivial; the diagnostic window you gain is priceless.

SQLSet AWR to 30-day / 30-min

-- Run as SYSDBA
BEGIN
  DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS(
    retention  => 43200,   -- 30 days in minutes
    interval   => 30        -- every 30 minutes
  );
END;
/

ASH — The Database's Time Machine

⬡ Knowledge Level: Foundation

AWR tells you averages over an hour. That's like a doctor who only checks your pulse at annual checkups. ASH — Active Session History — checks your pulse every second, for every session that was actively doing work.

Every second, Oracle samples all sessions in a non-idle wait state and writes a row into the circular in-memory buffer V$ACTIVE_SESSION_HISTORY. Every AWR flush (default 60 min), 1-in-10 rows persist to DBA_HIST_ACTIVE_SESS_HISTORY on disk.

The genius of ASH is statistical validity with minimal overhead. At 1 sample/second across active sessions, you can reconstruct what any SQL was doing, what it was waiting for, and who was blocking it — for any moment in the past.

The Full Diagnostic Stack

◈ Knowledge Level: Intermediate

When an incident arrives, most DBAs start in the middle of the stack. Expert architects start at the symptom layer and methodically traverse downward. Here is the complete Oracle performance diagnostic architecture.

L7

Symptoms Layer — What Users Experience

Slow page loads · ORA-04031 out of shared pool · Runaway CPU · Lock contention · Buffer busy waits · Connection pool exhaustion

APM AlertsApp Error LogsOEM Alerts

L6

Real-Time Diagnosis — What's Happening NOW

Live session state, current SQL, current wait event — millisecond precision

V$SESSIONV$SQLV$ACTIVE_SESSION_HISTORYV$SQL_PLAN_MONITOR

L5

Historical Analysis — What Happened Before

AWR snapshots, top SQL by elapsed/LIO/CPU, wait event trends across time windows

DBA_HIST_ACTIVE_SESS_HISTORYDBA_HIST_SQLSTATAWR ReportsADDM

L4

SQL Plan Intelligence — Why the Plan Changed

Execution plan history, plan stability monitoring, adaptive plan feedback loops, cardinality feedback

DBA_HIST_SQL_PLANDBA_SQL_PLAN_BASELINESV$SQL_SHARED_CURSOR

L3

Adaptive Query Plans — The Living Optimizer

Join method switching mid-execution, statistics feedback loops, re-optimization triggers

V$SQL_OPTIMIZER_ENVV$SQL_REOPTIMIZATION_HINTSDBMS_XPLAN

L2

Plan Management — Controlling the Optimizer

SQL Plan Baselines, plan evolution, SQL Quarantine, Resource Manager profiles

DBMS_SPMDBMS_SQLQDBMS_RESOURCE_MANAGER

L1

Storage/Compute Offload — Where Exadata Lives

Smart Scan, Storage Indexes, Bloom Filters, In-Memory Column Store, Hybrid Columnar Compression

V$CELL_THREAD_HISTORYV$INMEMORY_FASTSTART_AREAV$HM_RUN

L0

Resolution — The Actual Fix

Add/rebuild index · Rewrite SQL · Hints · Partition strategy · Refresh statistics · Pin plan baseline · SQL Quarantine · Resource Plan

Measure LIO reductionElapsed timeBuffer getsCPU consumptionWait event time

🏛 Architecture Principle

Never skip layers. A DBA who jumps from "slow page load" directly to "I'll add an index" has skipped six layers of evidence. The index might make it faster for this query while masking the real architectural problem: a missing partition, a bloated SGA, or a toxic connection pattern generating 50,000 hard parses per second.

ASH Forensics — SQL You Run Tonight

◈ Knowledge Level: Intermediate → Advanced

These are not textbook queries. These are the exact statements a senior DBA reaches for the moment an incident starts. Copy them. Save them. Run them at 3 AM.

① Top Wait Events — Last 60 Minutes

First query in every incident. Shows you exactly where DB time is being consumed right now. 1 sample = 1 second of active session time.

SQLash_top_waits.sql — First query every incident

-- Top wait events by DB time — last 60 min
-- 1 sample ≈ 1 active second. % column = % of total DB time.
WITH ash AS (
  SELECT
    event,
    session_state,
    wait_class,
    sql_id,
    COUNT(*) AS samples,
    ROUND(COUNT(*)*100 / SUM(COUNT(*)) OVER(), 2) AS pct_db_time
  FROM   V$ACTIVE_SESSION_HISTORY
  WHERE  sample_time >= SYSDATE - 1/24
  GROUP BY event, session_state, wait_class, sql_id
)
SELECT
  COALESCE(event, 'ON CPU')  AS wait_event,
  wait_class,
  samples,
  pct_db_time,
  sql_id
FROM  ash
ORDER BY samples DESC
FETCH FIRST 15 ROWS ONLY;

② Blocking Session Chain — Who Is Killing Whom

When you see lock-related waits in query ①, run this immediately. It reconstructs the full blocking chain including the root holder.

SQLash_blocking_chain.sql — Lock tree reconstruction

-- Reconstruct lock holder → waiter chain from ASH
SELECT
  LEVEL                           AS chain_depth,
  LPAD(' ', (LEVEL-1)*2)
    || session_id                  AS session_id,
  blocking_session,
  event,
  sql_id,
  TO_CHAR(sample_time,'HH24:MI:SS') AS sampled_at,
  seconds_in_wait
FROM  V$ACTIVE_SESSION_HISTORY
WHERE sample_time >= SYSDATE - 10/1440  -- last 10 minutes
  AND   wait_class = 'Concurrency'
CONNECT BY PRIOR blocking_session = session_id
START WITH blocking_session IS NULL
ORDER SIBLINGS BY seconds_in_wait DESC;

③ Top SQL Heatmap — ASH Minute-by-Minute

Shows you exactly which SQL IDs were consuming the most DB time at each minute during an incident window. Essential for pinpointing the moment a regression hit.

SQLash_heatmap.sql — Minute-by-minute SQL attribution

-- Top SQL per minute — pipe this into a spreadsheet heatmap
SELECT
  TRUNC(sample_time, 'MI')          AS minute_bucket,
  sql_id,
  SUM(time_waited) / 1e6           AS total_wait_sec,
  COUNT(*)                           AS ash_samples,
  MAX(event)                         AS dominant_event,
  MAX(wait_class)                    AS wait_class
FROM   V$ACTIVE_SESSION_HISTORY
WHERE  sample_time BETWEEN
       TO_DATE('2026-05-18 02:00','YYYY-MM-DD HH24:MI')
       AND
       TO_DATE('2026-05-18 03:30','YYYY-MM-DD HH24:MI')
  AND   sql_id IS NOT NULL
GROUP BY
  TRUNC(sample_time, 'MI'), sql_id
ORDER BY minute_bucket, ash_samples DESC;

⚠ Common Mistake

Many DBAs query V$SQL for elapsed_time and divide by executions to find slow SQL. This is cumulative since cursor load — it doesn't tell you that the SQL was only slow for 40 minutes starting at 02:17 AM. ASH with a time filter does. Always add a sample_time predicate.

Adaptive Query Plans — Why Your Plan Lies

◈ Knowledge Level: Intermediate → Advanced

Since Oracle 12c, the optimizer is no longer a one-shot decision machine. It is a feedback loop. Understanding this is the difference between architects who fight the optimizer and those who work with it.

Diagnosing an Adaptive Plan Mid-Switch

SQLadaptive_plan_check.sql — Was this plan adaptive? Did it switch?

-- Check if a SQL used an adaptive plan and whether it switched
SELECT
  s.sql_id,
  s.child_number,
  s.is_resolved_adaptive_plan,    -- Y = final plan locked in
  s.is_reoptimizable,             -- Y = will try to reoptimize
  s.executions,
  ROUND(s.elapsed_time/NULLIF(s.executions,0)/1e6,3) AS avg_elapsed_sec,
  s.optimizer_cost
FROM  V$SQL s
WHERE s.sql_id = '&sql_id'
ORDER BY s.child_number;

-- Then view the adaptive plan detail — look for STATISTICS COLLECTOR node
SELECT * FROM TABLE(
  DBMS_XPLAN.DISPLAY_CURSOR(
    sql_id      => '&sql_id',
    format      => 'ADAPTIVE ALLSTATS LAST'
  )
);

💡 What to Look For

In the ADAPTIVE ALLSTATS LAST output, lines prefixed with - (dash) are plan lines that were considered but not used in the final execution. If you see a STATISTICS COLLECTOR node, the optimizer was buffering rows to make a mid-execution decision. The absence of this node in a repeated execution means it has already resolved the adaptive plan.

SQL Plan Regression Detection

★ Knowledge Level: Expert

The most dangerous performance event is the one that happened at 2 AM before anyone was watching. Statistics jobs, optimizer upgrades, plan invalidations — all happen silently. This is your early warning system.

Trigger: Statistics job completes on ORDERS table (02:14)

Automatic Stats Refresh

DBMS_STATS gathers new stats on a partitioned table. One partition has 10× the expected rows. Cardinality estimate for a join changes from 400 to 1,200,000.

Trigger: Next hard parse after shared pool flush (02:17)

Plan Hash Value Changes

Optimizer re-evaluates plan with new statistics. Decides a full table scan is cheaper than the index range scan + nested loop. New plan_hash_value: 4029847362 (was 1847293841).

Effect: 0.3s → 47s per execution (02:17 onward)

Silent Regression

No errors. No alerts. AWR won't flag it for another 38 minutes when the hour's average moves. ASH captures it in seconds.

Detection: Via regression query below

You Find It

The query compares the 7-day average to today's average. SQL_ID 5rw2a9q8xkf1m shows 156× regression. Plan hash changed at 02:17:03.

SQLsql_regression_detection.sql — Compare now vs 7-day baseline

-- Detect SQL regressions vs 7-day historical baseline
-- regression_x: how many times slower vs baseline
WITH
  baseline AS (
    SELECT
      sql_id,
      AVG(elapsed_time_delta / NULLIF(executions_delta, 0))
           AS avg_elapsed_base,
      AVG(buffer_gets_delta  / NULLIF(executions_delta, 0))
           AS avg_lio_base
    FROM   DBA_HIST_SQLSTAT
    WHERE  snap_id BETWEEN
       (SELECT MIN(snap_id) FROM DBA_HIST_SNAPSHOT
        WHERE  begin_interval_time >= SYSDATE-14)
       AND
       (SELECT MAX(snap_id) FROM DBA_HIST_SNAPSHOT
        WHERE  begin_interval_time < SYSDATE-7)
    GROUP BY sql_id
  ),
  recent AS (
    SELECT
      sql_id,
      AVG(elapsed_time_delta / NULLIF(executions_delta, 0))
           AS avg_elapsed_now,
      AVG(buffer_gets_delta  / NULLIF(executions_delta, 0))
           AS avg_lio_now,
      MAX(plan_hash_value)   AS current_plan_hash
    FROM   DBA_HIST_SQLSTAT
    WHERE  snap_id >= (
      SELECT MIN(snap_id) FROM DBA_HIST_SNAPSHOT
      WHERE  begin_interval_time >= SYSDATE-1)
    GROUP BY sql_id
  )
SELECT
    r.sql_id,
    ROUND(b.avg_elapsed_base/1e6, 3)        AS baseline_sec,
    ROUND(r.avg_elapsed_now/1e6, 3)         AS current_sec,
    ROUND(r.avg_lio_base)                     AS baseline_lio,
    ROUND(r.avg_lio_now)                      AS current_lio,
    ROUND(r.avg_elapsed_now /
          NULLIF(b.avg_elapsed_base, 0), 1)  AS regression_x,
    r.current_plan_hash,
    SUBSTR(
      (SELECT sql_text FROM DBA_HIST_SQLTEXT
       WHERE  sql_id = r.sql_id AND ROWNUM=1),
       1, 60
    ) AS sql_preview
FROM  recent r
JOIN  baseline b USING(sql_id)
WHERE r.avg_elapsed_now > b.avg_elapsed_base * 2   -- 2x slower or more
ORDER BY regression_x DESC
FETCH FIRST 20 ROWS ONLY;

Find the Exact Moment the Plan Changed

SQLplan_change_timeline.sql — When did the plan hash change?

-- For a given sql_id, show plan_hash_value timeline from AWR
-- Pivot this in Excel to see exactly when the plan mutated
SELECT
  sn.begin_interval_time,
  ss.plan_hash_value,
  ROUND(ss.elapsed_time_delta/NULLIF(ss.executions_delta,0)/1e6,3)
           AS avg_elapsed_sec,
  ROUND(ss.buffer_gets_delta/NULLIF(ss.executions_delta,0))
           AS avg_lio,
  ss.executions_delta                                           AS execs
FROM
  DBA_HIST_SQLSTAT ss
  JOIN DBA_HIST_SNAPSHOT sn USING(snap_id, dbid, instance_number)
WHERE
  ss.sql_id = '&sql_id'
  AND sn.begin_interval_time >= SYSDATE-3
ORDER BY
  sn.begin_interval_time;

SQL Quarantine — The Circuit Breaker

★ Knowledge Level: Expert

📖 Why This Exists

A single query consuming 100% CPU for 2 minutes, called by 500 concurrent sessions, can bring an entire cluster to its knees. Before SQL Quarantine (introduced in 18c), the options were: kill sessions manually, restart the instance, or add a blunt resource plan. SQL Quarantine lets you surgically target that specific execution plan of that specific SQL and terminate any session that would run it — within a threshold you define — while allowing other plans of the same SQL to execute normally.

SQL / PL·SQLquarantine_runaway.sql — Full quarantine lifecycle

-- STEP 1: Create quarantine entry targeting specific sql_id + plan
DECLARE
  v_handle VARCHAR2(100);
BEGIN
  -- force_matching=YES applies to all SQLs that normalize to same text
  v_handle := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(
    sql_id        => '5rw2a9q8xkf1m',
    force_matching => 'NO'
  );

  -- STEP 2: Set threshold — kill if elapsed time > 10 seconds
  DBMS_SQLQ.ALTER_QUARANTINE(
    quarantine_name => v_handle,
    action          => 'ENABLE',
    threshold       => DBMS_SQLQ.THRESHOLD_ELAPSED_TIME,
    value           => '00:00:10'    -- kill if > 10 seconds
  );

  -- Optionally also cap CPU time
  DBMS_SQLQ.ALTER_QUARANTINE(
    quarantine_name => v_handle,
    action          => 'ENABLE',
    threshold       => DBMS_SQLQ.THRESHOLD_CPU_TIME,
    value           => '00:00:05'    -- kill if CPU > 5 seconds
  );

  DBMS_OUTPUT.PUT_LINE('Quarantine: ' || v_handle);
END;
/

-- STEP 3: Verify quarantine is active
SELECT
  name,
  sql_id,
  plan_hash_value,
  enabled,
  cpu_time,
  elapsed_time,
  io_megabytes
FROM   DBA_SQL_QUARANTINE
WHERE  sql_id = '5rw2a9q8xkf1m';

-- STEP 4: When root cause is fixed, drop quarantine
BEGIN
  DBMS_SQLQ.DROP_QUARANTINE(quarantine_name => '&quarantine_name');
END;
/

⚠ Architect's Warning

SQL Quarantine is a tourniquet, not a cure. It stops the bleeding, but the root cause — bad statistics, missing index, unpartitioned table — still exists. The moment you remove the quarantine without fixing the underlying problem, the runaway query returns. Always schedule the root cause fix before disabling quarantine.

SPM Baselines — Controlling the Optimizer

★ Knowledge Level: Expert

SQL Plan Management (SPM) is a constitutional framework for the Oracle optimizer. While quarantine reacts to disasters, SPM prevents them. It says to the optimizer: "You may explore new plans, but you may not use them until a human validates they are better."

SQL / PL·SQLspm_capture_and_pin.sql — Full SPM lifecycle: capture, verify, enable

-- ═══════════════════════════════════════════════
-- SCENARIO: Pin known-good plan after a regression
-- ═══════════════════════════════════════════════

-- 1. Load the GOOD plan from AWR into SPM baseline
--    (good_plan_hash_value found from regression timeline query)
DECLARE
  v_loaded PLS_INTEGER;
BEGIN
  v_loaded := DBMS_SPM.LOAD_PLANS_FROM_AWR(
    begin_snap_id  => 28441,    -- snap before regression
    end_snap_id    => 28443,
    basic_filter   => 'sql_id = ''5rw2a9q8xkf1m''',
    fixed          => 'YES',    -- mark as FIXED (won't evolve)
    enabled        => 'YES'
  );
  DBMS_OUTPUT.PUT_LINE('Loaded: ' || v_loaded || ' plan(s)');
END;
/

-- 2. Verify the baseline was created
SELECT
  plan_name,
  sql_handle,
  sql_text,
  enabled,
  accepted,
  fixed,
  plan_hash_value,
  created,
  last_modified
FROM  DBA_SQL_PLAN_BASELINES
WHERE sql_id = '5rw2a9q8xkf1m'
ORDER BY created DESC;

-- 3. If new (unaccepted) plans exist — evolve them
--    This runs the new plan in shadow, compares to baseline,
--    accepts only if it's statistically better
DECLARE
  v_report CLOB;
BEGIN
  v_report := DBMS_SPM.EVOLVE_SQL_PLAN_BASELINE(
    sql_handle    => 'SYS_SQL_abc123def456',
    plan_name     => NULL,     -- evolve all unaccepted plans
    time_limit    => 60,       -- max 60 seconds
    verify        => 'YES',   -- must verify before accepting
    commit        => 'YES'    -- auto-accept if verified better
  );
  DBMS_OUTPUT.PUT_LINE(v_report);
END;
/

-- 4. Export baselines for cross-environment transport (Prod→UAT)
DECLARE
  v_plans PLS_INTEGER;
BEGIN
  v_plans := DBMS_SPM.PACK_STGTAB_BASELINE(
    table_name    => 'SPM_STGTAB',
    table_owner   => 'DBA_TOOLS',
    sql_handle    => 'SYS_SQL_abc123def456'
  );
  DBMS_OUTPUT.PUT_LINE('Packed: '||v_plans||' plans');
END;
/

The War Room Protocol

★ Knowledge Level: Expert

When the call comes at 3 AM, most teams lose 45 minutes running random queries in a panic. Here is the exact decision tree an expert follows. Each step takes under 2 minutes. Total time to diagnosis: under 10 minutes.

SQLwar_room_kit.sql — Emergency commands, all in one place

-- ① WHAT IS THE DB DOING RIGHT NOW?
SELECT wait_class, event, COUNT(*) sess
FROM   V$SESSION
WHERE  status = 'ACTIVE' AND type = 'USER'
GROUP BY wait_class, event ORDER BY 3 DESC FETCH FIRST 10 ROWS ONLY;

-- ② TOP SQL BY ELAPSED TIME RIGHT NOW
SELECT sql_id, child_number,
  ROUND(elapsed_time/NULLIF(executions,0)/1e6,3) avg_s,
  executions, loads, invalidations,
  SUBSTR(sql_text,1,80) sql_preview
FROM   V$SQL
WHERE  executions > 0
ORDER BY elapsed_time/NULLIF(executions,0) DESC
FETCH FIRST 10 ROWS ONLY;

-- ③ HARD PARSE STORM CHECK (shared pool contention)
SELECT ROUND(value) AS hard_parses_per_sec
FROM   V$SYSSTAT
WHERE  name = 'parse count (hard)';
-- > 100/sec = investigate bind variables or cursor_sharing

-- ④ KILL A SPECIFIC SESSION (use sid+serial from V$SESSION)
ALTER SYSTEM KILL SESSION '&sid,&serial#' IMMEDIATE;

-- ⑤ FORCE STATS REFRESH ON A SINGLE TABLE (targeted, not full gather)
BEGIN
  DBMS_STATS.GATHER_TABLE_STATS(
    ownname    => 'SCHEMA_OWNER',
    tabname    => 'ORDERS',
    partname   => 'ORDERS_202505', -- NULL = all partitions
    estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
    degree     => 4,
    no_invalidate => FALSE         -- force cursor invalidation now
  );
END;
/

From Reactive to Prophylactic

Every query in this guide is reactive — something bad happened, now you diagnose it. That is table stakes. The real craft is building the infrastructure so that you know about the problem before the users do.

🏛 Architectural North Star

Schedule the regression detection query (Section 06) as a daily job. Alert when any SQL shows >3× elapsed time regression. You'll catch most plan instabilities within 24 hours — before they compound into incidents. At scale, run it after every DBMS_STATS gather window.

90 sec

Time to diagnosis with ASH (expert)

↓ vs 45 min reactive

0

Outages from known-plan regressions (with SPM)

↓ from 2–3/month

156×

Worst plan regression caught in case study

0.3s → 47s per call

10 min

War room protocol — incident to fix

↓ from ~90 min avg

"The database knows everything that happened to every session at every microsecond. The only question is whether you know how to ask."
— A truth all Oracle architects eventually learn at 3 AM

The DBA who rebooted at 04:30 wasn't incompetent — he simply lacked the vocabulary to ask the database the right question. ASH is that vocabulary. This guide is the grammar. Now go talk to your database.

Oracle Applications R12.2.x| Exadata | OCI|Oracle GRC|ECC

Monday, May 18, 2026