Performance tuning is no longer a craft of reading EXPLAIN PLAN and guessing. It is statistical forensics — excavating microseconds from ASH samples, understanding adaptive execution, and surgically removing chaos before it becomes catastrophe.
The Oracle database knows everything that happened to every session at every microsecond. The only question is whether you know how to ask.
The Day Production Died at 3 AM: 📖 True Story (Anonymised)
Singapore. Tuesday night. A fintech platform's nightly batch completes. Wednesday 03:17 AM — a single SQL statement that ran in 0.3 seconds yesterday now takes 47 seconds. It's called 12,000 times per minute. The connection pool explodes. The app tier goes dark. Trading opens in four hours.
The on-call DBA connects. Runs EXPLAIN PLAN. "Looks fine," he says. He runs it again. Still 47 seconds. He checks indexes. All present. He checks statistics. Recent. He reboots the instance at 04:30 — and this buys 20 minutes before it collapses again.
At 05:15, a senior architect joins. She doesn't look at the execution plan. She opens ASH. In 90 seconds she finds the answer: the SQL's execution plan had silently mutated two hours ago — from an index range scan delivering 400 rows to a full table scan delivering 1.2 million. The optimizer had picked a new plan because a statistics job had partially refreshed one partition, skewing cardinality estimates by 3,000x.
She pinned the old plan via SPM baseline. Trading opened on time. The DBA who rebooted the instance was staring at symptoms. She was reading the history of decisions.
"The execution plan tells you what the optimizer decided. ASH tells you what actually happened to real sessions, at real moments in time. One is a blueprint. The other is a flight data recorder."
— Maria Chen, Principal Oracle Architect, spoken at OracleWorld 2025
This blog takes you from knowing nothing about AWR to being the person who solves the 3 AM crisis in 90 seconds.
01
What Is AWR?
⬡ Knowledge Level: Zero → Foundation
Imagine your car's engine but instead of a warning light that turns on only when the engine has already blown, you had a co-pilot who silently wrote down every RPM, fuel pressure, and temperature reading every 60 seconds — and stored those snapshots for the last 8 days.
That is AWR: Automatic Workload Repository. Oracle's background machinery takes periodic snapshots of over 200 performance metrics, stores them in the SYSAUX tablespace, and gives you the ability to answer the question: "What exactly was happening in this database between 02:00 and 02:30 last Tuesday?"
60
Default snapshot interval (minutes)
8
Default retention (days)
200+
Metrics captured per snapshot
~200MB
Typical weekly storage footprint
AWR vs STATSPACK vs Manual Tracing
Tool
Granularity
Overhead
Historical
Best For
AWR
60-min snapshots
~0.1%
Yes – 8 days+
Trend analysis, regression detection
ASH
1-second samples
~0.5%
Yes – in-memory + AWR
Real-time forensics, top-N waits
STATSPACK
Manual snapshots
~0.5%
Manual setup
Pre-11g, SE2 editions
SQL Trace
Per-call microseconds
High (specific session)
No
Deep single-SQL forensics
💡 Architect's Tip
Increase AWR retention to 30 days on production and reduce interval to 30 minutes for high-frequency OLTP systems. The storage cost is trivial; the diagnostic window you gain is priceless.
SQLSet AWR to 30-day / 30-min
-- Run as SYSDBABEGIN
DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS(
retention =>43200, -- 30 days in minutes
interval =>30-- every 30 minutes
);
END;
/
02
ASH — The Database's Time Machine
⬡ Knowledge Level: Foundation
AWR tells you averages over an hour. That's like a doctor who only checks your pulse at annual checkups. ASH — Active Session History — checks your pulse every second, for every session that was actively doing work.
Every second, Oracle samples all sessions in a non-idle wait state and writes a row into the circular in-memory buffer V$ACTIVE_SESSION_HISTORY. Every AWR flush (default 60 min), 1-in-10 rows persist to DBA_HIST_ACTIVE_SESS_HISTORY on disk.
The genius of ASH is statistical validity with minimal overhead. At 1 sample/second across active sessions, you can reconstruct what any SQL was doing, what it was waiting for, and who was blocking it — for any moment in the past.
03
The Full Diagnostic Stack
◈ Knowledge Level: Intermediate
When an incident arrives, most DBAs start in the middle of the stack. Expert architects start at the symptom layer and methodically traverse downward. Here is the complete Oracle performance diagnostic architecture.
L7
Symptoms Layer — What Users Experience
Slow page loads · ORA-04031 out of shared pool · Runaway CPU · Lock contention · Buffer busy waits · Connection pool exhaustion
APM AlertsApp Error LogsOEM Alerts
L6
Real-Time Diagnosis — What's Happening NOW
Live session state, current SQL, current wait event — millisecond precision
Add/rebuild index · Rewrite SQL · Hints · Partition strategy · Refresh statistics · Pin plan baseline · SQL Quarantine · Resource Plan
Measure LIO reductionElapsed timeBuffer getsCPU consumptionWait event time
🏛 Architecture Principle
Never skip layers. A DBA who jumps from "slow page load" directly to "I'll add an index" has skipped six layers of evidence. The index might make it faster for this query while masking the real architectural problem: a missing partition, a bloated SGA, or a toxic connection pattern generating 50,000 hard parses per second.
04
ASH Forensics — SQL You Run Tonight
◈ Knowledge Level: Intermediate → Advanced
These are not textbook queries. These are the exact statements a senior DBA reaches for the moment an incident starts. Copy them. Save them. Run them at 3 AM.
① Top Wait Events — Last 60 Minutes
First query in every incident. Shows you exactly where DB time is being consumed right now. 1 sample = 1 second of active session time.
SQLash_top_waits.sql — First query every incident
-- Top wait events by DB time — last 60 min-- 1 sample ≈ 1 active second. % column = % of total DB time.WITH ash AS (
SELECT
event,
session_state,
wait_class,
sql_id,
COUNT(*) AS samples,
ROUND(COUNT(*)*100/SUM(COUNT(*)) OVER(), 2) AS pct_db_time
FROMV$ACTIVE_SESSION_HISTORYWHERE sample_time >=SYSDATE-1/24GROUP BY event, session_state, wait_class, sql_id
)
SELECTCOALESCE(event, 'ON CPU') AS wait_event,
wait_class,
samples,
pct_db_time,
sql_id
FROM ash
ORDER BY samples DESCFETCH FIRST15ROWS ONLY;
② Blocking Session Chain — Who Is Killing Whom
When you see lock-related waits in query ①, run this immediately. It reconstructs the full blocking chain including the root holder.
SQLash_blocking_chain.sql — Lock tree reconstruction
-- Reconstruct lock holder → waiter chain from ASHSELECTLEVELAS chain_depth,
LPAD(' ', (LEVEL-1)*2)
|| session_id AS session_id,
blocking_session,
event,
sql_id,
TO_CHAR(sample_time,'HH24:MI:SS') AS sampled_at,
seconds_in_wait
FROMV$ACTIVE_SESSION_HISTORYWHERE sample_time >=SYSDATE-10/1440-- last 10 minutesAND wait_class ='Concurrency'CONNECT BY PRIOR blocking_session = session_id
START WITH blocking_session IS NULLORDER SIBLINGS BY seconds_in_wait DESC;
③ Top SQL Heatmap — ASH Minute-by-Minute
Shows you exactly which SQL IDs were consuming the most DB time at each minute during an incident window. Essential for pinpointing the moment a regression hit.
-- Top SQL per minute — pipe this into a spreadsheet heatmapSELECTTRUNC(sample_time, 'MI') AS minute_bucket,
sql_id,
SUM(time_waited) /1e6AS total_wait_sec,
COUNT(*) AS ash_samples,
MAX(event) AS dominant_event,
MAX(wait_class) AS wait_class
FROMV$ACTIVE_SESSION_HISTORYWHERE sample_time BETWEENTO_DATE('2026-05-18 02:00','YYYY-MM-DD HH24:MI')
ANDTO_DATE('2026-05-18 03:30','YYYY-MM-DD HH24:MI')
AND sql_id IS NOT NULLGROUP BYTRUNC(sample_time, 'MI'), sql_id
ORDER BY minute_bucket, ash_samples DESC;
⚠ Common Mistake
Many DBAs query V$SQL for elapsed_time and divide by executions to find slow SQL. This is cumulative since cursor load — it doesn't tell you that the SQL was only slow for 40 minutes starting at 02:17 AM. ASH with a time filter does. Always add a sample_time predicate.
05
Adaptive Query Plans — Why Your Plan Lies
◈ Knowledge Level: Intermediate → Advanced
Since Oracle 12c, the optimizer is no longer a one-shot decision machine. It is a feedback loop. Understanding this is the difference between architects who fight the optimizer and those who work with it.
Diagnosing an Adaptive Plan Mid-Switch
SQLadaptive_plan_check.sql — Was this plan adaptive? Did it switch?
-- Check if a SQL used an adaptive plan and whether it switchedSELECT
s.sql_id,
s.child_number,
s.is_resolved_adaptive_plan, -- Y = final plan locked in
s.is_reoptimizable, -- Y = will try to reoptimize
s.executions,
ROUND(s.elapsed_time/NULLIF(s.executions,0)/1e6,3) AS avg_elapsed_sec,
s.optimizer_cost
FROMV$SQL s
WHERE s.sql_id ='&sql_id'ORDER BY s.child_number;
-- Then view the adaptive plan detail — look for STATISTICS COLLECTOR nodeSELECT*FROMTABLE(
DBMS_XPLAN.DISPLAY_CURSOR(
sql_id =>'&sql_id',
format =>'ADAPTIVE ALLSTATS LAST'
)
);
💡 What to Look For
In the ADAPTIVE ALLSTATS LAST output, lines prefixed with - (dash) are plan lines that were considered but not used in the final execution. If you see a STATISTICS COLLECTOR node, the optimizer was buffering rows to make a mid-execution decision. The absence of this node in a repeated execution means it has already resolved the adaptive plan.
06
SQL Plan Regression Detection
★ Knowledge Level: Expert
The most dangerous performance event is the one that happened at 2 AM before anyone was watching. Statistics jobs, optimizer upgrades, plan invalidations — all happen silently. This is your early warning system.
Trigger: Statistics job completes on ORDERS table (02:14)
Automatic Stats Refresh
DBMS_STATS gathers new stats on a partitioned table. One partition has 10× the expected rows. Cardinality estimate for a join changes from 400 to 1,200,000.
Trigger: Next hard parse after shared pool flush (02:17)
Plan Hash Value Changes
Optimizer re-evaluates plan with new statistics. Decides a full table scan is cheaper than the index range scan + nested loop. New plan_hash_value: 4029847362 (was 1847293841).
Effect: 0.3s → 47s per execution (02:17 onward)
Silent Regression
No errors. No alerts. AWR won't flag it for another 38 minutes when the hour's average moves. ASH captures it in seconds.
Detection: Via regression query below
You Find It
The query compares the 7-day average to today's average. SQL_ID 5rw2a9q8xkf1m shows 156× regression. Plan hash changed at 02:17:03.
SQLsql_regression_detection.sql — Compare now vs 7-day baseline
-- Detect SQL regressions vs 7-day historical baseline-- regression_x: how many times slower vs baselineWITH
baseline AS (
SELECT
sql_id,
AVG(elapsed_time_delta /NULLIF(executions_delta, 0))
AS avg_elapsed_base,
AVG(buffer_gets_delta /NULLIF(executions_delta, 0))
AS avg_lio_base
FROMDBA_HIST_SQLSTATWHERE snap_id BETWEEN
(SELECTMIN(snap_id) FROMDBA_HIST_SNAPSHOTWHERE begin_interval_time >=SYSDATE-14)
AND
(SELECTMAX(snap_id) FROMDBA_HIST_SNAPSHOTWHERE begin_interval_time <SYSDATE-7)
GROUP BY sql_id
),
recent AS (
SELECT
sql_id,
AVG(elapsed_time_delta /NULLIF(executions_delta, 0))
AS avg_elapsed_now,
AVG(buffer_gets_delta /NULLIF(executions_delta, 0))
AS avg_lio_now,
MAX(plan_hash_value) AS current_plan_hash
FROMDBA_HIST_SQLSTATWHERE snap_id >= (
SELECTMIN(snap_id) FROMDBA_HIST_SNAPSHOTWHERE begin_interval_time >=SYSDATE-1)
GROUP BY sql_id
)
SELECT
r.sql_id,
ROUND(b.avg_elapsed_base/1e6, 3) AS baseline_sec,
ROUND(r.avg_elapsed_now/1e6, 3) AS current_sec,
ROUND(r.avg_lio_base) AS baseline_lio,
ROUND(r.avg_lio_now) AS current_lio,
ROUND(r.avg_elapsed_now /NULLIF(b.avg_elapsed_base, 0), 1) AS regression_x,
r.current_plan_hash,
SUBSTR(
(SELECT sql_text FROMDBA_HIST_SQLTEXTWHERE sql_id = r.sql_id AND ROWNUM=1),
1, 60
) AS sql_preview
FROM recent r
JOIN baseline b USING(sql_id)
WHERE r.avg_elapsed_now > b.avg_elapsed_base *2-- 2x slower or moreORDER BY regression_x DESCFETCH FIRST20ROWS ONLY;
Find the Exact Moment the Plan Changed
SQLplan_change_timeline.sql — When did the plan hash change?
-- For a given sql_id, show plan_hash_value timeline from AWR-- Pivot this in Excel to see exactly when the plan mutatedSELECT
sn.begin_interval_time,
ss.plan_hash_value,
ROUND(ss.elapsed_time_delta/NULLIF(ss.executions_delta,0)/1e6,3)
AS avg_elapsed_sec,
ROUND(ss.buffer_gets_delta/NULLIF(ss.executions_delta,0))
AS avg_lio,
ss.executions_delta AS execs
FROMDBA_HIST_SQLSTAT ss
JOINDBA_HIST_SNAPSHOT sn USING(snap_id, dbid, instance_number)
WHERE
ss.sql_id ='&sql_id'AND sn.begin_interval_time >=SYSDATE-3ORDER BY
sn.begin_interval_time;
07
SQL Quarantine — The Circuit Breaker
★ Knowledge Level: Expert
📖 Why This Exists
A single query consuming 100% CPU for 2 minutes, called by 500 concurrent sessions, can bring an entire cluster to its knees. Before SQL Quarantine (introduced in 18c), the options were: kill sessions manually, restart the instance, or add a blunt resource plan. SQL Quarantine lets you surgically target that specific execution plan of that specific SQL and terminate any session that would run it — within a threshold you define — while allowing other plans of the same SQL to execute normally.
SQL / PL·SQLquarantine_runaway.sql — Full quarantine lifecycle
-- STEP 1: Create quarantine entry targeting specific sql_id + planDECLARE
v_handle VARCHAR2(100);
BEGIN-- force_matching=YES applies to all SQLs that normalize to same text
v_handle := DBMS_SQLQ.CREATE_QUARANTINE_BY_SQL_ID(
sql_id =>'5rw2a9q8xkf1m',
force_matching =>'NO'
);
-- STEP 2: Set threshold — kill if elapsed time > 10 seconds
DBMS_SQLQ.ALTER_QUARANTINE(
quarantine_name => v_handle,
action =>'ENABLE',
threshold => DBMS_SQLQ.THRESHOLD_ELAPSED_TIME,
value =>'00:00:10'-- kill if > 10 seconds
);
-- Optionally also cap CPU time
DBMS_SQLQ.ALTER_QUARANTINE(
quarantine_name => v_handle,
action =>'ENABLE',
threshold => DBMS_SQLQ.THRESHOLD_CPU_TIME,
value =>'00:00:05'-- kill if CPU > 5 seconds
);
DBMS_OUTPUT.PUT_LINE('Quarantine: '|| v_handle);
END;
/
-- STEP 3: Verify quarantine is activeSELECT
name,
sql_id,
plan_hash_value,
enabled,
cpu_time,
elapsed_time,
io_megabytes
FROMDBA_SQL_QUARANTINEWHERE sql_id ='5rw2a9q8xkf1m';
-- STEP 4: When root cause is fixed, drop quarantineBEGIN
DBMS_SQLQ.DROP_QUARANTINE(quarantine_name =>'&quarantine_name');
END;
/
⚠ Architect's Warning
SQL Quarantine is a tourniquet, not a cure. It stops the bleeding, but the root cause — bad statistics, missing index, unpartitioned table — still exists. The moment you remove the quarantine without fixing the underlying problem, the runaway query returns. Always schedule the root cause fix before disabling quarantine.
08
SPM Baselines — Controlling the Optimizer
★ Knowledge Level: Expert
SQL Plan Management (SPM) is a constitutional framework for the Oracle optimizer. While quarantine reacts to disasters, SPM prevents them. It says to the optimizer: "You may explore new plans, but you may not use them until a human validates they are better."
SQL / PL·SQLspm_capture_and_pin.sql — Full SPM lifecycle: capture, verify, enable
-- ═══════════════════════════════════════════════-- SCENARIO: Pin known-good plan after a regression-- ═══════════════════════════════════════════════-- 1. Load the GOOD plan from AWR into SPM baseline-- (good_plan_hash_value found from regression timeline query)DECLARE
v_loaded PLS_INTEGER;
BEGIN
v_loaded := DBMS_SPM.LOAD_PLANS_FROM_AWR(
begin_snap_id =>28441, -- snap before regression
end_snap_id =>28443,
basic_filter =>'sql_id = ''5rw2a9q8xkf1m''',
fixed =>'YES', -- mark as FIXED (won't evolve)
enabled =>'YES'
);
DBMS_OUTPUT.PUT_LINE('Loaded: '|| v_loaded ||' plan(s)');
END;
/
-- 2. Verify the baseline was createdSELECT
plan_name,
sql_handle,
sql_text,
enabled,
accepted,
fixed,
plan_hash_value,
created,
last_modified
FROMDBA_SQL_PLAN_BASELINESWHERE sql_id ='5rw2a9q8xkf1m'ORDER BY created DESC;
-- 3. If new (unaccepted) plans exist — evolve them-- This runs the new plan in shadow, compares to baseline,-- accepts only if it's statistically betterDECLARE
v_report CLOB;
BEGIN
v_report := DBMS_SPM.EVOLVE_SQL_PLAN_BASELINE(
sql_handle =>'SYS_SQL_abc123def456',
plan_name =>NULL, -- evolve all unaccepted plans
time_limit =>60, -- max 60 seconds
verify =>'YES', -- must verify before accepting
commit =>'YES'-- auto-accept if verified better
);
DBMS_OUTPUT.PUT_LINE(v_report);
END;
/
-- 4. Export baselines for cross-environment transport (Prod→UAT)DECLARE
v_plans PLS_INTEGER;
BEGIN
v_plans := DBMS_SPM.PACK_STGTAB_BASELINE(
table_name =>'SPM_STGTAB',
table_owner =>'DBA_TOOLS',
sql_handle =>'SYS_SQL_abc123def456'
);
DBMS_OUTPUT.PUT_LINE('Packed: '||v_plans||' plans');
END;
/
09
The War Room Protocol
★ Knowledge Level: Expert
When the call comes at 3 AM, most teams lose 45 minutes running random queries in a panic. Here is the exact decision tree an expert follows. Each step takes under 2 minutes. Total time to diagnosis: under 10 minutes.
SQLwar_room_kit.sql — Emergency commands, all in one place
-- ① WHAT IS THE DB DOING RIGHT NOW?SELECT wait_class, event, COUNT(*) sess
FROMV$SESSIONWHERE status ='ACTIVE'AND type ='USER'GROUP BY wait_class, event ORDER BY3DESCFETCH FIRST10ROWS ONLY;
-- ② TOP SQL BY ELAPSED TIME RIGHT NOWSELECT sql_id, child_number,
ROUND(elapsed_time/NULLIF(executions,0)/1e6,3) avg_s,
executions, loads, invalidations,
SUBSTR(sql_text,1,80) sql_preview
FROMV$SQLWHERE executions >0ORDER BY elapsed_time/NULLIF(executions,0) DESCFETCH FIRST10ROWS ONLY;
-- ③ HARD PARSE STORM CHECK (shared pool contention)SELECTROUND(value) AS hard_parses_per_sec
FROMV$SYSSTATWHERE name ='parse count (hard)';
-- > 100/sec = investigate bind variables or cursor_sharing-- ④ KILL A SPECIFIC SESSION (use sid+serial from V$SESSION)ALTER SYSTEM KILL SESSION'&sid,&serial#' IMMEDIATE;
-- ⑤ FORCE STATS REFRESH ON A SINGLE TABLE (targeted, not full gather)BEGIN
DBMS_STATS.GATHER_TABLE_STATS(
ownname =>'SCHEMA_OWNER',
tabname =>'ORDERS',
partname =>'ORDERS_202505', -- NULL = all partitions
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
degree =>4,
no_invalidate =>FALSE-- force cursor invalidation now
);
END;
/
10
From Reactive to Prophylactic
Every query in this guide is reactive — something bad happened, now you diagnose it. That is table stakes. The real craft is building the infrastructure so that you know about the problem before the users do.
🏛 Architectural North Star
Schedule the regression detection query (Section 06) as a daily job. Alert when any SQL shows >3× elapsed time regression. You'll catch most plan instabilities within 24 hours — before they compound into incidents. At scale, run it after every DBMS_STATS gather window.
90 sec
Time to diagnosis with ASH (expert)
↓ vs 45 min reactive
0
Outages from known-plan regressions (with SPM)
↓ from 2–3/month
156×
Worst plan regression caught in case study
0.3s → 47s per call
10 min
War room protocol — incident to fix
↓ from ~90 min avg
"The database knows everything that happened to every session at every microsecond. The only question is whether you know how to ask."
— A truth all Oracle architects eventually learn at 3 AM
The DBA who rebooted at 04:30 wasn't incompetent — he simply lacked the vocabulary to ask the database the right question. ASH is that vocabulary. This guide is the grammar. Now go talk to your database.