To better understand dependent/correlated sets, let’s take a brief look at dependent/correlated columns.
Oracle works under the assumption that the data in each column is independent. If an equality predicate on column COL1 leaves 10% of the records, and an equality predicate on column COL2 leaves 20% of the records, then by default, the Oracle CBO would assume that predicates on both COL1 and COL2 would leave 2 % (10%x20%) of the records. If the data is COL1 is correlated with the data in COL2 then equality predicates on both COL1 and COL2 would leave significantly more than 2% of the records. That difference in estimated cardinality could cause huge troubles when estimating consequent execution steps.
Oracle recognized the problem and came with a comprehensive solution – extended statistics in Oracle 11g.
Now about dependent/correlated sets – two sets are dependent/correlated if they contain the same or similar records. Oracle assumes that sets are independent. If a predicate like this “COL1 IN (select.. from .. SET1 ) “ leaves 10% of the records, and a similar predicate “COL1 IN (select.. from .. SET2 ) “ leaves 20% of the records, then Oracle assumed that both predicates “COL1 IN (select.. from .. SET1 ) “ AND “COL1 IN (select.. from .. SET2 ) “ would leave 2%(10%x20%) of the records. If SET1 is identical or very similar to SET2, then the “COL1 IN (select.. from .. SET1 ) “ AND “COL1 IN (select.. from .. SET2 ) “ would leave significantly more than 2% of the records. Needless to say, the difference in estimated cardinality could cause huge troubles when estimating consequent execution steps.
The big challenge is that there is nothing out of the box that can help us in the above scenario.
Tables TAB1 and TAB2 would provide us with the correlated sets (SET1,SET2)
create table tab1 as
with generator as (
select
rownum id
from dual
connect by
rownum <= 4000
)
select
id col1,
mod(id,1024) col2 ,
mod(id,2) col3
from (
select
/*+ no_merge */
rownum id
from
generator,
generator
where
rownum <= 1000000
)
;
create table tab2 as
with generator as (
select
rownum id
from dual
connect by
rownum <= 4000
)
select
id col1,
mod(id,1024) + 134 col2 ,
mod(id,2) col3
from (
select
/*+ no_merge */
rownum id
from
generator,
generator
where
rownum <= 1000000
)
;
TAB3 would be the table we are going to apply the filters with those correlated sets
create table tab3 as
with generator as (
select
rownum id
from dual
connect by
rownum <= 4000
)
select
id col1
from (
select
/*+ no_merge */
rownum id
from
generator,
generator
where
rownum <= 1000000000
)
Gather stats:
exec dbms_stats.gather_table_stats(NULL,'TAB1');
exec dbms_stats.gather_table_stats(NULL,'TAB2');
exec dbms_stats.gather_table_stats(NULL,'TAB3');
The query we are interested in is
select
t3.col1
from
tab3 t3 ,
tab1 t1 ,
tab2 t2
where
t3.col1 = t1.col1
and t1.col2 in (66,166,316,416,516,616)
and
t3.col1 = t2.col1
and t2.col2 in (200,300,450,550,650,750)
and t1.col3 = t2.col3
Please note that the sets that come from TAB1 and TAB2 are identical, so this clause
t3.col1 = t1.col1 and t3.col1 = t2.col1
would eliminate significantly fewer records that Oracle CBO’s estimate.
Due to the described dependent/correlated sets behaviors, the CBO (DBMS_XPLAN.DISPLAY_CURSOR) estimates that only 17 records will be returned by the query, even though the query returns 5861 records.
---------------------------------------------------------
| Id | Operation |Name |Rows |Bytes |Cost (%CPU)|
---------------------------------------------------------
| 0 | SELECT STATEMENT | | | |8182 (100) |
|* 1 | HASH JOIN | |17 |510 |8182 (4) |
|* 2 | TABLE ACCESS FULL |TAB2 |5859 |70308 |657 (4) |
|* 3 | HASH JOIN | |5872 |103K |7524 (4) |
|* 4 | TABLE ACCESS FULL |TAB1 |5859 |70308 |653 (4) |
| 5 | TABLE ACCESS FULL |TAB3 |16M |91M |6786 (2) |
---------------------------------------------------------
Predicate Information (identified by operation id):
—————————————————
1 - access("T3"."COL1"="T2"."COL1" AND "T1"."COL3"="T2"."COL3")
2 - filter(("T2"."COL2"=200 OR "T2"."COL2"=300 OR "T2"."COL2"=450 OR
"T2"."COL2"=550 OR "T2"."COL2"=650 OR "T2"."COL2"=750))
3 - access("T3"."COL1"="T1"."COL1")
4 - filter(("T1"."COL2"=66 OR "T1"."COL2"=166 OR "T1"."COL2"=316 OR
"T1"."COL2"=416 OR "T1"."COL2"=516 OR "T1"."COL2"=616))
Dynamic sampling, even at the max level, actually makes the estimate worse
exec dbms_stats.delete_table_stats(NULL,'TAB1');
exec dbms_stats.delete_table_stats(NULL,'TAB2');
exec dbms_stats.delete_table_stats(NULL,'TAB3');
alter session set optimizer_dynamic_sampling=10
---------------------------------------------------------
| Id | Operation |Name |Rows |Bytes |Cost (%CPU)|
---------------------------------------------------------
| 0 | SELECT STATEMENT | | | |8156 (100) |
|* 1 | HASH JOIN | |1 |91 |8156 (4) |
|* 2 | TABLE ACCESS FULL |TAB2 |5861 |223K |644 (3) |
|* 3 | HASH JOIN | |5861 |297K |7511 (4) |
|* 4 | TABLE ACCESS FULL |TAB1 |5861 |223K |640 (3) |
| 5 | TABLE ACCESS FULL |TAB3 |16M |198M |6786 (2) |
---------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("T3"."COL1"="T2"."COL1" AND "T1"."COL3"="T2"."COL3")
2 - filter(("T2"."COL2"=200 OR "T2"."COL2"=300 OR "T2"."COL2"=450 OR
"T2"."COL2"=550 OR "T2"."COL2"=650 OR "T2"."COL2"=750))
3 - access("T3"."COL1"="T1"."COL1")
4 - filter(("T1"."COL2"=66 OR "T1"."COL2"=166 OR "T1"."COL2"=316 OR
"T1"."COL2"=416 OR "T1"."COL2"=516 OR "T1"."COL2"=616))
Note
-----
- dynamic sampling used for this statement (level=10)
One way to resolve this is to force the execution order, so TAB3 is visited before TAB2 or TAB1.
Another solution is to create a virtual column identical to COL1
alter table tab3 add col1_corr generated always as (col1*1 ) not null ;
,gather stats
exec dbms_stats.gather_table_stats(NULL,'TAB1');
exec dbms_stats.gather_table_stats(NULL,'TAB2');
exec dbms_stats.gather_table_stats(NULL,'TAB3');
Then set the number of distinct records for that new virtual column to 1
exec dbms_stats.set_column_stats(NULL,'TAB3','COL1_CORR',distcnt=>1);
and use the new virtual column (COL1_CORR), instead of COL1, for one of the predicates
select
t3.col1
from
tab3 t3 ,
tab1 t1 ,
tab2 t2
where
t3.col1_corr = t1.col1
and t1.col2 in (66,166,316,416,516,616)
and t3.col1 = t2.col1
and t2.col2 in (200,300,450,550,650,750)
and t1.col3 = t2.col3
Now, the CBO expects 2936 records – much better than the original 17 records.
---------------------------------------------------------
| Id | Operation |Name |Rows |Bytes |Cost (%CPU)|
---------------------------------------------------------
| 0 | SELECT STATEMENT | | | |8182 (100) |
|* 1 | HASH JOIN | |2936 |103K |8182 (4) |
|* 2 | TABLE ACCESS FULL |TAB1 |5859 |70308 |653 (4) |
|* 3 | HASH JOIN | |5872 |137K |7528 (4) |
|* 4 | TABLE ACCESS FULL |TAB2 |5859 |70308 |657 (4) |
| 5 | TABLE ACCESS FULL |TAB3 |16M |183M |6786 (2) |
---------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("COL1"*1="T1"."COL1" AND "T1"."COL3"="T2"."COL3")
2 - filter(("T1"."COL2"=66 OR "T1"."COL2"=166 OR "T1"."COL2"=316 OR
"T1"."COL2"=416 OR "T1"."COL2"=516 OR "T1"."COL2"=616))
3 - access("T3"."COL1"="T2"."COL1")
4 - filter(("T2"."COL2"=200 OR "T2"."COL2"=300 OR "T2"."COL2"=450 OR
"T2"."COL2"=550 OR "T2"."COL2"=650 OR "T2"."COL2"=750))
Setting the number of distinct values for COL1_CORR to 1 forced the CBO to believe that the clause on COL1_CORR would not eliminate any records. That’s the only way (for me anyway) to tell the optimizer that the COL1_CORR clause would not further reduce the number of records.
Interestingly, I get the same result when for any value of distcnt …