A SET table force Teradata to check for duplicate rows every time a row is inserted or updated. This can cause a lot of overhead on such operations. Why cant we use multiset tables for all Tables which we are insert by other tables in DWH and use "Group By " to insert rows ?
There's no reason to use a SET (as opposed to MULTISET) table if you don't need the automatic duplicatechecking. By the way, if a table has a uniqueness constraint defined on some set of columns (i.e., a primary key or unique index), then there is absolutely no difference between set and multset, as only the uniqueness constraint will be enforced in either case.
Actually, there IS a difference between SET and MULTISET even if you have a UNIQUE index:CREATE TABLE T1 (c1 INTEGER NOT NULL, c2 INTEGER) PRIMARY INDEX (c1);INSERT T1 VALUES (1,1);INSERT T1 VALUES (1,2);INSERT T1 VALUES (2,1);CREATE SET TABLE T2_S (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);CREATE MULTISET TABLE T2_MS (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);INSERT T2_S SELECT c1 FROM T1;/* succeeds - quietly eliminates the duplicate and inserts 2 rows */INSERT T2_MS SELECT c1 FROM T1;/* fails with duplicate key error */
Hi Jim Chapman / Fred PluebellI am concerned about Time taken to load the Set Vs MultiSet tablesIn my application there are many Set tables created for storing temp result For loading these set tables its taking more time, as we are planning to change it in to MultiSet and put a group by all columns to remove duplicate rows , table index is Primary Index, (is it the correct way?)
First, I'd suggest you verify that you have chosen a good Primary Index. If the number of rows with any given PI value is small, the SET table duplicate row checking overhead is small. There may be cases in which an extremely non-unique PI is a reasonable choice; in those cases changing to MULTISET is certainly worth considering. Just be sure you aren't covering up the real problem.
(Somewhat of a late comment relative to the prior posts, but the internet is timeless and it will help any googlers).
Can someone quantify when they say that a SET table has overhead? I like the fact that a SET table discards duplicates for me. I want to know how it scales. If I want to store, say, person id and sku pairs for billions of rows, I simply insert into my SET table knowing that it will discard prior pairs. That's better than creating a proper of NOT EXISTS or other techniques if it were a MULTISET table. How far can I expect to do this until the overhead is not worth it?
To insert a new row the system has to do a duplicate row check for all existing rows with the same RowHash value, i.e. all rows with the same PI value or another PI value which happens to hash the same.
And of course it's cumulative, to insert N rows with the same RowHash you'll need (N*(N-1)) / 2 checks:
10 rows -> 45 dup row checks
100 -> 4950
1000 -> 499500
The rule of thumb is:
If the number of rows per value is low (maybe up to a few hundred) and all rows fit on a single datablock it might be acceptable.
How does each instance of a dup row check equate to time? This is way beyond a linear increase, but I'm not seeing that. I have a x00 million row SET table where I can insert x0 million rows to it daily with net new rows about 5%, and I'm not getting the degradation timewise that the formula would indicate.
What would you advise as a more efficient technique where I simply want to sweep new pairs into my SET pair table . As is, I don't have to worry about ETL change data coding.
Appreciate your feedback.
what's the average number of rows per value for this table?
The formula is only to compare the overhead for different rows per PI, when the count is doubled the number of dup checks is quadrupled to insert all those rows with the same PI.
When the number of rows per PI increases due to inserting new rows with an existing PI you should see a linear increase of CPU usage in DBQL, e.g. existing number of rows per PI value: 50, insert 1 row per day per PI.
After 50 days the number of rows per PI doubled, to insert a new row there's 99 dup checks now vs. 49 on the first day, resulting in doubled CPU time per row.
SET tables will insert rows quickly as we start inserts, but will become
much slower as its record count in table reached millions.
This is because, as SET table won't allow entire row duplicate, as you
insert new rows to the table
TD system should check whether particular row is already present or not.
And if row is already present, it won't insert particular row. It won't give
any error message also.
As this is an automatic process, lot of resources and processor time
required to do the same.
Initially it work faster and as record count reach millions, it becomes
Possible cause is 'duplicate row check'. This process is likely to slow down
as you add more rows to the target table. This is a problem when you have
too many rows per PI data value (well, technically PI rowhash) AND the table
is SET AND there are no unique indexes on the target table.
- as the process starts, there are relatively few rows for any given PI
- duplicate row checking happens, but with only a few rows to check it
doesn't appear to slow you down
- as you add more data to the target table, the number of rows with same PI
rowhash grows, and you can now 'see' the processing slow down.
If the duplicate row check looks like the culprit then a couple of things
come to mind
- add a unique index
- change the table to multiset (providing of course that your data
model/processing can handle this)
If the table has a unique index (either upi or usi) then you will ** not **
have a problem with 'duplicate row checks'.
The only time you have problems with duplicate row checks is when:
- the table is SET, AND
- the table has NO unique indexes, AND
- you have lots of rows with the same PI rowhash value.
If any of the above conditions are NOT met then you will not have a problem
with duplicate row checks during insert processing.
Sughesh I S
Sughesh I S.
why do you simply copy answers from a recent thread at www.teradataforum.com?
The first five phrases were written by you, but the rest is from Dave Wellmann.
And there was another post:
SET/NUPI plus USI still does duplicate row checks for insert/selects,
only single row inserts benefit from the USI.
You can easily see that when you insert/select the same rows a second time:
Multiset table makes explain plan different from set table when a few table join tgatger and the results return faster. My guess is that Teradat choose a different explan plan when it sees multiset table. But I do not understand why. Have anyone else had a silmilar experince.
When statistics exist on the join columns there should be no different plan and when the plan doesn't change the speed will not change.
But when stats are missing (on the one part of a one-to-many join) the optimizer assumes more than one row will be joined and then overestimates the number of rows returned by the join step. This is a builtin factor, which seems to vary based on some unknown conditions, let's assume 1.1:
Base table has 1.000.000 rows and all the joins are to the PK of the 2nd table:
1st join: estimated 1,100,000, actual 1,000,000
2nd join: estimated 1,210,000, actual 1,000,000
3nd join: estimated 1,331,000, actual 1,000,000
If the factor is different for multiset tables this might explain different plans, but i never tested it.
i am new for teradata project..
i want to know, what type of error's will come, when data is loading,and how to over come those errors..
and give me some tips, i want to become a best in Teradata Database Development.. please help me..
can any one please explain me..
Thank you.. all of you..!!
your way of exp is very very good, and easy to understand... i am new for Teradata Project..
please help me.. can you send any dummy project to my mail ID.. Please..
Santhosh.firstname.lastname@example.org.. this is my mail id..
In reference to third comment posted by Fred,
I created a similar table T2_S1 as original source table T1.
CREATE volatile SET TABLE T2_S1 (c1 INTEGER NOT NULL,c2 INTEGER) UNIQUE PRIMARY INDEX (c1) on commit preserve rows;
Now, i tried insert select into this SET table. But, it gave the error.
INSERT T2_S1 SELECT c1,c2 FROM T1;/* succeeds - quietly eliminates the duplicate and inserts 2 rows */
Here it is failing, *** Failure 2801 Duplicate unique prime key error in iTunes_User.T2_MS.
Statement# 1, Info =0
*** Total elapsed time was 1 second.
It seems the insert select for set table case with duplicates being discarded, is not working fine.
Can u please suggest, the behavior here? Is it, when we select all the columns i.e an entire row from the source table, duplicate row check is done for the target table.
Not exactly. Duplicate row check is being done in both cases, but in your example the entire rows (1,1) and (1,2) are not duplicate so neither is eliminated. Therefore the uniqueness violation on the PI is reported.
In my example, the rows (1) and (1) were identical so one was quietly discarded, and there was no uniqueness violation.
Thanks Fred. I got it know :)