I am a beginner on Teradata and am involved in an exploration to figure out what it would take to migrate an existing system based on Informatica PC+Oracle to Informatica PC+Teradata. I am trying to understand some guidelines or best practices in determining primary index for tables - I understand this is an optimization exercise to maximize data distribution while trying to use fewer columns, using those that are used for querying often are recommended.
Following is a case using which I would like to get your thoughts so i get a clear picture. Let's say I have a table which has 20 columns, 8 of them together (all varchar/string datatype) can act as a primary key. The data volume can range between 1 million to 1 billion (the table is part of a generic application that can be used with varying data volumes)
The options I have for choosing primary index are:
Option 1: Use a combination of 3 columns (let's say A, B, C) that will always be used to query data but are highly non unique (between 40 to 1000 distinct values)
Pros: will help retrieve data faster
Cons: data storage would be skewed and all AMPs may not be used in some cases - adversely affecting the data retrieval performance
Option 2: Use a combination of 2 columns (let's say D, E - entirely different from earlier option) that are highly unique
Pros: will distribute data evenly
Cons: both columns not used too frequently for querying
Option 3: Use a combination of 5 columns (A, B, C, D, E) - highly unique together
Pros: data distribution will be great
Cons: Only 3 out of 5 columns are frequently used in querying so may not be a big help there (can gather stats or create secondary indexes on the columns, if necessary), also 5 columns will be used to define index (not sure how much impact this has)
Option 4: Use columns D, E (highly unique) as indexes and partition index by columns A, B, C (highly non unique, but always used in querying)
Pros: data distribution will be good, query performance would be aided by the partitions
Cons: most operations will be all-AMP access and primary index will not aid in query performance as such
Option 5: Use all 8 columns as primary index
Pros: will be highly unique, best data distribution
Cons: not all columns will be used for querying and too many columns will be used to define index (again, not sure how much impact this has)
Which of these options would you recommend and why? please feel free to share any additional thoughts you may have around this. Do let me know if you would need any additional information to provide your inputs.
Choosing the Primary Index is one of the most critical part in a Teradata system because it affects both QUERY PERFORMANCE and DATA STORAGE.
In choosing a PI, one should consider 3 things:
1) Data Distribution
2) Access path
In ideal scenarios, defining a PI which gives good Data Distribution and Access Path will really yield great performance. However in cases like yours, the best way to do it is to strike a balance on all 3 things and MOST IMPORTANTLY analyze what kind of table you have:
- If the table contains a large volume of data (a lot larger than normal tables), then the priority should go to data distribution as skewed data will adversely affect the whole system and give you pains on storage. Then you can go for Option 3.
- But in cases where it is a highly accessed table (compared to normal ones), where users want to get data out of the table frequently and as soon as possible, priority should go to Access Path. Then you can go for Option 2.
- But for cases where there's a balance on both Table VOLUME and ACCESSIBILITY, Option 4 will be the best since it would give you highly unique data distribution and still get fast results because of the help of partitions.
At the end of the day, it would all depend on the characteristics of the table and how much you can play around between giving importance on your storage and giving importance to your user getting faster results.
Lastly, monitoring these kind of tables will also be important as there are some cases where these tables are highly-accessed only during a particular period, for example, checking the results of a newly launched promo campaign but after which, may not be that important anymore.
Playing around on changing the primary indexes, maybe just giving more CPU% through Priority Scheduler/TDWM to the users at a particular point in time to yield faster results or creating a separate history table since frequent query access might be just for a particular set of data, are just some of the things which can be done to effectively handle such scenarios.
Hope this helps.