Here is the explain that is straight forward but sometimes the same query runs in 10-15mins and sometimes it takes longer than 4-5hours for the same data volume.
Explain SELECT COOKIE_ID, CAST(MAX(VISIT_START_DTTM) AS DATE) AS LATEST_VISIT_DATE,
MAX(CASE WHEN CAST(VISIT_START_DTTM AS DATE) BETWEEN '2009-05-01' AND '2010-04-30' THEN 'Y'
ELSE 'N' END) AS ACTIVITY_2yearsbefore,
MAX(CASE WHEN CAST(VISIT_START_DTTM AS DATE) BETWEEN '2010-05-01' AND '2011-04-30' THEN 'Y'
ELSE 'N' END) AS ACTIVITY_prevyear,
MAX(CASE WHEN CAST(VISIT_START_DTTM AS DATE) BETWEEN '2011-05-01' AND '2012-04-30' THEN 'Y'
ELSE 'N' END) AS ACTIVITY_currentyear
FROM CDW.VISIT A
WHERE A.LOCATION_ID = 2
AND CAST(VISIT_START_DTTM AS DATE) <= '2012-04-30'
GROUP BY COOKIE_ID
1) First, we lock CDW_Tables.VISIT in view CDW.VISIT for access.
2) Next, we do an all-AMPs SUM step to aggregate from
CDW_Tables.VISIT in view CDW.VISIT by way of an all-rows scan with
a condition of ("(CDW_Tables.VISIT in view CDW.VISIT.Location_Id =
2) AND ((CAST((CDW_Tables.VISIT in view CDW.VISIT.Visit_Start_Dttm)
AS DATE))<= DATE '2012-04-30')") , grouping by field1 (
CDW_Tables.VISIT.Cookie_Id). Aggregate Intermediate Results are
computed globally, then placed in Spool 4. The input table will
not be cached in memory, but it is eligible for synchronized
scanning. The aggregate spool file will not be cached in memory.
The size of Spool 4 is estimated with no confidence to be
141,283,585 rows (7,911,880,760 bytes). The estimated time for
this step is 2 minutes and 23 seconds.
3) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of
an all-rows scan into Spool 2 (group_amps), which is built locally
on the AMPs. The result spool file will not be cached in memory.
The size of Spool 2 is estimated with no confidence to be
141,283,585 rows (6,075,194,155 bytes). The estimated time for
this step is 6.21 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 2 are sent back to the user as the result of
statement 1. The total estimated time is 2 minutes and 29 seconds.
The UPI on the visit table is on VISIT_ID, not sure why it would perform so differently under different cpu loads.
Any ways to tune this query? The table does not have any PPIs and would not make any difference to this query since the CAST(VISIT_START_DTTM AS DATE) <= '2012-04-30' is actually pulling all the data because the min(visit_Start_dttm) in the table itself is 2009-05-01.
Are these queries logged in DBQL?
If yes, check if the resource consumption is really different. You might be faceing system workload issues.
Der Beginn aller Wissenschaften ist das Erstaunen, dass die Dinge sind, wie sie sind.-Aristoteles
Yes they are in DBQL, what do i really need to look at to understand resource consumption? Do I need to look at HotAmpCPU, CPU Skew, I/O Sku etc? Appreciate your response
Check the AMPCPU Time.. Is there a database patch upgrade done recently ??
No upgrade as far as I know. I looked at the DBQLogTbl and the AMP CPU Time was 18,367. Is this too high?
AMP CPU Time
Let me know your thoughts on this!!
Looks like it could be a statistics problem to me (no confidence everywhere). Collect the necessary stats, rerun the query, then see if your plan is much better.
you posted only one query log entry.
My understanding is that different runs differ in execution time / ellapsedtimes. Do they differ in AMPCPUtime, totalIOCount etc.?
how could this plan be improved? Its doning a plain aggregation, no joins etc. Stats might give better estimates but will not change the plan.
I would say it is just generally a good idea to do it anyway. Whether it improves the query... who knows... pretty low hanging fruit though.
It is evident from the plan that the query is unable to use the PI of the table and is pulling out the entire table into spool which is global, meaning the data will first be pulled out into a spool and then it will be processed. I would try out following options:
(a) see if cookie id can be made the PI of the table without skewing it too much, this will make all calculations happen locally in the AMP and run very fast even with high data volumes owing to MPP. This may be the only real option to try, rest are all guesses with a very slim chance of working
(b) if PI cannot be changed, a PPI with date ranges may work here if it is aligned to your case statement filters, although I have a strong feeling it won't work
(c) lastly, try converting your case statements into a small control table that stores the boundary values for each case and is joined with your master table. With correct stats the optimizer should replicate this table across all AMPs and partially restrict the data that goes into the spool. Again, the issue here is if the filter conditions will pull out the entire table - this would make no difference
I'm curious to hear more tuning options from others based on only the explain plan.
I have re-written the query below. I added numbers as comments that refer to footnotes explaining my reasoning.
max( b.VISIT_DATE ) as latest_visit_date, -- 7, 8
max(case when b.VISIT_DATE between '2009-05-01' and '2010-04-30' -- 7
then 'Y' else 'N' end) as ACTIVITY_2yearsbefore,
max(case when b.VISIT_DATE between '2010-05-01' and '2011-04-30' -- 7
then 'Y' else 'N' end) as ACTIVITY_prevyear,
max(case when b.VISIT_DATE between '2011-05-01' and '2012-04-30' -- 7
then 'Y' else 'N' end) as ACTIVITY_currentyear
from -- 1
cast( a.VISIT_START_DTTM as date ) as VISIT_DATE
from CDW.VISIT as a
where -- 2
a.LOCATION_ID = 2 and
VISIT_DATE < date '2012-05-01' -- 3, 4, 5, 6
) as b
group by b.COOKIE_ID
1. Use a derived table to convert VISIT_START_DTTM to DATE format only once.
2. Put as many of the WHERE conditions inside the derived table as possible.
3. Use the alias name for the new date column wherever possible. This avoids the potential additonal, unnecessary re-casting of the same column and makes the query easier to read.
4. This might not be an issue with Teradata, but it is best practice for Oracle: Use a single comparison when possible rather than a combination, because a combo like "<=" forces the engine to re-write the whole query as "<" OR "=", which is like running 2 separate UNION queries.
5. Here, because the "=" was dropped, the date constant had to be adjusted by one day.
6. It's best practice to convert constants explicitly to the DATE data type.
7. After being converted from a DATETIME column to a DATE column once, it can be referred to by its alias name - again, avoids potential re-casting and makes the query easier to read.
8. The original query actually has the VISIT_START_DTTM treated in 2 different ways, and I'm not sure whether that was on purpose or by accident. The original 2nd column FIRST took the maximum DATETIME value and THEN took the MAX; all the remaining original columns FIRST cast from DATETIME to DATE and THEN perform the MAX aggregation. I re-wrote the query to deal with them all in the same way; if that doesn't accomplish what you want, feel free to re-write.
Some remarks on the comments:
1. ok, but this only avoids cut&pasting the same calculation several times, it will not influence the optimizer. Unless you do a DITINCT/GROUP BY/OLAP within the DT the optimizer will remove it.
2. The optmizer will automatically push conditions into the Derived Table, again this should not change the plan
3. re-using the alias is only conveniant for the programmer, the resolver will replace it with the original calculation before it's passed to the optimizer
4. I doubt this is true in Oracle (maybe it was decades ago), it's definitely not true in TD, simply compare Explains
6. ok, but then you should do it within the CASEs, too :-)
And best pratice would be to completely avoid the cast to a date, as it's not needed at all (and as a side effect existing statistics will be usable again). Instead of casting the column change the literals, e.g.
VISIT_START_DTTM < TIMESTAMP '2012-05-01 00:00:00'
VISIT_START_DTTM < CAST(DATE '2012-05-01' AS TIMESTAMP)
The optimizer will treat both exactly the same.
I misstated #8. I should have written, "The original 2nd column was created by first taking a MAX then doing a CAST, while the other columns first did the CAST then took the MAX value."