Using TERADATA: Running simple SELECT DISTINCT and getting SPOOL SPACE ERROR

Using TERADATA: Running simple SELECT DISTINCT and getting SPOOL SPACE ERROR - teradata

I am new to Teradata and am a bit confused. I've Googled and searched here and am still uncertain as to why. Hoping to find some clarity here.
I have two queries:
SELECT distinct foo FROM table.A;
SELECT distinct foo2 FROM table.A;
Neither attribute appears to be keys.
One produces desired results; the other does not. I keep getting a spool space error.
I ran explain plans on both and they have similar execution times and row counts.
Any suggestions as to why this is occurring?

Related

Order by Gremlin (on AWS Neptune) descending puts 0 at the top

I have a Neptune Gremlin query that should order vertices by the number of times they've been saved by other users in descending order. It works perfectly for vertices where the property value is > 0, but for some reason puts the vertices where the property is equal to zero at the top.
When adding the vertex, the property is created without quotes (so not a string), and I am able to sum on the property when I increment it in other scenarios, so they should all be numbers. When ordering in ascending order it works as expected too (zero values come up first and then ordering is correct).
Has anyone seen this before or knows why it might be happening? I don't want to have to pre-filter out zero values.
The relevant part of my query is the following (and acts the same way with incorrect ordering, but has some stuff in the results that isn't relevant for this question), but I have attached an image for the full query I'm using with the results it gives g.V().hasLabel('trip').order().by('numSaves', desc)
Query and results

I was able to reproduce the issue thanks to the very helpful additional information. In the near term, the workaround of using fold().unfold() will work as it causes a different code path through the query engine to be taken. I will update this answer when more information is available. The issue seems to be related to the sum step. Another workaround that worked for me is to use a sack to do the "add one". Not a very elegant query but it does seem to avoid the order problem.
g.V("some-id").
property(single, "numSaves",
sack(assign).by('numSaves').
sack(sum).by(constant(1)).sack())
UPDATED July 29th 2021:
An Amazon Neptune update (1.0.5.0) was just released that contains a fix for this issue.

sql query to find datatypes in teradata

I've never dealt with Teradata database before, I need to find out the data types for a table in Teradata I tried the below queries but none of them worked
describe table tablename; show create table tablename;help table tablename;
When I did show table I realized that it is a view and I tried
show view view name;
help view view name;
none of the above queries gave me the datatypes for the view I'm looking to find out. i googled around and tried my findings none of them worked.
Please help me with the query to find out the data types in teradata.
Thanks in advance

HELP COLUMN mytable.*; resolves views and returns the actual datatypes.
The description of the two character ColumnType can be found in the Data Dictionary manual

The stuff people have already said is great if you need the whole table. However, if you just need 1 specific field and want the result to be readable (without having to dig around in the data dictionary manual), sometimes I find it simpler to just do something like:
Select Distinct TYPE(Fieldname)
from mytable
For many reasons, this isn't the solution you want to use every time but I think it is worth adding to what people have already said.

Oracle Stored Procedure performance

I am facing a performance issue in one of my stored procedures.
Following is the pseudo-code:
PROCEDURE SP_GET_EMPLOYEEDETAILS(P_EMP_ID IN NUMBER, CUR_OUT OUT REF CURSOR)
IS
BEGIN
OPEN CUR_OUT FOR
SELECT EMP_NAME, EMAIL, DOB FROM T_EMPLOYEES WHERE EMP_ID=P_EMP_ID;
END;
The above stored procedure takes around 20 seconds to return the result set with let's say P_EMP_ID = 100.
However, if I hard-code employee ID as 100 in the stored procedure, the stored procedure returns the result set in 40 milliseconds.
So, the same stored procedure behaves differently for the same parameter value when the value is hard-coded instead of reading the parameter value.
The table T_EMPLOYEES has around 1 million records and there is an index on the EMP_ID column.
Would appreciate any help regarding this as to how I can improve the performance of this stored procedure or what could be the problem here.

This may be an issue with skewed data distribution and/or incomplete histograms and/or bad system tuning.
The fast version of the query is probably using an index. The slow version is probably doing a full-table-scan.
In order to know which to do, Oracle has to have an idea of the cardinality of the data (in your case, how many results will be returned). If it thinks a lot of results will be returned, it will go straight ahead and do a full-table-scan as it is not worth the overhead of using an index. If it thinks few results will be returned it will use an index to avoid scanning the whole table.
The issues are:
If using a literal value, Oracle knows exactly where to look in the histogram to see how many results would be returned. If using a bind variable, it is more complicated. Certainly, on Oracle 10 it didn't handle this well and just took a guess at the cardinality. On Oracle 11, I am not sure as it can do something called "bind variable peeking" - see SQL Plan Management.
Even if it does know the actual value, if your histogram is not up-to-date, it will get the wrong values.
Even if it works out an accurate guess as to how many results will be returned, you are still dependent on the Oracle system parameters being correct.
For this last point ... basically, Oracle has some parameters that tell it how fast it thinks a FTS is vs how fast an index look-up is. If these are not correct, it will may do an FTS even if it is a lot slower. See Burleson
My experience is that Oracle tends to flip to doing FTS way too early. Ideally, as the result set grows in size there should be a smooth transition in performance at the point where it goes from using an index to using an FTS, but in practice the systems seem to be set up to favour bulk work.

Excluding records from one table depending of other entries from same table

I got following problem:
I have to correct a report in AX 4.0, which is created with the standard report-framework of AX and I don´t want to rework the whole report for this.
How can I set up the Datasources, so records from Table_A, which have the same Value in Field_A as other records from Table_A have in Field_B, are both not displayed anymore?
I´m driving crazy right now, because I haven´t found any solution for this, while it seems not that complicated.

You should work on the Query associated to the report datasource. But it's difficult to give a more accurate tip without knowing more information about the problem.

Is count(*) really expensive?

I have a page where I have 4 tabs displaying 4 different reports based off different tables.
I obtain the row count of each table using a select count(*) from <table> query and display number of rows available in each table on the tabs. As a result, each page postback causes 5 count(*) queries to be executed (4 to get counts and 1 for pagination) and 1 query for getting the report content.
Now my question is: are count(*) queries really expensive -- should I keep the row counts (at least those that are displayed on the tab) in the view state of page instead of querying multiple times?
How expensive are COUNT(*) queries ?

In general, the cost of COUNT(*) cost is proportional to the number of records satisfying the query conditions plus the time required to prepare these records (which depends on the underlying query complexity).
In simple cases where you're dealing with a single table, there are often specific optimisations in place to make such an operation cheap. For example, doing COUNT(*) without WHERE conditions from a single MyISAM table in MySQL - this is instantaneous as it is stored in metadata.
For example, Let's consider two queries:
SELECT COUNT(*)
FROM largeTableA a
Since every record satisfies the query, the COUNT(*) cost is proportional to the number of records in the table (i.e., proportional to what it returns) (Assuming it needs to visit the rows and there isnt a specific optimisation in place to handle it)
SELECT COUNT(*)
FROM largeTableA a
JOIN largeTableB b
ON a.id = b.id
In this case, the engine will most probably use HASH JOIN and the execution plan will be something like this:
Build a hash table on the smaller of the tables
Scan the larger table, looking up each records in a hash table
Count the matches as they go.
In this case, the COUNT(*) overhead (step 3) will be negligible and the query time will be completely defined by steps 1 and 2, that is building the hash table and looking it up. For such a query, the time will be O(a + b): it does not really depend on the number of matches.
However, if there are indexes on both a.id and b.id, the MERGE JOIN may be chosen and the COUNT(*) time will be proportional to the number of matches again, since an index seek will be performed after each match.

You need to attach SQL Profiler or an app level profiler like L2SProf and look at the real query costs in your context before:
guessing what the problem is and trying to determine the likely benefits of a potential solution
allowing others to guess for you on da interwebs - there's lots of misinformation without citations about, including in this thread (but not in this post :P)
When you've done that, it'll be clear what the best approach is - i.e., whether the SELECT COUNT is dominating things or not, etc.
And having done that, you'll also know whether any changes you choose to do have had a positive or a negative impact.

As others have said COUNT(*) always physically counts rows, so if you can do that once and cache the results, thats certainly preferable.
If you benchmark and determine that the cost is negligible, you don't (currently) have a problem.
If it turns out to be too expensive for your scenario you could make your pagination 'fuzzy' as in "Showing 1 to 500 of approx 30,000" by using
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('sometable') AND indid < 2
which will return an approximation of the number of rows (its approximate because its not updated until a CHECKPOINT).

If the page gets slow, one thing you can look at is minimizing the number of database roundtrips, if at all possible. Even if your COUNT(*) queries are O(1), if you're doing enough of them, that could certainly slow things down.
Instead of setting up and executing 5 separate queries one at a time, run the SELECT statements in a single batch and process the 5 results at once.
I.e., if you're using ADO.NET, do something like this (error checking omitted for brevity; non-looped/non-dynamic for clarity):
string sql = "SELECT COUNT(*) FROM Table1; SELECT COUNT(*) FROM Table2;"
SqlCommand cmd = new SqlCommand(sql, connection);
SqlDataReader dr = cmd.ExecuteReader();
// Defaults to first result set
dr.Read();
int table1Count = (int)dr[0];
// Move to second result set
dr.NextResult();
dr.Read();
int table2Count = (int)dr[0];
If you're using an ORM of some sort, such as NHibernate, there should be a way to enable automatic query batching.

COUNT(*) can be particularly expensive as it may result in loading (and paging) an entire table, where you may only need a count on a primary key (In some implementations it is optimised).
From the sound of it, you are causing a table load operation each time, which is slow, but unless it is running noticeably slowly, or causing some sort of problem, don't optimise: premature and unnecessary optimisation can cause a great deal of trouble!
A count on an indexed primary key will be much faster, but with the costs of having an index this may provide no benefit.

All I/O is expensive and if you can accomplish the task without it, you should. But if it's needed, I wouldn't worry about it.
You mention storing the counts in view state, certainly an option, as long as the behavior of the code is acceptable when that count is wrong because the underlying records are gone or have been added to.

This depends on what are you doing with data in this table. If they are changing very often and you need them all every time, maybe you could make trigger which will fill another table that consists only on counts from this table. If you need to show this data separately, maybe you could just execute "select count(*)..." for only one particular table. This just came to my mind instantly, but there are other ways to speed this up, I'm sure. Cache data, maybe? :)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex