I’ve been struggling with this one for while, would appreciate some advice. A bit context about the source and destination tables involved in the update/merge query I'm using:
SRC Table
Format_Code ACR
----------------------------
BAD 5
SAD 7
MAD 2
SRC is created via a select statement joining on two tables; select statement looks something like this:
Select distinct a.Format_Code, b.ACR from Formats a
Inner join Codes b on lower(a.Format_Name) = lower(b.Format_Name)
I’m attempting to use SRC to update a destination table (DES), joining/matching on Format_Code as follows:
Merge Into Inventory DES
Using
(
Select distinct a.Format_Code, b.ACR from Formats a
Inner join Codes b on lower(a.Format_Name) = lower(b.Format_Name)
) SRC
On DES.Format_Code = SRC.Format_Code
When Matched Then Update set DES.ACR = SRC.ACR
I get the following error (because of the duplicates in the destination table I think) but not sure how to ignore/bypass them. SRC contains no duplicates but DES does have duplicate Format_Code. During the update I’d like to either update just one instance of the duplicate rows or ignore duplicates entirely (small number of duplicates so I can update manually if necessary)
SQL Error: ORA-30926: unable to get a stable set of rows in the source tables
30926. 00000 - "unable to get a stable set of rows in the source tables"
*Cause: A stable set of rows could not be got because of large dml
activity or a non-deterministic where clause.
*Action: Remove any non-deterministic where clauses and reissue the dml.
It’s my first time posting, so apologies if I’ve made any rookie mistakes
... because of large dml activity ...
Try to COMMIT, and then run your MERGE. Any improvement?
Related
There are three tables mentioned below, I eventually want to bring in a field from Table3 to Table1 (but the only way to join these two tables is via a common field present in Table2)
Table 1: Application Insights-30 days data (datasize ~4,000,000)
Table 2: Kusto based table (datasize: 1,080,153)
Table 3: Kusto based table (datasize: 38,815,878)
I was not able to join the tables directly, So, I used various filter conditions, distinct operators and split the month data to 4 weeks and then used union to join all 3 tables and got the resultant table.
However, now I am unable to perform any operations on the resultant table (even |count doesn't work)
I get the following error
Query execution has exceeded the allowed limits (80DA0003):
Any help in handling such cases would be helpful
Please check article how to control the query limits:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/concepts/querylimits#limit-on-result-set-size-result-truncation
When using the join operator, make sure that the table with fewer rows is the first one (left-most in query).
See more Best Practices here.
My Teradata query creates a volatile that is used to join to existing views. When linking query to excel the following error pops up: "Teradata: [Teradata Database] [3932] Only an ET or null statement is legal after a DDL Statement". Is there a workaround for this for someone that does not have write permissions in teradata to create a real view or table? I want to avoid linking to Teradata in SQL and running an open query to pull in the data needed.
This is for Excel 2016 64bit and using Teradata version 15.10.1.12
Normally this error will occur if you are using ANSI mode or have issued a BT (Begin Transaction) in BTET mode.
Here are a few workarounds to try:
Issue an ET; statement (commit) after the create volatile table statement. If you are using ANSI mode, use COMMIT; instead of ET;. If you are unsure, try each one in turn. Only one will be valid but both do the same thing. Make sure your Volatile table includes ON COMMIT PRESERVE ROWS
Try using BT ET mode (a.k.a. Teradata mode) when establishing the session. I do not remember where but there will be a setting in the ODBC configuration for this.
Try using a Global Temporary table. These work similarly to Volatile tables except you define them once and the definition sticks around. That is, you can create it in, say BTEQ, or SQL assistant etc. The definition is common to all users and sessions (i.e. your Excel session), but the content is transient and unique to each session (like a volatile table).
Move the select part of your insert into the volatile table into the query that selects the data from the volatile table. See simple example below.
If you do not have create Global Temporary table permissions, ask your DBA.
Here is a simple example to illustrate point 4.
Current:
create volatile table tmp (id Integer)
ON COMMIT PRESERVE ROWS;
insert into tmp
select customer_number
from customer
where X = Y and yr = 2019
;
select a,b,c
from another_tbl A join TMP T ON
A.id = T.id
;
Becomes:
select a,b,c
from another_tbl A join (
select customer_number
from customer
where X = Y and yr = 2019
) AS T
ON
A.id = T.id
;
Or better yet, just Join your tables directly.
Note The first sequence (create table, Insert into and select) is a three statement series. This will return 3 "result sets". The first two will be row counts the last will be the actual data. Most programs (including I think Excel) can not process multiple result set responses. This is one of the reasons it is difficult to use Teradata Macros with client tools like Excel.
The latter solution (a single select) avoids this potential problem.
So this is essentially a follow-up question on Finding duplicate records.
We perform data imports from text files everyday and we ended up importing 10163 records spread across 182 files twice. On running the query mentioned above to find duplicates, the total count of records we got is 10174, which is 11 records more than what are contained in the files. I assumed about the posibility of 2 records that are exactly the same and are valid ones being accounted for as well in the query. So I thought it would be best to use a timestamp field and simply find all the records that ran today (and hence ended up adding duplicate rows). I used ORA_ROWSCN using the following query:
select count(*) from my_table
where TRUNC(SCN_TO_TIMESTAMP(ORA_ROWSCN)) = '01-MAR-2012'
;
However, the count is still more i.e. 10168. Now, I am pretty sure that the total lines in the file is 10163 by running the following command in the folder that contains all the files. wc -l *.txt.
Is it possible to find out which rows are actually inserted twice?
By default, ORA_ROWSCN is stored at the block level, not at the row level. It is only stored at the row level if the table was originally built with ROWDEPENDENCIES enabled. Assuming that you can fit many rows of your table in a single block and that you're not using the APPEND hint to insert the new data above the existing high water mark of the table, you are likely inserting new data into blocks that already have some existing data in them. By default, that is going to change the ORA_ROWSCN of every row in the block causing your query to count more rows than were actually inserted.
Since ORA_ROWSCN is only guaranteed to be an upper-bound on the last time there was DML on a row, it would be much more common to determine how many rows were inserted today by adding a CREATE_DATE column to the table that defaults to SYSDATE or to rely on SQL%ROWCOUNT after your INSERT ran (assuming, of course, that you are using a single INSERT statement to insert all the rows).
Generally, using the ORA_ROWSCN and the SCN_TO_TIMESTAMP function is going to be a problematic way to identify when a row was inserted even if the table is built with ROWDEPENDENCIES. ORA_ROWSCN returns an Oracle SCN which is a System Change Number. This is a unique identifier for a particular change (i.e. a transaction). As such, there is no direct link between a SCN and a time-- my database might be generating SCN's a million times more quickly than yours and my SCN 1 may be years different from your SCN 1. The Oracle background process SMON maintains a table that maps SCN values to approximate timestamps but it only maintains that data for a limited period of time-- otherwise, your database would end up with a multi-billion row table that was just storing SCN to timestamp mappings. If the row was inserted more than, say, a week ago (and the exact limit depends on the database and database version), SCN_TO_TIMESTAMP won't be able to convert the SCN to a timestamp and will return an error.
I have a very simple small database, 2 of tables are:
Node (Node_ID, Node_name, Node_Date) : Node_ID is primary key
Citation (Origin_Id, Target_Id) : PRIMARY KEY (Origin_Id, Target_Id) each is FK in Node
Now I write a query that first find all citations that their Origin_Id has a specific date and then I want to know what are the target dates of these records.
I'm using sqlite in python the Node table has 3000 record and Citation has 9000 records,
and my query is like this in a function:
def cited_years_list(self, date):
c=self.cur
try:
c.execute("""select n.Node_Date,count(*) from Node n INNER JOIN
(select c.Origin_Id AS Origin_Id, c.Target_Id AS Target_Id, n.Node_Date AS
Date from CITATION c INNER JOIN NODE n ON c.Origin_Id=n.Node_Id where
CAST(n.Node_Date as INT)={0}) VW ON VW.Target_Id=n.Node_Id
GROUP BY n.Node_Date;""".format(date))
cited_years=c.fetchall()
self.conn.commit()
print('Cited Years are : \n ',str(cited_years))
except Exception as e:
print('Cited Years retrival failed ',e)
return cited_years
Then I call this function for some specific years, But it's crazy slowwwwwwwww :( (around 1 min for a specific year)
Although my query works fine, it is slow. would you please give me a suggestion to make it faster? I'd appreciate any idea about optimizing this query :)
I also should mention that I have indices on Origin_Id and Target_Id, so the inner join should be pretty fast, but it's not!!!
If this script runs over a period of time, you may consider loading the database into memory. Since you seem to be coding in python, there is a connection function called connection.backup that can backup an entire database into memory. Since memory is much faster than disk, this should increase speed. Of course, this doesn''t do anything to optimize the statement itself, since I don't have enough of the code to evaluate what it is you are doing with the code.
Instead of COUNT(*) use MAX(n.Node_Date)
SQLite doesn't keep a counter on number of tables like mysql does but instead it scans all your rows everytime you call COUNT meaning extremely slow.. yet you can use MAX() to fix that problem.
I've got a database table with a very large amount of rows. This table represents messages that are logged by a system. Each message has a message type and this is stored it it's own field in the table. I'm writing a website for querying this message log. If I want to search by message type then ideally I would want to have a drop down box listing the message types that have come up in the database. Message types may change over time so I can't hard code the types into the drop down. I'll have to do some sort of lookup. Iterating over the entire table contents to find unique message values is obviously very stupid however being stupid in the database field I'm here asking for a better way. Perhaps a separate lookup table which the database occasionally updates listing just the unique message types that I can populate my drop down from would be a better idea.
Any suggestions would be much appreciated.
The platform I'm using is ASP.NET MVC and SQL Server 2005
A separate lookup table with the id of the message type stored in your log. This will reduce the size and increase the efficiency of the log. Also it would Normalize your data.
Yep, I would definitely go with the separate lookup table. You can then populate it using something like:
INSERT TypeLookup (Type)
SELECT DISTINCT Type
FROM BigMassiveTable
You could then run a top-up job periodically to pull in new types from your main table that don't already exist in the lookup table.
SELECT DISTINCT message_type
FROM message_log
is the most straightforward but not very efficient way.
If you have a list of types that can possibly appear in the log, use this:
SELECT message_type
FROM message_types mt
WHERE message_type IN
(
SELECT message_type
FROM message_log
)
This will be more efficient if message_log.message_type is indexed.
If you don't have this table but want to create one, and message_log.message_type is indexed, use a recursive CTE to emulate loose index scan:
WITH rows (message_type) AS
(
SELECT MIN(message_type) AS mm
FROM message_log
UNION ALL
SELECT message_type
FROM (
SELECT mn.message_type, ROW_NUMBER() OVER (ORDER BY mn.message_type) AS rn
FROM rows r
JOIN message_type mn
ON mn.message_type > r.message_type
WHERE r.message_type IS NOT NULL
) q
WHERE rn = 1
)
SELECT message_type
FROM rows r
OPTION (MAXRECURSION 0)
I just wanted to state the obvious: normalize the data.
message_types
message_type | message_type_name
messages
message_id | message_type | message_type_name
Then you can just do without any cached DISTINCT:
For your dropdown
SELECT * FROM message_types
For your retrieval
SELECT * FROM messages WHERE message_type = ?
SELECT m.*, mt.message_type_name FROM messages AS m
JOIN message_types AS mt
ON ( m.message_type = mt.message_type)
I'm not sure why you would want a cached DISTINCT which you'll have to update, when you can slightly tweak the schema and have one with RI.
Create an index on the message type:
CREATE INDEX IX_Messages_MessageType ON Messages (MessageType)
Then to get a list of unique Message Types, you run:
SELECT DISTINCT MessageType
FROM Messages
ORDER BY MessageType
Because the index is physically sorted in order of MessageType SQL Server can very quickly, and efficiently, scan through the index, picking up a list of unique message types.
It is not bad performing - it's what SQL Server is good at.
Admittedly, you can save some space by having a "message types" table. And if you only display a few messages at a time: then the bookmark lookup, as it joins back to the MessageTypes table, won't be a problem. But if you start displaying hundreds or thousands of messages at a time, then the join back to MessageTypes can get pretty expensive, and needless, and it will be faster to have the MessageType stored with the message.
But i would have no problem with creating an index on the MessageType column, and selecting distinct. SQL Server loves that sort of thing. But if you're finding it to be a real load on your server, once you're getting dozens of hits a second, then follow the other suggestion and cache them in memory.
My personal solution would be:
create the index
select distinct
and if i still had problems
cache in memory that expires after 30 seconds
As for the normalized/denormalized issue. Normalizing saves space, at the cost of CPU when joins are constantly performed. But the logical point of denoralization is to avoid duplicate data, which can lead to inconsistent data.
Are you planning on changing the text of a message type, which if you stored with the messages you would have to update all rows?
Or is there something to be said for the fact that at the time of the message the message type was "Client response requested"?
Have you considered an indexed view? Its result set is materialized and persists in storage so that the overhead of the lookup is separated from the rest of whatever you're trying to do.
SQL Server takes care of automagically updating the view when there is a data change which in its opinion would change the contents of the view, so in this respect it's less flexible than Oracle materialized.
The MessageType should be a Foreign Key in the main table to a definition table containing the message type codes and descriptions. This will greatly increase your lookup performance.
Something like
DECLARE #MessageTypes TABLE(
MessageTypeCode VARCHAR(10),
MessageTypeDesciption VARCHAR(100)
)
DECLARE #Messages TABLE(
MessageTypeCode VARCHAR(10),
MessageValue VARCHAR(MAX),
MessageLogDate DATETIME,
AdditionalNotes VARCHAR(MAX)
)
From this design, your lookup should only query MessageTypes
As others have said, create a separate table of message types. When you add a record to the message table, check if the message type already exists in the table. If not, add it. In either case, then post the identifier from the message type table into the message table. This should give you normalized data. Yes, it's a little extra time when you add a record, but should be more efficient on retrieval.
If there are a lot more adds then reads and if the "message type" is short, an entirely different approach would be to still create the separate message type table, but don't reference it when doing adds, and only update it lazily, on demand.
Namely, (a) Include a time-stamp in each message record. (b) Keep a list of the message types found as of the last time you checked. (c) Each time you check, search for any new message types added since the last time, as in:
create table temp_new_types as
(select distinct message_type
from message
where timestamp>last_type_check
);
insert into message_type_list (message_type)
select message_type
from temp_new_types
where message_type not in (select message_type from message_type_list);
drop table temp_new_types;
Then store the timestamp of this check somewhere so you can use it the next time around.
The answer is to use 'DISTINCT' and each best solution is different for different sizes of table. Thousands of rows, millions, billions ? more ? This are very different best solutions.