Efficiently finding unique values in a database table

Efficiently finding unique values in a database table - asp.net

I've got a database table with a very large amount of rows. This table represents messages that are logged by a system. Each message has a message type and this is stored it it's own field in the table. I'm writing a website for querying this message log. If I want to search by message type then ideally I would want to have a drop down box listing the message types that have come up in the database. Message types may change over time so I can't hard code the types into the drop down. I'll have to do some sort of lookup. Iterating over the entire table contents to find unique message values is obviously very stupid however being stupid in the database field I'm here asking for a better way. Perhaps a separate lookup table which the database occasionally updates listing just the unique message types that I can populate my drop down from would be a better idea.
Any suggestions would be much appreciated.
The platform I'm using is ASP.NET MVC and SQL Server 2005

A separate lookup table with the id of the message type stored in your log. This will reduce the size and increase the efficiency of the log. Also it would Normalize your data.

Yep, I would definitely go with the separate lookup table. You can then populate it using something like:
INSERT TypeLookup (Type)
SELECT DISTINCT Type
FROM BigMassiveTable
You could then run a top-up job periodically to pull in new types from your main table that don't already exist in the lookup table.

SELECT DISTINCT message_type
FROM message_log
is the most straightforward but not very efficient way.
If you have a list of types that can possibly appear in the log, use this:
SELECT message_type
FROM message_types mt
WHERE message_type IN
(
SELECT message_type
FROM message_log
)
This will be more efficient if message_log.message_type is indexed.
If you don't have this table but want to create one, and message_log.message_type is indexed, use a recursive CTE to emulate loose index scan:
WITH rows (message_type) AS
(
SELECT MIN(message_type) AS mm
FROM message_log
UNION ALL
SELECT message_type
FROM (
SELECT mn.message_type, ROW_NUMBER() OVER (ORDER BY mn.message_type) AS rn
FROM rows r
JOIN message_type mn
ON mn.message_type > r.message_type
WHERE r.message_type IS NOT NULL
) q
WHERE rn = 1
)
SELECT message_type
FROM rows r
OPTION (MAXRECURSION 0)

I just wanted to state the obvious: normalize the data.
message_types
message_type | message_type_name
messages
message_id | message_type | message_type_name
Then you can just do without any cached DISTINCT:
For your dropdown
SELECT * FROM message_types
For your retrieval
SELECT * FROM messages WHERE message_type = ?
SELECT m.*, mt.message_type_name FROM messages AS m
JOIN message_types AS mt
ON ( m.message_type = mt.message_type)
I'm not sure why you would want a cached DISTINCT which you'll have to update, when you can slightly tweak the schema and have one with RI.

Create an index on the message type:
CREATE INDEX IX_Messages_MessageType ON Messages (MessageType)
Then to get a list of unique Message Types, you run:
SELECT DISTINCT MessageType
FROM Messages
ORDER BY MessageType
Because the index is physically sorted in order of MessageType SQL Server can very quickly, and efficiently, scan through the index, picking up a list of unique message types.
It is not bad performing - it's what SQL Server is good at.
Admittedly, you can save some space by having a "message types" table. And if you only display a few messages at a time: then the bookmark lookup, as it joins back to the MessageTypes table, won't be a problem. But if you start displaying hundreds or thousands of messages at a time, then the join back to MessageTypes can get pretty expensive, and needless, and it will be faster to have the MessageType stored with the message.
But i would have no problem with creating an index on the MessageType column, and selecting distinct. SQL Server loves that sort of thing. But if you're finding it to be a real load on your server, once you're getting dozens of hits a second, then follow the other suggestion and cache them in memory.
My personal solution would be:
create the index
select distinct
and if i still had problems
cache in memory that expires after 30 seconds
As for the normalized/denormalized issue. Normalizing saves space, at the cost of CPU when joins are constantly performed. But the logical point of denoralization is to avoid duplicate data, which can lead to inconsistent data.
Are you planning on changing the text of a message type, which if you stored with the messages you would have to update all rows?
Or is there something to be said for the fact that at the time of the message the message type was "Client response requested"?

Have you considered an indexed view? Its result set is materialized and persists in storage so that the overhead of the lookup is separated from the rest of whatever you're trying to do.
SQL Server takes care of automagically updating the view when there is a data change which in its opinion would change the contents of the view, so in this respect it's less flexible than Oracle materialized.

The MessageType should be a Foreign Key in the main table to a definition table containing the message type codes and descriptions. This will greatly increase your lookup performance.
Something like
DECLARE #MessageTypes TABLE(
MessageTypeCode VARCHAR(10),
MessageTypeDesciption VARCHAR(100)
)
DECLARE #Messages TABLE(
MessageTypeCode VARCHAR(10),
MessageValue VARCHAR(MAX),
MessageLogDate DATETIME,
AdditionalNotes VARCHAR(MAX)
)
From this design, your lookup should only query MessageTypes

As others have said, create a separate table of message types. When you add a record to the message table, check if the message type already exists in the table. If not, add it. In either case, then post the identifier from the message type table into the message table. This should give you normalized data. Yes, it's a little extra time when you add a record, but should be more efficient on retrieval.
If there are a lot more adds then reads and if the "message type" is short, an entirely different approach would be to still create the separate message type table, but don't reference it when doing adds, and only update it lazily, on demand.
Namely, (a) Include a time-stamp in each message record. (b) Keep a list of the message types found as of the last time you checked. (c) Each time you check, search for any new message types added since the last time, as in:
create table temp_new_types as
(select distinct message_type
from message
where timestamp>last_type_check
);
insert into message_type_list (message_type)
select message_type
from temp_new_types
where message_type not in (select message_type from message_type_list);
drop table temp_new_types;
Then store the timestamp of this check somewhere so you can use it the next time around.

The answer is to use 'DISTINCT' and each best solution is different for different sizes of table. Thousands of rows, millions, billions ? more ? This are very different best solutions.

Related

Efficient insertion of row and foreign table row if it does not exist

Similar to this question and this solution for PostgreSQL (in particular "INSERT missing FK rows at the same time"):
Suppose I am making an address book with a "Groups" table and a "Contact" table. When I create a new Contact, I may want to place them into a Group at the same time. So I could do:
INSERT INTO Contact VALUES (
"Bob",
(SELECT group_id FROM Groups WHERE name = "Friends")
)
But what if the "Friends" Group doesn't exist yet? Can we insert this new Group efficiently?
The obvious thing is to do a SELECT to test if the Group exists already; if not do an INSERT. Then do an INSERT into Contacts with the sub-SELECT above.
Or I can constrain Group.name to be UNIQUE, do an INSERT OR IGNORE, then INSERT into Contacts with the sub-SELECT.
I can also keep my own cache of which Groups exist, but that seems like I'm duplicating functionality of the database in the first place.
My guess is that there is no way to do this in one query, since INSERT does not return anything and cannot be used in a subquery. Is that intuition correct? What is the best practice here?

My guess is that there is no way to do this in one query, since INSERT
does not return anything and cannot be used in a subquery. Is that
intuition correct?
You could use a Trigger and a little modification of the tables and then you could do it with a single query.
For example consider the folowing
Purely for convenience of producing the demo:-
DROP TRIGGER IF EXISTS add_group_if_not_exists;
DROP TABLE IF EXISTS contact;
DROP TABLE IF EXISTS groups;
One-time setup SQL :-
CREATE TABLE IF NOT EXISTS groups (id INTEGER PRIMARY KEY, group_name TEXT UNIQUE);
INSERT INTO groups VALUES(-1,'NOTASSIGNED');
CREATE TABLE IF NOT EXISTS contact (id INTEGER PRIMARY KEY, contact TEXT, group_to_use TEXT, group_reference TEXT DEFAULT -1 REFERENCES groups(id));
CREATE TRIGGER IF NOT EXISTS add_group_if_not_exists
AFTER INSERT ON contact
BEGIN
INSERT OR IGNORE INTO groups (group_name) VALUES(new.group_to_use);
UPDATE contact SET group_reference = (SELECT id FROM groups WHERE group_name = new.group_to_use), group_to_use = NULL WHERE id = new.id;
END;
SQL that would be used on an ongoing basis :-
INSERT INTO contact (contact,group_to_use) VALUES
('Fred','Friends'),
('Mary','Family'),
('Ivan','Enemies'),
('Sue','Work colleagues'),
('Arthur','Fellow Rulers'),
('Amy','Work colleagues'),
('Henry','Fellow Rulers'),
('Canute','Fellow Ruler')
;
The number of values and the actual values would vary.
SQL Just for demonstration of the result
SELECT * FROM groups;
SELECT contact,group_name FROM contact JOIN groups ON group_reference = groups.id;
Results
This results in :-
1) The groups (noting that the group "NOTASSIGNED", is intrinsic to the working of the above and hence added initially) :-
have to be careful regard mistakes like (Fellow Ruler instead of Fellow Rulers)
-1 used because it would not be a normal value automatically generated.
2) The contacts with the respective group :-
Efficient insertion
That could likely be debated from here to eternity so I leave it for the fence sitters/destroyers to decide :). However, some considerations:-
It works and appears to do what is wanted.
It's a little wasteful due to the additional wasted column.
It tries to minimise the waste by changing the column to an empty string (NULL may be even more efficient, but for some can be confusing)
There will obviously be an overhead BUT in comparison to the alternatives probably negligible (perhaps important if you were extracting every Facebook user) but if it's user input driven likely irrelevant.
What is the best practice here?
Fences again. :)
Note Hopefully obvious, but the DROP statements are purely for convenience and that all other SQL up until the INSERT is run once
to setup the tables and triggers in preparation for the single INSERT
that adds a group if necessary.

Use ConditionExpression to limit insert when ID doesn't exist in other table

Simple thing. While inserting data to table A I have a HashKey id and additional hash index for column ex_id, which is kind of a foreign key in table B.
When inserting a new data into table A I would like to create an exception whenever data is inserted with value in column ex_id that doesn't have a correspondent entry in table B.
I thought that ConditionExpression is the way to go, but can't make it work - probably missing something obvious. Tried to use contains()...
Any ideas?

As per my knowledge this would not be possible at DynamoDB end because there are no relationship between the tables.
What you can do is that you can have a condition at the application level, which checks on its own and throw an exception before inserting the value in table A. (You can query table B for that "Id" if found then insert else throw exception)

DynamoDB does not natively support any kind of foreign key support, everything works on a per table basis, per key basis. DynamoDB's approach is to handle such logic at the client level. For example see the dynamodb transactions client. This library allows you to perform transactions across tables which either all succeed or all rollback.
For your case, I would first make a getItem request to table B (use consistent read) if it exists then write to table A.
Then I would enable streams on table A and write a lambda function to check if any data violations get written to the table.

SQLite data retrieve with select taking too long

I have created a table with sqlite for my corona/lua app. It's a hashtable with ~=700 000 values.The table has two columns, which are the hashcode (a string), and the value (another string). During the program I need to get data several times by providing the hashcode.
I'm using something like this code to get the data:
for p in db:nrows([[SELECT * FROM test WHERE id=']].."hashcode"..[[';]]) do
print(p)
-- p = returned value --
end
This statement is though taking insanely too much time to perform
thanks,
Edit:
Success!
the mistake was with the primare key thing.I set the hashcode as the primary key like below and the retrieve time whent to normal:
CREATE TABLE IF NOT EXISTS test (id STRING PRIMARY KEY , array);
I also prepared the statements in advance as you said:
stmt = db:prepare("SELECT * FROM test WHERE id = ?;")
[...]
stmt:bind(1,s)
for p in stmt:nrows() do
The only problem was that the db file size,that was around 18 MB, went to 29,5 MB

You should create the table with id as a unique primary key; this will automatically make an index.
create table if not exists test
(
id text primary key,
val text
);
You should not construct statements using string concatenation; this is a security issue so avoid getting in this habit. Also, you should prepare statements in advance, at program initialization, and run the prepared statements.
Something like this... initially:
hashcode_query_stmt = db:prepare("SELECT * FROM test WHERE id = ?;")
then for each use:
hashcode_query_stmt:bind_values(hashcode)
for p in hashcode_query_stmt:urows() do ... end

Ensure that there is an index on the id/hashcode column? Without one such queries will be slow, slow, slow. This index should probably be unique.
If only selecting the value/hashcode (SELECT value FROM ..), it may be beneficial to have a covering index over (id, value) as that can avoid additional seeking to the row data (see SQLite Query Planning). Try it with and without such a covering index.
Also, it may be worthwhile to employ caching if the same hashcodes are queried multiple times.

As already stated, get sure you have an index on ID.
If you can't change table schema now, you can add a index ad hoc:
CREATE INDEX test_id ON test (id);
About hashes: if you are computing hashes in your software to speed up searches, don't!
SQLite will use your supplied hashes as any regular string/blob. Also, RDBMS are optimized for efficient searching, which may be greatly improved with indexes.
Unless your hashing to save space, you are wasting processor time computing hashes in your application.

Updating multiple related tables in SQLite

Just some background, sorry so long winded.
I'm using the System.Data.SQLite ADO.net adapter to create a local sqlite database and this will be the only process hitting the database, so I don't need to worry about concurrency.
I'm building the database from various sources and don't want to build this all in memory using datasets or dataadapters or anything like that. I want to do this using SQL (DdCommands). I'm not very good with SQL and complete noob in sqlite. I'm basically using sqlite as a local database / save file structure.
The database has a lot of related tables and the data has nothing to do with People or Regions or Districts, but to use a simple analogy, imagine:
Region table with auto increment RegionID, RegionName column and various optional columns.
District table with auto increment DistrictID, DistrictName, RegionId, and various optional columns
Person table with auto increment PersonID, PersonName, DistrictID, and various optional columns
So I get some data representing RegionName, DistrictName,PersonName, and other Person related data. The Region, District and/or Person may or may not be created at this point.
Once again, not being the greatest with this, my thoughts would be something like:
Check to see if Region exists and if so get the RegionID
else create it and get RegionID
Check to see if District exists and if so get the DistrictID
else create it adding in RegionID from above and get DistrictID
Check to see if Person exists and if so get the PersonID
else create it adding in DistrictID from above and get PersonID
Update Person with rest of data.
In MS SQL Server I would create a stored procedure to handle all this.
Only way I can see to do this with sqlite is a lot of commands. So I'm sure I'm not getting this. I've spent hours looking around on various sites but just don't feel like I'm going down the right road. Any suggestions would be greatly appreciated.

Use last_insert_rowid() in conjunction with INSERT OR REPLACE. Something like:
INSERT OR REPLACE INTO Region (RegionName)
VALUES (:Region );
INSERT OR REPLACE INTO District(DistrictName, RegionID )
VALUES (:District , last_insert_rowid());
INSERT OR REPLACE INTO Person(PersonName, DistrictID )
VALUES (:Person , last_insert_rowid());

Hierarchical Database Select / Insert Statement (SQL Server)

I have recently stumbled upon a problem with selecting relationship details from a 1 table and inserting into another table, i hope someone can help.
I have a table structure as follows:
ID (PK) Name ParentID<br>
1 Myname 0<br>
2 nametwo 1<br>
3 namethree 2
e.g
This is the table i need to select from and get all the relationship data. As there could be unlimited number of sub links (is there a function i can create for this to create the loop ?)
Then once i have all the data i need to insert into another table and the ID's will now have to change as the id's must go in order (e.g. i cannot have id "2" be a sub of 3 for example), i am hoping i can use the same function for selecting to do the inserting.

If you are using SQL Server 2005 or above, you may use recursive queries to get your information. Here is an example:
With tree (id, Name, ParentID, [level])
As (
Select id, Name, ParentID, 1
From [myTable]
Where ParentID = 0
Union All
Select child.id
,child.Name
,child.ParentID
,parent.[level] + 1 As [level]
From [myTable] As [child]
Inner Join [tree] As [parent]
On [child].ParentID = [parent].id)
Select * From [tree];
This query will return the row requested by the first portion (Where ParentID = 0) and all sub-rows recursively. Does this help you?
I'm not sure I understand what you want to have happen with your insert. Can you provide more information in terms of the expected result when you are done?
Good luck!

For the retrieval part, you can take a look at Common Table Expression. This feature can provide recursive operation using SQL.
For the insertion part, you can use the CTE above to regenerate the ID, and insert accordingly.

I hope this URL helps Self-Joins in SQL

This is the problem of finding the transitive closure of a graph in sql. SQL does not support this directly, which leaves you with three common strategies:
use a vendor specific SQL extension
store the Materialized Path from the root to the given node in each row
store the Nested Sets, that is the interval covered by the subtree rooted at a given node when nodes are labeled depth first
The first option is straightforward, and if you don't need database portability is probably the best. The second and third options have the advantage of being plain SQL, but require maintaining some de-normalized state. Updating a table that uses materialized paths is simple, but for fast queries your database must support indexes for prefix queries on string values. Nested sets avoid needing any string indexing features, but can require updating a lot of rows as you insert or remove nodes.
If you're fine with always using MSSQL, I'd use the vendor specific option Adrian mentioned.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex