Dynamic query and caching - plsql

I have two problem sets. What I am preferably looking for is a solution which combines both.
Problem 1: I have a table of lets say 20 rows. I am reading 150,000 rows from other table (say table 2). For each row read from table 2, I have to match it with a specific row of table 1 (not matching whole row, few columns. like if table2.col1 = table1.col && table2.col2 = table1.col2) etc. Is there a way that i can cache table 1 so that i don't have to query it again and again ?
Problem 2: I want to generate query string dynamically i.e., if parameter 2 is null then don't put it in where clause. Now the only option left is to use immidiate execute which will be very slow.
Now what i am asking that how can i have dynamic query to compare it with table 1 ? any ideas ?

For problem 1, as mentioned in the comments, let the database handle it. That's what it does really well. If it is something being hit often, then the blocks for the table should remain in the database buffer cache if the buffer cache is sized appropriately. Part of DBA tuning would be to identify appropriate sizing, pinning tables into the "keep" pool, etc. But probably not something that needs worrying over.
If the desire is just to simplify writing the queries rather than performance, then views or stored procs can simplify the repetitive use of the join.
For problem 2, a query in a format like this might work for you:
SELECT id, val
FROM myTable
WHERE filter = COALESCE(v_filter, filter)
If the input parameter v_filter is null, then just automatically match the existing column. This assumes the existing filter column itself is never null (since you can't use = for null comparisons). Also, it assumes that there are other indexed portions in the WHERE clause since a function like COALESCE isn't going to be able to take advantage of an index.

For problem 1 you just join the tables. If there is an equijoin and one table is quite small and the other large then you're likely to get a hash join. This is effectively a caching mechanism, and the total cost of reading the tables and performing the join is only very slightly higher than that of reading the tables (as long as the hash table fits in memory).
It does not make a difference if the query is constructed and run through execute immediate -- the RDBMS hash join will still act as an effective cache.

Related

Azure data explorer update record

I am new to Azure data explorer and I am wondering how you can do update on a record in Azure data explorer using microsoft .NET SDK in C# ?
The Microsoft documentation is really poor
Can we update or you can replace a row only or you?
You can use soft-delete to delete the original record, and then append/ingest the updated record.
Please note that this won't be atomic, meaning if someone queries the table between the soft-delete and the append operations, they won't see neither the old record, nor the updated record.
there is no record "update" mechanism in Azure Data Explorer, even the 'soft delete' removes and replaces the row. this is useful for one-off scenarios, and may not be worth implementing in another language since it should not be used frequently. as the soft delete documentation says, if you plan to update data often, materialize may be a better option.
materialize is a bit more work and abstract, generally being worth the effort if you have a very large table that relies on metadata information like ingestion_time to make sense of records.
in smaller tables (say, less than a gig) i recommend the simple approach of replacing the table with an updated version of itself (just make sure that if you do rely on fields like ingestion_time, you update the schema and extend that data as a field for later use).
You will need to query for the entire table, implement logic for isolating only the row(s) of interest (while retaining all others, and perform an extend function to modify that value. Then, replace (do not append) the entire table.
For example:
.set-or-replace MyTable1 <|
MyTable1
| extend IncorrectColumn = iif(IncorrectColumn == "incorrectValue", "CorrectValue", IncorrectColumn)
alternatively, you can have the unchanging relevant data and updated data in two tabular results, and perform a union on them to form the final table.
.set-or-replace MyTable1 <|
let updatedRows =
MyTable1
| where Column1 = "IncorrectValue"
| extend Column1 = "CorrectValue";
let nonUpdatedRows =
MyTable1
| where Column1 = "CorrectValue";
updatedRows
| union nonUpdatedRows
I prefer to write to a temp table, double check the data quality, then replace the final table. This is particularly useful if you're working in batches and want to minimize the risk of data loss if there's a failure halfway through your batches

SQLite: re-arrange physical position of rows inside file

My problem is that my querys are too slow.
I have a fairly large sqlite database. The table is:
CREATE TABLE results (
timestamp TEXT,
name TEXT,
result float,
)
(I know that timestamps as TEXT is not optimal, but please ignore that for the purposes of this question. I'll have to fix that when I have the time)
"name" is a category. This calculation holds the results of a calculation that has to be done at each timestamp for all "name"s. So the inserts are done at equal-timestamps, but the querys will be done at equal-names (i.e. I want given a name, get its time series), like:
SELECT timestamp,result WHERE name='some_name';
Now, the way I'm doing things now is to have no indexes, calculate all results, then create an index on name CREATE INDEX index_name ON results (name). The reasoning is that I don't need the index when I'm inserting, but having the index will make querys on the index really fast.
But it's not. The database is fairly large. It has about half a million timestamps, and for each timestamp I have about 1000 names.
I suspect, although I'm not sure, that the reason why it's slow is that every though I've indexed the names, they're still scattered all around the physical disk. Something like:
timestamp1,name1,result
timestamp1,name2,result
timestamp1,name3,result
...
timestamp1,name999,result
timestamp1,name1000,result
timestamp2,name1,result
timestamp2,name2,result
etc...
I'm sure this is slower to query with NAME='some_name' than if the rows were physically ordered as:
timestamp1,name1,result
timestamp2,name1,result
timestamp3,name1,result
...
timestamp499997,name1000,result
timestamp499998,name1000,result
timestamp499999,name1000,result
timestamp500000,namee1000,result
etc...
So, how do I tell SQLite that the order in which I'd like the rows in disk isn't the one they were written in?
UPDATE: I'm further convinced that the slowness in doing a select with such an index comes exclusively from non-contiguous disk access. Doing SELECT * FROM results WHERE name=<something_that_doesnt_exist> immediately returns zero results. This suggests that it's not finding the names that's slow, it's actually reading them from the disk.
Normal sqlite tables have, as a primary key, a 64-bit integer (Known as rowid and a few other aliases). That determines the order that rows are stored in a B*-tree (Which puts all actual data in leaf node pages). You can change this with a WITHOUT ROWID table, but that requires an explicit primary key which is used to place rows in a B-tree. So if every row's (name, timestamp) columns make a unique value, that's a possibility that will leave all rows with the same name on a smaller set of pages instead of scattered all over.
You'd want the composite PK to be in that order if you're searching for a particular name most of the time, so something like:
CREATE TABLE results (
timestamp TEXT
, name TEXT
, result REAL
, PRIMARY KEY (name, timestamp)
) WITHOUT ROWID
(And of course not bothering with a second index on name.) The tradeoff is that inserts are likely to be slower as the chances of needing to split a page in the B-tree go up.
Some pragmas worth looking into to tune things:
cache_size
mmap_size
optimize (After creating your index; also consider building sqlite with SQLITE_ENABLE_STAT4.)
Since you don't have an INTEGER PRIMARY KEY, consider VACUUM after deleting a lot of rows if you ever do that.

sqlite optimization on inner join with tables values around 18K

I have two tables tool , tool_attribute.
tool has 12 columns and tool_attribute has 5.
Information i needed from the tables :
tool - refid, serial, type, id
tool_attribute - key, value, id (There will be multiple entries for this)
Right now i have around 18264 in tool and 255696 in tool_attribute
Current Query :
select
tool.refid,
tool.serial,
tool_attribute.value,
tool.type
from tool
inner join tool_attribute
on tool.id = tool_attribute.id
where
(tool_attribute.val LIKE '%t00%' or
tool.serial LIKE '%t00%')
group by tool.refid
order by tool.serial asc;
This take around 750ms which is quite fast but i want to make it much faster. I run this code on low memory windows 6.0 device so it takes too much time.
Is there any way i could make it faster ?
You can try adding indices to the columns involved in the join:
CREATE INDEX idx_tool ON tool (id);
CREATE INDEX idx_tool_attr ON tool_attribute (id);
The LIKE conditions in your WHERE clause would preclude any chance of using an index on the columns involved, I think. The reason for this is a LIKE expression of the form %something eliminates the chance to search through a B-tree, which uses the suffix from left to right to find something. If you could rephrase your WHERE logic using something similar to LIKE 'something%' then an index could be used there as well.

Multiple inserts using 'User Defined Table Type' (asp.net sql-server)

In my ASP.NET web app I have a DataTable filled with data to insert into tblChildren.
The DataTable is passed to a stored procedure.
In the SP I need to read each row (e.i Loop through the DataTable), change a couple of columns in it (with accordance to the relevant data in tblParents) and only then insert the row into tblChildren.
SqlBulkCopy wouldn't do and I don't think TVP will do either (not sure... not too familiar with it yet).
Of course I can iterate through the DataTable rows in the app and send each one separately to the SP, but that would mean hundreds of round trips to the SqlServer.
I came across two possibilities that might achieve that : (1) Temp table
(2) Cursor.
The first is quite messy and the second, as I understand it, is NOT recommended)
Any guidance would be much appreciated.
EDIT :
I tried the approach of user-defined Table Type.
That works because I populate the Table Type (TT_Children) with values in the TT_Child_Family_Id column.
In real life, though, I will not know these values and I would need to loop thru #my_TT_Children and for each row get the value from tblFamilies, something like this :
SELECT Family_Id FROM tblFamilies WHERE Family_Name = TT_Child_Last_Name
(assuming there is always an equivalent for TT_Child_Last_Name in tblFamilies.Family_Name)
So my question is - how to loop through the table-type and for each row look up a value in a different table?
EDIT 2 (the solution) :
As in Amir's perfect answer, the stored procedure should look like this :
ALTER PROCEDURE [dbo].[usp_Z_Insert_Children]
#my_TT_Children TT_Children READONLY
AS
BEGIN
INSERT INTO tblChildren(Child_FirstName,
Child_LastName,
Child_Family_ID)
SELECT Cld.tt_child_FirstName,
Cld.tt_child_LastNAme,
Fml.Family_Id FROM #my_TT_Children Cld
INNER JOIN tblFamilies fml
ON Cld.TT_Child_LastName = Fml.Family_Name
END
Notes by Amir : column Family_Name in tblFamily must be unique and preferably indexed.
(Also I noticed that in case TT_Child_LastName does not have a match in tblFamilies, the row will not be inserted and I'll never know about it. That means that I have to check somehow if all rows were successfully processed).
You can join tblFamilies into the insert in the procedure and take the value from there. Much more efficient than looping through.
Or create a cursor and do one child at a time.
1) Make sure there is only one occurance of FamilyName in tblFamilies.
2) Make sure that if tblFamilies is a large table, then the FamilyName column is indexed.
INSERT INTO tblChildren(Child_FirstName, Child_LastName, Child_Family_ID)
SELECT Cld.tt_child_FirstName,
Cld.tt_child_LastNAme,
Fml.FamilyID
FROM #my_TT_Children Cld
INNER JOIN tblFamilies fml on Cld.TT_Child_LastName = Fml.FamilyName
But be aware that if tblFamilies has more than one entry per Family_Name, then this will duplicate the data. In this case you will need to add more restrictions in the where.

SQLite - Get a specific row index for a Sorted/Filtered Query

I'm creating a caching system to take data from an SQLite database table using a sorted/filtered query and display it. The tables I'm pulling from can be potentially very large and, of course, I need to minimize impact on memory by only retaining a maximum number of rows in memory at any given time. This is easily done by using LIMIT and OFFSET to load only the records I need and update the cache as needed. Implementing this is trivial. The problem I'm having is determining where the insertion index is for a new record inserted into a particular query so I can update my UI appropriately. Is there an easy way to do this? So far the ideas I've had are:
Dump the entire cache, re-count the Query results (there's no guarantee the new row will be included), refresh the cache and refresh the entire UI. I hope it's obvious why that's not really desirable.
Use my own algorithm to determine whether the new row is included in the current query, if it is included in the current cached results and at what index it should be inserted into if it's within the current cached scope. The biggest downfall of this approach is it's complexity and the risk that my own sorting/filtering algorithm won't match SQLite's.
Of course, what I want is to be able to ask SQLite: Given 'Query A' what is the index of 'Row B', without loading the entire query results. However, so far I haven't been able to find a way to do this.
I don't think it matters but this is all occurring on an iOS device, using the objective-c programming language.
More Info
The Query and subsequent cache is based off of user input. Essentially the user can re-sort and filter (or search) to alter the results they're seeing. My reticence in simply recreating the cache on insertions (and edits, actually) is to provide a 'smoother' UI experience.
I should point out that I'm leaning toward option "2" at the moment. I played around with creating my own caching/indexing system by loading all the records in a table and performing the sort/filter in memory using my own algorithms. So much of the code needed to determine whether and/or where a particular record is in the cache is already there, so I'm slightly predisposed to use it. The danger lies in having a cache that doesn't match the underlying query. If I include a record in the cache that the query wouldn't return, I'll be in trouble and probably crash.
You don't need record numbers.
Save the values of the ordered field in the first and last records of the LIMITed query result.
Then you can use these to check whether the new record falls into this range.
In other words, assuming that you order by the Name field, and that the original query was this:
SELECT Name, ...
FROM mytab
WHERE some_conditions
ORDER BY Name
LIMIT x OFFSET y
then try to get at the new record with a similar query:
SELECT 1
FROM mytab
WHERE some_conditions
AND PrimaryKey = LastInsertedValue
AND Name BETWEEN CachedMin AND CachedMax
Similarly, to find out before (or after) which record the new record was inserted, start directly after the inserted record and use a limit of one, like this:
SELECT Name
FROM mytab
WHERE some_conditions
AND Name > MyInsertedName
AND Name BETWEEN CachedMin AND CachedMax
ORDER BY Name
LIMIT 1
This doesn't give you a number; you still have to check where the returned Name is in your cache.
Typically you'd expect a cache to be invalidated if there were underlying data changes. I think dropping it and starting over will be your simplest, maintainable solution. I would recommend it unless you have a very good reason.
You could write another query that just returned the row count (example below) to see if your cache should be invalidated. That would save recreating the cache when it did not change.
SELECT name,address FROM people WHERE area_code=970;
SELECT COUNT(rowid) FROM people WHERE area_code=970;
The information you'd need from sqlite to know when your cache was invalidated would require some rather intimate knowledge of how the query and/or index was working. I would say that is fairly high coupling.
Otherwise, you'd want to know where it was inserted with regards to the sorting. You would probably key each page on the sorted field. Delete anything greater than the insert/delete field. Any time you change the sorting you'd drop everything.
Something like the below would be a start if you were using C++. I realize you aren't doing C++, but hopefully it is evident as to what I'm trying to do.
struct Person {
std::string name;
std::string addr;
};
struct Page {
std::string key;
std::vector<Person> persons;
struct Less {
bool operator()(const Page &lhs, const Page &rhs) const {
return lhs.key.compare(rhs.key) < 0;
}
};
};
typedef std::set<Page, Page::Less> pages_t;
pages_t pages;
void insert(const Person &person) {
if (sql_insert(person)) {
pages_t::iterator drop_cache_start = pages.lower_bound(person);
//... drop this page and everything after it
}
}
You'd have to do some wrangling to get different datatypes of key to work nicely, but its possible.
Theoretically you could just leave the pages out of it and only use the objects themselves. The database would no longer "own" the data though. If you only fill pages from the database, then you'll have less data consistency worries.
This may be a bit off topic, you aren't re-implementing views are you? It doesn't cache per se, but it isn't clear if that is a requirement of your project.
The solution I came up with is not exactly simple, but it's currently working well. I realized that the index of a record in a Query Statement is also the Count of all it's previous records. What I needed to do was 'convert' all the ORDER statements in the query to a series of WHERE statements that would return only the preceding records and take a count of those records. It's trickier than it sounds (or maybe not...it sounds tricky). The biggest issue I had was making sure the query was, in fact, sorted in a way I could predict. This meant I needed to have an order column in the Order Parameters that was based off of a column with unique values. So, whenever a user sorts on a column, I append to the statement another order parameter on a unique column (I used a "Modified Date Stamp") to break ties.
Creating the WHERE portion of the statement requires more than just tacking on a bunch of ANDs. It's easier to demonstrate. Say you have 3 Order columns: "LastName" ASC, "FirstName" DESC, and "Modified Stamp" ASC (the tie breaker). The WHERE statement would have to look something like this ('?' = record value):
WHERE
"LastName" < ? OR
("LastName" = ? AND "FirstName" > ?) OR
("LastName" = ? AND "FirstName" = ? AND "Modified Stamp" < ?)
Each set of WHERE parameters grouped together by parenthesis are tie breakers. If, in fact, the record values of "LastName" are equal, we must then look at "FirstName", and finally "Modified Stamp". Obviously, this statement can get really long if you're sorting by a bunch of order parameters.
There's still one problem with the above solution. Mathematical operations on NULL values always return false, and yet when you sort SQLite sorts NULL values first. Therefore, in order to deal with NULL values appropriately you've gotta add another layer of complication. First, all mathematical equality operations, =, must be replace by IS. Second, all < operations must be nested with an OR IS NULL to include NULL values appropriately on the < operator. This turns the above operation into:
WHERE
("LastName" < ? OR "LastName" IS NULL) OR
("LastName" IS ? AND "FirstName" > ?) OR
("LastName" IS ? AND "FirstName" IS ? AND ("Modified Stamp" < ? OR "Modified Stamp" IS NULL))
I then take a count of the RowID using the above WHERE parameter.
It turned out easy enough for me to do mostly because I had already constructed a set of objects to represent various aspects of my SQL Statement which could be assembled to generate the statement. I can't even imagine trying to manipulate a SQL statement like this any other way.
So far, I've tested using this on several iOS devices with up to 10,000 records in a table and I've had no noticeable performance issues. Of course, it's designed for single record edits/insertions so I don't really need it to be super fast/efficient.

Resources