How Optimize sql query make it faster - sqlite

I have a very simple small database, 2 of tables are:
Node (Node_ID, Node_name, Node_Date) : Node_ID is primary key
Citation (Origin_Id, Target_Id) : PRIMARY KEY (Origin_Id, Target_Id) each is FK in Node
Now I write a query that first find all citations that their Origin_Id has a specific date and then I want to know what are the target dates of these records.
I'm using sqlite in python the Node table has 3000 record and Citation has 9000 records,
and my query is like this in a function:
def cited_years_list(self, date):
c=self.cur
try:
c.execute("""select n.Node_Date,count(*) from Node n INNER JOIN
(select c.Origin_Id AS Origin_Id, c.Target_Id AS Target_Id, n.Node_Date AS
Date from CITATION c INNER JOIN NODE n ON c.Origin_Id=n.Node_Id where
CAST(n.Node_Date as INT)={0}) VW ON VW.Target_Id=n.Node_Id
GROUP BY n.Node_Date;""".format(date))
cited_years=c.fetchall()
self.conn.commit()
print('Cited Years are : \n ',str(cited_years))
except Exception as e:
print('Cited Years retrival failed ',e)
return cited_years
Then I call this function for some specific years, But it's crazy slowwwwwwwww :( (around 1 min for a specific year)
Although my query works fine, it is slow. would you please give me a suggestion to make it faster? I'd appreciate any idea about optimizing this query :)
I also should mention that I have indices on Origin_Id and Target_Id, so the inner join should be pretty fast, but it's not!!!

If this script runs over a period of time, you may consider loading the database into memory. Since you seem to be coding in python, there is a connection function called connection.backup that can backup an entire database into memory. Since memory is much faster than disk, this should increase speed. Of course, this doesn''t do anything to optimize the statement itself, since I don't have enough of the code to evaluate what it is you are doing with the code.

Instead of COUNT(*) use MAX(n.Node_Date)
SQLite doesn't keep a counter on number of tables like mysql does but instead it scans all your rows everytime you call COUNT meaning extremely slow.. yet you can use MAX() to fix that problem.

Related

How to put a part of a code as a string in table to use it in a procedure?

I'm trying to resolve below issue:
I need to prepare table that consists 3 columns:
user_id,
month
value.
Each from over 200 users has got different values of parameters that determine expected value which are: LOB, CHANNEL, SUBSIDIARY. So I decided to store it in table ASYSTENT_GOALS_SET. But I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in "where" clause further in procedure.
So, as an example - instead of multiple rows:
I created such entry:
So far I created testing table ASYSTENT_TEST (where I collect month and value for certain user). I wrote a piece of procedure where I used BULK COLLECT.
declare
type test_row is record
(
month NUMBER,
value NUMBER
);
type test_tab is table of test_row;
BULK_COLLECTOR test_tab;
p_lob varchar2(10) :='GOSP';
p_sub varchar2(14);
p_ch varchar2(10) :='BR';
begin
select subsidiary into p_sub from ASYSTENT_GOALS_SET where user_id='40001001';
execute immediate 'select mc, sum(ppln_wartosc) plan from prod_nonlife.mis_report_plans
where report_id = (select to_number(value) from prod_nonlife.view_parameters where view_name=''MIS'' and parameter_name=''MAX_REPORT_ID'')
and year=2017
and month between 7 and 9
and ppln_jsta_symbol in (:subsidiary)
and dcs_group in (:lob)
and kanal in (:channel)
group by month order by month' bulk collect into BULK_COLLECTOR
using p_sub,p_lob,p_ch;
forall x in BULK_COLLECTOR.first..BULK_COLLECTOR.last insert into ASYSTENT_TEST values BULK_COLLECTOR(x);
end;
So now when in table ASYSTENT_GOALS_SET column SUBSIDIARY (varchar) consists string 12_00_00 (which is code of one of subsidiary) everything works fine. But the problem is when user works in two subsidiaries, let say 12_00_00 and 13_00_00. I have no clue how to write it down. Should SUBSIDIARY column consist:
'12_00_00','13_00_00'
or
"12_00_00","13_00_00"
or maybe
12_00_00','13_00_00
I have tried a lot of options after digging on topics like "Deling with single/escaping/double qoutes".
Maybe I should change something in execute immediate as well?
Or maybe my approach to that issue is completely wrong from the very beginning (hopefully not :) ).
I would be grateful for support.
I didn't create the table function described here but that article inspired me to go back to try regexp_substr function again.
I changed: ppln_jsta_symbol in (:subsidiary) to
ppln_jsta_symbol in (select regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''),''[^,]+'', 1, level) from dual
connect by regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''), ''[^,]+'', 1, level) is not null) Now it works like a charm! Thank you #Dessma very much for your time and suggestion!
"I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in 'where' clause further in procedure"
This seems a misguided requirement. You shouldn't worry about number of rows: databases are optimized for storing and retrieving rows.
What they are not good at is dealing with "multi-value" columns. As your own solution proves, it is not nice, it is very far from nice, in fact it is a total pain in the neck. From now on, every time anybody needs to work with subsidiary they will have to invoke a function. Adding, changing or removing a user's subsidiary is much harder than it ought to be. Also there is no chance of enforcing data integrity i.e. validating that a subsidiary is valid against a reference table.
Maybe none of this matters to you. But there are very good reasons why Codd mandated "no repeating groups" as a criterion of First Normal Form, the foundation step of building a sound data model.
The correct solution, industry best practice for almost forty years, would be to recognise that SUBSIDIARY exists at a different granularity to CHANNEL and so should be stored in a separate table.

How to insert large number of nodes into Neo4J

I need to insert about 1 million of nodes in Neo4j. I need to specify that each node is unique, so every time I insert a node it has to be checked that there's not the same node yet. Also the relationships must be unique.
I'm using Python and Cypher:
uq = 'CREATE CONSTRAINT ON (a:ipNode8) ASSERT a.ip IS UNIQUE'
...
queryProbe = 'MERGE (a:ipNode8 {ip:"' + prev + '"})'
...
queryUpdateRelationship= 'MATCH (a:ipNode8 {ip:"' + prev + '"}),(b:ipNode8 {ip:"' + next + '"}) MERGE (a)-[:precede]->(b)'
The problem is that after putting 40-50K nodes into Neo4j , the insertion speed slows down quickly and I can not to put anything else.
Your question is quite open ended. In addition to #InverseFalcon's recommendations, here are some other things you can investigate to speed things up.
Read the Performance Tuning documentation, and follow the recommendations. In particular, you might be running into memory-related issues, so the Memory Tuning section may be very helpful.
Your Cypher query(ies) can probably be sped up. For instance, if it makes sense, you can try something like the following. The data parameter is expected to be a list of objects having the format {a: 123, b: 234}. You can make the list as long as appropriate (e.g., 20K) to avoid running out of memory on the server while it processes the list within a single transaction. (This query assumes that you also want to create b if it does not exist.)
UNWIND {data} AS d
MERGE (a:ipNode8 {ip: d.a})
MERGE (b:ipNode8 {ip: d.b})
MERGE (a)-[:precede]->(b)
There are also periodic execution APOC procedures that you might be able to use.
For mass inserts like this, it's best to use LOAD CSV with periodic commit or the import tool.
I believe it's also best practice to use a parameterized query instead of appending values into a string.
Also, you created a unique property constraint on :ipNode8, but not :ipNode, which is the first one you MERGE. Seems like you'll need a unique constraint for that one too.

Why does adding this line to a Select slow down SQL drastically?

I have a fairly complicated query that uses a bunch of tables. It usually takes ~17 seconds to return all records (though it is usually filtered). One of the fields it returns is CAT.OLD_CODE. When I replace that field with NVL(CAT.OLD_CODE,(CASE WHEN ISS.WHDEF_CODE = 'SCRAP' THEN 'SCRAP' ELSE 'QH/PH/CUST' END)) the query takes forever (I haven't let sql developer run it long enough to get just the first 50 rows yet). ISS.WHDEF_CODE is already in the select statement, and both CAT and ISS are outer joined to the main body of the query. I'm using the command in a crystal report, and currently doing the filtering there, but I'm just curious why the CASE statement takes so long (just using NVL has no appreciable impact on performance). I've noticed this before when using string operations in select statements, but this seems to just be a simple comparison.

Sqlite slow but barely using machine ressources

I have a 500MB sqlite database of about 5 million rows with the following schema:
CREATE TABLE my_table (
id1 VARCHAR(12) NOT NULL,
id2 VARCHAR(3) NOT NULL,
date DATE NOT NULL,
val1 NUMERIC,
val2 NUMERIC,
val2 NUMERIC,
val4 NUMERIC,
val5 INTEGER,
PRIMARY KEY (id1, id2, date)
);
I am trying to run:
SELECT count(ROWID) FROM my_table
The query has now been running for several minutes which seems excessive to me. I am aware that sqlite is not optimized for count(*)-type queries.
I could accept this if at least my machine appeared to be hard at work. However, my CPU load hovers somewhere around 0-1%. "Disk Delta Total Bytes" in Process Explorer is about 500.000.
Any idea if this can be sped up?
You should have an index for any fields you query on like this. create index tags_index on tags(tag);. Then, I am sure definitely the query will be faster. Secondly, try to normalize your table and have a test (without having an index). Compare the results.
In most cases, count(*) would be faster than count(rowid).
If you have a (non-partial) index, computing the row count can be done faster with that because less data needs do be loaded from disk.
In this case, the primary key constraint already has created such an index.
I would try to look at my disk IO if I were you. I guess they are quite high. Considering the size of your database some data must be on the disk which makes it the bottleneck.
Two ideas from my rudimentary knowledge of SQLite.
Idea 1: If memory is not a problem in your case and your application is launched once and run several queries, I would try to increase the amount of cache used (there's a cache_size pragma available). After a few googling I found this link about SQLite tweaking: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
Idea 2: I would try to have an autoincremented primary key (on a single column) and try to tweak my request using SELECT COUNT(DISTINCT row_id) FROM my_table; . This could force the counting to be only run on what's contained in the index.

Why does this query timeout? V2

This question is a followup to This Question
The solution, clearing the execution plan cache seemed to work at the time, but i've been running into the same problem over and over again, and clearing the cache no longer seems to help. There must be a deeper problem here.
I've discovered that if I remove the .Distinct() from the query, it returns rows (with duplicates) in about 2 seconds. However, with the .Distinct() it takes upwards of 4 minutes to complete. There are a lot of rows in the tables, and some of the where clause fields do not have indexes. However, the number of records returned is fairly small (a few dozen at most).
The confusing part about it is that if I get the SQL generated by the Linq query, via Linqpad, then execute that code as SQL or in SQL Management Studio (including the DISTINCT) it executes in about 3 seconds.
What is the difference between the Linq query and the executed SQL?
I have a short term workaround, and that's to return the set without .Distinct() as a List, then using .Distinct on the list, this takes about 2 seconds. However, I don't like doing SQL Server work on the web server.
I want to understand WHY the Distinct is 2 orders of magnitude slower in Linq, but not SQL.
UPDATE:
When executing the code via Linq, the sql profiler shows this code, which is basically identical query.
sp_executesql N'SELECT DISTINCT [t5].[AccountGroupID], [t5].[AccountGroup]
AS [AccountGroup1]
FROM [dbo].[TransmittalDetail] AS [t0]
INNER JOIN [dbo].[TransmittalHeader] AS [t1] ON [t1].[TransmittalHeaderID] =
[t0].[TransmittalHeaderID]
INNER JOIN [dbo].[LineItem] AS [t2] ON [t2].[LineItemID] = [t0].[LineItemID]
LEFT OUTER JOIN [dbo].[AccountType] AS [t3] ON [t3].[AccountTypeID] =
[t2].[AccountTypeID]
LEFT OUTER JOIN [dbo].[AccountCategory] AS [t4] ON [t4].[AccountCategoryID] =
[t3].[AccountCategoryID]
LEFT OUTER JOIN [dbo].[AccountGroup] AS [t5] ON [t5].[AccountGroupID] =
[t4].[AccountGroupID]
LEFT OUTER JOIN [dbo].[AccountSummary] AS [t6] ON [t6].[AccountSummaryID] =
[t5].[AccountSummaryID]
WHERE ([t1].[TransmittalEntityID] = #p0) AND ([t1].[DateRangeBeginTimeID] = #p1) AND
([t1].[ScenarioID] = #p2) AND ([t6].[AccountSummaryID] = #p3)',N'#p0 int,#p1 int,
#p2 int,#p3 int',#p0=196,#p1=20100101,#p2=2,#p3=0
UPDATE:
The only difference between the queries is that Linq executes it with sp_executesql and SSMS does not, otherwise the query is identical.
UPDATE:
I have tried various Transaction Isolation levels to no avail. I've also set ARITHABORT to try to force a recompile when it executes, and no difference.
The bad plan is most likely the result of parameter sniffing: http://blogs.msdn.com/b/queryoptteam/archive/2006/03/31/565991.aspx
Unfortunately there is not really any good universal way (that I know of) to avoid that with L2S. context.ExecuteCommand("sp_recompile ...") would be an ugly but possible workaround if the query is not executed very frequently.
Changing the query around slightly to force a recompile might be another one.
Moving parts (or all) of the query into a view*, function*, or stored procedure* DB-side would be yet another workaround.
 * = where you can use local params (func/proc) or optimizer hints (all three) to force a 'good' plan
Btw, have you tried to update statistics for the tables involved? SQL Server's auto update statistics doesn't always do the job, so unless you have a scheduled job to do that it might be worth considering scripting and scheduling update statistics... ...tweaking up and down the sample size as needed can also help.
There may be ways to solve the issue by adding* (or dropping*) the right indexes on the tables involved, but without knowing the underlying db schema, table size, data distribution etc that is a bit difficult to give any more specific advice on...
 * = Missing and/or overlapping/redundant indexes can both lead to bad execution plans.
The SQL that Linqpad gives you may not be exactly what is being sent to the DB.
Here's what I would suggest:
Run SQL Profiler against the DB while you execute the query. Find the statement which corresponds to your query
Paste the whole statment into SSMS, and enable the "Show Actual Execution Plan" option.
Post the resulting plan here for people to dissect.
Key things to look for:
Table Scans, which usually imply that an index is missing
Wide arrows in the graphical plan, indicating lots of intermediary rows being processed.
If you're using SQL 2008, viewing the plan will often tell you if there are any indexes missing which should be added to speed up the query.
Also, are you executing against a DB which is under load from other users?
At first glance there's a lot of joins, but I can only see one thing to reduce the number right away w/out having the schema in front of me...it doesn't look like you need AccountSummary.
[t6].[AccountSummaryID] = #p3
could be
[t5].[AccountSummaryID] = #p3
Return values are from the [t5] table. [t6] is only used filter on that one parameter which looks like it is the Foreign Key from t5 to t6, so it is present in [t5]. Therefore, you can remove the join to [t6] altogether. Or am I missing something?
Are you sure you want to use LEFT OUTER JOIN here? This query looks like it should probably be using INNER JOINs, especially because you are taking the columns that are potentially NULL and then doing a distinct on it.
Check that you have the same Transaction Isolation level between your SSMS session and your application. That's the biggest culprit I've seen for large performance discrepancies between identical queries.
Also, there are different connection properties in use when you work through SSMS than when executing the query from your application or from LinqPad. Do some checks into the Connection properties of your SSMS connection and the connection from your application and you should see the differences. All other things being equal, that could be the difference. Keep in mind that you are executing the query through two different applications that can have two different configurations and could even be using two different database drivers. If the queries are the same then that would be only differences I can see.
On a side note if you are hand-crafting the SQL, you may try moving the conditions from the WHERE clause into the appropriate JOIN clauses. This actually changes how SQL Server executes the query and can produce a more efficient execution plan. I've seen cases where moving the filters from the WHERE clause into the JOINs caused SQL Server to filter the table earlier in the execution plan and significantly changed the execution time.

Resources