Well, I know. 'fast from SSMS, slow from app' - that sounds very familiar for someone.
One can start thinking about parameter sniffing or connection settings. But I guess that's not the case for me.
So that's the query:
SELECT [ST].*, [STL].*
FROM [WP_CashCenter_StockTransaction] AS [ST]
LEFT JOIN [WP_CashCenter_StockTransactionLine] AS [STL] ON ([STL] [StockTransaction_id] = [ST].[id])
WHERE
([ST].[Type] IN (0, 1, 10, 9)
AND ([STL].[Direction] IN (0, 1) OR [STL].[id] IS NULL)
AND [ST].[Status] IN (0,1)
AND ( ([STL].[StockContainer_id] = 300000742600 OR [STL].[id] IS NULL) AND [ST].[StockContainerID] = 300000742600))
I'll post links to images of execution plans, if you don't mind (pls tell in comments), cause there's gonna be many of them.
Execution plan that I get from SSMS: http://i.imgur.com/DjTypV2.png (runs fraction of a sec)
Execution plan, that's used for query when it's executed from app: http://i.imgur.com/Ra45CAo.png (runs ~3sec)
So, for some reason, sql-server makes wrong estimations and prefers a table scan in second case.
The query is built dynamically and new plan is generated for each new value of StockContainerID (no parameters).
Well, okay, so gave up trying to figure out the problem and just used FORCESEEK hint:
SELECT [ST].*, [STL].*
FROM [WP_CashCenter_StockTransaction] AS [ST] WITH(FORCESEEK)
LEFT JOIN [WP_CashCenter_StockTransactionLine] AS [STL] WITH(FORCESEEK) ON ([STL].[StockTransaction_id] = [ST].[id])
Now, the execution plans seem to be identical:
http://i.imgur.com/orq7Pmx.png (executed from app). But it still takes ~3secs.
Take a look at this:
http://i.imgur.com/ZFhyWYc.png (SSMS, 1 number of executions, 1 estimated rows)
http://i.imgur.com/bIeTE13.png (app, 1 numer of executions, 655k estimated rows)
http://i.imgur.com/FERBNQR.png (SSMS, 1 numebr of executions, 1 estimated rows)
http://i.imgur.com/pm2k8CS.png (app, 655k number of executions, 1 estimated rows)
You should have noticed that the second plan uses parallelism. I don't know if that can be a reason for the problem (I think no).
Related
I have a unidirectional graph.
The structure is as follows:
There are about 20,000 nodes in the graph.
I make the simplest request: MATCH (b1)-[:NEXT_BAR*10]->(b2) RETURN b1.id, b2.id LIMIT 5
The request is processed quickly.
But if I increase the number of relationships, the query takes much longer to process. In other words, the speed depends on the number of relationships.
This request takes longer than 5 minutes to complete: MATCH (b1)-[:NEXT_BAR*10000]->(b2) RETURN b1.id, b2.id LIMIT 5
This is still a simplified version. The request can have more than two nodes and the number of relationships can still be a range.
How can I optimize a query with a large number of relationships?
Perhaps there are other graph DBMS where there is no such problem?
Variable-length relationship queries have exponential time and memory complexity.
If R is the average number of suitable relationships per node, and D is the depth of the search, then the complexity is O(R ** D). This complexity will exist in any DBMS.
The theory is simple here, but there are a couple of intricacies in the query execution.
-[:NEXT_BAR*10000]- matches a path that is precisely 10000 edges in size, so query engine spends some time to find these paths. Another thing to mention is that in (b1)-[...]- >(b2), b1 and b2 are not specific, which means that the query engine has to scall all nodes. If there is a limit, yea, scall all should return a limited number of items. The whole execution also depends on the efficiency of variable-length path implementation.
Some of the following might help:
Is it feasible to start from a specific node?
If there are branches, the only hope is aggressive filtering because of exponential complexity (as cybersam well explained).
Use a smaller number in the variable expand, or a range, e.g., [NEXT_BAR*..10000]. In this case, the query engine will match any path up to 10000 in size (different semantics, but maybe applicable).
* means the DFS type of execution. On the other hand, BFS might be the right approach. Memgraph (DISCLAIMER: I'm the co-founder and CTO) also supports BFS type of execution with filtering lambda.
Here is a Python script I've used to generate and import data into Memgraph. By using small nodes_no you can quickly notice the execution patterns.
import mgclient
# Make a connection to the database.
connection = mgclient.connect(
host='127.0.0.1',
port=7687,
sslmode=mgclient.MG_SSLMODE_REQUIRE)
connection.autocommit = True
cursor = connection.cursor()
# Clean and setup database instance.
cursor.execute("""MATCH (n) DETACH DELETE n;""")
cursor.execute("""CREATE INDEX ON :Node(id);""")
# Import dataset.
nodes_no = 10
# Create nodes.
for identifier in range(0, nodes_no):
cursor.execute("""CREATE (:Node {id: "%s"});""" % identifier)
# Create edges.
for identifier in range(1, nodes_no):
cursor.execute("""
MATCH (start_node:Node {id: "%s"})
MATCH (end_node:Node {id: "%s"})
CREATE (start_node)-[:NEXT_BAR]->(end_node);
""" % (identifier - 1, identifier))
As InnoDB organizes its data in B+ trees. The height of the tree affects the count of IO times which may be one of the main reasons that DB slows down.
So my question is how to predicate or calculate the height of the B+ tree (e.g. based on the count of pages which can be calculated by row size, page size, and row number), and thus to make a decision whether or not to partition the data to different masters.
https://www.percona.com/blog/2009/04/28/the_depth_of_a_b_tree/
Let N be the number of rows in the table.
Let B be the number of keys that fit in one B-tree node.
The depth of the tree is (log N) / (log B).
From the blog:
Let’s put some numbers in there. Say you have a billion rows, and you can currently fit 64 keys in a node. Then the depth of the tree is (log 109)/ log 64 ≈ 30/6 = 5. Now you rebuild the tree with keys half the size and you get log 109 / log 128 ≈ 30/7 = 4.3. Assuming the top 3 levels of the tree are in memory, then you go from 2 disk seeks on average to 1.3 disk seeks on average, for a 35% speedup.
I would also add that usually you don't have to optimize for I/O cost, because the data you use frequently should be in the InnoDB buffer pool, therefore it won't incur any I/O cost to read it. You should size your buffer pool sufficiently to make this true for most reads.
Simpler computation
The quick and dirty answer is log base 100, rounded up. That is, each node in the BTree has about 100 leaf nodes. In some circles, this is called fanout.
1K rows: 2 levels
1M rows: 3 levels
billion: 5 levels
trillion: 6 levels
These numbers work for "average" rows or indexes. Of course, you could have extremes of about 2 or 1000 for the fanout.
Exact depth
You can find the actual depth from some information_schema:
For Oracle's MySQL:
$where = "WHERE ( ( database_name = ? AND table_name = ? )
OR ( database_name = LOWER(?) AND table_name = LOWER(?) ) )";
$sql = "SELECT last_update,
n_rows,
'Data & PK' AS 'Type',
clustered_index_size * 16384 AS Bytes,
ROUND(clustered_index_size * 16384 / n_rows) AS 'Bytes/row',
clustered_index_size AS Pages,
ROUND(n_rows / clustered_index_size) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
UNION
SELECT last_update,
n_rows,
'Secondary Indexes' AS 'BTrees',
sum_of_other_index_sizes * 16384 AS Bytes,
ROUND(sum_of_other_index_sizes * 16384 / n_rows) AS 'Bytes/row',
sum_of_other_index_sizes AS Pages,
ROUND(n_rows / sum_of_other_index_sizes) AS 'Rows/page'
FROM mysql.innodb_table_stats
$where
AND sum_of_other_index_sizes > 0
";
For Percona:
/* to enable stats:
percona < 5.5: set global userstat_running = 1;
5.5: set global userstat = 1; */
$sql = "SELECT i.INDEX_NAME as Index_Name,
IF(ROWS_READ IS NULL, 'Unused',
IF(ROWS_READ > 2e9, 'Overflow', ROWS_READ)) as Rows_Read
FROM (
SELECT DISTINCT TABLE_SCHEMA, TABLE_NAME, INDEX_NAME
FROM information_schema.STATISTICS
) i
LEFT JOIN information_schema.INDEX_STATISTICS s
ON i.TABLE_SCHEMA = s.TABLE_SCHEMA
AND i.TABLE_NAME = s.TABLE_NAME
AND i.INDEX_NAME = s.INDEX_NAME
WHERE i.TABLE_SCHEMA = ?
AND i.TABLE_NAME = ?
ORDER BY IF(i.INDEX_NAME = 'PRIMARY', 0, 1), i.INDEX_NAME";
(Those give more than just the depth.)
PRIMARY refers to the data's BTree. Names like "n_diff_pfx03" refers to the 3rd level of the BTree; the largest such number for a table indicates the total depth.
Row width
As for estimating the width of a row, see Bill's answer. Here's another approach:
Look up the size of each column (INT=4 bytes, use averages for VARs)
Sum those.
Multiply by between 2 and 3 (to allow for overhead of InnoDB)
Divide into 16K to get average number of leaf nodes.
Non-leaf nodes, plus index leaf nodes, are trickier because you need to understand exactly what represents a "row" in such nodes.
(Hence, my simplistic "100 rows per node".)
But who cares?
Here's another simplification that seems to work quite well. Since disk hits are the biggest performance item in queries, you need to "count the disk hits" as the first order of judging the performance of a query.
But look at the caching of blocks in the buffer_pool. A parent node is 100 times as likely to be recently touched as the child node.
So, the simplification is to "assume" that all non-leaf nodes are cached and all leaf nodes need to be fetched from disk. Hence the depth is not nearly as important as how many leaf node blocks are touched. This shoots down your "35% speedup" -- Sure 35% speedup for CPU, but virtually no speedup for I/O. And I/O is the important component.
Note that if you fetching the latest 20 rows of a table that is chronologically stored, they will be found in the last 1 (or maybe 2) blocks. If they are stored by a UUID, it is more likely to tale 20 blocks -- many more disk hits, hence much slower.
Secondary Indexes
The PRIMARY KEY is clustered with the data. That implies that a look by the PK needs to drill down one BTree. But a secondary index is implemented by a second BTree -- drill down it to find the PK, then drill down via the PK. When "counting the disk hits", you need to consider both BTrees. And consider the randomness (eg, for UUIDs) or not (date-ordered).
Writes
Find the block (possibly cached)
Update it
If necessary, deal with a block split
Flag the block as "dirty" in the buffer_pool
Eventually write it back to disk.
Step 1 may involve a read I/O; step 5 may involve a write I/O -- but you are not waiting for it to finish.
Index updates
UNIQUE indexes must be checked before finishing an INSERT. This involves a potentially-cached read I/O.
For a non-unique index, an entry in the "Change buffer" is made. (This lives in the buffer_pool.) Eventually that is merged with the appropriate block on disk. That is, no waiting for I/O when INSERTing a row (at least not waiting to update non-unique indexes).
Corollary: UNIQUE indexes are more costly. But is there really any need for more than 2 such indexes (including the PK)?
I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?
As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.
You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.
I need to write a program, which calculates product of product in range:
I written the following code:
mult(N,N,R,R).
mult(N,Nt,R,Rt):-N1=Nt+1,R1=Rt*(1/(log(Nt))),mult(N,N1,R,R1).
This should implement basic product from Nt to N of 1/ln(j). As far as I understand it's got to be stopped when Nt and N are equal. However, I can't get it working due to:
?- mult(10,2,R,1), write(R).
ERROR: Out of global stack
The following error. Is there any other way to implement loop not using default libraries of SWI-Prolog?
Your program never terminates! To see this consider the following failure-slice of your program:
mult(N,N,R,R) :- false.
mult(N,Nt,R,Rt):-
N1=Nt+1,
R1=Rt*(1/(log(Nt))),
mult(N,N1,R,R1), false.
This new program does never terminate, and thus the original program doesn't terminate. To see that this never terminates, consider the two (=)/2 goals. In the first, the new variable N1 is unified with something. This will always succeed. Similarly, the second goal with always succeed. There will never be a possibility for failure prior to the recursive goal. And thus, this program never terminates.
You need to add some goal, or to replace existing goals. in the visible part. Maybe add
N > Nt.
Further, it might be a good idea to replace the two (=)/2 goals by (is)/2. But this is not required for termination, strictly speaking.
Out of global stack means you entered a too-long chain of recursion, possibly an infinite one.
The problem stems from using = instead of is in your assignments.
mult(N,N,R,R).
mult(N,Nt,R,Rt):-N1 is Nt+1, R1 is Rt*(1/(log(Nt))), mult(N,N1,R,R1).
You might want to insert a cut in your first clause to avoid going on after getting an answer.
If you have a graphical debugger (like the one in SWI) try setting 'trace' and 'debug' on and running. You'll soon realize that performing N1 = Nt+1 giving Ntas 2 yields the term 2+1. Since 2+1+1+1+(...) will never unify with with 10, that's the problem right there.
When profiling R code with Rprof-type functions we get the time spent in function alone and the time spent in function and callees. However, as far as I know we don't get the number of times a given function was evaluated.
For example, assume I wants to compare two integration functions:
integrate_1(myfunc, from = -Inf, to = Inf)
integrate_2(myfunc, from = -Inf, to Inf)
I could easily see how much time each function takes and where this time was spent, but I don't know how to check how many times myfunc had to be evaluated in each of the integrate functions.
Thanks,
One way of implementing Joran's counter method is to use the trace function.
For example, first we set the counter to zero. (Assigned in the global environment, for convenience.)
count <- 0
Then set up the trace. Here we set it on the identity function (that just returns the value that you input to it).
trace("identity", quote(count <<- count + 1), print = FALSE)
Now whenever identity is called, the value of count is incremented. print = FALSE just stops a message being printed to the console when the function is called.
Let's call the function a few times and inspect the count:
for(i in seq_len(123)) identity(1)
count
## [1] 123
Rprof works by sampling the call stack on a timer. It does not count calls.
It records the sampled call stacks in a file, and though it does not record line numbers where calls occur, those samples are still useful for seeing what causes time to be spent.
For example, if you happen to look at M random samples, and you see a pattern like A calling B calling C on N of them, then you know the program spends roughly fraction N/M of its time doing that (assuming N > 1).
If you see such a thing, and you can think of a way to avoid even part of it, you will save a substantial fraction of the total time.
Rprof comes with a summarization tool that gives you the kind of numbers you mentioned, but I don't find those numbers useful anyway.
I would much rather get a real sense of what's happening.