Should I use WITH instead of a JOIN on a table with a lot of data? - mariadb

I have a MariaDB table which contains a lot of metadata and is very big in terms of bytes.
I have columns A, B in that table a long with other columns.
I would like to join that table with another table (stuff) in order to get column C from it.
So I have something like:
SELECT metadata.A, metadata.B, stuff.C FROM metadata JOIN
stuff on metadata.D = stuff.D
This query takes a very long time sometimes, I suspect its because (AFAIK, please correct me if Im wrong) that JOIN stores the result of the join in some side table and because metadata table is very big it has to copy a lot of data even though I dont use it, so I thought about optimizing it with WITH as follows:
WITH m as (SELECT A,B,D FROM metadata),
s as (SELECT C,D FROM stuff)
SELECT * FROM m JOIN s ON m.D = s.D;
The execution plan is the same (using EXPLAIN) but I think it will be faster since the side tables that will be created by WITH (again AFAIK WITH also creates side tables, please correct me if Im wrong) will be smaller and only contain the needed data.
Is my logic correct? Is there some way I can test that in MariaDB?

More likely, there is some form of cache speeding up one query or the other.
The Query cache is usually recognizable by a query time that is only about 1ms. It can be turned off via SELECT SQL_NO_CACHE ... to get a timing to compare against.
The other likely cache is the buffer_pool. Data is read from disk into the buffer_pool unless it is already there. The simple workaround for strange timings is to run the query twice and take the second 'time'.
Your hypothesis that WITH creates 'small' temp tables falls apart because of the work that is needed to read the original tables is the same with or without WITH.
Please provide SHOW CREATE TABLE for the two tables. There are a couple of datatype issues that may be involved -- big TEXTs or BLOBs.
The newly-added WITH opens up the possibility of recursive CTEs (and other things). And it provides a way to materialize a temp table that is used more than once. Neither of those applies in your query, so I would not expect any performance improvement.

Related

websql performance, can we shard tables

I am using websql to store data in a phonegap application. One of table have a lot of data say from 2000 to 10000 rows. So when I read from this table, which is just a simple select statement it is very slow. I then debug and found that as the size of table increases the performance deceases exponentially. I read somewhere that to get performance you have to divide table into smaller chunks, is that possible how?
One idea is to look for something to group the rows by and consider breaking into separate tables based on some common category - instead of a shared table for everything.
I would also consider fine tuning the queries to make sure they are optimal for the given table.
Make sure you're not just running a simple Select query without a where clause to limit the result set.

sqlite3 insert into dynamic table

I am using sqlite3 (maybe sqlite4 in the future) and I need something like dynamic tables.
I have many tables with the same format: values_2012_12_27, values_2012_12_28, ... (number of tables is dynamic) and I want to select dynamically the table that receives some data.
I am using _sqlite3_prepare with INSERT INTO ? VALUES(?,?,?). Ofcourse this fails to compile (syntax error near ?). There is a nice and simple way to do this in sqlite ?
Thanks
Using SQL parameters is not possible for identifiers such as table or column names.
If you don't want to keep so many prepared statements around, just prepare them on the fly whenever you need one.
If your database were properly normalized, you would have a single big values table with an extra date column.
This organization is usually to be preferred, unless you have measured both and found that the better performance (if it actually exists) outweighs the overhead of managing multiple tables.

LINQ to entities performance regarding where clause

Let's say i have a table in a database with 10k records. I dont need to actually use those 10k records anymore, but i still need to keep them in the database. That very table is now going to be used to store new data. So there's gonna be more records coming on top of the 10K records already present in the table. As opposed to the "old" 10K records, i do need to work with the newly inserted data. Right now im doing this to get the data i need:
List<Stuff> l = (from x in db.Table
where x.id > id
select x).ToList();
My question now is: how does the where clause in LINQ (or in SQL in general) work under the covers? Is the ENTIRE table going to be searched until (x.id > id) is true? Because let's say the table will increase from 10k records to 20K. It'd be a little silly to look through the entire 20 k records, if i know that i only have to start looking from a certain point.
I've had performance problems (not dramatic, but bad enough to be agitated by it) with this while using LINQ to entities, which i kinda don't understand because it should be no problem at all for a modern computer to sift through a mere 20 k records. I've been advised to use a stored procedure instead of a LINQ query, but i dont know whether or not this will boost performance?
Any feedback will be appreciated.
It's going to behave just like a similarly worded SQL query would. The question is whether the overhead you're experiencing is happening in the query or in the conversion of the query to a list. The query itself as you've written should equate literally to:
Select ID, Column1, Column2, Column3, ... , Column(n+1)
From db.Table
Where ID > id
This query should be fairly fast depending on the nature of the data. The query itself will not be executed until it is acted upon, however. In this case, you're converting it to a list, which is the equivalent of acting upon it. I can't find the comment someone made to me about this practice, but I've found it too be quite helpful in keeping performance clean. Unless you have some very specific need, you should leave your queries as IQueryable. Converting them to lists doubles the effort because first the query must be executed and then the result set must be converted into an appropriate IEnumerable (List in this case).
So you have 2 potential bottlenecks. The simple query could be taking a long time to query a massive collection of data, or the number of records could be bottenecking at the poing where the List is created. Another possibility is the nature of ID in this case. If it is numeric, that will save you some time. If it's performing a text-based search then it's going to be heavier.
To answer your specific question, yes, it's going to search every record in the database and return all of the records that match the expression. Edit: If the database has a proper index on the column in question, it will not search EVERY record but rather will use the index to perform the search. From comment from #Pleun.
As for using a stored procedure, that's a load of hogwash, but it's a perfectly acceptable alternative. I have several programs that routinely run similar queries against a database with over 40 million records, and the only performance issue I've run into so far has been CPU usage when multiple users are performing rapid firing queries. To solve your specific issue, I'd recommend that you tune it a little in SQL Management Studio until the query you want returns to your interface with an acceptable speed. Then you can convert that query into a compatible Linq statement. As long as you leave it as an IQueryable it should exhibit similar results.

Is count(*) really expensive?

I have a page where I have 4 tabs displaying 4 different reports based off different tables.
I obtain the row count of each table using a select count(*) from <table> query and display number of rows available in each table on the tabs. As a result, each page postback causes 5 count(*) queries to be executed (4 to get counts and 1 for pagination) and 1 query for getting the report content.
Now my question is: are count(*) queries really expensive -- should I keep the row counts (at least those that are displayed on the tab) in the view state of page instead of querying multiple times?
How expensive are COUNT(*) queries ?
In general, the cost of COUNT(*) cost is proportional to the number of records satisfying the query conditions plus the time required to prepare these records (which depends on the underlying query complexity).
In simple cases where you're dealing with a single table, there are often specific optimisations in place to make such an operation cheap. For example, doing COUNT(*) without WHERE conditions from a single MyISAM table in MySQL - this is instantaneous as it is stored in metadata.
For example, Let's consider two queries:
SELECT COUNT(*)
FROM largeTableA a
Since every record satisfies the query, the COUNT(*) cost is proportional to the number of records in the table (i.e., proportional to what it returns) (Assuming it needs to visit the rows and there isnt a specific optimisation in place to handle it)
SELECT COUNT(*)
FROM largeTableA a
JOIN largeTableB b
ON a.id = b.id
In this case, the engine will most probably use HASH JOIN and the execution plan will be something like this:
Build a hash table on the smaller of the tables
Scan the larger table, looking up each records in a hash table
Count the matches as they go.
In this case, the COUNT(*) overhead (step 3) will be negligible and the query time will be completely defined by steps 1 and 2, that is building the hash table and looking it up. For such a query, the time will be O(a + b): it does not really depend on the number of matches.
However, if there are indexes on both a.id and b.id, the MERGE JOIN may be chosen and the COUNT(*) time will be proportional to the number of matches again, since an index seek will be performed after each match.
You need to attach SQL Profiler or an app level profiler like L2SProf and look at the real query costs in your context before:
guessing what the problem is and trying to determine the likely benefits of a potential solution
allowing others to guess for you on da interwebs - there's lots of misinformation without citations about, including in this thread (but not in this post :P)
When you've done that, it'll be clear what the best approach is - i.e., whether the SELECT COUNT is dominating things or not, etc.
And having done that, you'll also know whether any changes you choose to do have had a positive or a negative impact.
As others have said COUNT(*) always physically counts rows, so if you can do that once and cache the results, thats certainly preferable.
If you benchmark and determine that the cost is negligible, you don't (currently) have a problem.
If it turns out to be too expensive for your scenario you could make your pagination 'fuzzy' as in "Showing 1 to 500 of approx 30,000" by using
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('sometable') AND indid < 2
which will return an approximation of the number of rows (its approximate because its not updated until a CHECKPOINT).
If the page gets slow, one thing you can look at is minimizing the number of database roundtrips, if at all possible. Even if your COUNT(*) queries are O(1), if you're doing enough of them, that could certainly slow things down.
Instead of setting up and executing 5 separate queries one at a time, run the SELECT statements in a single batch and process the 5 results at once.
I.e., if you're using ADO.NET, do something like this (error checking omitted for brevity; non-looped/non-dynamic for clarity):
string sql = "SELECT COUNT(*) FROM Table1; SELECT COUNT(*) FROM Table2;"
SqlCommand cmd = new SqlCommand(sql, connection);
SqlDataReader dr = cmd.ExecuteReader();
// Defaults to first result set
dr.Read();
int table1Count = (int)dr[0];
// Move to second result set
dr.NextResult();
dr.Read();
int table2Count = (int)dr[0];
If you're using an ORM of some sort, such as NHibernate, there should be a way to enable automatic query batching.
COUNT(*) can be particularly expensive as it may result in loading (and paging) an entire table, where you may only need a count on a primary key (In some implementations it is optimised).
From the sound of it, you are causing a table load operation each time, which is slow, but unless it is running noticeably slowly, or causing some sort of problem, don't optimise: premature and unnecessary optimisation can cause a great deal of trouble!
A count on an indexed primary key will be much faster, but with the costs of having an index this may provide no benefit.
All I/O is expensive and if you can accomplish the task without it, you should. But if it's needed, I wouldn't worry about it.
You mention storing the counts in view state, certainly an option, as long as the behavior of the code is acceptable when that count is wrong because the underlying records are gone or have been added to.
This depends on what are you doing with data in this table. If they are changing very often and you need them all every time, maybe you could make trigger which will fill another table that consists only on counts from this table. If you need to show this data separately, maybe you could just execute "select count(*)..." for only one particular table. This just came to my mind instantly, but there are other ways to speed this up, I'm sure. Cache data, maybe? :)

SQL Server 2005 - Select From Multiple DB's & Compile Results as Single Query

Ok, the basic situation: Due to a few mixed up starts, a project ends up with not one, but three separate databases, each containing a portion of the overall project data. All three databases are the same, it's just that, say 10% of the project was run into the first, then a new DB was made due to a code update and 15% of the project was run into the new one, then another code change required another new database for the rest of the project. Again, the pertinent tables are all exactly the same across all three databases.
Now, assume I wanted to take all three of those databases - bearing in mind that they can't just be compiled into a single databases due to Primary Key issues and so on - and run a single query that would look through all three of them, select a given set of data from each, then compile those three sets into one single result and return it to the reporting page I'm working on.
For reference, at its endpoint the data is output to an ASP.Net/VB.Net backed page, specifically a Gridview object. It doesn't need to be edited, fortunately, just displayed.
What would be the best way to approach this mess? I'm thinking that creating a temporary table would be my best bet, but honestly I'm stepping into a portion of SQL that I'm not familiar with here, and would appreciate any guidance somebody more experienced might have.
I'd say your best bet is to suck it up and combine the databases, even if it is a major pain to combine the primary keys. It may be a major pain now, but it is going to be 10x as painful over the life of the project.
You can do a union across multiple databases as Scott has pointed out, but you are in for a world of trouble as the application gets more complex. For example, even if you circumvent the technical limitations by having multiple tables/databases for the same entity, having duplicates in the PK for a logical entity is a world of trouble.
Implement the workaround solution if you must, but I guarantee you will hate yourself for it later.
Why not just use 3 part naming on the tables and union them all together?
select db1.dbo.Table1.Field1,
db1.dbo.Table1.Field2
from db1.dbo.Table1
UNION
select db2.dbo.Table1.Field1,
db2.dbo.Table1.Field2
from db2.dbo.Table1
UNION
select db3.dbo.Table1.Field1,
db3.dbo.Table1.Field2
from db3.dbo.Table1
-- where ...
-- order by ...
You should create what is called a Partitioned View for each of your tables of interest. These views do a union of the underlying base tables and eventually add a syntetic column to uniquefy the rows:
CREATE VIEW vTableXDB
AS
SELECT 'DB1' as db_key, *
FROM DB1.dbo.table
UNION ALL
SELECT 'DB2' as db_key, *
FROM DB2.dbo.table
UNION ALL
SELECT 'DB3' as db_key, *
FROM DB3.dbo.table;
You create one such view for each table and then design your reports on these views, not on the base tables. You must add the db_key to your join conditions. The query optimizzer has some understanding of the partitioned views and might be able to create plans that do the right thing and avoid joins that span multiple dbs, but that is not guaranteed. If things go haywire and the optimizer does not recognize the partitioning resulting in very bad execution times, you may have to move the db_key into the tables themselves and add some artificial check constraints on the base tables so that the optimizer can understand the partitioning (see the article I linked for details).
You can actually join tables on different databases. If I remember right the syntax is changed from "tablename.columnName" to "Server.Owner.tablename.columnName". You will need to run some stored procedures as an admin to allow this connectivity. It's also pretty slow but the effort to get it working is low.
If you have time to do it right look at data warehouse concepts. That's basically a temp table that collects the data you need to report on.
Building on Scott Ivey's excellent example above,
Use table name aliasing to simplify your code
Use UNION ALL instead of UNION assuming that your data is unique between the three databases
Code:
select
d1t1.Field1,
d1t1.Field2
from db1.dbo.Table1 AS d1t1
UNION ALL
select
d2t1.Field1,
d2.Field2
from db2.dbo.Table1 AS d2t1
UNION ALL
select
d3t1.Field1,
d3t1.Field2
from db3.dbo.Table1 AS d3t1
-- where ...
-- order by ...

Resources