Is there any such thing as "too many indexes" when it comes to speed in SQLite3? - sqlite

I wanted to improve the performance on my SQLite3 database. I went with the most extreme course of action first (just to see what would happen) and added an index to every column of every table in the database.
The database size more than doubled, and to my surprise, performance dropped drastically. Where I had previously gotten 4000 selects per second I now get ~50 selects per second.
This question is not specifically about my case. My question is; is it possible that adding indexes will decrease SELECT performance in SQLite3? I'm asking because I want to know if my problem is that I added too many indexes, or if I've made a mistake somewhere that is causing the slowdown.
To be more specific about my case: the database increased from 140 MB to 280 MB and I have an SSD.

There a mechanisms by which additional indexes could cause a slowdown:
Most optimization decisions are designed for the worst case – when you're accessing data that is too large to fit into any cache and has to be loaded from disk.
If the data itself fits into the caches, but all the various indexes used by your queries are so large that the entire working set becomes too large, you will get more swapping.
SELECT queries will ignore any indexes that are not actually used.
However, INSERT/UPDATE/DELETE statements must update all indexes of the changed table, so every additional index will slow down such changes.
Use EXPLAIN QUERY PLAN to check which indexes are actually used by a query.
Read Query Planning and The SQLite Query Planner to understand how indexes can be used.

Related

Partitioning vs extra database

Where i work, we got a dilemma. We are using a database(MariaDB 10) that has 1 table that is growing very large(107.4GiB as i write this,. so 1.181 million rows..). This does off course affect the performance of the system.
Me and a coworker had a discussion, he suggested using partitions on that table. This will likely increase the performance, but does not reduce the size of the DB.
In previous times, i however, have been working on writing a cronjob that will move data older then 2 years from that table to a exact copy of the database on a other location.
I feel that that is the more effective way. I expect that doing this will not only increase performance(except during the times when the cronjob is running) but i know that it will also reduce the size of the table.
We don't expect that our customers are interested in this old data anyway.
Question is: What would you choose? I prefer my option, because old data is not used anyway and it keeps the main DB a lot cleaner, my coworker prefers his solution because it means less load at all times and customers can still access the old data.
I have read some of the pro's to use partitioning but haven't found a comparison yet between partitioning and moving old data to another database/place
The table in question uses several query's, This is the most important insert:
INSERT INTO ".$defaultDataTable." (
sensor_data_type_id,
sequence_number,
value,
flag,
datetime
) VALUES (
'".Database::esc($sdtid)."',
'".Database::esc($valueSequence)."',
'".Database::esc($value)."',
'".Database::esc($valueSensorDataFlagsExtended)."',
'".Database::esc($valueDateTime)."'
);
The data is selected in several pages of the application, but 1 example is the following.
SELECT
ws_sensor_data_type.sensor_data_type_id as sensor_data_type_id,
ws_sensor_data_type.name as sensor_data_type_name,
ws_sensor_data_type.equation_id as equation_id,
ws_sensor.name as sensor_name,
ws_equation.description as data_type_name,
ws_basestation.network_id as network_id,
ws_basestation.name as basestation_name,
ws_basestation.worldwide_id as worldwide_id,
ws_client.name as client_name,
ws_sensor.device_type_id as device_type,
ws_sensor.device_id as device_id
FROM
ws_sensor_data_type,
ws_sensor,
ws_basestation,
ws_client_basestation,
ws_client,
ws_equation
WHERE ws_sensor.sensor_id = ws_sensor_data_type.sensor_id
AND ws_sensor.basestation_id = ws_basestation.basestation_id
AND ws_basestation.basestation_id = ws_client_basestation.basestation_id
AND ws_client_basestation.client_id = ws_client.client_id
AND ws_sensor_data_type.equation_id = ws_equation.equation_id
AND ws_sensor_data_type.sensor_data_type_id = '".Database::esc($sdtid)."'
");
In this example, the data, along with some other information is being selected to create a .CSV export file.
The create table statement will follow as i am creating a copy of the Development DB right now to test partitioning on.
We do not use UUID's so that should not be a problem.
It depends.
Partitioning does not inherently improve performance. Only a very limited number of use cases show any performance improvement. More details .
If you are only fetching "recent" rows from the table and you have adequate indexing, then "neither" is the answer -- your million rows could grow to a billion without any performance degradation.
If you are using UUIDs, you are doomed. Performance declines terribly once the data is too big to be cached.
You have done some "hand waving". So have I. If you want to continue this discussion, please provide more specifics. CREATE TABLE, sample queries, proposed partition mechanism, proposed mechanism for accessing 'old' data, etc.

How can I speed up search/browse/filter with 10 M products?

Background:
I'm using SQL Server 2008 and ASP.NET 4 on Windows 2008
I have one table with about 10 million rows of products that I make available online for users to browse -- not search. Each of the 10 million products have extra attributes -- like categories -- that I keep in lookup tables -- there are three or four lookup tables.
Problem
When someone browses and starts using filters (shipping location, price, quality, brand), I need to join the tables, apply all the filters, and return the results. It's very slow and I want to make it faster. Sometimes users will apply a very broad filter, resulting in 800,000 results, and though I only return the first 10 of those for browsing, I still need to run the query for the full 800,000.
What I've Tried Already
I've joined all the information from the various tables into one physical table and then created a covering index for the table.
The queries are much faster, but there is a good bit of maintenance I have to do on the table behind the scenes with jobs to make sure if something goes out of stock I take it out within a reasonable time frame (5 mins or so).
I don't use materialized/indexed views b/c I've got aggregates in the results which SQL Server doesn't seem to like.
Question
How can I speed up browse results beyond the indexing and table optimization that I've already done? I'm not doing any full-text searches -- I'm filtering with exact parameters.
Possible Solutions I've Thought Of
Large caching solution -- AppFabric or MemCached. I'm know next to nothign about these and don't know they are appropriate.
Small caching solution -- Maybe leveraging ASP.NET caching -- but every person is going to apply different filters so I'm not sure how much this will give me.
SSDs -- as a larger-scale solution I've thought about getting SSDs but that will be down the road
CDN -- I don't think a CDN will help b/c the bottleneck here is my database's search capabilities, not the bandwidth/distance to the requester.
I had a similar problem with a complex join query causing horrible response times. I was able to solve it via using Lucene.NET. It's a .NET implementation of the Lucene search index. Basically, you build indexes on data fields (your categories) and then you can search via those categories and return thousands of rows very quickly. Basically, it takes the join operation out of the equation because it already knows, via the indexes, which records fit your criteria.
The following is a very good article on Lucene.NET. I highly recommend it. It took a search result that was taking 20 seconds using standard joins and reduced the response time to less than a second.
http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net
Also, feel free to ping me if you have specific Lucene.NET implmenetation questions. I just got through a lot of research/learning in order to implement it properly on my site, so if you have specific questions on how to make it work I may be able to help with that as well.
"I perform the full query b/c I need to populate the new filters and
the number of results along with the search results. For example, if
someeone filters on category of "Shoes", and location of TX, some of
the other filters are going to be restricted based on the previous
filter."
Try executing two queries: One to count all results and one to select the top N. Maybe your bottleneck is copying 800,000 rows to the client. Doing two queries would fix this at the cost of an additional query. The cost is likely to be less than 2x though due to optimizations for few rows and for count-only queries.

Cache results from sql database, or query each time?

I'm generating pages based on an sql query.
This is the query:
CREATEPROCEDURE sp_searchUsersByFirstLetter
#searchQuery nvarchar(1)
AS
BEGIN
SET NOCOUNT ON;
SELECT UserName
FROM Users Join aspnet_Users asp on Users.UserId = asp.UserId
WHERE (LoweredUserName like #searchQuery + '%')
I can call this procedure for each letter in the alphabet, and get all the users that start with that letter. Then, I put these users into a list on one of my pages.
My question is this: would it be better to cache the list of users to my webserver, rather than query the database each time? Like this:
HttpRuntime.Cache.Insert("users", listOfUsersReturnedFromQuery, null, DateTime.Now.AddHours(1), System.Web.Caching.Cache.NoSlidingExpiration);
Its ok for me if the list of users is an hour out of date. Will this be more efficient that querying the database each time?
Using a cache is best reserved for situations where your query meets the following constraints:
The data is not time critical, i.e. make sure a cache hit won't break your application by causing your code to miss a recent update of the data.
The data isn't sequenced, i.e. A, B, C, D, E are cached, F is inserted by another user, your user inserts G and hits the cache, resulting in ABCDEG instead of ABCDEFG.
The data doesn't change much.
The data is queried and re-used frequently.
Size isn't really a factor unless it's going to really tax your RAM.
I have found that one of the best tables to cache is a settings table, where the data is practically static, gets queried on nearly every page request, and changes don't have to be immediate.
The best thing for you to do would be to test which queries are performed most, then select those that are taxing the database server highest. Out of those, cache anything you can afford to. You should also take a look at tweaking maximum cached object ages. If you're performing a query 100 times a second, you can cut that rate down by a factor of 99% by simply caching it for 1 second, which negates the update delay problem for most practical situations.
In case if you have few servers memory cashing isn't so good, because it will take memroy in each server and in each w3p process of every server.
It will be also hard to maintain consistent data.
I would advise to choose from:
basic output cache (assuming you are using MVC this is zero efforts and good imporevement)
Db cache using smaller pre-calculated table where you have mapping from input string to 10 possible results
It really depends. Do you bottleneck at your database server (I would hope that answer is no)? If you are hitting the database 26 times, that is nothing compared to what typically happens. You should be considering caching data in a Dataset or some other offline model if you are hitting the database hundreds of thousands of times.
So I would say, no. You should be fine with your round trips to the database.
But there is no replacement for testing. That'll tell you for sure.
Considering that each DB call is always expensive in terms of network and DB load I would prefer to avoid such extra operations and cache items even they are requested few times per hour.
Only one opposite case I see - when amount of users in terms of memory consumption is a tons of megabytes.
Well Caching data and get back is fastest but it also depends on the data size...If there is large amount of data than it will cause performance issue.
So it almost depends on your requirement.
I would like you to suggest make use of paging or make use of mix mode by loading half of the user put in cache and than load the other data when require....

Strategy for handling variable time queries?

I have a typical scenario that I'm struggling with from a performance standpoint. The user selects a value from a dropdown and clicks a button. A stored procedure takes that value as an input parameter, executes, and returns the results to a grid. For just one of the values ('All'), the query runs for roughly 2.5 minutes. For the rest of the values the query runs less than 1ms.
Obviously, having the user wait for 2.5 minutes just isn't going to fly. So, what are some typical strategies to handle this?
Some of my own thoughts:
New table that stores the information for the 'All' value and is generated nightly
Cache the data on the caching server
Any help is appreciated.
Thanks!
Update
A little bit more info:
sp returns two result sets. The first is a group by rollup summary and the second is the first result set, disaggregated (roughly 80,000 rows).
I would first look at if your have the proper indexes in place. Using the Query Analyzer and the Database Tuning Assistant is a simple and often effective way of seeing what indexes might help.
If you still have performance problems after creating the appropriate indexes you might then look at adding tables/views to speed things up. If your query does a lot of joins you might consider creating an indexed view that allows you to do a select with no joins on the denormalized data. Since indexed views are persisted you can see big gains from their use.
You can read up on indexed views here:
http://msdn.microsoft.com/en-us/library/dd171921%28v=sql.100%29.aspx
and read about the database tuning adviser here:
http://msdn.microsoft.com/en-us/library/ms166575.aspx
Also, how many records does "All" return? I have seen people get hung up on the "All" scenario before, but if it returns 1 million records or something then the data is not usable to a person anyways...
Caching data is a good thing, but.... if the SP is inherently flawed, then you might want to actually fix it instead of trying to bandage it with caching.
You might also want to (since you didn't mention here) look at the number of rows "All" returns compared to the other selections and think about your indexes.
Also in your SP does the "All" cause it to run a different sets of tsql as in maybe a case or an if... or is it running the same code just with a different "WHERE"?
It might simply be that "ALL" just returns A LOT of records. You may want to implement paging and partial dataset return using ajax... (kinda like return the first 1000 records early so that it can be displayed and also show a throbber on the screen while the rest of the dataset is returned)
These are all options... if the number of records really isnt that different between ALL and the others... then it probably has something to do with the query/index/program flow.

How many rows can an SQLite table hold before queries become time comsuming

I'm setting up a simple SQLite database to hold sensor readings. The tables will look something like this:
sensors
- id (pk)
- name
- description
- units
sensor_readings
- id (pk)
- sensor_id (fk to sensors)
- value (actual sensor value stored here)
- time (date/time the sensor sample was taken)
The application will be capturing about 100,000 sensor readings per month from about 30 different sensors, and I'd like to keep all sensor readings in the DB as long as possible.
Most queries will be in the form
SELECT * FROM sensor_readings WHERE sensor_id = x AND time > y AND time < z
This query will usually return about 100-1000 results.
So the question is, how big can the sensor_readings table get before the above query becomes too time consuming (more than a couple seconds on a standard PC).
I know that one fix might be to create a separate sensor_readings table for each sensor, but I'd like to avoid this if it is unnecessary. Are there any other ways to optimize this DB schema?
If you're going to be using time in the queries, it's worthwhile adding an index to it. That would be the only optimization I would suggest based on your information.
100,000 insertions per month equates to about 2.3 per minute so another index won't be too onerous and it will speed up your queries. I'm assuming that's 100,000 insertions across all 30 sensors, not 100,000 for each sensor but, even if I'm mistaken, 70 insertions per minute should still be okay.
If performance does become an issue, you have the option to offload older data to a historical table (say, sensor_readings_old) and only do your queries on the non-historical table (sensor_readings).
Then you at least have all the data available without affecting the normal queries. If you really want to get at the older data, you can do so but you'll be aware that the queries for that may take a while longer.
Are you setting indexes properly? Besides that and reading http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html, the only answer is 'you'll have to measure yourself' - especially since this will be heavily dependent on the hardware and on whether you're using an in-memory database or on disk, and on if you wrap inserts in transactions or not.
That being said, I've hit noticeable delays after a couple of tens of thousands of rows, but that was absolutely non-optimized - from reading a bit I get the impression that there are people with 100's of thousands of rows with proper indexes etc. who have no problems at all.
SQLite now supports R-tree indexes ( http://www.sqlite.org/rtree.html ), ideal if you intend to do a lot of time range queries.
Tom
I know I am coming to this late, but I thought this might be helpful for anyone that comes looking at this question later:
SQLite tends to be relatively fast on reading as long as it is only serving a single application/user at a time. Concurrency and blocking can become issues with multiple users or applications accessing it at a single time and more robust databases like MS SQL Server tend to work better in a high concurrency environment.
As others have said, I would definitely index the table if you are concerned about the speed of read queries. For your particular case, I would probably create one index that included both id and time.
You may also want to pay attention to the write speed. Insertion can be fast, but commits are slow, so you probably want to batch many insertions together into one transaction before hitting commit. This is discussed here: http://www.sqlite.org/faq.html#q19

Resources