I am running a web application with a backend SQLite database that solely performs read operations. Users connect to the database, search for entries via a select command, and view the results in a browser. But, the select is quite time-consuming because it involves character pattern matching across several million table rows. (The size of the results table is quite small).
Different users will generally do the exact same search, so if I can cache the results of the select the first time, the next user to search to database (concurrently or more likely a few days later) can get back the results quickly.
How can I do this in SQLite? Is there a pragma I need to use? I hear that SQLite has an automatic caching feature, but this does not seem to help. Note that I'm using a hosting service, so I cannot rebuild SQLite in anyway.
Any help would be much appreciated.
I would use an external caching solution like
memcached
or
APC/Zend Cache (on PHP)
Then you have much more control over your cache (what to store, lifetimes, clearing the cache completely...)
Are the users using different connections to the SQLite DB? Try using the PRAGMA command to increase cache size.
PRAGMA cache_size;
PRAGMA cache_size = Number-of-pages;
Query or change the suggested maximum
number of database disk pages that
SQLite will hold in memory at once per
open database file.
From documentation: http://www.sqlite.org/pragma.html#pragma_cache_size
After posting the above question, I found one simple solution that seems effective that doesn't require any changes to the CPanel hosting service that I use.
Instead of caching the SQL results, I simple cache the entire webpage generated by the PHP script. Users making the exact same search are then given the cached page, bypassing the database completely.
I got the basic idea here.
The advantage is that a quite complex results set involving several different database calls can all be cached in a very minimal space. Then, using PHP it's a trivial task to delete the cached webpages after a certain time.
I hope this helps others working on similar problems.
I have a very similar issue which I attempted to optimise by hashing queries, and storing the hash/result in a CACHE table.
But there are gotchas - for example, the commands:
SELECT * FROM myTable WHERE (col1 BETWEEN 1000 AND 2000) AND (col2='StackOverflow');
SELECT * FROM myTable WHERE (col2='StackOverflow') AND (col1 BETWEEN 1000 AND 2000);
Should give exactly the same answer, but MD5'ing those two string will give different results. Same goes for:
SELECT col1,col2 FROM myTable;
SELECT col2,col1 FROM myTable;
If you can standardise how the queries come into the database (sorting SELECTables, WHERE filters, grouping, etc - all in our PHP/JavaScript), then you can reliably quick-hash the queries, do them, then store the result in your database. I would recommend storing it as JSON if your not going to return whole pages like the OP, because that saves having to convert it later, and JSON is pretty much always gzipped before sending it to the user (if your AJAXing to the users screen).
To really speed this up, your CACHE table should have the Hash column include the UNIQUE type so it is automatically indexed.
If you are using Node.js, then id go one step further and say you can read this CACHE table from disk on startup, store all the hashes in a set object, then when your server gets a query just do allHashes.has(hashedSortedQuery). Since that operation happens in memory, it is essentially instantaneous. If true, SELECT result from CACHE where Hash='hashedSortedQuery', else, (actual query).
One final note - from my experience doing the same thing, updating my SQLite version had a HUGE improvement on speed, and i only mention it because often on shared hosting servers, the latest version isn't always installed. Heres a comparison i did today between the global SQLite version installed on my server, and the latest SQLite version compiled from source: http://pastebin.com/hpWu3UCk
Related
AFAIK, Memcached does not support synchronization with database (at least SQL Server and Oracle). We are planning to use Memcached (it is free) with our OLTP database.
In some business processes we do some heavy validations which requires lot of data from database, we can not keep static copy of these data as we don't know whether the data has been modified so we fetch the data every time which slows the process down.
One possible solution could be
Write triggers on database to create/update prefixed-postfixed (table-PK1-PK2-PK3-column) files on change of records
Monitor this change of file using FileSystemWatcher and expire the key (table-PK1-PK2-PK3-column) to get updated data
Problem: There would be around 100,000 users using any combination of data for 10 hours. So we will end up having a lot of files e.g. categ1-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-78-data250, categ2-subcateg5-subcateg-78-data100, categ1-subcateg5-subcateg-33-data100, etc.
I am expecting 5 million files at least. Now it looks a pathetic solution :(
Other possibilities are
call a web service asynchronously from the trigger passing the key
to be expired
call an exe from trigger without waiting it to finish and then this
exe would expire the key. (I have got some success with this approach on SQL Server using xp_cmdsell to call an exe, calling an exe from oracle's trigger looks a bit difficult)
Still sounds pathetic, isn't it?
Any intelligent suggestions please
It's not clear (to me) if the use of Memcached is mandatory or not. I would personally avoid it and use instead SqlDependency and OracleDependency. The two both allow to pass a db command and get notified when the data that the command would return changes.
If Memcached is mandatory you can still use this two classes to trigger the invalidation.
MS SQL Server has "Change Tracking" features that maybe be of use to you. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
I have no experience with Oracle, but i believe it may also have some tracking functionality as well. This article might get you started:
20 Using Oracle Streams to Record Table Changes
I have developed a CRM for my company. Next I would like to take that system and make it available for others to use in a hosted format. Very much like a salesforce.com. The question is what type of database structure would I use. I see two options:
Option 1. Each time a company signs up, I clone the master database for them.
The disadvantage of this is that I could end up with thousands of databases. Thats a lot of databases to backup every night. My CRM uses cron jobs for maintanance, those jobs would have to run on all databases.
Each time I upgrade the system with a new feature, and need to add a new column to the database, I will have to add that column to thousands of databases.
Option 2. Use only one database.
At the beginning of EVERY table add "CompanyID". In EVERY SQL statement add "and
companyid={companyid}".
The advantage of this method is the simplicity of only one database. Just one database to backup each night. Just one database to update when needed.
The disadvantage is what if I get 1000 companies signing up, and each wants to store data on 100,000 leads, that 100,000,000 rows in the lead table, which worries me.
Does anyone know how the online hosted CRMs like salesforce.com do this?
Thanks
Wouldn't you clone a table structure style to each new database id all sheets archived in master base indexed client clone is hash verified to access specific sheet run through a host machine at the front end of the master system. Then directing requests as primary role. Internal access is batched to read/write slave systems in groups. Obviously set raid configurations to copy real time and scheduled. Balance requests for load to speed of system resources. That way you separated the security flawed from ui and connection to the retention schema. it seems like simplified structures and reducing policy requests cut down request rss in the query processing. or Simply a man in the middle approach from inside out.
1) Splitting your data into smaller groups in a safe, unthinking way (such as one database per grouping) is almost always best if you want to scale. In this case, unless for some reason you want to query between companies, keeping them in separate databases is best.
2) If you are updating all of the databases by hand, you are doing something wrong if you want to scale. You'd want to automate the process.
3) Ultimately, salesforce.com uses this as the basis of their DB infrastructure:
http://blog.database.com/blog/2011/08/30/database-com-is-open-for-business/
I currently have a problem where the performance of my database is impacted by several million lines of updates that run (it takes more or less 3 days, so we usually run them over a weekend)
However since the site is live, search performance is impacted. A 3 second query to pull 1.3 million records and page through them takes in excess of the timeout values by default in sql server sometimes. This obviously creates a user experience no one wants (or can afford to) to have happen.
My question now. If I setup replication on the Master to a Slave on the same server; Would I be able to point the website to the Slave and avoid that performance impact? Or would it just be duplicating the same problem since the Master will push any updates through to the Slave in any case?
I don't think replication is going to help you here, it is only going to make things on the source system worse IMHO.
Is it possible you can make a static copy of the data for the users running queries while the updates are going on? For reporting solutions that don't need to be up-to-the-minute, I've done this in several cases using two schemas - one that holds the static versions of the tables for querying, and one where the work is being done; when the work is done, switch. I go into this methodology in a little more detail here: What is the best way to refresh a rollup table under load?
Perhaps another thought is to make your updates more efficient, such that they don't take 3 days? Do you only do this on long weekends?
Question is, what is nature of your site ?? Do users use it just to "Search" or it does CRUD operations ?? If it is just for "Search" and Report generation then I agree with #Aaron. You can have some database just for reporting purposes, you can even use Log Shipping to automatically update your reporting database at very brief interval.
Is it possible that user can change data at the same time while records are being updated by your update process ?? In that case, you will have to update your Primary Database using your update job, and then again update Primary database for changes made by users using Slave database.
The benefits of using SQLite storage
for the template cache are faster read
and write operations when the number
of cache elements is important.
I've never used it yet,but how can using SQLite by faster than plain file system?
IMO the overhead(initiating a connection) will make it slower.
BTW,can someone provide a demo how to use SQLite?
There is no real notion of "initiating a connection" : an SQLite database is stored as a single file, in the local filesystem ; so there is nothing like a network connection.
I suppose using an SQLite database can be seen as fast as there is only one file (the database), and not one file per template -- and each access to a file costs some resources ; the operating system might be able to cache accesses to one big file more efficiently to several accesses to several distinct small files.
About a "demo how to use SQLite", it kind of depends on the language you'll be using, but you can start by taking a look at the SQLite documentation, and the API that's available in your programming language ; accessing an SQLite DB is not that hard : basically, you have to :
"Connect" to the DB -- i.e. open the file
Issue some SQL queries
Close the connection
It's not much different than with any other DB engine : the biggest difference is there is no need to setup any DB server.
The benefits of SQLite over a standard file system would lie in it's caching mechanism. SQLite stores data in pages and caches pages to memory. Repeated calls for data that are on pages already in memory will skip a call out to the file system.
There is some overhead in using SQLite though. When you connect to a SQLite database the engine reads and parses the schema. On our system, that takes 30ms (although it's usually less than 1ms for smaller schemas--we have just under a hundred tables and hundreds of triggers and indexes).
We are developing an ASP.NET HR Application that will make thousands of calls per user session to relatively static database tables (e.g. tax rates). The user cannot change this information, and changes made at the corporate office will happen ~once per day at most (and do not need to be immediately refreshed in the application).
About 2/3 of all database calls are to these static tables, so I am considering just moving them into a set of static objects that are loaded during application initialization and then refreshed every 24 hours (if the app has not restarted during that time). Total in-memory size would be about 5MB.
Am I making a mistake? What are the pitfalls to this approach?
From the info you present, it looks like you definitely should cache this data -- rarely changing and so often accessed. "Static" objects may be inappropriate, though: why not just access the DB whenever the cached data is, say, more than N hours old?
You can vary N at will, even if you don't need special freshness -- even hitting the DB 4 times or so per day will be much better than "thousands [of times] per user session"!
Best may be to keep with the DB info a timestamp or datetime remembering when it was last updated. This way, the check for "is my cache still fresh" is typically very light weight, just get that "latest update" info and check it with the latest update on which you rebuilt the local cache. Kind of like an HTTP "if modified since" caching strategy, except you'd be implementing most of it DB-client-side;-).
If you decide to cache the data (vs. make a database call each time), use the ASP.NET Cache instead of statics. The ASP.NET Cache provides functionality for expiry, handles multiple concurrent requests, it can even invalidate the cache automatically using the query notification features of SQL 2005+.
If you use statics, you'll probably end up implementing those things anyway.
There are no drawbacks to using the ASP.NET Cache for this. In fact, it's designed for caching data too (see the SqlCacheDependency class http://msdn.microsoft.com/en-us/library/system.web.caching.sqlcachedependency.aspx).
With caching, a dbms is plenty efficient with static data anyway, especially only 5M of it.
True, but the point here is to avoid the database roundtrip at all.
ASP.NET Cache is the right tool for this job.
You didnt state how you will be able to find the matching data for a user. If it is as simple as finding a foreign key in the cached set then you dont have to worry.
If you implement some kind of filtering/sorting/paging or worst searching then you might at some point miss the quereing capabilities of SQL.
ORM often have their own quereing and linq makes things easy to, but it is still not SQL.
(try to group by 2 columns)
Sometimes it is a good way to have the db return the keys of a resultset only and use the Cache to fill the complete set.
Think: Premature Optimization. You'll still need to deal with the data as tables eventually anyway, and you'd be leaving an "unusual design pattern".
With event default caching, a dbms is plenty efficient with static data anyway, especially only 5M of it. And the dbms partitioning you're describing is often described as an antipattern. One example: multiple identical databases for multiple clients. There are other questions here on SO about this pattern. I understand there are security issues, but doing it this way creates other security issues. I've recently seen this same concept in a medical billing database (even more highly sensitive) that ultimately had to be refactored into a single database.
If you do this, then I suggest you at least wait until you know it's solving a real problem, and then test to measure how much difference it makes. There are lots of opportunities here for Unintended Consequences.