How to defragment/consolidate partially filled sqlite pages in order to reclaim space? - sqlite

I am designing a little server application with local data store and sqlite3 seems to be the way to manage the persistent data. But I am worried about malicious users who know the internal logic and might trick the server into creating (and subsequent deleting) lots of records, in a way where a few valid records remain in each data page. The database size might explode quite soon.
Following the documentation of and recommendations like https://blogs.gnome.org/jnelson/2015/01/06/sqlite-vacuum-and-auto_vacuum/ implies that even auto_vacuum=incremental would not help me in this scenario because it's only effective for released pages, not for used pages with internal gaps (i.e. fragmentation).
Is there a good way to tell sqlite to consolidate such data on-the-fly?
VACUUM operation is not an option due to long-living global DB lock.

Sqlite will merge an almost empty page with neighbors automatically to help reduce fragmentation like you describe.
From an email from D Richard Hipp on the sqlite mailing list:
Once a sufficient number of rows are removed from a page, and the free space on that page gets to be a substantial fraction of the total space for the page, then the page is merged with adjacent pages, freeing up a whole page for reuse. But as doing this reorganization is expensive, it is deferred until a lot of free space accumulates on the page. (The exact thresholds for when a rebalance occurs are written down some place, but they do not come immediately to my mind, as the whole mechanism just works and we haven't touched it in about 15 years.)

Related

A fast way to store views on a page and when to save in database

I want to implement a views counter like most forums, Youtube and several others have. So every time a user reads an article, that is stored and remembered. I also want to know who looked at the article.
My queston is: How do you implement this efficiently? What is the best practice?
One way would be to call a stored procedure for every view, but that would result in a lot of unneeded calls to the database.
Another way would be to store this to some global application object, and then store in DB every 5 minutes or so (and can you even do that in a good way?)
What's the best way to do this?
Database operations are surprisingly cheap and really are not worth worrying about. In the event that a DB operation was even marginally expensive then you can always delegate the blocking operation to a new thread thus freeing-up your page-generation thread (you can trivially do this for UPDATE and INSERT operations that return nothing from the database - they are inconsequential).
Sprocs aren't really in-fashion right now - the performance advantage they might have had from pre-computed execution plans is almost eliminated because modern servers cache plans from all previous queries, and for trivial SELECT, INSERT, and UPDATEs you begin to suffer from increased code complexity. There's nothing wrong with inline SQL commands now.
Anyway, back on-topic and in summary: your assumptions are wrong. There is nothing wrong with running UPDATE Pages SET ViewCount = ViewCount + 1 WHERE PageId = #pageId on every page-view. There is also nothing wrong with doing this either: INSERT INTO UserPageviews (UserId, PageId, DateTime) VALUES ( #userId, #pageId, NOW() ). Both operations are very cheap and will execute in under 2-3 miliseconds on even an old and aged database server.
Another way would be to store this to some global application object,
and then store in DB every 5 minutes or so (and can you even do that
in a good way?)
This method is very prone to data loss unless you use a durable queueing mechanism (like MSMQ). Unless you anticipate massive traffic, I wouldn't even think about this approach.
Writes of this nature are inexpensive and hundreds of operations per second are not a big deal. I recently built a comment/rating framework that acheives throughput of 3000+ complete transactions per second just on my local all-in-one workstation. This included processing the request, validation, and creating multiple records within a transaction.
As a note, you should take steps to ensure that your statistics data isn't vulnerable to artificial inflation/manipulation. This part of the process will probably be more complex than the view tracking itself. For example, a user should not be able to sit and hold down the F5 key and inflate the number of views on their video. Nor should these values be manipulable by HTTP (e.g. creating a small script to send an AJAX request over and over).
This suggests that each INSERT would be preceded by a SELECT to ensure that the same user ID or IP hadn't already been recorded in some period of time. Of course, this isn't foolproof (unless you invest a great deal of effort), but it errs on the side of conservatism which is usually a good approach.
One way would be to call a stored procedure for every view, but that
would result in a lot of unneeded calls to the database.
I regularly have to remind myself (and other developers) to not fear the database. People (me included) sometimes go to great lengths to avoid a few simple database calls. Keep your tables narrow and well-indexed, and operations like this are faster than you might think.

Better performance to Query the DB or Cache small result sets?

Say I need to populate 4 or 5 dropdowns w/ items from a database. Each drop down will have < 15 items in it. These items almost never change.
Now I could query the DB each time the page is accessed or I could grab the values from a custom class that would check to see if they already exist in ASP.Net's cache and only if they don't query the DB to update the cache.
It's trivial for me to write but I'm unsure if the performace would be better or not. I think it would be (although not likely anything huge).
What do you think?
When dealing with performance issues you should always:
Do things the simplest way first (avoid premature optimisation)
Performance test your code with set performance goals (e.g. 200ms response time under load of N concurent users)
Then, IF your code doesn't perform then profile your code to determine what is slow, and profile your proposed performance fixes to accurately measure what the real-world performance change will be.
Having said that then yes, what you are suggesting seems sensible (you would usually expect an in-memory cache to be quicker than a database), however it also depends on what data is being returned, what the memory load of your application is, how expensive the query is, what the query parameters are etc...
You should performance test your changes before and after to determine the actual effect of your changes (including things like memory load), and you should only really be doing things like this once you have identified that these dropdowns are the cause of an unacceptable performance problem.
That's what System.Web.Helpers.WebCache class exists for.
IO is usually more expensive than memory operations (by orders of magnitude). Especially if your database is in another machine, then you would even be using network resources, and it will definitely be faster to just use the cache.
But indeed, optimize in the end when you have really identified it as a performance bottleneck by measuring.
Quick answer to your question:
Use the built in .Net cache.
Additional points to ponder over..
Preferably, retrieve all master data in a single database retrieval (think stored procedure and dataset): though, I do not advocate the used of stored procs in all scenarios.
As you rightly said, ensure that your data access layer checks the cache before making a round trip to the database
Also, as your drop down values do not change very often; do remember to keep a long expiry duration
Finally, based on your page design you could also look at Fragment Caching (partial page caching: user controls) which could give you bigger benefits since now you neither access the data cache nor the database.
Performance:
Again, the performance depends more on the application's load as compared to your direct round trips for fetching the master data. Put simply, As Thomas suggested use the cache class!

Oracle sequence cache aging too often

my asp.net application uses some sequences to generate tables primary keys. Db administrators have set the cache size to 20. Now the application is under test and a few records are added daily (say 4 for each user test session).
I've found that new test session records always use new cache portions as if the preavious day cached numbers had expired, losing tenth of keys everyday. I'd like to understand if it's due to some mistake i might have made in my application (disposing of tableadapters or whatever) or if it's the usual behaviour. There are programming best practices to take into account when handling oracle sequences ?
Since the application will not have to bear an heavy load of work (say 20-40 new records at day), i was tinking if it might be the case to set a smaller cache size or none at all.
Does sequence cache resizing implies the reset of current index ?
thank you in advance for any hint
The answer from Justin Cave in this thread might be interesting for you:
http://forums.oracle.com/forums/thread.jspa?threadID=640623
In a nutshell: if the sequence is not accessed frequently enough but you have a a lot of "traffic" in the library cache, then the sequence might be aged out and removed from the cache. In that case the pre-allocated values are lost.
If that happens very frequently to you, it seems that your sequence is not used very often.
I guess that reducing the cache size (or completely disabling it) will not have a noticable impact on performance in your case (also when taking your statement of 20-40 new records a day into account)
Oracle Sequences are not gap-free. Reducing the Cache size will reduce the gaps... but you will still have gaps.
The sequence is not associated to the table by the database, but by your code (via the nextval on the insert via trigger/sql/pkg api) -- on that note you may use the same sequence over multiple tables (it is not like sql server's identity where it is associated to the column/ table)
thus changing the sequence will have no impact on the indexes.
You would just need to make sure if you drop the sequence and restart it, you 'reseed' to the +1 of the current value (e.g. create sequence seqer start with 125 nocache;)
, but
If your application requires a
gap-free set of numbers, then you
cannot use Oracle sequences. You must
serialize activities in the database
using your own developed code.
but be forewarned, you may increase disk IO and possible transaction locking if you choose not to use sequences.
The sequence generator is useful in
multiuser environments for generating
unique numbers without the overhead of
disk I/O or transaction locking.
to reiterate a_horse_with_no_name's comments, what is the issue with gaps in the id?
Edit
also have a look at the caching logic you should use located here:
http://download.oracle.com/docs/cd/E11882_01/server.112/e17120/views002.htm#i1007824
If you are using the sequence for PKs and not to enforce some application logic then you shouldn't worry about gaps. However, if there is some application logic tied to sequential sequence values, you will have holes if you use sequence caching and do not have a busy system. Sequence cache values can be aged out of the library cache.
You say that your system is not very busy, in this case alter your sequence to no cache. You are in a position of taking a negligible performance hit to fix a logic issue so you might as well.
As people mentioned: Gaps shouldn't be a problem, so if you are requiring no gaps you are doing something wrong. (But I don't think this is what you want).
Reducing the cache should reduce the number and decrease the performance of the sequence especially with concurrent access to it. (which shouldn't be a problem in your use case).
Changing the sequence using the alter sequence statement (http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_2011.htm) should not reset the current/next val of the sequence.

ASP.NET/SQL 2008 Performance issue

We've developed a system with a search screen that looks a little something like this:
(source: nsourceservices.com)
As you can see, there is some fairly serious search functionality. You can use any combination of statuses, channels, languages, campaign types, and then narrow it down by name and so on as well.
Then, once you've searched and the leads pop up at the bottom, you can sort the headers.
The query uses ROWNUM to do a paging scheme, so we only return something like 70 rows at a time.
The Problem
Even though we're only returning 70 rows, an awful lot of IO and sorting is going on. This makes sense of course.
This has always caused some minor spikes to the Disk Queue. It started slowing down more when we hit 3 million leads, and now that we're getting closer to 5, the Disk Queue pegs for up to a second or two straight sometimes.
That would actually still be workable, but this system has another area with a time-sensitive process, lets say for simplicity that it's a web service, that needs to serve up responses very quickly or it will cause a timeout on the other end. The Disk Queue spikes are causing that part to bog down, which is causing timeouts downstream. The end result is actually dropped phone calls in our automated VoiceXML-based IVR, and that's very bad for us.
What We've Tried
We've tried:
Maintenance tasks that reduce the number of leads in the system to the bare minimum.
Added the obvious indexes to help.
Ran the index tuning wizard in profiler and applied most of its suggestions. One of them was going to more or less reproduce the entire table inside an index so I tweaked it by hand to do a bit less than that.
Added more RAM to the server. It was a little low but now it always has something like 8 gigs idle, and the SQL server is configured to use no more than 8 gigs, however it never uses more than 2 or 3. I found that odd. Why isn't it just putting the whole table in RAM? It's only 5 million leads and there's plenty of room.
Poured over query execution plans. I can see that at this point the indexes seem to be mostly doing their job -- about 90% of the work is happening during the sorting stage.
Considered partitioning the Leads table out to a different physical drive, but we don't have the resources for that, and it seems like it shouldn't be necessary.
In Closing...
Part of me feels like the server should be able to handle this. Five million records is not so many given the power of that server, which is a decent quad core with 16 gigs of ram. However, I can see how the sorting part is causing millions of rows to be touched just to return a handful.
So what have you done in situations like this? My instinct is that we should maybe slash some functionality, but if there's a way to keep this intact that will save me a war with the business unit.
Thanks in advance!
Database bottlenecks can frequently be improved by improving your SQL queries. Without knowing what those look like, consider creating an operational data store or a data warehouse that you populate on a scheduled basis.
Sometimes flattening out your complex relational databases is the way to go. It can make queries run significantly faster, and make it a lot easier to optimize your queries, since the model is very flat. That may also make it easier to determine if you need to scale your database server up or out. A capacity and growth analysis may help to make that call.
Transactional/highly normalized databases are not usually as scalable as an ODS or data warehouse.
Edit: Your ORM may have optimizations as well that it may support, that may be worth looking into, rather than just looking into how to optimize the queries that it's sending to your database. Perhaps bypassing your ORM altogether for the reports could be one way to have full control over your queries in order to gain better performance.
Consider how your ORM is creating the queries.
If you're having poor search performance perhaps you could try using stored procedures to return your results and, if necessary, multiple stored procedures specifically tailored to which search criteria are in use.
determine which ad-hoc queries will most likely be run or limit the search criteria with stored procedures.. can you summarize data?.. treat this
app like a data warehouse.
create indexes on each column involved in the search to avoid table scans.
create fragments on expressions.
periodically reorg the data and update statistics as more leads are loaded.
put the temporary files created by queries (result sets) in ramdisk.
consider migrating to a high-performance RDBMS engine like Informix OnLine.
Initiate another thread to start displaying N rows from the result set while the query
continues to execute.

One massive instance of an app, or many medium-sized ones?

A web application we wrote intended for one customer is going to be product-ized and sold to dozens of companies, and we will be doing the hosting.
I could use some guidance about the pros and cons of rolling out a seperate instance for each customer versus going with a single (or very small number of) multi-tenant instances.
At first, as we ramp up, I will have to roll out a seperate instance of the application for each new customer (they will come online one at a time) because it's the only immediate option. I imagine this won't scale very well as far as maintenance goes - rolling out changes will become very tedious and possibly error-prone once there are more than 4 or 5 instances out there. Unless we automate that somehow.
Also, the single-instance philosophy seems like it might lead to a bunch of forks if people need customizations. And it would be nice to avoid that.
So what has your experience been with this?
Bonus question #1: What's the performance difference between 10 SQL Servers with 2m records each versus one huge one with 20m? Let's say they are all in one table and we're mainly doing inserts and selects on single records. Sometimes the selects are on an indexed varchar(12) or date field.
Bonus Question #2: I imagine that to avoid forking, we would have to make the customizations configurable, or build a plug-in architecture. However, that might increase the cost of doing customizations, and I don't want to be one of those shops that takes a week to resize a textbox, and I don't want to over-invest in infrastructure. Any thoughts on that?
Scale Details
Each customer will have a decent amount of data -- up to a few million records.
There will be a very small number of concurrent users, only a few per customer, plus a handful of internal reps on our end.
It's unclear whether each customer will require customizations, but I would say some of them probably will, and maybe some of those changes will be things that other customers will not want to see.
when faced with a similar challenge, here's what we did:
we have one code base with multiple sql servers. we do maintain multiple iis servers with copies of the same code base. we are free to move clients around from sql server to sql server to maximize performance.
if a customer has the $ for it, we will install them on their own server and maintain a separate iis server for them. this accommodates the largest customers for whom paying much more money every month (10 fold more money). we do not, however, give them a separate code base. if they need a mod, we make it visible on a per client basis (see #3)
custom programming usually results in a configurable option. even the people who pay us to have their own server get the same version of the code. sometimes its as simple as a clause in the code that says "if the customer = "ourbigcustomer then turn on this option". yes, that's kludgy hard-coding, but if the customer has enough money, that is fine with me.
i didn't quite get from your question whether you wanted to mix different customer's data into one big database .. our rule is we never do that (never ever). it is one of the wisest choices we ever made. it makes data manipulation much less risky and restores of data easier.
I don't see a good reason for either of your two options. I think the real answer lies somewhere in the middle: having multiple instances, each hosting multiple clients.
This adds another layer of automation processing, but it means you can keep the hosting cheap (you won't need to go out and buy a Cray any time soon) and (hopefully) this sort of mentality means you could do failover backups fairly easily.
But let's not get ahead of ourselves... We're talking about a webapp, right? Get your database(s) and aspnet on different machines. Cluster your databases and you'll have a much happier time playing around with various front-end scenarios. You'll also be able to upscale whichever area runs out of puff first.
By the sounds of it, you'll end up with one clustered database over half if not a full dozen database machines and only a couple of front-end boxes.
As for customisations, you've nailed it. You either provide a completely database-hosted set of editable templates or you have to customise who instances. I'm all for the first. It's a lot of work (without much in return) but it's well worth it as you should only need to change the core code when (you will!) you do upgrades. Hunting through a hundred customers' custom instances to make sure they upgrade safely will kill a developer! Template are the answer. At the very very least, you could allow custom CSS without much pain (but they'd need somebody who knew their stuff).
Edit: I've seen a couple of posts going for the all-in-one method. Splitting the instances over multiple machines insulates you from a couple of things:
If you introduce a bug not caught in testing, only a few clients are effected at once
Hardware fails. Having one mega-server fall over will annoy a lot of people at once. Having a failover mega-server is massively expensive. Having a spare failover box per three or four running servers is much cheaper and annoys fewer people.
Performance can be balanced between boxes on a client-by-client basis, so you can put a few light-use clients with a heavy client, or just fill a box with a few medium-use clients, etc.
On the same idea, usage spikes or other slowdowns only effect clients on the same box. Of course this doesn't mean the same for the database, but you can split that up into a cluster of clusters when you get there.
The big advantage of individual instances will be scaling out as each customer's demand increases. For example if you're running on a single server and one customer suddenly needs more preformance you're stuffed. But if they're all individual then moving that customer to a shiny new server is relatively easy.
The big disadvantage will be in managing the instances all individually. (regardless of whether they're all running on the same server or not).
Regardless you should only ever have one instance of the codebase. And customisation should all be controlled through plugins and configuration. Front end should naturally be seperate from content. Although the cost of making a change may be higher, the benefit in terms of features you can offer your other customers (which will just be customisations you've been asked to do) will pay off I'm sure. Which is to say nothing as to how much easier it'll be to manage a single codebase, as opposed to several.
I would strongly advise going with the single instance hosted by your company. This has the following advantages:
You have physical access to all code
and databases to make changes and
updates.
You control the quality of the
hardware it is running on.
When you fix a bug in common code,
you have fixed it once for all
customers.
You can refactor the application
design to better support customer
specific code and avoid forking.
As the number of customers grow, you
can scale-up and scale-out your
servers to meet
performance/responsiveness
requirements.
Your application code and databases
cannot be tampered with by
"inquistive" customers.
I would have to say it is almost more important where your application is running as opposed to how many separate instances there are of it.
Sure, maintaining multiple separate instances is not ideal due to the support/maintenance overhead, but if these apps. are all on servers you control, life is much easier then needing remote/ physical access to different customers networks and servers.
Joel Spolsky also talks about exactly this on StackOverflow podcast 67.
One thing Joel has learned from
selling Fogbugz: software designed to
be installed on a server in-house at a
customer’s site, under full control of
that customer, is almost never worth
the hassle
20 million records relatively speaking is not a huge SQL Server database. A single well provisioned SQL Server could handle this size comfortably. More important however is the number of concurrent accesses to the database. However you say that there will be only a few users per customer so is unlikely to hit you until the level of concurrency grows.
All of the above are good points but you are missing two key questions. What price point is the service offered at and how many customers (order of magnitude) will you ultimately have to support (ie market size)? In 3 years will you have a maximum of 10 customers each of which will pay you $500,000 per year or 500 customers each paying you $10,000 per year? For a small set of high paying premium customers the advantages of individual deployments is clear, whereas the lower prices and larger customer bases demand a shared solution (a la Oli's comment) is the best way to go. Or go with a cloud platform, although I've only read the hype and tinkered rather than deployed that in the field.
Bonus Question 1: table layout, indexing, number of reads / writes, efficiency and complexity of stored procedures (you are using procs or at least prepared statements, right?) all matter a heck of a lot more than the number of physical records in the database to a point. Beyond that you will likely find yourself needing to either provide individual SQL Server instances for each customer or for a pool of customers, once again depending on some of the questions I raised above.
Bonus Question 2: Putting the time into your design for templating and a plugin architecture is essential in this situation and you need to do it sooner rather than later. Once you're in the grind of customizing code for paying customers you will likely not have the time to do it right. This point cannot be stressed enough. Templates and admin tools that give you quick and deep access to data-driven changes in your product will save you a lot of time down the road. As your company / group expands you can then add less technical staff that can be "product experts" who can perform 90% of customizations and maintenance, freeing up your core to continue development or move on to other projects. Finally, don't neglect your data tier in this planning process. Having a core data tier of (almost) immutable stored procs and tables is very important, with custom tables and stored procs clearly demarcated using a good naming convention.
Good luck, feel free to provide more details if you'd like more specific suggestions.
Based on some of the advice received here, we did end up implementing a monolithic multi-tenant version of our application.
I'm glad we did. By the time it was done, we had 3 or 4 forks of the code base (mainly custom skins and things we didn't have n-level support for, but also some actual features), and it was only getting crazier.
We got the multi-tenant version up and successfully folded everything in. There ended up being a lot to think about and a lot to keep track of, but our customers never even knew they had been moved to a new system.
I will say that the actual customer migration was a bit of a bear. I thought at first that we would be able to do it by hand in the backend, but I ended up having to write some fairly involved scripts to get the job done. There were just too many identity columns, and it's not like you can just turn off constraints temporarily when you're importing into a live production system.

Resources