offset/limit performance optimization - asp.net

I have a table with structure like :
Id (serial int) (index on this)
Post (text)
...
CreationDate (DateTime) (Desc index on this)
I need to implement pagination. My simple query looks like :
SELECT Id, Post, etc FROM Posts ORDER BY CreationDate desc OFFSET x LIMIT 15
When there are few records (below 1 mln) performance is somewhat bearable, but when the table grows there is a noticeable difference.
Skipping the fact that there is good to configure DB settings like cache size, work memory, cost, shared mem, etc... What can be done to improve the performance and what are the best practices of pagination using Postgres. There is something similar asked here, but I am not sure if this can be applied in my case too.
Since my Id is auto incremented (so predictable) one of the other options I was thinking is to have something like this
SELECT Id, Post...FROM Posts WHERE Id > x and Id < y
But this seems to complicate things, I have to get the count of records all the time and besides it is not guaranteed that I will always get 15 records(for example if one of the posts has been deleted and Ids are not in "straight" sequence anymore).
I was thinking about CURSOR too, but if I am not mistaken CURSOR will keep the connection open, which is not acceptable in my case.

Pagination is hard; the RDBMS model isn't well suited to large numbers of short-lived queries with stateful scrolling. As you noted, resource use tends to be too high.
You have the options:
LIMIT and OFFSET
Using a cursor
Copying the results to a temporary table or into memcached or similar, then reading it from there
x > id and LIMIT
Of these, I prefer x > id with a LIMIT. Just remember the last ID you saw and ask for the next one. If you have a monotonously increasing sequence this will be simple, reliable, and for simple queries it'll be efficient.

Related

Storing and querying for announcements between two datetimes

Background
I have to design a table to store announcements in DynamoDB. Each announcement has the following structure:
{
"announcementId": "(For the frontend to identify an announcement to the backend)",
"author": "(id of author)",
"displayStartDatetime": "",
"displayEndDatetime": "",
"title": "",
"description": "",
"image": "(A url to an image)",
"link": "(A single url to another page)"
}
As we are still designing the table, alterations to the structure are permitted. In particular, announcementId, displayStartDatetime and displayEndDatetime can be changed.
The main access pattern is to find the current announcements. Users have a webpage which they can see all current announcements and their details.
Every announcement has a date for when to start showing it (displayStartDatetime) and when to stop showing it (displayEndDatetime). The announcement is should still be kept in the table after the current datetime is past displayEndDatetime for reference for admins.
The start and end datetime are precise to the minute.
Problem
Ideally, I would like a way to query the table for all the current announcements in one query.
However, I have come to the conclusion that it is impossible to fuse two datetimes in one sort key because it is impossible to order two pieces of data of equal importance (e.g. storing the timestamps as a string will mean one will be more important/greater than the other).
Hence, as a compromise, I would like to sort the table values by displayEndDatetime so that I can filter out past announcements. This is because, as time goes on, there will be more past announcements than future announcements, so it will be more beneficial to optimise that.
Compromised Solution
Currently, my (not very good) solutions are:
Use one "hot" partition key and use the displayEndDatetime as the sort key.
This allows me to filter out past announcements, but it also means that all the data is in a single partition. I could run a scheduled job every now and then to move the past announcements to a different spaced out partitions.
Scan through the table
I believe Scan will look at every item in the table before it performs any filtering. This solution doesn't seem as good as 1. but it would be the simplest to implement and it would allow me to keep announcementId as the partition key.
Scan a GSI of the table
Since Scan will look through every item, it may be more efficient to create a GSI (announcementId (PK), displayEndDatetime (SK)) and scan through that to retrieve all the announcementIds which have not passed. After that, another request could be made to get all the announcements.
Question
What is the most optimised solution for storing all announcements and then finding current announcements when using DynamoDB?
Although I have listed a few possible solutions for sorting the displayEndDatetime, the main point is still finding announcements between the start and end datetime.
Edit
Here are the answers to #tugberk's questions on the background:
What is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
I am uncertain of how the admins will use this system, announcements can be very regular (about 3/day) or very infrequent (about 3/month).
How much new data do you anticipate storing daily, and how do you think this will grow?
As mentioned above, this could be about 3 announcements a day or 3 a month. This is likely to remain the same for as long as I should be concerned about.
What is the rate of reads (e.g. peak reads per second)?
I would expect the peak reads per second to be around 500-1000 reads/s. This number is expected to grow as there are more users.
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most).
I would expect the maxmimum number of viewable announcements to be up to 30-40. This is because there could be multiple long-running announcements along with short-term announcements. On average, I would expect about 5-10 announcements.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
I think the speed which the announcement starts showing is important, especially if the admins decide that this is a good platform for urgent announcements (likely urgent to the minute). However, when it stops showing is less important, but to avoid confusing the users the announcement should stop display at most 4 hours after it is past its display end datetime.
This type of questions are always hard to answer here as there is so many assumptions on the answer as it's really hard to have all the facts. But I will try to give you so ideas, which may help you think about your data storage choice as well as giving you further options.
I know what I am doing, and really need to use DynamoDB
Edited this answer based on the OP's answers to my original questions.
As you really need to us DynamoDB for this for internal reasons, I think it's more suitable to store the data in two DynamoDB tables for both serving reads and writes as nearly all access patterns I can think of will hit multiple partitions if you have one table. You can get away with a GSI, but it's not too straight forward how to do it, and I am not sure whether there is any advantage to doing it that way.
The core thing you need to optimize for is the reads as you mentioned it can go up to 2K/rps which is big enough to make this the part where you optimize your architecture against. Based on your assumptions of having 3 announcements a day, it's nothing to worry about as far as the writes are concerned.
General idea is this:
I would consider using one DynamoDB table to handle writes where you can configure author identifier as the partition key, and announcement identifier as the sort key (and make your primary key as the combination of both). This will allow you to query all the announcements for a given author easily.
I would also have a second DynamoDB table to handle reads, where you will only store active announcements which your application can query and retrieve all of it with a Scan query (i.e. O(N)), which is not a concern as you mentioned there will only be 30-40 active announcments at any point in time. Let's imagine this to be even 500, you are still OK with this structure. In terms of partition and sort key, I would just have an active boolean field as the partition key, which you will always have it as true, you can have the announcement id as the sort key, and make the combination of both as the primary key. If you care about the sort of these announcements, you can adjust the sort key accordingly but make sure it's unique (i.e. consider concatenating the announcement identifier, e.g. {displayBeginDatetime-in-yyyyMMddHHmmss-format}-{announcementId}. With this way you will guarantee that you will only hit one partition. However, you can actually simplify this and have the announcement identifier as the partition key and primary key as I am nearly sure that DynamoDB will store all your data in one partition as it's going to be so small. Better to confirm this though as I am not 100% sure. The point here is that you are much better of ensuring hitting one partition with this query.
Here is how this may work, where there are some edge cases I am overlooking:
record the write inside the first DynamoDB for an announcement. When an announcement is written, configure displayEndDatetime as the TTL of that row, with the assumption that you don't need this record in this table when an announcement expires.
have a job running for N minute (one or more, depending on the data inconsistency gap you can handle), which will Scan the entire DynamoDB table across partitions (do it in a paginated way), and makes decisions on which announcements are currently visible. Then, write your data into the second DynamoDB table, which will handle the reads, in the structure we have established above so that your consumer can read from this w/o worrying about any filtering as the data is already filtered (e.g. all the announcements here are visible ones). Note that Scan is fine here as you are running this once every N minutes, with the assumption that you are ok with at least 1 minute + processing time data inconsistency gap. I would suggest running this every 10 minutes or so, if you don't have strong data consistency requirements.
On the read storage system, also configure displayEndDatetime as the TTL for the row so that it gets automatically deleted.
Configure DynamoDB streams on the first DynamoDB table, which has 24 hours retention and exactly once delivery guarantee, and have a lambda consumer of this stream, which to handle when an item is deleted (will happen when TTL kicks in for a particular row) to keep a record of this announcements somewhere else, for longer retention reasons, and will need to expose it through different access pattern (e.g. show all the announcements per author so that they can reenable old announcements), as you mentioned in you question. You can configure a lambda event sourcing with DynamoDb streams, which will allow you to handle failures with retries, etc. Make sure that your logic in these lambdas are idempotent so that you can retry safely.
The below is the parts from my original question, which are still relevant to anyone who might be trying to achieve the same. So, I will leave them here but they are less relevant as the OP needs to use DynamoDB.
Why DynamoDB?
First of all, I would question why you need DynamoDB for this, as it seems like your requirements are more read heavy than it's being write heavy, where I think DynamoDB shines the most due to its partitioned out of the box nature.
Below questions would help you understand whether you really need DynamoDB for this, or can you get away with a more flexible data storage system:
what is the rate of writes you anticipate receiving (i.e. peak writes per second you need to handle)?
how much new data do you anticipate storing daily, and how do you think this will grow?
what is the rate of reads (e.g. peak reads per second)?
How many announcements a user can see at a time (i.e. what's avg/max number of announcements will be visible at any point in time)? Practically thinking, this shouldn't be more than a few (e.g. 10-20 at most). This will help you understand whether you need will be OK pulling all the visible announcements in one go, or need a pagination system.
What is the data inconsistency gap you are happy to have here (i.e. do you need seconds level precision, or would you be happy to have ~1min delay on displaying and hiding announcements)?
Actually, I don't need DynamoDB
Based on my assumptions on your consumption and admin needs for this use case, I believe you don't need DynamoDB for this with the assumption of not having high number of writes for this (which might be wrong), and if these assumptions are correct, the above is a super over engineered solution for you. Let's say it's correct, I think you are better of using PostgreSQL for this, which can give you easy ability to change your access pattern as you see fit with further indexing, and for the current access pattern you have, you can have a range query over the start and end times.

Get last N records in a DynamoDB table

Is there any way to get the last N records from a dynamodb table. The range key I have is the timestamp. So I could use the ScanIndex forward to order items chronologically.
But in order to query I need to have a hashKey condition, which I don't want to filter. Any thoughts?
DynamoDB is not designed to work this way. The items are distributed according to a hash on the HashKey in such a way that the order is not predictable.
Your options include:
grouping the items under a single hash key (not recommended: you would overload a few servers with your data, and Amazon cannot guarantee your read/write capacity)
scanning the whole table and keep the N most recent items (something like for (item in items) { if (item newer then oldest accumulated item) accumulate item; });
partition your table into multiple tables (ie, instead of a table called Events, create one called Events20130705 for today's events, Events20130706 for tomorrow's events), and scan just like the previous option -- this way your scans are smaller
You could also maybe change your data model. For example, You could have one versioned entry that would keep references to the N most recent items. Or you could have something like a single counter that you'd increment and update N other entries under hashkeys such as recent-K where K is your counter mod N.
Maybe you could even use another tool for this job. For instance, you could have a Redis server to do this. Without knowing your use case with much more detail, it is hard to make a precise suggestion -- how scalable should this be? how reliable should it be? how much maintenance are you willing to perform? how much are you willing to pay for it?
It's usually better to embrace the limitation, know your constraints and be creative.
I'm not sure this is still relevant. I'm fairly sure you can use ScanIndexForward along with a rangeKey to get the latest value.

LINQ to entities performance regarding where clause

Let's say i have a table in a database with 10k records. I dont need to actually use those 10k records anymore, but i still need to keep them in the database. That very table is now going to be used to store new data. So there's gonna be more records coming on top of the 10K records already present in the table. As opposed to the "old" 10K records, i do need to work with the newly inserted data. Right now im doing this to get the data i need:
List<Stuff> l = (from x in db.Table
where x.id > id
select x).ToList();
My question now is: how does the where clause in LINQ (or in SQL in general) work under the covers? Is the ENTIRE table going to be searched until (x.id > id) is true? Because let's say the table will increase from 10k records to 20K. It'd be a little silly to look through the entire 20 k records, if i know that i only have to start looking from a certain point.
I've had performance problems (not dramatic, but bad enough to be agitated by it) with this while using LINQ to entities, which i kinda don't understand because it should be no problem at all for a modern computer to sift through a mere 20 k records. I've been advised to use a stored procedure instead of a LINQ query, but i dont know whether or not this will boost performance?
Any feedback will be appreciated.
It's going to behave just like a similarly worded SQL query would. The question is whether the overhead you're experiencing is happening in the query or in the conversion of the query to a list. The query itself as you've written should equate literally to:
Select ID, Column1, Column2, Column3, ... , Column(n+1)
From db.Table
Where ID > id
This query should be fairly fast depending on the nature of the data. The query itself will not be executed until it is acted upon, however. In this case, you're converting it to a list, which is the equivalent of acting upon it. I can't find the comment someone made to me about this practice, but I've found it too be quite helpful in keeping performance clean. Unless you have some very specific need, you should leave your queries as IQueryable. Converting them to lists doubles the effort because first the query must be executed and then the result set must be converted into an appropriate IEnumerable (List in this case).
So you have 2 potential bottlenecks. The simple query could be taking a long time to query a massive collection of data, or the number of records could be bottenecking at the poing where the List is created. Another possibility is the nature of ID in this case. If it is numeric, that will save you some time. If it's performing a text-based search then it's going to be heavier.
To answer your specific question, yes, it's going to search every record in the database and return all of the records that match the expression. Edit: If the database has a proper index on the column in question, it will not search EVERY record but rather will use the index to perform the search. From comment from #Pleun.
As for using a stored procedure, that's a load of hogwash, but it's a perfectly acceptable alternative. I have several programs that routinely run similar queries against a database with over 40 million records, and the only performance issue I've run into so far has been CPU usage when multiple users are performing rapid firing queries. To solve your specific issue, I'd recommend that you tune it a little in SQL Management Studio until the query you want returns to your interface with an acceptable speed. Then you can convert that query into a compatible Linq statement. As long as you leave it as an IQueryable it should exhibit similar results.

Reindexing a large SQL Server database to Lucene

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)
For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.
Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?
I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)

Is count(*) really expensive?

I have a page where I have 4 tabs displaying 4 different reports based off different tables.
I obtain the row count of each table using a select count(*) from <table> query and display number of rows available in each table on the tabs. As a result, each page postback causes 5 count(*) queries to be executed (4 to get counts and 1 for pagination) and 1 query for getting the report content.
Now my question is: are count(*) queries really expensive -- should I keep the row counts (at least those that are displayed on the tab) in the view state of page instead of querying multiple times?
How expensive are COUNT(*) queries ?
In general, the cost of COUNT(*) cost is proportional to the number of records satisfying the query conditions plus the time required to prepare these records (which depends on the underlying query complexity).
In simple cases where you're dealing with a single table, there are often specific optimisations in place to make such an operation cheap. For example, doing COUNT(*) without WHERE conditions from a single MyISAM table in MySQL - this is instantaneous as it is stored in metadata.
For example, Let's consider two queries:
SELECT COUNT(*)
FROM largeTableA a
Since every record satisfies the query, the COUNT(*) cost is proportional to the number of records in the table (i.e., proportional to what it returns) (Assuming it needs to visit the rows and there isnt a specific optimisation in place to handle it)
SELECT COUNT(*)
FROM largeTableA a
JOIN largeTableB b
ON a.id = b.id
In this case, the engine will most probably use HASH JOIN and the execution plan will be something like this:
Build a hash table on the smaller of the tables
Scan the larger table, looking up each records in a hash table
Count the matches as they go.
In this case, the COUNT(*) overhead (step 3) will be negligible and the query time will be completely defined by steps 1 and 2, that is building the hash table and looking it up. For such a query, the time will be O(a + b): it does not really depend on the number of matches.
However, if there are indexes on both a.id and b.id, the MERGE JOIN may be chosen and the COUNT(*) time will be proportional to the number of matches again, since an index seek will be performed after each match.
You need to attach SQL Profiler or an app level profiler like L2SProf and look at the real query costs in your context before:
guessing what the problem is and trying to determine the likely benefits of a potential solution
allowing others to guess for you on da interwebs - there's lots of misinformation without citations about, including in this thread (but not in this post :P)
When you've done that, it'll be clear what the best approach is - i.e., whether the SELECT COUNT is dominating things or not, etc.
And having done that, you'll also know whether any changes you choose to do have had a positive or a negative impact.
As others have said COUNT(*) always physically counts rows, so if you can do that once and cache the results, thats certainly preferable.
If you benchmark and determine that the cost is negligible, you don't (currently) have a problem.
If it turns out to be too expensive for your scenario you could make your pagination 'fuzzy' as in "Showing 1 to 500 of approx 30,000" by using
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('sometable') AND indid < 2
which will return an approximation of the number of rows (its approximate because its not updated until a CHECKPOINT).
If the page gets slow, one thing you can look at is minimizing the number of database roundtrips, if at all possible. Even if your COUNT(*) queries are O(1), if you're doing enough of them, that could certainly slow things down.
Instead of setting up and executing 5 separate queries one at a time, run the SELECT statements in a single batch and process the 5 results at once.
I.e., if you're using ADO.NET, do something like this (error checking omitted for brevity; non-looped/non-dynamic for clarity):
string sql = "SELECT COUNT(*) FROM Table1; SELECT COUNT(*) FROM Table2;"
SqlCommand cmd = new SqlCommand(sql, connection);
SqlDataReader dr = cmd.ExecuteReader();
// Defaults to first result set
dr.Read();
int table1Count = (int)dr[0];
// Move to second result set
dr.NextResult();
dr.Read();
int table2Count = (int)dr[0];
If you're using an ORM of some sort, such as NHibernate, there should be a way to enable automatic query batching.
COUNT(*) can be particularly expensive as it may result in loading (and paging) an entire table, where you may only need a count on a primary key (In some implementations it is optimised).
From the sound of it, you are causing a table load operation each time, which is slow, but unless it is running noticeably slowly, or causing some sort of problem, don't optimise: premature and unnecessary optimisation can cause a great deal of trouble!
A count on an indexed primary key will be much faster, but with the costs of having an index this may provide no benefit.
All I/O is expensive and if you can accomplish the task without it, you should. But if it's needed, I wouldn't worry about it.
You mention storing the counts in view state, certainly an option, as long as the behavior of the code is acceptable when that count is wrong because the underlying records are gone or have been added to.
This depends on what are you doing with data in this table. If they are changing very often and you need them all every time, maybe you could make trigger which will fill another table that consists only on counts from this table. If you need to show this data separately, maybe you could just execute "select count(*)..." for only one particular table. This just came to my mind instantly, but there are other ways to speed this up, I'm sure. Cache data, maybe? :)

Resources