I have a typical scenario that I'm struggling with from a performance standpoint. The user selects a value from a dropdown and clicks a button. A stored procedure takes that value as an input parameter, executes, and returns the results to a grid. For just one of the values ('All'), the query runs for roughly 2.5 minutes. For the rest of the values the query runs less than 1ms.
Obviously, having the user wait for 2.5 minutes just isn't going to fly. So, what are some typical strategies to handle this?
Some of my own thoughts:
New table that stores the information for the 'All' value and is generated nightly
Cache the data on the caching server
Any help is appreciated.
Thanks!
Update
A little bit more info:
sp returns two result sets. The first is a group by rollup summary and the second is the first result set, disaggregated (roughly 80,000 rows).
I would first look at if your have the proper indexes in place. Using the Query Analyzer and the Database Tuning Assistant is a simple and often effective way of seeing what indexes might help.
If you still have performance problems after creating the appropriate indexes you might then look at adding tables/views to speed things up. If your query does a lot of joins you might consider creating an indexed view that allows you to do a select with no joins on the denormalized data. Since indexed views are persisted you can see big gains from their use.
You can read up on indexed views here:
http://msdn.microsoft.com/en-us/library/dd171921%28v=sql.100%29.aspx
and read about the database tuning adviser here:
http://msdn.microsoft.com/en-us/library/ms166575.aspx
Also, how many records does "All" return? I have seen people get hung up on the "All" scenario before, but if it returns 1 million records or something then the data is not usable to a person anyways...
Caching data is a good thing, but.... if the SP is inherently flawed, then you might want to actually fix it instead of trying to bandage it with caching.
You might also want to (since you didn't mention here) look at the number of rows "All" returns compared to the other selections and think about your indexes.
Also in your SP does the "All" cause it to run a different sets of tsql as in maybe a case or an if... or is it running the same code just with a different "WHERE"?
It might simply be that "ALL" just returns A LOT of records. You may want to implement paging and partial dataset return using ajax... (kinda like return the first 1000 records early so that it can be displayed and also show a throbber on the screen while the rest of the dataset is returned)
These are all options... if the number of records really isnt that different between ALL and the others... then it probably has something to do with the query/index/program flow.
Related
Background:
I'm using SQL Server 2008 and ASP.NET 4 on Windows 2008
I have one table with about 10 million rows of products that I make available online for users to browse -- not search. Each of the 10 million products have extra attributes -- like categories -- that I keep in lookup tables -- there are three or four lookup tables.
Problem
When someone browses and starts using filters (shipping location, price, quality, brand), I need to join the tables, apply all the filters, and return the results. It's very slow and I want to make it faster. Sometimes users will apply a very broad filter, resulting in 800,000 results, and though I only return the first 10 of those for browsing, I still need to run the query for the full 800,000.
What I've Tried Already
I've joined all the information from the various tables into one physical table and then created a covering index for the table.
The queries are much faster, but there is a good bit of maintenance I have to do on the table behind the scenes with jobs to make sure if something goes out of stock I take it out within a reasonable time frame (5 mins or so).
I don't use materialized/indexed views b/c I've got aggregates in the results which SQL Server doesn't seem to like.
Question
How can I speed up browse results beyond the indexing and table optimization that I've already done? I'm not doing any full-text searches -- I'm filtering with exact parameters.
Possible Solutions I've Thought Of
Large caching solution -- AppFabric or MemCached. I'm know next to nothign about these and don't know they are appropriate.
Small caching solution -- Maybe leveraging ASP.NET caching -- but every person is going to apply different filters so I'm not sure how much this will give me.
SSDs -- as a larger-scale solution I've thought about getting SSDs but that will be down the road
CDN -- I don't think a CDN will help b/c the bottleneck here is my database's search capabilities, not the bandwidth/distance to the requester.
I had a similar problem with a complex join query causing horrible response times. I was able to solve it via using Lucene.NET. It's a .NET implementation of the Lucene search index. Basically, you build indexes on data fields (your categories) and then you can search via those categories and return thousands of rows very quickly. Basically, it takes the join operation out of the equation because it already knows, via the indexes, which records fit your criteria.
The following is a very good article on Lucene.NET. I highly recommend it. It took a search result that was taking 20 seconds using standard joins and reduced the response time to less than a second.
http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net
Also, feel free to ping me if you have specific Lucene.NET implmenetation questions. I just got through a lot of research/learning in order to implement it properly on my site, so if you have specific questions on how to make it work I may be able to help with that as well.
"I perform the full query b/c I need to populate the new filters and
the number of results along with the search results. For example, if
someeone filters on category of "Shoes", and location of TX, some of
the other filters are going to be restricted based on the previous
filter."
Try executing two queries: One to count all results and one to select the top N. Maybe your bottleneck is copying 800,000 rows to the client. Doing two queries would fix this at the cost of an additional query. The cost is likely to be less than 2x though due to optimizations for few rows and for count-only queries.
Let's say i have a table in a database with 10k records. I dont need to actually use those 10k records anymore, but i still need to keep them in the database. That very table is now going to be used to store new data. So there's gonna be more records coming on top of the 10K records already present in the table. As opposed to the "old" 10K records, i do need to work with the newly inserted data. Right now im doing this to get the data i need:
List<Stuff> l = (from x in db.Table
where x.id > id
select x).ToList();
My question now is: how does the where clause in LINQ (or in SQL in general) work under the covers? Is the ENTIRE table going to be searched until (x.id > id) is true? Because let's say the table will increase from 10k records to 20K. It'd be a little silly to look through the entire 20 k records, if i know that i only have to start looking from a certain point.
I've had performance problems (not dramatic, but bad enough to be agitated by it) with this while using LINQ to entities, which i kinda don't understand because it should be no problem at all for a modern computer to sift through a mere 20 k records. I've been advised to use a stored procedure instead of a LINQ query, but i dont know whether or not this will boost performance?
Any feedback will be appreciated.
It's going to behave just like a similarly worded SQL query would. The question is whether the overhead you're experiencing is happening in the query or in the conversion of the query to a list. The query itself as you've written should equate literally to:
Select ID, Column1, Column2, Column3, ... , Column(n+1)
From db.Table
Where ID > id
This query should be fairly fast depending on the nature of the data. The query itself will not be executed until it is acted upon, however. In this case, you're converting it to a list, which is the equivalent of acting upon it. I can't find the comment someone made to me about this practice, but I've found it too be quite helpful in keeping performance clean. Unless you have some very specific need, you should leave your queries as IQueryable. Converting them to lists doubles the effort because first the query must be executed and then the result set must be converted into an appropriate IEnumerable (List in this case).
So you have 2 potential bottlenecks. The simple query could be taking a long time to query a massive collection of data, or the number of records could be bottenecking at the poing where the List is created. Another possibility is the nature of ID in this case. If it is numeric, that will save you some time. If it's performing a text-based search then it's going to be heavier.
To answer your specific question, yes, it's going to search every record in the database and return all of the records that match the expression. Edit: If the database has a proper index on the column in question, it will not search EVERY record but rather will use the index to perform the search. From comment from #Pleun.
As for using a stored procedure, that's a load of hogwash, but it's a perfectly acceptable alternative. I have several programs that routinely run similar queries against a database with over 40 million records, and the only performance issue I've run into so far has been CPU usage when multiple users are performing rapid firing queries. To solve your specific issue, I'd recommend that you tune it a little in SQL Management Studio until the query you want returns to your interface with an acceptable speed. Then you can convert that query into a compatible Linq statement. As long as you leave it as an IQueryable it should exhibit similar results.
I am working on a large project where I have to present efficient way for a user to enter data into a form.
Three of the fields of that form require a value from a subset of a common data source (SQL Table). I used JQuery and JQuery UI to build an autocomplete, which posts to a generic HttpHandler.
Internally the handler uses Linq-to-sql to grab the data required from that specific table. The table has about 10 different columns, and the linq expression uses the SqlMethods.Like() to match the single search term on each of those 10 fields.
The problem is that that table contains some 20K rows. The autocomplete works flawlessly, accept the sheer volume of data introduces deleays, in the vicinity of 6 seconds or so (when debugging on my local machine) before it shows up.
The JqueryUI autocomplete has 0 delay, queries on the 3 key, and the result of the post is made in a Facebook style multi-row selectable options. (I almost had to rewrite the autocomplete plugin...).
So the problem is data vs. speed. Any thoughts on how to speed this up? The only two thoughts I had were to cache the data (How/Where?); or use straight up sql data reader for data access?
Any ideas would be greatly appreciated!
Thanks,
<bleepzter/>
I would look at only returning the first X number of rows using the .Take(10) linq method. That should translate into a sensbile sql call, which will put much less load on your database. As the user types they will find less and less matches, so they will only see that data they require.
I'm normally reckon 10 items is enough for the user to understand what is going on and still get to the data they need quickly (see the amazon.com search bar for an example).
Obviously if you can sort the data in a meaningful fashion then the 10 results will be much more likely to give the user what they are after quickly.
Returning the top N results is a good idea for sure. We found (querying a potential list of 270K) that returning the top 30 is a better bet for the user finding what they're looking for, but that COMPLETELY depends on the data you are querying.
Also, you REALLY should drop the delay to something sensible like 100-300 ms. When you set delay to ZERO, once you hit the 3-character trigger, effectively EVERY. SINGLE. KEY. STROKE. is sent as a new query to your server. This could easily have the unintended and unwelcome effect of slowing down the response even MORE.
I'm using ASP.net and an SQL database. I have a blog like system where a number of comments are made against a post and I want to display the number of those comments next to the post. To get that number I could either hold it in the post record and add/subtrack when a comment is added or deleted or I could use the SQL to calculate the number of comments using a query each time a user hits the page. The latter seems to be a bad idea as its going to hit my SQL database harder however holding the number against the record feels like it could be error prone. What do you think is best coding practice in this case?
Always start with a normalized database (your second option). Only denormalize if you have an absolute necessity for performance reasons. Designing it in the denormalized way (which is error-prone as you guessed) is premature optimization. With proper indexes it should be fine calculating the number on the fly.
I think the SQL statement should be fine. The other is duplication of data you already have. A count query should be quick.
Don't optimize prematurely. Use the simple solution and pagefault in optimizations only when they're needed.
I would query the database each time you want the information. I would revisit it later if you find that performance is lacking (optimize later). For the traffic most blog type applications will get, that should be sufficient.
Perhaps get the count back as part of the main thread query so as to limit the number of hits on the actual DB from the webserver. But I would always query the actual count and not try and keep it in a field, data will eventually get out of sync as that is reality.
To increase performance, you could keep a flag in the main table to indicate if the item has any comments but only use this as a 'hint' as to whether or not to perform an additional query to count and retrieve comments at a later time.
Imagine a photo gallery that returns 50 photos to rotate through. Each photo could have its own comments.
The initial page load would return a list of photos plus a flag indicating if a photo has comments.
When a photo is displayed, if the comments flag is set to True, your app would make an ajax request to count and fetch the comments for that photo.
If only 3 out of the 50 photos have comments, you just saved yourself 47 additional requests!
This does denormalize the data, but on a limited level.
Creating hints can really help improve performance for very busy sites.
Depending on how your data model looks...Don't add the total post count to the main thread record, it is error prone, you should calculate the comment count when needed based on the thread ID, IMHO
Caching the pages and updating that cache as comments are added/removed would be a good option a long with the SQL count query if you are that worried about the number of queries happening against the db..
I usually use an indexed view for this kind of thing. This allows you to denormalize the data for quick retrieval, but there is no way for it to get out of sync. Folks will also not be confused and think the view is the master of the data. I have mostly used the standard sku of SS2K5, so I have to specify the (noexpand) hint to get it to actually use the index on the view (enterprise will do it automatically). So for standard sku, I always create a wrapper view that everyone hits so I know the hint is always in place.
Coding this on the web page, so hopefully no syntax errors ;)
create view postCount__
as
select
threadId
,postCount=count_big(*)
from thread
group by threadId
go
create unique clustered index postCount__xpk_threadid on postCount__(threadId)
go
create view postCount
as
select
threadId
,postCount=cast(postCount as int)
from postCount__ with (noexpand)
go
So I use a nomenclature on the actual indexed view to let everyone know not to query it directly. Instead they look for the associated wrapper view that enforces the noexpand hint. Using an indexed view forces you to do count_big, so I often cast down to int in the wrapper view to be able to keep our asp.net code lazily using 32 bit ints. It would be better to omit the cast, but it hasn't been of any significant impact for me.
EDIT - I can tell you that forum software always denormalizes the post count to the thread table. It kills the DB to continually count the post count on every page view if you have an active forum. I love that mssql has indexed views so you can define the denormalization declaratively rather than maintain it yourself.
Is it quicker to make one trip to the database and bring back 3000+ plus rows, then manipulate them in .net & LINQ or quicker to make 6 calls bringing back a couple of 100 rows at a time?
It will entirely depend on the speed of the database, the network bandwidth and latency, the speed of the .NET machine, the actual queries etc.
In other words, we can't give you a truthful general answer. I know which sounds easier to code :)
Unfortunately this is the kind of thing which you can't easily test usefully without having an exact replica of the production environment - most test environments are somewhat different to the production environment, which could seriously change the results.
Is this for one user, or will many users be querying the data? The single database call will scale better under load.
Speed is only one consideration among many.
How flexible is your code? How easy is it to revise and extend when the requirements change? How easy is it for another person to read and maintain your code? How portable is your code? what if you change to a diferent DBMS, or a different progamming language? Are any of these considerations important in your case?
Having said that, go for the single round trip if all other things are equal or unimportant.
You mentioned that the single round trip might result in reading data you don't need. If all the data you need can be described in a single result table, then it should be possible to devise a query that will get that result. That result table might deliver some result data in more than one row, if the query denormalizes the data. In that case, you might gain some speed by obtaining the data in several result tables, and composing the result yourself.
You haven't given enough information to know how much programming effort it will be to compose a single query or to compose the data returned by 6 queries.
As others have said, it depends.
If you know which 6 SQL statements you're going to execute beforehand, you can bundle them into one call to the database, and return multiple result sets using ADO or ADO.NET.
http://support.microsoft.com/kb/311274
the problem I have here is that I need it all, i just need it displayed separately...
The answer to your question is 1 query for 3000 rows is better than 6 queries for 500 rows. (given that you are bringing all 3000 rows back regardless)
However, there's no way you're going (to want) to display 3000 rows at a time, is there? In all likelihood, irrespective of using Linq, you're going to want to run aggregating queries and get the database to do the work for you. You should hopefully be able to construct the SQL (or Linq query) to perform all required logic in one shot.
Without knowing what you're doing, it's hard to be more specific.
* If you absolutely, positively need to bring back all the rows, then investigate the ToLookup() method for your linq IQueryable< T >. It's very handy for grouping results in non-standard ways.
Oh, and I highly recommend LINQPad (free) for trying out queries with Linq. It has loads of examples, and it also shows you the sql and lambda forms so you can familiarize yourself with Linq<->lambda form<->Sql.
Well, the answer is always "it depends". Do you want to optimize on the database load or on the application load?
My general answer in this case would be to use as specific queries as possible at the database level, therefore using 6 calls.
Thx
I was kind of thinking "ball park", but it sounds as though its a choice thing...the difference is likely small.
I was thinking that getting all the data and manipulating in .net would be the best - I have nothing concrete to base this on (hence the question), I just tend to feel that calls to the DB are expensive and if I know i need all the data...get it in one hit?!?
Part of the problem is that you have not provided sufficient information to give you a precise answer. Obviously, available resources need to be considered.
If you pull 3000 rows infrequently, it might work for you in the short term. However, if there are say 10,000 people that execute the same query (ignoring cache effects), this could become a problem for both the app and db.
Now in the case of something like pagination, it makes sense to pull in just what you need. But that would be a general rule to try to only pull what is necessary. It's much more elegant to use a scalpel instead of a broadsword. =)
If you are talking about a query that has already been run by SQL (so optimized by SQL Server), working with LINQ or a SqlDataReader might actually have the same performance.
The only difference will be "how hard will it be to maintain your code?"
LINQ doesn't query anything to the database until you ask for the result with ".ToList()" or ".ToArray()" or even ".Count()". LINQ is dynamically building your query so it is exactly the same as having a SqlDataReader but with runtime verification.
Rather than speculating, why don't you try both and measure the results?
It depends
1) if your connector implementation precaches a lot of objects AND you have big rows (for example blobs, contry polygons etc.) you have a problem, you have to download a LOT of data. I've optimalized once a code that had this problem and it was just downloading some megs of garbage all the time via localhost, and my software runs now 10 times faster because i removed the precaching by an option
2) If your rows are small and you have a good chance that you need to read through all the 3000, you're better going on a big resultset
3) If you don't use prepared statements, all queries have to be parsed! Big resultset might be better.
Hope it helped
I always stick to the rule of "bring in what I need" and nothing more...the problem I have here is that I need it all, I just need it displayed separately.
So say...
I have a table with userid and typeid. I want to display all records with a userid, and display on the page in grids say separated by typeid.
At the moment I call sproc that does "select field1, field2 from tab where userid=1",
then on the page set the datasource of a grid to from t in tab where typeid=2 select t;
Rather than calling a different sproc "select field1, field2 from tab where userid=1 and typeid=2" 6 times.
??