Fastest alternative to Datatable.Select() to narrow cached data? - asp.net

Stack Overflowers:
I have a search function on my company's website (based on .NET 2.0) that allows you to narrow the product catalog using up to 9 different fields. Right now, after you make your selections on the frontend I am building a dynamic query and hitting the database (SQL Server) to get the resulting list of items numbers.
I would like to move away from hitting the database everytime and do all of this in memory for faster results. Basically a 3500 - 4500 row "table" with 10 columns: the item number (which could be a primary key) and the 9 attribute fields (which have repeating values for many many rows). There can be any number of different searches between the 9 columns to get the items you want:
Column A = 'foo' AND Column D = 'bar'
Column B = 'foo' AND Column C = 'bar' AND Column I = 'me'
Column H = 'foo'
etc...
Based on my research, the .Select() function seems like the slowest way to perform the search, but it stands out to me as being the quickest and easiest way to perform the narrowing searches to get the list of item numbers:
MyDataSet.Select("Column B = 'foo' AND Column E = 'bar' AND Column I = 'me'")
In my specific case, what method do you suggest I use as an alternative that has the same narrowing functionality and better performance instead of settling for the datatable.select() method?

Your best alternative is to let your database do what it's best at: querying and filtering data.
Caching DataTables (especially ones with 3500-4500 rows) is a bad idea for web applications. Calling Select() on a DataTable doesn't reduce the number of rows in the DataTable - it returns a new collection of rows (copied from the original), which means you'll still have the original 4000 rows sitting in the cache. Better to have nothing at all in the cache, and just get the rows you need when the user requests them.
DataTables (and DataSets) are best used with fat clients (usually Windows applications) that need to work with in-memory copies of database data while in a disconnected state.

Datatables are not optimally built for being queried, I wouldn't recommend going down this route, unless you really have a documented performance problem that you're certain would be improved by doing so.
If your dynamic queries are slow, it's probably because you haven't indexed your table properly in your database. Databases are designed to be able to optimally query your data, so my hunch would be that a little work on the database side of things should get you where you need to go.
If you really need to query ADO.Net datatables, make sure you read Scaling ADO.Net DataTables thoroughly. It talks about things you can do to speed up the performance of them, and gives you some benchmarks so you can see the difference.

Related

LINQ to entities performance regarding where clause

Let's say i have a table in a database with 10k records. I dont need to actually use those 10k records anymore, but i still need to keep them in the database. That very table is now going to be used to store new data. So there's gonna be more records coming on top of the 10K records already present in the table. As opposed to the "old" 10K records, i do need to work with the newly inserted data. Right now im doing this to get the data i need:
List<Stuff> l = (from x in db.Table
where x.id > id
select x).ToList();
My question now is: how does the where clause in LINQ (or in SQL in general) work under the covers? Is the ENTIRE table going to be searched until (x.id > id) is true? Because let's say the table will increase from 10k records to 20K. It'd be a little silly to look through the entire 20 k records, if i know that i only have to start looking from a certain point.
I've had performance problems (not dramatic, but bad enough to be agitated by it) with this while using LINQ to entities, which i kinda don't understand because it should be no problem at all for a modern computer to sift through a mere 20 k records. I've been advised to use a stored procedure instead of a LINQ query, but i dont know whether or not this will boost performance?
Any feedback will be appreciated.
It's going to behave just like a similarly worded SQL query would. The question is whether the overhead you're experiencing is happening in the query or in the conversion of the query to a list. The query itself as you've written should equate literally to:
Select ID, Column1, Column2, Column3, ... , Column(n+1)
From db.Table
Where ID > id
This query should be fairly fast depending on the nature of the data. The query itself will not be executed until it is acted upon, however. In this case, you're converting it to a list, which is the equivalent of acting upon it. I can't find the comment someone made to me about this practice, but I've found it too be quite helpful in keeping performance clean. Unless you have some very specific need, you should leave your queries as IQueryable. Converting them to lists doubles the effort because first the query must be executed and then the result set must be converted into an appropriate IEnumerable (List in this case).
So you have 2 potential bottlenecks. The simple query could be taking a long time to query a massive collection of data, or the number of records could be bottenecking at the poing where the List is created. Another possibility is the nature of ID in this case. If it is numeric, that will save you some time. If it's performing a text-based search then it's going to be heavier.
To answer your specific question, yes, it's going to search every record in the database and return all of the records that match the expression. Edit: If the database has a proper index on the column in question, it will not search EVERY record but rather will use the index to perform the search. From comment from #Pleun.
As for using a stored procedure, that's a load of hogwash, but it's a perfectly acceptable alternative. I have several programs that routinely run similar queries against a database with over 40 million records, and the only performance issue I've run into so far has been CPU usage when multiple users are performing rapid firing queries. To solve your specific issue, I'd recommend that you tune it a little in SQL Management Studio until the query you want returns to your interface with an acceptable speed. Then you can convert that query into a compatible Linq statement. As long as you leave it as an IQueryable it should exhibit similar results.

Strategy for handling variable time queries?

I have a typical scenario that I'm struggling with from a performance standpoint. The user selects a value from a dropdown and clicks a button. A stored procedure takes that value as an input parameter, executes, and returns the results to a grid. For just one of the values ('All'), the query runs for roughly 2.5 minutes. For the rest of the values the query runs less than 1ms.
Obviously, having the user wait for 2.5 minutes just isn't going to fly. So, what are some typical strategies to handle this?
Some of my own thoughts:
New table that stores the information for the 'All' value and is generated nightly
Cache the data on the caching server
Any help is appreciated.
Thanks!
Update
A little bit more info:
sp returns two result sets. The first is a group by rollup summary and the second is the first result set, disaggregated (roughly 80,000 rows).
I would first look at if your have the proper indexes in place. Using the Query Analyzer and the Database Tuning Assistant is a simple and often effective way of seeing what indexes might help.
If you still have performance problems after creating the appropriate indexes you might then look at adding tables/views to speed things up. If your query does a lot of joins you might consider creating an indexed view that allows you to do a select with no joins on the denormalized data. Since indexed views are persisted you can see big gains from their use.
You can read up on indexed views here:
http://msdn.microsoft.com/en-us/library/dd171921%28v=sql.100%29.aspx
and read about the database tuning adviser here:
http://msdn.microsoft.com/en-us/library/ms166575.aspx
Also, how many records does "All" return? I have seen people get hung up on the "All" scenario before, but if it returns 1 million records or something then the data is not usable to a person anyways...
Caching data is a good thing, but.... if the SP is inherently flawed, then you might want to actually fix it instead of trying to bandage it with caching.
You might also want to (since you didn't mention here) look at the number of rows "All" returns compared to the other selections and think about your indexes.
Also in your SP does the "All" cause it to run a different sets of tsql as in maybe a case or an if... or is it running the same code just with a different "WHERE"?
It might simply be that "ALL" just returns A LOT of records. You may want to implement paging and partial dataset return using ajax... (kinda like return the first 1000 records early so that it can be displayed and also show a throbber on the screen while the rest of the dataset is returned)
These are all options... if the number of records really isnt that different between ALL and the others... then it probably has something to do with the query/index/program flow.

Autocomplete optimization for large data sets

I am working on a large project where I have to present efficient way for a user to enter data into a form.
Three of the fields of that form require a value from a subset of a common data source (SQL Table). I used JQuery and JQuery UI to build an autocomplete, which posts to a generic HttpHandler.
Internally the handler uses Linq-to-sql to grab the data required from that specific table. The table has about 10 different columns, and the linq expression uses the SqlMethods.Like() to match the single search term on each of those 10 fields.
The problem is that that table contains some 20K rows. The autocomplete works flawlessly, accept the sheer volume of data introduces deleays, in the vicinity of 6 seconds or so (when debugging on my local machine) before it shows up.
The JqueryUI autocomplete has 0 delay, queries on the 3 key, and the result of the post is made in a Facebook style multi-row selectable options. (I almost had to rewrite the autocomplete plugin...).
So the problem is data vs. speed. Any thoughts on how to speed this up? The only two thoughts I had were to cache the data (How/Where?); or use straight up sql data reader for data access?
Any ideas would be greatly appreciated!
Thanks,
<bleepzter/>
I would look at only returning the first X number of rows using the .Take(10) linq method. That should translate into a sensbile sql call, which will put much less load on your database. As the user types they will find less and less matches, so they will only see that data they require.
I'm normally reckon 10 items is enough for the user to understand what is going on and still get to the data they need quickly (see the amazon.com search bar for an example).
Obviously if you can sort the data in a meaningful fashion then the 10 results will be much more likely to give the user what they are after quickly.
Returning the top N results is a good idea for sure. We found (querying a potential list of 270K) that returning the top 30 is a better bet for the user finding what they're looking for, but that COMPLETELY depends on the data you are querying.
Also, you REALLY should drop the delay to something sensible like 100-300 ms. When you set delay to ZERO, once you hit the 3-character trigger, effectively EVERY. SINGLE. KEY. STROKE. is sent as a new query to your server. This could easily have the unintended and unwelcome effect of slowing down the response even MORE.

.NET Object Design

I have a series of objects I have created:
Item
Order
Song
etc.
Each object has a reasonable number of properties, and I use a datareader where I pass it "SELECT * FROM .objectname." and then I fill a collection of objects, and return the collection. This works as: GetOrdersCollection(), GetSongsCollection(), etc.
I understand SELECT * to be a performance problem, and additionally, sometimes I prefer to include additional columns in the select statement which do not exist in the object, and have those all returned as well.
So my question is, what is the best way to approach this problem?
Should I create a new object for every query type?
I tried performing a check to see if column is in datareader before storing it, but this presents perf. issues. Is there a negligible perf. way to avoid IndexOutOfRange?
Should I just use Datatable and read right from the table?
I understand SELECT * to be a
performance problem,
It's not a performance problem if there are only a few columns, or you need all of the columns anyway.
1.Should I create a new object for every query type?
You should create a new object for each table, and a new method for each query type.
2.I tried performing a check to see if column is in datareader before storing
it, but this presents perf. issues. Is
there a negligible perf. way to avoid
IndexOutOfRange?
If you are referring to your fields by name rather than index, there shouldn't be any IndexOutOfRange problems. If you are referring to your fields by index, you can loop thru them where your index is less than the column Count(), and there shouldn't be any IndexOutOfRange problems.
3.Should I just use Datatable and read right from the table?
That's a perfectly good approach to start out with. Consider spending some time to learn a simple ORM as others have suggested. Subsonic is a good "first" ORM.
Performance-wise reading from a forward only data structure like DataReader is going to net you the best performance and resource conservation.
On the other hand populating object (like a OR/M does) can be negligible so long as you are not returning more than a handful of objects.
Your first step should be to profile your database and ensure that you have proper indexes. Write some tests to see where your largest time expense is in the process and optimize the target areas that cost you the most.
Are there any reasons you can't use a simple ORM generator like SubSonic? This will allow you to very easily access these types of collections, and they'll be strongly typed. You also won't have to worry about the SQL since the queries will be built by SubSonic.

Add new columns in asp .net application

I am facing this question in a new little project:
The system to be built will allow user to add new columns to a table in the system, and then the user will be able to maintain the data, I think there is two ways to implement this:
1) create a few tables including "columns" table with "columnName" "columnValue" "datatype" etc to store the column definition, aonther table "XXCoumn" to store the value of the column (entered by user), and user a store procedure to query/update column data.
2) create the column in the table schema when user enter a new column, then the maintain of the table data is just as normal
which way do you guys reckon? or any new suggestion?
Some additional info: the data volumn is small, and I need to create reports.
Any good recommendations would require a much better understanding of your requirements, but here are some comments on the options you mentioned, as well as some additional thoughts.
1) Entity-Attribute-Value (EAV) Design: This is the option you describe where you have a table that has columns for ColumnName, Type and Value. This option has the advantage of being able to accomodate unlimited new columns easily, but I have found it to be painful when the time comes to retrieve meaningful data back. For example, say you have rows in this EAV table for {Color, varchar}{Red, Green, Blue}, and {Size, varchar}{Small, Medium, Large}. If you want to find all the small green items, you need something like this (untested SQL of course):
SELECT *
FROM ITEMS
WHERE ITEMID IN (SELECT ITEMID
FROM ITEM_ATTRIBUTES ATT INNER JOIN ITEM_VALUES VLS
ON ATT.AttributeID = VLS.AttributeID
WHERE ATT.ColumnName = 'Color' AND VLS.Value = 'Green')
AND ITEMID IN (SELECT ITEMID
FROM ITEM_ATTRIBUTES ATT INNER JOIN ITEM_VALUES VLS
ON ATT.AttributeID = VLS.AttributeID
WHERE ATT.ColumnName = 'Size' AND VLS.Value = 'Small')
Contrast this with having actual columns on the items table for color and size:
SELECT *
FROM ITEMS
WHERE COLOR = 'Green' AND SIZE = 'Small'
In addition, you will have a difficult time maintaining data integrity, if that is important for this app (and it is almost always important, even when you are told otherwise). In the example above, you will need to implement extra logic if "Color" should be limited to Blue, Green, and Red. Also, you will need to implement even more logic if certain colors only come in certain sizes (example - blue items are only available in small and medium)
2) User-Defined Columns: Just giving the user the ability to add additional columns to the table has the advantage of making data retrieval simpler, but all the data integrity issues remain. Also, your app usually requires extra logic to deal with the unknown columns.
3) Pre-Existing Custom Columns: I have worked with a few apps, such as CRMs, that provide a dozen or more columns already in place for user definition. Basically, the designers put in columns like "Text1","Text2","Text3","Number1","Number2", etc. The users then provide header and description information for these columns, and that is what the app uses for display purposes. This model has the advantage of easy data retrieval, as well as a pre-defined DB schema which should simplify app logic. Data integrity issues remain, however. The other obvious downside is that you will run out of pre-defined columns, which is what you are usually trying to avoid with this type of solution.
As with most design issues, there are tradeoffs to each solution. My experience has been that while many users/clients say they want solutions like these, in reality they are simply trying to ensure they don't get trapped with an app that can't grow with their needs. I have found that there are actually very few places where a design like this is needed. I can almost always create a design that addresses the expansion desires of the client without putting them into the role of database designer.
"The system to be built will allow user to add new columns to a table in the system..."
Really - that's the user story? Sounds like you've already made up your mind on the solution, to me.
Whether it's a good idea or not to allow a user to extend schemas is pretty context dependent. I'd have little problem in an admin-like, limited use way. But it'd be a horribly bad idea in a MySpace type way. I suspect your situation lies somewhere between those 2 extremes.
Extending the schema would lead to greatly more efficient queries - as you could add indexes and such - but it does expose some relational rules on your users. Also, the extension would (probably) lock the entire table and concurrent edits would need to be dealt with.
If this is centrally hosted by you, I would suggest NOT allowing user-input data to change the schema of the database (i.e. drive the creation of new tables).
Rather you may want to look into using XML fields in SQL to store variable field names of data, or a more generic table structure... this technique works pretty well if we're not talking crazy amounts of data...
Is it possible you're looking at your solution sideways? It sounds like you need a mapping table (sort of like your #1). You have a table, say "objects" for example, a table called "properties" which holds what you're calling columns and then a table that holds the values, so it just has object_id, property_id, value.
To put in a smarter way than I said it, take a look at the Entity-attribute-value model.

Resources