Slow making many aggregate queries to a very large SQL Server table

Slow making many aggregate queries to a very large SQL Server table - asp.net

I have a custom log/transaction table that tracks my users every action within the web application and it currently has millions of records and grows by the minute. In my application I need to implement some of way of precalculating a user's activities/actions in sql to determine whether other features/actions are available to the user within the application. For one example, before a page loads, I need to check if the user viewed a page X number of times.
(SELECT COUNT(*) FROM MyLog WHERE UserID = xxx and PageID = 123)
I am making several similar aggregate queries with joins for checking other conditions and the performance is poor. These checks are occuring on every page request and the application can receive hundreds of requests per minute.
I'm looking for any ideas to improve the application performance through sql and/or application code.
This is a .NET 2.0 app and using SQL Server 2008.
Much thanks in advance!

Easiest way is to store the counts in a table by themselves. Then, when adding records (hopefully through an SP), you can simply increment the affected row in your aggregate table. If you are really worried about the counts getting out of whack, you can put a trigger on the detail table to update the aggregated table, however I don't like triggers as they have very little visibility.
Also, how up to date do these counts need to be? Can this be something that can be stored into a table once a day?

Querying a log table like this may be more trouble then it is worth.
As an alternative I would suggest using something like memcache to store the value as needed. As long as you update the cache on each hit it will much faster the querying a large database table. Memcache has an build in increment operator that handles this kind of thing.
This way you only need to query the db on the first visit.
Another alternative is to use a precomputed table, updating it as needed.

Have you indexed MyLog on UserID and PageID? If not, that should give you some huge gains.

Todd this is a tough one because of the number of operations you are performing.
Have you checked your indexes on that database?
Here's a stored procedure you can execute to help at least find valid indexes. I can't remember where I found this but it helped me:
CREATE PROCEDURE [dbo].[SQLMissingIndexes]
#DBNAME varchar(100)=NULL
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SELECT
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0)
* (migs.user_seeks + migs.user_scans) AS improvement_measure,
'CREATE INDEX [missing_index_'
+ CONVERT (varchar, mig.index_group_handle)
+ '_' + CONVERT (varchar, mid.index_handle)
+ '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'
+ ' ON ' + mid.statement
+ ' (' + ISNULL (mid.equality_columns,'')
+ CASE WHEN mid.equality_columns IS NOT NULL
AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END
+ ISNULL (mid.inequality_columns, '')
+ ')'
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
migs.*,
mid.database_id,
mid.[object_id]
FROM
sys.dm_db_missing_index_groups mig
INNER JOIN
sys.dm_db_missing_index_group_stats migs
ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid
ON mig.index_handle = mid.index_handle
WHERE
migs.avg_total_user_cost
* (migs.avg_user_impact / 100.0)
* (migs.user_seeks + migs.user_scans) > 10
AND
(#DBNAME = db_name(mid.database_id) OR #DBNAME IS NULL)
ORDER BY
migs.avg_total_user_cost
* migs.avg_user_impact
* (migs.user_seeks + migs.user_scans) DESC
END
I modified it a bit to accept a db name. If you dont provide a db name it will run and give you information about all databases and give you suggestions on what fields need indexing.
To run it use:
exec DatabaseName.dbo.SQLMissingIndexes 'MyDatabaseName'
I usually put reusable SQL (Sproc) code in a seperate database called DBA then from any database I can say:
exec DBA.dbo.SQLMissingIndexes
As an example.
Edit
Just remembered the source, Bart Duncan.
Here is a direct link http://blogs.msdn.com/b/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx
But remember I did modify it to accept a single db name.

We had the same problem, beginning several years ago, moved from SQL Server to OLAP cubes, and when that stopped working recently we moved again, to Hadoop and some other components.
OLTP (Online Transaction Processing) databases, of which SQL Server is one, are not very good at OLAP (Online Analytical Processing). This is what OLAP cubes are for.
OLTP provides good throughput when you're writing and reading many individual rows. It fails, as you just found, when doing many aggregate queries that require scanning many rows. Since SQL Server stores every record as a contiguous block on the disk, scanning many rows means many disk fetches. The cache saves you for a while - so long as your table is small, but when you get to tables with millions of rows the problem becomes evident.
Frankly, OLAP isn't that scalable either, and at some point (tens of millions of new records per day) you're going to have to move to a more distributed solution - either paid (Vertica, Greenplum) or free (HBase, Hypertable).
If neither is an option (e.g. no time or no budget) then for now you can alleviate your pain somewhat by spending more on hardware. You need very fast IO (fast disks, RAID), as as much RAM as you could get.

Related

asp.net - How to clean up after an SQL Injection Attack?

I have several old sites that have just been taken offline by my hosting company, apparently due to a SQL injection attack. I looked inside my database and yes I was hacked. *oops*
My database has been filled with script tags that have been appended to my original data (at least my original data is still there so that was nice of them).
I have been looking though my old code and have seen a few unsanitised input locations, so obviously I will go through this thoroughly and check for more.
Im also downloading the hacked site to compare it to the version I uploaded years ago (using some kind of file checker program), this should allow me to see if they have tried to add a backdoor.
My questions are…
1) Is there a way I can strip out all the appended scrip tags from my database as they are all exactly the same?
2) Is there anything else I should be aware of or have overlooked?
I would just like to point out that no sensitive material are stored on these old sites so it’s no big deal, I would just like to get them back up and running again.
I am bushing up on my security knowledge and will shortly delete all the files on the host, change all the passwords and upload the improved (and less hacker friendly) site.
Thanks...

Specifically answering your script tag replacement issue, I can't see this anything other than being a manual task.
I'm sure you've considered this, but a simple replace statement on a field ought to get this stuff out:
update MyTable
set field = replace(field, 'unwanted', '')
where field like '%unwanted%'
If there are many tables and fields, then I'm sure you could conjour some sort of automation using the SQl data dictionary. Something like the following:
DECLARE #ColName varchar(255), #TableName varchar(255), #sSQL varchar(1000)
DECLARE colcur CURSOR for
SELECT name, object_name(id)
FROM syscolumns
WHERE name = 'Moniker'
OPEN ColCur
FETCH NEXT FROM ColCur
INTO #ColName, #TableName
WHILE ##FETCH_STATUS = 0
BEGIN
Set #sSQL = 'update ' + #TableName + ' set ' + #ColName + ' = replace(' + #ColName + ', ''unwanted'', '''') where ' + #ColName + ' like ''%unwanted%'''
exec(#sSQL)
select #ColName, #TableName
FETCH NEXT FROM ColCur
INTO #ColName, #TableName
END
CLOSE ColCur
DEALLOCATE ColCur

I guess these would be some steps in an ideal scenario:
Keep your site offline. Maybe you'd like to display a "Down to technical maintenance" message rather than a 404.
Make a backup of the hacked database, you may want to analyse it later
Make sure that you fix code pieces vulnerable for SQL Injections. I'd recommend doing this in a team, to be more thorough.
Restore your database from a backup
Upload the (hopefully) fixed homepage
Contact your lawyer because you may have probably leaked customer data.
With your lawyer you would discuss the next legal steps.
As you mentioned, no sensitive material was stored on the hacked page, that probably means you can skip steps 6 and 7.

This is an ideal time to use your backup if you have one, because you don't know exactly how your data was corrupted. If you don't have a backup, then this should be a lesson to use backups in the future and to protect yourself against such attacks. Also, if you don't have a backup, you should create an algorithm which cleans up your data, this doesn't guarantee that no junk will remain though.

first Protect From SQL Injection
then, restore the data from a recent backup.

Poor SP performance from ASP.NET

I have a stored procedure that handles sorting, filtering and paging (using Row_Number) and some funky trickery :) The SP is running against a table with ~140k rows.
The whole thing works great and for at least the first few dozen pages is super quick.
However, if I try to navigate to higher pages (e.g. head to the last page of 10k) the whole thing comes to a grinding halt and results in a SQL timeout error.
If I run the same query, using the same parms inside studio manager query window, the response is instant irrespective of the page number I pass in.
At the moment it's test code that is simply binding to a ASP:Datagrid in .NET 3.5
The SP looks like this:
BEGIN
WITH Keys
AS (
SELECT
TOP (#PageNumber * #PageSize) ROW_NUMBER() OVER (ORDER BY JobNumber DESC) as rn
,P1.jobNumber
,P1.CustID
,P1.DateIn
,P1.DateDue
,P1.DateOut
FROM vw_Jobs_List P1
WHERE
(#CustomerID = 0 OR CustID = #CustomerID) AND
(JobNumber LIKE '%'+#FilterExpression+'%'
OR OrderNumber LIKE '%'+#FilterExpression+'%'
OR [Description] LIKE '%'+#FilterExpression+'%'
OR Client LIKE '%'+#FilterExpression+'%')
ORDER BY P1.JobNumber DESC ),SelectedKeys
AS (
SELECT
TOP (#PageSize)SK.rn
,SK.JobNumber
,SK.CustID
,SK.DateIn
,SK.DateDue
,SK.DateOut
FROM Keys SK
WHERE SK.rn > ((#PageNumber-1) * #PageSize)
ORDER BY SK.JobNumber DESC)
SELECT
SK.rn
,J.JobNumber
,J.Description
,J.Client
,SK.CustID
,OrderNumber
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateIn,0) AS DateTime)) AS nvarchar) AS DateIn
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateDue,0) AS DateTime)) AS nvarchar) AS DateDue
,CAST(DateAdd(d, -2, CAST(isnull(SK.DateOut,0) AS DateTime)) AS nvarchar) AS DateOut
,Del_Method
,Ticket#
,InvoiceEmailed
,InvoicePrinted
,InvoiceExported
,InvoiceComplete
,JobStatus
FROM SelectedKeys SK
JOIN vw_Jobs_List J ON j.JobNumber=SK.JobNumber
ORDER BY SK.JobNumber DESC
END
And it's called via
sp_jobs (PageNumber,PageSize,FilterExpression,OrderBy,CustomerID)
e.g.
sp_Jobs '13702','10','','JobNumberDESC','0'
Can anyone shed any light on what might be the cause of the dramatic difference in performance between SQL query window and an asp.net page executing a dataset?

Check out the "WITH RECOMPILE" option
http://www.techrepublic.com/article/understanding-sql-servers-with-recompile-option/5662581

I have run into similar problems where the execution plan on stored procedures will work great for a while, but then get a new plan because the options changed. So, it will be "optimized" for one case and then perform "table scans" for another option. Here is what I have tried in the past:
Re-execute the stored procedure to calculate a new execution plan and then keep an eye on it.
Break up the stored procedure into separate stored procedures of each option such that it can be optimized and then the overall stored procedure simply calls each "optimized" stored procedure.
Bring in the records into an object and then perform all of the "funky trickery" in code and then it gives you the option to "cache" the results.
Obviously option #2 and #3 is better than option #1. I am honestly finding option #3 is becoming the best bet in most cases.
I just had another option 4. You could instead of performing your "inner selects" in one query, you could put the results of your inner selects into temporary tables and then JOIN on those results. I would still push for option #3 if possible, but I understand that sometimes you just need to keep working the stored procedure until it "works".
Good luck.

Why does this query timeout? V2

This question is a followup to This Question
The solution, clearing the execution plan cache seemed to work at the time, but i've been running into the same problem over and over again, and clearing the cache no longer seems to help. There must be a deeper problem here.
I've discovered that if I remove the .Distinct() from the query, it returns rows (with duplicates) in about 2 seconds. However, with the .Distinct() it takes upwards of 4 minutes to complete. There are a lot of rows in the tables, and some of the where clause fields do not have indexes. However, the number of records returned is fairly small (a few dozen at most).
The confusing part about it is that if I get the SQL generated by the Linq query, via Linqpad, then execute that code as SQL or in SQL Management Studio (including the DISTINCT) it executes in about 3 seconds.
What is the difference between the Linq query and the executed SQL?
I have a short term workaround, and that's to return the set without .Distinct() as a List, then using .Distinct on the list, this takes about 2 seconds. However, I don't like doing SQL Server work on the web server.
I want to understand WHY the Distinct is 2 orders of magnitude slower in Linq, but not SQL.
UPDATE:
When executing the code via Linq, the sql profiler shows this code, which is basically identical query.
sp_executesql N'SELECT DISTINCT [t5].[AccountGroupID], [t5].[AccountGroup]
AS [AccountGroup1]
FROM [dbo].[TransmittalDetail] AS [t0]
INNER JOIN [dbo].[TransmittalHeader] AS [t1] ON [t1].[TransmittalHeaderID] =
[t0].[TransmittalHeaderID]
INNER JOIN [dbo].[LineItem] AS [t2] ON [t2].[LineItemID] = [t0].[LineItemID]
LEFT OUTER JOIN [dbo].[AccountType] AS [t3] ON [t3].[AccountTypeID] =
[t2].[AccountTypeID]
LEFT OUTER JOIN [dbo].[AccountCategory] AS [t4] ON [t4].[AccountCategoryID] =
[t3].[AccountCategoryID]
LEFT OUTER JOIN [dbo].[AccountGroup] AS [t5] ON [t5].[AccountGroupID] =
[t4].[AccountGroupID]
LEFT OUTER JOIN [dbo].[AccountSummary] AS [t6] ON [t6].[AccountSummaryID] =
[t5].[AccountSummaryID]
WHERE ([t1].[TransmittalEntityID] = #p0) AND ([t1].[DateRangeBeginTimeID] = #p1) AND
([t1].[ScenarioID] = #p2) AND ([t6].[AccountSummaryID] = #p3)',N'#p0 int,#p1 int,
#p2 int,#p3 int',#p0=196,#p1=20100101,#p2=2,#p3=0
UPDATE:
The only difference between the queries is that Linq executes it with sp_executesql and SSMS does not, otherwise the query is identical.
UPDATE:
I have tried various Transaction Isolation levels to no avail. I've also set ARITHABORT to try to force a recompile when it executes, and no difference.

The bad plan is most likely the result of parameter sniffing: http://blogs.msdn.com/b/queryoptteam/archive/2006/03/31/565991.aspx
Unfortunately there is not really any good universal way (that I know of) to avoid that with L2S. context.ExecuteCommand("sp_recompile ...") would be an ugly but possible workaround if the query is not executed very frequently.
Changing the query around slightly to force a recompile might be another one.
Moving parts (or all) of the query into a view*, function*, or stored procedure* DB-side would be yet another workaround.
 * = where you can use local params (func/proc) or optimizer hints (all three) to force a 'good' plan
Btw, have you tried to update statistics for the tables involved? SQL Server's auto update statistics doesn't always do the job, so unless you have a scheduled job to do that it might be worth considering scripting and scheduling update statistics... ...tweaking up and down the sample size as needed can also help.
There may be ways to solve the issue by adding* (or dropping*) the right indexes on the tables involved, but without knowing the underlying db schema, table size, data distribution etc that is a bit difficult to give any more specific advice on...
 * = Missing and/or overlapping/redundant indexes can both lead to bad execution plans.

The SQL that Linqpad gives you may not be exactly what is being sent to the DB.
Here's what I would suggest:
Run SQL Profiler against the DB while you execute the query. Find the statement which corresponds to your query
Paste the whole statment into SSMS, and enable the "Show Actual Execution Plan" option.
Post the resulting plan here for people to dissect.
Key things to look for:
Table Scans, which usually imply that an index is missing
Wide arrows in the graphical plan, indicating lots of intermediary rows being processed.
If you're using SQL 2008, viewing the plan will often tell you if there are any indexes missing which should be added to speed up the query.
Also, are you executing against a DB which is under load from other users?

At first glance there's a lot of joins, but I can only see one thing to reduce the number right away w/out having the schema in front of me...it doesn't look like you need AccountSummary.
[t6].[AccountSummaryID] = #p3
could be
[t5].[AccountSummaryID] = #p3
Return values are from the [t5] table. [t6] is only used filter on that one parameter which looks like it is the Foreign Key from t5 to t6, so it is present in [t5]. Therefore, you can remove the join to [t6] altogether. Or am I missing something?

Are you sure you want to use LEFT OUTER JOIN here? This query looks like it should probably be using INNER JOINs, especially because you are taking the columns that are potentially NULL and then doing a distinct on it.

Check that you have the same Transaction Isolation level between your SSMS session and your application. That's the biggest culprit I've seen for large performance discrepancies between identical queries.
Also, there are different connection properties in use when you work through SSMS than when executing the query from your application or from LinqPad. Do some checks into the Connection properties of your SSMS connection and the connection from your application and you should see the differences. All other things being equal, that could be the difference. Keep in mind that you are executing the query through two different applications that can have two different configurations and could even be using two different database drivers. If the queries are the same then that would be only differences I can see.
On a side note if you are hand-crafting the SQL, you may try moving the conditions from the WHERE clause into the appropriate JOIN clauses. This actually changes how SQL Server executes the query and can produce a more efficient execution plan. I've seen cases where moving the filters from the WHERE clause into the JOINs caused SQL Server to filter the table earlier in the execution plan and significantly changed the execution time.

SQL Server 2005 - Pass In Name of Table to be Queried via Parameter

Here's the situation. Due to the design of the database I have to work with, I need to write a stored procedure in such a way that I can pass in the name of the table to be queried against if at all possible. The program in question does its processing by jobs, and each job gets its own table created in the database, IE table-jobid1, table-jobid2, table-jobid3, etc. Unfortunately, there's nothing I can do about this design - I'm stuck with it.
However, now, I need to do data mining against these individualized tables. I'd like to avoid doing the SQL in the code files at all costs if possible. Ideally, I'd like to have a stored procedure similar to:
SELECT *
FROM #TableName AS tbl
WHERE #Filter
Is this even possible in SQL Server 2005? Any help or suggestions would be greatly appreciated. Alternate ways to keep the SQL out of the code behind would be welcome too, if this isn't possible.
Thanks for your time.

best solution I can think of is to build your sql in the stored proc such as:
#query = 'SELECT * FROM ' + #TableName + ' as tbl WHERE ' + #Filter
exec(#query)
not an ideal solution probably, but it works.

The best answer I can think of is to build a view that unions all the tables together, with an id column in the view telling you where the data in the view came from. Then you can simply pass that id into a stored proc which will go against the view. This is assuming that the tables you are looking at all have identical schema.
example:
create view test1 as
select * , 'tbl1' as src
from job-1
union all
select * , 'tbl2' as src
from job-2
union all
select * , 'tbl3' as src
from job-3
Now you can select * from test1 where src = 'tbl3' and you will only get records from the table job-3

This would be a meaningless stored proc. Select from some table using some parameters? You are basically defining the entire query again in whatever you are using to call this proc, so you may as well generate the sql yourself.
the only reason I would do a dynamic sql writing proc is if you want to do something that you can change without redeploying your codebase.
But, in this case, you are just SELECT *'ing. You can't define the columns, where clause, or order by differently since you are trying to use it for multiple tables, so there is no meaningful change you could make to it.
In short: it's not even worth doing. Just slop down your table specific sprocs or write your sql in strings (but make sure it's parameterized) in your code.

Sql Processing vs. ASP.NET Runtime processing

I know in general it is a good practice to move as much processing as possible from Sql Server to the application (in my case ASP.NET). However what if the processing on the application level means passing 30+ extra parameters to the Sql Server. In this case is it worth moving the processing to the Sql Server?
Here's the specific dilemma I am facing - which procedure will offer better performance overall?
CREATE PROCEDURE MyProc1
#id int
AS BEGIN
UPDATE MyTable
SET somevalue1 = somevalue1 + 1,
somevalue2 = somevalue2 + 1,
somevalue3 = somevalue3 + 1,
...
somevalueN = somevalueN + 1
WHERE id = #id
END
Or
CREATE PROCEDURE MyProc2
#id int,
#somevalue1 int,
#somevalue2 int,
#somevalue3 int,
...
#somevalueN int
AS BEGIN
UPDATE MyTable
SET somevalue1 = #somevalue1,
somevalue2 = #somevalue2,
somevalue3 = #somevalue3,
...
somevalueN = #somevalueN
WHERE id = #id
END
I am using a managed hosting, but I guess it is valid to assume that Sql Server and ASP.NET runtime reside on the same machine, so the transfer of data between the two would probably be pretty fast/negligible(or is it).
The 30 Parameters are basically totalNumberOfRatings for different items. So whenever a user of my web app gives a new rating for itemN then totalNumberOfRatingsItemN is incremented by 1. In most cases the rating will be given to several items (but not necessarily all), so totalNumberOfRatings is not the same for different items.

I'll make a wild guess that you've read that SQL Server is not the appropriate place to do math. I think SQL is quite appropriate for doing a handful of arithmetic operations and aggregate operations, like SUM. Only measuring both implementations in realistic load scenarios can say for sure.
What SQL isn't appropriate for are things like multiple regression and matrix inversion.
Moving incrementors to the application tier sounds to me like a micro-optimization that is unlikely to pay for itself. Implement the logic to increment the values in either tier that makes the code readable and maintainable.

in general it is a good practice to move as much processing as possible from Sql Server to the application
Says who? SQL Server is good at some things, not so good at others. Use your judgement.
and by "application", do you mean
the web page
an application-service layer
an enterprise service bus component
a web service
something else?
There are many choices...
As to your specific question, incrementing 30 fields seems ludicrous. Passing 30 parms seems excessive. Some context is in order...

Is performance that critical in this instance? If its not I would personally go for readability and maintainability - which would mean implementing it following the same standards as the rest of the system.

I recommend normalizing your table to eliminate the 30+ columns of ratings. Instead, consider having a table with a single rating column and having one row for each different item, like so:
CREATE TABLE ItemRatings
(
myTableId INT NOT NULL, -- Foreign key
itemNumber INT NOT NULL,
itemRating INT
);
You could then increment a bunch of ratings at once with a query like
UPDATE ItemRatings
SET itemRating = itemRating + 1
WHERE myTableId = #id
AND itemNumber IN (#n1, #n2, #n3, ...)
Something like that, anyways. I'm not exactly clear on how your tables function since you anonymized all the names, but hopefully you get the gist of what I'm saying.
That said, if you don't want to or can't normalize your table this way, I agree with the other answers. It's six of one, half a dozen of the other. Personally, I'd choose the first way and let the database do all the heavy lifting. Of course, as in all things, if performance is ultra-critical then the real answer is not to guess; try both ways and get out your stopwatch!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Slow making many aggregate queries to a very large SQL Server table - asp.net

Have you indexed MyLog on UserID and PageID? If not, that should give you some huge gains.

Related

asp.net - How to clean up after an SQL Injection Attack?

Poor SP performance from ASP.NET

Why does this query timeout? V2

SQL Server 2005 - Pass In Name of Table to be Queried via Parameter

Sql Processing vs. ASP.NET Runtime processing

Categories

Resources