As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have been looking for some advice for a while on how to handle a project I am working on, but to no avail. I am pretty much on my fourth iteration of improving an "application" I am working on; the first two times were in Excel, the third Time in Access, and now in Visual Studio. The field is manufacturing.
The basic idea is I am taking read-only data from a massive Sybase server, filtering it and creating much smaller tables in Access daily (using delete and append Queries) and then doing a bunch of stuff. More specifically, I use a series of queries to either combine data from multiple tables or group data in specific ways (aggregate functions), and then I place this data into a table (so I can sort and manipulate data using DAO.recordset and run multiple custom algorithms). This process is then repeated multiple times throughout the database until a set of relevant tables are created.
Many times I will create a field in a query with a value such as 1.1 so that when I append it to a table I can store information in the field from the algorithms. So as the process continues the number of fields for the tables change.
The overall application consists of 4 "back-end" databases linked together on a shared drive, with various output (either front-end access applications or Excel).
So my question is is this how many data driven applications that solve problems essentially work? Each backend database is updated with fresh data daily and updating each takes around 10 seconds (for three) and 2 minutes(for 1).
Project Objectives. I want/am moving to SQL Server soon. Front End will be a Web Application (I know basic web-development and like the administration flexibility) and visual-studio will be IDE with c#/.NET.
Should these algorithms be run "inside the database," or using a series of C# functions on each server request. I know you're not supposed to store data in a database unless it is an actual data point, and in Access I have many columns that just hold calculations from algorithms in vba.
The truth is, I have seen multiple professional Access applications, and have never seen one that has the complexity or does even close to what mine does (for better or worse). But I know some professional software applications are 1000 times better then mine.
So Please Please Please give me a suggestion of some sort. I have been completely on my own and need some guidance on how to approach this project the right way.
If you are going to sql server or any other full client server DBMS for that matter, the trick (generally) is to do as much on the server as possible.
Depends on how you've written the code really. In general the optimisations for a desktop are the inverse of those for a server.
For instance if you a Find Customer facility.
In a desktop you'd get the entire table and then use say Locate to find the record by name, post/zip code etc. Because effectively your application is both server and client.
In a Client Server set up, you pass customer Name etc to the DBMS, and let it find the customer(s) that matched and pass only those back.
So in your situation forgetting the web application bit, you've got to look at what your application does and say can I write this in sql.
So
If you had
// get orders
foreach(Order order in clientOrders)
{
if (Order.Discount > 0)
{
Order.Value = Order.ItemCount * Order.ItemPrice * Order.Discount;
}
}
// save orders
you'd replace that with a query that did
Update Orders Set Value = ItemCount * ItemPrice * Discount
Where ClientID = #ClientID and Discount > 0
Let the server do the work on the server instead of pulling and pushing loads of data into and out of an application.
If I was you though, I'd either do the sql server piece, or I'd do the web server piece, not both at the same time. In terms of client server there's a lot of overlap. Neither one precludes the other, but a lot of times you'll be able to use either to solve the same problem in slightly different ways.
As more details emerge, it appears one piece of your application involves storing 15K rows in your Access db file(s) so that you may later perform computations on those data.
However, it's not clear why you feel those data must be stored in Access to perform the computations.
Ideally, we would create a query to ask the server to perform those calculations. If that's not possible with your server's capabilities, or so computationally intensive as to place an unacceptable processing load on the server, you still should not need to download all the raw data to Access in order to use it for your computations. Instead, you could open a recordset populated by a query on the server, move through the recordset rows to perform your computation and store only the results in your Access table (via a second recordset).
Public Sub next_level_outline()
Dim db As DAO.Database
Dim rsLocal As DAO.Recordset
Dim rsServer As DAO.Recordset
Dim varLastValue As Variant
Set db = CurrentDb
Set rsLocal = db.OpenRecordset("AccessTable", dbOpenTable, dbAppendOnly)
Set rsServer = db.OpenRecordset("ServerQuery", dbOpenSnapshot)
Do While Not rsServer.EOF
rsLocal.AddNew
rsLocal!computed_field = YourAlgorithm(varLastValue)
rsLocal.Update
varLastValue = rsServer!indicator_field.value
rsServer.MoveNext
Loop
rsLocal.Close
Set rsLocal = Nothing
rsServer.Close
Set rsServer = Nothing
Set db = Nothing
End Sub
That is only a crude outline. Much depends on the nature of YourAlgorithm(). From a comment, I gathered it has something to do with a previous row ... so I included varLastValue as a placeholder.
Part of your approach was to filter 2 million source rows to the 15K rows which apply to your selected factory. Do that with a WHERE clause in ServerQuery:
WHERE factory_id = 'foo'
If the row ordering is important for YourAlgorithm(), include an ORDER BY clause in ServerQuery.
The driver for this suggestion is to avoid redundantly storing data in Access. And, if you can't eliminate the redundancy completely, at least limit the extent of it.
You may then find you can consolidate the Access storage into a single db file rather than four. The single db file could simplify other aspects of your application and should also offer improved performance.
I think you should make certain you've thoroughly addressed this issue before you move on to the next stage of your application's evolution. I don't believe this challenge will become any easier in ASP.Net.
The application you describe appears to be an example of "ETL" - extract, transform, load.
It was one of the first projects I ever worked on as a professional programmer - and it's distinctly non-trivial. There are a bunch of tools you can use to help with this process (including one from Microsoft), but they are aimed mostly at populating a data warehouse - it's not clear that's what you're building, so that may not be hugely useful. Nevertheless, read through the Wikipedia article, and perhaps look at some of the ETL tools to get some ideas.
If you go your own way, I'd suggest writing a windows service to automatically run your ETL process. I assume you run the import on some kind of trigger - nightly, hourly, when the manufacturing system sends you a message or whatever; write your windows service to poll for this trigger.
I'd then execute whatever database commands from the service you need to move the data around, run your algorithms etc; pay attention to error handling and logging (services don't have a user interface, so you have to write errors to the system log and make sure someone is paying attention). Consider wrapping your database code in stored procedures - it makes them easier to invoke from the service.
It sounds like this is a fairly complex app; pay attention to code quality, consider unit tests (though it's hard to unit test database code). Buy "Code complete" by Steve McConnell and read it cover to cover if you're not a professional coder.
Related
I want to implement a views counter like most forums, Youtube and several others have. So every time a user reads an article, that is stored and remembered. I also want to know who looked at the article.
My queston is: How do you implement this efficiently? What is the best practice?
One way would be to call a stored procedure for every view, but that would result in a lot of unneeded calls to the database.
Another way would be to store this to some global application object, and then store in DB every 5 minutes or so (and can you even do that in a good way?)
What's the best way to do this?
Database operations are surprisingly cheap and really are not worth worrying about. In the event that a DB operation was even marginally expensive then you can always delegate the blocking operation to a new thread thus freeing-up your page-generation thread (you can trivially do this for UPDATE and INSERT operations that return nothing from the database - they are inconsequential).
Sprocs aren't really in-fashion right now - the performance advantage they might have had from pre-computed execution plans is almost eliminated because modern servers cache plans from all previous queries, and for trivial SELECT, INSERT, and UPDATEs you begin to suffer from increased code complexity. There's nothing wrong with inline SQL commands now.
Anyway, back on-topic and in summary: your assumptions are wrong. There is nothing wrong with running UPDATE Pages SET ViewCount = ViewCount + 1 WHERE PageId = #pageId on every page-view. There is also nothing wrong with doing this either: INSERT INTO UserPageviews (UserId, PageId, DateTime) VALUES ( #userId, #pageId, NOW() ). Both operations are very cheap and will execute in under 2-3 miliseconds on even an old and aged database server.
Another way would be to store this to some global application object,
and then store in DB every 5 minutes or so (and can you even do that
in a good way?)
This method is very prone to data loss unless you use a durable queueing mechanism (like MSMQ). Unless you anticipate massive traffic, I wouldn't even think about this approach.
Writes of this nature are inexpensive and hundreds of operations per second are not a big deal. I recently built a comment/rating framework that acheives throughput of 3000+ complete transactions per second just on my local all-in-one workstation. This included processing the request, validation, and creating multiple records within a transaction.
As a note, you should take steps to ensure that your statistics data isn't vulnerable to artificial inflation/manipulation. This part of the process will probably be more complex than the view tracking itself. For example, a user should not be able to sit and hold down the F5 key and inflate the number of views on their video. Nor should these values be manipulable by HTTP (e.g. creating a small script to send an AJAX request over and over).
This suggests that each INSERT would be preceded by a SELECT to ensure that the same user ID or IP hadn't already been recorded in some period of time. Of course, this isn't foolproof (unless you invest a great deal of effort), but it errs on the side of conservatism which is usually a good approach.
One way would be to call a stored procedure for every view, but that
would result in a lot of unneeded calls to the database.
I regularly have to remind myself (and other developers) to not fear the database. People (me included) sometimes go to great lengths to avoid a few simple database calls. Keep your tables narrow and well-indexed, and operations like this are faster than you might think.
We've developed a system with a search screen that looks a little something like this:
(source: nsourceservices.com)
As you can see, there is some fairly serious search functionality. You can use any combination of statuses, channels, languages, campaign types, and then narrow it down by name and so on as well.
Then, once you've searched and the leads pop up at the bottom, you can sort the headers.
The query uses ROWNUM to do a paging scheme, so we only return something like 70 rows at a time.
The Problem
Even though we're only returning 70 rows, an awful lot of IO and sorting is going on. This makes sense of course.
This has always caused some minor spikes to the Disk Queue. It started slowing down more when we hit 3 million leads, and now that we're getting closer to 5, the Disk Queue pegs for up to a second or two straight sometimes.
That would actually still be workable, but this system has another area with a time-sensitive process, lets say for simplicity that it's a web service, that needs to serve up responses very quickly or it will cause a timeout on the other end. The Disk Queue spikes are causing that part to bog down, which is causing timeouts downstream. The end result is actually dropped phone calls in our automated VoiceXML-based IVR, and that's very bad for us.
What We've Tried
We've tried:
Maintenance tasks that reduce the number of leads in the system to the bare minimum.
Added the obvious indexes to help.
Ran the index tuning wizard in profiler and applied most of its suggestions. One of them was going to more or less reproduce the entire table inside an index so I tweaked it by hand to do a bit less than that.
Added more RAM to the server. It was a little low but now it always has something like 8 gigs idle, and the SQL server is configured to use no more than 8 gigs, however it never uses more than 2 or 3. I found that odd. Why isn't it just putting the whole table in RAM? It's only 5 million leads and there's plenty of room.
Poured over query execution plans. I can see that at this point the indexes seem to be mostly doing their job -- about 90% of the work is happening during the sorting stage.
Considered partitioning the Leads table out to a different physical drive, but we don't have the resources for that, and it seems like it shouldn't be necessary.
In Closing...
Part of me feels like the server should be able to handle this. Five million records is not so many given the power of that server, which is a decent quad core with 16 gigs of ram. However, I can see how the sorting part is causing millions of rows to be touched just to return a handful.
So what have you done in situations like this? My instinct is that we should maybe slash some functionality, but if there's a way to keep this intact that will save me a war with the business unit.
Thanks in advance!
Database bottlenecks can frequently be improved by improving your SQL queries. Without knowing what those look like, consider creating an operational data store or a data warehouse that you populate on a scheduled basis.
Sometimes flattening out your complex relational databases is the way to go. It can make queries run significantly faster, and make it a lot easier to optimize your queries, since the model is very flat. That may also make it easier to determine if you need to scale your database server up or out. A capacity and growth analysis may help to make that call.
Transactional/highly normalized databases are not usually as scalable as an ODS or data warehouse.
Edit: Your ORM may have optimizations as well that it may support, that may be worth looking into, rather than just looking into how to optimize the queries that it's sending to your database. Perhaps bypassing your ORM altogether for the reports could be one way to have full control over your queries in order to gain better performance.
Consider how your ORM is creating the queries.
If you're having poor search performance perhaps you could try using stored procedures to return your results and, if necessary, multiple stored procedures specifically tailored to which search criteria are in use.
determine which ad-hoc queries will most likely be run or limit the search criteria with stored procedures.. can you summarize data?.. treat this
app like a data warehouse.
create indexes on each column involved in the search to avoid table scans.
create fragments on expressions.
periodically reorg the data and update statistics as more leads are loaded.
put the temporary files created by queries (result sets) in ramdisk.
consider migrating to a high-performance RDBMS engine like Informix OnLine.
Initiate another thread to start displaying N rows from the result set while the query
continues to execute.
Hai guys,
I ve developed a web application using asp.net and sql server 2005 for an attendance management system.. As you would know attendance activities will be carried out daily.. Inserting record one by one is a bad idea i know,my questions are
Is Sqlbulkcopy the only option for me when using sql server as i want to insert 100 records on a click event (ie) inserting attendance for a class which contains 100 students?
I want to insert attendance of classes one by one?
Unless you have a particularly huge number of attendance records you're adding each day, the best way to do it is with insert statements (I don't know why exactly you've got it into your head that this is a bad idea, our databases frequently handle tens of millions of rows being added throughout the day).
If your attendance records are more than that, you're on a winner, getting that many people to attend whatever functions or courses you're running :-)
Bulk copies and imports are generally meant for transferring sizable quantities of data and I mean sizeable as in the entire contents of a database to a disaster recovery site (and other things like that). I've never seen it used in the wild as a way to get small-size data into a database.
Update 1:
I'm guessing based on the comments that you're actually entering the attendance records one by one into your web app and 1,500 is taking too long.
If that's the case, it's not the database slowing you down, nor the web app. It's how fast you can type.
The solution to that problem (if indeed it is the problem) is to provide a bulk import functionality into your web application (or database directly if you wish but you're better off in my opinion having the application do all the work).
This is of course assuming that the data you're entering can be accessed electronically. If all you're getting is pieces of paper with attendance details, you're probably out of luck (OCR solutions notwithstanding), although if you could get muliple people doing it concurrently, you may have some chance of getting it done in a timely manner. Hiring 1,500 people do do one each should knock it over in about five minutes :-)
You can add functionality to your web application to accept the file containing attendance details and process each entry, inserting a row into your database for each. This will be much faster than manually entering the information.
Update 2:
Based on your latest information that it's taking to long to process the data after starting it from the web application, I'm not sure how much data you have but 100 records should basically take no time at all.
Where the bottleneck is I can't say, but you should be investigating that.
I know in the past we've had long-running operations from a web UI where we didn't want to hold up the user. There are numerous solutions for that, two of which we implemented:
take the operation off-line (i.e., run it in the background on the server), giving the user an ID to check on the status from another page.
same thing but notify user with email once it's finished.
This allowed them to continue their work asynchronously.
Ah, with your update I believe the problem is that you need to add a bunch of records after some click, but it takes too long.
I suggest one thing that won't help you immediately:
Reconsider your design slightly, as this doesn't seem particularly great (from a DB point of view). But that's just a general guess, I could be wrong
The more helpful suggestion is:
Do this offline (via a windows service, or similar)
If it's taking too long, you want to do it asynchronously, and then later inform the user that the operation is completed. Probably they don't even need to be around, you just don't let them do whatever functions that the data is needed, before it's completed. Hope that idea makes sense.
The fastest general way is to use ExecuteNonQuery.
internal static void FastInsertMany(DbConnection cnn)
{
using (DbTransaction dbTrans = cnn.BeginTransaction())
{
using (DbCommand cmd = cnn.CreateCommand())
{
cmd.CommandText = "INSERT INTO TestCase(MyValue) VALUES(?)";
DbParameter Field1 = cmd.CreateParameter();
cmd.Parameters.Add(Field1);
for (int n = 0; n < 100000; n++)
{
Field1.Value = n + 100000;
cmd.ExecuteNonQuery();
}
}
dbTrans.Commit();
}
}
Even on a slow computer this should take far less than a second for 1500 inserts.
[reference]
Using ASP.NET and Windows Stack.
Purpose:
Ive got a website that takes in over 1GB of data about every 6 months. So as you can tell my database can become huge.
Problem:
Most hosting providers only offer Databases in 1GB increments. This means that every time I go over another 1GB, I will need to create another Database. I have absolutely no experience in this type of setup and Im looking for some advice on what to do?
Wondering:
Do I move the membership stuff over to a separate database? This still won't solve much because of the size of the other data I have.
Do I archive data into another database? If I do, how to I allow users to access it?
If I split the data between two databases, do I name the tables the same?
I query all my data with LINQ. So establishing a few different connections wouldn't be a horrible thing.
Is there a hosting provider that anyone knows of that can scale their databases?
I just want to know what to do? How can I solve this dilemma? I don't have the advertising dollars coming in to spend more than $50 a month so far...
While http://www.ultimahosts.net/windows/vps/ seems to offer the best solution for the best price, they still split the databases up. So where do I go from here?
Again, I am a total amateur to multiple databases. Ive only used one at a time..
I'd be genuinely surprised if they actually impose a hard 1GB per DB limit and create a new one for each additional GB, but on the assumption that that actually is the case -
Designate a particular database as your master database. This is the only one your app will directly connect to.
Create a clone of all the tables you'll need in your second (and third, fourth etc) databases.
Within your master database, create a view that does a UNION on the tables as a cross-DB query - SELECT * FROM Master..TableName UNION SELECT * FROM DB2..TableName UNION SELECT * FROM DB3..TableName
For writing, you'll need to use sprocs to locate the relevant records and update them, but you shouldn't have a major problem there. In principle you could extend the view above to return which DB the record was in if you wanted.
Answering this question is very hard for it requires knowing at least some basic facts about the data model, the way the data is queried, etc. Also as suggested by rexem, a better understanding of the use model may allow using normalization to limit the growth (and I had may also allow introducing compression, if applicable)
I'm more puzzled at the general approach and business model (and I do understand the need to keep cost down with a startup application based on ad revenues). Wouldn't you be able to contract an amount that will fit your need for the next 6 months, then, when you start outgrowing this space, purchase additional storage (for an extra 6 month/year, by then you may be "rich"); such may not even require anything on your end (depends on the way hosting service manages racks etc.), or at worse, may require you to copy the old database to the new (bigger) storage?
In this fashion, you wouldn't need to split the database in any artificial fashion, and hence focus on customer-oriented features, rather than optimizing queries that need to compile info from multiple servers.
I believe solution is much more simpler than that: also if your provider manage database in 1 GB space it does not means that you have N databases of 1 GB each, it means that once you reach 1 GB the database could be increased to move to 2 GB, 3 GB and so on...
Regards
Massimo
You would have multiple questions to answer:
It seems the current hosting provider can not be very reliable if it is the way you say: they create a new database every time the initial one gets more then 1GB - this sounds strange... at least they should increase the storage for the current db and announce you that you'll be charged more... Find other hosting solutions with better options...
Is there any information into your current DB that could be archived? That's a very important question since you may carry over "useless" data that could be archived into separate databases and queried only when special requests. As other colleagues told you already, that would be difficult for us to evaluate since we do not know the data model.
Can you split the data model into two total different storages and only replicate between them the common information? You could use SQL Server Replication (http://technet.microsoft.com/en-us/library/ms151198.aspx) to maintain the same membership information between the databases.
If the data model can not be splited then I do not see any practical choice to have multiple databases - just find a bigger storage solution.
You may want to look for a better hosting provider.
Even SQL Express supports a 4GB database, and it's free. Some hosts don't like using SQL Express in a shared environment, but disk space is so cheap these days that finding a plan that starts at or grows in chunks of more than 1GB should be pretty easy.
You should go for a Windows VPS solution. Most of the Windows VPS providers will offer SQL 2008 Web Edition that can support upto 10 GB of database space ...
Context
My current project is a large-ish public site (2 million pageviews per day) site running a mixture of asp classic and asp.net with a SQL Server 2005 back-end. We're heavy on reads, with occasional writes and virtually no updates/deletes. Our pages typically concern a single 'master' object with a stack of dependent (detail) objects.
I like the idea of returning all the data required for a page in a single proc (and absolutely no unnecesary data). True, this requires a dedicated proc for such pages, but some pages receive double-digit percentages of our overall site traffic so it's worth the time/maintenance hit. We typically only consume multiple-recordsets from our .net code, using System.Data.SqlClient.SqlDataReader and it's NextResult method. Oh, yeah, I'm not doing any updates/inserts in these procs either (except to table variables).
The question
SQL Server (2005) procs which return multiple recordsets are working well (in prod) for us so far but I am a little worried that multi-recordset procs are my new favourite hammer that i'm hitting every problem (nail) with. Are there any multi-recordset sql server proc gotchas I should know about? Anything that's going to make me wish I hadn't used them? Specifically anything about it affecting connection pooling, memory utilization etc.
Here's a few gotchas for multiple-recordset stored procs:
They make it more difficult to reuse code. If you're doing several queries, odds are you'd be able to reuse one of those queries on another page.
They make it more difficult to unit test. Every time you make a change to one of the queries, you have to test all of the results. If something changed, you have to dig through to see which query failed the unit test.
They make it more difficult to tune performance later. If another DBA comes in behind you to help performance improve, they have to do more slicing and dicing to figure out where the problems are coming from. Then, combine this with the code reuse problem - if they optimize one query, that query might be used in several different stored procs, and then they have to go fix all of them - which makes for more unit testing again.
They make error handling much more difficult. Four of the queries in the stored proc might succeed, and the fifth fails. You have to plan for that.
They can increase locking problems and incur load in TempDB. If your stored procs are designed in a way that need repeatable reads, then the more queries you stuff into a stored proc, the longer it's going to take to run, and the longer it's going to take to return those results back to your app server. That increased time means higher contention for locks, and the more SQL Server has to store in TempDB for row versioning. You mentioned that you're heavy on reads, so this particular issue shouldn't be too bad for you, but you want to be aware of it before you reuse this hammer on a write-intensive app.
I think multi recordset stored procedures are great in some cases, and it sounds like yours maybe one of them.
The bigger (more traffic), you site gets, the more important that 'extra' bit of performance is going to matter. If you can combine 2-3-4 calls (and possibly a new connections), to the database in one, you could be cutting down your database hits by 4-6-8 million per day, which is substantial.
I use them sparingly, but when I have, I have never had a problem.
I would recommend having invoking in one stored procedure several inner invocations of stored procedures that return 1 resultset each.
create proc foo
as
execute foobar --returns one result
execute barfoo --returns one result
execute bar --returns one result
That way when requirments change and you only need the 3rd and 5th result set, you have a easy way to invoke them without adding new stored procedures and regenerating your data access layer. My current app returns all reference tables (e.g. US states table) if I want them or not. Worst is when you need to get a reference table and the only access is via a stored procedure that also runs an expensive query as one of its six resultsets.