Process huge data and write to file for client to download - asp.net

I have got a situation which require to Select data from database (MS SQL), then process the data a little before provide it to client to download. The problem is that the table contains a HUGE number of records to select (~1 Billion records). (EXPORT function)
Currently, I got 2 issues:
Select data -> SQL Server response too slow, usually hanged
I used streambuilder to store content of file before write to file for client to download. So, with huge data, IIS got problem with memory -> HANGED.
I have been trying some solutions to deal with them, I appreciate any advices from you for the good solution to deal with huge data in this case.
My thinking of solution for :
1st issue: select data like paging, get around 10000 record per time.
2nd: try to create a temporary file on server, write data to that file, and send to client after all data is available in that file.
In your professional experience, are these solutions possible?

Related

Flushing a Gembox Spreadsheet object to a stream while reading from a DataReader

Here's the scenario:
I have around 400k records in a SQL Server 2008 R2 database that I want to export to an XLSX spreadsheet.
The application is an ASP.NET 4.0 web application
I tried using a DataTable with ReportViewer but the w3wp process memory usage skyrockets due to the entire DataTable being read into memory.
I thought that Gembox Spreadsheet would handle that scenario a little better, guessing that I could use a DataReader instead of the DataTable and just write a new row to the Excel workbook sheet and flush it over the HTTP stream.
But I can't find that function anywhere in Gembox Spreadsheet. Has anyone achieved anything similar, either with Gembox or any other 3rd party component?
I am only guessing it here but you should be able to do it using SQLDataReader. SQL DataReader is different than SQL DataSet. The former retrieves one record at a time and it required live connection to DB. It should not create memory problem, I think. The later retrieves the whole table at once which can sky rocket the memory usage. Here is a good article from Microsoft, difference between datareader and dataset
Also note that Excel has limitation as well. Excel 2007 is pretty good by the way, can handle 1000k x 16k records. Older versions of Excel are definitely more limited. Excel 2007 Limitation
I am also thinking if Excel can handle such a large file , your program should handle 400k records. I know that is a lot of data but Operating System usually takes care of memory management unless you are doing something that is just plain wrong.

send large data using webservice

I need to send large data using webservice. the size of data would be between 300 MB to 700 MB. The webservice generates data from SQL database and send to the client. it is in form of DataSet with around 20 to 25 tables.
I tried solution from artical, "How to: Enable a Web Service to Send and Receive Large Amounts of Data" and sample fo Microsoft WSE 3.0, but mostly it is giving me "System.OutOfMemoryException".
I think the problem is WebService buffers data in memory on server and it crosses limit.
i thought two alternate,
(1) send DataTable one by one, but some time one DataTable can have around 100MB to 150MB data
(2)Write file on server and transfer using HttpWebRequest(FTP possible, but FTP server is not accessible currently)
can any one suggest workaround for this problem using webservice?
Thanks,
A dataset will load all the data in memory. It is not suited to transfer that huge amounts of data. DataSets carry a lot of extra information when they are serialized.
If you know the structure of the tables you will need to transfer, create a set of Serializable objects and sending an array of those would help reduce your data payload significantly.
If you must use a DataSet take a look into enabling BinaryRemoting.
BinaryFormatter bf = new BinaryFormatter();
myDataSet.RemotingFormat = SerializationFormat.Binary;
bf.Serialize(s, ResultDataSet);
After reducing your data payload by such means, it would be best to write the files to a publicly accessible location on your http server. Hosting it over http allows clients to download the file far more easily than ftp. You can control access to these http folders by means of proper permissions given to the users.

Exporting Large Amounts of Data from SQL Server using ASP.Net

At the company I work for, we are building a data warehousing application which will be a web based front end for a lot of queries that we run. [Using ASP.net]
Some of these queries will bring in over a million records (and by year end maybe around 2 million records) -
Most will be thousands, etc.
What is a good way to architect the application in such a way that you can browse to the query that you want, export it and have a CSV file generated of the data requested -
I was thinking a web based interface that calls BCP to generate a file, and shows you when the report has been created so that it can be downloaded, and expires within 24 hours of being created -
Any Ideas?
Sas
I architected something like this in a former life. Essentially a C# command-line app that ran queries against our reporting server.
Some key points:
It didn't matter how long they took because it was in the background - when a file was ready, it would show up in the UI and the user could download it. They weren't waiting in real time.
It didn't matter how inefficient the query was - both because of the point above, and because the reports were geared to the previous day. We ran them off of a reporting copy of production, not against production, which was kept on a delay via log shipping.
We didn't set any expiration on the files because the user could request a report on a Friday and not expect to look at it until Monday. We'd rather have a file sitting around on the disk than run the report again (file server disk space is relatively cheap). We let them delete reports once they were done with them, and they would do so on their own to prevent clutter in the UI.
We used C# and a DataReader rather than bulk export methods because of various requirements for the data. We needed to provide the flexibility to include column headers or not, to quote data or not, to deal with thousand separators and decimal points supporting various cultures, to apply different line endings (\r\n, \n, \r), different extensions, Unicode / non-Unicode file formats, and different delimiters (comma, tab, pipe, etc). There were also different requirements for different customers and even for different reports - some wanted an e-mail notification as soon as a report was finished, others wanted a copy of the report sent to some FTP server, etc.
There are probably a lot of requirements you haven't thought of yet but I hope that gives you a start.

Im building a salesforce.com type of CRM - what is the right database architecture?

I have developed a CRM for my company. Next I would like to take that system and make it available for others to use in a hosted format. Very much like a salesforce.com. The question is what type of database structure would I use. I see two options:
Option 1. Each time a company signs up, I clone the master database for them.
The disadvantage of this is that I could end up with thousands of databases. Thats a lot of databases to backup every night. My CRM uses cron jobs for maintanance, those jobs would have to run on all databases.
Each time I upgrade the system with a new feature, and need to add a new column to the database, I will have to add that column to thousands of databases.
Option 2. Use only one database.
At the beginning of EVERY table add "CompanyID". In EVERY SQL statement add "and
companyid={companyid}".
The advantage of this method is the simplicity of only one database. Just one database to backup each night. Just one database to update when needed.
The disadvantage is what if I get 1000 companies signing up, and each wants to store data on 100,000 leads, that 100,000,000 rows in the lead table, which worries me.
Does anyone know how the online hosted CRMs like salesforce.com do this?
Thanks
Wouldn't you clone a table structure style to each new database id all sheets archived in master base indexed client clone is hash verified to access specific sheet run through a host machine at the front end of the master system. Then directing requests as primary role. Internal access is batched to read/write slave systems in groups. Obviously set raid configurations to copy real time and scheduled. Balance requests for load to speed of system resources. That way you separated the security flawed from ui and connection to the retention schema. it seems like simplified structures and reducing policy requests cut down request rss in the query processing. or Simply a man in the middle approach from inside out.
1) Splitting your data into smaller groups in a safe, unthinking way (such as one database per grouping) is almost always best if you want to scale. In this case, unless for some reason you want to query between companies, keeping them in separate databases is best.
2) If you are updating all of the databases by hand, you are doing something wrong if you want to scale. You'd want to automate the process.
3) Ultimately, salesforce.com uses this as the basis of their DB infrastructure:
http://blog.database.com/blog/2011/08/30/database-com-is-open-for-business/

How do I get SQLite to cache results of a select command

I am running a web application with a backend SQLite database that solely performs read operations. Users connect to the database, search for entries via a select command, and view the results in a browser. But, the select is quite time-consuming because it involves character pattern matching across several million table rows. (The size of the results table is quite small).
Different users will generally do the exact same search, so if I can cache the results of the select the first time, the next user to search to database (concurrently or more likely a few days later) can get back the results quickly.
How can I do this in SQLite? Is there a pragma I need to use? I hear that SQLite has an automatic caching feature, but this does not seem to help. Note that I'm using a hosting service, so I cannot rebuild SQLite in anyway.
Any help would be much appreciated.
I would use an external caching solution like
memcached
or
APC/Zend Cache (on PHP)
Then you have much more control over your cache (what to store, lifetimes, clearing the cache completely...)
Are the users using different connections to the SQLite DB? Try using the PRAGMA command to increase cache size.
PRAGMA cache_size;
PRAGMA cache_size = Number-of-pages;
Query or change the suggested maximum
number of database disk pages that
SQLite will hold in memory at once per
open database file.
From documentation: http://www.sqlite.org/pragma.html#pragma_cache_size
After posting the above question, I found one simple solution that seems effective that doesn't require any changes to the CPanel hosting service that I use.
Instead of caching the SQL results, I simple cache the entire webpage generated by the PHP script. Users making the exact same search are then given the cached page, bypassing the database completely.
I got the basic idea here.
The advantage is that a quite complex results set involving several different database calls can all be cached in a very minimal space. Then, using PHP it's a trivial task to delete the cached webpages after a certain time.
I hope this helps others working on similar problems.
I have a very similar issue which I attempted to optimise by hashing queries, and storing the hash/result in a CACHE table.
But there are gotchas - for example, the commands:
SELECT * FROM myTable WHERE (col1 BETWEEN 1000 AND 2000) AND (col2='StackOverflow');
SELECT * FROM myTable WHERE (col2='StackOverflow') AND (col1 BETWEEN 1000 AND 2000);
Should give exactly the same answer, but MD5'ing those two string will give different results. Same goes for:
SELECT col1,col2 FROM myTable;
SELECT col2,col1 FROM myTable;
If you can standardise how the queries come into the database (sorting SELECTables, WHERE filters, grouping, etc - all in our PHP/JavaScript), then you can reliably quick-hash the queries, do them, then store the result in your database. I would recommend storing it as JSON if your not going to return whole pages like the OP, because that saves having to convert it later, and JSON is pretty much always gzipped before sending it to the user (if your AJAXing to the users screen).
To really speed this up, your CACHE table should have the Hash column include the UNIQUE type so it is automatically indexed.
If you are using Node.js, then id go one step further and say you can read this CACHE table from disk on startup, store all the hashes in a set object, then when your server gets a query just do allHashes.has(hashedSortedQuery). Since that operation happens in memory, it is essentially instantaneous. If true, SELECT result from CACHE where Hash='hashedSortedQuery', else, (actual query).
One final note - from my experience doing the same thing, updating my SQLite version had a HUGE improvement on speed, and i only mention it because often on shared hosting servers, the latest version isn't always installed. Heres a comparison i did today between the global SQLite version installed on my server, and the latest SQLite version compiled from source: http://pastebin.com/hpWu3UCk

Resources