strip html - Whos job is it? - asp.net

In db I have : (have = existing situation. the DB has already has the data)
Long html string (nvarchar max)
user have :
asp.net aspx page
mission:
extract from the whole html string 1 specific sentence - and show it to user
options :
1) do the strip job in sql and transfer the result to the client
pros: network is not loaded with the whole html - but the result
cons : sql server is working harder
2) send the whole html in the wire to the server and do the job there.
pros : sql is not working too hard , iis do the job with ready-tools ( html agilitypack)
cons : network wire is loaded with unwanted data
which approach is the right one ?
p.s.
lets assume that the hardware is excellent.

If you need to retrieve this specific piece of data repeatedly you should store it separately - in another column on the same table, for example.
When saving the HTML, extract this piece of information for storage (also on update, to ensure that the information stays in sync).
I would suggest using an HTML parser like the HTML Agility Pack to parse and query the HTML and doing it in the ASP.NET code that will save to the DB instead of doing it in the DB (as string manipulation does not have great support in databases).
This has the benefit of only retrieving the data when you need it and processing only when required.

Its always good to go for loss-less data storage.. because it could be useful in future.. i think 2nd option is good, you should save html in db and use agilitypack to parse the data.. as you have mentioned the hardware is excellent agilitypack should do its work easily because it works on principles of xml document that is quite faster to traverse nodes.
Regards.

I would also go with HTML Agility Pack option only.
That said, if the HTML stored in the DB is well formed and you are using SQL Server 2005+ then try out this to see if it helps - http://vadivel.blogspot.com/2011/10/strip-html-using-udf-in-sql-server-2005.html

Related

Acunetix Webscan

I am scanning my web application which i have build in Asp.net. Scanner is injecting junk data into the system trying to do blind Sql injection on the system but i am using Sql store procedures with parametrized quires which is escaping the blind sql injection but these junk entries are stored into the system as normal text i am sanitizing the inputs not to take ' and other sql related parameters.Now my question are
1) Are these junk entries any threat to the system?
2) Do i really need to sanitize the input if i am already using paramitrised quires with store procedures?
3) Scanner is not able to enter information into the system if u don't create login sequence is that a good thing?
Any other precautions i should take please let me know
Thanks
As you correctly mentioned, the 'junk' entries in your database are form submissions that Acunetix is submitting when testing for SQL injection, XSS and other vulnerabilities.
To answer your questions specifically:
1) No, this junk data is just an artifact of the scanner submitting forms. You might want to consider applying stricter validation on these forms though -- remember, if a scanner can input a bunch of bogus data, an automated script (or a real user for that matter) can also insert a bunch of bogus data.
Some ideas for better validation could include restricting the kind of input based on what data should be allowed in a particular field. For example, if a user is expected to input a telephone number, then there is no point allowing the user to enter alpha-characters (numbers, spaces, dashes, parenthesis and a plus sign should be enough for a phone number).
Alternatively, you may also consider using a CAPTCHA for some forms. Too many CAPTCHAs may adversely affect the user experience, so be cautious where, when and how often you make use of them.
2) If you are talking about SQL injection, no, you shouldn't need to do anything else. Parameterized queries are the proper way to avoid SQLi. However, be careful of Cross-site Scripting (XSS). Filtering characters like <>'" is not the way to go when dealing with XSS.
In order to deal with XSS, the best approach (most of the time) is to exercise Context-dependent Outbound Encoding, which basically boils-down to -- use the proper encoding based on which XSS context you're in, and encode when data is printed onto the page (i.e. do not encode when saving data to the database, encode when you are writing that data to the page). To read more about this, this is the easiest, and most complete source I've come across -- http://excess-xss.com/#xss-prevention
3) A login sequence is Acunetix's way of authenticating into your application. Without it, the scanner can not scan the internals of your app. So unless you have forms (perhaps on the customer-facing portion of your site) the scanner is not going to be able to insert any data -- Yes, this is generally a good thing :)

Best practice for saving application-level data to database (ASP.NET)

We have a very large HTML form (> 100 fields) that updates a SQL Server database with user-entered data. It will take the user a long time to fill out the form, but every piece of information they submit is very valuable to the business process. Even if the user gives up on the form, we want to retain everything they have entered.
We plan to attach an onblur event to each field and use jQuery/AJAX to post each piece of data back to the application server immediately. That part is pretty straightforward. The question we have is when and how to best save this application-level information to the database. Again, our priority is data retention as opposed to performance but we also want to do this as efficiently as possible.
Options as I see it are:
Have the web service immediately post each piece of data to the database server.
Store the information in a custom class on the application server, then periodically call an update method to post new data to the database.
Store the information in view or session state, then run a routine to post this information to the database server.
Something else that we haven't thought of.
Option 1 seems the most obviously failsafe, but also the most resource intensive. Option 2 seems the most elegant, but can we be absolutely certain that the custom class instance can't be destroyed without first running its update method?
Thanks for your help!
IMHO, I'd really cut up the form into sections (if possible). Since this is ASP.Net, if you are using Web Forms then look into using wizards (cut up the form into logical Steps)
You can do same without Form Wizard, but still cut up the process into logical steps, client-side. You can probably do this in pure JavaScript, but it would likely be easier if you used a framework (jQuery, Knockout, etc.) - the concept remains the same, cut up the form entry process into sections (aka 'Steps') - e.g. using display toggles, divs for each "step", etc.
"retain everything even if abandoned later": assumed that the steps are "hierarchical" where the "most critical" inputs are at the beginning. This makes the "steps" approach even more important - this is a "logical group" (of inputs you really want) so if you do the Step approach, then you can save this data (of this "step") to DB in whatever fashion you deem appropriate (e.g. Ajax, or ASP.Net Post/postback).
Hth...
I would package everything up in some xml or dataset (.getxml) and pass the xml to a stored procedure....
How to pass XML from C# to a stored procedure in SQL Server 2008?
And maybe put the call on a background thread.
http://code.msdn.microsoft.com/CSASPNETBackgroundWorker-dda8d7b6
The xml will be faster than calling the values row by row (RBAR).
You can save just the xml, or shred the xml into a relational table(s).

cache data until changed

I have a legacy website that needs a little optimization because of poor performance. It is an asp.net shopping website with linq to sql as data layer and MVP pattern as UI pattern.
The most costly entities in the db are product and category tables that have a one to many relationship. These two entities might not change regularly unless a user of admin group decides to add a product or category… etc. i was wondering how resource costly would it be to create and fetch everything from these two entities for each request! so if i could have had a way to keep my data alive…
first I thought well let’s use AJAX for data retrievals so I will create only those entities that I need to query or bind to, but wait, how can I do that without creating a new DataContext instance?!!
At the other side, using cache for whole DataContext is considered a bad decision because of memory cost. So what would be the best option here? How can I improve things?
UPDATE
1) doing what #HatSoft suggested.
Cons: those approaches will not help your code, only the database. beside this, there might be memory issues since we're putting data in memory instead of rendered html, however this might be the best option regarding de-coupling.
2) using output caching we have this code in an http handler with *.aspx wildcard:
string pagePath = Context.Request.Url.AbsolutePath;
object cacheKey = application[pagePath];
if(cacheKey == null)
return; //application restarted/first run so cache the stuff
else
Context.Response.RemoveOutputCacheItem(pagePath);
Cons: now we should link the pagePath to each database entity that the page uses, but if i do so then i'm coupling things instead of de-coupling them. this approach also will run into a little hard coding.
3) another solution would be output caching in post-cache mode instead of control cache mode. using Subsituation element and setting the OutPutCache Duration to 86400 so the page will be re-created every 24 hours.
Cons: hard coding user controls to produce the html output for Subsituation element dynamically.
so what do you suggest?
I would suggest you look in to SqlDependency class please read this article http://www.asp.net/web-forms/tutorials/data-access/caching-data/using-sql-cache-dependencies-cs
Also I would suggest you look in to loading data in the cache at application startup if it suits your application. Please see a good example here http://www.asp.net/web-forms/tutorials/data-access/caching-data/caching-data-at-application-startup-cs
With Linq2SQL you can use LinqToCache which offers a SqlDependency powered cache for your LINQ queries. It transforms the IQueryable<Products> into IEnumerable<Products> and enumerates form memmory after first access (first iteration of the underlying IQueryable). Based on SqlDependency data change notifications it invalidates the list and subsequent access will query again from DB, and cache the result.
My recommendation would be to cache the Products list and Categories in memory, since they change seldom and I expect them to be of a fairly constrained size.

Generate and serve a file in the server on demand. Best way to do it without consume too much resources?

My application has to export the result of a stored procedure to .csv format. Basically, the client performs a query, and he can see the results on a paged grid, if it contains what he wants, then he clicks on a "Export to CSV" button and he downloads the whole thing.
The server will have to run a stored procedure that will return the full result without paging, create the file and return it to the user.
The result file could be very large, so I'm wondering what is the best way to create this file in the server on demand and serve it to the client without blow up the server memory or resources.
The easiest way: Call the stored procedure with LINQ, create a stream and iterate over the result collection and creating a line in the file per collection item.
Problem 1: Does the deferred execution applies as well to LINQ to stored procedures? (I mean, will .NET try to create a collection with all the items in the result set in memory? or will it give me the result item by item if I do an iteration instead of a .ToArray?)
Problem 2: Is that stream kept in RAM memory till I perform a .Dispose/.Close?
The not-so-easy way: Call the stored procedure with a IDataReader and per each line, write directly to the HTTP response stream. It looks like a good approach, as long as I read I write to the response, the memory is not blown up.
Is it really worth it?
I hope I have explained myself correctly.
Thanks in advance.
Writing to a stream is the way to go, as it will rougly consume not more than the current "record" and associated memory. That stream can be a FileStream (if you create a file) or the ASP.NET stream (if you write directly to the web), or any other useful stream.
The advantage of creating a file (using the FileStream) is to be able to cache the data to serve the same request over and over. Depending on your need, this can be a real benefit. You must come up with an intelligent algorithm to determine the file path & name from the input. This will be the cache key. Once you have a file, you can use the TransmitFile api which leverages Windows kernel cache and in general very efficient. You can also play with HTTP client caches (headers like last-modified-since, etc.), so next time the client request the same information, you may return a not modified (HTTP 304 status code) response. The disadvantages of using cache files is you will need to manage these files, disk space, expiration, etc.
Now, Linq or IDataReader should not change much about perf or memory consumption provided you don't use Linq method that materialize the whole data (exhaust the stream) or a big part of it. That means you will need to avoid ToArray(), ToList() methods and other methods like this, and concentrate only on "streamed" methods (enumerations, skips, while, etc.).
I know I'm late to the game on here, but theoretically how many records are we talking here? I saw 5000 being thrown around, and if its around there that shouldn't be a problem for your server.
Answering the easiest way:
It does unless you specify otherwise (you disable lazy loading).
Not sure I get what you're asking here. Are you referring to a streamreader you'd be using for creating the file, or the datacontext you are using to call the SP? I believe the datacontext will clean up for you after you're done (always good practice to close anyway). Streamreader or the like will need a dispose method run to remove from memory.
That being said, when dealing with file exports I've had success in the past building the Table (CSV) programmatically (via iteration), then sending the structured data as an HTTP response with the type specified in the header, the not so easy way as you so eloquently stated :). Heres a question that asks how to do that with CSV:
Response Content type as CSV
"The server will have to run a stored procedure that will return the full result without paging..."
Perhaps not, but I believe you'll need Silverlight...
You can set up a web service or controller that allows you to retrieve data "by the page" (much like just calling a 'paging' service using GridView or other repeater). You can make async calls from silverlight to get each "page" of data until completed, then use the SaveFileDialog to save to the harddisk.
Hope this helps.
Example 1 |
Example 2
What you're talking about isn't really deferred execution, but limiting the results of a query. When you say objectCollection.Take(10), the SQL that is generated when you iterate the enumerable only takes the top 10 results of that query.
That being said, a stored procedure will return whatever results you are passing back, whether its 5 or 5000 rows of data. Performing a .Take() on the results won't limit what the database returns.
Because of this, my recommendation (if possible for your scenario), is to add paging parameters to your stored procedure (page number, page size). This way, you will only be returning the results you plan to consume. Then when you want the full list for your CSV, you can either pass a large page size, or have NULL values mean "Select all".

Simple problem storing strings into a database

For my website I've just implemented tinyMCE for my site (just a word processor). Everything works fine except when i try to store the string variable input into a sql server database. I want to store a string and not have the html tags make me exceed the 8000 length limit(the html tags take up most of that space). My question is, is there a solution so I can store my document with the html tags without shortening my document? Thanks
Some ideas I've had but not sure if they'll work
create an if statement that will determine the length If > 8000 than split the string apart and insert into seperate fields.
maybe their is a compression feature which I'm unaware of?
Paul
Can you store it as a BLOB or possibly even FILESTREAM. I know BLOB's have a size limit of 2 GB and are probably less than the ideal depending on the average size of the file you expect because of the hit to the log file. FILESTREAM's were added in SQL SERVER 2008 to handle large files by writing them directly to the filesystem by setting an attribute on the varbinary type.

Resources