Parsing PlainText Emails from HTML Content (ASP.NET) - asp.net

Right, in short we basically already have a system in place where the HTML content for emails is generated. It's not perfect, but it works.
From this, we need to be able to derive a plaintext alternative for the email. I was thinking of instantly jumping on and creating a RegEx to strip the <*> tags from the message - but then I realised this would be no good because we do need some of the formatting information (paragraphs, line breaks, images etc).
NOTE: I am OK with actually sending the mail and setting up alternative views etc, this is only about getting plaintext from HTML.
So, I am pondering some ideas. Will post one as an answer to see what you guys think, but thought I would open it up to the floor. :)
If you need any more clarification then please shout.
Many thanks,
Rob

My Solution
OK, so here it is! I thought up a solution to my problem and it works like a charm!
Now, here are some of the goals I wanted to set out:
All the content for the emails should remain in the ASPX pages (as the HTML content currently does).
I didn't want the client code to do anything more other than say "SendMail("PageX.aspx")".
I didn't want to write too much code.
I wanted to keep the code as semantically correct as possible (no REALLY crazy-ass hacks!).
The Process
So, this is what I ended up doing:
Go to the master page for the email messages. Create an ASP.NET MultiView Control. This control would have two views - HTML and PlainText.
Within each view, I added content placeholders for the actual content.
I then grabbed all the existing ASPX code (such as header and footer) and stuck it in the HTML View. All of it, DocType and everything. This does cause VS to whinge a little bit. Ignore It.
I then of course added new content to the PlainText view to best replicate the HTML view in a PlainText environment.
I then added some code to the Master Page_Load, checking for the QueryString parameter "type" which could be either "html" or "text". It falls over to "text" if none present. Dependant on the value, it switches the view.
I then go to the content pages and add new placeholders for the PlainText equivalents and add text as required.
To make my life easier, I then overloaded my SendMail method to get the response for the required page, passing "type=html" and "type=text" and creating AlternateView's as appropriate.
In Summary
So, in short:
The Views seperate the actual "views" of the content (HTML and Text).
A master page auto switches the view based on a QueryString.
Content pages are responsible for how their views look.
Job done!
If any of this is unclear then please shout. I would like to create blog post on this at some point in more detail.

My Idea
Create a page based on the HTML content and traverse the control tree. You can then pick the text from the controls and handle different controls as required (e.g. use ALT text for images, "_____" for HR etc).

You could ensure the HTML mail is in XHTML format so you can parse it easily using the standard XML tools, then create your own DOM serialiser that outputs plain text. It'd still be a lot of work to cover general XHTML, but for a limited subset you plan to use in e-mail it could work.
Alternatively, if you don't mind shelling out to another program, you could just use the -dump switch to the lynx web browser.

Related

Webscraping a tricky asp.net page

The overall goal is to perform a search on the following webpage http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx with a container value of CMAU1173561. I have tried two approaches, the php extension cURL and python's mechanized. The php approached involves a performing a POST submit using the input fields found on the page (NOTE: These are really ugly on the asp.net page). The returned page does not contain any of the search results. The second approaches involves using python's mechanize module. In this approach I load the page, select the form, then change the text field ctl00$ContentPlaceBody$TextSearch to the container value. When I load the response again no search results.
I am at a really dead end. Any help would be appreciate because as it stands my next step is to become a asp.net expertm which i perfer not to.
The source of that page is pretty scary (giant viewstate, tables all over the place, inline CSS, styles that look like they were copied from Word).
Regardless...an ASP.Net form still passes the same raw data to the server as any other form (though it is abstracted to the developer).
It's very possible that you are missing the cookies which go along with the request. If the search page (or any piece of the site) uses session state, the ASP.Net session cookie must be included in the request. You will be able to tell it from its name (contains "asp.net" and "session").
I assume that you have used a tool like Firebug or Chrome to view the complete outgoing request when the page is submitted. From my quick test, it looks like the request may be performed with a GET, not a POST. I submitted a form, looked at the request, and pasted the URL into a new browser window.
Example: http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx?ContNum=CMAU1173561&T=57201202648
This may be all you need to do.

tinymce with asp.net, ValidateRequest=false in page, is it dangerous?

I am using tinymce editor in asp.net page. It was configured fine but when I tried to write soem text in editor it raised error "A potentially dangerous Request.Form value was detected from the client with timymce" I searched and came to know it was bascailly scan of input message form script and sql injection attaks.
To remove this error I put ValidateRequest=fasle in page heade in aspx page. Now I am sure input is not beign validated but is it unsecue now ?
Please guide me what type of threat it has now and what safty measure I can take to prevent it. The editor is being used for compose and store emails. I just read on some sites that client side script attaks are possible from input. Please guide and help.
I believe this answer is along the lines of what you are looking for.
Basically, you have to make sure you html encode/decode all the input fields where applicable. In reality, you cannot completely avoid it, unless you disable the validation. But if you are, make sure you take steps to avoid direct use of the input.

Using Javascript to get around SEO concerns

I would like to know at which stage is it okay to start manipulating HTML elements/content using Javascript so as not to impair SEO?
I have read somewhere that HTML content that is hidden using the CSS property display:none is often penalized by Google crawlers, with good reason from what I'm led to believe...I ask this as I intend to have some div panels that are initially hidden, but shown once the user clicks on an appropriate link. My intention is therefore not to hide content from users entirely - just intially to give them a better user experience - I'm afraid Google may not see it that way!
My reason for doing this is to prevent the split second (or in some cases, a full 2 seconds) of ghastly unstyled html elements (positioning), before my Javascript comes in to position, hide and neaten everything up. So adding the display:none at the forefront, and then using Javascript to toggle visibility would have been ideal, but is apparently a no-no with Google Search Engine bot.
Do you experts have any advice? Thank you!
google can now crawl AJAX sites using a simple URL substitution trick; you might be able to take advantage of this to let googlebot see a plain html version of the page for indexing instead of your load-optimized page; see http://code.google.com/web/ajaxcrawling/docs/getting-started.html
If the content in question exists on the page in the html, and is accessible to the user by the time the page finishes loading initially, then you are okay. You want to make sure google can lead a user to your page and see the content in question without requiring further interaction. Adding new content to the html after the initial load (i.e. content from the server), can be problematic for SEO. However if all content is in the html by the end of the page load, then you shouldn't get docked. Keep in mind, good SEO strategy dictates using standard methods of usability so the web crawler can access your content.
Also, each page should follow a content theme. Example: Don't abuse users by hiding five different unrelated blocks of content "medical devices, kazoos, best diners, motorcycles, toxic waste" on one page. Theoretically you could take all of your site's content and lay it out on one page using javascript and 'display:none' waiting for an 'onClick', but that smells like spam.
EDIT, additional info as pertaining to the original question:
The search engine friendly way to display content dynamically is to load it, then hide it from the user.

Printing receipt ASP.NET

I'm currently making a project where I need to print out a receipt on a receipt printer.
At the moment i'm using the CSS mechanism media=screen , media=print to indicate what to print.
Problem is of course the header and footer which can't be removed, as it is client browser specific.
So i'm wondering if anyone has another suggestion on how to do the printing. Preferbly without using too much IO.
Generally speaking, if you need precise control, your best bet is to have a pdf, or other doocument format that is generated from the server, for your printing. (if the machines printing receipts are controlled, and have word, than .doc (html with an output type) is the easiest method. There are a number of third party controls for generating PDF from server-side code as well. Hope this leads you in a usable direction, since you didn't specify if you controlled the client machines in use.
One benefit to PDF is you can use it as a hard archive, as well as being able to email receipts as an attachment.
The header and footer information (Assuming you're talking about the URL showing up at the bottom of a page) is client-side and there is nothing you can do to change that from server side.
If all of your printing is going to be done from inside your company, you could have a group policy created for Internet Explorer printing to remove these company-wide. You could also just have instructions on your page on how to change these setting manually.
Another option is to print with a 3rd party application, such as PDF, or print it directly from the server if that option is available to you.
Do you mean the page header and footer?
If that's the case, wrap the header and footer in IDs and create CSS tags to target them and give them a much simpler styling, or you can use the CSS element display:none to remove them altogether in the print css.
You could load the content you want to print into an iframe, focus on it, and print that. That way you'll have exact control over what appears on the receipt.
It'd take a small bit of javascript, but I've had success doing similiar things when I wrote a custom contract printer.
It's not an ASP solution, but may help:
http://code.google.com/p/jzebra
It's a java plugin that can bypass the header and footer stuff.

Best way to create complex html email message with asp.net, how?

After user places an order I have to send detailed email message containing order details, instructions for payment, some additional text etc.
I want to create nicely formatted HTML email message.
By now I found two options:
manually creating piece by piece, string by string, which is too cumbersome,
creating actual aspx page and binding data, then rendering that page as html and sending as body of email.
This second option is more visual and easier to implement except:
I do not know how to actually load and render page, I know how to do it with ascx
This seems to much of overhead to instantiate page and render it
How to load page and render it? Do you have any other ideas or suggestions for this task?
Well, IMO, your basic problem amounts to "How do I convert an ASPX resource into an HTML string to pass to the MailMessage Body property ?"
You could do that simply by using a WebRequest to the ASPX URL in question and read that response into a Stream. Then simply read the stream into a string and your primary problem is solved.
Edit: Here's an article that illustrates this concept.
Personally, I'd want to use a template, either in a database, or as a file that gets loaded. This template would have most of the content for the email in HTML, with tokens that I can replace with the content.
ex.
<b>Receipt for order # [[ordernum]]</b>
That way I could use simple string replacement to place the dymanic content into the email, without having to build the whole email every time it needs to be sent.
In a similar situation I store a template email message in my database so that the people who use our software can modify the message. It is created (by the user) using the online HTML editor control from Telerik. Within this message, I support several "mailmerge" type fields that all have the pattern {FirstName}, {LastName}, etc.
When it is time to send the message, I pull the formatted text from the database, use string replace to fill in any slots in the template, and then send it. I guess the key is that I know the message is HTML formatted because the Telerik control helps ensure that it is so. However, there is no reason why you couldn't create your HTML and then just save it for later use.
The .aspx page route? I just wouldn't do it. It is way overkill and doesn't offer you any advantages.
I'll use a template like Jay mentioned.
Below resource might turn out useful for you.
http://dotnettricks.com/blogs/roberhinojosablog/archive/2006/05/12/57.aspx
Try using a template stored in a .NET string resource file. Down the line this will make localization a lot easier too.

Resources