I need to read data from an online database that's displayed using an aspx page from the UN. I've done HTML parsing before, but it was always by manipulating query-string values. In this case, the site uses asp.net postbacks. So, you click on a value in box one, then box two shows, click on a value in box 2 and click a button to get your results.
Does anybody know how I could automate that process?
Thanks,
Mike
You may still only need to send one request, but that one request can be rather complicated. ASP.Net is notoriously difficult (though not impossible) to screen scrape. Between event validation and the ViewState, it's tricky to get your requests just right. The simplest way to do it is often to use a sniffer tool like fiddler to see exactly what the http request looks like, and then just mimic that request.
If you do still need to send two requests, it's because the first request also places some state in a session somewhere, and that means whatever you use to send those requests needs to be able to send them with the same session. This often means supporting cookies.
Watin would be my first choice. You would code the selecting and clicking, then parse the HTML after.
I'd look at HtmlAgilityPack with the FormProcessor addon.
Related
This is a clarification question: I'm studying for MCTS 70-515 and in my training kit it states that Hidden Fields: "users can view or modify data stored in hidden fields"
Now I'm aware that users can view the source of the page and then that would display the hidden field data. But I'm curious as to the modification part. How would a user modify a hidden field data and how would it affect the site? Even if they modify the data via View Source they can't save the page and then post the data back to the server.
What am I missing that the author is assuming I know?
OK well all the answers said the same thing (at this time). I guess if the author would of said "savvy" user then that might of tipped me off. I guess I've always assumed that users wouldn't know of Firebug or any other tool that can do manipulation after the page has been displayed to the user.
Thank you for all your answers. I appreciate it!
The hidden field is just a key-value-pair represented as a key-value-pair when serialized and sent to the server, just like any other form element. There are a number of ways to modify hidden fields - one is to use FireBug or some other "developer console" in the browser, another is to manually write the request and send it to the server.
In addition to using a debugging tool such as Firebug, the user could change the value of a hidden field indirectly though other interactions (with JavaScript) making the change for them. Normally, the user would be unaware of the technical detail of what they are doing (they neither know about, nor care about the fact a hidden field got changed)
Other tools, such as Fiddler, may intercept the web request and change the value of the hidden (or any) field as it is being transfered to the server on a postback.
It is possible to change the value of a hidden field on the server during a postback, as you know, or on the client using JavaScript.
Example using jQuery: http://jsfiddle.net/JGsQ5/
Once the page has been loaded by the browser, it is stored in the DOM (http://en.wikipedia.org/wiki/Document_Object_Model) which is what JavaScript manipulates and is used by the browser to build a HTTP request which is sent back to the server as a postback.
Easy, open up a program like FireBug and change the element value. Remember, markup is client side, so the server is trusting the client to send back the right data -- however, this is easily circumvented.
It is best to store data that is essential to the security of your application in session's, whereas the data remains on the server side and is tied to the client. ASP.NET can make up of hashes to prevent the unauthorized modification of fields, amoung other things.
It is possible to have a control that would proxy requests to another domain/Web site, including postback?
In this control, you would specify the URL you wanted to execute, and whenever the control executed, it would make a GET request to this other URL, and render the HTML return. (This part is not hard.)
However, when the page is posting back, it would make a POST request, with all of its postback variables intact, to this other page.
I'm really looking for a blind proxy. Some control that will take the incoming request and throw it another URL, and render the results. The other page would really have no idea it wasn't interacting with a human.
I want to think I could develop this, but I can't be the first person who wants to do it, so there has to be some reason why Google isn't revealing the solution to me. I suspect I'm going to run into the same Big Problem that anyone else with this idea has run into.
I'm not exactly sure what the value of this is; which is probably why you haven't found a solution yet.
However, it seems to me that there are two possible solutions.
When the page is rendered have the control modify the form action to point elsewhere; or,
on post back, have the control execute a web request to the alternate URL with the post variables and decide what to do with the results at that time.
In this end, this never had much of a chance of working. I experimented with it for a while, but Postback requires intimate knowledge of the control tree, and there's no way that you're going to be able to apply a postback from the calling page to the other page and have it overlay correctly because the control trees between the two pages are totally different.
Now, if you wanted to write the backend app as a more traditional Web app (even something not in ASP.Net), it might work. During postback, you could iterate the Request.Form values and send them back, and just have your backend app prepared to accept those incoming values and deal with them, but this wouldn't be a traditional postback.
I have a webservice that I need called, the result of which determines whether or not the user is allowed to submit the form.
Since this call is from javascript and not from code behind is it not reliable? Is there any way the user can get around the check -- by either going in with firebug and enabling the submit button, somehow making the method give a different result than was actually returned by the webservice, any other ways of being able to get around it?
Basically is there any way to call a webservice from javascript and have it's result determine whether or not a form can be submitted, and actually prevent the user from submitting the form at all? -- whether or not they have firebug, etc...
No, not possible.
Just to name a few possible reasons:
what if javascript is disabled?
what if the user submits the raw POST (using libcurl, for example)?
what if the browser, that the user is using interprets javascript in a way, different from your expectations (think, portable devices)?
Javascript validation is there for your users' convenience only and should never ever be used as a means of providing security.
You can never prevent the user from making an HTTP request that mimics submission of the form. While disabling the form via Javascript prevents submission for 95% of the users who both have Javascript enabled and don't want to circumvent your access control, anyone who understands HTTP can make the call and you are correct in showing that anyone with Firebug can do it in a matter of seconds.
Javascript isn't reliable for preventing anything. It shouldn't be seen as a security-wall, as it's too easily disassembled with things like firebug, iedevelopertoolbar, and many other browser toys.
Even if you could prevent them from submitting your form on your page, nothing stops them from creating a brand new form, on their own page, and point it toward the action of your form. Thus they're removing themselves from your "secure" environment, and instead chosing to play in their own.
Your suspicion is correct; the user can easily get around any possible Javascript validation.
You will need to use server-side code.
No, it is not reliable. Try disabling Javascript in your browser to see for yourself how easily you can get around it.
The user could simply disable javascript in their browser, or use something like NoScript. The best you could do is to try setting the form action itself in the return from the Ajax request, that way the form, as loaded, won't submit (except to itself). This will probably stop casual users but would be no impediment to a slightly more determined (or just bored and tech savvy) user. You will need to check on the server side whatever you do.
In general, no. You can make the form hard to submit without going through Javascript. Make the submit button not an actual submit button (<input type="submit">), but a pushbutton (<input type="button">) that submits the form in its onClick handler.
As everyone else said, no you can't do it. The only real solution is to have the web service return some dynamic value which the Javascript inserts in a hidden form input. Then whatever server-side code processes the form submission should reject the request if that value is not present.
I have a wizard style interface where I need to collect data from users. I've been asked by my managers that the information is to be collected in a step by step type process.
I've decided to have a page.aspx with each step of the process as a separate user control. step1.ascx step2.ascx etc...
The way it works now, is that when the initial GET request comes in, I render the entire page (which sits inside of a master page) and step1.ascx. When then next POST request comes in for step 2 (using query string step=2), I render only step2.ascx to the browser by overriding the Render(HtmlTextWriter) method and use jQuery html() method to replace the contents of a div.
The problem with this whole approach, besides being hacky (in my opinion) is that it's impossible to update viewstate as this is usually handled server side.
My workaround is to store the contents of step1.ascx into temporary session storage so if the user decides to click the Back button to go back one step, I can spit out the values that were stored for it previously.
I feel I'm putting on my techy hat on here in wanting to try the latest Javascript craze as jQuery with .NET has taken a lot of hack like approaches and reverse engineering to get right. Would it be easier to simply use an updatepanel and be done with it or is there a site with a comprehensive resource of using jQuery to do everything in ASP.NET?
Thanks for taking the time to read this.
Another approach, that might be easier to work with, is to load the entire form with the initial GET request, and then hide all sections except the first one. You then use jQuery to hide and show different parts of the form, and when the final section is shown the entire form is posted in one POST to the server. That way you can handle the input on the server just as if the data entry was done in one step by the user, and still get the step-by-step expreience on the client side.
You could just place all your user controls one after another and turn on the visibility of the current step's control and turn on other controls when appropriate . No need to mess with overriding Render(). This way the user controls' viewstate will be managed by the server. and you can focus on any step validation logic.
Using an UpdatePanel to contain the steps would give the ajax experience and still be able to provide validation on each step. If you are OK with validating multiple steps at once, Tomas Lycken's suggestion (hide/show with JQuery), would give a fast step by step experience.
Did you look into using the ASP.NET Wizard control? It's a bit of a challenge to customize the UI, but otherwise it's worked well for me in similar scenarios.
What is updated when an Update is triggered? What goes to the server? What comes back?
I was under the impression that only the content of the panel was transmitted to the server and back (without touching anything in the page outside the panel), but I'm experiencing strange results, probably because I don't really understand how it works exactly.
Can someone provide an easy explanation as to how exactly it works?
What is generated is a form submit through AJAX, which means essentially XML HTTP in the browser. When it hits the server, the server sees it as an AJAX call and it routes the Request to the correct method.
As for precisely what is sent, it is anything that the form submit should send, which can very well be information outside of the UpdatePanel. the sever then figures out what to work with and sends back a Response.
This is all well and good as theory, but you are dealing with problems not theory. What strangeness are you experiencing? If you can post, we can focus on the particulars of the problem.
The post that goes to the server contains pretty much all the information of the post, including the viewstate. The difference is on what is actually returned back to the browser.
To process the request, the full page is instantiated, if anything is updated outside the update panel, then you can get some ugly errors.
Update 1: this is different to other ajax approaches, that only send the bit of info needed and doesn't use viewstate i.e. autocompleteextender of the ajax control toolkit - look for json, ajax requests, and other related info.
It might work for you, but you are correct to look into understanding what is going on, that way you need when it is appropriate to just other solutions instead.