I'm trying to scrape some pages on a website that uses ASPX forms. The forms involve adding details of people by updating the server (one person at a time) and then proceeding to a results page that shows information regarding the specified people. There are 5 steps to the process:
Hit the login page (the site is HTTPS) by sending a POST request with my credentials. The response will contain cookies that will be used to validate all subsequent requests.
Hit the search criteria page by sending a GET request (no parameters). The only purpose of this is to discover the __VIEWSTATE and __EVENTVALIDATION tokens in the HTML response to be used in the next step.
Update the server with a person. This involves hitting the same webpage in step 2 but using a POST request with form parameters that correspond to the form controls on the page for adding person details and their values. The form parameters will include the __VIEWSTATE and __EVENTVALIDATION tokens gained from the previous step. The server response will include a new __VIEWSTATE and __EVENTVALIDATION. This step can be repeated using the new __VIEWSTATE and __EVENTVALIDATION, or can proceed to the next step.
Signal to the server that all people have been added. This involves hitting the same page as the previous 2 steps by sending a POST request with form parameters that correspond to the form controls on the page for signalling that all people have been added. The server response will simply be 25|pageRedirect||/path/to/results.aspx|.
Hit the search results page specified in the redirect response from the previous step by sending a GET request (no parameters - cookies are enough). The server response will be the HTML that I need to scrape.
If I follow the process manually with any browser, filling in the form controls and clicking the buttons etc. (testing with just one person) I get to the results page and the results are fine. If I do this programmatically from an application running on my machine, then ultimately the search results HTML is wrong (the page returns valid HTML, but there are no results compared with the browser version and some null values were there should not be).
I've run this using a Java application with Apache HttpClient handling the requests. I've also tried it using a Ruby script with Mechanize handling the requests. I've setup a proxy server using Charles to intercept and examine all 5 HTTPS requests. Using Charles, I've scrutinized the raw requests (headers and body) and made comparisons between requests made using a browser and requests made using the application(s). They are all identical (except for the VIEWSTATE / EVENTVALIDATION values and session cookie values, which I would expect to differ).
A few additional points about the programmatic attempts:
The login step returns successful data, and the cookies are valid (otherwise the subsequent requests would all fail)
Updating the server with a person (step 3) returns successful responses, in that they are the same as would be returned from interaction using a browser. I can only assume this must mean the server is updating successfully with the person added.
A custom header is being added to requests in step 3 X-MicrosoftAjax: Delta=true (just like the browser requests are doing)
I don't own or have access to the server I'm scraping
Given that my application requests are identical to the browser requests that succeed, it baffles me that the server is treating them differently somehow. I can't help but feel that this is an ASP.net issue with forms that I'm overlooking. I'd appreciate any help.
Update:
I went over the raw requests again a bit more methodically, and it turns out I was missing something in the form parameters of the requests. Unfortunately, I don't think it will be of much use to anyone else, because it would seem to be specific to this particular ASP servers logic.
The POST request that notifies the server that all people have been added (step 4) requires two form parameters specifying the county and address of the last person that was added to the search. I was including these form parameters in my request, but the values were empty strings. I figured the browser request was just snagging these values because when the user hits the Continue button on the form, those controls would have the values of the last person added. I figured they wouldn't matter and forgot about them, but I was wrong.
It's a peculiar issue that I should have caught the first time. I can't complain though, I am scraping a site after all.
Review Charles logs again. It is possible that the search results and other content may be coming over via Ajax, and that your Java/Ruby apps are not actually doing all of the requests/responses that happen with the browser. Look for any POST or GET requests in between the requests you are already duplicating. If search results are populated via Javascript your client app may not be able to handle this?
I'm trying to manipulate a .net ASP form on a site that's using AJAX Control Toolkit. The site is only accessible to valid logins, and I do have a valid account. It consists of a search page with a form. Each time a submit button is clicked on the form, the server is updated using the values of some text fields on the form, and then the VIEWSTATE and EVENTVALIDATION tokens will be updated based on the response from the server, ready for the next request.
I'm using HttpClient in Java to do this. I suspect there's something I'm not doing correctly with regard to interacting with ASPX forms in general.
When I hit the main search page for the first time (cookies are validating my login with the server), I get the HTML for the search page back. I extract the VIEWSTATE and EVENTVALIDATION tokens for the next request. I've examined the exact form fields and their values that need to be sent to the server in a POST by looking at the Chrome debugger utility after making a request on the site manually. I've replicated them exactly as they should be, inserting the VIEWSTATE and EVENTVALIDATION appropriately.
But the response I get back from the server is not what it should be. What I get back is just the same HTML for the main search page that I get the first time I hit the webpage. The form data I'm using looks like this:
ctl00$ScriptManager1:ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton
ctl00_ContentPlaceHolder1_TabContainer1_ClientState:{"ActiveTabIndex":0,"TabState":[true,true]}
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:<token extracted from first page hit>
__VIEWSTATEENCRYPTED:
__EVENTVALIDATION:<token extracted from first page hit>
ctl00$ContentPlaceHolder1$LabelFee:0
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$RadioButtonList1:Person
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$snameText:aSurname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$HiddenField1:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$fnameText:aFirstname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayFromTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthFromTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearFromTextBox:2001
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayToTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthToTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearToTextBox:2008
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$DropDownList1:aCity
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$PropText:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel2$RefText:
__ASYNCPOST:true
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton:Accept
I've also tried replicating the headers that the Chrome debugger shows, so my request is including the same Content-Type, Host, Origin, Referer, User-Agent (for my browser) and every other header, including this header X-MicrosoftAjax: Delta=true.
I know there's a lot of moving parts here, but I intentionally haven't mentioned how I'm actually making the POST request with the HttpClient lib because I'd don't want to complicate the question anymore or alienate anyone who doesn't know Java but knows ASP. I'd like to know if there's an ASP issue I'm not addressing, but I can post the Java code is necessary.
Edit:
I've checked the debugging info that HttpClient is outputting just before sending the request, and the form data is being added properly as multi-part form data. The headers are all there too.
This answer is a long shot, but I've seen weirder things.
You mention this header:
X-MicrosoftAjax: Delta=true
I did some deep googling and found that this is often shown as all lower case in dumps of Ajax and UpdatePanel POST requests:
x-microsoftajax: Delta=true
See here and here.
Could it be as simple as not casing the header correctly?
I eventually got this working. The problem was not specific to ASP in general, it was actually a problem with how Java (specifically HttpClient) was sending the request. I was using HttpClient to compile the request using multi-part form, but after using Fiddler to analyse and compare the requests (see the edited part of this question for more details on that) sent from both my application and the actual webpage, my app request was structured very differently.
The real website request had the form options embedded in the request body in what looked like a URL encoded query string. My request was a series of entries in the request body where each option was wrapped in the Content-Type and Content-Disposition headers. The requests succeeded after changing the POST to add the parameters like:
request.setEntity(new UrlEncodedFormEntity(paramList));
What is the difference between the following:
Server.transfer?
Response.Redirect?
Postbackurl?
When should I decide to use which?
Server.Transfer tells ASP.NET to redirect processing to another page within the same application. This happens completely server side. This is more "efficient" as it happens on the server side but there are some limitations with this method. The link below describes some of these.
Response.Redirect actually sends a HTTP 302 status code back to client in the response with a different location. The client is then responsible for following the new location. There is another round trip happening here.
PostBackUrl is not a "transfer method" but rather an property that tells the browser which URL to post the form to. By default the form will post back to itself on the server.
Here's a good link: http://haacked.com/archive/2004/10/06/responseredirectverseservertransfer.aspx
Server.Transer() works server-side. It will reply to the client with a different page than the client requested. If the client refreshes (F5), he will refresh the original page.
Response.Redirect() replies to the client that it should go to a different page. This requires an additional roundtrip, but the client will know about the redirect, so F5 will request the destination page.
PostbackUrl is a property telling an ASP control where to go when clicked on the client. This does not require an additional round trip while keeping the client informed. If you can use this method, it's generally preferable to the other choices.
Server.Transfer:
Transfers request from one page to other on server.
e.g. Browser request for /page1.aspx
Request comes on page1 where you do Server.Transfer("/page2.aspx") so request transfers to page2 And page2 returns in response but browser's address bar remains showing URL of /page1.aspx
Response.Redirect
This statements tells browser to request for next page. In this case browser's address bar also changes and shows new page URL
PostBackUrl
You can mention it on Buttons or Link buttons. This will submit the form to the page provided. It is similar to the:
<form method="post" action="/page2.aspx">
So I read an article on Response.Redirect that say's "its like a suggestion" as compared to Server.Transfer which happens whether the client wants to or not. What does this mean? Is there some kind of event that we can offer a user to say "well nope nm I don't want to redirect to that page"?
Server.Transfer will execute the new page immediately and send its result to the client.
The client stays at the original URL doesn't see anything unusual.
Response.Redirect sends a special code, called an HTTP 302, that tells the client to send a new request to a different URL.
Response.Redirect
Response.Redirect will navigate to the url specified.
From MSDN
Any response body content such as displayed HTML text or Response.Write text in the page indicated by the original URL is ignored. However, this method does send other HTTP headers set by this page indicated by the original URL to the client. An automatic response body containing the redirect URL as a link is generated. The Redirect method sends the following explicit header, where URL is the value passed to the method, as shown in the following code:
Server.Transfer
Server.Transfer will execute the url passed in but maintain the url transfered from.
From MSDN
When you use the Transfer method, the state information for all the built-in objects are included in the transfer. This means that any variables or objects that have been assigned a value in session or application scope are maintained. In addition, all of the current contents for the Request collections are available to the .asp file that is receiving the transfer.
Here's an article discussing and comparing the two.
It's "like a suggestion" in the sense that when the client requests a url that's redirected, the server responds with "that object has moved...find it here." When you do Server.Transfer, you immediately respond with the output of the page to which you're transferring.
There's nothing in either process you could hook into to give the user any choice...you'd have to respond with a 200 and implement the choice to the user through logic.
Response.Redirect involves a "roundtrip" to the server whereas Server.Transfer conserves server resources by "avoiding the roundtrip". It just changes the focus of the webserver to a different page and transfers the page processing to a different page.
What is difference between Server.Transfer and Response.Redirect?
What are advantages and disadvantages of each?
When is one appropriate over the other?
When is one not appropriate?
Response.Redirect simply sends a message (HTTP 302) down to the browser.
Server.Transfer happens without the browser knowing anything, the browser request a page, but the server returns the content of another.
Response.Redirect() will send you to a new page, update the address bar and add it to the Browser History. On your browser you can click back.
Server.Transfer() does not change the address bar. You cannot hit back.
I use Server.Transfer() when I don't want the user to see where I am going. Sometimes on a "loading" type page.
Otherwise I'll always use Response.Redirect().
To be Short: Response.Redirect simply tells the browser to visit another page. Server.Transfer helps reduce server requests, keeps the URL the same and, with a little bug-bashing, allows you to transfer the query string and form variables.
Something I found and agree with (source):
Server.Transfer is similar in that it sends the user to another page
with a statement such as Server.Transfer("WebForm2.aspx"). However,
the statement has a number of distinct advantages and disadvantages.
Firstly, transferring to another page using Server.Transfer
conserves server resources. Instead of telling the browser to
redirect, it simply changes the "focus" on the Web server and
transfers the request. This means you don't get quite as many HTTP
requests coming through, which therefore eases the pressure on your
Web server and makes your applications run faster.
But watch out: because the "transfer" process can work on only those
sites running on the server; you can't use Server.Transfer to send
the user to an external site. Only Response.Redirect can do that.
Secondly, Server.Transfer maintains the original URL in the browser.
This can really help streamline data entry techniques, although it may
make for confusion when debugging.
That's not all: The Server.Transfer method also has a second
parameter—"preserveForm". If you set this to True, using a statement
such as Server.Transfer("WebForm2.aspx", True), the existing query
string and any form variables will still be available to the page you
are transferring to.
For example, if your WebForm1.aspx has a TextBox control called
TextBox1 and you transferred to WebForm2.aspx with the preserveForm
parameter set to True, you'd be able to retrieve the value of the
original page TextBox control by referencing
Request.Form("TextBox1").
Response.Redirect() should be used when:
we want to redirect the request to some plain HTML pages on our server or to some other web server
we don't care about causing additional roundtrips to the server on each request
we do not need to preserve Query String and Form Variables from the original request
we want our users to be able to see the new redirected URL where he is redirected in his browser (and be able to bookmark it if its necessary)
Server.Transfer() should be used when:
we want to transfer current page request to another .aspx page on the same server
we want to preserve server resources and avoid the unnecessary roundtrips to the server
we want to preserve Query String and Form Variables (optionally)
we don't need to show the real URL where we redirected the request in the users Web Browser
Response.Redirect redirects page to another page after first page arrives to client. So client knows the redirection.
Server.Transfer quits current execution of the page. Client does not know the redirection. It allows you to transfer the query string and form variables.
So it depends to your needs to choose which is better.
"response.redirect" and "server.transfer" helps to transfer user from one page to other page while the page is executing. But the way they do this transfer / redirect is very different.
In case you are visual guy and would like see demonstration rather than theory I would suggest to see the below facebook video which explains the difference in a more demonstrative way.
https://www.facebook.com/photo.php?v=762186150488997
The main difference between them is who does the transfer. In "response.redirect" the transfer is done by the browser while in "server.transfer" it’s done by the server. Let us try to understand this statement in a more detail manner.
In "Server.Transfer" following is the sequence of how transfer happens:-
1.User sends a request to an ASP.NET page. In the below figure the request is sent to "WebForm1" and we would like to navigate to "Webform2".
2.Server starts executing "Webform1" and the life cycle of the page starts. But before the complete life cycle of the page is completed “Server.transfer” happens to "WebForm2".
3."Webform2" page object is created, full page life cycle is executed and output HTML response is then sent to the browser.
While in "Response.Redirect" following is the sequence of events for navigation:-
1.Client (browser) sends a request to a page. In the below figure the request is sent to "WebForm1" and we would like to navigate to "Webform2".
2.Life cycle of "Webform1" starts executing. But in between of the life cycle "Response.Redirect" happens.
3.Now rather than server doing a redirect , he sends a HTTP 302 command to the browser. This command tells the browser that he has to initiate a GET request to "Webform2.aspx" page.
4.Browser interprets the 302 command and sends a GET request for "Webform2.aspx".
In other words "Server.Transfer" is executed by the server while "Response.Redirect" is executed by thr browser. "Response.Redirect" needs to two requests to do a redirect of the page.
So when to use "Server.Transfer" and when to use "Response.Redirect" ?
Use "Server.Transfer" when you want to navigate pages which reside on the same server, use "Response.Redirect" when you want to navigate between pages which resides on different server and domain.
Below is a summary table of which chalks out differences and in which scenario to use.
The beauty of Server.Transfer is what you can do with it:
TextBox myTxt = (TextBox)this.Page.PreviousPage.FindControl("TextBoxID");
You can get anything from your previous page using the above method as long as you use Server.Transfer but not Response.Redirect
In addition to ScarletGarden's comment, you also need to consider the impact of search engines and your redirect. Has this page moved permanently? Temporarily? It makes a difference.
see: Response.Redirect vs. "301 Moved Permanently":
We've all used Response.Redirect at
one time or another. It's the quick
and easy way to get visitors pointed
in the right direction if they somehow
end up in the wrong place. But did you
know that Response.Redirect sends an
HTTP response status code of "302
Found" when you might really want to
send "301 Moved Permanently"?
The distinction seems small, but in
certain cases it can actually make a
big difference. For example, if you
use a "301 Moved Permanently" response
code, most search engines will remove
the outdated link from their index and
replace it with the new one. If you
use "302 Found", they'll continue
returning to the old page...
There are many differences as specified above. Apart from above all, there is one more difference. Response.Redirect() can be used to redirect user to any page which is not part of the application but Server.Transfer() can only be used to redirect user within the application.
//This will work.
Response.Redirect("http://www.google.com");
//This will not work.
Server.Transfer("http://www.google.com");
Transfer is entirely server-side. Client address bar stays constant. Some complexity about the transfer of context between requests. Flushing and restarting page handlers can be expensive so do your transfer early in the pipeline e.g. in an HttpModule during BeginRequest. Read the MSDN docs carefully, and test and understand the new values of HttpContext.Request - especially in Postback scenarios. We usually use Server.Transfer for error scenarios.
Redirect terminates the request with a 302 status and client-side roundtrip response with and internally eats an exception (minor server perf hit - depends how many you do a day) Client then navigates to new address. Browser address bar & history updates etc. Client pays the cost of an extra roundtrip - cost varies depending on latency. In our business we redirect a lot we wrote our own module to avoid the exception cost.
Response.Redirect is more costly since it adds an extra trip to the server to figure out where to go.
Server.Transfer is more efficient however it can be a little mis-leading to the user since the Url doesn't physically change.
In my experience, the difference in performance has not been significant enough to use the latter approach
Server.Transfer doesn't change the URL in the client browser, so effectively the browser does not know you changed to another server-side handler. Response.Redirect tells the browser to move to a different page, so the url in the titlebar changes.
Server.Transfer is slightly faster since it avoids one roundtrip to the server, but the non-change of url may be either good or bad for you, depending on what you're trying to do.
Response.Redirect: tells the browser that the requested page can be found at a new location. The browser then initiates another request to the new page loading its contents in the browser. This results in two requests by the browser.
Server.Transfer: It transfers execution from the first page to the second page on the server. As far as the browser client is concerned, it made one request and the initial page is the one responding with content.
The benefit of this approach is one less round trip to the server from the client browser. Also, any posted form variables and query string parameters are available to the second page as well.
Just more details about Transfer(), it's actually is Server.Execute() + Response.End(), its source code is below (from Mono/.net 4.0):
public void Transfer (string path, bool preserveForm)
{
this.Execute (path, null, preserveForm, true);
this.context.Response.End ();
}
and for Execute(), what it is to run is the handler of the given path, see
ASP.NET does not verify that the current user is authorized to view the resource delivered by the Execute method. Although the ASP.NET authorization and authentication logic runs before the original resource handler is called, ASP.NET directly calls the handler indicated by the Execute method and does not rerun authentication and authorization logic for the new resource. If your application's security policy requires clients to have appropriate authorization to access the resource, the application should force reauthorization or provide a custom access-control mechanism.
You can force reauthorization by using the Redirect method instead of the Execute method. Redirect performs a client-side redirect in which the browser requests the new resource. Because this redirect is a new request entering the system, it is subjected to all the authentication and authorization logic of both Internet Information Services (IIS) and ASP.NET security policy.
-from MSDN
Response.Redirect involves an extra round trip and updates the address bar.
Server.Transfer does not cause the address bar to change, the server responds to the request with content from another page
e.g.
Response.Redirect:-
On the client the browser requests a page http://InitiallyRequestedPage.aspx
On the server responds to the request with 302 passing the redirect address http://AnotherPage.aspx.
On the client the browser makes a second request to the address http://AnotherPage.aspx.
On the server responds with content from http://AnotherPage.aspx
Server.Transfer:-
On the client browser requests a page http://InitiallyRequestedPage.aspx
On the server Server.Transfer to http://AnotherPage.aspx
On the server the response is made to the request for http://InitiallyRequestedPage.aspx passing back content from http://AnotherPage.aspx
Response.Redirect
Pros:-
RESTful - It changes the address bar, the address can be used to record changes of state inbetween requests.
Cons:-
Slow - There is an extra round-trip between the client and server. This can be expensive when there is substantial latency between the client and the server.
Server.Transfer
Pros:-
Quick.
Cons:-
State lost - If you're using Server.Transfer to change the state of the application in response to post backs, if the page is then reloaded that state will be lost, as the address bar will be the same as it was on the first request.
Response.Redirect
Response.Redirect() will send you to a new page, update the address bar and add it to the Browser History. On your browser you can click back.
It redirects the request to some plain HTML pages on our server or to some other web server.
It causes additional roundtrips to the server on each request.
It doesn’t preserve Query String and Form Variables from the original request.
It enables to see the new redirected URL where it is redirected in the browser (and be able to bookmark it if it’s necessary).
Response. Redirect simply sends a message down to the (HTTP 302) browser.
Server.Transfer
Server.Transfer() does not change the address bar, we cannot hit back.One should use Server.Transfer() when he/she doesn’t want the user to see where he is going. Sometime on a "loading" type page.
It transfers current page request to another .aspx page on the same server.
It preserves server resources and avoids the unnecessary roundtrips to the server.
It preserves Query String and Form Variables (optionally).
It doesn’t show the real URL where it redirects the request in the users Web Browser.
Server.Transfer happens without the browser knowing anything, the browser request a page, but the server returns the content of another.