I'm trying to scrape some pages on a website that uses ASPX forms. The forms involve adding details of people by updating the server (one person at a time) and then proceeding to a results page that shows information regarding the specified people. There are 5 steps to the process:
Hit the login page (the site is HTTPS) by sending a POST request with my credentials. The response will contain cookies that will be used to validate all subsequent requests.
Hit the search criteria page by sending a GET request (no parameters). The only purpose of this is to discover the __VIEWSTATE and __EVENTVALIDATION tokens in the HTML response to be used in the next step.
Update the server with a person. This involves hitting the same webpage in step 2 but using a POST request with form parameters that correspond to the form controls on the page for adding person details and their values. The form parameters will include the __VIEWSTATE and __EVENTVALIDATION tokens gained from the previous step. The server response will include a new __VIEWSTATE and __EVENTVALIDATION. This step can be repeated using the new __VIEWSTATE and __EVENTVALIDATION, or can proceed to the next step.
Signal to the server that all people have been added. This involves hitting the same page as the previous 2 steps by sending a POST request with form parameters that correspond to the form controls on the page for signalling that all people have been added. The server response will simply be 25|pageRedirect||/path/to/results.aspx|.
Hit the search results page specified in the redirect response from the previous step by sending a GET request (no parameters - cookies are enough). The server response will be the HTML that I need to scrape.
If I follow the process manually with any browser, filling in the form controls and clicking the buttons etc. (testing with just one person) I get to the results page and the results are fine. If I do this programmatically from an application running on my machine, then ultimately the search results HTML is wrong (the page returns valid HTML, but there are no results compared with the browser version and some null values were there should not be).
I've run this using a Java application with Apache HttpClient handling the requests. I've also tried it using a Ruby script with Mechanize handling the requests. I've setup a proxy server using Charles to intercept and examine all 5 HTTPS requests. Using Charles, I've scrutinized the raw requests (headers and body) and made comparisons between requests made using a browser and requests made using the application(s). They are all identical (except for the VIEWSTATE / EVENTVALIDATION values and session cookie values, which I would expect to differ).
A few additional points about the programmatic attempts:
The login step returns successful data, and the cookies are valid (otherwise the subsequent requests would all fail)
Updating the server with a person (step 3) returns successful responses, in that they are the same as would be returned from interaction using a browser. I can only assume this must mean the server is updating successfully with the person added.
A custom header is being added to requests in step 3 X-MicrosoftAjax: Delta=true (just like the browser requests are doing)
I don't own or have access to the server I'm scraping
Given that my application requests are identical to the browser requests that succeed, it baffles me that the server is treating them differently somehow. I can't help but feel that this is an ASP.net issue with forms that I'm overlooking. I'd appreciate any help.
Update:
I went over the raw requests again a bit more methodically, and it turns out I was missing something in the form parameters of the requests. Unfortunately, I don't think it will be of much use to anyone else, because it would seem to be specific to this particular ASP servers logic.
The POST request that notifies the server that all people have been added (step 4) requires two form parameters specifying the county and address of the last person that was added to the search. I was including these form parameters in my request, but the values were empty strings. I figured the browser request was just snagging these values because when the user hits the Continue button on the form, those controls would have the values of the last person added. I figured they wouldn't matter and forgot about them, but I was wrong.
It's a peculiar issue that I should have caught the first time. I can't complain though, I am scraping a site after all.
Review Charles logs again. It is possible that the search results and other content may be coming over via Ajax, and that your Java/Ruby apps are not actually doing all of the requests/responses that happen with the browser. Look for any POST or GET requests in between the requests you are already duplicating. If search results are populated via Javascript your client app may not be able to handle this?
Related
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
Is there a way to observe a http request in the browser and save that request (header data and parameters) and simulate the same request in code?
What I want is to "simulate" a browser in my project, to get the same response back like if the user is using a normal browser.
I don't know exactly how to ask the question correctly, but what I want is to simulate the authentification on some websites and scrape the same data as when I were in the browser.
What I wanted was to crawle a website, which is secured with authentification, using simple http-requests and build the request-header in my code. And it was not only about sending a POST-request with name + password, but also some other hidden parameters which are first generated when a user visits the website - on the clientside with javascript.
Maybe it is possible to understand the algorithm behind generating those hidden parameters, but it can take a long time because of the complexity.
The best way to crawle a website in a automated way without caring about the correct headers is to use a "headless" browser, which is nothing else then a normal browser without a GUI. You can control it in your code. A list of those headless browsers can be found here.
So no need for observing and recording the request and simulating it in code - just use a headless browser.
I'm trying to manipulate a .net ASP form on a site that's using AJAX Control Toolkit. The site is only accessible to valid logins, and I do have a valid account. It consists of a search page with a form. Each time a submit button is clicked on the form, the server is updated using the values of some text fields on the form, and then the VIEWSTATE and EVENTVALIDATION tokens will be updated based on the response from the server, ready for the next request.
I'm using HttpClient in Java to do this. I suspect there's something I'm not doing correctly with regard to interacting with ASPX forms in general.
When I hit the main search page for the first time (cookies are validating my login with the server), I get the HTML for the search page back. I extract the VIEWSTATE and EVENTVALIDATION tokens for the next request. I've examined the exact form fields and their values that need to be sent to the server in a POST by looking at the Chrome debugger utility after making a request on the site manually. I've replicated them exactly as they should be, inserting the VIEWSTATE and EVENTVALIDATION appropriately.
But the response I get back from the server is not what it should be. What I get back is just the same HTML for the main search page that I get the first time I hit the webpage. The form data I'm using looks like this:
ctl00$ScriptManager1:ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton
ctl00_ContentPlaceHolder1_TabContainer1_ClientState:{"ActiveTabIndex":0,"TabState":[true,true]}
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:<token extracted from first page hit>
__VIEWSTATEENCRYPTED:
__EVENTVALIDATION:<token extracted from first page hit>
ctl00$ContentPlaceHolder1$LabelFee:0
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$RadioButtonList1:Person
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$snameText:aSurname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$HiddenField1:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$fnameText:aFirstname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayFromTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthFromTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearFromTextBox:2001
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayToTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthToTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearToTextBox:2008
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$DropDownList1:aCity
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$PropText:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel2$RefText:
__ASYNCPOST:true
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton:Accept
I've also tried replicating the headers that the Chrome debugger shows, so my request is including the same Content-Type, Host, Origin, Referer, User-Agent (for my browser) and every other header, including this header X-MicrosoftAjax: Delta=true.
I know there's a lot of moving parts here, but I intentionally haven't mentioned how I'm actually making the POST request with the HttpClient lib because I'd don't want to complicate the question anymore or alienate anyone who doesn't know Java but knows ASP. I'd like to know if there's an ASP issue I'm not addressing, but I can post the Java code is necessary.
Edit:
I've checked the debugging info that HttpClient is outputting just before sending the request, and the form data is being added properly as multi-part form data. The headers are all there too.
This answer is a long shot, but I've seen weirder things.
You mention this header:
X-MicrosoftAjax: Delta=true
I did some deep googling and found that this is often shown as all lower case in dumps of Ajax and UpdatePanel POST requests:
x-microsoftajax: Delta=true
See here and here.
Could it be as simple as not casing the header correctly?
I eventually got this working. The problem was not specific to ASP in general, it was actually a problem with how Java (specifically HttpClient) was sending the request. I was using HttpClient to compile the request using multi-part form, but after using Fiddler to analyse and compare the requests (see the edited part of this question for more details on that) sent from both my application and the actual webpage, my app request was structured very differently.
The real website request had the form options embedded in the request body in what looked like a URL encoded query string. My request was a series of entries in the request body where each option was wrapped in the Content-Type and Content-Disposition headers. The requests succeeded after changing the POST to add the parameters like:
request.setEntity(new UrlEncodedFormEntity(paramList));
I am trying to simulate a form submission on an ASPX.NET site.
The flow of the website when accessed in a browser is as follows:
1) In a browser the user visits http://mysite.com/ which is configured with Basic Authentication
2) Upon correct credentials, the user is shown a form with one input text box and a button (URL stays http://mysite.com/ but the form being served is Default.aspx)
3)User enters some text and presses submit...
4) The page reloads... URL is still http://mysite.com/... but there is a timer which triggers after 10 secs and downloads a file from http://mysite.com/Downloader
I am trying to simulate this flow in my program using HTTPClient.
1) Do a GET on http://mysite.com
2) Extract hidden form fields __EVENTVALIDATION and __VIEWSTATE
3) Create a POST request with above two and other form fields and POST it to http://mysite.com RESULTS in Invalid Viewstate exception.
How do I achieve this in HTTPClient?
The usual way to do this is as follows: First, record the HTTP traffic using WireShark or Fiddler while you are using the website from the browser. Second, analyze the packet trace in detail, and collect every HTTP header and every HTTP payload from every GET and POST message sent by the browser. Third, try to send the same messages from your code. After sending an HTTP request, you will have to analyze the response of the server, and extract all pieces of data you need to insert into the next request. Don't forget to set the referer field, for example. Add each request to your code one by one, and record the traffic when you run the code. If you assemble your HTTP requests correctly, then your request packets should look like the requests of the browser.
I'm in the same scenario, I have to create a POST request to an external ASPX page.
I have captured the traffic using FIDDLER and tryed to simulare the call using online post request tool like https://www.codepunker.com
I have not been able to recreate the request...
In my opinion (and this require time) we have to:
Create a basic webrequest to the source form
Collect all the form elements with value
Create a POST request submitting all the elements including VIEWSTATE
NOTE: may be that you need to use a webclient that accepts cookies, check:
Accept Cookies in WebClient?
Good luck
What is difference between Server.Transfer and Response.Redirect?
What are advantages and disadvantages of each?
When is one appropriate over the other?
When is one not appropriate?
Response.Redirect simply sends a message (HTTP 302) down to the browser.
Server.Transfer happens without the browser knowing anything, the browser request a page, but the server returns the content of another.
Response.Redirect() will send you to a new page, update the address bar and add it to the Browser History. On your browser you can click back.
Server.Transfer() does not change the address bar. You cannot hit back.
I use Server.Transfer() when I don't want the user to see where I am going. Sometimes on a "loading" type page.
Otherwise I'll always use Response.Redirect().
To be Short: Response.Redirect simply tells the browser to visit another page. Server.Transfer helps reduce server requests, keeps the URL the same and, with a little bug-bashing, allows you to transfer the query string and form variables.
Something I found and agree with (source):
Server.Transfer is similar in that it sends the user to another page
with a statement such as Server.Transfer("WebForm2.aspx"). However,
the statement has a number of distinct advantages and disadvantages.
Firstly, transferring to another page using Server.Transfer
conserves server resources. Instead of telling the browser to
redirect, it simply changes the "focus" on the Web server and
transfers the request. This means you don't get quite as many HTTP
requests coming through, which therefore eases the pressure on your
Web server and makes your applications run faster.
But watch out: because the "transfer" process can work on only those
sites running on the server; you can't use Server.Transfer to send
the user to an external site. Only Response.Redirect can do that.
Secondly, Server.Transfer maintains the original URL in the browser.
This can really help streamline data entry techniques, although it may
make for confusion when debugging.
That's not all: The Server.Transfer method also has a second
parameter—"preserveForm". If you set this to True, using a statement
such as Server.Transfer("WebForm2.aspx", True), the existing query
string and any form variables will still be available to the page you
are transferring to.
For example, if your WebForm1.aspx has a TextBox control called
TextBox1 and you transferred to WebForm2.aspx with the preserveForm
parameter set to True, you'd be able to retrieve the value of the
original page TextBox control by referencing
Request.Form("TextBox1").
Response.Redirect() should be used when:
we want to redirect the request to some plain HTML pages on our server or to some other web server
we don't care about causing additional roundtrips to the server on each request
we do not need to preserve Query String and Form Variables from the original request
we want our users to be able to see the new redirected URL where he is redirected in his browser (and be able to bookmark it if its necessary)
Server.Transfer() should be used when:
we want to transfer current page request to another .aspx page on the same server
we want to preserve server resources and avoid the unnecessary roundtrips to the server
we want to preserve Query String and Form Variables (optionally)
we don't need to show the real URL where we redirected the request in the users Web Browser
Response.Redirect redirects page to another page after first page arrives to client. So client knows the redirection.
Server.Transfer quits current execution of the page. Client does not know the redirection. It allows you to transfer the query string and form variables.
So it depends to your needs to choose which is better.
"response.redirect" and "server.transfer" helps to transfer user from one page to other page while the page is executing. But the way they do this transfer / redirect is very different.
In case you are visual guy and would like see demonstration rather than theory I would suggest to see the below facebook video which explains the difference in a more demonstrative way.
https://www.facebook.com/photo.php?v=762186150488997
The main difference between them is who does the transfer. In "response.redirect" the transfer is done by the browser while in "server.transfer" it’s done by the server. Let us try to understand this statement in a more detail manner.
In "Server.Transfer" following is the sequence of how transfer happens:-
1.User sends a request to an ASP.NET page. In the below figure the request is sent to "WebForm1" and we would like to navigate to "Webform2".
2.Server starts executing "Webform1" and the life cycle of the page starts. But before the complete life cycle of the page is completed “Server.transfer” happens to "WebForm2".
3."Webform2" page object is created, full page life cycle is executed and output HTML response is then sent to the browser.
While in "Response.Redirect" following is the sequence of events for navigation:-
1.Client (browser) sends a request to a page. In the below figure the request is sent to "WebForm1" and we would like to navigate to "Webform2".
2.Life cycle of "Webform1" starts executing. But in between of the life cycle "Response.Redirect" happens.
3.Now rather than server doing a redirect , he sends a HTTP 302 command to the browser. This command tells the browser that he has to initiate a GET request to "Webform2.aspx" page.
4.Browser interprets the 302 command and sends a GET request for "Webform2.aspx".
In other words "Server.Transfer" is executed by the server while "Response.Redirect" is executed by thr browser. "Response.Redirect" needs to two requests to do a redirect of the page.
So when to use "Server.Transfer" and when to use "Response.Redirect" ?
Use "Server.Transfer" when you want to navigate pages which reside on the same server, use "Response.Redirect" when you want to navigate between pages which resides on different server and domain.
Below is a summary table of which chalks out differences and in which scenario to use.
The beauty of Server.Transfer is what you can do with it:
TextBox myTxt = (TextBox)this.Page.PreviousPage.FindControl("TextBoxID");
You can get anything from your previous page using the above method as long as you use Server.Transfer but not Response.Redirect
In addition to ScarletGarden's comment, you also need to consider the impact of search engines and your redirect. Has this page moved permanently? Temporarily? It makes a difference.
see: Response.Redirect vs. "301 Moved Permanently":
We've all used Response.Redirect at
one time or another. It's the quick
and easy way to get visitors pointed
in the right direction if they somehow
end up in the wrong place. But did you
know that Response.Redirect sends an
HTTP response status code of "302
Found" when you might really want to
send "301 Moved Permanently"?
The distinction seems small, but in
certain cases it can actually make a
big difference. For example, if you
use a "301 Moved Permanently" response
code, most search engines will remove
the outdated link from their index and
replace it with the new one. If you
use "302 Found", they'll continue
returning to the old page...
There are many differences as specified above. Apart from above all, there is one more difference. Response.Redirect() can be used to redirect user to any page which is not part of the application but Server.Transfer() can only be used to redirect user within the application.
//This will work.
Response.Redirect("http://www.google.com");
//This will not work.
Server.Transfer("http://www.google.com");
Transfer is entirely server-side. Client address bar stays constant. Some complexity about the transfer of context between requests. Flushing and restarting page handlers can be expensive so do your transfer early in the pipeline e.g. in an HttpModule during BeginRequest. Read the MSDN docs carefully, and test and understand the new values of HttpContext.Request - especially in Postback scenarios. We usually use Server.Transfer for error scenarios.
Redirect terminates the request with a 302 status and client-side roundtrip response with and internally eats an exception (minor server perf hit - depends how many you do a day) Client then navigates to new address. Browser address bar & history updates etc. Client pays the cost of an extra roundtrip - cost varies depending on latency. In our business we redirect a lot we wrote our own module to avoid the exception cost.
Response.Redirect is more costly since it adds an extra trip to the server to figure out where to go.
Server.Transfer is more efficient however it can be a little mis-leading to the user since the Url doesn't physically change.
In my experience, the difference in performance has not been significant enough to use the latter approach
Server.Transfer doesn't change the URL in the client browser, so effectively the browser does not know you changed to another server-side handler. Response.Redirect tells the browser to move to a different page, so the url in the titlebar changes.
Server.Transfer is slightly faster since it avoids one roundtrip to the server, but the non-change of url may be either good or bad for you, depending on what you're trying to do.
Response.Redirect: tells the browser that the requested page can be found at a new location. The browser then initiates another request to the new page loading its contents in the browser. This results in two requests by the browser.
Server.Transfer: It transfers execution from the first page to the second page on the server. As far as the browser client is concerned, it made one request and the initial page is the one responding with content.
The benefit of this approach is one less round trip to the server from the client browser. Also, any posted form variables and query string parameters are available to the second page as well.
Just more details about Transfer(), it's actually is Server.Execute() + Response.End(), its source code is below (from Mono/.net 4.0):
public void Transfer (string path, bool preserveForm)
{
this.Execute (path, null, preserveForm, true);
this.context.Response.End ();
}
and for Execute(), what it is to run is the handler of the given path, see
ASP.NET does not verify that the current user is authorized to view the resource delivered by the Execute method. Although the ASP.NET authorization and authentication logic runs before the original resource handler is called, ASP.NET directly calls the handler indicated by the Execute method and does not rerun authentication and authorization logic for the new resource. If your application's security policy requires clients to have appropriate authorization to access the resource, the application should force reauthorization or provide a custom access-control mechanism.
You can force reauthorization by using the Redirect method instead of the Execute method. Redirect performs a client-side redirect in which the browser requests the new resource. Because this redirect is a new request entering the system, it is subjected to all the authentication and authorization logic of both Internet Information Services (IIS) and ASP.NET security policy.
-from MSDN
Response.Redirect involves an extra round trip and updates the address bar.
Server.Transfer does not cause the address bar to change, the server responds to the request with content from another page
e.g.
Response.Redirect:-
On the client the browser requests a page http://InitiallyRequestedPage.aspx
On the server responds to the request with 302 passing the redirect address http://AnotherPage.aspx.
On the client the browser makes a second request to the address http://AnotherPage.aspx.
On the server responds with content from http://AnotherPage.aspx
Server.Transfer:-
On the client browser requests a page http://InitiallyRequestedPage.aspx
On the server Server.Transfer to http://AnotherPage.aspx
On the server the response is made to the request for http://InitiallyRequestedPage.aspx passing back content from http://AnotherPage.aspx
Response.Redirect
Pros:-
RESTful - It changes the address bar, the address can be used to record changes of state inbetween requests.
Cons:-
Slow - There is an extra round-trip between the client and server. This can be expensive when there is substantial latency between the client and the server.
Server.Transfer
Pros:-
Quick.
Cons:-
State lost - If you're using Server.Transfer to change the state of the application in response to post backs, if the page is then reloaded that state will be lost, as the address bar will be the same as it was on the first request.
Response.Redirect
Response.Redirect() will send you to a new page, update the address bar and add it to the Browser History. On your browser you can click back.
It redirects the request to some plain HTML pages on our server or to some other web server.
It causes additional roundtrips to the server on each request.
It doesn’t preserve Query String and Form Variables from the original request.
It enables to see the new redirected URL where it is redirected in the browser (and be able to bookmark it if it’s necessary).
Response. Redirect simply sends a message down to the (HTTP 302) browser.
Server.Transfer
Server.Transfer() does not change the address bar, we cannot hit back.One should use Server.Transfer() when he/she doesn’t want the user to see where he is going. Sometime on a "loading" type page.
It transfers current page request to another .aspx page on the same server.
It preserves server resources and avoids the unnecessary roundtrips to the server.
It preserves Query String and Form Variables (optionally).
It doesn’t show the real URL where it redirects the request in the users Web Browser.
Server.Transfer happens without the browser knowing anything, the browser request a page, but the server returns the content of another.