Where should http headers be set? - http

In a web application using an MVC layout, should HTTP Headers be set in the controller or the view? My thoughts:
Controller: Setting the header here seems appropriate, as this is part of taking a request, and setting necessary variables to handle it on the server side.
View: An HTTP header is really just a few lines of text above the rest of the content being served up, and that text is arguably the view.
I wouldn't gasp to see headers set in either location. What is the best practice?

The view’s responsibility is anything that is sent to the user. The format of the content doesn’t matter. The view doesn’t know how that content will be parsed – in a web browser, a console, Lynx …
An example: you want to debug your AJAX requests and send data about the inner processes to the browser. You don’t want to mangle that information into your DOM, so you use HTTP headers instead. These headers are meant to be viewed in the browser’s debugger. The view in your application just doesn’t know if you are actually looking at its output.
Basic rule: whenever you sent a single Byte to the user, use the view.

Related

head request returns different content-type [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

How to manipulate a .NET ASPX form programmatically?

I'm trying to manipulate a .net ASP form on a site that's using AJAX Control Toolkit. The site is only accessible to valid logins, and I do have a valid account. It consists of a search page with a form. Each time a submit button is clicked on the form, the server is updated using the values of some text fields on the form, and then the VIEWSTATE and EVENTVALIDATION tokens will be updated based on the response from the server, ready for the next request.
I'm using HttpClient in Java to do this. I suspect there's something I'm not doing correctly with regard to interacting with ASPX forms in general.
When I hit the main search page for the first time (cookies are validating my login with the server), I get the HTML for the search page back. I extract the VIEWSTATE and EVENTVALIDATION tokens for the next request. I've examined the exact form fields and their values that need to be sent to the server in a POST by looking at the Chrome debugger utility after making a request on the site manually. I've replicated them exactly as they should be, inserting the VIEWSTATE and EVENTVALIDATION appropriately.
But the response I get back from the server is not what it should be. What I get back is just the same HTML for the main search page that I get the first time I hit the webpage. The form data I'm using looks like this:
ctl00$ScriptManager1:ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton
ctl00_ContentPlaceHolder1_TabContainer1_ClientState:{"ActiveTabIndex":0,"TabState":[true,true]}
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:<token extracted from first page hit>
__VIEWSTATEENCRYPTED:
__EVENTVALIDATION:<token extracted from first page hit>
ctl00$ContentPlaceHolder1$LabelFee:0
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$RadioButtonList1:Person
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$snameText:aSurname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$HiddenField1:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$fnameText:aFirstname
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayFromTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthFromTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearFromTextBox:2001
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$dayToTextBox:01
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$monthToTextBox:January
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$yearToTextBox:2008
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$DropDownList1:aCity
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$PropText:
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel2$RefText:
__ASYNCPOST:true
ctl00$ContentPlaceHolder1$TabContainer1$TabPanel1$acceptButton:Accept
I've also tried replicating the headers that the Chrome debugger shows, so my request is including the same Content-Type, Host, Origin, Referer, User-Agent (for my browser) and every other header, including this header X-MicrosoftAjax: Delta=true.
I know there's a lot of moving parts here, but I intentionally haven't mentioned how I'm actually making the POST request with the HttpClient lib because I'd don't want to complicate the question anymore or alienate anyone who doesn't know Java but knows ASP. I'd like to know if there's an ASP issue I'm not addressing, but I can post the Java code is necessary.
Edit:
I've checked the debugging info that HttpClient is outputting just before sending the request, and the form data is being added properly as multi-part form data. The headers are all there too.
This answer is a long shot, but I've seen weirder things.
You mention this header:
X-MicrosoftAjax: Delta=true
I did some deep googling and found that this is often shown as all lower case in dumps of Ajax and UpdatePanel POST requests:
x-microsoftajax: Delta=true
See here and here.
Could it be as simple as not casing the header correctly?
I eventually got this working. The problem was not specific to ASP in general, it was actually a problem with how Java (specifically HttpClient) was sending the request. I was using HttpClient to compile the request using multi-part form, but after using Fiddler to analyse and compare the requests (see the edited part of this question for more details on that) sent from both my application and the actual webpage, my app request was structured very differently.
The real website request had the form options embedded in the request body in what looked like a URL encoded query string. My request was a series of entries in the request body where each option was wrapped in the Content-Type and Content-Disposition headers. The requests succeeded after changing the POST to add the parameters like:
request.setEntity(new UrlEncodedFormEntity(paramList));

How does Backbone send a PUT and PATH request to server

Regarding this question and also many documents have stated that sending a PUT request directly via form in browser is impossible due to security reason.
However, What I am seeing in Backbone is that it could still send a direct PUT request via browser without a workaround like adding a hidden form field.
And they're confusing to me. Is there anything that I'm missing here?
A form can only send a GET or a POST request, as set in the method attribute.
However, Backbone delegates its requests to jQuery.ajax by default (or whatever you want via Backbone.ajax) which itself wraps XMLHttpRequest, an object that can send PUT/DELETE/PATCH requests.
From https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest
XMLHttpRequest is a JavaScript object that was designed by Microsoft
and adopted by Mozilla, Apple, and Google. It's now being standardized
in the W3C. It provides an easy way to retrieve data from a URL
without having to do a full page refresh. A Web page can update just a
part of the page without disrupting what the user is doing.
XMLHttpRequest is used heavily in AJAX programming.
many documents have stated that sending a PUT request directly via browser is impossible due to security reason
Citation please.
Backbone sends a PUT just like it sends any other request, with jQuery,
Backbone.ajax({
type: 'PUT'
...
});
It is just some server side langauges,like PHP, that have problems with receiving a PUT request.
The hidden form field is used when posting from a <form>. Backbone uses javascript.

Was there ever a proposal to include the URL fragment into the HTTP request?

In the current HTTP spec, the URL fragment (the part of the URL including and following the #) is not sent to the server in any way. However with the increased spread of AJAX, which uses the fragment to maintain some form of state, there are a lot of situations where it would be useful for the server to have knowledge of the URL fragment at request time.
For example, if you go to http://facebook.com, then click a user name in your stream, the URL will become http://faceboook.com/#!/username - to allow FB to update your page without reloading all of its bootstrap JS and HTML. However, if you were to reload this with your browser, the server would have no way of seeing the "#/!username" part of the URL, and therefore could not pre-render the content for you. This forces your browser to make an extra request once the client Javascript has loaded and parsed the fragment.
I am wondering if there have been any efforts or proposals towards creating a standard mechanism to achieve this.
For example, there could be a standard HTTP header, which would be sent with the value of the URL fragment - any server which cared about such things could then have access to it.
It seems like this would be a very useful thing for the web-application community as a whole, so I am surprised to not have heard anything proposed. Perhaps I missed it though.
Imho, the fragment identifier really is not a good place to store the state, it has been designed for something else.
That being said, http://www.jenitennison.com/blog/node/154 has a good discussion of the whole subject.
I found this proposal by Google to make Ajax pages crawlable, but it addresses a more constrained set of use cases. Specifically, it creates a way to replace the URL fragment with a URL parameter to obtain the same HTML output from the server as would be generated by a client visiting the equivalent URL with the fragment. However, such URLs are useless for actually running the Ajax apps, since they would necessitate a page reload every time.
Webkit Bug 24175 - URL Redirect Loses Fragment refers to Handling of fragment identifiers in redirected URLs which may be of interest.
A suggestion for a future version of HTTP may be to add an (optional)
Fragment header to the request, which holds the fragment identifier.
Even simpler may be to allow an HTTP request to contain a fragment
identifier.

Is it correct to say that a web browser always knows when a web page is completely loaded?

A browser sends a GET request for a static web page to a server. The server sends back HTTP OK response with the HTML page in the HTTP body. Looking at the Content-Length field or looking for the terminating chunk or some other delimiter for some other encoding the browser can know if it has received the web page and subsequently all its embedded objects (images etc.). Is it correct to say that in this case the browser always knows when a web page has completely loaded and that it will see no further network traffic?
Now if the page is dynamic (lets say facebook or gmail), where you might receive notifications or parts of the page gets updated using AJAX or javascript running in the background, here also the browser should know when the page has loaded. What if the server is pushing some updates to the client. Is it possible in this scenario for the browser to know when it has received the full update?
So, is there any scenario in which a browser doesn't know when it has fully received the data (static or dynamic) it has requested from a web server or push-based updates the server is forwarding to it?
I can only imagine (for the static case) the one scenario when Content-Length is not set. It's not mandatory to send it for the server.
Potentially, of course, in a page containing scripts, one could also have other scenarios where the script loads bits and pieces one by one with delays (including the AJAX scenario you mentioned). This way the browser would not know in advance either. In such a case it would know "for the moment" that the page has loaded completely, but the next action from the script would invalidate that assertion again.
You do not need AJAX to get in a situation where not all elements in the page are loaded even after the page itself has been loaded. A little javascript is all that you need (been a while since I last worked with JS, there might be some syntax errors)
<img id="dyn_image" src="/not_clicked.gif">
<input type="button" onclick="javascrit:document.get("dyn_image").src="/clicked.gif">
There are cases when the server uses some kind of push technology, for example Comets. In this case a request (generally Ajax request) is sent, without receiving any response (obvoiusly no HTTP headers as well), but leaving the TCP connection open. This may take long time, but still may be considered as a sub-case of Ajax calls.
The other case is HTML5's WebSocket technology. In a WebSocket the server side can push data to the client side without explicit request from the client side.
These two can be combined, so the answer to your question is: yes, there can be cases when you cannot predict that the network traffic is over or not. The common (in all cases) is that the client side must leave a channel open to the server.

Resources