I'm trying to understand how an IIS server handles different browsers in the header of an HTTP request.
The situation is that I have some load tests set up that fire off HTTP requests to an IIS server, constructing them and sending them over the wire. My code allows me to specify the browser in the header, but I'm not sure what that would actually change.
So what does IIS do with that particular information in the header?
As far as i am aware IIS doesn't actually do anything with the header.
You can create rules to explicitly handle a type of browser, this is pretty useful if you block traffic from countries but you still want to allow bots for example.
Its useful to also have this information in Log Files too
Related
I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1®ion=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)
I'm doing some work with this right now and I have to say, it makes no sense at all to me! Basically, I have some CDN server which provides css, images ect for a site. For whatever reason, in order for my browser to stop blocking those resources with a CORS error, I had to have that server (the CDN) add the Access-Control-Allow-Origin header. But as far as I can tell that does absolutely nothing to increase security. Shouldn't the page I request which references those cross-domain resources be telling the browser it's safe to get stuff from the other domain? If that were a malicious domain wouldn't it just have the Access-Control-Allow-Origin set to * so that sites load their malicious responses (you don't have to answer that because obviously they would)?
So can someone explain how this mechanism/feature provides security? As far as I can tell the implementors fucked up and it actually does nothing. The header should be required from the page which references/requests cross-domain resources rather than from that domain being requested.
To be clear; if I request a page at domain A it would make sense for the response to include the Access-Control-Allow-Origin header white listing resources from domain B (Access-Control-Allow-Origin:.B.com), however it makes no sense at all for domain B to effectively white list itself by providing the header; Access-Control-Allow-Origin: which is how this is currently implemented. Can anyone clarify what the benefit of this feature is?
If I have a protected resource hosted on site A, but also control sites B, C, and D, I may want to use that resource on all of my sites but still prevent anyone else from using that resource on theirs. So I instruct my site A to send Access-Control-Allow-Origin: B, C, D along with all of its responses. It's up to the web browser itself to honor this and not serve the response to the underlying Javascript or whatever initiated the request if it didn't come from an allowed origin. Error handlers will be invoked instead. So it's really not for your security as much as it's an honor-system (all major browsers do this) access control method for servers.
Primarily Access-Control-Allow-Origin is about protecting data from leaking from one server (lets call it privateHomeServer.com) to another server (lets call it evil.com) via an unsuspecting user's web browser.
Consider this scenario:
You are on your home network browsing the web when you accidentally stumble onto evil.com. This web page contains malicious javascript that tries to look for web servers on your local home network and then sends their content back to evil.com. It does this by trying to open XMLHttpRequests on all local IP addresses (eg. 192.168.1.1, 192.168.1.2, .. 192.168.1.255) until it finds a web server.
If you are using an old web browser that isn't Access-Control-Allow-Origin aware or you have set Access-Control-Allow-Origin * on your privateHomeServer then your browser would happily retrieve the data from your privateHomeServer (which presumably you didn't bother passwording as it was safely behind your home firewall) and then handing that data to the malicious javascript which can then send the information on to the evil.com server.
On the other hand using an Access-Control-Allow-Origin aware browser and default web configuration on privateHomeServer (ie. not sending Access-Control-Allow-Origin *) your web browser would block the malicious javascript from seeing any data retrieved from privateHomeServer. So this way you are protected from such attacks unless you go out of your way to change the default configuration on your server.
Regarding the question:
Shouldn't the page I request which references those cross-domain
resources be telling the browser it's safe to get stuff from the other
domain?
The fact that your page contains code that is attempting to get resources from a particular server is implicitly telling the web browser that you believe the resources are safe to fetch. It wouldn't make sense to need to repeat this again elsewhere.
CORS makes only sense for Mashup content provider and nothing more.
Example: You are a provider of a embedded maps mashup service which requires a registration. Now you want to make sure that your ajax mashup map will only work for your registered users on their domains. Other domains should be excluded. Only for this reason CORS makes sense.
Another example: Someone misuse CORS for a REST-Service. The clever developer set up a ajax proxy and et voilĂ you can access from every domain on that service.
Such a ajax proxy would make no sense for a mashup, on the other way the CORS makes no sense for REST-Services, because you could bypass the restriction with a simple http-client.
I have difficulties in seeing the point of the Access-Control-Allow-Origin http header.
I thought that if a client (browser) gets a "no" from a server once, than it will not send any further requests. But chrome and firefox keep sending requests.
Could anyone tell me a real life example where a header like this makes sense?
thanks!
The Access-Control-Allow-Origin header should contain a list of origins which are "allowed" to access the resource.
Thus, determining which domains can make requests to your server for resources.
For example, sending back a header of Access-Control-Allow-Origin: * would allow all sites to access the requested resource.
On the other hand, sending back Access-Control-Allow-Origin: http://foo.example.com will allow access only to http://foo.example.com.
There's some more information on this over at the Mozilla Developer Site
For example
Let's suppose we have a URL on our own domain that returns a JSON collection of Music Albums by Artist. It might look like this:
http://ourdomain.com/GetAlbumsByArtist/MichaelJackson.json
We might use some AJAX on our website to get this JSON data and then display it on our website.
But what if someone from another site wishes to use our JSON object for themselves? Perhaps we have another website http://subdomain.ourdomain.com which we own and would like to use our feed from ourdomain.com.
Traditionally we can't make cross-domain requests for this data.
By specifying other domains that are allowed access to our resource, we now open the doors to cross-domain requests.
CORS implements a two-part security view of cross-origin. The problem it is trying to solve is that there are many servers sitting out there on the public internet written by people who either (a) assumed that no browser would ever allow a cross-origin request, or (b) didn't think about it at all.
So, some people want to permit cross-origin communications, but the browser-builders do not feel that they can just unlock browsers and suddenly leave all these websites exposed. To avoid this, they invented a two-part structure. Before a browser will permit a cross-origin interaction with a server, that server has to specifically indicate that it is willing to allow cross-origin access. In the simple cases, that's Access-Control-Allow-Origin. In more complex cases, it's the full preflight mechanism.
It's still true that servers have to implement appropriate resource access control on their resources. CORS is just there to allow the server to indicate to browsers that it is aware of all the issues.
I want to change first line of the HTTP header of my request, modifying the method and/or URL.
The (excellent) Tamperdata firefox plugin allows a developer to modify the headers of a request, but not the URL itself. This latter part is what I want to be able to do.
So something like...
GET http://foo.com/?foo=foo HTTP/1.1
... could become ...
GET http://bar.com/?bar=bar HTTP/1.1
For context, I need to tamper with (make correct) an erroneous request from Flash, to see if an error can be corrected by fixing the url.
Any ideas? Sounds like something that may need to be done on a proxy level. In which case, suggestions?
Check out Charles Proxy (multiplatform) and/or Fiddler2 (Windows only) for more client-side solutions - both of these run as a proxy and can modify requests before they get sent out to the server.
If you have access to the webserver and it's running Apache, you can set up some rewrite rules that will modify the URL before it gets processed by the main HTTP engine.
For those coming to this page from a search engine, I would also recommend the Burp Proxy suite: http://www.portswigger.net/burp/proxy.html
Although more specifically targeted towards security testing, it's still an invaluable tool.
If you're trying to intercept the HTTP packets and modify them on the way out, then Tamperdata may be route you want to take.
However, if you want minute control over these things, you'd be much better off simulating the entire browser session using a utility such as curl
Curl: http://curl.haxx.se/
As of current, are there still any methods to spoof HTTP referer?
Yes.
The HTTP_REFERER is data passed by the client. Any data passed by the client can be spoofed/forged. This includes HTTP_USER_AGENT.
If you wrote the web browser, you're setting and sending the HTTP Referrer and User-Agent headers on the GET, POST, etc.
You can also use middleware such as a web proxy to alter these. Fiddler lets you control these values.
If you want to redirect a visitor to another website and set their browser's referrer to any value you desire, you'll need to develop a web browser-plugin or some other type of application that runs on their computer. Otherwise, you cannot set the referrer on the visitor's browser. It will show the page from your site that linked to it.
What might be a valid solution in your case would be for you to load the third party page on the visitor's behalf, using whatever referrer is necessary, then display the page to the user from your server.
Yes, the HTTP referer header can be spoofed.
A common way to play with HTTP headers is to use a tool like cURL:
Sending headers using cURL:
How to send a header using a HTTP request through a curl call?
or
The cURL docs:
http://curl.haxx.se/docs/
Yes of course. Browser can avoid to send it, and it can be also "spoofed". There's an addon for firefox (I haven't tried it myself) and likely you can use also something like privoxy (but it is harder to make it dynamically changing). Using other tools like wget, is as easy as setting the proper option.