When I make a request to a server can I see all the requests made by that server to another server? - http

I need to know which requests a webpage sends. Basically the site i call, calls another service/api/url whatever and receives the data (probably within javascript) and show me this. Can i see all the calls it make?
Edit: concrete example:
From this site (http://www.flickriver.com/lenses/nikon/) you can choose a lens, at that moment, the page sends a request to flickr, and get all the data. But in chrome developer tools i could not see this request.
Here is a screenshot of get requests. I have looked through them but could not see any request to flickr.
The first is request to the page. And the sixth one is the picture request already, where it requests the picture by its id. So in between other 4 requests should contain a request to the external source which gives the picture id in return or do i miss sth?
And what if the backend makes this request? Do i still need to see this request in developer tools?

No, of course you cannot see the calls made by some server to another server. Why would you expect to be able to do that? Those calls have nothing to do with the browser. The browser knows nothing about those requests. The browser knows only about requests that it itself initiated. Devtools can only report on requests made by the browser. If in fact there were some way to spy on the requests made by a server to another server, it would be gaping security hole.

Related

head request returns different content-type [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

get know if request is subrequest of opening site or independent request to site

Is there any way to recognize (by process http packet or filtering tcp connections) does several requests belong to one opening url or another?
Try to explain in more detail.
When we open any page in browser it also initializes different requests to download images, resources, scripts. I d like to get know that some scope of requests was invoked by opening site (call it main site).
I can get referer property but in that case how to distinguish request to resorce from request to different site link on which was clicked on main site. In both cases referer will be the same.
I suspect that this problem could not be resolved, but I hope that I'm mistaken. Or you can offer some workaround.
If you are in control of the site, set a cookie or a URL parameter and check if it exists in subsequent requests.

How can a HTTP request be instantiated with a different host?

Strange one here folks.
I'm studiying a web application's inner workings using Fiddler and have become a bit stumped. I'm requesting /account via the browser and Fiddler shows in the "Host" column website.local as expected since this is the URL the browser is pointing at.
Immediately after this a second request is made, however this time the host is services.website.com. I also cannot find any script in /account that makes this request.
So how can the Host change? Where is the response being sent to? Where can this be getting called from?
I'd expect that the server is sending a redirect to services.website.com: Fiddler isn't showing any redirects?
It depends on what really is in the first response that you receive. When you see a second request in Fiddler, does the page change too (in the browser)?
It will help a great deal if you could share the part of the Fiddler trace.

Is it correct to say that a web browser always knows when a web page is completely loaded?

A browser sends a GET request for a static web page to a server. The server sends back HTTP OK response with the HTML page in the HTTP body. Looking at the Content-Length field or looking for the terminating chunk or some other delimiter for some other encoding the browser can know if it has received the web page and subsequently all its embedded objects (images etc.). Is it correct to say that in this case the browser always knows when a web page has completely loaded and that it will see no further network traffic?
Now if the page is dynamic (lets say facebook or gmail), where you might receive notifications or parts of the page gets updated using AJAX or javascript running in the background, here also the browser should know when the page has loaded. What if the server is pushing some updates to the client. Is it possible in this scenario for the browser to know when it has received the full update?
So, is there any scenario in which a browser doesn't know when it has fully received the data (static or dynamic) it has requested from a web server or push-based updates the server is forwarding to it?
I can only imagine (for the static case) the one scenario when Content-Length is not set. It's not mandatory to send it for the server.
Potentially, of course, in a page containing scripts, one could also have other scenarios where the script loads bits and pieces one by one with delays (including the AJAX scenario you mentioned). This way the browser would not know in advance either. In such a case it would know "for the moment" that the page has loaded completely, but the next action from the script would invalidate that assertion again.
You do not need AJAX to get in a situation where not all elements in the page are loaded even after the page itself has been loaded. A little javascript is all that you need (been a while since I last worked with JS, there might be some syntax errors)
<img id="dyn_image" src="/not_clicked.gif">
<input type="button" onclick="javascrit:document.get("dyn_image").src="/clicked.gif">
There are cases when the server uses some kind of push technology, for example Comets. In this case a request (generally Ajax request) is sent, without receiving any response (obvoiusly no HTTP headers as well), but leaving the TCP connection open. This may take long time, but still may be considered as a sub-case of Ajax calls.
The other case is HTML5's WebSocket technology. In a WebSocket the server side can push data to the client side without explicit request from the client side.
These two can be combined, so the answer to your question is: yes, there can be cases when you cannot predict that the network traffic is over or not. The common (in all cases) is that the client side must leave a channel open to the server.

Is it safe to redirect to the same URL?

I have URLs of the form http://domain/image/⟨uuid⟩/42x42/some_name.png. The Web server (nginx) is configured to look for a file /some/path/image/⟨uuid⟩/thumbnail_42x42.png, and if it does not exist, it sends the URL to the backend (Django via mod_wsgi) which then generates the thumbnail. Then the backend emits a 302 redirect to exactly the same URL that was requested by the client, with the idea that upon this second request the server will notice the thumbnail file and send it directly.
The question is, will this work with all the browsers? So far testing has shown no problems, but can I be sure all the user agents will interpret this as intended?
Update: Let me clarify the intent. Currently this works as follows:
The client requests a thumbnail of an image.
The server sees the file does not exist, so it forwards the request to the backend.
The backend creates the thumbnail and returns 302.
The backend releases all the resources, letting the server share the newly generated file to current and subsequent clients.
Having the backend serve the newly created image is worse for two reasons:
Two ways of serving the same data must be created;
The server is much better at serving static content. What if the client has an extremely slow link? The backend is not particularly fast nor memory-efficient, and keeping it in memory while spoon-feeding the client can be wasteful.
So I keep the backend working for the minimum amount of time.
Update²: I’d really appreciate some RFC references or opinions of someone with experience with lots of browsers. All those affirmative answers are pleasant but they look somewhat groundless.
If it doesn't, the client's broken. Most clients will follow redirect loops until a maximum value. So yes, it should be fine until your backend doesn't generate the thumbnail for any reason.
You could instead change URLs to be http://domain/djangoapp/generate_thumbnail and that'll return the thumbnail and the proper content-type and so on
Yes, it's fine to re-direct to the same URI as you were at previously.

Resources