head request returns different content-type [duplicate] - python-requests

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Related

JMeter: The purpose of User-defined Cookies for requests

I'm just trying to get going with JMeter, and I try to understand more about User-defined cookies.
What purpose do they fulfill when you add them with hard-coded values, like for instance if you define a cookie called A and give it a value B to a certain domain for your HTTP sampler?
Much grateful for all information!
There could be several possible reasons:
Replay user session without having to re-login (i.e. session hijacking) so you would be able to debug your test by running a single request rather than the whole sequence (open login page, login, navigate somewhere, do something, etc.)
The value doesn't have to be hard-coded, it may come from correlation or calculation by the JSR223 Pre-Processor
Negative test scenarios (providing invalid cookie value to check for anticipated error)
You name it
Some web sites have Javascript scripts executed on page load which are adding/removing/updating HTTP cookie(s)
The set() method of the cookies API sets a cookie containing the specified cookie data. This method is equivalent to issuing an HTTP Set-Cookie header during a request to a given URL.
JMeter isn't executing Javascript
JMeter is not a browser, it works at protocol level. As far as web-services and remote services are concerned, JMeter looks like a browser (or rather, multiple browsers); however JMeter does not perform all the actions supported by browsers. In particular, JMeter does not execute the Javascript found in HTML pages.
So in JMeter you can manually manipulate HTTP cookie as you expect your site to execute

CDN: Forward to a different resource instead of redirect

I need to send different resources (specially images) for same urls depending on a complex logic based on different factors (cookie, IP, time, random). I want to take advantage of CDNs (cache, availability, proximity). So, I want this CDN to make a call to my server in order to decide which resource serve to any request. It is very important to not use redirects, so the user will never see a 30X status code.
For clarification:
User makes a request to http://resources.mydomain.com/img/a.jpg, which domain is under CDN
CDN makes a call to my server, sending url requested, cookies and user IP
My server returns the name of the real resource to serve (http://hidden.mydomain.com/img/a-version3.jpg)
CDN requests that image if not in cache
CDN responds to user request sending a-version3.jpg data, but without any redirect
Is it possible using any current commercial solution?
Yes, I think it is already supported by CDNetworks long time ago.
It is called "Origin Logic Control" now. You can check the description from http://www.cdnetworks.com/wp-content/uploads/2013/08/CDNetworks-ContentAccel-DS-EN2.pdf:
Allows a customer’s domain to require checking with the origin on every request.
You can return a special HTTP header (or special HTTP body, I am not sure now) to tell CDNetworks to return resources directly (and using cached version if available), not 30x status code.
You can enable Redirect Chasing to get what you are looking for. Alternatively, look at the Akamai blog post on Edge Redirect for a faster option.

What happens when a HTTP request uses different browser headers?

I'm trying to understand how an IIS server handles different browsers in the header of an HTTP request.
The situation is that I have some load tests set up that fire off HTTP requests to an IIS server, constructing them and sending them over the wire. My code allows me to specify the browser in the header, but I'm not sure what that would actually change.
So what does IIS do with that particular information in the header?
As far as i am aware IIS doesn't actually do anything with the header.
You can create rules to explicitly handle a type of browser, this is pretty useful if you block traffic from countries but you still want to allow bots for example.
Its useful to also have this information in Log Files too

get know if request is subrequest of opening site or independent request to site

Is there any way to recognize (by process http packet or filtering tcp connections) does several requests belong to one opening url or another?
Try to explain in more detail.
When we open any page in browser it also initializes different requests to download images, resources, scripts. I d like to get know that some scope of requests was invoked by opening site (call it main site).
I can get referer property but in that case how to distinguish request to resorce from request to different site link on which was clicked on main site. In both cases referer will be the same.
I suspect that this problem could not be resolved, but I hope that I'm mistaken. Or you can offer some workaround.
If you are in control of the site, set a cookie or a URL parameter and check if it exists in subsequent requests.

Are there any safe assumptions to make about the availability of a URL?

I am trying to determine if there is a way to check the availability of a potentially large list of urls (> 1000000) without having to send a GET request to every single one.
Is it safe to assume that if http://www.example.com is inaccessible (as in unable to connect to server or the DNS request for the domain fails), or I get a 4XX or 5XX response, then anything from that domain will also be inaccessible (e.g. http://www.example.com/some/path/to/a/resource/named/whatever.jpg)? Would a 302 response (say for whatever.jpg) be enough to invalidate the first assumption? I imagine sub domains should be considered distinct as http://subdomain.example.com and http://www.example.com may not direct to the same ip?
I seem to be able to think of a counter example for each shortcut I come up with. Should I just bite the bullet and send out GET requests to every URL?
Unfortunately, no you cannot infer anything from 4xx or 5xx or any other codes.
Those codes are for individual pages, not for the server. It's quite possible that one page is down and another is up, or one has a 500 server-side error and another doesn't.
What you can do is use HEAD instead of GET. That retrieves the MIME header for the page but not the page content. This saves time server-side (because it doesn't have to render the page) and for yourself (because you don't have to buffer and then discard content).
Also I suggest you use keep-alive to accelerate responses from the same server. Many HTTP client libraries will do this for you.
A failed DNS lookup for a host (e.g. www.example.com) should be enough to invalidate all URLs for that host. Subdomains or other hosts would have to be checked separately though.
A 4xx code might tell you that a particular page isn't available, but you couldn't make any assumptions about other pages from that.
A 5xx code really won't tell you anything. For example, it could be that the page is there, but the server is just too busy at the moment. If you try it again later it might work fine.
The only assumption you should make about the availability of an URL is that "Getting an URL can and will fail".
It's not safe to assume that a sub domain request will fail when a parent one does. Namely because inbetween your two requests your network connection can go up, down or generally misbehave. It's also possible for the domains to be changed in between requests.
Ignoring all internet connection issues. You are still dealing with a live web site that can and will change constantly. What is true now might not be true in 5 minutes when they decide to alter their page structure or change the way the display a particular page. Your best bet is to assume any get will fail.
This may seem like an extreme view point. But these events will happen. How you handle them will determine the robustness of your program.
First don't assume anything based on a single page failing. I have seen many cases where IIS will continue to serve static content but not be able to serve any dynamic content.
You have to treat each host name as unique you cannot assume subdomain.example.com and example.com point to the same IP. Or even if they do there is no guarentee that are the same site. IIS again has host headers that allows you to run multiple sites using a single IP Address.
If the connection to the server actually fails, then there's no reason to check URLs on that server. Otherwise, you can't assume anything.
In addition to what everyone else is saying, use HEAD requests instead of GET requests. They function the same, but the response doesn't contain the message body, so you save everyone some bandwidth.

Resources