Are there any safe assumptions to make about the availability of a URL? - http

I am trying to determine if there is a way to check the availability of a potentially large list of urls (> 1000000) without having to send a GET request to every single one.
Is it safe to assume that if http://www.example.com is inaccessible (as in unable to connect to server or the DNS request for the domain fails), or I get a 4XX or 5XX response, then anything from that domain will also be inaccessible (e.g. http://www.example.com/some/path/to/a/resource/named/whatever.jpg)? Would a 302 response (say for whatever.jpg) be enough to invalidate the first assumption? I imagine sub domains should be considered distinct as http://subdomain.example.com and http://www.example.com may not direct to the same ip?
I seem to be able to think of a counter example for each shortcut I come up with. Should I just bite the bullet and send out GET requests to every URL?

Unfortunately, no you cannot infer anything from 4xx or 5xx or any other codes.
Those codes are for individual pages, not for the server. It's quite possible that one page is down and another is up, or one has a 500 server-side error and another doesn't.
What you can do is use HEAD instead of GET. That retrieves the MIME header for the page but not the page content. This saves time server-side (because it doesn't have to render the page) and for yourself (because you don't have to buffer and then discard content).
Also I suggest you use keep-alive to accelerate responses from the same server. Many HTTP client libraries will do this for you.

A failed DNS lookup for a host (e.g. www.example.com) should be enough to invalidate all URLs for that host. Subdomains or other hosts would have to be checked separately though.
A 4xx code might tell you that a particular page isn't available, but you couldn't make any assumptions about other pages from that.
A 5xx code really won't tell you anything. For example, it could be that the page is there, but the server is just too busy at the moment. If you try it again later it might work fine.

The only assumption you should make about the availability of an URL is that "Getting an URL can and will fail".
It's not safe to assume that a sub domain request will fail when a parent one does. Namely because inbetween your two requests your network connection can go up, down or generally misbehave. It's also possible for the domains to be changed in between requests.
Ignoring all internet connection issues. You are still dealing with a live web site that can and will change constantly. What is true now might not be true in 5 minutes when they decide to alter their page structure or change the way the display a particular page. Your best bet is to assume any get will fail.
This may seem like an extreme view point. But these events will happen. How you handle them will determine the robustness of your program.

First don't assume anything based on a single page failing. I have seen many cases where IIS will continue to serve static content but not be able to serve any dynamic content.
You have to treat each host name as unique you cannot assume subdomain.example.com and example.com point to the same IP. Or even if they do there is no guarentee that are the same site. IIS again has host headers that allows you to run multiple sites using a single IP Address.

If the connection to the server actually fails, then there's no reason to check URLs on that server. Otherwise, you can't assume anything.

In addition to what everyone else is saying, use HEAD requests instead of GET requests. They function the same, but the response doesn't contain the message body, so you save everyone some bandwidth.

Related

head request returns different content-type [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?
Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.
One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)
In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Server Log Showing Many 'Unhandled Exceptions' From URL with &hash=

I've noticed a large increase in the number of events logged daily that have &hash= in the URL. The requested URL is the same every time but the number that follows the &hash= is always different.
I have no idea what the purpose of the &hash= parameter is, so I'm unsure if these attempts are malicious or something else. Can anyone provide insight as to what is being attempted with the requested URL? I have copied in one from a recent log below.
https://www.movinglabor.com:443/moving-services/moving-labor/move-furniture/&du=https:/www.movinglabor.com/moving-services/moving-labor/move.../&hash=AFD3C9508211E3F234B4A265B3EF7E3F
I have been seeing the same thing in IIS on Windows Server 2012 R2. They were mostly HEAD requests. I did see a few other more obvious attack attempts from the same ip address so I'm assuming the du/hash thing is also intended to be malicious.
Here's an example of another attempt which also tries some url encoding to bypass filters:
part_id=D8DD67F9S8DF79S8D7F9D9D%5C&du=https://www.examplesite.com/page..asp%5C?part...%5C&hash=DA54E35B7D77F7137E|-|0|404_Not_Found
So you may want to look through your IIS logs to see if they are trying other things.
In the end I simply created a blocking rule for it using the Url Rewrite extension for IIS.

How do I generate a 403 error when someone tries to access a particular page

I may be barking up completely the wrong tree here but what I would like to do is protect my .js pages by having them return a 403 Forbidden http error status page if someone tries to access them directly via http. I use them to support my index.html page but would like for them to remain hidden.
The helpdesk guys at my ISP basically say they don't know if it's possible but it may be something you could do with a web.config file (which is not something I have used before).
Any help at all would be gratefully received - I am a bit out of my comfort zone with this one
I would like to […] protect my .js pages by having them return a 403 Forbidden http error status page if someone tries to access them directly via http.
Please note that if you include some resource, for example a script via the <script>-tag in HTML or an image via the <img>-tag, the browser does nothing else than simply run another HTTP request to get that resource. The whole communication already happens over HTTP.
While a browser may include additional details in its HTTP request when requesting additional resources, like the Referer-header, it definitely is not required to do so. So if you look out for the Referer-header, be advised that you may lock out other valid clients which do not send the Referer-header in their requests.
Also note that this will not give you any protection whatsoever. One can simply construct HTTP headers when requesting things, so “faking” requests your server would allow (because it thinks they are correct) is not a problem at all. And even without that; every resource you tell the client to use to make your website work will be downloaded by the client. And after that, the client can do whatever he wants with it. It can cache them on the hard disk, or allow the user to quickly look at it without having to run another request.
So if you want to do this for protecting your code, then just forget about it, and make it easier for everyone by just not adding a non-optimal protection. Code you put on the web can be made difficult to read, but if you want the user to see the end result, then you also give out your code in the same step.
In php you can do this with:
header("HTTP/1.0 403 Forbidden");

get know if request is subrequest of opening site or independent request to site

Is there any way to recognize (by process http packet or filtering tcp connections) does several requests belong to one opening url or another?
Try to explain in more detail.
When we open any page in browser it also initializes different requests to download images, resources, scripts. I d like to get know that some scope of requests was invoked by opening site (call it main site).
I can get referer property but in that case how to distinguish request to resorce from request to different site link on which was clicked on main site. In both cases referer will be the same.
I suspect that this problem could not be resolved, but I hope that I'm mistaken. Or you can offer some workaround.
If you are in control of the site, set a cookie or a URL parameter and check if it exists in subsequent requests.

ASP.Net Relative Redirects and Resource Paths

We are working on the conversion of an ASP site to ASP.Net and are running into problems with redirects and resource locations. Our issues are coming from a peculiarity of our set-up. Our site can be accessed in two ways:
Directly by URL: http://www.mysite.com - in this case everything works fine
Via a proxy server with a URL like: http://www.proxy.com/mysite_proxy/proxy/
In #2 "mysite_proxy" is a mapping on proxy.com that directs the request behind the scenes to www.mysite.com, "proxy" is a virtual sub-website that just redirects the request to the root of www.mysite.com. It essntially is meant to give us a convenient way of knowing if a request is hitting the site from the proxy or not.
We are running into two problems with this setup:
Using Response.Redirect either with the "~" or a plain relative path (Default.aspx) generates a 302 response with a location of "/proxy/rest_of_the_path.aspx." This causes the browser to request http://www.proxy.com/proxy/rest_of_the_path.aspx which isn't anything and doesn't even hit our server so we couldn't do an after the fact re-write.
Using "~" based URLs in our pages for links, images, style-sheets, etc. creates the same kind of path: "/proxy/path_to_resources.css." We could probably solve some of these by using relative paths for all these resources though that would be a lot of work and it would do nothing to address similar resource links generated by the framework and 3rd party components.
Ideally I want to find a global fix that will make these problems transparent to the developers working on the site. I have a few ideas at this point:
Getting rid of the proxy, it is not really needed and is there for administrative and not technical reasons. Easiest to accomplish technically, the hardest to accomplish in the real world.
Hand the problem off to the group that runs the proxy and say it is their problem they need to fix it.
Use a Response filter to modify the raw html before it is sent to the client. I know this could fix my resource links, but I am not certain about the headers (need to test it out) and there would be a performance hit to having to parse every response looking for and re-writing urls.
All of these solutions have big negatives in my mind and I was hoping someone might have another idea. So any thoughts?
Aside: there are a lot of posts up already that deal with the reverse of this issue: I have a relative URL, how do I may it absolute, but I didn't come across anything that fit the bill for the other direction.
As a fix, I'd go with a small detection routine at Global.asax:Session_Start (since i imagine that the proxy doesn't actually starts another application instance), set a session variable with the correct path, and use it instead of '~'.
In the case a different application instance is used, then use Application_Start instead of Session_Start and a static Global variable instead of a Session variable.

Resources