When requesting http://example.com (using something like cURL), is there anyway to determine what the actual page server side is? Is it /index.php /index.html /index.asp?
This is a completely client side question.
Definitely no guaranteed way, there may not even be a default page at all on the server. Although there's usually some sort of page, script, or template associated with a given URL, it can be buried under several layers framework that make it not useful information anyway. You might be able to glean some extra info from the http response headers. But that's pretty much all you get on the client side.
Related
So I have an Nginx server set up which is supposed to redirect all http to https (and non-www to www) using 4 server blocks.
The issue is that any 404 or non existent http URL first get a 301 redirect to what could have been an https version if it hypothetically existed (hence creating an extra URL and redirect).
See example:
1) http://example.com/thisurldoesntexit
301 Redirect
2) https://example.com/thisurldoesntexit
404
3) https://example.com/notfound
Is there a way to redirect user directly to a https 404 (URL 3)?
First of all, as already been pointed out, doing a 301 redirect from a non-existent page to a single /notfound moniker, is a really bad practice, and is likely against the RFCs.
What if the user simply mistyped a single character of a long URL? Modern browsers make it non-trivial to go back to what has been typed in order to correct it. The user would have to decide whether your site is worth a retyping from scratch, or whether your competitor might have thought of a better experience.
What if the user simply followed a broken link, which is broken in a very obvious way, and could be easily fixed? E.g., http://www.example.org/www.example.com/page, where an absolute URL was mistyped by the creator to be a relative one, or maybe a URI like /page.html., with an extra dot in the end. Likewise, you'll be totally confusing the user with what's going on, and offering a terrible user experience, where if left alone, the URL could easily have been corrected promptly.
But, more importantly, what real problem are you actually trying to solve?!
For better or worse, it's a pretty common practice to indiscriminately redirect from http to https scheme, without an account of whether a given page may or may not exist. In fact, if you employ HSTS, then content served over http effectively becomes meaningless; the browser with a policy would never even be requesting anything over http from there on out.
Undoubtedly, in order to know whether or not a given page exists, you must consult with the backend. As such, you might as well do the redirect from http to https from within your backend; but it'll likely tie up your valuable server resources for little to no extra benefit.
Moreover, the presence or absence of the page may be dictated by the contents of the cookies. As such, if you require that your backend must discern whether a page does or does not exist for an http request, then you'll effectively be leaking private information that was meant to be protected by https in the first place. (In turn, if your site has no such private information, then maybe you shouldn't be using https in the first place.)
So, overall, the whole approach is just a REALLY, REALLY bad idea!
Consider instead:
Do NOT do a 301 redirect from all non-existent pages to a single /notfound page. Very bad practice, very bad UX.
It is totally OK to do an indiscriminate redirect from http to https, without accounting for whether or not the page exists. In fact, it's not only okay, but it's the way God intended, because an adversary should not be capable of discerning whether or not a given page exists for an https-based site, so, if you do find and implement a solution for your "problem", then you'll effectively create a security vulnerability and a data leak.
Use https://www.drupal.org/project/fast_404 module for serving 404 pages directly without much overload.
I'd suggest redirecting to a 404 page is a poor choice, and you should instead serve the 404 on the incorrect URL.
My reasons for stating this are:
By redirecting away from the page, you are issuing headers that implicitly say "The content does not exist on this URL, but it does over here". I'm not sure how the various search engines would react to being redirected to a 404
I can speak from my own experience as a user when I say that having the URL change on me when I've mis-typed by a single character can be very frustrating. I then need to spend the time to type out the entire URL again.
You can avoid having logic in your .htaccess file or whatever to judge a page as a 404. This will greatly simplify your initial logic (which by-the-by gets computed on every single page load) - and will remove far more redirects than just the odd one of http://badurl to https://badurl to https://404
I may be barking up completely the wrong tree here but what I would like to do is protect my .js pages by having them return a 403 Forbidden http error status page if someone tries to access them directly via http. I use them to support my index.html page but would like for them to remain hidden.
The helpdesk guys at my ISP basically say they don't know if it's possible but it may be something you could do with a web.config file (which is not something I have used before).
Any help at all would be gratefully received - I am a bit out of my comfort zone with this one
I would like to […] protect my .js pages by having them return a 403 Forbidden http error status page if someone tries to access them directly via http.
Please note that if you include some resource, for example a script via the <script>-tag in HTML or an image via the <img>-tag, the browser does nothing else than simply run another HTTP request to get that resource. The whole communication already happens over HTTP.
While a browser may include additional details in its HTTP request when requesting additional resources, like the Referer-header, it definitely is not required to do so. So if you look out for the Referer-header, be advised that you may lock out other valid clients which do not send the Referer-header in their requests.
Also note that this will not give you any protection whatsoever. One can simply construct HTTP headers when requesting things, so “faking” requests your server would allow (because it thinks they are correct) is not a problem at all. And even without that; every resource you tell the client to use to make your website work will be downloaded by the client. And after that, the client can do whatever he wants with it. It can cache them on the hard disk, or allow the user to quickly look at it without having to run another request.
So if you want to do this for protecting your code, then just forget about it, and make it easier for everyone by just not adding a non-optimal protection. Code you put on the web can be made difficult to read, but if you want the user to see the end result, then you also give out your code in the same step.
In php you can do this with:
header("HTTP/1.0 403 Forbidden");
We are working on the conversion of an ASP site to ASP.Net and are running into problems with redirects and resource locations. Our issues are coming from a peculiarity of our set-up. Our site can be accessed in two ways:
Directly by URL: http://www.mysite.com - in this case everything works fine
Via a proxy server with a URL like: http://www.proxy.com/mysite_proxy/proxy/
In #2 "mysite_proxy" is a mapping on proxy.com that directs the request behind the scenes to www.mysite.com, "proxy" is a virtual sub-website that just redirects the request to the root of www.mysite.com. It essntially is meant to give us a convenient way of knowing if a request is hitting the site from the proxy or not.
We are running into two problems with this setup:
Using Response.Redirect either with the "~" or a plain relative path (Default.aspx) generates a 302 response with a location of "/proxy/rest_of_the_path.aspx." This causes the browser to request http://www.proxy.com/proxy/rest_of_the_path.aspx which isn't anything and doesn't even hit our server so we couldn't do an after the fact re-write.
Using "~" based URLs in our pages for links, images, style-sheets, etc. creates the same kind of path: "/proxy/path_to_resources.css." We could probably solve some of these by using relative paths for all these resources though that would be a lot of work and it would do nothing to address similar resource links generated by the framework and 3rd party components.
Ideally I want to find a global fix that will make these problems transparent to the developers working on the site. I have a few ideas at this point:
Getting rid of the proxy, it is not really needed and is there for administrative and not technical reasons. Easiest to accomplish technically, the hardest to accomplish in the real world.
Hand the problem off to the group that runs the proxy and say it is their problem they need to fix it.
Use a Response filter to modify the raw html before it is sent to the client. I know this could fix my resource links, but I am not certain about the headers (need to test it out) and there would be a performance hit to having to parse every response looking for and re-writing urls.
All of these solutions have big negatives in my mind and I was hoping someone might have another idea. So any thoughts?
Aside: there are a lot of posts up already that deal with the reverse of this issue: I have a relative URL, how do I may it absolute, but I didn't come across anything that fit the bill for the other direction.
As a fix, I'd go with a small detection routine at Global.asax:Session_Start (since i imagine that the proxy doesn't actually starts another application instance), set a session variable with the correct path, and use it instead of '~'.
In the case a different application instance is used, then use Application_Start instead of Session_Start and a static Global variable instead of a Session variable.
In the current HTTP spec, the URL fragment (the part of the URL including and following the #) is not sent to the server in any way. However with the increased spread of AJAX, which uses the fragment to maintain some form of state, there are a lot of situations where it would be useful for the server to have knowledge of the URL fragment at request time.
For example, if you go to http://facebook.com, then click a user name in your stream, the URL will become http://faceboook.com/#!/username - to allow FB to update your page without reloading all of its bootstrap JS and HTML. However, if you were to reload this with your browser, the server would have no way of seeing the "#/!username" part of the URL, and therefore could not pre-render the content for you. This forces your browser to make an extra request once the client Javascript has loaded and parsed the fragment.
I am wondering if there have been any efforts or proposals towards creating a standard mechanism to achieve this.
For example, there could be a standard HTTP header, which would be sent with the value of the URL fragment - any server which cared about such things could then have access to it.
It seems like this would be a very useful thing for the web-application community as a whole, so I am surprised to not have heard anything proposed. Perhaps I missed it though.
Imho, the fragment identifier really is not a good place to store the state, it has been designed for something else.
That being said, http://www.jenitennison.com/blog/node/154 has a good discussion of the whole subject.
I found this proposal by Google to make Ajax pages crawlable, but it addresses a more constrained set of use cases. Specifically, it creates a way to replace the URL fragment with a URL parameter to obtain the same HTML output from the server as would be generated by a client visiting the equivalent URL with the fragment. However, such URLs are useless for actually running the Ajax apps, since they would necessitate a page reload every time.
Webkit Bug 24175 - URL Redirect Loses Fragment refers to Handling of fragment identifiers in redirected URLs which may be of interest.
A suggestion for a future version of HTTP may be to add an (optional)
Fragment header to the request, which holds the fragment identifier.
Even simpler may be to allow an HTTP request to contain a fragment
identifier.
I am trying to determine if there is a way to check the availability of a potentially large list of urls (> 1000000) without having to send a GET request to every single one.
Is it safe to assume that if http://www.example.com is inaccessible (as in unable to connect to server or the DNS request for the domain fails), or I get a 4XX or 5XX response, then anything from that domain will also be inaccessible (e.g. http://www.example.com/some/path/to/a/resource/named/whatever.jpg)? Would a 302 response (say for whatever.jpg) be enough to invalidate the first assumption? I imagine sub domains should be considered distinct as http://subdomain.example.com and http://www.example.com may not direct to the same ip?
I seem to be able to think of a counter example for each shortcut I come up with. Should I just bite the bullet and send out GET requests to every URL?
Unfortunately, no you cannot infer anything from 4xx or 5xx or any other codes.
Those codes are for individual pages, not for the server. It's quite possible that one page is down and another is up, or one has a 500 server-side error and another doesn't.
What you can do is use HEAD instead of GET. That retrieves the MIME header for the page but not the page content. This saves time server-side (because it doesn't have to render the page) and for yourself (because you don't have to buffer and then discard content).
Also I suggest you use keep-alive to accelerate responses from the same server. Many HTTP client libraries will do this for you.
A failed DNS lookup for a host (e.g. www.example.com) should be enough to invalidate all URLs for that host. Subdomains or other hosts would have to be checked separately though.
A 4xx code might tell you that a particular page isn't available, but you couldn't make any assumptions about other pages from that.
A 5xx code really won't tell you anything. For example, it could be that the page is there, but the server is just too busy at the moment. If you try it again later it might work fine.
The only assumption you should make about the availability of an URL is that "Getting an URL can and will fail".
It's not safe to assume that a sub domain request will fail when a parent one does. Namely because inbetween your two requests your network connection can go up, down or generally misbehave. It's also possible for the domains to be changed in between requests.
Ignoring all internet connection issues. You are still dealing with a live web site that can and will change constantly. What is true now might not be true in 5 minutes when they decide to alter their page structure or change the way the display a particular page. Your best bet is to assume any get will fail.
This may seem like an extreme view point. But these events will happen. How you handle them will determine the robustness of your program.
First don't assume anything based on a single page failing. I have seen many cases where IIS will continue to serve static content but not be able to serve any dynamic content.
You have to treat each host name as unique you cannot assume subdomain.example.com and example.com point to the same IP. Or even if they do there is no guarentee that are the same site. IIS again has host headers that allows you to run multiple sites using a single IP Address.
If the connection to the server actually fails, then there's no reason to check URLs on that server. Otherwise, you can't assume anything.
In addition to what everyone else is saying, use HEAD requests instead of GET requests. They function the same, but the response doesn't contain the message body, so you save everyone some bandwidth.