Robot request to an ASP.Net app - asp.net

Is there a way to determine if a http request to an ASP.Net application is made from a browser or from a robot/crawler? I need to differentiate this two kind of requests.
Thanks!

No, there isn't. There is no fool proof to determine what originated a request - all HTTP headers can be spoofed.
Some crawlers (GoogleBot and such) do advertise themselves, but that doesn't mean a person browsing can't pretend to be GoogleBot.
The best strategy it to look for the well known bots (by User-Agent header and possibly by the known IP address) and assume those are crawlers.

Well... If the robot want to be recognized as a robot, yes. Cause he can easilly simulates that he is a web browser.
Personnaly, I will use this list to start: http://www.robotstxt.org/db.html

Have a look at Request.Browser.Crawler, but that only works for some crawlers.

Related

Difference Between // and http://

I know that HTTP is hyper text transfer protocol, and I know that's how (along with HTTPS) one accesses a website. However, what does just a // do? For instance, to access Google's copy of jQuery, one would use the url //ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js, as opposed to http://....
What exactly is the difference? What does just // indicate?
Thanks.
By saying on // it means use whatever protocol (IE: http vs https) your user is currently hittin for that resource.
So you don't have to worry about dealing with http: vs https: management yourself.
Avoiding potential browser security warnings. It would be good practice to stick with this approach.
For example: If your user is accessing http://yourdomain/ that script file would automatically be treated as http://ajax.googleapis.com/...
if your current request is http
//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js
will be treated as
http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js
if your current request is https
//ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js
will be treated as
https://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js

Inspect how requests routed through a proxy look to their destination

My web app makes request to third party servers, and we sometimes route them trough proxies. I'd like to be able to "see what they see" -- see what the request looks like once its been routed through the proxy.
Specifically, I'm interested in how much identifying information about the source (my web app) is left in the request once it reaches the destination, having been routed through the proxy.
Does anyone know an easy way to do this? Maybe a web service that will just echo back all the information about the incoming request in the outgoing response?
Not a full answer, but maybe you can try:
http://www.cantoni.org/2012/01/08/simple-webservice-echo-test
And the other 2 webs mentioned there:
http://respondto.it/
http://requestb.in/
To setup a URL to send your requests and see if the info provided helps you.
I'm just stating this as an idea that came to me. You could try sending requests to your own URL, which you control (i.e. a resource in your own web application). That way, you can use your debugging infrastructure or other facilities (basically anything you want) to inspect the request that's coming into your application. It seems to me this might be the most powerful / easiest way to do this. It won't let you test the URL you were trying to test, but in terms of proxy visibility, it might be what you need.
Good luck!
If the proxy supports the TRACE method and the Max-Forwards header you can use that. Not all do, however.

HTTP Response before Request

My question might sound stupid, but I just wanted to be sure:
Is it possible to send an HTTP response before having the request for that resource?
Say for example you have an HTML page index.html that only shows a picture called img.jpg.
Now, if your server knows that a visitor will request the HTML file and then the jpg image every time:
Would it be possible for the server to send the image just after the HTML file to save time?
I know that HTTP is a synchronous protocol, so in theory it should not work, but I just wanted someone to confirm it (or not).
A recent post by Jacques Mattheij, referencing your very question, claims that although HTTP was designed as a synchronous protocol, the implementation was not. In practise the browser (he doesn't specify which exactly) accepts answers to requests have not been sent yet.
On the other hand, if you are looking to something less hacky, you could have a look at :
push techniques that allows the server to send content to the browser. The modern implementation that replace long-polling/Comet "hacks" are the websockets. You may want to have a look at socket.io also.
Alternatively you may want to have a look at client-side routing. Some implementations combine this with caching techniques (like in derby.js I believe).
If someone requests /index.html and you send two responses (one for /index.html and the other for /img.jpg), how do you know the recipient will get the two responses and know what to do with them before the second request goes in?
The problem is not really with the sending. The problem is with the receiver possibly getting unexpected data.
One other issue is that you're denying the client the ability to use HTTP caching tools like If-Modified-Since and If-None-Match (i.e. the client might not want /img.jpg to be sent because it already has a cached copy).
That said, you can approximate the server-push benefits by using Comet techniques. But that is much more involved than simply anticipating incoming HTTP requests.
You'll get a better result by caching resources effectively, i.e. setting proper cache headers and configuring your web server for caching. You can also inline images using base 64 encoding, if that's a specific concern.
You can also look at long polling javascript solutions.
You're looking for server push: it isn't available in HTTP. Protocols like SPDY have it, but you're out of luck if you're restricted to HTTP.
I don't think it is possible to mix .html and image in the same HTTP response. As for sending image data 'immediately', right after the first request - there is a concept of 'static resources' which could be of help (but it will require client to create a new reqest for a specific resource).
There are couple of interesting things mentioned in the the article.
No it is not possible.
The first line of the request holds the resource being requested so you wouldn't know what to respond with unless you examined the bytes (at least one line's worth) of the request first.
No. HTTP is defined as a request/response protocol. One request: one response. Anything else is not HTTP, it is something else, and you would have to specify it properly and implement it completely at both ends.

Is there any downside for using a leading double slash to inherit the protocol in a URL? i.e. src="//domain.example"

I have a stylesheet that loads images from an external domain and I need it to load from https:// from secure order pages and http:// from other pages, based on the current URL. I found that starting the URL with a double slash inherits the current protocol. Do all browsers support this technique?
HTML ex:
<img src="//cdn.domain.example/logo.png" />
CSS ex:
.class { background: url(//cdn.domain.example/logo.png); }
If the browser supports RFC 1808 Section 4, RFC 2396 Section 5.2, or RFC 3986 Section 5.2, then it will indeed use the page URL's scheme for references that begin with "//".
When used on a link or #import, IE7/IE8 will download the file twice per http://paulirish.com/2010/the-protocol-relative-url/
Update from 2014:
Now that SSL is encouraged for everyone and doesn’t have performance concerns, this technique is now an anti-pattern. If the asset you need is available on SSL, then always use the https:// asset.
One downside occurs if your URLs are viewed outside the context of a web page. For example, an email message sitting in an email client (say, Outlook) effectively has no URL, and when you're viewing a message containing a protocol-relative URL, there is no obvious protocol context at all (the message itself is independent of the protocol used to fetch it, whether it's POP3, IMAP, Exchange, uucp or whatever) so the URL has no protocol to be relative to. I've not investigated compatibility with email clients to see what they do when presented with a missing protocol handler - I'm guessing that most will take a guess at http. Apple Mail refuses to let you enter a URL without a protocol. It's analogous to the way that relative URLs do not work in email because of a similarly missing context.
Similar problems could occur in other non-HTTP contexts such as in tweets, SMS messages, Word documents etc.
The more general explanation is that anonymous protocol URLs cannot work in isolation; there must be a relevant context. In a typical web page it's thus fine to pull in a script library that way, but any external links should always specify a protocol. I did try one simple test: //stackoverflow.com maps to file:///stackoverflow.com in all browsers I tried it in, so they really don't work by themselves.
The reason could be to provide portable web pages. If the outer page is not transported encrypted (http), why should the linked scripts be encrypted? This seems to be an unnecessary performance loss. In case, the outer page is securely transported encrypted (https), then the linked content should be encrypted, too. If the page is encrypted, the linked content not, IE seems to issue a Mixed Content warning. The reason is that an attacker can manipulate the scripts on the way. See http://ie.microsoft.com/testdrive/Browser/MixedContent/Default.html?o=1 for a longer discussion.
The HTTPS Everywhere campaign from the EFF suggests to use https whenever possible. We have the server capacity these days to serve web pages always encrypted.
Just for completeness. This was mentioned in another thread:
The "two forward slashes" are a common shorthand for "whatever protocol is being used right now"
if (plain http environment) {
use 'http://example.com/my-resource.js'
} else {
use 'https://example.com/my-resource.js'
}
Please check the full thread.
It seems to be a pretty common technique now. There is no downside, it only helps to unify the protocol for all assets on the page so should be used wherever possible.

What HTTP signatures are encountered from Google web crawling robots?

With all HTTP data available,What 'signs' can you look for to recognize Google's search engine robots?
How to verify googlebot - the official method.
As far as I know, Google's crawlers have the user-agent set to "Googlebot".
Other search engine providers typically stick to a recognisable name in the user-agent; there are various lists of well-known agents, such as that on http://www.jafsoft.com/searchengines/webbots.html.
The User-Agent header should be enough to detect the Google bot
Check out user-agents.org website to get a list of known se bot
By the you would like to want to be sure that's a true googlebot from google, then you can check out the ip/host which is always
c[nn].googlebot.com
Where [nn] is a number.
Well, I'm not so sure how maintainable it is to be doing DNS reverse lookups for ip addresses. I would only do this if you were concerned about someone spoofing google's user agent strings, which is highly unlikely. It can also be spoofed itself, as the article points out.
You're best off just matching their known user agents:
Regex.IsMatch(ua, #"googlebot|mediapartners-google|adsbot-google", RegexOptions.IgnoreCase);

Resources