I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish this? Am I even using the correct terminology?
EDIT: I'm mainly looking to prevent people that would be doing this maliciously. I.e. they aren't going to abide by robots.txt
EDIT2: What about preventing use by "rate of use" ... i.e. captcha to continue browsing if automation is detected and the traffic isn't from a legitimate (google, yahoo, msn, etc.) IP.
This is difficult if not impossible to accomplish. Many "rogue" spiders/crawlers do not identify themselves via the user agent string, so it is difficult to identify them. You can try to block them via their IP address, but it is difficult to keep up with adding new IP addresses to your block list. It is also possible to block legitimate users if IP addresses are used since proxies make many different clients appear as a single IP address.
The problem with using robots.txt in this situation is that the spider can just choose to ignore it.
EDIT: Rate limiting is a possibility, but it suffers from some of the same problems of identifying (and keeping track of) "good" and "bad" user agents/IPs. In a system we wrote to do some internal page view/session counting, we eliminate sessions based on page view rate, but we also don't worry about eliminating "good" spiders since we don't want them counted in the data either. We don't do anything about preventing any client from actually viewing the pages.
One approach is to set up an HTTP tar pit; embed a link that will only be visible to automated crawlers. The link should go to a page stuffed with random text and links to itself (but with additional page info: /tarpit/foo.html , /tarpit/bar.html , /tarpit/baz.html - but have the script at /tarpit/ handle all requests with the 200 result).
To keep the good guys out of the pit, generate a 302 redirect to your home page if the user agent is google or yahoo.
It isn't perfect, but it will at least slow down the naive ones.
EDIT: As suggested by Constantin, you could mark the tar pit as offlimits in robots.txt. The good guys use web spiders that honor this protocol will stay out of the tar pit. This would probably get rid of the requirement to generate redirects for known good people.
If you want to protect yourself from generic crawler, use a honeypot.
See, for example, http://www.sqlite.org/cvstrac/honeypot. The good spider will not open this page because site's robots.txt disallows it explicitly. Human may open it, but is not supposed to click "i am a spider" link. The bad spider will certainly follow both links and so will betray its true identity.
If the crawler is created specifically for your site, you can (in theory) create a moving honeypot.
I agree with the honeypot approach generally. However, I put the ONLY link to the honeypot page/resource on a page blocked by "/robots.txt" - as well as the honeypot blocked by such. This way, the malicious robot has to violate the "disallow" rule(s) TWICE to ban itself. A typical user manually following an unclickable link is likely only to do this once and may not find the page containing the honeypot URL.
The honeypot resource logs the offending IP address of the malicious client into a file which is used as an IP ban list elsewhere in the web server configuration. This way, once listed, the web server blocks all further access by that client IP address until the list is cleared. Others may have some sort of automatic expiration, but I believe only in manual removal from a ban list.
Aside: I also do the same thing with spam and my mail server: Sites which send me spam as their first message get banned from sending any further messages until I clear the log file. Although I implement these ban lists at the application level, I also have firewall level dynamic ban lists. My mail and web servers also share banned IP information between them. For an unsophisticated spammer, I figured that the same IP address may host both a malicious spider and a spam spewer. Of course, that was pre-BotNet, but I never removed it.
robots.txt only works if the spider honors it. You can create a HttpModule to filter out spiders that you don't want crawling your site.
You should do what good firewalls do when they detect malicious use - let them keep going but don't give them anything else. If you start throwing 403 or 404 they'll know something is wrong. If you return random data they'll go about their business.
For detecting malicious use though, try adding a trap link on search results page (or the page they are using as your site map) and hide it with CSS. Need to check if they are claiming to be a valid bot and let them through though. You can store their IP for future use and a quick ARIN WHOIS search.
1 install iptables and tcpdump (for linux)
2 detect and autorize good traffic, for example googlebot
in perl
$auth="no";
$host=`host $ip`;
if ($host=~/.googlebot.com\.$/){$auth="si";}
if ($host=~/.google.com\.$/){$auth="si";}
if ($host=~/.yandex.com\.$/){$auth="si";}
if ($host=~/.aspiegel.com\.$/){$auth="si";}
if ($host=~/.msn.com\.$/){$auth="si";}
Note: host googlebot is 55.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-55.googlebot.com.
3 Create schedule or service capture traffic and count packet by host and insert in database or insert in you site a sql query insert ip for count ip traffic
for example in perl
$ip="$ENV{'REMOTE_ADDR'}";
use DBI;
if ($ip !~/^66\.249\./){
my $dbh = DBI->connect('DBI:mysql:database:localhost','user','password') or die print "non connesso:";
my $sth = $dbh->prepare("UPDATE `ip` SET totale=totale+1, oggi=oggi+1, dataUltimo=NOW() WHERE ip ='$ip'");
$sth ->execute;
$rv = $sth->rows;
if ($rv < 1){
my $sth = $dbh->prepare("INSERT INTO `ip` VALUES (NULL, '$ip', 'host', '1', '1', 'no', 'no', 'no', NOW(), 'inelenco.com', oggi+1)");
$sth ->execute;
}
$dbh->disconnect();
}
Or sniff traffic by service for example in perl
$tout=10;
$totpk=3000;
$tr= `timeout $tout tcpdump port $porta -nn -c $totpk`;
#trSplit=split(/\n/,$tr);
undef %contaUltimo;
foreach $trSplit (#trSplit){
if ($trSplit=~/IP (.+?)\.(.+?)\.(.+?)\.(.+?)\.(.+?) > (.+?)\.(.*?)\.(.+?)\.(.+?)\.(.+?): Flags/){
$ipA="$1.$2.$3.$4";
$ipB="$6.$7.$8.$9";
if ($ipA eq "<SERVER_IP>"){$ipA="127.0.0.1";}
if ($ipB eq "<SERVER_IP>"){$ipB="127.0.0.1";}
$conta{$ipA}++;
$conta{$ipB}++;
}
4 block host if traffic is > $max_traffic
for example in perl
if ($conta->{$ip} > $maxxDay){block($conta->{$ip});}
sub block{
my $ipX=shift;
if ($ipX =~/\:/){
$tr= `ip6tables -A INPUT -s $ipX -j DROP`;
$tr= `ip6tables -A OUTPUT -s $ipX -j DROP`;
print "Ipv6 $ipX blocked\n";
print $tr."\n";
}
else{
$tr= `iptables -A INPUT -s $ipX -j DROP`;
$tr= `iptables -A OUTPUT -s $ipX -j DROP`;
print "iptables -A INPUT -s $ipX -j DROP";
print "Ipv4 $ipX blocked\n";
print $tr."\n";
}
}
Another method is read log traffic server.
for example in linux /var/log/apache2/*error.log
contain all query error
/var/log/apache2/*access.log contains all web traffic
create a Bash script read log and block bad spider.
For block attack read all log, for example for block ssh attack read log ssh error and block ip. iptables -A INPUT -s $ip -j DROP
Related
Last week I started quite a fuss in my Computer Networks class over the need for a mandatory Host clause in the header of HTTP 1.1 GET messages.
The reason I'm provided with, be it written on the Web or shouted at me by my classmates, is always the same: the need to support virtual hosting. However, and I'll try to be as clear as possible, this does not appear to make sense.
I understand that in order to allow two domains to be hosted in a single machine (and by consequence, share the same IP address), there has to exist a way of differentiating both domain names.
What I don't understand is why it isn't possible to achieve this without a Host clause (HTTP 1.0 style) by using an absolute URL (e.g. GET http://www.example.org/index.html) instead of a relative one (e.g. GET /index.html).
When the HTTP message got to the server, it (the server) would redirect the message to the appropriate host, not by looking at the Host clause but, instead, by looking at the hostname in the URL present in the message's request line.
I would be very grateful if any of you hardcore hackers could help me understand what exactly am I missing here.
This was discussed in this thread:
modest suggestions for HTTP/2.0 with their rationale.
Add a header to the client request that indicates the hostname and
port of the URL which the client is accessing.
Rationale: One of the most requested features from commercial server
maintainers is the ability to run a single server on a single port
and have it respond with different top level pages depending on the
hostname in the URL.
Making an absolute request URI required (because there's no way for the client to know on beforehand whether the server homes one or more sites) was suggested:
Re the first proposal, to incorporate the hostname somewhere. This
would be cleanest put into the URL itself :-
GET http://hostname/fred http/2.0
This is the syntax for proxy redirects.
To which this argument was made:
Since there will be a mix of clients, some supporting host name reporting
and some not, it just doesn't matter how this info gets to the server.
Since it doesn't matter, the easier to implement solution is a new HTTP
request header field. It allows all clients and servers to operate as they
do now with NO code changes. Clients and servers that actually need host
name information can have tiny mods made to send the extra header field
containing the URL and process it.
[...]
All I'm suggesting is that there is a better way to
implement the delivery of host name info to the server that doesn't involve
hacking the request syntax and can be backwards compatible with ALL clients
and servers.
Feel free to read on to discover the final decision yourself. But be warned, it's easy to get lost in there.
The reason for adding support for specifying a host in an HTTP request was the limited supply of IP addresses (which was not an issue yet when HTTP 1.0 came out).
If your question is "why specify the host in a Host header as opposed to on the Request-Line", the answer is the need for interopability between HTTP/1.0 and 1.1.
If the question is "why is the Host header mandatory", this has to do with the desire to speed up the transition away from assigned IP addresses.
Here's some background on the Internet address conservation with respect to HTTP/1.1.
The reason for the 'Host' header is to make explicit which host this request refers to. Without 'Host', the server must know ahead of time that it is supposed to route 'http://joesdogs.com/' to Joe's Dogs while it is supposed to route 'http://joscats.com/' to Jo's Cats even though they are on the same webserver. (What if a server has 2 names, like 'joscats.com' and 'joescats.com' that should refer to the same website?)
Having an explicit 'Host' header make these kinds of decisions much easier to program.
I've just got my hands on a Raspberry Pi and I've set it up to act as the DNS and DHCP server on my home network. This means that all network requests go through it before they are released into the wild... Which offers me a great opportunity to use tcpdump and see what is happening on my network!
I am playing around with the tcpdump arguments to create the perfect network spy. The idea is to capture HTTP GET requests.
This is what I have so far and it's pretty good:
tcpdump -i eth0 'tcp[((tcp[12:1] & 0xf0)>> 2):4] = 0x47455420' -A
The -i eth0 tells it which interface to listen to
The bit in quotes is a nifty bit of hex matching to detect a GET request
The -A means "print the ASCII contents of this packet"
This fires every time anything on my network sends a GET request, which is great. My question, finally, is how can I filter out boring requests like images, JavaScript, favicons etc?
Is this even possible with tcpdump or do I need to move onto something more comprehensive like tshark?
Thanks for any help!
DISCLAIMER: Currently the only person on my network is me... This is not malicious, it's a technical challenge!
Grep is your friend :-) tcpdump ... | grep -vE "^GET +(/.*\.js)|(/favicon.ico)|(.*\.png)|(.*\.jpg)|(.*\.gif)|... +HTTP will hide things like GET /blah/blah/blah.js HTTP 1/.0, GET /favicon.ico HTTP 1/.0, GET /blah/blah/blah.png HTTP 1/.0, etc.
Getting this error message in the browser:
Attention!!!
The transfer attempted appeared to contain a data leak!
URL=http://test-login.becreview.com/domain/User_Edit.aspx?UserID=b5d77644-b10e-44e0-a007-3b9a5e0f4fff
I've seen this before but I'm not sure what causes it. It doesn't look like a browser error or an asp.net error. Could it be some sort of proxy error? What causes it?
That domain is internal so you won't be able to go to it. Also the page has almost no styling. An h1 for "Attention!!!" and the other two lines are wrapped in p tags if that helps any.
For anyone else investigating this message, it appears to be a Fortinet firewall's default network data-leak prevention message.
It doesn't look like an ASP.NET error that I've ever seen.
If you think it might be a proxy message you should reconfigure your browser so it does not use a proxy server, or try to access the same URL from a machine that has direct access to the web server (and doesn't use the same proxy).
This is generated from an inline IPS sensor (usually an appliance or a VM) that is also configured to scan traffic for sensitive data (CC info, SSNs etc). Generally speaking, the end user cannot detect or bypass this proxy as it is deployed to be transparent. It is likely also inspecting all SSL traffic. In simple terms, it is performing a MITM attack because your organizational policy has specified that all traffic to and from your network be inspected.
I'm pretty sure I remember reading --but cannot find back the links anymore-- about this: on some ISP (including at least one big ISP in the U.S.) it is possible to have a user's GET and POST request appearing to come from different IPs.
(note that this is totally programming related, and I'll give an example below)
I'm not talking about having your IP adress dynamically change between two requests.
I'm talking about this:
IP 1: 123.45.67.89
IP 2: 101.22.33.44
The same user makes a GET, then a POST, then a GET again, then a POST again and the servers see this:
- GET from IP 1
- POST from IP 2
- GET from IP 1
- POST from IP 2
So altough it's the same user, the webserver sees different IPs for the GET and the POSTs.
Surely seen that HTTP is a stateless protocol this is perfectly legit right?
I'd like to find back the explanation as to how/why certain ISP have their networks configured such that this may happen.
I'm asking because someone asked me to implement the following IP filter and I'm pretty sure it is fundamentally broken code (breaking havoc for at least one major american ISP users).
Here's a Java servlet filter that is supposed to protect against some attacks. The reasoning is that:
"For any session filter checks that IP address in the request is the same that was used when session was created. So in this case session ID could not be stolen for forming fake sessions."
http://www.servletsuite.com/servlets/protectsessionsflt.htm
However I'm pretty sure this is inherently broken because there are ISPs where you may see GET and POST coming from different IPs.
Some ISPs (or university networks) operate transparent proxies which relay the request from the outgoing node that is under the least network load.
It would also be possible to configure this on a local machine to use the NIC with the lowest load which could, again, result in this situation.
You are correct that this is a valid state for HTTP and, although it should occur relatively infrequently, this is why validation of a user based on IP is not an appropriate determinate of identity.
For a web server to be seeing this implies that the end user is behind some kind of proxy/gateway. As you say it's perfectly valid given that HTTP is stateless, but I imagine would be unusual. As far as I am aware most ISPs assign home users a real, non-translated IP (albeit usually dynamic).
Of course for corporate/institutional networks they could be doing anything. load balancing could mean that requests come from different IPs, and maybe sometimes request types get farmed out to different gateways (altho I'd be interested to know why, given that N_GET >> N_POST).
In my application, I have to send notification e-mails from time to time. In order to send mail (over SMTP), I have to get the MX server of that particular domain (domain part of e-mail address). This is not a Unix application but an Embedded one.
What I do goes like this ::
1 - Send a DNS query (MX type) containing the domain to the current DNS
2 - If the response contains the MX answer , return success from this function
3 - Read the first NS record and copy its IP address to the current DNS , goto 1
This may loop a few times and this is expected but what I do not expect is that the response contains NS records of servers named like ns1.blahblah.com but not their IP addresses. In this case, I have to send another query to find the IP of this NS. I have seen this for only 1 e-mail address (1 domain), the other addresses worked without any problem.
Is this normal behaviour ? IMHO, it is a misconfig on the DNS records. Any thoughts ?
Thanks in advance...
The authority section in the message, as well as the additional section are optional. Ie, the name servers and their IPs don't have to be in the response to the MX query. It is up to the DNS server to decide to send that extra information even when the server already has the data.
You are stuck having to query for the MX and then query for the IP of the mail server
Short answer to your question: RFC 1035 says,
NS records cause both the usual additional section processing to locate
a type A record, and, when used in a referral, a special search of the
zone in which they reside for glue information.
...the additional records section contains RRs
which relate to the query, but are not strictly answers for the
question.
...When composing a response, RRs which are to be inserted in the
additional section, but duplicate RRs in the answer or authority
sections, may be omitted from the additional section.
So the bottom line in my opinion is that, yes, if the response does not contain the A record matching the NS record it some section, something is likely misconfigured somewhere. But, as the old dodge goes, "be liberal in what you accept;" if you are going to make the queries, you will need to handle situations like this. DNS is awash in these kinds of problems.
The longer answer requires a question: how are you getting the original DNS server where you are starting the MX lookup?
What you are doing is a non-recursive query: if the first server you query does not know the answer, it points you at another server that is "closer" in the DNS hierarchy to the domain you are looking for, and you have to make the subsequent queries to find the MX record. If you are starting your query at one of the root servers, I think you will have to follow the NS pointers yourself like you are.
However, if the starting DNS server is configured in your application (i.e. a manual configuration item or via DHCP), then you should be able to make a recursive request, using the Recusion Desired flag, which will push the repeated lookup off onto the configured DNS server. In that case you would just get the MX record value in your first response. On the other hand, recursive queries are optional, and your local DNS server may not support them (which would be bizarre since, historically, many client libraries relied on recursive lookups).
In any case, I would personally like to thank you for looking MX records. I have had to deal with systems that wanted to send mail but could not do the DNS lookups, and the number and variety of bizarre and unpleasant hacks they have used has left me with emotional scars.
It could be that the domain simply does not have a MX record. I completely take out the MX entry for my unused / parked domains, it saves my mail server a lot of grief (SPAM).
There really is no need to go past step 2. If the system (or ISP) resolver returned no MX entry, its because it already did the extra steps and found nothing. Or, possibly, the system host resolver is too slow (i.e. from an ISP).
Still, I think its appropriate to just bail out if either happened, as its clearly a DNS or ISP issue, not a problem with the function. Just tell the user that you could not resolve a MX record for the domain, and let them investigate it on their end.
Also, is it feasible to make the resolvers configurable in the application itself, so users could get around a bunky NS?