Multiple requests to server question - http

I have a DB with user accounts information.
I've scheduled a CRON job which updates the DB with every new user data it fetches from their accounts.
I was thinking that this may cause a problem since all requests are coming from the same IP address and the server may block requests from that IP address.
Is this the case?
If so, how do I avoid being banned? should I be using a proxy?
Thanks

You get banned for suspicious (or malicious) activity.
If you are running a normal business application inside a normal company intranet you are unlikely to get banned.
Since you have access to user accounts information, you already have a lot of access to the system. The best thing to do is to ask your systems administrator, since he/she defines what constitutes suspicious/malicious activity. The systems administrator might also want to help you ensure that your database is at least as secure as the original information.
should I be using a proxy?
A proxy might disguise what you are doing - but you are still doing it. So this isn't the most ethical way of solving the problem.

Is the cron job that fetches data from this "database" on the same server? Are you fetching data for a user from a remote server using screen scraping or something?
If this is the case, you may want to set up a few different cron jobs and do it in batches. That way you reduce the amount of load on the remote server and lower the chance of wherever you are getting this data from, blocking your access.
Edit
Okay, so if you have not got permission to do scraping, obviously you are going to want to do it responsibly (no matter the site). Try gather as much data as you can from as little requests as possible, and spread them out over the course of the whole day, or even during times that a likely to be low load. I wouldn't try and use a proxy, that wouldn't really help the remote server, but it would be a pain in the ass to you.
I'm no iPhone programmer, and this might not be possible, but you could try have the individual iPhones grab the data so all the source traffic isn't from the same IP. Just an idea, otherwise just try to be a bit discrete.
Here are some tips from Jeff regarding the scraping of Stack Overflow, but I'd imagine that the rules are similar for any site.
Use GZIP requests. This is important! For example, one scraper used 120 megabytes of bandwidth in only 3,310 hits which is substantial. With basic gzip support (baked into HTTP since the 90s, and universally supported) it would have been 20 megabytes or less.
Identify yourself. Add something useful to the user-agent (ideally, a link to an URL, or something informational) so we can see your bot as something other than "generic unknown anonymous scraper."
Use the right formats. Don't scrape HTML when there is a JSON or RSS feed you could use instead. Heck, why scrape at all when you can download our cc-wiki data dump??
Be considerate. Pulling data more than every 15 minutes is questionable. If you need something more timely than that ... why not ask permission first, and make your case as to why this is a benefit to the SO community and should be allowed? Our email is linked at the bottom of every single page on every SO family site. We don't bite... hard.
Yes, you want an API. We get it. Don't rage against the machine by doing naughty things until we build it. It's in the queue.

Related

Calculate the number of visits based on downloaded GB

I have a website hosted in firebase that totally went viral for a day. Since I wasn't expecting that, I didn't install any analytics tool. However, I would like to know the number of visits or downloads. The only metric I have available is the GB Downloaded: 686,8GB. But I am confused because if I open the website with the console of Chrome, I get two different metrics about the size of the page: 319KB transferred and 1.2MB resources. Furthermore, not all of those things are transferred from firebase but from other CDN as you can see in the screenshots. What is the proper way of calculating the visits I had?
Transferred metric is how much bandwidth was used after compression was applied.
Resources metric is how much disk space those resources use before they are compressed (for transfer).
True analytics requires an understanding how what is on the web. There are three classifications:
Humans, composed of flesh and blood and overwhelmingly (though not absolutely) use web browsers.
Spiders (or search engines) that request pages with the notion that they obey robots.txt and will list your website in their websites for relevant search queries.
Rejects (basically spammers and the unknowns) which include (though are far from limited to) content/email scrapers, brute-force password guessers, vulnerability scanners and POST spammers.
With this clarification in place what you're asking in effect is, "How many human visitors am I receiving?" The easiest way to obtain that information is to:
Determine what user agent requests are human (not easy, behavior based).
Determine the length of time a single visit from a human should count as.
Assign human visitors a session.
I presume you understand what a cookie is and how it differs from a session cookie. Obviously when you sign in to a website you are assigned a session. If that session cookie is not sent to the server on a page request you will in effect be signed out. You can make session cookies last for a long time and it will come down to factors such as convenience for the visitor and if you directly count those sessions or use it in conjunction with something else.
Now your next thought likely is, "But how do I count downloads?" Thankfully you mention PHP in your website so I can thankfully give you some code that should make sense to you. If you just link directly to the file you'd be stuck with (at best) counting clicks via a click event on the anchor element though if the download gets canceled because it was a mistake or something else makes it more subjective than my suggestion. Granted my suggestion can still be subjective (e.g. they decide they actually don't want to download and cancel before the completion) and of course if they use the download is another aspect to consider. That being said if you want the server to give you a download count you'd want to do the following:
You'll may want to use Apache rewrite (or whatever the other HTTP server equivalents are) so that PHP handles the download.
You'll may need to ensure Apache has the proper handling for PHP (e.g. AddType application/x-httpd-php5 .exe .msi .dmg) so your server knows to let PHP run on the request file.
You'll want to use PHP's file_exists() with an absolute file path on the server for the sake of security.
You'll want to ensure that you set the correct mime for the file via PHP's header() as you should expect browsers to be horrible at guessing.
You absolutely need to use die() or exit() to avoid Gecko (Firefox) bugs if your software leaks even whitespace as the browser would interpret it as part of the file likely causing corruption.
Here is the code for PHP itself:
$p = explode('/',strrev($_SERVER['REQUEST_URI']));
$file = strrev($p[0]);
header('HTTP/1.1 200');
header('Content-Type: '.$mime);
echo file_get_contents($path_absolute.$file);
die();
For counting downloads if you want to get a little fancy you could create a couple of database tables. One for the files (download_files) and the second table for requests (download_requests). Throw in basic SQL queries and you're collecting data. Record IPv6 (Storing IPv6 Addresses in MySQL) and you'll be able to discern from a query how many unique downloads you have.
Back to human visitors: it takes a very thorough study to understand the differences between humans and bots. Things like Captcha are garbage and are utterly annoying. You can get a rough start by requiring a cookie to be sent back on requests though not all bots are ludicrously stupid. I hope this at least gets you on the right path.

how to prevent vulnerability scanning

I have a web site that reports about each non-expected server side error on my email.
Quite often (once each 1-2 weeks) somebody launches automated tools that bombard the web site with a ton of different URLs:
sometimes they (hackers?) think my site has inside phpmyadmin hosted and they try to access vulnerable (i believe) php-pages...
sometimes they are trying to access pages that are really absent but belongs to popular CMSs
last time they tried to inject wrong ViewState...
It is clearly not search engine spiders as 100% of requests that generated errors are requests to invalid pages.
Right now they didn't do too much harm, the only one is that I need to delete a ton of server error emails (200-300)... But at some point they could probably find something.
I'm really tired of that and looking for the solution that will block such 'spiders'.
Is there anything ready to use? Any tool, dlls, etc... Or I should implement something myself?
In the 2nd case: could you please recommend the approach to implement? Should I limit amount of requests from IP per second (let's say not more than 5 requests per second and not more then 20 per minute)?
P.S. Right now my web site is written using ASP.NET 4.0.
Such bots are not likely to find any vulnerabilities in your system, if you just keep the server and software updated. They are generally just looking for low hanging fruit, i.e. systems that are not updated to fix known vulnerabilities.
You could make a bot trap to minimise such traffic. As soon as someone tries to access one of those non-existant pages that you know of, you could stop all requests from that IP address with the same browser string, for a while.
There are a couple of things what you can consider...
You can use one of the available Web Application Firewalls. It usually has set of rules and analytic engine that determine suspicious activities and react accordingly. For example in you case it can automatically block attempts to scan you site as it recognize it as a attack pattern.
More simple (but not 100% solution) approach is check referer url (referer url description in wiki) and if request was originating not from one of you page you rejected it (you probably should create httpmodule for that purpose).
And of cause you want to be sure that you site address all known security issues from OWASP TOP 10 list (OWASP TOP 10). You can find very comprehensive description how to do it for asp.net here (owasp top 10 for .net book in pdf), i also recommend to read the blog of the author of the aforementioned book: http://www.troyhunt.com/
Theres nothing you can do (reliabily) to prevent vulernability scanning, the only thing to do really is to make sure you are on top of any vulnerabilities and prevent vulernability exploitation.
If youre site is only used by a select few and in constant locations you could maybe use an IP restriction

Monitoring ASP.NET and SQL Server for Security

What is the best (or any good) way to monitor an ASP.NET application to ensure that it is secure and to quickly detect intrusion? How do we know for sure that, as of right now, our application is entirely uncompromised?
We are about to launch an ASP.NET 4 web application, with the data stored on SQL Server. The web server runs in IIS on a Windows Server 2008 instance, and the database server runs on SQL Server 2008 on a separate Win 2008 instance.
We have reviewed Microsoft's security recommendations, and I think our application is very secure. We have implemented "defense in depth" and considered a range of attack vectors.
So we "feel" confident, but have no real visibility yet into the security of our system. How can we know immediately if someone has penetrated? How can we know if a package of some kind has been deposited on one of our servers? How can we know if a data leak is in progress?
What are some concepts, tools, best practices, etc.?
Thanks in advance,
Brian
Additional Thoughts 4/22/11
Chris, thanks for the very helpful personal observations and tips below.
What is a good, comprehensive approach to monitoring current application activity for security? Beyond constant vigilance in applying best practices, patches, etc., I want to know exactly what is going on inside my system right now. I want to be able to observe and analyze its activity in a way that clearly shows me which traffic is suspect and which is not. Finally, I want this information to be totally accurate and easy to digest.
How do we efficiently get close to that? Wouldn't a good solution include monitoring logins, database activity, ASP.NET activity, etc. in addition to packets on the wire? What are some examples of how to assume a strong security posture?
Brian
The term you are looking for is Intrusion Detection System (IDS). There is a related term called Intrusion Prevention System (IPS).
IDS's monitor traffic coming into your servers at the IP level and will send alerts based on sophisticated analysis of the traffic.
IPS's are the next generation of IDS which actually attempt to block certain activities.
There are many commercial and open source systems available including Snort, SourceFire, Endace, and others.
In short, you should look at adding one of these systems to your mix for real time monitoring and potentially blocking of hazardous activities.
I wanted to add a bit more information here as the comments area is just a bit small.
The main thing you need to understand are the types of attacks you will see. These are going to range from relatively unsophisticated automated scripts on up to highly sophisticated targeted attacks. They will also hit everything they can see from the web site itself to IIS, .Net, Mail server, SQL (if accessible), right down to your firewall and other exposed machines/services. A wholistic approach is the only way to really monitor what's going on.
Generally speaking, a new site/company is going to be hit with the automated scripts within a few minutes (I'd say 30 at most) of going live. Which is the number one reason new installations of MS Windows keep the network severely locked down during installation. Heck, I've seen machines nailed within 30 seconds of being turned on for the first time.
The approach hackers/worms take is to constantly scan wide ranges of IP addresses, this is followed up with machine fingerprinting for those that respond. Based on the profile they will send certain types of attacks your way. In some cases the profiling step is skipped and they attack certain ports regardless of response. Port 1443 (SQL) is a common one.
Although the most common form of attack, the automated ones are by far the easiest to deal with. Shutting down unused ports, turning off ICMP (ping response), and having a decent firewall in place will keep most of the scanners away.
For the scripted attacks, make sure you aren't exposing commonly installed packages like PhpMyAdmin, IIS's web admin tools, or even Remote Desktop outside of your firewall. Also, get rid of any accounts named "admin", "administrator", "guest", "sa", "dbo", etc Finally make sure your passwords AREN'T allowed to be someones name and are definitely NOT the default one that shipped with a product.
Along these lines make sure your database server is NOT directly accessible outside the firewall. If for some reason you have to have direct access then at the very least change the port # it responds to and enforce encryption.
Once all of this is properly done and secured the only services that are exposed should be the web ones (port 80 / 443). The items that can still be exploited are bugs in IIS, .Net, or your web application.
For IIS and .net you MUST install the windows updates from MS pretty much as soon as they are released. MS has been extremely good about pushing quality updates for windows, IIS, and .Net. Further a large majority of the updates are for vulnerabilities already being exploited in the wild. Our servers have been set to auto install updates as soon as they are available and we have never been burned on this (going back to at least when server 2003 was released).
Also you need to stay on top of the updates to your firewall. It wasn't that long ago that one of Cisco's firewalls had a bug where it could be overwhelmed. Unfortunately it let all traffic pass through when this happened. Although fixed pretty quickly, people were still being hammered over a year later because admins failed to keep up with the IOS patches. Same issue with windows updates. A lot of people have been hacked simply because they failed to apply updates that would have prevented it.
The more targeted attacks are a little harder to deal with. A fair number of hackers are going after custom web applications. Things like posting to contact us and login forms. The posts might include JavaScript that, once viewed by an administrator, could cause credentials to be transferred out or might lead to installing key loggers or Trojans on the recipients computers.
The problem here is that you could be compromised without even knowing it. Defenses include making sure HTML and JavaScript can't be submitted through your site; having rock solid (and constantly updated) spam and virus checks at the mail server, etc. Basically, you need to look at every possible way an external entity could send something to you and do something about it. A lot of Fortune 500 companies keep getting hit with things like this... Google included.
Hope the above helps someone. If so and it leads to a more secure environment then I'll be a happy guy. Unfortunately most companies don't monitor traffic so they have no idea just how much time is spent by their machines fending off this garbage.
I can say some thinks - but I will glad to hear more ideas.
How can we know immediately if someone has penetrated?
This is not so easy and in my opinion, ** an idea is to make some traps** inside your backoffice , together with monitor for double logins from different ips.
a trap can be anything you can think of, for example a non real page that say "create new administrator", or "change administrator password", on backoffice, and there anyone can gets in and try to make a new administrator is for sure a penetrator - of course this trap must be known only on you, or else there is no meaning for that.
For more security, any change to administrators must need a second password, and if some one try to make a real change on administrators account, or try to add any new administrator, and fails on this second password must be consider as a penetrator.
way to monitor an ASP.NET application
I think that any tool that monitor the pages for some text change, can help on that. For example this Network Monitor can monitor for specific text on you page and alert you, or take some actions if this text not found, that means some one change the page.
So you can add some special hiden text, and if you not found, then you can know for sure that some one change the core of your page, and probably is change files.
How can we know if a package of some kind has been deposited on one of our servers
This can be any aspx page loaded on your server and act like a file browser. For this not happens I suggest to add web.config files to the directories that used for uploading data, and on this web.config do not allow anything to run.
<configuration>
<system.web>
<authorization>
<deny users="*" />
</authorization>
</system.web>
</configuration>
I have not tried it yet, but Lenny Zeltser directed me to OSSEC, which is a host-based intrusion detection system that continuously monitors an entire server to detect any suspicious activity. This looks like exactly what I want!
I will add more information once I have a chance to fully test it.
OSSEC can be found at http://www.ossec.net/

Determining Website Capacity

A client of mine has a website and they need to determine how 'scalable' the site currently is. What I mean by this is the number of users browsing around the site concurrently.
It's a custom e-commerce app in .net, not written by myself and the code is... well lets just say, a bit dubious.
A much bigger company is looking to buy them / throw funding their way but they need some form of metrics to show how much load it can take before it falls apart. This big company has the ability to 'turn on the taps' to a huge user base - and obviously doesn't want to do that if the site is going to fall over with a sneeze of traffic.
What is a good metric to provide here? And how can I obtain it?
Edit: Question revised
I always use Apache's "ab" tool: link text
Run it from a different machine, preferably a BSD or Linux machine with no firewall rules that will limit the performance of the tool. Because otherwise the result might not be as reliable. If you use a Windows machine, make sure you're using one that isn't limiting the number of active TCP connections.
When using "ab", the number you're looking for it "Requests per second". Experiment with the concurrency switch to see how many concurrent users you can handle before you're getting a lot of errors, or when the requests per seconds is dropping rapidly.
When you are noticing the webserver is having serious issues you should restart the webserver, and let it rest for a while before continuing the test.
You'd be better off with a hosted load test, as this might give you more insight on realworld scenario's (something like http://www.scl.com/software-quality/hosted-load-test, no experience with them though).
Furthermore: scalability is as far as I know, not how many concurrent users can be served, but the way how easy it is to serve more when the site grows bigger (by adding extra servers etc, how easy is it for the website to scale up, does the codebase allow to use unlimited number of servers, etc.)
Well, I suppose it'll depend on what the client cares about.
Do they care about how many users to can access the site at once? Report on that, but running simultaneous requests from another server until it dies, then get the number.
Do they care about something else?
For me, when someone says they want it to 'scale', it really means they have no idea what they want. So try and talk to them, and get specific details of what, exactly, they want to see 'scaling', and then, once you find the areas to analyse, you can do so trivially, and attempt to improve them.

How can I implement an IRC Server with 'owned' nicknames?

Recently, I've been reading up on the IRC protocol (RFCs 1459, 2810-2813), and I was thinking of implementing my own server.
I'm not necessarily looking into adhering religiously to the IRC protocol (I'm doing this for fun, after all), but one of the things I do like about it is that a network can consist of multiple servers transparently.
There are a number of things I don't like about the protocol or the IRC specification. The first is that nicknames aren't owned. While services like NickServ exist, they're not part of the official protocol. On the other hand, implementing something like NickServ properly kind of defeats the purpose of distribution (i.e. there'd be one place where NickServ is running, and one data store for it).
I was hoping there'd be a way to manage nicknames on a per-server basis. The problem with this is that if you have two servers that have some registered nicknames, and they then link up, you can have collisions.
Is there a way to avoid this, without using one central data store? That is: is it possible to keep the servers loosely connected (such that they each exist as an independent entity, but can also connect to one another) and maintain uniqueness amongst nicknames?
I realize this question is vague, but I can't think of a better way of wording it. I'm looking more for suggestions than I am for actual yes/no answers. So if anyone has any ideas as to how to accomplish nickname uniqueness in a network while still maintaining server independence, I'd be interested in hearing it. Note that adhering strictly to the IRC protocol isn't at all necessary; I've got no problem changing things to suit my purposes. :)
There's a simple solution if you don't care about strictly implementing an IRC server, but rather implementing a distributed message system that's like IRC, but not exactly IRC.
The simple solution is to use nicknames in the form "nick#host", much like email. So instead of merely being "mipadi", my nickname could be "mipadi#free-memorys-server.net". So I register with just your server, but when your server links up with others to form another a big ole' chat network, you can easily union all the usernames together. There might be a "mipadi" on otherserver.net, but then our nicknames become "mipadi#free-memorys-server.net" and "mipadi#otherserver.net", and everything is cool.
Of course, this deviates a good deal from IRC. :)
They have to be aware of each other. If not, you cannot prevent the sharing of nicknames. If they are, you simply need to transfer updates on the back-end. To prevent simultaneous registrations, you need a transaction system that blocks, requests permission from all other servers, and responds.
To prevent simultaneous registrations during outages, you have no choice but to timestamp the registration, and remove all but the last (or a random for truly simultaneous) registered copy of the nick.
It's not very pretty considering these servers aren't initially merged in the first place.
You could still implement nick ownership without a central instance, if your server instances trust each other.
When a user registers a nick, it is registered with the current server he's connected with
When a server receives a registration that it didn't know of, it forwards that information to all other servers that don't know it yet (might need a smart algorithm to avoid spamming the network)
When a server re-connects to another server then it tries to synchronize the list of registered nicks and which server handles which nick
If there is a collision during that sync, then the older registration is used, and the newer one marked as invalid
If you can't trust your servers, then it'll get a lot harder, as a servers could easily claim every username and even claim the oldest registration for each one.
Since you are trying to come up with something new, the idea that springs to mind, is simply including something unique about the server as part of the nick name when communicating outside of the server. So if you want to message a user on a different server you might have something like user#server
If you don't need them to be completely separate you might want to consider creating some kind of multiple-master replicated database of accounts. Where each server stores a complete copy of the account database, and each server can create new accounts which will be replicated to other servers as possible. You'll probably still have to deal with collisions on occasion though.
While services like NickServ exist, they're not part of the official protocol.
Services are not part of the official protocol because they've nothing to do with the protocol. They're bots with permissions. There's no reason why you couldn't have one running on each server but it does make them harder to maintain.
If you were to go down that path, I would probably suggest the commonly used "multiple master" database replication technique. If one receives a write (in your case, a new user is created or updated, etc) it sends the data to all the other nodes. You'll have to be careful though. If one node is offline when the others get an update, it will need to know to resync on reconnection.
Another technique would be as above but in reverse. Data is only exchanged between nodes when it's needed. Eg if a user tries to log in on a node where there's no data for it, it'll query the others and issue a move order to get all the data to that one node. This is potentially less painful than the replication version but there could be severe problems in netsplits if somebody signs up on a node disconnected from the pack for a duplicate nick.
One technique to nullify the problems of netsplits would be to make chat nodes and their bots netsplit-aware. When they're split, they probably shouldn't allow any write actions... But this could impact on your network if you're splitting lots.
You've also got to ask how secure this might or might not be. IRC network nodes are distributed for performance but they're not "secure". Because of this, service bots are usually run centrally to keep ultimate control over their running. If you distributed the bots and remote node got hacked, they'd potentially have access to the whole user database (depending on the model).

Resources