How can I configure Munin to give me a total of all my cloud servers? - networking

I have a dozen load balanced cloud servers all monitored by Munin.
I can track each one individually just fine. But I'm wondering if I can somehow bundle them up to see just how much collective CPU usage (for example) there is among the cloud cluster as a whole.
How can I do this?
The munin.conf file makes it easy enough to handle this for subdomains, but I'm not sure how to configure this for simple web nodes. Assume my web nodes are named, web_node_1 - web_node_10.
My conf looks something like this right now:
[web_node_1]
address 10.1.1.1
use_node_name yes
...
[web_node_10]
address 10.1.1.10
use_node_name yes
Your help is much appreciated.

You can achieve this with sum and stack.
I've just had to do the same thing, and I found this article pretty helpful.
Essentially you want to do something like the following:
[web_nodes;Aggregated]
update no
cpu_aggregate.update no
cpu_aggregate.graph_args --base 1000 -r --lower-limit 0 --upper-limit 200
cpu_aggregate.graph_category system
cpu_aggregate.graph_title Aggregated CPU usage
cpu_aggregate.graph_vlabel %
cpu_aggregate.graph_order system user nice idle
cpu_aggregate.graph_period second
cpu_aggregate.user.label user
cpu_aggregate.nice.label nice
cpu_aggregate.system.label system
cpu_aggregate.idle.label idle
cpu_aggregate.user.sum web_node_1:cpu.user web_node_2:cpu.user
cpu_aggregate.nice.sum web_node_1:cpu.nice web_node_2:cpu.nice
cpu_aggregate.system.sum web_node_1:cpu.nice web_node_2:cpu.system
cpu_aggregate.idle.sum web_node_1:cpu.nice web_node_2:cpu.idle
There are a few other things to tweak the graph to give it the same scale, min/max, etc as the main plugin, those can be copied from the "cpu" plugin file. The key thing here is the last four lines - that's where the summing of values from other graphs comes in.

Related

Copying 100 GB with continues change of file between datacenters with R-sync is good idea?

I have a datacenter A which has 100GB of the file changing every millisecond. I need to copy and place the file in Datacenter B. In case of failure on Datacenter A, I need to utilize the file in B. As the file is changing every millisecond does r-sync can handle it at 250 miles far datacenter? Is there any possibility of getting the corropted file? As it is continuously updating when we call this as a finished file in datacenter B ?
rsync is a relatively straightforward file copying tool with some very advanced features. This would work great for files and directory structures where change is less frequent.
If a single file with 100GB of data is changing every millisecond, that would be a potential data change rate of 100TB per second. In reality I would expect the change rate to be much smaller.
Although it is possible to resume data transfer and potentially partially reuse existing data, rsync is not made for continuous replication at that interval. rsync works on a file level and is not as commonly used as a block-level replication tool. However there is an --inplace option. This may be able to provide you the kind of file synchronization you are looking for. https://superuser.com/questions/576035/does-rsync-inplace-write-to-the-entire-file-or-just-to-the-parts-that-need-to
When it comes to distance, the 250 miles may result in at least 2ms of additional latency, if accounting for the speed of light, which is not all that much. In reality this would be more due to cabling, routers and switches.
rsync by itself is probably not the right solution. This question seems to be more about physics, link speed and business requirements than anything else. It would be good to know the exact change rate, and to know if you're allowed to have gaps in your restore points. This level of reliability may require a more sophisticated solution like log shipping, storage snapshots, storage replication or some form of distributed storage on the back end.
No, rsync is probably not the right way to keep the data in sync based on your description.
100Gb of data is of no use to anybody without without the means to maintain it and extract information. That implies structured elements such as records and indexes. Rsync knows nothing about this structure therefore cannot ensure that writes to the file will transition from one valid state to another. It certainly cannot guarantee any sort of consistency if the file will be concurrently updated at either end and copied via rsync
Rsync might be the right solution, but it is impossible to tell from what you have said here.
If you are talking about provisioning real time replication of a database for failover purposes, then the best method is to use transaction replication at the DBMS tier. Failing that, consider something like drbd for block replication but bear in mind you will have to apply database crash recovery on the replicated copy before it will be usable at the remote end.

Graphite + Collectd - How to plot memory used percentage for each host?

I have graphite+collectd setup to collect system related metrics. This question concerns with the memory plugin for collectd.
My infra has this format for collecting memory usage data using collectd:
<cluster>.<host>.memory.memory-{buffered,cached,free,used}
I want to plot the percentage of memory used for each host.
So basically, I have to do something like this:
divideSeries(sumSeriesWithWildCards(*.*.memory.memory-{buffered,cached,free},1),sumSeriesWithWildCards(*.*.memory.memory-{buffered,cached,free,used}),1)
But I am not able to do this, as divideSeries wants the divisor metric to return only one metric.
I basically want a single target to monitor all hosts in a cluster.
How can I do this?
try this one:
asPercent(host.memory.memory-used, sumSeries(host.memory.memory-{used,free,cached,buffered}))
you'll get a graph of memory usage in percents for one hosts. Unfortunately I wasn't able to make it work with wildcards (multiple hosts).
Try this for multiple nodes with regex.
alias(asPercent(sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.used), sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{used,free,cached,buffered})),"Memory Used")
alias(asPercent(sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{cached,buffered}), sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{used,free,cached,buffered})),"Memory Cached")
alias(asPercent(sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.free), sumSeries(collectd.nodexx*_internal_cloudapp_net.memory.memory.{used,free,cached,buffered})),"Memory Free")

How can I find the average number of concurrent users for IIS to simulate during a load/performance test?

I'm using JMeter for load testing. I'm going through and exercise of finding the max number of concurrent threads (users) that our webserver can handle by simply increasing the # of threads in my distributed JMeter test case, and firing off the test.
Then -- it struck me, that while the MAX number may be useful, the REAL number of users that my website actually handles on average is the number I need to make the test fruitful.
Here are a few pieces of information about our setup:
This is a mixed .NET/Classic ASP site. Upon login, a browser session (with timeout) is created in both for the users.
Each session times out after 60 minutes.
Is there a way using this information, IIS logs, performance counters, and/or some calculation that will help me determine the average # of concurrent users we handle on our production site?
You might use logparser with the QUANTIZE function to determine the peak number of requests over a suitable interval.
For a 10 second window, it would be something like:
logparser "select quantize(to_localtime(to_timestamp(date,time)), 10) as Qnt,
count(*) as Hits from yourLogFile.log group by Qnt order by Hits desc"
The reported counts won't be exactly the same as threads or users, but they should help get you pointed in the right direction.
The best way to do exact counts is probably with performance counters, but I'm not sure any of the standard ones works like you would want -- you'd probably need to create a custom counter.
I can see a couple options here.
Use Performance Monitor to get the current numbers or have it log all day and get an average. ASP.NET has a Requests Current counter. According to this page Classic ASP also has a Requests current, but I've never used it myself.
Run the IIS logs through Log Parser to get the total number of requests and how long each took. I'm thinking that if you know how many requests come in each hour and how long each took, you can get an average of how many were running concurrently.
Also, keep in mind that concurrent users isn't quite the same as concurrent threads on the server. For one, multiple threads will be active per user while content like images is being downloaded. And after that the user will be on the page for a few minutes while the server is idle.
My suggestion is that you define the stop conditions first, such as
Maximum CPU utilization
Maximum memory usage
Maximum response time for requests
Other key parameters you like
It is really subjective to choose the parameters and I personally cannot provide much experience on that.
Secondly you can see whether performance counters or IIS logs can map to the parameters. Then you set up proper mappings.
Thirdly you can start testing by simulating N users (threads) and see whether the stop conditions hit. If not hit, you can go to a higher number. If hit, you can use a smaller number. Recursively you will find a rough number.
However, that never means your web site in real world can take so many users. No simulation so far can cover all the edge cases.

Asp.net guaranteed response time

Does anybody have any hints as to how to approach writing an ASP.net app that needs to have a guaranteed response time?
When under high load that would normally cause us to exceed our desired response time, we want to throw out an appropriate number of requests, so that the rest of the requests can return before the max response time. Throwing out requests based on exceeding a fixed req/s is not viable, as there are other external factors that will control response time that cause the max rps we can safely support to fiarly drastically drift and fluctuate over time.
Its ok if a few requests take a little too long, but we'd like the great majority of them to meet the required response time window. We want to "throw out" the minimal or near minimal number of requests so that we can process the rest of the requests in the allotted response time.
It should account for ASP.Net queuing time, ideally the network request time but that is less important.
We'd also love to do adaptive work, like make a db call if we have plenty of time, but do some computations if we're shorter on time.
Thanks!
SLAs with a guaranteed response time require a bit of work.
First off you need to spend a lot of time profiling your application. You want to understand exactly how it behaves under various load scenarios: light, medium, heavy, crushing.. When doing this profiling step it is going to be critical that it's done on the exact same hardware / software configuration that production uses. Results from one set of hardware have no bearing on results from an even slightly different set of hardware. This isn't just about the servers either; I'm talking routers, switches, cable lengths, hard drives (make/model), everything. Even BIOS revisions on the machines, RAID controllers and any other device in the loop.
While profiling make sure the types of work loads represent an actual slice of what you are going to see. Obviously there are certain load mixes which will execute faster than others.
I'm not entirely sure what you mean by "throw out an appropriate number of requests". That sounds like you want to drop those requests... which sounds wrong on a number of levels. Doing this usually kills an SLA as being an "outage".
Next, you are going to have to actively monitor your servers for load. If load levels get within a certain percentage of your max then you need to add more hardware to increase capacity.
Another thing, monitoring result times internally is only part of it. You'll need to monitor them from various external locations as well depending on where your clients are.
And that's just about your application. There are other forces at work such as your connection to the Internet. You will need multiple providers with active failover in case one goes down... Or, if possible, go with a solid cloud provider.
Yes, in the last mvcConf one of the speakers compares the performance of various view engines for ASP.NET MVC. I think it was Steven Smith's presentation that did the comparison, but I'm not 100% sure.
You have to keep in mind, however, that ASP.NET will really only play a very minor role in the performance of your app; DB is likely to be your biggest bottle neck.
Hope the video helps.

What is the most accurate method of estimating peak bandwidth requirement for a web application?

I am working on a client proposal and they will need to upgrade their network infrastructure to support hosting an ASP.NET application. Essentially, I need to estimate peak usage for a system with a known quantity of users (currently 250). A simple answer like "you'll need a dedicated T1 line" would probably suffice, but I'd like to have data to back it up.
Another question referenced NetLimiter, which looks pretty slick for getting a sense of what's being used.
My general thought is that I'll fire the web app up and use the system like I would anticipate it be used at the customer, really at a leisurely pace, over a certain time span, and then multiply the bandwidth usage by the number of users and divide by the time.
This doesn't seem very scientific. It may be good enough for a proposal, but I'd like to see if there's a better way.
I know there are load tools available for testing web application performance, but it seems like these would not accurately simulate peak user load for bandwidth testing purposes (too much at once).
The platform is Windows/ASP.NET and the application is hosted within SharePoint (MOSS 2007).
In lieu of a good reporting tool for bandwidth usage, you can always do a rough guesstimate.
N = Number of page views in busiest hour
P = Average Page size
(N * P) /3600) = Average traffic per second.
The server itself will have a lot more internal traffic for probably db server/NAS/etc. But outward facing that should give you a very rough idea on utilization. Obviously you will need to far surpass the above value as you never want to be 100% utilized, and to allow for other traffic.
I would also not suggest using an arbitrary number like 250 users. Use the heaviest production day/hour as a reference. Double and triple if you like, but that will give you the expected distribution of user behavior if you have good log files/user auditing. It will help make your guesstimate more accurate.
As another commenter pointed out, a data center is a good idea, when redundancy and bandwidth availability become are a concern. Your needs may vary, but do not dismiss the suggestion lightly.
There are several additional questions that need to be asked here.
Is it 250 total users, or 250 concurrent users? If concurrent, is that 250 peak, or 250 typically? If it's 250 total users, are they all expected to use it at the same time (eg, an intranet site, where people must use it as part of their job), or is it more of a community site where they may or may not use it? I assume the way you've worded this that it is 250 total users, but that still doesn't tell enough about the site to make an estimate.
If it's a community or "normal" internet site, it will also depend on the usage - eg, are people really going to be using this intensely, or is it something that some users will simply log into once, and then forget? This can be a tough question from your perspective, since you will want to assume the former, but if you spend a lot of money on network infrastructure and no one ends up using it, it can be a very bad thing.
What is the site doing? At the low end of the spectrum, there is a "typical" web application, where you have reasonable size (say, 1-2k) pages and a handful of images. A bit more intense is a site that has a lot of media - eg, flickr style image browsing. At the upper end is a site with a lot of downloads - streaming movies, or just large files or datasets being downloaded.
This is getting a bit outside the threshold of your question, but another thing to look at is the future of the site: is the usage going to possibly double in the next year, or month? Be wary of locking into a long term contract with something like a T1 or fiber connection, without having some way to upgrade.
Another question is reliability - do you need redundancy in connections? It can cost a lot up front, but there are ways to do multi-homed connections where you can balance access across a couple of links, and then just use one (albeit with reduced capacity) in the event of failure.
Another option to consider, which effectively lets you completely avoid this entire question, is to just host the application in a datacenter. You pay a relatively low monthly fee (low compared to the cost of a dedicated high-quality connection), and you get as much bandwidth as you need (eg, most hosting plans will give you something like 500GB transfer a month, to start with - and some will just give you unlimited). The datacenter is also going to be more reliable than anything you can build (short of your own 6+ figure datacenter) because they have redundant internet, power backup, redundant cooling, fire protection, physical security.. and they have people that manage all of this for you, so you never have to deal with it.

Resources