Example disk space, CPU and / or memory monitoring configuration for Graphite - graphite

We are looking for a simple monitoring tool for basic stuff like disk space, CPU, folder sizes, memory usage.
Graphite looks promising. For a demo I want to create some example data to put in Graphite for one or more of such metrics.
What is best way to approach this. I have Graphite running in a Docker. How do I configure Graphite and send some test data to it? For example for
daily disk space metrics
daily folder sizes
hourly CPU
hourly memory

Graphite is not a data collector. From graphiteapp.org:
What Graphite is and is not.
Graphite does two things:
Store numeric time-series data
Render graphs of this data on demand
Graphite is not a collection agent, but it offers the simplest path for getting your measurements into a time-series database. Feeding your metrics into Graphite couldn't be any easier.
You will need a collector, Collectd and Telegraf seem to be popular choices at the moment but there are many others, see list of collectors. Disclosure: I contributed to both projects so might be biased.
Your intervals are very long, Graphite is usually used with way smaller intervals, 10s to 1m. I don't see why it won't work with intervals of hours though. Make sure to configure your storage-schemas.conf accordingly, the default setting is 1m.

Related

Capacity planning for service oriented architecture?

I have a collection of SOA components that can handle a series of business processes. For example one SOA component imports user data, another runs analytics on it.
I'm familiar with business process modeling for manufacturing, i.e. calculating WIP, throughput, cycle times, utilization etc. for each process. Little's Law, theory of constraints, etc.
Can I apply this approach to capacity planning for my SOA architecture, or is there a more rigorous / more widely accepted approach?
A bit of a broad question. Some guidelines for you but there is no real perfect answer here.
What you are looking for is Business Activity Monitoring used together with performance metrics reported from your servers.
BAM/Business Activity Monitoring will allow you to measure how many orders per seconds you are processing. How many sales you have made today etc. You all then monitor and collect information such as CPU usage, network bandwidth, disk io performance, memory usage and other technical performance metrics. In windows you can use performance counters for this. In the Linux world there is various tools and techniques that you can use.
Using the number of orders placed you can then look at the performance statistics of the systems used by the order placing software to give you some indication of what is happening.
For example we process 10 orders a second on average using roughly 8GB of ram on the ESB server where the orders service is hosted. We are seeing a average increase of 25% per month in the order coming through. We have noticed several alerts about swapping to disk when orders are at their peak. To ensure that we can cater with the demand we will need to double the memory on the server every 4 months. Thus in a year we will need 3*8GB of memory extra or another 32GB of memory. Now you can decide on the implementation do you create a cluster with 4 machines with 8GB of ram in or do I load balance.
Using this information you can start to get a good idea of where your limits are and what you need to budget for in the future.
Go look at some BAM tools and some monitoring tools and see what suits you.

How to maximize downloading throughput by multi-threading

I am using multi-threading to speed up the process of downloading a bunch of files from the Web.
How might I determine how many threads I should use to maximize or nearly maximize the total download throughput?
PS:
I am using my own laptop and the bandwidth is 1Mb.
The data I want is the webpage source code of coursera.com
There are much more factors than only number of threads if you want to speed up downloading files from network. Actually I don't believe that you will achieve this expect that there are some limitations you haven't described (like max bandwidth per connection on server side, you have multilink client and you can use different links do download different data, you want to download different parts from different servers, or similar).
In usual conditions having multiple threads to download something will slow the process. You will need to maintain couple of connections and somehow synchronise data (expect if you will download e.g. different files at the same time).
I would say that in "ordinary" conditions much bigger limitations are your bandwidth limit so using more threads will not make downloading faster. You will in this case share your whole bandwidth to many connections.

scaling statsd with multiple servers

I am laying out an architecture where we will be using statsd and graphite. I understand how graphite works and how a single statsd server could communicate with it. I am wondering how the architecture and set up would work for scaling out statsd servers. Would you have multiple node statsd servers and then one central statsd server pushing to graphite? I couldn't seem to find anything about scaling out statsd and any ideas of how to have multiple statsd servers would be appreciated.
I'm dealing with the same problem right now. Doing naive load-balancing between multiple statsds obviously doesn't work because keys with the same name would end up in different statsds and would thus be aggregated incorrectly.
But there are a couple of options for using statsd in an environment that needs to scale:
use client-side sampling for counter metrics, as described in the statsd documentation (i.e. instead of sending every event to statsd, send only every 10th event and make statsd multiply it by 10). The downside is that you need to manually set an appropriate sampling rate for each of your metrics. If you sample too few values, your results will be inaccurate. If you sample too much, you'll kill your (single) statsd instance.
build a custom load-balancer that shards by metric name to different statsds, thus circumventing the problem of broken aggregation. Each of those could write directly to Graphite.
build a statsd client that counts events locally and only sends them in aggregate to statsd. This greatly reduces the traffic going to statsd and also makes it constant (as long as you don't add more servers). As long as the period with which you send the data to statsd is much smaller than statsd's own flush period, you should also get similarly accurate results.
variation of the last point that I have implemented with great success in production: use a first layer of multiple (in my case local) statsds, which in turn all aggregate into one central statsd, which then talks to Graphite. The first layer of statsds would need to have a much smaller flush time than the second. To do this, you will need a statsd-to-statsd backend. Since I faced exactly this problem, I wrote one that tries to be as network-efficient as possible: https://github.com/juliusv/ne-statsd-backend
As it is, statsd was unfortunately not designed to scale in a manageable way (no, I don't see adjusting sampling rates manually as "manageable"). But the workarounds above should help if you are stuck with it.
Most of the implementations I saw use per server metrics, like: <env>.applications.<app>.<server>.<metric>
With this approach you can have local statsd instances on each box, do the UDP work locally, and let statsd publish its aggregates to graphite.
If you dont really need per server metrics, you have two choices:
Combine related metrics in the visualization layer (e.g.: by configuring graphiti to do so)
Use carbon aggregation to take care of that
If you have access to a hardware load balancer like a F5 BigIP (I'd imagine there are OSS software implementations that do this) and happen to have each host's hostname in your metrics (i.e. you're counting things like "appname.servername.foo.bar.baz" and aggregating them at the Graphite level) you can use source address affinity load balancing - it sends all traffic from one source address to the same destination node (within a reasonable timeout). So, as long as each metric name comes from only one source host, this will achieve the desired result.

Perfomance Test reporting with accurate TPS

I need to complete some performance tests on SOA appliances with side cache.
I have developed a simple application to generate SOAP/HTTP traffic but now i need some way of monitoring the E2E applications performance.
One vital metric i require is an accurate figure for the Transactions Per Second, as well as e2e response time.
I've used soapUI and Loadui but just do not believe the reported TPS figures as they seem very high, e.g. > 1300 TPS.
can anyone recommend a method to measure TPS that is "fool proof"?
I'd suggest cross checking SoapUI's numbers against the logs from your server (count the number of lines with the same second), or cross check this way:
Time the test run yourself.
Verify that the number of transactions SoapUI cites are accurate (logs or another measure on the server itself).
Divide trans count by seconds.
In the past I've done this and found SoapUI to be pretty reliable.
One thing to keep in mind in terms of whether your numbers are as good as they can be is whether or not you might need to simultaneously run soapui from more than one machine. I suggest monitoring the CPU, memory, bandwidth, etc on the SoapUI machine. If any of these get rather high, run the test on two machines simultaneously with very close to the same start and stop times and then you can safely add the two TPS numbers.

Client and Cache configuration for Oracle coherence

I have the specific scenario for which we want to use Coherence as sitributed cache. Which I am gonna describe here.
I have 20+ standalone processes which are going to put the data in cache continuously. the frequency of all of them differs, though thats not a concern.
And 2 procesess which will be reading data from those cache.
I dont need any underlying db except for the way which coherence provide. Data will be written to the cache and read from the cache.
I have 4 node cluster at my disposal (cost constraint whatever) and the coherence cluster will be on different boxes (infra constraint whatever) and both the populating portion of the cache and the reading part will be on differnt nmachines.
The peak memory size of the cache daily will hover around 6 GB max, min being 2 GB.
Cache will have daily data only and I will have separate archiving processes to simulatneosuly keep archiving it also. the point is that cache size for now will have this size only. Lets say I am gonna keep the date out of key equation.
Though Would like to explore if I can store more into those 4 nodes. Right now its simple serialization, can explore other nbinary formats. Or should I definietly at this size of the cache?
My read and write operations are fairly spread out in the day. Meaning the read and write will keep on happening by those 2 reading clients and 20+ writing clients. Its not like one of them is more. Though there is a startup batch process in all of the background process which push more to the cache than the continuous pushing afterwards. But continuous pushing pushes fair amount of data too.
Now my questions regarding those above points (and because of some confusion also)
The biggest one is somebody told me that I an have limited number of connection depending on the nodes we have bought. so he said if its 4, you ideally should have 4 connections only at the max. So, develop a gatekeeper kind of application and what not. Even if we use TCP Extend. Now from my reading so far, I dont think so. Is it? The point is dont wanna go that way if its really is not a constraint.
In other words is there limit on connection through Proxy Service dependeing on the nodes in the cluster?
Soemwhat related to above only. at the very max, I am going to get some penalty on the performance while pushing to cache only if I go the Extend way, right?
Partioned cache/near cache. As the reading time as well as the most update cache both are extremely critical. (the most imp question i have).
Really want to see the benefit which can be obtained from going to POF instead of lets say serialization/externalizatble/protobuf. Can coherence support protobuf out of the box? (may be for later on)
There's no technical limitation to the number of connections a Coherence Extend proxy can support except normal network and hardware resource constraints. You will have to ask an Oracle sales person if there are licensing limitations.
There is some performance impact from using a proxy because you are adding an additional network hop (client to proxy to cluster). If you use POF serialization then the proxy does not have to serialize/deserialize values. It can just pass the object through in its serialized form. In most applications the performance impact of using a proxy is tiny because Coherence is highly optimized for network speed. You are not required to use a proxy unless your clients are .NET or C++, but there are advantages of isolating client performance from impacting the cache.
Near cache will improve retrieval performance dramatically if there a number of frequently retrieved items for a client since they will be found in-process.
POF offers performance improvements based on faster serialization/deserialization and more compact storage. It is always best to try with test data based on your real production data and measure the difference yourself. Coherence does not support protobuf out of the box.

Resources