How do I tune Graphite to not drop data points? - graphite

I have a graphite instance which is taking in a little over 100k metrics per minute. Some servers have gaps in their metrics, others do not. Individual servers have some metrics with a lot of gaps, others with few or no gaps.
The gaps seem to be on a very rhythmic basis. For example, sending CPU at 1/sec, it would send 7, drop 3, send 7, drop 4, back and forth. When I sent them in at one every 10 seconds, it would drop one roughly every 3 minutes (18 data points).
Because the rhythm varies from metric to metric, server to server, I find it hard to blame the carbon-cache for the missing metrics. If the data were coming in too fast, why wouldn't the dropped metrics be spread evenly among all metrics? Besides, the computer isn't really even breaking a sweat, and the disk isn't all that busy, either.
My questions are:
How do I prove that the diamond collectors are (or are not) sending all the data points?
How do I get carbon to tell me that it has dropped data points (if it did)?
What do I need to measure on the graphite server to know when it is unable to take any more metrics?
UPDATE:
I learned that to prove the diamond collector is working correctly, I can run it in the foreground by typing
diamond -f -l
I also discovered that there were errors that were not making it into the logs. When I disabled the offending collectors, the gaps were reduced if not eliminated.

Related

Limitations of using sequential IDs in Cloud Firestore

I read in a stackoverflow post that (link here)
By using predictable (e.g. sequential) IDs for documents, you increase the chance you'll hit hotspots in the backend infrastructure. This decreases the scalability of the write operations.
I would like if anyone could explain better on the limitations that can occur when using sequential or user provided id.
Cloud Firestore scales horizontally by allocated key ranges to machines. As load increases beyond a certain threshold on a single machine, it will split the range being served by it and assign it to 2 machines.
Let's say you just starting writing to Cloud Firestore, which means a single server is currently handling the entire range.
When you are writing new documents with random Ids, when we split the range into 2, each machine will end up with roughly the same load. As load increases, we continue to split into more machines, with each one getting roughly the same load. This scales well.
When you are writing new documents with sequential Ids, if you exceed the write rate a single machine can handle, the system will try to split the range into 2. Unfortunately, one half will get no load, and the other half the full load! This doesn't scale well as you can never get more than a single machine to handle your write load.
In the case where a single machine is running more load than it can optimally handle, we call this "hot spotting". Sequential Ids mean we cannot scale to handle more load. Incidentally, this same concept applies to index entries too, which is why we warn sequential index values such as timestamps of now as well.
So, how much is too much load? We generally say 500 writes/second is what a single machine will handle, although this will naturally vary depending on a lot of factors, such as how big a document you are writing, number of transactions, etc.
With this in mind, you can see that smaller more consistent workloads aren't a problem, but if you want something that scales based on traffic, sequential document ids or index values will naturally limit you to what a single machine in the database can keep up with.

Predicting/calculating congestion in telecom network

I have an application installed at my phone which is providing below details every minute: - Bandwidth , -Packet loss ,-signal strength,- RTT for google.com every minute.
I am trying to predict congestion based on these 4 attribute , but some how it doesn't look accurate to me , previously i have only used bandwidth .
I want predict congestion at any point more appropriately , appreciate any recommendations .
I think you are saying you are trying to measure network 'responsiveness', and from these measurements get a sense of how congested the network is. You also mention you want to predict which I guess means you want to make an estimate of the future 'responsiveness' based on your measurements and observations.
The items you are measuring look sensible, although you may want to include jitter if you are interested in VoIP or other real time streamed media.
The issue you have is that there are many variables which can effect your measurements, for example:
congestion in the radio cell you are in at the time
congestion in the backhaul network
delays in the server you are using to measure the RTT
congestion or faults with the particular APN your mobile is using to access data services
network faults
As some of these can be irregularly occurring but can have a large impact, it is quite hard to build up an accurate view of the overall network 'responsiveness' with a single handset. For example your local cell may be busy or have a problem but others users of Google.com in other cells will have perfectly good response, or Google.com may be busy or delayed and other users in your cell accessing a different server may again have perfectly good response.
It would likely be useful for you to look at some of the generally available web speedtest applications to see the type of information they provide - they have the advantage of being able to gather results from many thousands of users, and also generally have access to the servers to understand any issues on that side.
Depending on what you are trying to achieve it might be that a combination of measurements from one of the general speedtest services, combined with your own measurements will give you enough data to draw some sort of meaningful conclusions.

divide workload on different hardware using MPI

I have a small network with computers of different hardware. Is it possible to optimize workload division between these hardware using MPI? ie. give nodes with larger ram and better cpu more data to compute? minimizing waiting time between different nodes for final reduction.
Thanks!
In my program data are divided into equal-sized batches. Each node in the network will process some of them. The result of each batch will be summed up after all batches are processed.
Can you divide the work into more batches than there are processes? If so, change your program so that instead of each process receiving one batch, the master keeps sending batches to whichever node is available, for as long as there are unassigned batches. It should be a fairly easy modification, and it will make faster nodes process more data, leading to a lower overall completion time. There are further enhancements you can make, e.g. once all batches have been assigned and a fast node is available, you could take an already assigned batch away from a slow node and reassign it to said fast node. But these may not be worth the extra effort.
If you absolutely have to work with as many batches as you have nodes, then you'll have to find some way of deciding which nodes are fast and which ones are slow. Perhaps the most robust way of doing this is to assign small, equally sized test batches to each process, and have them time their own solutions. The master can then divide the real data into appropriately sized batches for each node. The biggest downside to this approach is that if the initial speed measurement is inaccurate, then your efforts at load balancing may end up doing more harm than good. Also, depending on the exact data and algorithm you're working with, runtimes with small data sets may not be indicative of runtimes with large data sets.
Yet another way would be to take thorough measurements of each node's speed (i.e. multiple runs with large data sets) in advance, and have the master balance batch sizes according to this precompiled information. The obvious complication here is that you'll somehow have to keep this registry up to date and available.
All in all, I would recommend the very first approach: divide the work into many smaller chunks, and assign chunks to whichever node is available at the moment.

Time synchronization of views generated by different instances of the game engine

I'm using open source Torque 3d game engine for the avia simulator project.
I need to generate single image from the several IG (image generator) PCs.Each IG displays has its own view camera with certain angle offset and get the info about the current position from the server via LAN.
I've already setup multi IG system.
Network connection is robust (less than <1 ms)
Frame rate is good as well - about 70 FPS on each IG.
However while moving the whole picture looks broken because some IG are updating their views faster than others.
I'm looking for the solution that will make the IG update simultaneously. Maybe some kind of precise time synchronization algorithms that make different PC connected via LAN act as one.
I had a much simpler problem, but my approach might help you.
You've got to run clocks on all your machines with, say, a 15 millisecond tick. Each image needs to be generated correctly for a specific tick and marked with its tick ID time. The display machine can check its own clock, determine the specific tick number (time) for which it should display, grab the images for that specific time, and display them.
(To have the right mindset to think about this, imagine your network is really bad and think about one IG delivering 1000 images ahead of the current display tick while another is 5 ticks behind. Write for this sort of system and the results will look really good on the one you have.)
Ideally you want your display running a bit behind the IGs so you always have a full set of images for the current tick. I had a client-server setup and slowed the display (client) timer down if it came close to missing updates and sped it up if it was getting too far behind. You have to synchronize all your IG machines, so it might be better to have the master clock on the display and have it send messages to speed up any IG machine that's getting behind. (You may not have the variable network delays I had, but it's best to plan for them.)
The key is that each image must be made at a particular time, that the display include only images for the time being displayed, and that the composite images appear right when they should (every 15 milliseconds, on the millisecond). Also, do not depend on your network or even your machines to do anything in a timely manner. Use feedback to keep everything synched.
Addition On Feedback:
Say the last image for the frame at time T arrives 5ms after time T by the display computer's time (real time). If you display the frame for time T at T plus 10 ms, no one will notice the lag and you'll have plenty of time to assemble the images. Using a constant (10 ms) delay might work for you, especially if you make it big enough. It may be the way to go if you always run with the exact same network.
But you are depending on all your IG machines being precisely synchronized for real time, taking no more than a certain amount of time to produce their image, and delivering their image to the display machine all in predictable lengths of time.
What I'd suggest is have your display machine determine the delay based on the time stamps on the images it receives. It would want to increase the delay if it isn't getting the images it needs in time, and decrease it if all the IG's are running several images ahead of what the display needs. (You might want to ignore the occasional really late image. You have to decide which is more annoying: images that are out-of-date, a display that is running noticably behind time, or a display that noticably speeds up and slows down.)
In my original answer I was suggesting some kind of feedback from the display to keep the IG machines running on time, but that may be overkill: your computer's clocks are probably good enough for that.
Very generally, when any two processes have to coordinate over time, it's best if they talk to each other to stay in step (feedback) rather than each stick to a carefully timed schedule.

How to improve the speed of MPI_scatter/MPI_gather?

I found that time used for MPI_scatter/MPI_gather continuously increased (somehow linearly) as the number of workers increases, especially when the workers are across different nodes.
I thought that MPI_scatter/MPI_gather is a parallel process, and wonder what leads to the above increasing? Is there any trick to make it faster, especially for workers distributing across CPU nodes?
The root rank has to push a fixed amount of data to the other ranks. As long as all ranks reside on the same compute node, the process is limited by the memory bandwidth available. Once more nodes become involved, the network bandwidth, usually much lower than the memory bandwidth, becomes the limiting factor.
Also the time to send a message is roughly divided in two parts - initial (network setup and MPI protocol handshake) latency and then the time it takes to physically transfer the actual data bits. As the amount of data is fixed, the total physical transfer time remains the same (as long as the transport type and therefore the bandwidth stays the same) but more setup/latency overhead is being added with each new rank that data is scattered to or gathered from, therefore the linear increase in the time it takes to complete the operation.
How an MPI_Scatter/Gather will work varies between implementations. Some MPI implementations may choose to use a series of MPI_Send as an underlying mechanism.
The parameters that may affect how MPI_Scatter works are:
1. Number of processes
2. Size of data
3. Interconnect
For example, an implementation may avoid using a broadcast for very small number of ranks sending/receiving very large data.

Resources