Making Carbon in Graphite accept all data, no matter what - graphite

The Carbon listener in Graphite has been designed and tuned to make it somewhat predictable in its load on your server, to avoid flooding the server itself with IO wait or skyrocketing the system load overall. It will drop incoming data if necessary, putting server load as the priority. After all, for the typical data being stored, it's no big deal.
I appreciate all that. However, I am trying to prime a large backlog of data into graphite, from a different source, instead of pumping in live data as it happens. I have a reliable data source from a third party that comes to me in bulk, once/day.
So in this case, I don't want any data values dropped on the floor. I don't really care how long the data import takes. I just want to disable all the safety mechanisms, let carbon do its thing, and know ALL my data has made it in.
I'm searching the docs and finding all kinds of advice on tuning the parameters of carbon_cache in carbon.conf, but I can't find this. It is starting to sound more like art than science. Any help appreciated.

First thing of course is to receive data through tcp listener (line receiver) instead of udp to avoid loosing incoming points.
There are several settings in graphite that throttle part of the pipeline, though it is not always clear of what graphite does when threshold are reached. You'll have to test and/or read the carbon code.
You'll probably want to tune:
MAX_UPDATES_PER_SECOND = 500 (max number of disk updates in a second)
MAX_CREATES_PER_MINUTE = 50 (max number of metric creation per minute)
For the cache, USE_FLOW_CONTROL = True and MAX_CACHE_SIZE = inf (inf is a good value so revert to this if you changed it)
If you use a relay and/or aggregator, MAX_QUEUE_SIZE = 10000 and USE_FLOW_CONTROL = True are important.

I set this property to "inf":
MAX_CREATES_PER_MINUTE = inf
and make sure that this is infinite too:
MAX_CACHE_SIZE = inf
During the bulk load, I monitor /opt/graphite/storage/log/carbon-cache/carbon-cache-a/creates.log to make sure that the whisper DBs are being created.
To make sure, you can run the load a second time and there should be no further creations.

Related

Trading off between User Bandwidth and Download Interval

I am designing a non commercial open source client app which needs to download data of exactly 100 KB from server on regular interval and show an alert in client app based on the data changes. Now I need to trade off between the user bandwidth and download interval.
Analysis,
If I set the interval = 1 hour. That means within 1 month app will download 30*24*100KB = 72MB.
If I set the interval = 30 mins. That means within 1 month app will download 30*48*100KB = 144MB.
And so on.
Now, I am considering only the file size where in practice there will be some portion of bandwidth used for control flow apart from data flow. For downloading file of exactly 100 KB from server, how much overhead bandwidth of control flow should I consider in my analysis for TCP communication? Is there any guideline/reference or research on that topic?
Assume, if 10KB is used for control flow, total monthly usage will include 14.4MB extra data which needed to be identified in my analysis.
Note: (1) I am limited to analyse only the client app part. (2) No changes in server side can be done at that moment (i.e. pull based to push based, partial data change api etc. cannot be applied). (3) I am limited to download the file using TCP. (4) Although, that much granularity is not often be considered in practice, let's assume, for my case the analysis required to be that much granular that I need to know the data vs control bandwidth ratio.
If you are asking only for the TCP/IP part, the payload/PDU ratio is 1460/1500 for IPv4 and 1440/1500 for IPv6, assuming an MTU of 1500 bytes (sources: this already mentioned discussion, this other discussion, this other article).
I also found this really nice page that allows you to see all the header sizes for an arbitrary protocol stack and this academic paper.
However besides the protocol headers, there are more effects that reduce the bandwidth:
TCP will send additional messages, e.g. for performing a handshake when establishing the connection,
Retransmission of data may occur,
Actual frame sizes are negotiated on the lower communication layers, so TCP segments might be smaller than assumed.
In summary, this is not easy to answer precisely, because there are influences in the transmission process that are beyond your control.
Have you considered to measure the actual amount of data needed for transmitting one (or more) 100KB chunk(s) of payload rather than performing a theoretical analysis?

Best approach for transfering large data chunks over BLE

I'm new to BLE and hope you will be able to point me towards the right implementation approach.
I'm working on an application in which the peripheral (battery operated) device continuously aggregate sensor readings.
On the mobile side application there will be a "sync" button, upon button press, I would like to transfer all the sensor readings that were accumulated in the peripheral to the mobile application.
The maximal duration between sync's can be several days, hence, the accumulated data can reach a size of 20Kbytes.
Now, I'm wondering what will be the best approach to perform the data transfer from the peripheral to the central application.
I thought about creating an array of characteristics where each characteristic will contain a fixed amount of samples (e.g. representing 1hour of readings).
Then, upon sync, I will:
Read the characteristics count (how many 1hours cells).
Then read the characteristics (1hour cells) one by one.
However, I have no idea if this is a valid approach ?
I'm not sure if this is the most "power efficient" way that I can
use.
I'm not sure if Characteristic READ is the way to go, or maybe
I need to use indication instead.
Any help here will be highly appreciated :)
Thanks in advance, Moti.
I would simply use notifications.
Use one characteristic which you write something to in order to trigger the transfer start.
Then have another characteristic which you simply stream data over by sending 20 bytes at a time. Most SDKs for BLE system-on-a-chips have some way to control the flow of data so you don't send too fast. Normally by having a callback triggered when it is ready to take the next notification.
In order to know the size of the data being sent, you can for example let the first notification contain the size, and rest of them the data.
This is the most time and power efficient way since there can be sent many notifications per connection interval, compared if you do a lot of reads instead which normally requires two round trips each. Don't use indications since they also require basically two round trips per indication. They're also quite useless anyway.
You could possibly increase the speed also by some % by exchanging a larger MTU (which leads to lower L2CAP/ATT headers overhead).

Is it possible to use the TWS/IBpy interface to collect and analyze tick data?

While searching for a template to test a paper trading strategy, I stumbled on IBPy. I have gone through the initial set-up and can connect and receive updates from the server. What I would like to do is:
a) Gather ticks from 1..n symbols when new prices (bid/asks) are published
b) Store these temporarily in a vector (I guess with vector.append((bid,ask))
c) Once the vector reaches it's computational max (I need 30 seconds or a certain number of ticks) I will compute some valued on vector[] and decide on whether an entry is appropriate
d) If not pop(0) and keep collecting
e) exit on a stoploss or trailing profit
My questions are:
i) I have read that updates are 250 ms, that is fine for my analytics but can the program/system keep up because different symbols update at different times so just because symbolA updates every 250 ms, with 10 symbols the updates maybe very frequent
ii) When I stop to make a calculation, haven't I lost updates?
If there is skeleton code for this, it would be great to mess around with it
Thanks for listening!
If you need to handle 100s of stock symbols you shall have multiple (at least 2) threads. One thread pulls the incoming data from the socket, sorts the messages by message type and pushes the data to queues. Other threads are waiting for their respective queues to get some data and process the incoming data.
The idea is that the dispatcher thread ensures that all incoming data gets pulled from the socket as fast as possible.
Generally your PC will be able to handle anything IB will be willing to send you. If your processing does not take too much time - no locks, calls to sleep(), file operations - you can do everything in a single thread.

Asp.net guaranteed response time

Does anybody have any hints as to how to approach writing an ASP.net app that needs to have a guaranteed response time?
When under high load that would normally cause us to exceed our desired response time, we want to throw out an appropriate number of requests, so that the rest of the requests can return before the max response time. Throwing out requests based on exceeding a fixed req/s is not viable, as there are other external factors that will control response time that cause the max rps we can safely support to fiarly drastically drift and fluctuate over time.
Its ok if a few requests take a little too long, but we'd like the great majority of them to meet the required response time window. We want to "throw out" the minimal or near minimal number of requests so that we can process the rest of the requests in the allotted response time.
It should account for ASP.Net queuing time, ideally the network request time but that is less important.
We'd also love to do adaptive work, like make a db call if we have plenty of time, but do some computations if we're shorter on time.
Thanks!
SLAs with a guaranteed response time require a bit of work.
First off you need to spend a lot of time profiling your application. You want to understand exactly how it behaves under various load scenarios: light, medium, heavy, crushing.. When doing this profiling step it is going to be critical that it's done on the exact same hardware / software configuration that production uses. Results from one set of hardware have no bearing on results from an even slightly different set of hardware. This isn't just about the servers either; I'm talking routers, switches, cable lengths, hard drives (make/model), everything. Even BIOS revisions on the machines, RAID controllers and any other device in the loop.
While profiling make sure the types of work loads represent an actual slice of what you are going to see. Obviously there are certain load mixes which will execute faster than others.
I'm not entirely sure what you mean by "throw out an appropriate number of requests". That sounds like you want to drop those requests... which sounds wrong on a number of levels. Doing this usually kills an SLA as being an "outage".
Next, you are going to have to actively monitor your servers for load. If load levels get within a certain percentage of your max then you need to add more hardware to increase capacity.
Another thing, monitoring result times internally is only part of it. You'll need to monitor them from various external locations as well depending on where your clients are.
And that's just about your application. There are other forces at work such as your connection to the Internet. You will need multiple providers with active failover in case one goes down... Or, if possible, go with a solid cloud provider.
Yes, in the last mvcConf one of the speakers compares the performance of various view engines for ASP.NET MVC. I think it was Steven Smith's presentation that did the comparison, but I'm not 100% sure.
You have to keep in mind, however, that ASP.NET will really only play a very minor role in the performance of your app; DB is likely to be your biggest bottle neck.
Hope the video helps.

Most bandwidth efficient unidirectional synchronise (server to multiple clients)

What is the most bandwidth efficient way to unidirectionally synchronise a list of data from one server to many clients?
I have sizeable chunk of data (perhaps 20,000, 50-byte records) which I need to periodically synchronise to a series of clients over the Internet (perhaps 10,000 clients). Records may added, removed or updated only at the server end.
Something similar to bittorrent? Or even using bittorrent. Or maybe invent a wrapper around bittorrent.
(Assuming you pay for bandwidth on your server and not the others ...)
Ok, so we've got some detail now - perhaps 10 GB of total (uncompressed) data, every 3 days, so that's 100 GB per month.
That's actually not really a sizeable chunk of data these days. Whose bandwidth are you trying to save - yours, or your clients'?
Does the data perhaps compress very readily? For raw binary data it's not uncommon to achieve 50% compression, and if the data happens to have a lot of repeated patterns within it then 80%+ is possible.
That said, if you really do need a system that can just transfer the changes, my thoughts are:
make sure you've got a well defined primary key field - use that as your key to identify each record
record a timestamp for each record to say when it last changed
have each client tell you the timestamp of the last change it knows of, so you can calculate the deltas
ensure that full downloads are possible too, in case clients get out of sync

Resources