I want to ask about the Graphite carbon daemons.
https://graphite.readthedocs.org/en/latest/carbon-daemons.html
I would like to ask while running a carbon-rely.py, should i also run carbon-cache.py or the relay is okay?
Regards
Murtaza
Carbon relay is used when you set up a cluster of graphite instances. However, a carbon cache does not need a cluster;
Reg Carbon cache: As we all know that write operations are expensive; Graphite enables collected data to be collected in a cache where the graphite webapp can be used irrespective of a cluster to read and display the most recent data recorded into graphite ( irrespective of whether it was written into disk).
Hope this answers your question.
Carbon-relay only resends data to one or more destinations, so it needed only if you want fork data into several points. Example schemas can be:
save locally and resend to another node (cache or temporary-storage and relay)
resend all data into multiple remote daemons (multiple remote storages)
save all data in multiple local daemons (parallel storage & redundancy)
save different data sets in multiple local daemons (performance)
... other cases ...
So,
in case you need store data locally - you have to use carbon-cache.
in case you need fork data flow on the node,- you have to use carbon-relay before or instead carbon-cache
Related
I have data need to be downloaded on a local server every 24 hours. For high availability we provided 2 servers to avoid failures and losing data.
My question is: What is the best approach to use the 2 servers?
My ideas are:
-Download on a server and just if download failed for any reason, download will continue on the other server.
-Download will occur on the 2 servers at the same time every day.
Any advice?
In terms of your high-level approach, break it down into manageable chunks i.e. reliable data acquisition, and highly available data dissemination. I would start with the second part first, because that's the state you want to get to.
Highly available data dissemination
Working backwards (i.e. this is the second part of your problem), when offering highly-available data to consumers you have two options:
Active-Passive
Active-Active
Active-Active means you have at least two nodes servicing requests for the data, with some kind of Load Balancer (LB) in front, which allocates the requests. Depending on the technology you are using there may be existing components/solutions in that tech stack, or reference models describing potential solutions.
Active-Passive means you have one node taking all the traffic, and when that becomes unavailable requests are directed at the stand-by / passive node.
The passive node can be "hot" ready to go, or "cold" - meaning it's not fully operational but is relatively fast and easy to stand-up and start taking traffic.
In both cases, and if you have only 2 nodes, you ideally want both the nodes to be capable of handling the entire load. That's obvious for Active-Passive, but it also applies to active-active, so that if one goes down the other will successfully handle all requests.
In both cases you need some kind of network component that routes the traffic. Ideally it will be able to operate autonomously (it will have to if you want active-active load sharing), but you could have a manual / alert based process for switching from active to passive. For one thing, it will depend on what your non-functional requirements are.
Reliable data acquisition
Having figured out how you will disseminate the data, you know where you need to get it to.
E.g. if active-active you need to get it to both at the same time (I don't know what tolerances you can have) since you want them to serve the same consistent data. One option to get around that issues is this:
Have the LB route all traffic to node A.
Node B performs the download.
The LB is informed that Node B successfully got the new data and is ready to serve it. LB then switches the traffic flow to just Node B.
Node A gets the updated data (perhaps from Node B, so the data is guaranteed to be the same).
The LB is informed that Node A successfully got the new data and is ready to serve it. LB then allows the traffic flow to Nodes A & B.
This pattern would also work for active-passive:
Node A is the active node, B is the passive node.
Node B downloads the new data, and is ready to serve it.
Node A gets updated with the new data (probably from node B), to ensure consistency.
Node A serves the new data.
You get the data on the passive node first so that if node A went down, node B would already have the new data. Admittedly the time-window for that to happen should be quite small.
I have a Java EE application that resides on multiple servers over multiple sites.
Each instance of the application produces logs locally.
The Java EE application also communicates with IBM Mainframe CICS applications via SOAP/HTTP.
These CICS applications execute in multiple CICS regions over multiple mainframe LPARS over multiple sites.
Like the Java EE application the CICS application produces logs locally.
Attempting to trouble shoot issues is extremely time consuming. This entails support staff having to manually log onto UNIX servers and or mainframe LPARS tracking down all related Logs for a particular issue.
One solution we are looking at is to create a single point that collects all distributed logs from both UNIX and Mainframe.
Another area we are looking at is whether or not its possible to drive client traffic to designated Java EE servers and IBM Mainframe LAPS right down to a particular application server node and a single IBM CICS region.
We would only want to do this for "synthetic" client calls, e.g. calls generated by our support staff, not "real" customer traffic.
Is this possible?
So for example say we had 10 UNIX servers distributed over two geographical sites as follows:-
Geo One: UNIX_1, UNIX_3, UNIX_5, UNIX_7, UNIX_9
Geo Two: UNIX_2, UNIX_4, UNIX_6, UNIX_8, UNIX_0
Four IBM Mainframe lpars over two two geographical sites as follows:-
Geo One: lpar_a, lpar_c
Geo Two: lpar_b, lpar_d
each lpar has 8 cics regions
cicsa_1, cicsa_2... cicsa_8
cicsb_1, cicsb_2... cicsb_8
cicsc_1, cicsc_2... cicsc_8
cicsd_1, cicsd_2... cicsd_8
we would want to target a single route for our synthetic traffic of
unix_5 > lpar_b, > cicsb_6
this way we will know where to look for the log output on all platforms
UPDATE - 0001
By "synthetic traffic" I mean that our support staff would make client calls to our back end API's instead of "Real" front end users.
If our support staff could specify the exact route these synthetic calls traversed, they would know exactly which log files to search at each step.
These log files are very large 10's of MB each and there are many of them
for example, one of our applications runs on 64 UNIX physical servers, split across 2 geographical locations. Each UNIX server hosts multiple application server nodes, each node produces multiple log files, each of these log files are 10MB+. the log files roll over so log output can be lost very quickly .
One solution we are looking at is to create a single point that
collects all distributed logs from both UNIX and Mainframe.
I believe collecting all logs into a single point is the way to go. When the log files roll over, perhaps you could SFTP them to your single point as part of that rolling over process. Or use NFS mounts to copy them.
I think you can make your synthetic traffic solution work, but I'm not sure what it accomplishes.
You could have your Java applications send to a synthetic URL, which is mapped by DNS to a single CICS region containing a synthetic WEBSERVICE definition, synthetic PIPELINE definition, and a synthetic URIMAP definition which in turn maps to a synthetic transaction which is defined to run locally. The local part of the definition should keep it from being routed to another CICS region in the CICSPlex.
In order to get the synthetic URIMAP you would have to run your WSDL through the IBM tooling (DFHWS2LS or DFHLS2WS) with a URI control card indicating your synthetic URL. You would also use the TRANSACTION control card to point to your synthetic transaction defined to run locally.
I think this is seriously twisting the CICS definitions such that it barely resembles your non-synthetic environment - and that's provided it would work at all, I am not a CICS Systems Programmer and yours might read this and conclude my sanity is in question. Your auditors, on the other hand, may simply ask for my head on a platter.
All of the extra definitions are needed (IMHO) to defeat the function of the CICSPlex, which is to load balance incoming requests, sending them to the CICS region that is best able to service them. You need some requests to go to a specific region, short-circuiting all the load balancing being done for you.
What is the right approach to use to configure OpenSplice DDS to support 100,000 or more nodes?
Can I use a hierarchical naming scheme for partition names, so "headquarters.city.location_guid_xxx" would prevent packets from leaving a location, and "company.city*" would allow samples to align across a city, and so on? Or would all the nodes know about all these partitions just in case they wanted to publish to them?
The durability services will choose a master when it comes up. If one durability service is running on a Raspberry Pi in a remote location running over a 3G link what is to prevent it from trying becoming the master for "headquarters" and crashing?
I am experimenting with durability settings in a remote node such that I use location_guid_xxx but for the "headquarters" cloud server I use a Headquarters
On the remote client I might to do this:
<Merge scope="Headquarters" type="Ignore"/>
<Merge scope="location_guid_xxx" type="Merge"/>
so a location won't be master for the universe, but can a durability service within a location still be master for that location?
If I have 100,000 locations does this mean I have to have all of them listed in the "Merge scope" in the ospl.xml file located at headquarters? I would think this alone might limit the size of the network I can handle.
I am assuming that this product will handle this sort of Internet of Things scenario. Has anyone else tried it?
Considering the scale of your system I think you should seriously consider the use of Vortex-Cloud (see these slides http://slidesha.re/1qMVPrq). Vortex Cloud will allow you to better scale your system as well as deal with NAT/Firewall. Beside that, you'll be able to use TCP/IP to communicate from your Raspberry Pi to the cloud instance thus avoiding any problem related to NATs/Firewalls.
Before getting to your durability question, there is something else I'd like to point out. If you try to build a flat system with 100K nodes you'll generate quite a bit of discovery information. Beside generating some traffic, this will be taking memory on your end applications. If you use Vortex-Cloud, instead, we play tricks to limit the discovery information. To give you an example, if you have a data-write matching 100K data reader, when using Vortex-Cloud the data-writer would only match on end-point and thus reducing the discovery information by 100K times!!!
Finally, concerning your durability question, you could configure some durability service as alignee only. In that case they will never become master.
HTH.
A+
Is it possible to setup carbon relay to forward to a cache and an aggregator and then have the aggregator send to the same cache?
I am trying to store aggregate data for long term storage and machine specific data for short term storage. From what I can tell documentation wise it is possible to do this with two different caches, but from an administration standpoint using a single cache would simplify things.
It is indeed possible to do this.
Setup the carbon relay to send to both the carbon cache and the carbon aggregator. Setup the aggregator to send to the same cache the relay is. The aggregated stats will appear in the cache if properly setup. I have all these services setup to run on different ports with statsd as a proxy. So I was able to make all of these changes, start up the relay and aggregation daemons, and then change the port statsd was sending to all with only minimal impact.
I am laying out an architecture where we will be using statsd and graphite. I understand how graphite works and how a single statsd server could communicate with it. I am wondering how the architecture and set up would work for scaling out statsd servers. Would you have multiple node statsd servers and then one central statsd server pushing to graphite? I couldn't seem to find anything about scaling out statsd and any ideas of how to have multiple statsd servers would be appreciated.
I'm dealing with the same problem right now. Doing naive load-balancing between multiple statsds obviously doesn't work because keys with the same name would end up in different statsds and would thus be aggregated incorrectly.
But there are a couple of options for using statsd in an environment that needs to scale:
use client-side sampling for counter metrics, as described in the statsd documentation (i.e. instead of sending every event to statsd, send only every 10th event and make statsd multiply it by 10). The downside is that you need to manually set an appropriate sampling rate for each of your metrics. If you sample too few values, your results will be inaccurate. If you sample too much, you'll kill your (single) statsd instance.
build a custom load-balancer that shards by metric name to different statsds, thus circumventing the problem of broken aggregation. Each of those could write directly to Graphite.
build a statsd client that counts events locally and only sends them in aggregate to statsd. This greatly reduces the traffic going to statsd and also makes it constant (as long as you don't add more servers). As long as the period with which you send the data to statsd is much smaller than statsd's own flush period, you should also get similarly accurate results.
variation of the last point that I have implemented with great success in production: use a first layer of multiple (in my case local) statsds, which in turn all aggregate into one central statsd, which then talks to Graphite. The first layer of statsds would need to have a much smaller flush time than the second. To do this, you will need a statsd-to-statsd backend. Since I faced exactly this problem, I wrote one that tries to be as network-efficient as possible: https://github.com/juliusv/ne-statsd-backend
As it is, statsd was unfortunately not designed to scale in a manageable way (no, I don't see adjusting sampling rates manually as "manageable"). But the workarounds above should help if you are stuck with it.
Most of the implementations I saw use per server metrics, like: <env>.applications.<app>.<server>.<metric>
With this approach you can have local statsd instances on each box, do the UDP work locally, and let statsd publish its aggregates to graphite.
If you dont really need per server metrics, you have two choices:
Combine related metrics in the visualization layer (e.g.: by configuring graphiti to do so)
Use carbon aggregation to take care of that
If you have access to a hardware load balancer like a F5 BigIP (I'd imagine there are OSS software implementations that do this) and happen to have each host's hostname in your metrics (i.e. you're counting things like "appname.servername.foo.bar.baz" and aggregating them at the Graphite level) you can use source address affinity load balancing - it sends all traffic from one source address to the same destination node (within a reasonable timeout). So, as long as each metric name comes from only one source host, this will achieve the desired result.