Carbon Relay not working correctly - graphite

i've setup 2 real Servers.
One Statsite (alternative to StatsD) is in front of one "Graphite Stack" (Carbon and Graphite Webapp).
Metrics are collected from Statsite correctly, via UDP. And i just forward them every 10 seconds to carbon cache (TCP Port 2013 of carbon relay).
On my Carbon Server, 3 carbon cache instances (a, b and c) are running behind one carbon relay (consistent-hashing).
I got 3 cache:[a, b, c] sections, all are listening on different ports. The relay section, got those 3 cache instances inside the destinations configkey. I've started each carbon cache via python script, with the option --instance=[a, b, c] and i've started the carbon-relay also with the own python script. I can even see inside the relay log, that all 3 instances are connected.
But inside my Graphite Webapp, i just can see under carbon.agents.XXXXX-[a, b, c].metricsCount that all 3 instances got the same count rate each other.
I miss the metrics folder for carbon relay carbon.relay.XXXX.metricsCount.
Am i doing everything right???

sounds like you are missing a CARBON_METRIC_INTERVAL in the [relay] section of your carbon.conf .

Related

Network Partition in RabbitMQ

I am trying to analyze how RabbitMQ Partition theorum(pause_minority,pause_if_all_down,autoheal) work? I have reproduced the network partition in three-node clusters in GCP but I was unable to conclude which node will be stopped when there is a network jitter in between them.
I have used perftest for creating a production-like environment and used iptables concept in order to bring partition between two nodes.
I created ten queues with replication factor 2(master and one slave) and was using min_master for the uniform distribution of queues.
Publishing rate was 1000/sec(100/sec for each queue)
Consumption rate was 1000/sec(100/sec for each queue)
Please see the test results here for Pause_Minority
Explanation: Taking the first row, 8,9, and 6 are the connections(before partition) on Node A, B, and C respectively. I have blacklisted Node A with B and the results are Node A and Node B stopped running and connections are transferred to Node C.
I got different results for rows 2 and 3(Please See the Linked Image)
Please see the test results here pause_if_all_down
Note:0 Connections means (no publishing and consumption)
Please see the test results here autoheal
For Pause Minority I read this article in which the author has explained the master slave architecture but I was unable to get the resutls as per the blog.
I am also attaching the link for the google sheet where I have shared the resuts of my test in detail
Other Artciles that I have read are as listed:
https://www.rabbitmq.com/partitions.html
https://docs.vmware.com/en/VMware-Tanzu-RabbitMQ-for-Kubernetes/1.2/tanzu-rmq/GUID-partitions.html
Can anyone explain to me how partiton theorums decide which node will be stopped in case of partition? Does it decide on the basis of a number of queues, or no of connections, or anything else?

Random delay in port binding for distributed_virtual_router and router_centralized_snat in openstack neutron

I have created a private network called "Private_Network" in the range of 192.168.220.0/24 plus a virtual router called "Virtual_Router" inside openstack which is connected to the external network. Then I connect the default gateway of "Private_Network" i.e. 192.168.220.1 to the "Virtual_Router" so that I all VMs connected to "Private_Network" can access the Internet via SNAT.
I used vxlan as overlay network, and a flat provider network.
By connecting "Private_Network" to "Virtual_Router", two ports are immediately created: router_interface_distributed with the IP address of 192.168.220.1 and router_centralized_snat with IP address of 192.168.220.45, however, both ports are in the DOWN state for a bit long and random time like 2 hours, or 45 minutes, or 20 minutes. I should mention, that rarely the ports get UP as soon as (less than a minute) I connect the "Private_Network" to "Virtual_Router".
please see the two ports just after creation that both are in the Down state
I have searched a lot to find out the main reason behind this issue. I am not convinced that server configuration is wrong because I have a few cases in which the two ports gets up right after I connect the "Private_Network" to "Virtual_Router". So, I tried to look at the log files and noticed there are three main phases to pass in order to get each of those ports to UP state: DHCP, port binding, and L2 provisioning. I changed the log level to DEBUG and investigate the log files in details.
I run the following process several times:
create a brand project in Horizon.
create a brand virtual network (called "Private_Network") in the range of 192.168.220.x/24.
create a brand virtual router (called "Virtual_Router") connected to external network.
connect port 192.168.220.1 (default gateway) of "Private_Network" to "Virtual_Router".
cat /var/log/neutron/* | grep snat_port
From more than ten cases I experimented, neutron either stuck at the "port binding" phase or "L2 provisioning" phase.
When It stuck at "port binding" it takes random time to finish, like 45 minutes, 20 minutes or 10 minutes and Once "port binding" phase is done, "L2 provisioning" phase will be done in less than a minute and port state changes to "UP".
However, In case that "L2 provisioning" phase get stuck, the previous two phases finished in less that 1 minutes, but "L2 provisioning" get stuck for hours. Its so confusing to me why do I see this much delay in getting ports "UP".
I would appreciate if anybody can assist me to resolve this issue.

How to send 50.000 HTTP requests in a few seconds?

I want to create a load test for a feature of my app. It’s using a Google App Engine and a VM. The user sends HTTP requests to the App Engine. It’s realistic that this Engine gets thousands of requests in a few seconds. So I want to create a load test, where I send 20.000 - 50.000 in a timeframe of 1-10 seconds.
How would you solve this problem?
I started to try using Google Cloud Task, because it seems perfect for this. You schedule HTTP requests for a specific timepoint. The docs say that there is a limit of 500 tasks per second per queue. If you need more tasks per second, you can split this tasks into multiple queues. I did this, but Google Cloud Tasks does not execute all the scheduled task at the given timepoint. One queue needs 2-5 minutes to execute 500 requests, which are all scheduled for the same second :thinking_face:
I also tried a TypeScript script running asynchronous node-fetch requests, but I need for 5.000 requests 77 seconds on my macbook.
I don't think you can get 50.000 HTTP requests "in a few seconds" from "your macbook", it's better to consider going for a special load testing tool (which can be deployed onto GCP virtual machine in order to minimize network latency and traffic costs)
The tool choice is up to you, either you need to have powerful enough machine type so it would be able to conduct 50k requests "in a few seconds" from a single virtual machine or the tool needs to have the feature of running in clustered mode so you could kick off several machines and they would send the requests together at the same moment of time.
Given you mention TypeScript you might want to try out k6 tool (it doesn't scale though) or check out Open Source Load Testing Tools: Which One Should You Use? to see what are other options, none of them provides JavaScript API however several don't require programming languages knowledge at all
A tool you could consider using is siege.
This is Linux based and to prevent any additional cost by testing from an outside system out of GCP.
You could deploy siege on a relatively large machine or a few machines inside GCP.
It is fairly simple to set up, but since you mention that you need 20-50k in a span of a few seconds, siege by default only allows 255 requests per second. You can make this larger, though, so it can fit your needs.
You would need to play around on how many connections a machine can establish, since each machine will have a certain limit based on CPU, Memory and number of network sockets. You could just increase the -c number, until the machine gives an "Error: system resources exhausted" error or something similar. Experiment with what your virtual machine on GCP can handle.

JMS Connection Latency

I am examining an application where a JBOSS Application Server communicates with satellite JBOSS Application Servers (1 main server, hundreds of satellites).
When observing in the Windows Resource Monitor I can view the connections and see the latency by satellite - most are sub-second, but I see 10 over 1 second, of those 4 over 2 seconds and 1 over 4 seconds. This is a "moment in time" view, so as connections expire and rebuild when they need, the trend can shift. I do observe the same pair of systems have a ping latency matching seen on the connection list - so I suspect it's connection related (slow pipe, congestion, or anything in the line between points A and B).
My question is what should be a target latency, keeping in mind the satellites are VPN'd from various field sites. I use 2 seconds as a dividing line to when I need the network team to investigate, I'd like to survey what rule of thumb do you use in evaluating when the latency for a transient connection starts peaking - is it over a second? I do observe the same pair of systems have a ping latency matching seen on the connection list - so I know it's connection related.

Carbon relay to aggregrator and shared cache configuration

Is it possible to setup carbon relay to forward to a cache and an aggregator and then have the aggregator send to the same cache?
I am trying to store aggregate data for long term storage and machine specific data for short term storage. From what I can tell documentation wise it is possible to do this with two different caches, but from an administration standpoint using a single cache would simplify things.
It is indeed possible to do this.
Setup the carbon relay to send to both the carbon cache and the carbon aggregator. Setup the aggregator to send to the same cache the relay is. The aggregated stats will appear in the cache if properly setup. I have all these services setup to run on different ports with statsd as a proxy. So I was able to make all of these changes, start up the relay and aggregation daemons, and then change the port statsd was sending to all with only minimal impact.

Resources