I have a graphite relay and webapp installed on one servers, that is supposed to be communicating with 4 carbon caches (and respective webapps) on 4 other servers. I've validated that the relay is working by observing that different whisper files are being updated on different carbon-relay servers.
However, the webapp is only showing metrics that are stored on the first carbon cache server in the list and I'm not sure what else to look at.
The webapps on the carbon relays are set up to listen on port 81, and I have the following in local_settings.py on the relay server (the one I'm pointing my browser at):
CLUSTER_SERVERS = ["graphite-storage1.mydomain.com:81", "graphite-storage2.mydomain.com:81", "graphite-storage3.mydomain.com:81", "graphite-storage4.mydomain.com:81", ]
However - at one point I did have all metrics on all servers - I've migrated from a single instance to this federated cluster. I've since removed the whisper files that weren't active on each carbon-cache server. I've restarted all carbon-caches, the carbon-relay and the webapp server several times. Is there somewhere the metrics-->carbon-cache mapping is getting cached? Have I missed a setting somewhere?
Related
we are running a nextJS server as a service on a Kubernetes cluster, with a minimum of two replicas. So in a normal situation, we have these:
our-nextjs-server-prod-bd7c6dc4c-2dlqg 1/1 Running 0 18h
our-nextjs-server-prod-bd7c6dc4c-7dkbp 1/1 Running 0 18h
When the first server is hit for a page it hasn't cached yet, it will work on it, store it in the host node's volume, and will serve it from there in subsequent calls. Now, if the second server is hit for the same page, but is hosted on a different node, it will have to, as I understand it, re-generate the page as it doesn't exist on its node's volume.
Is there a way to have multiple nextJS pods from different nodes utilize a common resource to cache pages? a common volume, an external resource like Redis perhaps? Is there a best practice around that requirement?
For a moment: Let's disregard the CDN in front of the nextJS service caching the results for a certain TTL. We need those nextJS pods hit frequently so that they can ping the application server for changed properties that'll trigger a re-build of the page.
I am having very strange network problems. I am on a domain where a few servers are located on a different subnet. I can ping these servers, dns look them up and remote desktop to them by IP-address. I however cannot find them when using:
net view \server
or
Try to access them via windows explorer.
The person next to me who has an identical machine and is on the same subnet has no problems, as a matter of fact, I am the only one in a 50 person company having this problem!
This wouldn't be so much of a problem except for the fact that my machine cannot use web services located on these servers, neither via HTTP or NET.TCP.
After trying everything I can find on the internet and some more (added a new network card, reset policies, etc.) I finally got WireShark to see what is going on. When doing net view \server I notice that the server never responds to "Session Setup Request" but it did respond to "Negotiate Protocol Request". So what could possibly cause the server never to responde to the Session Setup Request?
Here is the server side capture (Not same session)
OK I found out what this was by comparing my tcpip registry (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters) with a machine that worked. What I noticed is that I had the following 2 entries
EnablePMTUBHDetect 0
EnablePMTUDiscovery 1
but the other machine didn't. By deleting these entries, everything started working!
This however is very strange because these happen to be the default values for there registry keys so I do not understand why having these entries cause such a problem.
Am new to graphite monitoring tool. I have one question in this setup. I have two servers and hear one server treated as a hosted server(installed graphite,collectd,statsd and grafana)and it grafana displays the all metrics. In the another second server i have installed the graphite and collectd.Now i would need to send this second server collectd informtion to first server(hosted server)and those metrics information will need to display the web using grafana...
could you please suggest me is there any plugin or any way to setup this configuration?
Thanks.
You don't actually need graphite on the second host, you can just configure collectd on that host to write to graphite (actually the carbon ingest api) on the first host.
https://collectd.org/documentation/manpages/collectd.conf.5.shtml#plugin_write_graphite
If you do want to have graphite on both servers for some reason, you can use multiple Node entries in your collectd config to have it send metrics to both graphite instances.
We have recently implemented a new ASP.NET site to our webservers to replace our old Classic ASP site(Both severs are Windows 2008 R2 Using IIS 7.5). They are hosted on a Load Balancer.
This one .NET webform application is used for approximately 30 clients (each with their own URL. client1.mysite.biz, client2.mysite.biz etc...)
Our original plan was deploy our new application into 3 "WebSites" each with their own app pools and BIND the clients to the relevant Website.
When binding we bound to both Http and Https for the URL (we have certificates for each of the sites)
INITIAL PROBLEM:
We noticed that after we bound more than half the sites and tested, we were suddenly being greeted with "Service Unavailable. Service is Temporarily Unavailable" (NO NUMBER just the words) every time. We unbound everything and tried again (meticulously testing each time we bound a site). Each time after binding a certain number of sites the same thing happened.
We ran out of down time and went to Plan B. We put the whole thing in the "Default Website" as a virtual directory (No bindings) (This is how the Classic ASP site was setup)
OUR PROBLEM NOW:
Occasionally we get the same dreaded white screen with "Service Unavailable. Service is Temporarily Unavailable" (NO NUMBER just the words).
It seems to happen randomly (not load or time dependent as far as we can tell). If using AJAX it simply is caught in the "Error" portion of the AJAX code but I believe it is the same problem. The error occurs INSTANTLY when it does happen. If the user attempts to repeat the action that caused the problem everything is fine (they are not logged out and they proceed on their way).
However this is happening MULTIPLE times a day and it's across ALL of our sites (not just this new one).
One more item of great importance. This appears to be happening to ALL of our sites (Virtual Directories and custom WebSites on BOTH of our web servers). That seems to rule out a "bad" server (both are in the cloud did I mention?) and it also "seems" to rule out App Pool settings but what do I know?
About our IIS servers: We have multiple application pools running multiple different instances of websites (different code). Some are testing sites. Some are using classic ASP and others and using ASP.NET.
What we've tried: We scoured the web looking for answers and have edited our machine.config file to increase all manner of things such as "Threads, Max-Connections etc...". We've edited our App Pool settings by increasing our Queue Length and turning on ALL the logs.
Anyone seen anything like this before? My theory is it has something to do with the bindings and the frequency of the error is increased for each binding I initiate but that is difficult to test when it happens on my production servers only.
We have finally solved this problem. As mentioned previously, we noticed that the IIS logs contained a sc-win32-status 64 error when we experienced the Service Unavailable problem in the browser when (and only when) our site was using the Load Balancer.
To help look into this further, we did a network capture of the traffic on the Load Balancer while testing. We reproduced the random Service Unavailable problem, saw the associated win32-status 64 error in the IIS logs, and identified the specific packet of traffic on the network capture for this event.
Using Wireshark, we followed the TCP stream and noticed that the TCP connection was reset by the Load Balancer immediately after this packet. We reproduced the problem three times and every time there was a TCP reset immediately afterwards.
Walking backwards through the TCP stream, we noticed in all three instances a packet for HTTP/1.1 200 (accplication/octet-stream) and prior to that a request to download a document (ie. .pdf or .xlsx or .docx) from one of our sites. The server that contains all our documents is not a web server and does not have the IIS role active. The document server does not have a way to define the content/media type for the document that is being downloaded. Hence the generic (application/octet-stream) packet in the network capture. The Load Balancer treated the request for a document as potentially malicious and decided to reset the TCP connection if another request is made. To fix the problem, we added a content type library function to our application using this post as a guide. Sorted!
In Summary:
A document was requested from our document server via our web
application
The document was sent back to the user with a generic content type =
application/octet-stream
The Load Balancer flagged this activity to be potentially malicious
Another request within this TCP connection was made
The Load Balancer reset the TCP connection
This results in a Service Unavailable
Lesson Learned:
Always define your content/media types if you are serving content from a non web server or a web server running an IIS version less than 7 (Heaven forbid).
A UC Certificate was originally meant for Microsoft Exchange, but it can also be used to cover multiple domains. We use one and it covers about 60+ domains (actually 4 or 5 domains with lots of subdomains). We also apply the certificate to a load balancer and two web servers and we have multiple sites. So far as I can tell the certificates operate as expected. you can view it from any of the 60+ domains. One odd thing about our setup is that in the IIS UI, you can't bind the same certificate to more than one site so we had to use the appcmd command line interface to bind multiple sites to the same certificate.
After looking more closely at our IIS logs it appears that there is indeed something that coincides with this behavior. We get an error of 200 0 64 which is the sc-win32-status 64: "the specified network name is no longer available".
Now our 2 IIS servers are hosted in the cloud on Sungard, and we are using a load balancer that they setup for us. It was our theory that the load balancer was "losing" the proper session id of the user when this 64 error occurs and has no idea where it was supposed to be.
We ran some controlled tests. One group we took OFF the load balancer and sent them directly to one of the servers and another group used the load balancer but made sure to connect to the same server. Both teams conducted the tests of trying to reproduce the error (which is to say we clicked a popup on the site over and over).
The results were interesting. The group that was NOT on the load balancer NEVER received the "Service Unavailable" error! BUT the logs indicated they were getting 64 errors 45 times. The group that WAS on the load balancer was able to produce the "Service Unavailable" message twice and the logs confirmed that there were exactly 2 instances of the 64 error that coincided to the exact moment that the errors were observed.
So what does this mean?
1.) Load balancer has some settings "Sticky Sessions?" that aren't keeping the sessions in right (but we can't find the right settings. It's not even our load balancer it's SunGard's). Anyone have any advice on these settings for ASP.NET?
2.) 64 errors are a part of web life? We gave more cpu power to one of our Virtual IIS servers and received less 64 errors. This is all I can come up with. We've sunk too much time and money trying to solve this, but it appears that I have an option at least of taking people off the load balancer and just routing them to one or the other server and in addition I can at least beef up the server to handle more traffic and reduce the 64 errors.
I've found at that Instagram share their technology implementation with other developers trough their blog. They've some great solutions for the problems they run into. One of those solutions they've is an Elastic Load Balancer on Amazon with 3 nginx instances behind it. What is the task of those nginx servers? And what is the task of the Elastic Load balancers, and what is the relation between them?
Disclaimer: I am no expert on this in any way and am in the process of learning about AWS ecosystem myself.
The ELB (Elastic load balancer) has no functionality on its own except receiving the requests and routing it to the right server. The servers can run Nginx, IIS, Apache, lighthttpd, you name it.
I will give you a real use case.
I had one Nginx server running one WordPress blog. This server was, like I said, powered by Nginx serving static content and "upstreaming" .php requests to phpfpm running on the same server. Everything was going fine until one day. This blog was featured on a tv show. I had a ton of users and the server could not keep up with that much traffic.
My first reaction would be to just use the AMI (Amazon machine image) to spin up a copy of my server on a more powerful instance like m1.heavy. The problem was I knew I would have traffic increasing over time over the next couple of days. Soon I would have to spin an even more powerful machine, which would mean more downtime and trouble.
Instead, I launched an ELB (elastic load balancer) and updated my DNS to point website traffic to the ELB instead of directly to the server. The user doesn’t know server IP or anything, he only sees the ELB, everything else goes on inside amazon’s cloud.
The ELB decides to which server the traffic goes. You can have ELB and only one server on at the time (if your traffic is low at the moment), or hundreds. Servers can be created and added to the server array (server group) at any time, or you can configure auto scaling to spawn new servers and add them to the ELB Server group using amazon command line, all automatically.
Amazon cloud watch (another product and important part of the AWS ecosystem) is always watching your server’s health and decides to which server it will route that user. It also knows when all the servers are becoming too loaded and is the agent that gives the order to spawn another server (using your AMI). When the servers are not under heavy load anymore they are automatically destroyed (or stopped, I don’t recall).
This way I was able to serve all users at all times, and when the load was light, I would have ELB and only one Nginx server. When the load was high I would let it decide how many servers I need (according to server load). Minimal downtime. Of course, you can set limits to how many servers you can afford at the same time and stuff like that so you don’t get billed over what you can pay.
You see, Instagram guys said the following - "we used to run 2 Nginx machines and DNS Round-Robin between them". This is inefficient IMO compared to ELB. DNS Round Robin is DNS routing each request to a different server. So first goes to server one, second goes to server two and on and on.
ELB actually watches the servers' HEALTH (CPU usage, network usage) and decides which server traffic goes based on that. Do you see the difference?
And they say: "The downside of this approach is the time it takes for DNS to update in case one of the machines needs to get decommissioned."
DNS Round robin is a form of a load balancer. But if one server goes kaput and you need to update DNS to remove this server from the server group, you will have downtime (DNS takes time to update to the whole world). Some users will get routed to this bad server. With ELB this is automatic - if the server is in bad health it does not receive any more traffic - unless of course the whole group of servers is in bad health and you do not have any kind of auto-scaling setup.
And now the guys at Instagram: "Recently, we moved to using Amazon’s Elastic Load Balancer, with 3 NGINX instances behind it that can be swapped in and out (and are automatically taken out of rotation if they fail a health check).".
The scenario I illustrated is fictional. It is actually more complex than that but nothing that cannot be solved. For instance, if users upload pictures to your application, how can you keep consistency between all the machines on the server group? You would need to store the images on an external service like Amazon s3. On another post on Instagram engineering – “The photos themselves go straight to Amazon S3, which currently stores several terabytes of photo data for us.”. If they have 3 Nginx servers on the load balancer and all servers serve HTML pages on which the links for images point to S3, you will have no problem. If the image is stored locally on the instance – no way to do it.
All servers on the ELB would also need an external database. For that amazon has RDS – All machines can point to the same database and data consistency would be guaranteed.
On the image above, you can see an RDS "Read replica" - that is RDS way of load balancing. I don't know much about that at this time, sorry.
Try and read this: http://awsadvent.tumblr.com/post/38043683444/using-elb-and-auto-scaling
Can you please point the blog entry out?
Load balancers balance load. They monitor the Web servers health (response time etc) and distribute the load between the Web servers. On more complex implementations it is possible to have new servers spawn automatically if there is a traffic spike. Of course you need to make sure there is a consistency between the servers. THEY CAN share the same databases for instance.
So I believe the load balancer gets hit and decides to which server it will route the traffic according to server health.
.
Nginx is a Web server that is extremely good at serving a lot of static content for simultaneous users.
Requests for dynamic pages can be offloaded to a different server using cgi. Or the same servers that run nginx can also run phpfpm.
.
A lot of possibilities. I am on my cell phone right now. tomorrow I can write a little more.
Best regards.
I am aware that I am late to the party, but I think the use of NGINX instances behind ELB in Istagram blogpost is to provide high available load balancer as described here.
NGINX instances do not seem to be used as web servers in the blogpost.
For that role they mention:
Next up comes the application servers that handle our requests. We run Djangoon Amazon High-CPU Extra-Large machines
So ELB is used just as a replacement for their older solution with DNS Round-Robin between NGINX instances that was not providing high availability.