Recently, I have an issue with my website from around 6pm gmt and 4am gmt.
I thought at first it was due to visitors at high traffic time but from 0am to 4am I have less visitors than times where the problem doesn't occur. My server gets high CPU usage. Here's the top command screenshot.
I have a dedicated server with 8 cores and 8 GB memory.
You should provide more infos about your issues. That's just a print screen and it's pretty hard to deduct anything based only on that! Check your apache server logs for that specific cPanel user (/usr/local/apache/domlogs/cpanel_user/domain.tld - that's the apache access log file for that account), run as suggested by others SHOW PROCESSLIST in mysql or install a tool like mytop or mtop
Related
We have a prod server running on IIS 10 (Windows Server 2016). The strange thing is, that the response time (even to access the main page) of an HTTP Request becomes slower after some time.
First assumption was that the ASP.NET Server / WCF Services have a problem, but even if we do an IISRESET /Stop - wait 10 secs - IISRESET /Start does not solve the problem.
Only a full restart of the server solves the problem.
The small red spikes in the picture shows our tries with IISRESET.
Do you have any suggestions on this?
The solution to this problem can be solved by improving the configuration of the server, or setting idle-timeout.
Please check the CPU and memory usage. If you find that the website has high CPU, please use the DebugDiag Collection to collect relevant information.
I am running a LAMP Stack on a google cloud customized compute engine primarily to host wordpress websites running woocommerce stores.
Following are server specs:
RAM: 5GB, Cores: 1, Space: 30GB, OS: CentOS7, Maria DB Version: 5.5.64, PHP Version: 7.3
Currently facing extreme ttfb Values over 10-20 secs even with very low traffic. Have done the following optimisations for improving the timing but it doesn't seem to improve it. The site has close to 1500 products.
Wordpress caching using hummingbird and auto optimize (minify, GZIP compression etc..) custom .htaccess with header expires, APCU PHP cache, cloudflare CDN, compressed images.
Optimized mariadb with optimum memory allocation, allocated optimum memory to apache and PHP as well.
Tried adding more cores and increase memory of compute engine in vain.
Disabling theme and template has little to no effect.
All the above optimizations has had little effect on the ttfb timings, is this a server/network related issue on my google cloud compute instance ?
Pls check the ttfb values below, test link:
TTFB Test Results
Thanks in advance !
I think you can measure the repose times. Try to measure the time spent waiting for the initial response by going to your browser and clicking "F12" >> "Network" tab and then search for your website using the browser in the same window.
You will get the response times by each process to connect to your website. If you click a specific process and then select the timing you will be able to see the TTFB and with that try to catch where is taking more time.
I believe this is more related with your installations than with the server itself.
If you want to test your server connection you could try to avoid the app side and use a trace or iperf to test your TCP connections times to your server from your local computer (to the external IP), this will only work if you have ICMP traffic allowed.
And the last thing is the same than John mentioned above, check if you're server is not swaping memory or even try to monitor the CPU and mem in use while you run the ttbf test, that will give you an idea if the problem is with the server or with the website and its configuration.
Additionally here are some recommendations to reduce ttbf (https://wp-rocket.me/blog/how-to-reduce-ttfb-wordpress-site/). Hoping it can help some how with this.
I am running a single high-visited website on a high-end Centos 7 VPS (16 vCore / 128 GB of RAM) running Plesk Onyx on
Centos 7 / MariaDB 10.1 / PHP-FPM 5.6 setup.
Everything is usually smooth and fast, but it happened twice in a year that the website went down with the message "Too Many Connections" from MariaDB.
Being in a hurry to restore website I launched a " service mariadb restart " without actually launching a SHOW PROCESSLIST.
I checked mariadb logs and web server logs afterwards and I haven't find anything useful to troubleshoot the issue.
Note that when it happened first time, I raised the max_connections value to 300 in my.cnf and constantly monitored the "max_used_connections" variabile seeing that value never went over 50 so I guessed it happened because of some DDOS attack or malicious attempt.
Questions :
Any advice on how to troubleshoot this ?
How can I be alerted if the max_used_connections value is approaching the max_connections value ? Any tool ?
I am using external pingdom service to check website uptime but it didn't detect this kind of problem (the web response is 200 OK) and also a netdata instance on the server (https://netdata.io/) that didn't help...
Troubleshoot it by turning on the slowlog, preferably with a low value for long_query_time (such as "1"). Probably some naughty query will show up there.
Yes, do SHOW FULL PROCESSLIST next time. (Note "FULL".) Instead of restarting mysqld, look for the offending query. It will have one of the highest values in Time and it probably won't be in Sleep mode. It may be something potentially long like ALTER or a dump. Killing that one process will probably uncork the problem, and the problem will vanish in, perhaps, seconds.
Deleting a file that is "open" by a process (such as mysqld) will not help -- disk space is not recycled until all processes have closed the file. Killing the process closes any open files. Some logs are can be handled with FLUSH LOGS; -- this should be harmless, though it may not help.
If your tables are MyISAM, switching to InnoDB will avoid many cases of table locks (if that is what you are experiencing).
What is the value of innodb_buffer_pool_size? For that sized RAM, about 80G is reasonable.
There might be some clues in the GLOBAL STATUS; see http://mysql.rjweb.org/doc.php/mysql_analysis#tuning for analyzing it. (Caution: It will be useless immediately after a reboot.)
Server Specks
Microsoft Windows Server 2003 Enterprise Edition SP2
IIS 6
.net4
Intel(R) Xeon(R) CPU
X5680 # 3.33GHz, 2.00GB of RAM
Physical Address Extension
I am having trouble finding the cause of our server's random downtime. Our clients inform us that their website goes down for hours at a time. Sometimes users are able to log in however the site is extremely slow/unstable and unusable. Sometimes users are not able to log in at all. When users are able to log in not all images are displayed (they get the image not found image).
We upgraded their website from .net1 to .net4 because we thought the cause of their downtime and random user log out was due to them running their website on .net1. The website was running fine with no issues for a few months.
The first time the server started to go down after that was due to the drive with which the website resided on running out of disk space. There was 40GB partitioned to this drive and 20GB was added. This didn't resolve the issue for very long.
The second time the server would randomly go down, I noticed in the Event viewer, that the web worker associated with the app pool used by the website would periodically require to be recylcled. That is, in the Security tab of the Event Viewer I would periodically see an event with ID 1074 reading 'A worker process with process id of '1540' serving application pool 'Net4' has requested a recycle because the worker process reached its allowed processing time limit.'. I then went into this app pool's properties and saw that the app pool would be recycled every 29 hours, which is the default. I modified this to have the app pool recycle every day at 3:00am. Since that we have not seen this event in the Event Viewer. We were able to catch the website during one of its downtimes before this was changed and recycled the app pool manually. This resolved the issue in this one instance.
This did not permanently fix the issue however, as we are still receiving emails from our client informing us that the website is down for hours at a time.
I then set up a performance monitor counter log. We have managed to monitor the server's performance during many of these downtimes. It does not appear to be a problem with memory as there is plenty of space on the drive. It does not appear to be a memory leak or related to excessive paging as there are no running processes which take up an excessive amount of % Processor Time and the Pages/Second Memory counter does not peak at an excessive amount during most of the downtime (I'll explain why excessive paging occurs later). The total IO Data Bytes/sec and IO Other Data Bytes/sec Process counter does not appear to be usually high or low during downtime. The total Thread Count and Handle Count Process counter do not exhibit any abnormal spikes or drops during this time. The total thread count, at a given time, seems to be between 600 and 900, give or take. The total handle count, at a given time, seems to be between 15,000 and 23,00, give or take. The % Time in Jit .NET CLR jit counter for instance w3wp is 0 for about half of the time and will randomly peak at almost 100 the other half, most of the time peaking for just a moment but rarely peaking for about 10 minutes, unrelated to downtime.
There are random times throughout the day where the process dsmcsvc takes up most, if not all, of the % Processor Time. This is a process run by the Symantec Antivirus software. When this process takes up the % Processor Time there is a corresponding event in the Event Viewer signifying that a new virus definition file has been uploaded that is, an Application event with ID 7 'New virus definition file loaded. Version: #version number#'. When this event occurs, the Pages/Sec counter spikes. Sometimes it spikes to only 200-300 but will at times peak over 10,000. This event seems to be completely unrelated to website downtime. I have researched the Symantec Antivirus software and found that there is a known memory leak in old versions of this software. I have found that this software is known to cause high memory usage when the link to a process called NavLogon.exe is broken/does not exist. This process does not appear to exist on the server so I currently have no way of restoring the link to it. I also found that this software uses Crypt32.dll and that old versions of Crypt32.dll have a known memory leak. The Crypt32.dll which exists on the server was last updated in 2007.
The Performance Monitor log monitors the total Sessions Active ASP.Net Applications counter. During downtime, the total number of sessions does not exhibit any abnormal behavior, there are a normal amount of active sessions during this time. Active sessions at a given time can be between 0 and 200. I was informed that the time when the most users are active is during 1st shift, however during about 10pm and 2am every day, this number peaks.
The site runs JavaScript client side, and Visual Basic.net server side. All users have about 10-15 session variables almost all of the time.
When the site goes down there are no events which seem to correspond to its downtime in the Event Viewer.
I also have set up a W3C Extended Log File Format log for this site. During downtime there seems be an excessive amount of GET requests for a Telerik.RadUploadProgressHandler.ashx.
I have seriously run out of ideas at this point and have extensively searched the web for solutions and come up empty. Any feedback as to why this may be occurring would be great.
It does not appear to be a problem with memory as there is plenty of space on the drive.
Really? Memory and hard drive space are two completely different things. 2GB of RAM was okay a decade ago, when that server was new, but today it's laughably small.
But don't bother upgrading or adding RAM. This server is old enough, the problem is probably just that the hardware is reaching the end of it's useful life. Additionally, the operating system is also nearing it's end of life. Server 2003 is scheduled for end of life on July 14, 2015. After that date, there will be no new patches of any kind produced for Server 2003... not even critical security patches. That will make Server 2003 completely unsuitable as a web server.
This seems like a good time to execute a transition to a completely new server.
Last night one of the websites (.NET 4.0 forms) hosted on my Win 2008 R2 (IIS 7.5) Server started to time out throwing the following error for all connected users.
TYPE System.Web.HttpException
MESSAGE Request timed out.
DETAIL System.Web.HttpException (0x80004005): Request timed out.
The outage was confined to just one website within IIS, the others continued to work fine.
Unfortunately I was unable to identify why the website was timing out. Here are the steps I took:
First thing I did was look at the task manager which revealed normal CPU and memory usage. Network activity was also moderate.
I then opened IIS to look at the live connections under 'Worker Processes'. There were about 60 live connections, so it didn't look like anything DDoS related.
Checked database connectivity (hosted on a separate server), all fine!
I then reset the website on IIS. That didn't work
I tried to then do a complete iisreset...still no luck :(
In the end (and under some duress) the only thing I could think to do to resolve this was to restart the server.
Restarting the server worked but I am nervous not knowing why this happened in the first place. Can anyone recommend any checks that I failed to carryout? Is there an official checklist for working through these sorts of IIS problems? I have reviewed the IIS logs but don't see anything unusual on the run up to the outage.
Any pointers or links to useful resources to help me understand and mitigate against this in future will be much appreciated.
EDIT
The only time I logged into the server that day was to add an additional web handler component (for remote deploy) to IIS Web Deploy. I'm doubtful this caused the outage as the server worked for for 6 hours after.
Because iisreset didn't helped and you had to restart whole machine, I would suspect it was a global resources shortage and mostly used website (or most resource consuming) was impacted. It could be because of not available RAM, network connections congestion due to some malfunctioning calls (for example a lot of CLOSE_WAIT sockets exhausting connections pool, we've seen that in production because of malfunction of external service). It could be also one specific client problem, which was disconnected after machine restart so eventually the problem disappeared.
I would start from:
Historical analysis
review Event Viewer to see any errors/warnings from that period of time,
although you have already looked into IIS logs, I would do it once again with help of Log Parser Lizard to make some statistics like number of request per client, network bandwith per client, average response time per client and so on.
Monitoring
continuously monitor Performance Counters:
\Processor(_Total_)\% Processor Time,
\.NET CLR Exceptions(_Global_)\# of Exceps Thrown / sec,
\Memory\Available MBytes,
\Web Service(Default Web Site)\Current Connections (per each your site name),
\ASP.NET v4.0.30319\Request Wait Time,
\ASP.NET v4.0.30319\Requests Current,
\ASP.NET v4.0.30319\Request Queued,
\Process(XXX)\Working Set,
\Process(XXX)\% Processor Time (XXX per each w3wp process),
\Network Interface(XXX)\Bytes total / sec
run Performance Analysis of Logs (PAL) Tool during time of failure to make a very detailed analysis of performance counters data,
run netstat -ano to analyze network traffic (or TCPView tool even better)
If all this will not lead you to any conclusion, create a Debug Diagnostic rule to create a memory dump of the process for long running requests and analyze it with WinDbg and PSSCor extension for .NET debugging.