Why is TTFB 10x Nginx total request time? - nginx

In an effort to reduce 'Initial Server Response Time' and so have a better Google PageSpeed Insights, I've been trying to optimize that 4.5Kb request's response time which takes around 270ms TTFB and 0.71ms content download (measured using dev tools).
The app is hosted on a Linode in India which is physically near. I turned on logs on Nginx as I was suspecting something was wrong with it but it shows a total response time of 25ms.
Given that Nginx defines the total response time as 'Full request time, starting when NGINX reads the first byte from the client and ending when NGINX sends the last byte of the response body', I expected that ultimately the user would get the response in a little more than 25ms but never 10x that.
Any ideas what I could be missing here? What else can I look at?
UPDATE: I have made the decision to migrate my Linode to Singapore from Mumbai and the results are far better now, I moved from 270ms TTFB to ~100ms. Lesson learned, even though India is close, Singapore's fast internet speed makes it a more suitable place to host my app in.

From nginx logging docs
$request_time – Full request time, starting when NGINX reads the first
byte from the client and ending when NGINX sends the last byte of the
response body
...NGINX sends the last byte...
Meaning it has sent the last byte to the underlying OS. So TCP socket buffers might have stored the bytes and are trying to send them to the client.
Here is an analysis of this scenario.
Nginx does not care about the RTT (Round Trip Time) between the client and the server. That's an OS/client problem.
Pinging the server from the client could give you an idea of the order of response time. If ping time is greater than nginx's $response_time, performance can't be expected to be close to $request_time.
ping -c3 -s 1450 www.kernel.org
PING ord.git.kernel.org (147.75.58.133) 1450(1478) bytes of data.
1458 bytes from ord1.git.kernel.org (147.75.58.133): icmp_seq=1 ttl=48 time=191 ms
1458 bytes from ord1.git.kernel.org (147.75.58.133): icmp_seq=2 ttl=48 time=192 ms
1458 bytes from ord1.git.kernel.org (147.75.58.133): icmp_seq=3 ttl=48 time=198 ms
--- ord.git.kernel.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 191.155/194.026/198.468/3.205 ms
As a ball park approach, if your response size is 4.5kB and max TCP packet size is ~ 1.5kB, you could expect total time to be at best, 3 times the ping time.
On a Linux box the maximum transmission unit (MTU) is 1500:
ip addr | grep 'eth0: .*mtu'
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
DNS resolution might have an influence.

Related

Troubleshoot RServe config option keep.alive

I am using RServe 1.7.3 on a headless RHEL 7.9 VM. On the client, I am using RserveCLI2.
On long running jobs, the TCP/IP connection becomes blocked by a fire wall, after 2 hours.
I came across the keep.alive configuration option, that is available since RServe 1.7.2 (RServe News/Changelog).
The specs read:
added support for keep.alive configuration option - it is global to
all servers and if enabled the client sockets are instructed to keep
the connection alive by periodic messages.
I added the following to /etc/Rserv.conf:
keep.alive enable
but this does no prevent the connection from being blocked.
Unfortunately, I cannot run a network monitoring tool, like Wireshark, to monitor the traffic between client and server.
How could I troubleshoot this?
Some specific questions I have:
Is the path of the config file indeed /etc/Rserv.conf, as specified in Documentation for Rserve? Notice that it does not have a final e, like Rserve.
Does this behaviour depend on de RServe client in use, or is this completely handled at the socket level?
Can I inspect the runtime settings of RServe, to see if keep.alive is enabled?
We got this to work.
To summarize, we adjusted some kernel settings to make sure keep-alive packets are send at shorter intervals to prevent the connection from being deemed dead by network components.
This is how and why.
The keep.alive enable setting is in fact an instruction to the socket layer to periodically emit keep-alive packets from server to client. The client is expected to return an ACK on these packets. The behaviour is governed by three kernel-level settings, as explained in TCP Keepalive HOWTO - Using TCP keepalive under Linux:
tcp_keepalive_time (defaults to 7200 seconds)
tcp_keepalive_intvl (defaults to 75 seconds)
tcp_keepalive_probes (defaults to 9 times)
The tcp_keepalive_time is the first time a keep-alive packet is sent, after establishing the tcp/ip connection. The tcp_keepalive_intvl interval is de wait time between subsequent packets and tcp_keepalive_probes the number of subsequent unacknowledged packets that make the system decide the connection is dead.
So, the first keep-alive packet was only send after 2 hours. After that time, some network component had already decided the connection was dead and the keep-alive packet never made it to the client and thus no ACK was ever send.
We lowered both tcp_keepalive_time and tcp_keepalive_intvl to 600 seconds.
With tcpdump -i [interface] port 6311 we were able to monitor the keep-alive packets.
15:40:11.225941 IP <server>.6311 <some node>.<port>: Flags [.], ack 1576, win 237, length 0
15:40:11.226196 IP <some node>.<port> <server>.6311: Flags [.], ack 401, win 511, length 0
This continues until the results are send back and the connection is closed. At least, I test for a duration of 12 hours.
So, we use keep-alive here not to check for dead peers, but to prevent disconnection due to network inactivity, as is discussed in TCP Keepalive HOWTO - 2.2. Why use TCP keepalive?. In that scenario, you want to use low values for keep-alive time and interval.
Note that these are kernel level settings, and thus are applied system-wide. We use a dedicated server, so this is no issue for us, but may be in other cases.
Finally, for completeness, I'll answer my own three questions.
The path of the the configuration is /etc/Rserv.conf, as was confirmed by changing another setting (remoted enable to remote disable).
This is handled a the socket level.
I am not sure, but using tcpdump shows that Rserve emits keep-alive packets, which is a more useful way to inspect what's happening.

Traceroute average latency

While using the UNIX traceroute command, in order to calculate average latency for each loop (from one hop to the next one, e.g.: hop 8 to 9) what procedure should we take?
8 146.97.33.6 2.150 ms 2.159 ms 2.133 ms
9 146.97.33.61 1.580 ms 1.543 ms 1.552 ms
10 146.97.35.170 1.544 ms 1.535 ms 1.526 ms
I am aware, for instance, the average latency for hop 9, is 1.5583:
1.580 ms+1.543 ms+1.552 ms / 3
However, is this the average time it takes from the local host to that particular hop, or is it the time it takes for data packets to travel from previous hop to the particular hop?
The latency is the round-trip latency from the originating host to the hop where it times out and back to the originating host, but it includes the time it takes for the timeout hop to get around to generating an ICMP message back to the originating host.
The primary purpose of a router is to route packets as fast as it can. Generating ICMP messages is a much lower priority for the router. If the router is busy routing packets, it will get around to generating the ICMP message when it has some spare time.
That is why you can see the times for some intermediate hops to be much longer than it is for the full path.
According to Wikipedia it looks like it is the former
the route is recorded as the round-trip times of the packets received
from each successive host (remote node) in the route (path); the sum
of the mean times in each hop is a measure of the total time spent to
establish the connection.
The answer is "from the local host to that particular hop"

Slow first load for each browser or after a while

I have a low-traffic ASP.NET MVC 4 application that run on Windows Server 2008 / IIS 7.5.
My problem is that:
The first request is slow around 15 sec. (Subsequent requests are fine)
Another request after about 2 minutes without any requests always gets a slow response (around 15 sec)
After first request if I make another request from new browser, again it takes 15 sec.
A scenario to address the problem (the document size is 24 KB):
Time: 16:26 - Using Chrome - First Request takes 15 sec. Subsequent requests are fine.
Time: 16:27 - Using Firefox - First Request takes 15 sec. Subsequent requests are fine.
Time: 16:30 - Using IE 11 - First Request takes 15 sec. Subsequent requests are fine.
Here is all screenshots of Developer Tools/Network Tab
And also Fiddler time output:
Request Count: 1
Bytes Sent: 380 (headers:380; body:0)
Bytes Received: 7,217 (headers:409; body:6,808)
ACTUAL PERFORMANCE
--------------
ClientConnected: 22:41:26.377
ClientBeginRequest: 22:41:26.378
GotRequestHeaders: 22:41:26.378
ClientDoneRequest: 22:41:26.378
Determine Gateway: 0ms
DNS Lookup: 0ms
TCP/IP Connect: 28ms
HTTPS Handshake: 0ms
ServerConnected: 22:41:26.407
FiddlerBeginRequest: 22:41:26.407
ServerGotRequest: 22:41:26.407
ServerBeginResponse: 22:41:41.496
GotResponseHeaders: 22:41:41.496
ServerDoneResponse: 22:41:41.503
ClientBeginResponse: 22:41:41.503
ClientDoneResponse: 22:41:41.504
Overall Elapsed: 00:00:15.1258651
It shows 15 sec delay between ServerGotRequest and ServerBeginResponse
This issue just occurs at my home. There is no problem at work or my friends computers (means another ISP). Also I have no problem in internet speed, then I tested my website connection:
output for ping neshoonak.ir
Reply from 94.232.172.248: bytes=32 time=67ms TTL=122
Reply from 94.232.172.248: bytes=32 time=56ms TTL=122
Reply from 94.232.172.248: bytes=32 time=63ms TTL=122
output for ping 8.8.8.8
Reply from 8.8.8.8: bytes=32 time=134ms TTL=47
Reply from 8.8.8.8: bytes=32 time=171ms TTL=47
Reply from 8.8.8.8: bytes=32 time=132ms TTL=47
I tested some sites placed in same data-center that my site located
and found that all of them have the same problem (there is no problem at work at all).
My reseller hosting says there is 3 data-center and I have problem with 2 of them (just at home). He proposed to move my website to the third data-center. But it may occurs for my site visitors and I don't want to solve it just for me.
Please help!
There are two possibilies that come to mind:
Caching: there are some resources that are not cached and thus they need to be loaded and that takes a while. How do you identify if this is the case? Install Fiddler and open the page. You will see the http response codes. How to fix? Cache :)
App pool: does not seem to be your issue but want to mention for other readers. This happened to me in an app for a Microsoft portal. They wanted near instant loading and it worked fine... Sometimes. I debugged and profiled the code a few times until it hit me. The problem was that the app pool was recycled and it needed to start and load everything. How did I fix? I set up a cron job every 5 minutes to open one page and this kept the app loaded in memory this getting near instant responses.
Hope this helps!
One good test is to set up a monitoring service like www.site24x7.com and have it ping your site from multiple locations and you can see response time there.

Why does HAProxy recommend setting timeouts that are multiples of 3 seconds?

From the HAProxy documentation on client timeouts:
It is a good practice to cover one or several TCP packet losses by
specifying timeouts that are slightly above multiples of 3 seconds
(eg: 4 or 5 seconds).
That seems like an arbitrary number. What is the significance of the 3 second figure?
It appears this is the default TCP retransmission timeout. From this Microsoft KB article:
TCP starts a re-transmission timer when each outbound segment is
handed down to IP. If no acknowledgment has been received for the data
in a given segment before the timer expires, then the segment is
retransmitted, up to the TcpMaxDataRetransmissions times. The default
value for this parameter is 5.
The re-transmission timer is initialized to 3 seconds when a TCP
connection is established; however it is adjusted "on the fly" to
match the characteristics of the connection using Smoothed Round Trip
Time (SRTT) calculations as described in RFC793. The timer for a given
segment is doubled after each re-transmission of that segment. Using
this algorithm, TCP tunes itself to the "normal" delay of a
connection. TCP connections over high-delay links will take much
longer to time out than those over low- delay links.

Relation between HTTP Keep Alive duration and TCP timeout duration

I am trying to understand the relation between TCP/IP and HTTP timeout values. Are these two timeout values different or same? Most Web servers allow users to set the HTTP Keep Alive timeout value through some configuration. How is this value used by the Web servers? is this value just set on the underlying TCP/IP socket i.e is the HTTP Keep Alive timeout and TCP/IP Keep Alive Timeout same? or are they treated differently?
My understanding is (maybe incorrect):
The Web server uses the default timeout on the underlying TCP socket (i.e. indefinite) regardless of the configured HTTP Keep Alive timeout and creates a Worker thread that counts down the specified HTTP timeout interval. When the Worker thread hits zero, it closes the connection.
EDIT:
My question is about the relation or difference between the two timeout durations i.e. what will happen when HTTP keep-alive timeout duration and the timeout on the Socket (SO_TIMEOUT) which the Web server uses is different? should I even worry about these two being same or not?
An open TCP socket does not require any communication whatsoever between the two parties (let's call them Alice and Bob) unless actual data is being sent. If Alice has received acknowledgments for all the data she's sent to Bob, there's no way she can distinguish among the following cases:
Bob has been unplugged, or is otherwise inaccessible to Alice.
Bob has been rebooted, or otherwise forgotten about the open TCP socket he'd established with Alice.
Bob is connected to Alice, and knows he has an open connection, but doesn't have anything he wants to say.
If Alice hasn't heard from Bob in awhile and wants to distinguish among the above conditions, she can resend her last byte of data, wrapped in a suitable TCP frame to be recognizable as a retransmission, essentially pretending she hasn't heard the acknowledgment. If Bob is unplugged, she'll hear nothing back, even if she repeatedly sends the packet over a period of many seconds. If Bob has rebooted or forgotten the connection, he will immediately respond saying the connection is invalid. If Bob is happy with the connection and simply has nothing to say, he'll respond with an acknowledgment of the retransmission.
The Timeout indicates how long Alice is willing to wait for a response when she sends a packet which demands a reply. The Keepalive time indicates how much time she should allow to lapse before she retransmits her last bit of data and demands an acknowledgment. If Bob goes missing, the sum of the Keepalive and Timeout values will indicate the worst-case time between Alice receiving her last bit of data and her deciding that Bob is dead.
They're two separate mechanisms; the name is a coincidence.
HTTP keep-alive (also known as persistent connections) is keeping the TCP socket open so that another request can be made without setting up a new connection.
TCP keep-alive is a periodic check to make sure that the connection is still up and functioning. It's often used to assure that a NAT box (e.g., a DSL router) doesn't "forget" the mapping between an internal and external ip/port.
KeepAliveTimeout Directive
Description: Amount of time the server will wait for subsequent
requests on a persistent connection Syntax: KeepAliveTimeout seconds
Default: KeepAliveTimeout 15 Context: server config, virtual host
Status: Core Module: core The number of seconds Apache will wait for a
subsequent request before closing the connection. Once a request has
been received, the timeout value specified by the Timeout directive
applies.
Setting KeepAliveTimeout to a high value may cause performance
problems in heavily loaded servers. The higher the timeout, the more
server processes will be kept occupied waiting on connections with
idle clients.
In a name-based virtual host context, the value of the first defined
virtual host (the default host) in a set of NameVirtualHost will be
used. The other values will be ignored.
TimeOut Directive
Description: Amount of time the server will wait for certain events
before failing a request Syntax: TimeOut seconds Default: TimeOut 300
Context: server config, virtual host Status: Core Module: core The
TimeOut directive currently defines the amount of time Apache will
wait for three things:
The total amount of time it takes to receive a GET request. The amount
of time between receipt of TCP packets on a POST or PUT request. The
amount of time between ACKs on transmissions of TCP packets in
responses. We plan on making these separately configurable at some
point down the road. The timer used to default to 1200 before 1.2, but
has been lowered to 300 which is still far more than necessary in most
situations. It is not set any lower by default because there may still
be odd places in the code where the timer is not reset when a packet
is sent.

Resources