Scrapy - Set TCP Connect Timeout - web-scraping

I'm trying to scrape a website via Scrapy. However, the website is extremely slow at times and it takes almost 15-20 seconds to respond at first request in browser. Anyways, sometimes, when I try to crawl the website using Scrapy, I keep getting TCP Timeout error. Even though the website opens just fine on my browser. Here's the message:
2017-09-05 17:34:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.hosane.com/result/spec
ialList> (failed 16 times): TCP connection timed out: 10060: A connection attempt failed because the connected party di
d not properly respond after a period of time, or established connection failed because connected host has failed to re
spond..
I have even overridden the USER_AGENT setting for testing.
I don't think DOWNLOAD_TIMEOUT setting works in this case, since it defaults to 180 seconds, and Scrapy doesn't even take 20-30 seconds before giving a TCP timeout error.
Any idea what is causing this issue? Is there a way to set TCP timeout in Scrapy?

A TCP connection timed out can happen before the Scrapy-specified DOWNLOAD_TIMEOUT because the actual initial TCP connect timeout is defined by the OS, usually in terms of TCP SYN packet retransmissions.
By default on my Linux box, I have 6 retransmissions:
cat /proc/sys/net/ipv4/tcp_syn_retries
6
which, in practice, for Scrapy too, means 0 + 1 + 2 + 4 + 8 + 16 + 32 (+64) = 127 seconds before receiveing a twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out. from Twisted. (That's the initial trial, then exponential backoff between each retry and not receiving a reply after the 6th retry.)
If I set /proc/sys/net/ipv4/tcp_syn_retries to 8 for example, I can verify that I receive this instead:
User timeout caused connection failure: Getting http://www.hosane.com/result/specialList took longer than 180.0 seconds.
That's because 0+1+2+4+8+16+32+64+128(+256) > 180.
10060: A connection attempt failed... seems to be a Windows socket error code. If you want to change the TCP connection timeout to something at least the DOWNLOAD_TIMEOUT, you'll need to change the TCP SYN retry count. (I don't know how to do it on your system, but Google is your friend.)

Related

Troubleshoot RServe config option keep.alive

I am using RServe 1.7.3 on a headless RHEL 7.9 VM. On the client, I am using RserveCLI2.
On long running jobs, the TCP/IP connection becomes blocked by a fire wall, after 2 hours.
I came across the keep.alive configuration option, that is available since RServe 1.7.2 (RServe News/Changelog).
The specs read:
added support for keep.alive configuration option - it is global to
all servers and if enabled the client sockets are instructed to keep
the connection alive by periodic messages.
I added the following to /etc/Rserv.conf:
keep.alive enable
but this does no prevent the connection from being blocked.
Unfortunately, I cannot run a network monitoring tool, like Wireshark, to monitor the traffic between client and server.
How could I troubleshoot this?
Some specific questions I have:
Is the path of the config file indeed /etc/Rserv.conf, as specified in Documentation for Rserve? Notice that it does not have a final e, like Rserve.
Does this behaviour depend on de RServe client in use, or is this completely handled at the socket level?
Can I inspect the runtime settings of RServe, to see if keep.alive is enabled?
We got this to work.
To summarize, we adjusted some kernel settings to make sure keep-alive packets are send at shorter intervals to prevent the connection from being deemed dead by network components.
This is how and why.
The keep.alive enable setting is in fact an instruction to the socket layer to periodically emit keep-alive packets from server to client. The client is expected to return an ACK on these packets. The behaviour is governed by three kernel-level settings, as explained in TCP Keepalive HOWTO - Using TCP keepalive under Linux:
tcp_keepalive_time (defaults to 7200 seconds)
tcp_keepalive_intvl (defaults to 75 seconds)
tcp_keepalive_probes (defaults to 9 times)
The tcp_keepalive_time is the first time a keep-alive packet is sent, after establishing the tcp/ip connection. The tcp_keepalive_intvl interval is de wait time between subsequent packets and tcp_keepalive_probes the number of subsequent unacknowledged packets that make the system decide the connection is dead.
So, the first keep-alive packet was only send after 2 hours. After that time, some network component had already decided the connection was dead and the keep-alive packet never made it to the client and thus no ACK was ever send.
We lowered both tcp_keepalive_time and tcp_keepalive_intvl to 600 seconds.
With tcpdump -i [interface] port 6311 we were able to monitor the keep-alive packets.
15:40:11.225941 IP <server>.6311 <some node>.<port>: Flags [.], ack 1576, win 237, length 0
15:40:11.226196 IP <some node>.<port> <server>.6311: Flags [.], ack 401, win 511, length 0
This continues until the results are send back and the connection is closed. At least, I test for a duration of 12 hours.
So, we use keep-alive here not to check for dead peers, but to prevent disconnection due to network inactivity, as is discussed in TCP Keepalive HOWTO - 2.2. Why use TCP keepalive?. In that scenario, you want to use low values for keep-alive time and interval.
Note that these are kernel level settings, and thus are applied system-wide. We use a dedicated server, so this is no issue for us, but may be in other cases.
Finally, for completeness, I'll answer my own three questions.
The path of the the configuration is /etc/Rserv.conf, as was confirmed by changing another setting (remoted enable to remote disable).
This is handled a the socket level.
I am not sure, but using tcpdump shows that Rserve emits keep-alive packets, which is a more useful way to inspect what's happening.

Can a gRPC client connect timeout be set independent of reconnect backoff settings?

We'd like to configure our gRPC client to reconnect very quickly after a connection is lost. (I believe the default behavior is to attempt to reconnect after 20 seconds, backing off to 120 seconds between attempts.) After a review of available settings, we tried setting grpc.initial_reconnect_backoff_ms and grpc.min_reconnect_backoff_ms to 200. While that results in quick reconnects when a connection is lost, we sometimes see calls (from tests) fail with GRPC::Internal: 13:Completed without a response. Looking at logging from a tcp reverse proxy sitting between client and server, I see a connection lasting for just over 200ms, then a second connection lasting for longer. So it looks like the reconnect times are effectively serving as timeouts on connection attempts.
Is it possible to configure a gRPC client so that it will begin attempting a reconnect very quickly after a connection is lost, but allow creation of that connection to take longer than the reconnect time?
If it matters, this is a Ruby client.
The initial backoff is supposed to be 1 second.
You're experiencing a bug were the minimum connection timeout acts as both the timeout and the backoff (so the 1s initial backoff is ignored). So both your initial problem and the failed workaround are caused by the same bug.
(The bug was noticed a month ago, but an issue wasn't filed due to a mixup with a second bug. Your question here let me notice the missing issue.)

Why retransmision occured 16 seconds later?

I use jmeter to test a tomcat web application running behind f5's load balancer.
From the pcap file I captured in jmeter node I see that there is a duplicate ack and then
16 seconds later the client retransmit the packet and do nothing within 20 seconds.
As tomcat's default socket timeout is set as 20 seconds the client received Http error.
My question is:
What could be the reason that make client's TCP stack retransmit after 16 seconds and later do nothing within 20 seconds? Is it too busy? Is there a way to find out the reason?

After HTTP Keep Alive timeout new HTTP request arrives

Our application is ExtJS 3.4 based application we are frequently getting "Communication Failure" error on UI , we have our application deployed on different domain but on some domain we get this very frequently .
Without HTTP Keep Alive we are not getting that error. :
But in different scenarios for 1 sec and 5 sec we get it quite frequently.
We have observed on Wireshark was due to high RTT (Round Trip Time) the request were taking more time than expected.
There were inconsistency in packet flow the scenario was :
If keep alive was 5 sec :
When a request is successfully served it will return 200 OK(success response) and timeout parameter of 5 sec (where server tries to say to client that server will wait for 5 sec before closing this connection).
Now as soon as 5 sec of time is elapsed Server sends a FIN Packet(Finish packet which is to close connection is sent from server to client which is browser in our case).
Now here is the catch the time taken by ACK (Acknowledge Packet) from client to close connection is high ( high RTT).
Now server has initiated close but due to high RTT before the connection is closed client sends a new HTTP request(for eg ExampleABC.do request) before server receives ACK for FINISH from client.
Because of which server was not able to handle that request since it has initiated connection close.
Setting 1 sec as keep alive meant we are reducing time the server will wait to close the connection since we wanted after 1 sec one connection is to be closed and fresh connection is setup for new request to avoid unwanted request coming after 5 sec .
Thanks in advance
This is my first post please correct me if needed.
Sorry for bad English :)
Image for communication failure :
We solved this issue by synchronizing browser timeout and server timeouts.
The fix was to make sure the TCP keepalive time and browser coincide or come at same time, causing the TCP connection to completely drop.

SignalR not recognizing some transport disconnects

I have noticed that SignalR does not recognize disconnection events at the transport layer in some circumstances. If the transport is disconnected gracefully, either through dropping my VPN connection to the server, or issuing an ipconfig /release the appropriate events are fired. If I am connected to the VPN and I either unplug the network cable or turn my wireless off, SignalR does not recognize we are disconnected until the VPN connection times out. That is mostly acceptable because it does eventually realize it. However if I am not using a VPN and I simply unplug the network cable or turn off the wireless, SignalR never recognizes a disconnect has occurred. This is using SignalR 2.0.1 over WebSockets in a non CORS environment. If I enable logging in the hub I see no events logged in the console after waiting for over 5 minutes. One thing that bothers me that I do see in the log during start up is:
SignalR: Now monitoring keep alive with a warning timeout of 1600000 and a connection lost timeout of 2400000.
Is this my problem? I have tried manually setting my GlobalHost.Configuration.DisconnectTimeout to 30 seconds but it does not change any behavior in this scenario, or alter that log statement. What else might I be overlooking here?
Edit: I noticed in fiddler that my negotiate response has a disconnect timeout of 3600 and a keepalive of 2400 and that trywebsockets is false. This particular server is 2008 R2, which I do not believe supports Web Sockets. Does that mean long polling would be used? I don't see any long polling requests going out in fiddler or the console.
The timeout settings were my problem. I'm not sure why I originally did not see a change in the logging output when I adjusted the setting, but I went back to a sample app and saw things change there and now things are behaving properly.
It should be noted that the default settings for SignalR produce the following statement in logging output and that this is measured in milliseconds.
SignalR: Now monitoring keep alive with a warning timeout of 13333.333333333332 and a connection lost timeout of 20000
This is obvious when you read the information under Transport Disconnect Scenarios on the following page which says: The default keepalive timeout warning period is 2/3 of the keepalive timeout. The keepalive timeout is 20 seconds, so the warning occurs at about 13 seconds.

Resources