asterisk chan_sip locking randomly - asterisk

chan_sip locks causing registration issues with asterisk realtime mysql when displaying the following errors:
chan_sip.c:3821 retrans_pkt: Retransmission timeout reached on transmission for seqno 1 (Critical Response)
chan_sip.c:3821 retrans_pkt: Retransmission timeout reached on transmission for seqno 106
my registrations to my sip providers then timeout with the following errors:
chan_sip.c:13661 sip_reg_timeout: -- Registration for xxx timed out, trying again
this is not a connectivity issue as restarting asterisk will re-register to the sip providers immediately
the connections eventually re-establish, registrations start registering until the above happens again at a random interval. Server is a dedicated non natted server.
What could be the issue?

if you have debian or ubuntu, turn off res_timing_pthread.so and use res_timing_dahdi.so
That fix such issue.

Related

Why cant I connect more than 8000 clients to MQTT brokers via HAProxy?

I am trying to establish 10k client connections(potentially 100k) with my 2 MQTT brokers using HAProxy as a load balancer.
I have a working simulator(using Java Paho library) that can simulate 10k clients. On the same machine I run 2 MQTT brokers in docker. For LB im using another machine with virtual image of Ubuntu 16.04.
When I connect directly to a MQTT Broker those connections are established without a problem, however when I use HAProxy I only get around 8.8k connections, while the rest throw: Error at client{insert number here}: Connection lost (32109) - java.net.SocketException: Connection reset. When I connect simulator directly to a broker (Same machine) about 20k TCP connections open, however when I use load balancer only 17k do. This leaves me thinking that LB is causing the problem.
It is important to add that whenever I run the simulator I'm unable to use the browser (Cannot connect to the internet). I havent tested if this is browser only, but could that mean that I actually run out of ports or something similar and the real issue here is not in the LB?
Here is my HAProxy configuration:
global
log /dev/log local0
log /dev/log local1 notice
maxconn 500000
ulimit-n 500000
maxpipes 500000
defaults
log global
mode http
timeout connect 3h
timeout client 3h
timeout server 3h
listen mqtt
bind *:8080
mode tcp
option tcplog
option clitcpka
balance leastconn
server broker_1 address:1883 check
server broker_2 address:1884 check
listen stats
bind 0.0.0.0:1936
mode http
stats enable
stats hide-version
stats realm Haproxy\ Statistics
stats uri /
This is what MQTT broker shows for every successful/unsuccessful connection
...
//Successful connection
1613382861: New connection from xxx:32850 on port 1883.
1613382861: New client connected from xxx:60974 as 356 (p2, c1, k1200, u'admin').
...
//Unsuccessful connection
1613382699: New connection from xxx:42861 on port 1883.
1613382699: Client <unknown> closed its connection.
...
And this is what ulimit -a shows on LB machine.
core file size (blocks) (-c) 0
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 102355
max locked memory (kb) (-l) 82000
max memory size (kb) (-m) unlimited
open files (-n) 500000
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) 500000
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
Note: The LB process has the same limits.
I followed various tutorials and increased open file limit as well as port limit and TCP header size, etc. The number of connected users increased from 2.8k to about 8.5-9k (Which is still way lower than the 300k author of the tutorial had). ss -s command shows about 17000ish TCP and inet connections.
Any pointers would greatly help!
Thanks!
You can't do a normal LB of MQTT traffic, as you can't "pin" the connection based on the MQTT Topic. If you send in a SUBSCRIBE to Broker1 for Topic "test/blatt/#", but the next client PUBLISHes to Broker2 "test/blatt/foo", then if the two brokers are not bridged, your first subscriber will never get that message.
If your clients are terminating the TCP connection sometime after the CONNECT, or the HAproxy is round-robin'ing the packets between the two brokers, you will get errors like this. You need to somehow persist the connections, and I don't know how you do that with HAproxy. Non-free LB's like A10 Thunder or F5 LTM can persist TCP connections...but you still need the MQTT brokers bridged for it all to work.
Turns out I was running out of resources on my computer.
I moved simulator to another machine and managed to get 15k connections running. Due to resource limits I cant get more than that. Computer thats running the serverside uses 20/32GB of RAM and the computer running simulator used 32/32GB for approx 15k devices. Now I see why running both on the same computer is not an option.

Scrapy - Set TCP Connect Timeout

I'm trying to scrape a website via Scrapy. However, the website is extremely slow at times and it takes almost 15-20 seconds to respond at first request in browser. Anyways, sometimes, when I try to crawl the website using Scrapy, I keep getting TCP Timeout error. Even though the website opens just fine on my browser. Here's the message:
2017-09-05 17:34:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.hosane.com/result/spec
ialList> (failed 16 times): TCP connection timed out: 10060: A connection attempt failed because the connected party di
d not properly respond after a period of time, or established connection failed because connected host has failed to re
spond..
I have even overridden the USER_AGENT setting for testing.
I don't think DOWNLOAD_TIMEOUT setting works in this case, since it defaults to 180 seconds, and Scrapy doesn't even take 20-30 seconds before giving a TCP timeout error.
Any idea what is causing this issue? Is there a way to set TCP timeout in Scrapy?
A TCP connection timed out can happen before the Scrapy-specified DOWNLOAD_TIMEOUT because the actual initial TCP connect timeout is defined by the OS, usually in terms of TCP SYN packet retransmissions.
By default on my Linux box, I have 6 retransmissions:
cat /proc/sys/net/ipv4/tcp_syn_retries
6
which, in practice, for Scrapy too, means 0 + 1 + 2 + 4 + 8 + 16 + 32 (+64) = 127 seconds before receiveing a twisted.internet.error.TCPTimedOutError: TCP connection timed out: 110: Connection timed out. from Twisted. (That's the initial trial, then exponential backoff between each retry and not receiving a reply after the 6th retry.)
If I set /proc/sys/net/ipv4/tcp_syn_retries to 8 for example, I can verify that I receive this instead:
User timeout caused connection failure: Getting http://www.hosane.com/result/specialList took longer than 180.0 seconds.
That's because 0+1+2+4+8+16+32+64+128(+256) > 180.
10060: A connection attempt failed... seems to be a Windows socket error code. If you want to change the TCP connection timeout to something at least the DOWNLOAD_TIMEOUT, you'll need to change the TCP SYN retry count. (I don't know how to do it on your system, but Google is your friend.)

What happens to a waiting WebSocket connection on a TCP level when server is busy (blocked)

I am load testing my WebSocket Tornado server, running on Ubuntu Server 14.04.
I am playing with a big client machine loading 60,000 users, 150 a second (that's what my small server can comfortably take). Client is a RedHat machine. When a load test suite finishes, I have to wait a few seconds to be able to rerun.
Within these few seconds, my websocket server is handling closing of the 60,000 connections. I can see it in my graphite dashboard (the server logs every connect and and disconnect information there).
I am also logging relevant outputs of the netstat -s and ss -s commands to my graphite dashboard. When the test suite finishes, I can immediately see tcp established seconds dropping from 60,000 to ~0. Other socket states (closed, timewait, synrecv, orphaned) remain constant, very low. My client's sockets go to timewait for a short period and then this number goes to 0 too. When I immediately rerun the suite, and all the tcp sockets on both ends are free, but the server has not finished processing of the previous closing batch yet, I see no changes on the tcp socket level until the server is finished processing and starts accepting new connections again.
My question is - where is the information about the sockets waiting to be established stored (RedHat and Ubuntu)? No counter/queue length that I am tracking shows this.
Thanks in advance.

SignalR not recognizing some transport disconnects

I have noticed that SignalR does not recognize disconnection events at the transport layer in some circumstances. If the transport is disconnected gracefully, either through dropping my VPN connection to the server, or issuing an ipconfig /release the appropriate events are fired. If I am connected to the VPN and I either unplug the network cable or turn my wireless off, SignalR does not recognize we are disconnected until the VPN connection times out. That is mostly acceptable because it does eventually realize it. However if I am not using a VPN and I simply unplug the network cable or turn off the wireless, SignalR never recognizes a disconnect has occurred. This is using SignalR 2.0.1 over WebSockets in a non CORS environment. If I enable logging in the hub I see no events logged in the console after waiting for over 5 minutes. One thing that bothers me that I do see in the log during start up is:
SignalR: Now monitoring keep alive with a warning timeout of 1600000 and a connection lost timeout of 2400000.
Is this my problem? I have tried manually setting my GlobalHost.Configuration.DisconnectTimeout to 30 seconds but it does not change any behavior in this scenario, or alter that log statement. What else might I be overlooking here?
Edit: I noticed in fiddler that my negotiate response has a disconnect timeout of 3600 and a keepalive of 2400 and that trywebsockets is false. This particular server is 2008 R2, which I do not believe supports Web Sockets. Does that mean long polling would be used? I don't see any long polling requests going out in fiddler or the console.
The timeout settings were my problem. I'm not sure why I originally did not see a change in the logging output when I adjusted the setting, but I went back to a sample app and saw things change there and now things are behaving properly.
It should be noted that the default settings for SignalR produce the following statement in logging output and that this is measured in milliseconds.
SignalR: Now monitoring keep alive with a warning timeout of 13333.333333333332 and a connection lost timeout of 20000
This is obvious when you read the information under Transport Disconnect Scenarios on the following page which says: The default keepalive timeout warning period is 2/3 of the keepalive timeout. The keepalive timeout is 20 seconds, so the warning occurs at about 13 seconds.

Tcp Socket Closed

I always thought that if you didn't implement a heartbeat, there was no way to know if one side of a TCP connection died unexpectedly. If the process was just killed on one side and didn't exit gracefully, there was no way for the socket to send FIN or let the other side know that it was closed.
(See some of the comments here for example http://www.perlmonks.org/?node_id=566568 )
But there is a stock order server that I connect to that has a new "cancel all orders on disconnect feature" that cancels live orders if the client dis-connects. It works even when I kill the process on my end, and there is definitely no heartbeat from my app to it.
So how is it able to detect when I've killed the process? My app is running on Windows Server 2003 and the order server is on Suse Linux Enterprise Server 10. Does Windows detect that the process associated with the socket is no longer alive and send the FIN?
When a process exits - for whatever reason - the OS will close the TCP connections it had open.
There's numerous other ways a TCP connection can go dead undetected
someone yanks out a network cable inbetween.
the computer at the other end gets nuked.
a nat gateway inbetween silently drops the connection
the OS at the other end crashes hard.
the FIN packets gets lost.
Though enabling tcp keepalive, you'll detect it eventually - atleast during a couple of hours.
It could be using a TCP Keep Alive to check for dead peers:
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html
As far as I know, the OS detects the process termination and closes all the file descriptors/sockets/handles the process was using. So, there isn't difference between "killing" application and "gracefully terminating". Of course, the kernel itself must be running (=pc turned on, wire connected...). But it's on the OS the job of sending the FIN and so on...
Also, if a host becomes unreachable /turned off, disconnected...) an intermediate gateway (or the client itself) may detect the event (e.g. loss of carrier, DHCP lease not renewed...) and reply to the packets sent to the died host with a ICMP error (host/network unreachable). This causes the peer's TCP connection to die, but it happens only if the client has some packet to send to the host.

Resources