ipvsadm -L -n suddenly showing no active connections - squid

I have a very odd problem in a proxy cluster of four Squid proxies:
One of the machine is the master. The mater is running ldirectord which is checking the availability of all four machines, distributing new client connections.
All over a sudden, after years of operation I'm encountering this problem:
1) The machine serving the master role is not being assigned new connections, old connections are served until a new proxy is assigned to the clients.
2) The other machines are still processing requests, taking over the clients from the master (so far, so good)
3) "ipvsadm -L -n" shows ever-decreasing ActiveConn and InActConn values.
Once I migrate the master role to another machine, "ipvsadm -L -n" is showing lots of active and inactive connections, until after about an hour the same thing happens on the new master.
Datapoint: This happened again this afternoon, and now "ipvsadm -L -n" shows:
TCP 141.42.1.215:8080 wlc persistent 1800
-> 141.42.1.216:8080 Route 1 98 0
-> 141.42.1.217:8080 Route 1 135 0
-> 141.42.1.218:8080 Route 1 1 0
-> 141.42.1.219:8080 Route 1 2 0
No change in the numbers quite some time now.
Some more stats (ipvsadm -L --stats -n):
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Conns InPkts OutPkts InBytes OutBytes
-> RemoteAddress:Port
TCP 141.42.1.215:8080 1990351 87945600 0 13781M 0
-> 141.42.1.216:8080 561980 21850870 0 2828M 0
-> 141.42.1.217:8080 467499 23407969 0 3960M 0
-> 141.42.1.218:8080 439794 19364749 0 2659M 0
-> 141.42.1.219:8080 521378 23340673 0 4335M 0
Value for "Conns" is constant now for all realservers and the virtual server now. Traffic is still flowing (InPkts increasing).
I examined the output of "ipvsadm -L -n -c" and found:
25 FIN_WAIT
534 NONE
977 ESTABLISHED
Then I waited a minute and got:
21 FIN_WAIT
515 NONE
939 ESTABLISHED

It turns out that a local bird installation was injecting router for the IP of the virtual server and thus taking precedence over ARP.

Related

Are TCP/UDP IP packets with a source port below 1024 possible

I am analyzing some events against dns servers running unbound. In the course of this investigation I am running into traffic involving queries to the dns servers that are reported as having in some cases a source port between 1 and 1024. As far as my knowledge goes these are reserved for services so there should never be traffic originating / initiated from those to a server.
Since I also know this is a practice, not a law, that evolved over time, I know there is no technical limitation to put any number in the source port field of a packet. So my conclusion would be that these queries were generated by some tool in which the source port is filled with a random value (the frequency is about evenly divided over 0-65535, except for a peak around 32768) and that this is a deliberate attack.
Can someone confirm/deny the source port theory and vindicate my conclusion or declare me a total idiot and explain why?
Thanks in advance.
Edit 1: adding more precise info to settle some disputes below that arose due to my incomplete reporting.
It's definitely not a port scan. It was traffic arriving on port 53 UDP and unbound accepted it apparently as an (almost) valid dns query, while generating the following error messages for each packet:
notice: remote address is <ipaddress> port <sourceport>
notice: sendmsg failed: Invalid argument
$ cat raw_daemonlog.txt | egrep -c 'notice: remote address is'
256497
$ cat raw_daemonlog.txt | egrep 'notice: remote address is' | awk '{printf("%s\n",$NF)}' | sort -n | uniq -c > sourceportswithfrequency.txt
$ cat sourceportswithfrequency.txt | wc -l
56438
So 256497 messages, 56438 unique source ports used
$ cat sourceportswithfrequency.txt | head
5 4
3 5
5 6
So the lowest source port seen was 4 which was used 5 times
$ cat sourceportswithfrequency.txt | tail
8 65524
2 65525
14 65526
1 65527
2 65528
4 65529
3 65530
3 65531
3 65532
4 65534
So the highest source port seen was 65534 and it was used 4 times.
$ cat sourceportswithfrequency.txt | sort -n | tail -n 25
55 32786
58 35850
60 32781
61 32785
66 32788
68 32793
71 32784
73 32783
88 32780
90 32791
91 32778
116 2050
123 32779
125 37637
129 7077
138 32774
160 32777
160 57349
162 32776
169 32775
349 32772
361 32773
465 32769
798 32771
1833 32768
So the peak around 32768 is real.
My original question still stands: does this traffic pattern suggest an attack or is there an logical explanation for, for instance, the traffic with source ports < 1024?
As far as my knowledge goes these are reserved for services so there should never be traffic originating / initiated from those to a server.
It doesn't matter what the source port number is, as long as it's between 1 and 65,535. It's not like a source port of 53 means that there is a DNS server listening on the source machine.
The source port is just there to allow multiple connections / in-flight datagrams from one machine to another machine on the same destination port.
See also Wiki: Ephemeral port:
The Internet Assigned Numbers Authority (IANA) suggests the range 49152 to 65535 [...] for dynamic or private ports.[1]
That sounds like a port scan.
There are 65536 distinct and usable port numbers. (ibid.)
FYI: The TCP and UDP port 32768 is registered and used by IBM FileNet TMS.

eucalyptus-node-keygen.service is showing failed on node controller

i have installed node controller on centos 7. And I am running command systemctl and it is showing that eucalyptus-node service is active and running but eucalyptus-node-keygen.service is failed. How do I fix this issue?
The eucalyptus-node-keygen.service generates keys that are used with instance migration. The service runs conditionally to generate the keys when required, if keys are present then they do not need to be generated.
# systemctl cat eucalyptus-node-keygen.service | grep Condition
ConditionPathExists=|!/etc/pki/libvirt/servercert.pem
#
# stat -t /etc/pki/libvirt/servercert.pem
/etc/pki/libvirt/servercert.pem 1298 8 81a4 0 0 fd00 833392 1 0 0 1582596904 1582596904 1582596904 0 4096 system_u:object_r:cert_t:s0
so typically this service will show "start condition failed" which is not an error, and no action is required.

Phusion Passenger Standalone seems to be on but nothing appears in browser

I ssh to the dev box where I am suppose to setup Redmine. Or rather, downgrade Redmine. In January I was asked to upgrade Redmine from 1.2 to 2.2. But the plugins we wanted did not work with 2.2. So now I'm being asked to setup Redmine 1.3.3. We figure we can upgrade from 1.2 to 1.3.3.
In January I had trouble getting Passenger to work with Nginx. This was on a CentOS box. I tried several installs of Nginx. I'm left with different error logs:
This:
whereis nginx.conf
gives me:
nginx: /etc/nginx
but I don't think that is in use.
This:
find / -name error.log
gives me:
/opt/nginx/logs/error.log
/var/log/nginx/error.log
When I tried to start Passenger again I was told something was already running on port 80. But if I did "passenger stop" I was told that passenger was not running.
So I did:
passenger start -p 81
If I run netstat I see something is listening on port 81:
netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:81 localhost:42967 ESTABLISHED
tcp 0 0 10.0.1.253:ssh 10.0.1.91:51874 ESTABLISHED
tcp 0 0 10.0.1.253:ssh 10.0.1.91:62993 ESTABLISHED
tcp 0 0 10.0.1.253:ssh 10.0.1.91:62905 ESTABLISHED
tcp 0 0 10.0.1.253:ssh 10.0.1.91:50886 ESTABLISHED
tcp 0 0 localhost:81 localhost:42966 TIME_WAIT
tcp 0 0 10.0.1.253:ssh 10.0.1.91:62992 ESTABLISHED
tcp 0 0 localhost:42967 localhost:81 ESTABLISHED
but if I point my browser here:
http: // 10.0.1.253:81 /
(StackOverFlow does not want me to publish the IP address, so I have to malform it. There is no harm here as it is an internal IP that no one outside my company could reach.)
In Google all I get is "Oops! Google Chrome could not connect to 10.0.1.253:81".
I started Phusion Passenger at the command line, and the output is verbose, and I expect to see any error messages in the terminal. But I'm not seeing anything. It's as if my browser request is not being heard, even though netstat seems to indicate the app is listening on port 81.
A lot of other things could be wrong with this app (I still need to reverse migrate the database schema) but I'm not seeing any of the error messages that I expect to see. Actually, I'm not seeing any error messages, which is very odd.
UPDATE:
If I do this:
ps aux | grep nginx
I get:
root 20643 0.0 0.0 103244 832 pts/8 S+ 17:17 0:00 grep nginx
root 23968 0.0 0.0 29920 740 ? Ss Feb13 0:00 nginx: master process /var/lib/passenger-standalone/3.0.19-x86_64-ruby1.9.3-linux-gcc4.4.6-1002/nginx-1.2.6/sbin/nginx -c /tmp/passenger-standalone.23917/config -p /tmp/passenger-standalone.23917/
nobody 23969 0.0 0.0 30588 2276 ? S Feb13 0:34 nginx: worker process
I tried to cat the file /tmp/passenger-standalone.23917/config but it does not seem to exist.
I also killed every session of "screen" and every terminal window where Phusion Passenger might be running, but clearly, looking at ps aux, it looks like something is running.
Could the Nginx be running even if the Passenger is killed?
This:
ps aux | grep phusion
brings back nothing
and this:
ps aux | grep passenger
Only brings back the line with nginx.
If I do this:
service nginx stop
I get:
nginx: unrecognized service
and:
service nginx start
gives me:
nginx: unrecognized service
This is a CentOS machine, so if I had Nginx installed normally, this would work.
The answer is here - Issue Uploading Files from Rails app hosted on Elastic Beanstalk
You probably have /etc/cron.daily/tmpwatch removing the /tmp/passenger-standalone* files every day, and causing you all this grief.

Scaling nginx with static files -- non-Persistent requests kill req/s

Working on a project where we need to server a small static xml file ~40k / s.
All incoming requests are sent to the server from HAProxy. However, none of the requests will be persistent.
The issue is that when benchmarking with non-Persistent requests, the nginx instance caps out at 19 114 req/s. When persistent connections are enabled, performance increases by nearly an order of magnitude, to 168 867 req/s. The results are similar with G-wan.
When benchmarking non-persistent requests, CPU usage is minimal.
What can I do to increase performance with non-persistent connections and nginx?
[root#spare01 lighttpd-weighttp-c24b505]# ./weighttp -n 1000000 -c 100 -t 16 "http://192.168.1.40/feed.txt"
finished in 52 sec, 315 millisec and 603 microsec, 19114 req/s, 5413 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 290000000 bytes total, 231000000 bytes http, 59000000 bytes data
[root#spare01 lighttpd-weighttp-c24b505]# ./weighttp -n 1000000 -c 100 -t 16 -k "http://192.168.1.40/feed.txt"
finished in 5 sec, 921 millisec and 791 microsec, 168867 req/s, 48640 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 294950245 bytes total, 235950245 bytes http, 59000000 bytes data
Your 2 tests are similar (except HTTP Keep-Alives):
./weighttp -n 1000000 -c 100 -t 16 "http://192.168.1.40/feed.txt"
./weighttp -n 1000000 -c 100 -t 16 -k "http://192.168.1.40/feed.txt"
And the one with HTTP Keep-Alives is 10x faster:
finished in 52 sec, 19114 req/s, 5413 kbyte/s
finished in 5 sec, 168867 req/s, 48640 kbyte/s
First, HTTP Keep-Alives (persistant connections) make HTTP requests run faster because:
Without HTTP Keep-Alives, the client must establish a new CONNECTION for EACH request (this is slow because of the TCP handshake).
With HTTP Keep-Alives, the client can send all requests at once (using the SAME CONNECTION). This is faster because there are less things to do.
Second, you say that the static file XML size is "small".
Is "small" nearer to 1 KB or 1 MB? We don't know. But that makes a huge difference in terms of available options to speedup things.
Huge files are usually served through sendfile() because it works in the kernel, freeing the usermode server from the burden of reading from disk and buffering.
Small files can use more flexible options available for application developers in usermode, but here also, file size matters (bytes and kilobytes are different animals).
Third, you are using 16 threads with your test. Are you really enjoying 16 PHYSICAL CPU Cores on BOTH the client and the server machines?
If that's not the case, then you are simply slowing-down the test to the point that you are no longer testing the web servers.
As you see, many factors have an influence on performance. And there are more with OS tuning (the TCP stack options, available file handles, system buffers, etc.).
To get the most of a system, you need to examinate all those parameters, and pick the best for your particular exercise.

How to kill tcp connection with CLOSE_WAIT state

I found a problem on my web application, hibernate connections does not close properly.
but given the complexity of the web application, it takes at least 15 - 30 days.
In the meantime I wanted to manually close the connection.
In that way I can close this connection without restart tomcat?
There is a command that I can use for kill this pool of connection?
I have found an error in hibernate configuration, to solve
#netstat -anp |grep 3306 |grep CLOSE_WAIT
tcp 1 0 ::ffff:172.18.11.4:50750 ::ffff:172.18.11.8:3306 CLOSE_WAIT 4203/java
tcp 1 0 ::ffff:172.18.11.4:36192 ::ffff:172.18.11.8:3306 CLOSE_WAIT 4203/java
tcp 1 0 ::ffff:172.18.11.4:36215 ::ffff:172.18.11.8:3306 CLOSE_WAIT 4203/java
tcp 1 0 ::ffff:172.18.11.4:36211 ::ffff:172.18.11.8:3306 CLOSE_WAIT 4203/java
tcp 1 0 ::ffff:172.18.11.4:57820 ::ffff:172.18.11.8:3306 CLOSE_WAIT 4203/java
tcp 1 0 ::ffff:172.18.11.4:36213 ::ffff:172.18.11.8:3306 CLOSE_WAIT 4203/java
tcp 1 0 ::ffff:172.18.11.4:36159 ::ffff:172.18.11.8:3306 CLOSE_WAIT 4203/java
etc....
CentOS 6.0 running Tomcat 5.5 and Mysql 5.5.
Always call socket.close(). See also how to close JDBC resources properly every time.
If you can't fix the server, add the following lines
to /etc/init.d/inetinit
/usr/sbin/ndd -set /dev/tcp tcp_close_wait_interval 1500
/usr/sbin/ndd -set /dev/tcp tcp_keepalive_interval 1500
and reboot. According to http://www.experts-exchange.com/OS/Unix/Solaris/Q_20568402.html
Alternatively, on Linux, try tcpkill (part of dsniff) or cutter.
There was some other question about this I can't find now. But you can try killcx and cutter. I can't find a link to cutter but it's found in debian repos. Make sure to pick the tcp connection killer and not the unit testing framework with same name.
update: there seems to be a windows version wkillcx
update2: thanks Xiong Chiamiov for cutter link

Resources