Network connection reset after intensive operation - networking

I am doing a measurement project where I send and receive data from numerous devices on my network. The send/receive can be considered fast and intensive, as there is almost no pause and a continuous flow of data. However, the data to/from each device is quite small, on the order of a couple of bytes each. For some reason, I am experiencing a reset of my entire ethernet connection where my internet connection also goes down, and I lose connection to all my devices as well. I have never experienced such a situation and am wondering what are some of the common situations that might lead to resets like this?

Actually, it turns out this had to do with the way I constructed my thread, in which I would create a new socket continuously without discarding the previous one. Stupid right? Well I fixed the code and the ethernet no longer crashes.

Related

Multiple IOT devices communicating to a server Asynchronously via TCP

I want multiple IoT devices (Say 50) communicating to a server directly asynchronously via TCP. Assume all of them have a heartbeat pulse every 30 seconds and may drop off and reconnect at variable times.
Can anyone advice me the best way to make sure no data is dropped or blocked when multiple devices are communicating simultaneously?
TCP by itself ensures no data loss during the communication between a client and a server. It does that by the use of sequence numbers and ACK messages.
Technically, before the actual data transfer happens, a TCP connection is created between the client (which can be an IoT device, or any other device) and the server. Then, the data is split into multiple packets and sent over the network through that connection. All TCP-related mechanisms like flow-control, error-detection, congestion-detection, and many others, take place once the data starts to flow.
The wiki page for TCP is a pretty good start if you want to learn more about how it works.
Apart from that, as long as your server has enough capacity to support the flow of requests coming from the devices, then everything should work (at least in theory).
I don't think you are asking the right question. There is no way to make sure that no data is dropped or blocked. Networks do not always work (that is why the word work is in network, to convince you otherwise ).
The right question is: how do I make my distributed system as available and reliable as possible? The answer involves viewing interruption and congestion as part of the normal operation, and build your software appropriately.
There is a timeless usenix/acm/? paper from the late 70s early 80s that invigorated the notion that end-to-end protocols are much more effective then over-featured middle to middle protocols; and most guarantees of middle to middle amount to best effort. If you rely upon those guarantees, you are bound to fail. Sorry, cannot find the reference right now, but it is widely cited.

RS232 Alicat and Labview communication drop

At the moment I have a problem I cannot pin down. Seemingly at random my communication with my RS232 Alicat Device will get held up. It will get held up somewhere in the read or write process and be unable to complete it. Upon closing the VI I will get a "Resetting VI" error in Labview 2020. I am using 7 of the 9 RS232 ports. My question is:
How do I fix this problem so that I do not get a communication drop OR (more likely)
How do I code the system such that I can catch and move through this problem or reset the connection. Something of a VISA read/write timeout? Open to ideas on how to move past the block
Here is what I have gathered about the problem:
Windows 10, I’ve tested everything on multiple computers. It happens no matter what.
It happens at random. It might happen twice within 20 minutes or not for a couple of hours.
I have never experienced the error when probing the line. I don’t know if that is a clue, or if that speaks to the randomness of the problem
Baud Rate = 9600, Prior to this I was running at 19,200 and experienced equivalent issues. The manufacturer recommended lowering the baud rate to reduce noise. I have also isolated the cable from other parts of the hardware. At this point noise on the connection is not an issue, but I am still experiencing the error.
My buffer size is 1000 bytes.
By termination character is \r. I cannot imagine a scenario where it fails to read a termination character due to the size of my buffer
I'm querying it every 50ms. Far below the threshold of a standard timeout. Too much?
What I am currently testing.
Due to how my code block is setup I cannot yet confirm if it is getting locked up on the read or write block or both. I'm attempting to isolate the problem with only minor modifications to see if I can isolate it.
Attached is slimmed down version of my code that I isolated the error to.
I have experienced similar problems with some RS232 devices from different suppliers. The (quite bad) solution was to connect and disconnect for each communication command. The question would be what sample rate you need.
Another idea is to replace that device with an ethernet device. If I am not mistaken Alicat supplies those with Modbus (TCP).
The issue turned out to be specific to windows/my laptop. There is a USB setting that disables inactive USB's after a certain amount of time. The setting to disable the timeout was unavailable through the control panel on my laptop, though it was available on my coworkers. I had to use powershell commands to change the setting

RxAndroidBle multiple connections

Background:
I am using the RxAndroidBle library and have a requirement to quickly (as possible) connect to multiple devices at a time and start communicating. I used the RxBluetoothKit for iOS, and have started to use RxAndroidBle on my Pixel 2. This had worked as expected and I could establish connections to 6-8 devices, as required, in a few hundred milliseconds. However, broadening my testing to phones such as the Samsung S8 and Nexus 6P, it seems that establishing a single connection can now take upwards of 5-6 seconds instead of 50-60 millis. I will assume for the moment that that disparity is within the vendor-specific BT implementations. Ultimately, this means that connecting to, e.g., 5 devices now takes 30 seconds instead of < 1 second.
Question:
What I understand from the documentation and other questions asked, RxAndroidBle queues all scanning, connecting, and communication requests and executes them serially to be safe and maintain stability based on the variety of Bluetooth implementations in the Android ecosystem. However, is there currently a way to execute the requests (namely, connecting) in parallel to accept this risk and potentially cut my total time to establish multiple connections down to whichever device takes the longest to connect?
And side question: are there any ideas to diagnose what could possibly be taking 5 seconds to establish a connection with a device? Or do we simply need to accept that some phones will take that long in some instances?
However, is there currently a way to execute the requests (namely, connecting) in parallel to accept this risk and potentially cut my total time to establish multiple connections down to whichever device takes the longest to connect?
Yes. You may try to establish connections using autoConnect=true which would prevent locking the queue for longer than few milliseconds. The last connection should be started with autoConnect=false to kick off a scan. Some stack implementations are handling this quite OK but your mileage may vary.
And side question: are there any ideas to diagnose what could possibly be taking 5 seconds to establish a connection with a device?
You can check the Bluetooth HCI snoop log. Also you may try using a BLE sniffer to check what is actually happening "on-air" (e.g. an nRF51 Development Kit).
Or do we simply need to accept that some phones will take that long in some instances?
This is also an option since usually there is little one can do about connecting time. From my experience BLE stack/firmware implementations are wildly different from each other.

Controlling TCP connection at packet level

I'm having an issue with some embedded mobile devices that have a buggy TCP stack. We're trying to update these devices but the firmware download fails, unless the mobile connection is very very good. Since it's an EDGE connection, it's usually bad.
Part of the problem is that the devices need quite a bit of time to write the data to storage. This is probably what leads to packet loss, but the connection never recovers.
I'm thinking that if I could control the connection at TCP level, I might be able to get around this problem. We tried changing the congestion control and it doesn't help, but we're still looking into that.
In the meantime I'd like to look into this option. Is there any way to do it, without writing my own TCP stack / kernel module?
I didn't find any way to do this so in the end I set up a new server and recompiled the linux kernel with a modified TCP_RTO_MAX value (5s instead of 120s). This seems to have solved my problem. My guess is that the network was not the actual issue but rather devices taking too long to store the data. This is a very specific case and this solution wouldn't help in any situation where the network connection is actually slow.

Why does Chrome Timeline shows more time than server?

We are trying to optimize our ASP.NET MVC app and get a big time difference between our server side logs and client side delay.
When refresh the page in Chrome in Timeline it shows 4.47s:
As I understand from the picture, the time for server side code execution should be 3.34s, but in our server logs we have the following:
Begin Request 15:41:52.421
End Request 15:41:53.218
Pre Send Request Headers 15:41:53.218
Pre Send Request Content 15:41:53.218
So, according to server side logs code execution took only 797ms in total.
It does not happen all the time and very often the Chrome timeline shows times very close to server logs. But sometimes we have this couple of seconds delay.
Where could this delay come from?
There is lot of stuff that can affect the time sporadically to such an extent, even though addition of almost three seconds is sort of excessive for this scenario. Since you don't mention much about how is your network set up, what operating system u use etc,
I'll try to sum up a list of what comes to my mind when dealing with this sort of a delay, sorted by probability.
The main problem here is the Waiting part of the total time there you should concentrate your detective talent.
Please note that the answer is very general since the question says virtually nothing about configuration of the server, client computer or the network (if any) between them. Since you say the delay is not present all the time, there are one or more moving targets you need to aim at.
Antivirus
If you have an internet shield or similarly named component, it is not uncommon that the antivirus can seemingly randomly delay some connections while leaving other virtually untouched. For the browser this is transparent (it's just a delay, whatever may have caused it), hence the Waiting.
Network issue
Especially if you are connected through a wireless network or poorly configured wired network, a few seconds delay may occur even though the label on the network device says TurboSpeedTM.
Server side issue
Server may be overloaded with previous requests in a manner not covered by your in-application timer, since there are many steps the server performs before and after your script is executed.
Client OS issue
Just like the antivirus, the OS can delay your packets virtually randomly for various reasons.
When hunting down such issue, I would recommend trying to perform the query on the server itself and compare resulting times, try as may combinations of network setup and operation systems as possible, prefer well planned network environments to those with many unknown or external factors (read wireless) and make use of some packet sniffing software (like wireshark) to check whether the browser doesn't lie. And that would be just the start of it :)

Resources