best practices or solutions to writing resilient programs using unstable slow networks

best practices or solutions to writing resilient programs using unstable slow networks - networking

A program is using a slow unstable network. Frequent timeouts, slow connection etc.
The program uses a few rest APIs and even ssh. The previous developer solved timeout problems by checking for error message and running the same instruction again until it worked. However, sometimes the network connection simply goes dark for a few hours and we have to wait.
We cannot always keep the program alive and waiting (to save power) and end up serializing the state and have another program check for network activity and then wake it up and resume from a serialized state.
The code quality is becoming more and more hackish due to these workarounds.
are there any best practices or solutions when it comes to writing programs for unstable networks? I'm wondering if someone already solved that problem through a library or some book you could recommend?
Thank you
PS: I have no control over the network infrastructure.

Related

How to avoid crashing my user's router?

It appears that cheap consumer routers are fairly easy to crash: hanging around in various backup/sync software forums, I see this mentioned from time to time. Developers seem to be putting a fair amount of effort into making sure they don't crash the routers.
What are the "do"s and "don't"s for my network-heavy application to ensure that it doesn't cause issues with badly designed routers? Especially one that intends to connect to a number of peers?

IMO trying to workaround bad hardware is the road to nowhere, because every router fails in its own remarkable way :).
What you can do in the network-heavy application is assume that network is not stable media (routers can crash, etc) and design application network operations accordingly.
For instance, provide reconnect logic, connection timeouts, some sort of state caching to allow users work with app even if network connectivity is gone.
Concerning faulty routers - they usually crash because of great number of simultaneous connections (e.g. downloading via bittorrent or other p2p protocol). So, maintaining minimum number of connections can help.

Distributed network monitoring - how to tell if the monitored resource has fallen over, or the monitor it's self is at fault

I'm building a system for monitoring several large web sites (resources), using distributed web services controlled by a central controller.
I'm coming to a specific part of the design - the actual reporting of resources that are thought to have fallen over.
My problem is that there is always the chance that the actual monitor it self is at fault, or has lost its network connection to a resource, and the resource is actually fine. I don't want to report issues if they are not really there.
My plan at the moment is to have the monitor request, that all other monitors check the resource if it encounters a problem, and then make a decision as to whether the resource has really fallen over based on collective results.
I'm sure there's someone out there with more experience of this type of programming than myself.
Is there a common solution to this type of problem? Is my solution a decent way of looking at this?

Your solution is about one of the only pragmatic ones.
There is nothing new under the sun. The IETF Routing Information Protocol wasn't the first attempt at addressing this problem, but it is well documented and works.
Note well, that there is no optimal (or perfect) solution to the class of problems which you are facing, the best you can do with in-band monitoring is make good guesses about where the fault is. In systems that need a very high degree of accuracy of fault information (e.g. the public switched telephone network) a parallel out-of-band monitoring network is established which itself must necessarily be monitored by humans.

Quis custodiet ipsos custodes? (Who will watch the watchers?) -- Juvenal, "Satires"

prove network is truly unavailable

I have an old school foxpro web app that I am trying to help limp along while I rewrite the system. Every day, multiple times, I get this following error message: The specified network name is no longer available.
Does anyone have any suggestions how to troubleshoot this? Perhaps, prove to my IT guys that there really is a network issue. I have theories, but I have no idea how to prove anything, it always comes back to foxpro sucks rewrite it now.
I'll take any help, tools, and will answer any questions that may clarify this for you.
thanks

We have a very large multi-user VFP application on hundreds of sites. Occasionally you get this sort of problem. It is almost always down to environmental issues.
Had one just recently where a client had two machines continually crashing out of the VFP application. Network IT guys swearing up and down that it's not their problem. But what's this in the System Log of both machines? Why, it's the Broadcom NIC reporting a network link loss detected at the same times the application crashed.
Check if the client and server NICs in your situation can report this.

You could consider writing a small program that pings the network resource periodically. You might just look for a file and if the network is failing and the program cannot find the file email the folks in charge of the network and yourself. This would be an independent app, and best if not written in FoxPro so you can independently prove it is not the application or the language/tool it was written in.
I have seen this when networks have bad wiring, a bad port on the switch/hub, a failing NIC in the mix, and sometimes when the network is just flooded with requests from workstations.
You also did not mention if this was a wireless connection. I am hoping not, but I have seen wireless (especially slower wireless) hubs fail with respect to the network overload and slow and unreliable performance. Especially compared to a wired network.
Rick Schummer

In addition to the comments about IP address, is the setting on the network controller to be energy efficient? and thus turn itself off when not actively in use.

Application level checksumming as the tcp checksumming might be too weak?

This Paper (When the CRC and TCP checksum disagree) suggests that since the TCP checksumming algorithm is rather weak, there would occur an undetected error every 16 million to 10 billion packets using TCP.
Are there any application developers out there who protect the data against such kind of errors by adding checksums at the application level?
Are there any patterns available to protect against such errors while doing EJB remote method invocation (Java EE 5)? Or does Java already checksum serialized objects automatically (additionally to the underlying network protocol)?
Enterprise software has been running on computers doing not only memory ECC, but also doing error checking within the CPU at the registers etc (SPARC and others). Bit errors at storage systems (hard drives, cables, ...) can be prevented by using Solaris ZFS.
I was never afraid of network bit errors because of TCP - until I saw that article.
It might not be that much work to implement application level checksumming for some very few client server remote interfaces. But what about distributed enterprise software that runs on many machines in a single datacenter. There can be a really huge number of remote interfaces.
Is every Enterprise Software vendor like SAP, Oracle and others just ignoring this kind of problem? What about banks? What about stock exchange software?
Follow up: Thank you very much for all your answers! So it seems that it is pretty uncommon to check against undetected network data corruption - but they do seem to exist.
Couldn't I solve this problem simply by configuring the Java EE Application Servers (or EJB deployment descriptors) to use RMI over TLS with the TLS configured to use MD5 or SHA1 and by configuring the Java SE clients to do the same? Would this be a way to get reliable transparent checksumming (although by overkill) so that I would not have to implement this at application level? Or am I completely confused network-stack wise?

I am convinced that every application that cares about data integrity should use a secure hash. Most, however, do not. People simply ignore the problem.
Although I have frequently seen data corruption over the years - even that which gets by checksums - the most memorable in fact involved a stock trading system. A bad router was corrupting data such that it usually got past the TCP checksum. It was flipping the same bit off and on. And of course, no one is alerted for the packets that in fact failed the TCP checksum. The application had no additional checks for data integrity.
The messages were things like stock orders and trades. The consequences of corrupting the data are as serious as it sounds.
Luckily, the corruption caused the messages to be invalid enough to result in the trading system completely crashing. The consequences of some lost business were nowhere near as severe as the potential consequences of executing bogus transactions.
We identified the problem with luck - someone's SSH session between two of the servers involved failed with a strange error message. Obviously SSH must ensure data integrity.
After this incident, the company did nothing to mitigate the risk of data corruption while in flight or in storage. The same code remains in production, and in fact additional code has gone into production that assumes the environment around it will never corrupt data.
This actually is the correct decision for all the individuals involved. A developer who prevents a problem that was caused by some other part of the system (e.g. bad memory, bad hard drive controller, bad router) is not likely to gain anything. The extra code creates the risk of adding a bug, or being blamed for a bug that isn't actually related. If a problem does occur later, it will be someone else's fault.
For management, it's like spending time on security. The odds of an incident are low, but the "wasted" effort is visible. For example, notice how end-to-end data integrity checking has been compared to premature optimization already here.
So far as things changing since that paper was written - all that has changed is we have greater data rates, more complexity to systems, and faster CPUs to make a cryptographic hash less costly. More chances for corruption, and less cost to preventing it.
The real issue is whether it is better in your environment to detect/prevent problems or to ignore them. Remember that by detecting a problem, it may become your responsibility. And if you spend time preventing problems that management does not recognize is a problem, it can make you look like you are wasting time.

I've worked on trading systems for IBs, and I can assure you there is no extra checksumming going on - most apps use naked sockets. Given the current problems in the financial sector, I think bad TCP/IP checksums should be the least of your worries.

Well, that paper is from 2000, so it's from a LONG time ago (man, am I old), and on a pretty limited set of traces. So take their figures with a huge grain of salt. That said, it would be interesting to see if this is still the case. However, I suspect things have changed, though some classes of errors may still well exist, such as hardware faults.
More useful than checksums if you really need the extra application-level assurance would be a SHA-N hash of the data, or MD5, etc.

Serial Comms dies in WinXP

A bit of history: We have an application, which was originally written many years ago (1998 is the first date in PVCS but the app is about 5 years older than that as it originally was a DOS program). This application communicates with a piece of hardware via serial. When we got to Windows XP we started receiving reports of the app dying after a short time of running. It seems that the serial comms just 'died' and the app was left in a stuck state. The only way to recover from this situation was to restart the application.
The only information I can find regarding this problem was apparently the Windows Message system would miss that information was received, the buffer would fill and the system would get stuck. This snippet of information was left in a old word document, but there's no evidence to back this up. It also mentions that this is only prevalent at high baud rates (115200+).
The solution was to provide customers with USB->Serial converters along with the hardware.
Today: We are working on a new version of the hardware that will run across a network as well as serial ports. So to allow me to work on the network code, minus the actual hardware we are using a VSCOM NetCom113 device. It also installs a virtual comm port on the users (ie: mine) machine.
Now I have got the network code integrated with the app, it appears that the NetCom device exhibits the same behaviour as a physical commport. This is undesirable as I need the app to run longer than ~30 seconds.
Google turns up zero problems that we experience.
I was wondering:
Has anyone experienced this before? If so what did you do to fix/workaround the problem?
Does anyone have any suggestions as to whether the original author of the document is correct and what I can do to test the theory?
Unfortunately I can't post code as the serial code is tightly couple with the rest of the system, though if you have questions regarding it I can answer questions about it.
Updates:
The code is written using Win32 Comm routines - so I am using CreateFile, ReadFile. There's also judicious calls to GetOverlappedResult.
It's not hanging per se, it's just that the comms stops. You can access the menus, click the buttons, but nothing can interact with the connected hardware. Using realterm you can see that no data is coming in or going out.
I think the reference to the windows message is that the problem is internal to windows. Data has arrived but the kernal has missed it and thus not told the rest of the system about it.
Flow control is not used.
Writing a 'simple' test is difficult due the the fact that the code is tightly coupled and the underlying protocol is quite complex and would require a lot of work.

Are you using DOS-style serial code, or the Win32 CreateFile approach?
If the former, be very suspicious: if at all possible I'd convert to the latter.
If the latter, do you know on what kind of system call it's hanging? Are you in a blocking read call? or an overlapped I/O call? or waiting on an event? (I'm not sure I have enough experience to help, but those are the kinds of questions that come to mind)
You might also check into the queue size, which you can set with the SetupComm function.
I don't buy the "Windows Message system" stuff -- it sounds fishy; you can write good Win32 serial i/o code that never uses Windows messages.
edit: does your Overlapped I/O use events? I seem to remember something about auto-reset events occasionally missing their trigger... check your overlapped I/O calls very carefully to see whether you're handling the possible outcomes properly. Perhaps there's a way to make your code more robust by automatically cancelling the overlapped i/o and restarting another read. (I assume the problem is in the read half, not the write half?)
edit 2: A suggestion: assuming the win32 side has missed a byte or packet, and your devices are in deadlock because they're both expecting each other to respond to something, can you tweak the other side of the serial I/O to regularly send some type of "ping" packet with an incrementing counter? (and log the ping packets on the PC side; that way you can see whether you've missed any)

Are you sure you have your flow control set up correctly? DTR, RTS, etc...
-Adam

i have written apps that use usb / bluetooth serial ports and have never had an issue. with bluetooth i have seen bit rates (sustained) of 800,000 bps for long periods of time. most people don't properly implement the port.
My serial port

Not sure if this is a possibility for you, but if you could re-write the code using C#.NET you'd have access to the SerialPort class there. It might remedy your problem. I know a lot of legacy code based around the Win32 API for hardware I/O ports tended to fail in XP due to timing (had a small bit of experience with MIDI).
In addition, I don't know if you can use the Win32 method of Serial Port access in Vista, so that might shut out future MS OSes from being able to use your code.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex