Watchdog monitoring for multi microcontroller - Embedded Systems - watchdog

I am using 3 micro-controller on a board.
Main micro, gateway micro and safety micro;
name suggest the associated applications.
Internal watchdog exist for all three, but I need to have an external supervision so as not to have a buggy timer code nullifying the effect of internal watchdog. Also to keep the BOM cost low, so can use just 1 external watchdog.
Propose to use the following strategy:
Main microcontroller: We plan to have the internal watchdog and as well an external watchdog for this.
Safety Microcontroller: We plan to have internal watchdog and as well monitoring over SPI by Main microcontroller.
Gateway Microcontroller: We plan to have internal watchdog and as well monitoring over SPI by Main microcontroller.
One issue with this is - EMI or noise issues over line causing SPI corruption and hence false RESET from main micro.
Has anybody faced similar challenge? Any suggestions for this?
Many Thanks for your help!!!!

Not knowing the specifics of your application, it is not possible to give you a definitive answer. The way you would normally solve this sort of problem is to do a failure mode and effects analysis. Essentially you list out all the parts of your system and then brainstorm all the possible failure modes you think could happen. EMC would be one of them. You then estimate a probability that each failure mode will occur and assign a severity to it in the event that it does occur. Multiplying these out will allow you to identify the areas that carry greater impact and need extra protection. When all the failure modes have a severity x risk value below a threshold set by your application, you will have a 'valid' solution.
Not doing a thorough analysis like this means you may very well put all your effort into defending the front door while leaving the back door unlocked.

Related

Is it possible to verify the received slave data in i2C protocol

I recently came across one question related to the I2C protocol,
usually we read the data from the I2C slave devices and use it for the further calculations on the master side,
but can I be sure if the data I got is the data I wanted or is that corrupted while transmitting to the bus?
is there any possibilities to do so in I2C protocol?
In many cases, no, you can't be certain from software. You have to verify the design by hardware (with an oscilloscope) and then make sure nothing changes.
However, some slave devices provide checksums or check bits on certain transactions that can be used to detect some faults.
Reading a known value from the ID register at startup is usually good enough for most applications.
Remember that if your data is getting corrupted then maybe your addresses are too, and that will cause ACKs to not be returned when you expect them.
Many things you can do from software will not detect an error instantly though, so you need to think about how important or harmful it could be if an incorrect value is used. There are various standards that help you plan for this, such as ISO 14971 "Application of risk management to medical devices".

RS232 Alicat and Labview communication drop

At the moment I have a problem I cannot pin down. Seemingly at random my communication with my RS232 Alicat Device will get held up. It will get held up somewhere in the read or write process and be unable to complete it. Upon closing the VI I will get a "Resetting VI" error in Labview 2020. I am using 7 of the 9 RS232 ports. My question is:
How do I fix this problem so that I do not get a communication drop OR (more likely)
How do I code the system such that I can catch and move through this problem or reset the connection. Something of a VISA read/write timeout? Open to ideas on how to move past the block
Here is what I have gathered about the problem:
Windows 10, I’ve tested everything on multiple computers. It happens no matter what.
It happens at random. It might happen twice within 20 minutes or not for a couple of hours.
I have never experienced the error when probing the line. I don’t know if that is a clue, or if that speaks to the randomness of the problem
Baud Rate = 9600, Prior to this I was running at 19,200 and experienced equivalent issues. The manufacturer recommended lowering the baud rate to reduce noise. I have also isolated the cable from other parts of the hardware. At this point noise on the connection is not an issue, but I am still experiencing the error.
My buffer size is 1000 bytes.
By termination character is \r. I cannot imagine a scenario where it fails to read a termination character due to the size of my buffer
I'm querying it every 50ms. Far below the threshold of a standard timeout. Too much?
What I am currently testing.
Due to how my code block is setup I cannot yet confirm if it is getting locked up on the read or write block or both. I'm attempting to isolate the problem with only minor modifications to see if I can isolate it.
Attached is slimmed down version of my code that I isolated the error to.
I have experienced similar problems with some RS232 devices from different suppliers. The (quite bad) solution was to connect and disconnect for each communication command. The question would be what sample rate you need.
Another idea is to replace that device with an ethernet device. If I am not mistaken Alicat supplies those with Modbus (TCP).
The issue turned out to be specific to windows/my laptop. There is a USB setting that disables inactive USB's after a certain amount of time. The setting to disable the timeout was unavailable through the control panel on my laptop, though it was available on my coworkers. I had to use powershell commands to change the setting

GSM network network overhead issue

I am try to learn GSM network issues. I would like to know more about in detail "Trade-off between network overhead and call setup time". why network overhead will occurs and how it reflect on call setup time
Seems an odd question (I am guessing it from some course etc?).
All networks will have an overhead to set up a call, so maybe this is referring to the extra work required to 'find' the terminating mobile device.
This requires a query to the GSM 'database' associated with the terminating subscriber, the HLR (Home Loctaion Register).
Call setup is generally prioritised over other traffic, including paging in the terminating cell to tell the terminating device there is a call - other than that there is not a lot of specific overhead. IN high congestion situations the terminating cell may not be able to page the device which can mean the call cannot be set up - maybe this is what the question was referring to.

Arduino Fail-Safe Mechanism

Suppose I am developing a fail-safe mechanism for Arduino (Or any other microcontroller). In other words a secondary microcontroller or a seperate board should get the responsibility when the primary controller fails.
Two possible mechanisms are as follows.
Method 1 - Client Server Mechanism
There are 2 identical systems which are powered separately.
The secondary system sends a request periodically and the primary system replies.
If the primary system fails to reply (several times) the secondary
system becomes in charge.
Method 2 - Heart Beat Mechanism
There are 2 identical systems which are powered separately.
The primary system sends a periodic heartbeat message.
If the heart beat is there the secondary node knows that the primary node is up.
When there is no heart beat the primary node is assumed to be dead. Secondary node gets the control.
Do you guys know any better mechanism to implement this?
Typically in commercial embedded systems, a watchdog timer would be utilized to reset the processor in the case that it fails to respond by periodically "kicking the dog". All AVR microcontrollers (and many if not most other brands as well) have an internal watchdog timer. Though a design with an independent, external watchdog timer is typically more robust and reliable. Like this:
For systems that require an even higher degree of fault tolerance, for instance aerospace applications, triple redundant or triple modular redundant architectures are used.
In a triple redundant system, three identical processing components perform the same task at the same time. The result is then sent to a voting circuit or what John von Neumann called a "majority organ" (Section 4.2.2). The output of the voting circuit is the majority opinion of the three processing components.
This allows for one of the processing components to fail without affecting the operation of the system. However, if the voting circuit fails, then the whole system fails as well. A triple modular redundant system does away with this single point of failure by implementing three voting circuits as well.
Eventually though, the three outputs will need to be combined into one result again leading to a single point of failure. Even if that point of failure is the human looking at three gauges, each monitoring the same temperature.
What you need to determine is just how fault-tolerant you need your system to be and what kind of mean time between failures (MTBF) your system can handle. Then design your redundancy system around that.

Serial Comms dies in WinXP

A bit of history: We have an application, which was originally written many years ago (1998 is the first date in PVCS but the app is about 5 years older than that as it originally was a DOS program). This application communicates with a piece of hardware via serial. When we got to Windows XP we started receiving reports of the app dying after a short time of running. It seems that the serial comms just 'died' and the app was left in a stuck state. The only way to recover from this situation was to restart the application.
The only information I can find regarding this problem was apparently the Windows Message system would miss that information was received, the buffer would fill and the system would get stuck. This snippet of information was left in a old word document, but there's no evidence to back this up. It also mentions that this is only prevalent at high baud rates (115200+).
The solution was to provide customers with USB->Serial converters along with the hardware.
Today: We are working on a new version of the hardware that will run across a network as well as serial ports. So to allow me to work on the network code, minus the actual hardware we are using a VSCOM NetCom113 device. It also installs a virtual comm port on the users (ie: mine) machine.
Now I have got the network code integrated with the app, it appears that the NetCom device exhibits the same behaviour as a physical commport. This is undesirable as I need the app to run longer than ~30 seconds.
Google turns up zero problems that we experience.
I was wondering:
Has anyone experienced this before? If so what did you do to fix/workaround the problem?
Does anyone have any suggestions as to whether the original author of the document is correct and what I can do to test the theory?
Unfortunately I can't post code as the serial code is tightly couple with the rest of the system, though if you have questions regarding it I can answer questions about it.
Updates:
The code is written using Win32 Comm routines - so I am using CreateFile, ReadFile. There's also judicious calls to GetOverlappedResult.
It's not hanging per se, it's just that the comms stops. You can access the menus, click the buttons, but nothing can interact with the connected hardware. Using realterm you can see that no data is coming in or going out.
I think the reference to the windows message is that the problem is internal to windows. Data has arrived but the kernal has missed it and thus not told the rest of the system about it.
Flow control is not used.
Writing a 'simple' test is difficult due the the fact that the code is tightly coupled and the underlying protocol is quite complex and would require a lot of work.
Are you using DOS-style serial code, or the Win32 CreateFile approach?
If the former, be very suspicious: if at all possible I'd convert to the latter.
If the latter, do you know on what kind of system call it's hanging? Are you in a blocking read call? or an overlapped I/O call? or waiting on an event? (I'm not sure I have enough experience to help, but those are the kinds of questions that come to mind)
You might also check into the queue size, which you can set with the SetupComm function.
I don't buy the "Windows Message system" stuff -- it sounds fishy; you can write good Win32 serial i/o code that never uses Windows messages.
edit: does your Overlapped I/O use events? I seem to remember something about auto-reset events occasionally missing their trigger... check your overlapped I/O calls very carefully to see whether you're handling the possible outcomes properly. Perhaps there's a way to make your code more robust by automatically cancelling the overlapped i/o and restarting another read. (I assume the problem is in the read half, not the write half?)
edit 2: A suggestion: assuming the win32 side has missed a byte or packet, and your devices are in deadlock because they're both expecting each other to respond to something, can you tweak the other side of the serial I/O to regularly send some type of "ping" packet with an incrementing counter? (and log the ping packets on the PC side; that way you can see whether you've missed any)
Are you sure you have your flow control set up correctly? DTR, RTS, etc...
-Adam
i have written apps that use usb / bluetooth serial ports and have never had an issue. with bluetooth i have seen bit rates (sustained) of 800,000 bps for long periods of time. most people don't properly implement the port.
My serial port
Not sure if this is a possibility for you, but if you could re-write the code using C#.NET you'd have access to the SerialPort class there. It might remedy your problem. I know a lot of legacy code based around the Win32 API for hardware I/O ports tended to fail in XP due to timing (had a small bit of experience with MIDI).
In addition, I don't know if you can use the Win32 method of Serial Port access in Vista, so that might shut out future MS OSes from being able to use your code.

Resources