How to get visibility into completion queue on C++ gRPC server - grpc

Note: Help with the immediate problem would be great, but mostly I'm looking for advice on troubleshooting gRPC timing issues in general (this isn't my first such issue).
I am adding a new server streaming service to a C++ module which has an existing server streaming service, and the two appear to be conflicting. Specifically, the completion queue Next() call on the server is crashing intermittently after the C# client calls Cancel() on the cancellation token for one of the services. This doesn't happen if I run each service independently.
On the client, I get this at the response stream MoveNext() call:
System.InvalidOperationException
HResult=0x80131509
Message=Shutdown has already been called
Source=Grpc.Core
StackTrace:
at Grpc.Core.Internal.CompletionQueueSafeHandle.BeginOp()
at Grpc.Core.Internal.CallSafeHandle.StartReceiveMessage(IReceivedMessageCallback callback)
at Grpc.Core.Internal.AsyncCallBase`2.ReadMessageInternalAsync()
at Grpc.Core.Internal.ClientResponseStream`2.<MoveNext>d__5.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at MyModule.Connection.<DoSubscriptionReceives>d__7.MoveNext() in C:\snip\Connection.cs:line 67
On the server, I get this at the completion queue next() call:
Exception thrown: read access violation.
core_cq_tag->**** was 0xDDDDDDDD.
The stack trace:
MyModule.exe!grpc_impl::CompletionQueue::AsyncNextInternal(void * * tag, bool * ok, gpr_timespec deadline) Line 59 C++
> MyModule.exe!grpc_impl::CompletionQueue::Next(void * * tag, bool * ok) Line 176 C++
...snip...
It appears something is being added to the queue after shutdown. The difficulty is I have little visibility into what is being added into the queue and in what order.
I'm trying to write a server-side interceptor to log all requests & responses, but there seems to be no documentation. So far, poking through the API hasn't gotten me very far. Is there any documentation available on wiring up an interceptor in C++? Or, are there other approaches for troubleshooting timing conflicts between services?
Windows 11, Grpc.Core 1.27
What I've tried:
I first played with the GRPC_TRACE & GRPC_VERBOSITY environment variables. I was able to get some unhelpful output from the client, but nothing from the server. Of course, there's been lots of debugging, stripping the client & server down to barebones, disabling keep alives, ensuring we aren't using deadlines, having the services share a cancellation token, etc.
Update: I have found that the crash only happens when the client is run from an NUnit test. In that environment, the completion queue is getting more hits on Next(), but I'm still trying to figure out where they are coming from.

Is 1.27 the version you are using? That seems pretty old.. There might have been fixes since then.
For using the C++ server interception API, I think you would find this very useful - https://github.com/grpc/grpc/blob/0f2a0f5fc9b9e9b9c98d227d16575d106f1e8d43/test/cpp/end2end/server_interceptors_end2end_test.cc#L48
One suggestion I have is to run the code another sanitizers https://github.com/google/sanitizers to make sure that we don't have a heap-use-after-free type bug.
I would also check for API misuse issues. (If you had posted the code, I could've given a look to see if anything seems weird..)

Related

Timing issue a C++/winRT BLE connection attempt?

I am using C++/winRT UWP to discover and connect to Bluetooth Low Energy devices. I am using the advertisment watcher to look for advertisements from devices I can support. This works.
Then I pick one to connect to. The connection procedure is a little weird by my way of thinking but according to the microsoft docs one Calls this FromBluetoothAddressAsync() with the BluetoothAddress and two things happen; one gets the BluetoothLEDevice AND a connection attempt is made. One needs to register a handler for the connection status changed event BUT you can't do that until you get the BluetoothLEDevice.
Is there a timing issue causing the exception? Has the connection already happened BEFORE I get the BluetoothLEDevice object? Below is the code and below that is the log:
void BtleHandler::connectToDevice(BluetoothLEAdvertisementReceivedEventArgs eventArgs)
{
OutputDebugStringA("Connect to device called\n");
// My God this async stuff took me a while to figure out! See https://msdn.microsoft.com/en-us/magazine/mt846728.aspx
IAsyncOperation<Windows::Devices::Bluetooth::BluetoothLEDevice> async = // assuming the address type is how I am to behave ..
BluetoothLEDevice::FromBluetoothAddressAsync(eventArgs.BluetoothAddress(), BluetoothAddressType::Random);
bluetoothLEDevice = async.get();
OutputDebugStringA("BluetoothLEDevice returned\n");
bluetoothLEDevice.ConnectionStatusChanged({ this, &BtleHandler::onConnectionStatusChanged });
// This method not only gives you the device but it also initiates a connection
}
The above code generates the following log:
New advertisment/scanResponse with UUID 00001809-0000-1000-8000-00805F9B34FB
New ad/scanResponse with name Philips ear thermometer and UUID 00001809-0000-1000-8000-00805F9B34FB
Connect to device called
ERROR here--> onecoreuap\drivers\wdm\bluetooth\user\winrt\common\bluetoothutilities.cpp(509)\Windows.Devices.Bluetooth.dll!03BEFDD6: (caller: 03BFB977) ReturnHr(1) tid(144) 80070490 Element not found.
ERROR here--> onecoreuap\drivers\wdm\bluetooth\user\winrt\device\bluetoothledevice.cpp(428)\Windows.Devices.Bluetooth.dll!03BFB9B7: (caller: 03BFAF01) ReturnHr(2) tid(144) 80070490 Element not found.
BluetoothLEDevice returned
Exception thrown at 0x0F5CDF2F (WindowsBluetoothAdapter.dll) in BtleScannerTest.exe: 0xC0000005: Access violation reading location 0x00000000.
It sure looks like there is a timing issue. But if it is, I have no idea how to resolve it. I cannot register for the event if I don't have a BluetoothLEDevice object! I cannot figure out a way to get the BluetoothLEDevice object without invoking a connection.
================================ UPDATE =============================
Changed the methods to IAsyncAction and used co_await as suggested by #IInspectable. No difference. The problem is clearly that the registered handler is out of scope or something is wrong with it. I tried a get_strong() instead of a 'this' in the registration, but the compiler would not accept it (said identifier 'get_strong()' is undefined). However, if I commented out the registration, no exception is thrown but I still get these log messages
onecoreuap\drivers\wdm\bluetooth\user\winrt\common\bluetoothutilities.cpp(509)\Windows.Devices.Bluetooth.dll!0F27FDD6: (caller: 0F28B977) ReturnHr(3) tid(253c) 80070490 Element not found.
onecoreuap\drivers\wdm\bluetooth\user\winrt\device\bluetoothledevice.cpp(428)\Windows.Devices.Bluetooth.dll!0F28B9B7: (caller: 0F28AF01) ReturnHr(4) tid(253c) 80070490 Element not found.
But the program continues to run an I continue to discover and connect. But since I can't get the connection event it is kind of useless at this stage.
I hate my answer. But after asynching and co-routining everything under the sun, the problem is unsolvable by me:
This method
bluetoothLEDevice = co_await BluetoothLEDevice::FromBluetoothAddressAsync(eventArgs.BluetoothAddress(), BluetoothAddressType::Random);
returns NULL. That should not happen and there is not much I can do about it. I read that as a broken BLE API.
A BTLE Central should be able to do as follows
Discover a device if new then:
If user selects connect, connect to
the device
perform service discovery
read/write/enable
characteristics as needed
handle indications/notifications
If at any time the peripheral sends a security request or insufficient authentication error, start pairing
repeat the action that caused the insufficient authentication.
On disconnect, save the paired and bonded state if the device is pairable.
On rediscovery of the device, if unpaired (not a pairable device)
repeat above
If paired and bonded
start encryption
work with the device; no need to re-enable or do service discovery
========================= MORE INFO ===================================
This is what the log shows when the method is called
Connect to device called
onecoreuap\drivers\wdm\bluetooth\user\winrt\common\bluetoothutilities.cpp(509)\Windows.Devices.Bluetooth.dll!0496FDD6: (caller: 0497B977) ReturnHr(1) tid(3b1c) 80070490 Element not found.
onecoreuap\drivers\wdm\bluetooth\user\winrt\device\bluetoothledevice.cpp(428)\Windows.Devices.Bluetooth.dll!0497B9B7: (caller: 0497AF01) ReturnHr(2) tid(3b1c) 80070490 Element not found.
BluetoothLEDevice returned is NULL. Can't register
Since the BluetoothLEDevice is NULL, I do not attempt to register.
================= MORE INFO ===================
I should also add that taking an over-the-air sniff reveals that there is never a connection event. Though the method is supposed to initiate a connection as well as return the BluetoothLEDevice object, it ends up doing neither. My guess is that the method requires more pre-use setup of the system that only the DeviceWatcher does. The AdvertisementWatcher probably does not.
In BLE you always have to wait for every operation to complete.
I am not an expert in C++, but in C# the async connection procedure returns a bool if it was successful.
In C++ the IAsyncOperation does not have a return type, so there is no way to know if the connection procedure was successful or completed.
You will have to await the IAsyncOperation and make sure that you have a BluetoothLEDevice object, before you attach the event handler.
To await an IAsyncOperation there is a question/answer on how to await anIAsyncOperation:
How to wait for an IAsyncAction? How to wait for an IAsyncAction?

gRPC Java - Listening for successful calls on server

I am looking for a way to capture when a call is completed on the server and no errors were thrown.
I understand that SimpleForwardingServerCallListener exists, however,onComplete is called when an exception is thrown.
My use case is for transaction management.
Yes, it currently only triggers onComplete(). There is a bug filed for that. If it's fixed, you can probably get onCancel() instead.
For now, you can wrap the ServerCall with SimpleForwardingServerCall in your ServerInterceptor, and override close(). If the RPC ends successfully, OK will be passed to close().

Why do I get exception "The execution of the InstancePersistenceCommand named LoadWorkflowByInstanceKey was interrupted by an error"

After doing some refactoring to my WF4 service, I got this exception when calling some of the operations:
The execution of the InstancePersistenceCommand named {urn:schemas-microsoft-com:System.Activities.Persistence/command}LoadWorkflowByInstanceKey was interrupted by an error.
My xamlx file contains a few receive/sendreplytoreceive pairs, as shown below. The exception sometimes happens on receive2, sometimes receive3.
receive1 (no correlation, cancreateinstance=true)
send reply to receive (initializes content correlation on generated ID)
receive2 (correlates on ID, cancreateinstance=false)
send reply to receive
receive 3 (correlates on ID, cancreateinstance=false)
send reply to receive
After doing a lot of debugging and making sure all correlations where set up right, the exception disappeared for new instances of the workflow.
What does the exception mean, and why did it show up and why did it dissappear all of a sudden? Is it a code/xamlx issue or something with the infrastructure (AppFabric/SQL)?
I'm hosting the WF service with IIS/AppFabric, using AppFabric' SQL persistence.
According to this support note this error can be the result of a race condition between the Receive and a Delay activity expiring. Is this possible in your workflow.
I kinda figured mine out... aparently if you point your persistance store in a SQL previous to 2012 you get the error... so all i had to do is put mine persistance store in a SQL 2012...
When I had this problem it turned out to be a mistake in my connection string when instantiating the persistence store object.
SqlWorkflowInstanceStore store = new SqlWorkflowInstanceStore(connStr);
I realise this an old question but fixing the connection string got rid of my error while running store.Execute() so I thought I'd share!

servlet initialization failure in websphere 6.0

I have many servlets in a web applicaton; for some stange reason, only and only one of them always fails in initialization with the following error trace:-
00000045 ServletWrappe E SRVE0100E: Did not realize init() exception thrown by servlet MyServletX: java.lang.NullPointerException
at com.ibm.ws.webcontainer.WebAppPmiListener.onServletStartInit(WebAppPmiListener.java:120)
at com.ibm.ws.webcontainer.webapp.FireOnServletStartInit.fireEvent(WebAppEventSource.java:237)
at com.ibm.ws.webcontainer.util.EventListeners.fireEvent(EventListeners.java:48)
at com.ibm.ws.webcontainer.webapp.WebAppEventSource.onServletStartInit(WebAppEventSource.java:105)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.init(ServletWrapper.java:261)
at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:444)
at com.ibm.ws.webcontainer.webapp.WebApp.handleRequest(WebApp.java:2841)
at com.ibm.ws.webcontainer.webapp.WebGroup.handleRequest(WebGroup.java:220)
at com.ibm.ws.webcontainer.VirtualHost.handleRequest(VirtualHost.java:204)
at com.ibm.ws.webcontainer.WebContainer.handleRequest(WebContainer.java:1681)
at com.ibm.ws.webcontainer.channel.WCChannelLink.ready(WCChannelLink.java:77)
at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleDiscrimination(HttpInboundLink.java:421)
at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleNewInformation(HttpInboundLink.java:367)
at com.ibm.ws.http.channel.inbound.impl.HttpICLReadCallback.complete(HttpICLReadCallback.java:94)
at com.ibm.ws.tcp.channel.impl.WorkQueueManager.requestComplete(WorkQueueManager.java:548)
at com.ibm.ws.tcp.channel.impl.WorkQueueManager.attemptIO(WorkQueueManager.java:601)
at com.ibm.ws.tcp.channel.impl.WorkQueueManager.workerRun(WorkQueueManager.java:934)
at com.ibm.ws.tcp.channel.impl.WorkQueueManager$Worker.run(WorkQueueManager.java:1021)
at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1332)
I could not figure out if there is anything extra ordinary with this servlet. There is no init() method in this servlet and it extends HTTPServlet. Any idea what could be reason? I am using websphere server 6.0.x. How to get more debugging information in this case?
Well I don't know still cause of above error, but this is how it started working strangely:- i) Re-applied recommended fixes by IBM for my WAS version (especially there are IBM JDK upgrade related fix patches) ii) created a new profile of server iii) Install web application to new profile and it started working.
I don't think this is a product issue.
To debug this problem what i would suggest is to place a simple servlet (kind of Hello World) and deploy it to the server and see what happens.
initialization does not necessarily mean init() method alone.
If you have a static block in your servlet, if you have any variables that are initialized they would all be part of the initialization activity.
Look at the FFDC logs that were generated when this error occurred and that should provide you with clues.
As bkail mentioned, also ensure that yo have the latest fixpacks just to eliminate known problems with the product.
if the hello world servlet works, suggest you place hte servlet code here along with the SystemOut and System Err logs that correspond to this issue along with the relevant FFDC logs and i am sure most of us will be able to help you out with this
HTH
Manglu

IIS7/Win7 - App Pool is failing suddenly

After nearly 5 months with this configuration I am now getting a series of:
"A process serving application pool 'Classic .NET AppPool' suffered a fatal communication error with the Windows Process Activation Service. The process id was '1640'."
This leads to:
Application pool 'Classic .NET AppPool' is being automatically disabled due to a series of failures in the process(es) serving that application pool.
I cannot, for the life of me, figure out what changed to start causing this nor can I figure out how to possibly dig in deeper to find out what is causing it to fail.
I recently (2 weeks ago) started adding Entity Frameworks to my solution. Right before this happened I did get an "out of stack space" error due to a reported self-referenced call. I cannot find any calls like that in the code I wrote and am suspecting EF may have added a join in my simple (3 table) model that is wrong.
Any ideas on where to start looking? What would cause the AppPool to fail?
TIA
NOTE:
An unhandled exception of type 'System.StackOverflowException' occurred in mscorlib.dll
I have an outside object that calls this method to get a single record:
public static AutoNegotiationDetails GetAutoNegotiationByCompany(Guid companyId)
{
return RivWorks.Controller.Negotiation.GetAutoNegotiationByCompany(companyId);
}
That method calls into:
internal static AutoNegotiationDetails GetAutoNegotiationByCompany(Guid companyId)
{
var autoNeg = from a in _dbRiv.AutoNegotiationDetails where a.CompanyId == companyId select a;
var ret = autoNeg.FirstOrDefault();
return ret;
}
In stepping through it I can set a break point inside the first method, step into the second method, see the record populated, return to the first method then finally exit the method. At that point my IDE locks up for a few seconds until I get the StackOverflow error.
For a more accurate picture of the whole system:
Running WebOrb30 on the IIS machine.
In the VS IDE -> Attach to Process (INETINFO.exe)
Log into WebOrb30 -> Management Console -> Drill down to service entry point -> Enter CompanyID into input box -> Click Invoke
Hit break point in VS IDE -> (See above)
NOTE:
Looks like it may be caused by another issue in EF. See C# - Entity Framework - An unhandled exception of type 'System.StackOverflowException' occurred in mscorlib.dll for further clarification.
I may be that you have two apps/sites using one app pool, but the apps/sites are running different .net versions.
This might not be the case, but its the only similar recurring problem i've ever had with iis.
Because of a bug in my Entity Framework I was getting a cyclic call into one of my relationships. This was causing a stack overflow which was reported to WebOrb as a general error and WebOrb would halt causing the App Pool to crash. (I still don't quite understand all the specifics). When I rebuilt my EF Model without the relationships the behavior went away. (sigh/)
EF will be another question (or series of questions).

Resources