We're using SignalR for real time pushing of messages, and I've read and realized that SignalR is not reliable, nor does it claim to be.
But everywhere I've looked, it never states where they don't claim to take responsibility for a lost message. Since WebSockets run over TCP, and TCP can (assuming no lost connection) guarantee delivery. What step in the process of receiving the message on a socket to being handled by the client is the "unreliable" part? I.e. where do I need to put my own reliability layer?
Related
I am developing a cloud-based back-end HTTP service that will be exposed for integration with some on-prem systems. Client systems are custom-made by external vendors, they are back-end systems with their own databases. These systems are deployed in companies of our clients, we don't have access to them and don't control them. We are providing vendors our API specifications and they implement client code.
The data format which my service exchanges with clients is based on XML and follows a certain standard. Vendors implement their client systems in different programming languages and new vendors will appear over time. I want as many of clients to be able to work with my service as possible.
Most of my service API is REST-like: it receives HTTP requests, processes them, and sends back HTTP responses.
Additionally, my service accumulates some data state changes and needs to regularly push this data to client systems. Because of the below limitations, this use-case does not seem to fit the traditional client-server HTTP request-response model.
Due to the nature of the business, the client systems cannot afford to have their own HTTP API endpoints open and so my service can't establish an outbound HTTP connection to them for delivering data state notifications. I.e. use of WebHooks is not an option.
At the same time my service stakeholders need recorded acknowledgment that data state notifications were accepted by the client system, therefore fire-and-forget systems like Amazon SNS don't seem to apply.
I was considering few approaches to this problem but I'm not sure if I'm missing some simple options or some technologies that already address the problem. Hence this question.
The question text updated: options moved to my own answer.
Related questions and resources
REST API with active push notifications from server to client
Is ReST over websockets possible?
Can we use Web-Sockets for Communication between Microservices?
What is difference between grpc and websocket? Which one is more suitable for bidirectional streaming connection?
https://www.smashingmagazine.com/2018/02/sse-websockets-data-flow-http2/
I eventually found answers to my question myself and with some help from my team. For people like me who come here with a question "how do I arrange notifications delivery from my service to its clients" here's an overview of available options.
WebHooks
This is when the client opens endpoint iself. The service calls client's endpoints whenever the service has some notification to deliver. This way the client also acts as a service and so the client and the service swap roles during notification delivery.
With WebHooks the client must be able to open the endpoint with a well-known address. This is complicated if the client's software is working behind NAT or firewall or if the client is Browser or a mobile application.
The service needs to be prepared that client's WebHook endpoints may not always be online and may not always be healthy.
Another issue is flow control: special measures should be taken in the service not to overwhelm the client with high volume of connections, requests and/or data.
Polling
In this case the client is still the client and the service is still the service, unlike WebHooks. The service offers an endpoint where the client can continuously request new notifications. The advantage of this option is that it does not change connection direction and request-response direction and so it works well with HTTP-based services.
The caveat is that polling API should have some rich semantics to be reasonably reliable if loss of notifications is not acceptable. Good examples could be Google Pub/Sub pull and Amazon SQS.
Here are few considerations:
Receiving and deleting notification should be separate operations. Otherwise, if the service deletes notification just before giving it to the client and the client fails to process the notification, the notification will be lost forever. When deletion operation is separate from receiving, the client is forced to do deletion explicitly which normally happens after successful processing.
In case the client received the notification and has not yet deleted it, it might be undesirable to let the same notification to be processed by some other actor (perhaps a concurrent process of the same client). Therefore the notification must be hidden from receiving after it was first received.
In case the client failed to delete the notification in reasonable time because of error, network loss or process crash, the service has to make notification visible for receiving again. This is retry mechanism which allows the notification to be ultimately processed.
In case the service has no notifications to deliver, it should block the client's call for some time by not delivering empty response immediately. Otherwise, if the client polls in a loop and response comes immediately, the loop iteration will be short and clients will make excessive requests to the service increasing network, parsing load and requests counts. A nice-to have feature is for the service to unblock and respond to the client as soon as some notification appears for delivery. This is sometimes called "long polling".
HTTP Server-sent Events
With HTTP Server-sent Events the client opens HTTP connection and sends a request to the service, then the service can send multiple events (notifications) instead of a single response. The connection is long-living and the service can send events as soon as they are ready.
The downside is that the communication is one-way, the client has no way to inform the service if it successfully processed the event. Because this feedback is absent, it may be difficult for the service to control the rate of events to prevent overwhelming the client.
WebSockets
WebSockets were created to enable arbitrary two-way communication and so this is viable option for the service to send notifications to the client. The client can also send processing confirmation back to the service.
WebSockets have been around for a while and should be supported by many frameworks and languages. WebSocket connection begins as HTTP 1.1 connection and so WebSockets over HTTPS should be supported by many load balancers and reverse proxies.
WebSockets are often used with browsers and mobile clients and more rarely in service-to-service communication.
gRPC
gRPC is similar to WebSockets in a way that it enables arbitrary two-way communication. The advantage of gRPC is that it is centered around protocol and message format definition files. These files are used for code generation that is essential for client and service developers.
gRPC is used for service-to-service communication plus it is supported for Browser clients with grpc-web.
gRPC is supported on multiple popular programming languages and platforms, yet the support is narrower than for HTTP.
gRPC works on top of HTTP/2 which might cause difficulties with reverse proxies and load balancers around things like TLS termination.
Message queue (PubSub)
Finally, the service and the client can use a message queue as a delivery mechanism for notifications. The service puts notifications on the queue and the client receives them from the queue. A queue can be provided by one of many systems like RabbitMQ, Kafka, Celery, Google PubSub, Amazon SQS, etc. There's a wide choice of queuing systems with different properties and choosing one is a challenge on its own. The queue can also be emulated by using database for example.
It has to be decided between the service and the client who owns the queue, i.e. who pays for it. Either way, the queuing system and the queue should be available whenever the service needs to push notifications to it otherwise notifications will be lost (unless the service buffers them internally, with another queue).
Queues are typically used for service-to-service communication but some technologies also allow Browsers as clients.
It is worth noting that an "implicit" internal queue might be used on the service side in other options listed above. One reason is to prevent loss of notifications when there's no client available to receive them. There are many other good reasons like letting clients handle notifications at their pace, allowing to maximize processing throughput, allowing to handle spiky traffic with fixed capacity.
In this option the queue is used "explicitly" as delivery mechanism, i.e. the service does not put any other mechanism (HTTP, gRPC or WebSocket endpoint) in front of the queue and lets the client receive notifications from the queue directly.
Message passing is popular in organizing microservice communications.
Common considerations
In all options it has to be decided whether the loss of notifications is tolerable for the service, the client and the business. Some simpler technical choices are possible if it is ok to lose notifications due to processing errors, unavailability, etc.
It is valuable to have a monitoring for client processing errors from the service side. This way service owners know which clients are more broken without having to ask them.
If the queue is used (implicitly or explicitly) it is valuable to monitor the length of the queue and the age of the oldest notifications. It lets service owners judge how stale data may be in the client.
In case the delivery of notification is organized in a way that notification gets deleted only after a successful processing by the client, the same notification could be stuck in infinite receive loop when the client fails to process it. Such notification is sometimes called "poison message". Poison messages should be removed by the service or the queuing system to prevent clients being stuck in infinite loop. A common practice is to move poison messages to a special place, sometimes called "dead letter queue", for the later human intervention.
One alternative to WebSockets for the problem of server→client notifications with acks from the client seems to be gRPC.
It supports bidirectional communication between server and client in bidirectional streaming mode.
It works on top of HTTP 2.0. In our case functioning over HTTP ports is essential.
There are client and server generators for multiple popular languages and platforms. A nice thing is that I can share protocol definition file with vendors and can be sure my service and their clients will talk the same language.
Drawbacks:
Not as many languages and platforms are supported compared to HTTP. Alternative C from the question will be more accessible if based on HTTP 1.1. WebSockets have also been around longer and I would expect broader adoption than gRPC.
Not all gRPC implementations seem to currently support XML format for data according to FAQ. In order to transport XML my service and its clients will have to transfer XML message as byte arrays inside of gRPC protobuf message.
With gRPC, TLS termination cannot be done on general-purpose HTTP 1.1 load balancer. An application-layer HTTP/2-aware reverse proxy (load balancer) such as Traefik is required.
There are approaches like this and this to allow HTTP 1.1 compatible protocols but they have their own restrictions like limited amount of available clients or necessary client customizations.
I read the gRPC Core concepts, architecture and lifecycle, but it doesn't go into the depth I like to see. There is the RPC call, gRPC channel, gRPC connection (not described in the article) and HTTP/2 connection (not described in the article).
I'm interested in knowing how these come together. For example, what happens to the channel when a RPC throws an exception? What happens to the gRPC connection when the channel is closed? When is the channel closed? When is the gRPC connection closed? Heart beats? What if the deadline is exceeded?
Can anyone answer these questions, or point me to resources that can?
The connection is not a gRPC concept. It is not part of the normal API and is an implementation detail. This should be seen as fairly normal, like HTTP libraries providing details about HTTP exchanges but not exposing connections.
It is best to view RPCs and connections as two mostly-separate systems.
The only real guarantee is that "connections are managed by channels," for varying definitions of "managed." You must shut down channels when no longer used if you want connections and other resources to be freed. Other details are either an implementation detail or an advanced API detail.
There is no "gRPC connection." A "gRPC connection" would just be a standard "HTTP/2 connection." Except that is even an implementation detail of the transport in many gRPC implementations. That allows having alternative "connection" types like "inprocess" or QUIC (via Cronet, where there is not a classic "connection" at all).
It is the channel's job to hold all the connections and reconnect as necessary. It delegates part of that responsibility to load balancers and the load balancing APIs do have a concept of connections (subchannels). By not exposing connections to the application, load balancers have a lot of freedom to operate.
I'll note that gRPC C-core based implementations share connections across channels.
What happens to the channel when a RPC throws an exception?
The channel and connection is not impacted by a failed RPC. Note that connection-level failures typically cause RPCs to fail. But things like retries could allow the RPC to be re-sent on a new connection.
What happens to the gRPC connection when the channel is closed?
The connections are closed, eventually. Channel shutdown isn't instantaneous because existing RPCs can continue, and connection shutdown isn't instantaneous as well. But once all RPCs complete the connections are closed. Although C-core won't shut down a connection until no channels are using it.
When is the channel closed?
Only when the user closes it.
When is the gRPC connection closed?
Lots of times. The client may close it when no longer needed. For example, let's say the server IP address changes and the client need to connect to 1.1.1.2 instead of 1.1.1.1. A new connection will be created and new RPCs will go to the new IP address. The client may also close connections it thinks are dead (e.g., via keepalive timeouts).
Servers have a lot of say of when to close connections. They may close them simply because they are old, or because they have been idle, or because the server is overloaded. But those are simply use-cases; the server can shut down a connection at-will.
What if the deadline is exceeded?
Deadline only applies to RPCs and doesn't impact the channel or a connection.
I was actually waiting for Eric to answer this as he is the expert in this!
I also have been playing with gRPC for a while now, I would like to add few things here for beginners. Anyone more experienced, please feel free to edit!
Channel is an abstraction over a long-lived connection! The client application will create a channel on start up. The channel can be reused/shared among multiple threads. It is thred safe. One channel is enough (for most of the use cases) for multiple threads and multiplexing concurrent requests. It is channel's responsibility to close / reconnect / keep the connection alive etc. We as the users do not have to worry about this in general. The client application can close the channel anytime it wants. Channel creation seems to be an expensive process. So we would not open/close for every RPC.
When you use gRPC loadbalancer/nameresolver for a domain name and the nameresolver resolves the domain with multiple ip addresses, a channel creates multiple subchannels where each subchannel is an abstraction over a connection to 1 server. So a channel can also represent multiple connections!!
Adding some points to note from Eric's comment.
adding the default load balancer still only creates (approximately)
one connection if the name resolver returns multiple addresses, as the
default is pick_first. But if you change the load balancer to
round_robin or virtually any other policy, then yes, there will be
multiple connections in a channel. Even if a name resolver returns one
address, the load balancer is free to create multiple connections
(e.g., for higher throughput), but that's not common today
An underlying connection can be closed any time for any reason. For ex: remote server is shutting down gracefully for a scheduled maintenance or a connection is idle for longer duration. In that case, the server could send GOAWAY signal to the client and client might disconnect and reconnect to some other server. or Server might crash due to OOM error. In this case channel will detect connection failure and will retry for new connection for some other server etc.
A channel can keep sending PING frame to the server to keep the connection alive. These are all configurable via channel builder.
With these information above, if we look at your questions,
what happens to the channel when a RPC throws an exception?
Nothing happens to the channel. The unhandled exception on the server might the fail the RPC on the client side. But channel is still usable for any RPC calls.
What happens to the gRPC connection when the channel is closed?
Channel is an abstraction over the connection. So it will be closed. (again there is no gRPC connection as such as Eric had mentioned. It would be a HTTP2 connection)
When is the channel closed?
Any time you want. But normally when the application shuts down.
When is the gRPC connection closed?
It is not our problem. Channel takes care of this.
Heart beats?
Channel sends PING frames periodicaly to keep the connection alive.
What if the deadline is exceeded?
It is something like timeout on the client side. When the deadline exceeds, the client might cancel the request. Once again nothing happens to the channel. (But it might trigger exception on the server side which I had noticed few times. (Received DATA frame for an unknown stream. https://github.com/grpc/grpc-java/issues/3548). It seems to have been fixed now).
How reliable is SignalR Backplane regarding to the question if all messages will reach all subscribed nodes? Is it using a reliable protocol underneath or are there chances that a message can get lost?
Obviously it can be that (for example) due to some network issues one node is down for some time. When it becomes reachable again, SignalR Backplane will deliver all intermediate messages. This is at least what I understand from davidfowl:
[...] This is VERY important! SignalR is NOT reliable messaging, it's a connection abstraction. We may buffer messages for longpolling but you cannot rely on the messages being there for ever. If you have important messages you need to persist, then persist them.
But how long is "forever" in this context? Can it be quantified/configured?
Are there other scenarios to consider if a reliable system is to be built on top of SignalR Backplane?
From what I have read a SignalR client should not miss any messages from the server while it's connected. This does not seem to be the case when using long polling.
I have a straightforward hub based application using SignalR 1.1.2. When using SSE, if the network cable is unplugged and plugged back in again within the timeout period, both the client and server are notified that a reconnect has occurred and, as far as I can tell, no messages are missed. When using long polling, this seems to happen:
When the connection is created ($.connection.hub.start()) the OnConnected method is called in the hub and the client goes into connected state.
If I then unplug the network cable and pop it back in quickly, there is no call to OnDisconnected or OnConnected. No messages are missed. Any messages waiting on the server are subsequently sent to the client. OK so far.
If I unplug the network cable and let the long poll expire, I get a call to OnDisconnected. There is no state change on the client.
If I plug the network cable back in the client starts receiving messages again. There has been no notification on the client that it has been disconnected, but the client has missed some messages. There is no call to OnReconnected or OnConnected on the server.
Is this a bug? The behaviour seems very different between SSE and long polling.
Is there a recommended strategy to ensure that the client does not miss messages in this scenario? I could keep track of connection ids on the server and send periodic pings from the client - if I get a ping after an OnDisconnected I could send a message to tell the client to resync, but this doesn't seem like the right thing to do.
Any suggestions?
WebSockets, Server Sent Events, and Forever Frame all utilize a client side keep alive which is used to ensure client connectivity. However, Long Polling does not utilize the client side keep alive feature due to technical limitations and has no guarantee of connectivity for events such as pulling the network cable out.
When I say no guarantee I'm simply stating that the Long Polling transport is no longer able to be ensured by SignalR but instead relies on the Browser to trigger the correct events on Long Polling's ajax connection (through which SignalR can respond to).
Keep in mind though, if the client does happen to regain connectivity with the server after pulling out the network cable it will receive any messages that it missed during its down time. So messages are not missed, they're just delayed.
Lastly in the case that the server does not see the client for an extended period of time the OnDisconnected event WILL be triggered. For this to happen in a situation such as pulling the network cable out the server will first timeout the current connection's request and then will timeout the connection itself. This means that you can still rely on the OnDisconnected event, it may just be delayed based on network conditions.
Soooo what you're seeing is 100% by design =)
Hope this helps!
Say my network connection drops for a few seconds and I miss some SignalR server-pushed messages.
When I regain network connectivity are the messages I missed lost? or does signalR handle them and push them out when I reconnect?
If it can't handle missed messages, then what is the recommended approach for ensuring consistency?
Periodically (2-3 mins) poll to check server-data?
Somehow detect loss of network on the client side and do an ajax call to get the data on network restoration?
something else?
Here are a couple of thoughts:
If you aren't sending a lot of messages per second, consider sending no data in the messages themselves. Instead, the message is just a "ping" to the clients telling them to go get the server data when they can. Combine that with a periodic poll, as you said, and you can be assured that you won't miss messages. They just might be delayed.
If you are sending a lot of messages quickly, how about adding a sequential ID to each one? Think of a SQL Identity column. Your clients would need to keep track of the most recent ID received. After a network reconnect, the client could ask for all messages since [Last ID]. If a message is received whose ID is not contiguous with the most recently received, you know that there was a disconnect and can ask the server for the missing information.