Is it possible to setup a nats-streaming-server cluster with a put-get latency < 1ms?
I created a 3 node cluster (all residing on the same server), using file storage.
eg,
# NATS specific configuration
port: 4222
cluster {
listen: 127.0.0.1:6222
routes: ["nats://127.0.0.1:6223", "nats://127.0.0.1:6224"]
}
# NATS Streaming specific configuration
streaming {
id: test-cluster
store: file
dir: /srv/nats/store_a
cluster {
node_id: "a"
peers: ["b", "c"]
}
}
It is taking 2-3ms per message from async-publish to subscribe callback.
Any other ways to speed it up with file storage?
Thanks.
I did the same on a MacbookPro and the latency is around 767 micro seconds, if you exclude the first message in which case the channel is being created.
You could get even better if all streaming servers were to connect to single central NATS Server since that would remove hop between NATS Servers.
Related
Question
What can cause tasks to be queued in Thread Pool while there are plenty threads still available in pool?
Explanation
Our actual code is too big to post, but here is best approximation:
long running loop
{
create Task 1
{
HTTP Post request (async)
Wait
}
create Task 2
{
HTTP Post request (async)
Wait
}
Wait for Tasks 1 & 2
}
The issue is that these HTTP requests which usually take 110-120 ms sometimes take up to 800-1100 ms.
Before you ask:
Verified no delays on server side
Verified no delays on network layer (tcpdump + wireshark). If we have such delays, there are pauses between requests, TCP level turn-around fits in 100ms
Important info:
We run it on Linux.
This happens only when we run the service in container on k8s or docker.
If we move it outside container it works just fine.
How do we know it's not ThreadPool starvation?
We have added logging values returned by ThreadPool.GetAvailableThreads and we have values of 32k and 4k for available threads.
How do we know the tasks are queued?
we run dotnet-counters tool and we see queue sizes up to 5 in same second when issue occurs.
Side notes:
we control the network, we are 99.999% sure it not it (because you can never be sure...)
process is not CPU throttled
the process usually have 25 - 30 threads in total at given time
when running on k8s/docker we tried both container and host network - no change.
HttpClient notes:
We are using this HTTP client: https://learn.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=net-6.0
Client instances are created before we launch the loop.
These are HTTP, not HTTPS requests
URLs are always the same per task, server is given as IP, like this http://1.2.3.4/targetfortaskX
Generally - using tcpdump and wireshark we observe two TCP streams to be opened and living through whole execution and all requests made are assigned to one of these two streams with keep-alive. So no delays on DNS, TCP SYN or source port exhaustion.
I have a question regarding Corda 3.1 and using the network map for the purpose of seeing if node is up - is it generally a good idea to use it for that ?
From these notes https://docs.corda.net/network-map.html#http-network-map-protocol as there is a polling of the network map participants data (in case that our cached data expired) it should be technically possible to do that.
Could you see any drawbacks of implementing this in that way ?
If the node is configured with the compatibilityZoneURL config then it first uploads its own signed NodeInfo to the server (and each time it changes on startup) and then proceeds to download the entire network map. The network map consists of a list of NodeInfo hashes. The node periodically polls for the network map (based on the HTTP cache expiry header) and any new entries are downloaded and cached. Entries which no longer exist are deleted from the node’s cache.
It is not a good idea to use the network map as a liveness service.
The network does have an event horizon parameter. If the node is offline for longer than the length of time specified by the event horizon parameter, it is ejected from the network map. However, the event horizon would usually be days (e.g. 30 days).
Instead, you can just ping the node's P2P port using a tool like Telnet. If you run telnet <node host> <P2P port> and the node is up, you'll see something like:
Trying ::1...
Connected to localhost.
Escape character is '^]'.
If the node is down, you'll see something like:
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
telnet: Unable to connect to remote host
Alternatively, if you want to check liveness automatically from within a flow, you can define a subflow like the one below. This flow will return a boolean indicating whether a given party on the network is offline.
#InitiatingFlow
class IsLiveFlow(val otherPartyName: CordaX500Name) : FlowLogic<Boolean>() {
#Suspendable
override fun call(): Boolean {
val otherPartyInfo = serviceHub.networkMapCache.getNodeByLegalName(otherPartyName)!!
val otherPartyP2PAddress = otherPartyInfo.addresses.single()
return try {
Socket().use { socket ->
socket.connect(InetSocketAddress(otherPartyP2PAddress.host, otherPartyP2PAddress.port), 1000)
true
}
} catch (e: IOException) {
false
}
}
}
We are building an orchestrator within a micro-service architecture. We chose websockets as the RPC protocol, to set up a streaming pipeline which can be scaled by a websocket capable server like Kestrel. This orchestrator will primarily be running on Linux servers (dockerized).
For admin and monitoring purposes, we plan to use http://dotnetify.net/ to build a reactive Web Admin portal (which could show the number of calculations and clients in semi-realtime, with push notify).
DotNetify uses SignalR, and we can't use the SignalR layer on top of Websockets. We require minimal overhead on top of the TCP protocol. Websocket in itself is a beautiful standard, and lightweight enough, but SignalR adds support for things we don't really need for this (LAN, microservices). We do consider WAMP, but in the proof of concept we will use a plain and simple custom handshake within the websocket bus. Another reason is: our main backend is IBM AIX, and the RDBMS process engine is a commercial prebuild binary, so it's very cumbersome (near impossible) to implement the SignalR protocol over there. But we don't have to, because we don't want to.
A possible solution to having [A] "pure" and [B] "signalR" websocket servers within 1 process, is starting multiple Kestrels. I tried this (on windows and ubuntu) and it seems to run without problems. I simply used a Task.Run() array, followed by Task.WaitAll(backgroundTasks). One Kestrel with SignalR, one without, running on separate ports.
Note: I could not find a proper way to use multiple ports in one Kestrel, and exclude SignalR from one port
My question is: Although this seems to run just fine, can anybody confirm that this is safe? Especially with libuv and os signal handling?
You can use SignalR as normal, and just listen for Websocket connections on a specific path for talking with your AIX (and other back end) boxes. Just do something like this (Taken from Microsoft Docs ):
app.Use(async (context, next) =>
{
if (context.Request.Path == "/ws")
{
if (context.WebSockets.IsWebSocketRequest)
{
WebSocket webSocket = await context.WebSockets.AcceptWebSocketAsync();
await Echo(context, webSocket);
}
else
{
context.Response.StatusCode = 400;
}
}
else
{
await next();
}
});
I don't see any reason why you would need to start two Kestrel instances. Obviously replace the /ws portion of the path above with whatever endpoint you want to use for hooking up your WebSockets for your backend service.
I have 2 systems: system 1 is running akka and HAProxy, system 2 is running REST components that make a request to akka.
Akka runs on port 4241 on system 1. System 2 is able to connect to System 1 when there is no HAProxy. After I installed HAProxy on system 1, the request from system 2 to system 1 errors out with the logs below:
ERROR[deal-akka.actor.default-dispatcher-18] EndpointWriter - dropping
message [class akka.actor.ActorSelectionMessage] for non-local
recipient [Actor[akka.tcp://akkaSystemName#Server1IP:42431/]] arriving at
[akka.tcp://akkaSystemName#Server1IP:42431] inbound addresses are
[akka.tcp://akkaSystemName#Server1IP:4241]
HAProxy runs on 42431.
The HAProxy configuration is the following:
listen akka_tcp :42431
mode tcp
option tcplog
balance leastconn
server test1 Server1IP:4241 check
server test2 Server1IP:4241 check
The akka configuration is this:
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
netty.tcp {
hostname = "Server1IP"
port = 4241
transport-protocol = tcp
# Sets the send buffer size of the Sockets,
# set to 0b for platform default
send-buffer-size = 52428800b
# Sets the receive buffer size of the Sockets,
# set to 0b for platform default
receive-buffer-size = 52428800b
maximum-frame-size = 52428800b
}
}
Any suggestion would be appreciated.
Updated answer:
Probably the Akka Remoting is not supposed to work with a Load Balancer. Look at this part of its documentation:
Akka Remoting is a communication module for connecting actor systems
in a peer-to-peer fashion, and it is the foundation for Akka
Clustering. The design of remoting is driven by two (related) design
decisions:
1.Communication between involved systems is symmetric: if a system A can connect to a system B then system B must also be able to connect
to system A independently.
2.The role of the communicating systems are symmetric in regards to connection patterns: there is no system that only accepts connections,
and there is no system that only initiates connections.
The consequence of these decisions is that it is not possible to
safely create pure client-server setups with predefined roles
(violates assumption 2) and using setups involving Network Address
Translation or Load Balancers (violates assumption 1).
For client-server setups it is better to use HTTP or Akka I/O.
For your case it seems reasonable to use Akka HTTP or Akka I/O on system 1 to accept and answer requests from system 2.
Old answer:
You have to set the bind-port property in the Akka configuration. Here is the quote from the Akka documentation:
# Use this setting to bind a network interface to a different port
# than remoting protocol expects messages at. This may be used
# when running akka nodes in a separated networks (under NATs or docker containers).
# Use 0 if you want a random available port. Examples:
#
# akka.remote.netty.tcp.port = 2552
# akka.remote.netty.tcp.bind-port = 2553
# Network interface will be bound to the 2553 port, but remoting protocol will
# expect messages sent to port 2552.
For your ports it should be like that:
port = 42431
bind-port = 4241
I'm playing with these basic TCP test scripts and would like to know: "How to get the IP-Address of clients connecting to the server?"
Any ideas? I tried to probe a client subport at the server-side, but it doesn't show the remote-ip.
Can someone give me hints on gathering that information. I know how it works within Rebol2, but I'm not familiar with the Rebol3 port model.
You can obtain that information by calling QUERY on the client port!, which will return an object with remote-ip and remote-port fields.
Here's a simple example illustrating this, with a simple service that listens for connections on port 9090 and prints the address of clients connecting to that service:
rebol []
awake-server: func [event /local client info] [
if event/type = 'accept [
client: first event/port
info: query client
print ajoin ["Client connected: " info/remote-ip ":" info/remote-port]
close client
]
]
serve: func [endpoint /local listen-port] [
listen-port: open endpoint
listen-port/awake: :awake-server
wait listen-port
]
serve tcp://:9090
The system/standard/net-info object includes two values - local-ip and remote-ip. I'm not sure whether that they get set though.
Give system/standard/net-info/remote-ip a try and if it contains none, I would suggest submitting a bug report.