If StackOverflow is the wrong Exchange for this question, please help direct me to the correct one.
Short Version
What is the best design for a networking application in which one user transmits a constant, high-bandwidth stream of data to many other addresses? The solution must not require the uploader to duplicate the packets for each recipient and preferably will not transmit to users that have not been accepted by the transmitter.
Long Version
A friend and I have written an application that enables someone to transmit data in real time to one or more recipients that he wants to receive the data. I have designed the high-level application protocol to use UDP and to encode the data so that each packet can be lost without hurting the use of the rest. This solution requires managing sockets with each user and sending each packet to every user.
The problem here is that the stream can be very high bandwidth. The user can modify the settings for how high quality the data he is sending should be, and can end up sending 6 Mbps to each user. It is unfeasible to expect a user to pay his ISP enough to be allowed to upload such a stream to the preferred minimum of four other users at a time.
We need a way for the transmitter to send a packet exactly once and have each user receive a copy.
We have looked at multicasting. It may be what we need to use in the end, but we are concerned about the fact that anyone can join any group. It would be preferable to not allow users we do not want to see the data to not be allowed to join in. There is also the problem that if multiple transmitters happen to use the same group, viewers may find that they are receiving multiple streams' worth of data when they only want one.
My searching has revealed something IBM published over a decade ago called Explicit Multicast (Xcast) that looks perfect, but I have yet to find any information to determine whether this technology is commonly supported. Also, I have not yet seen whether it supports datagrams.
Does anyone know the best way to design an application that meets our needs?
Please keep in mind that we have no funds to support our project. Solutions need to be free.
Edit
In the summary above, I hinted at but failed to explicitly state that this is for a real-time application. The motivating drive behind the application is to keep the clients/recipients as close together in time as is possible. If packets are lost or arrive too late to be used in keeping the server and clients in phase, they need to be disregarded. That is why I designed the application protocol on top of UDP with independent data in each packet. Even if a client receives only one packet out of 300 for a given time step, it will use what it did get.
I think that I_am_Helpful's recommendation may be a good step in the right direction (or possibly the destination). I need to do some experimentation to determine whether using a system like Spread will work. However, I do not think I can budget more than additional 17 ms in transmission time.
If you can think of a system that enables sending unreliable datagrams to a specific group of users (like Spread) for a real-time application (unlike Spread, see p. 3), please let me know about it.
We need a way for the transmitter to send a packet exactly once and
have each user receive a copy.
In my limited knowledge, I would say that Reliable Multicasting appears to be one of the viable option for broadcasting in the group. I would like to mention that there are some of the possible Java API's* which could help you achieving the same :
JGroups Java API
The Spread Toolkit -> Spread consists of a library that user applications are linked with, a binary daemon which runs on each computer that is part of the processor group, and various utility and demonstration programs.
Appia
*NOTE : I have never worked with these API's.
It would be preferable to not allow users we do not want to see the
data to not be allowed to join in.
They do provide this feature, e.g., Spread supports thousands of groups with different sets of members. It also provides a range of reliability, ordering and stability guarantees for messages. JGroups can be used to create groups of processes whose members can send messages to each other. It also has facilities like group creation and deletion(Group members can be spread across LANs or WANs).
There is also the problem that if multiple transmitters happen to use
the same group, viewers may find that they are receiving multiple
streams' worth of data when they only want one.
When you could easily create multiple groups in the same network(using Spread,etc.), then, I believe that would no longer be an issue. It is your responsibility to declassify users into different groups.
I hope the given information helps. Good LUCK.
Via multicast you achieve exactly you want: sending each packet once, but authentication seems to be a concern for you.
One possible solution could be simetric cryptography, where the same key is used to encrypt and decrypt. Via TCP your clients connect to a server and fetch the multicast IP Address of the transmission and its associated key, then they join the multicast group and decrypt the incomming transmission.
If you accept a more flexible solution, you could have a server which sends a transmission in real time to a set of distributed servers. Your clients connect to one of these distributed servers via unicast, and after authentication is done, they are inluded in a list of receivers. Each distributed server sends each new transmission packet to each registered client via UDP. in ordinary situations your clients would have the same experience as if it was delivered in a multicast group, but the servers will spend far more bandwidth. Multiple transmission at a time will be allowed, so it could be good for you, and you can have more control, as clients can send signals to the servers, like PAUSE, and etc.
Related
Consider a user who is using a service (say an app backend) and routing their connection through an intermediary proxy and/or vpn. Specifically let’s assume the user is in Shanghai-China, the proxy is in the Dallas-Texas and the backend is on AWS. In theory, compared to a user who actually lives in Dallas-Texas (on the same network) the Shanghai-China user will have additional latency in sending/receiving events due to the Asia<-> USA trip.
Questions:
Are there known/published methodologies for seeing this additional latency and thereby identifying imposters from far away? The simplest I can think of is grouping by isp providers and then looking for outliers in latency.
Are there additional ways to honeypot such users? I’m not a network export but I think various sorts of media (eg video streaming) get different treatment on these networks so I’m wondering if it is possible to send additional event data to honeypot more precise latency anomalies.
Assumptions:
We can assume that we have plenty of user data, from each network provider. We also have streams of event data that includes client and server side timestamps for sending and receiving data.
I’m strictly interested in identifying users who are very far away from the IP source. I am NOT interested in methodologies that strictly try to classify an IP as a VPN (eg what Maxmind does in the above link).
Since you have stated you are not interested in the VPN or Relay identification aspect of it, but only latency detection for "far away" users, I will offer some ideas:
Run your own HTTP(S) measurement server that all clients must run an HTTP ping against. HTTP round-trips time in milliseconds will be acceptable for broad-strokes classification of a user's "distance" from your server - assuming that the initial TCP handshake has been completed by the client, all intermediaries, and your measurement server ("pre-warmed" connection).
Use an IP geolocation API. These will give you the country code (almost always accurate), and approximate latitude, longitude (broadly accurate) that you may use to calculate the distance from your server. This of course assumes that the public IP of the client is visible to you, and not completely obfuscated by the intermediaries.
I want to fetch the data of a stock. Since the data changes very fast, is there any way to pull the data like 50-100 times a second from trading websites?
And can we implement that using a raspberry Pi 4 8gig model.
RasPi4 should be more than adequate for this task. Both the ethernet and WiFi hardware is capable of connections at these speeds. (Unless you’re running a bunch of other stuff on it.) Consider where your bottlenecks may be, likely ISP or other network traffic). Consider avoiding WiFi in favor of cat5e or cat6. Consider hanging this device off your router (edge) to keep lan traffic lower and consider QOS settings if you think this traffic may compete with other lan traffic.
This appears to be a general question with no specific platform in mind. For stocks, there are lots of platforms to choose from.
APIs for trading platforms often include a method to open a stream. Instead of a full TCP conversation for each price check, a stream tells the server to just keep on sending data. There are timeout mechanisms of course, but it is good to close that stream gracefully (It’s polite since you’re consuming server resources at a different scale. I’ve seen some financial APIs monitor and throttle stream subscribers who leave sessions open.).
For some APIs/languages you can find solid classes already built on GitHub. Although, if simply pulling and reading a stream then the platform API doc code snippets should be enough to get you going.
Be sure to find out what other overhead may be implicated. For example, if an account or API key is needed to open a stream then either a session must be opened first or the creds must be passed with the stream being opened. The API docs will say. If you’re new to this sort of thing, just be a detective and try to infer what is needed. API docs usually try to be precise and technically correct with the absolute minimum word count.
Simply checking the steam should be easy. Depending on how that steam can be handled by your code/script, it may be harder to perform logic on the stream while it is being updated. That’s usually a thread issue or a variable scope issue depending on the script/code. For what you’re doing I would consider Python or PowerShell depending on your skill-set and other design parameters.
I want multiple IoT devices (Say 50) communicating to a server directly asynchronously via TCP. Assume all of them have a heartbeat pulse every 30 seconds and may drop off and reconnect at variable times.
Can anyone advice me the best way to make sure no data is dropped or blocked when multiple devices are communicating simultaneously?
TCP by itself ensures no data loss during the communication between a client and a server. It does that by the use of sequence numbers and ACK messages.
Technically, before the actual data transfer happens, a TCP connection is created between the client (which can be an IoT device, or any other device) and the server. Then, the data is split into multiple packets and sent over the network through that connection. All TCP-related mechanisms like flow-control, error-detection, congestion-detection, and many others, take place once the data starts to flow.
The wiki page for TCP is a pretty good start if you want to learn more about how it works.
Apart from that, as long as your server has enough capacity to support the flow of requests coming from the devices, then everything should work (at least in theory).
I don't think you are asking the right question. There is no way to make sure that no data is dropped or blocked. Networks do not always work (that is why the word work is in network, to convince you otherwise ).
The right question is: how do I make my distributed system as available and reliable as possible? The answer involves viewing interruption and congestion as part of the normal operation, and build your software appropriately.
There is a timeless usenix/acm/? paper from the late 70s early 80s that invigorated the notion that end-to-end protocols are much more effective then over-featured middle to middle protocols; and most guarantees of middle to middle amount to best effort. If you rely upon those guarantees, you are bound to fail. Sorry, cannot find the reference right now, but it is widely cited.
I want to create a proxy server which routes incoming packets from REQ type sockets to one of the REP sockets on one of the computers in a cluster. I have been reading the guide and I think the proper structure is a combination of ROUTER and DEALER on the proxy server. Where the ROUTER passes messages to the dealer to be distributed. However, I cannot figure out how to create this connection scheme. Is this the correct architecture? If so how to I bind a dealer to multiple addresses. The flow I envision is like this REQ->ROUTER|DEALER->[REP, REP, ...] where only one REP socket would handle a single request.
NB: forget about packets -- think in terms of "Behaviour", that's the key
ZeroMQ is rather an abstract layer for certain communication-behavioral patterns, so while terms alike socket do sound similar to what one has read/used previously, the ZeroMQ-world is by far different from many points of view.
This very formalism allows ZeroMQ Formal-Communication-Patterns to grow in scale, to get assembled in higher-order-patterns ( for load-balancing, for fault-tolerance, for performance-scaling ). Mastering this style of thinkign, you forget about packets, thread-sync-issues, I/O-polling and focus on your higher-abstraction-based design -- on Behaviour -- rather than on underlying details. This makes your design both free from re-inventing wheel & very powerful, as you re-use a highly professional tools right for your problem-domain tasks.
DEALER->[REP,REP,...] Segment
That said, your DEALER-node ( in fact a ZMQsocket-access-node, having The Behaviour called a "DEALER" to resemble it's queue/buffering-style, it's round-robin dispatcher, it's send-out&expect-answer-in model ) may .bind() to multiple localhost address:port-s and these "service-points" may also operate over different TransportClass-es -- one working over tcp://, another over inproc://, if that makes sense for your Design Architecture -- ZeroMQ empowers you to use this transparently abstracted from all the "awfull&dangerous" lower level gritty-nitties.
ZeroMQ also allows to reverse .connect() / .bind()
In principle, where helpfull, one may reverse the .bind() and .connect() from DEALER to a known target address of the respective REP entity.
You leave a couple details out that are important to determining the correct architecture.
When you say "from REQ type sockets to one of the REP sockets on one of the computers in a cluster", how do you determine which computer gets the message? Is it addressed to a specific computer? Does a computer announce its availability before it can receive a message? Does each message just get passed to the next one in line in a round-robin fashion? (if it's not the last one, you probably don't want a DEALER socket)
When you say "how do I bind a dealer to multiple addresses", it's not clear what you mean by "addresses"... Do you mean to say that the proxy has a unique IP address that it uses to communicate with each computer in the cluster? Or are you just wondering how to manage the connection to multiple different peers with the same socket? The former is a special case, the latter is simple.
I'm going to work with the following assumptions:
You want a worker computer from the cluster to announce its availability for work before it receives any work, and any computer in the cluster can handle any job. A faster worker, or a worker working on a smaller job, will not have to wait behind some slow worker to finish their job and get a new job first.
The proxy/broker uses a single ip interface to communicate with all workers.
If those are true, then what you want will be closer to this:
REQ->ROUTER|ROUTER->[REQ, REQ, ...]
A worker will create a request to the backend router socket to announce its availability, and await a reply with work. Once it is finished, it will create a new request with the finished work, which again announces its availability. The other half of the pattern you've already worked out.
This is the Simple Pirate Pattern from the ZMQ guide. It's a good place to start, but it's not very robust. This is in the Reliable Request-Reply Patterns section of the guide, and I suggest you read or reread that section carefully as it will guide you well. In particular, they keep refining this pattern into more and more reliable implementations and wind up with the Majordomo pattern, which is very robust and fault tolerant. You should see if you need all the features that provides or if you can scale it back a little. Either way, you should learn and understand what these patterns are doing and why before you make the choice to do something different.
Are there any libraries which put a reliability layer on top of UDP broadcast?
I need to broadcast large amounts of data to a large number of machines as quickly as possible, and generally it seems like such a problem must have already been solved many times over, but I wasn't able to find anything except for the Spread toolkit, which has a somewhat viral license (you have to mention it in all materials advertising the end product, which I'm not sure our customer will be willing to do).
I was already going to write such a thing myself (because it would be extremely fun to do!) but decided to ask first.
I looked also at UDT (http://udt.sourceforge.net) but it does not seem to provide a broadcast operation.
PS I'm looking at something as lightweight as a library - no infrastructure changes.
How about UDP multicast? Have a look at the PGM protocol for which there are several commercial and open source implementations.
Disclaimer: I'm the author of OpenPGM, an open source implementation of said protocol.
Though some research has been done on reliable UDP multicasting, I haven't yet used anything like that. You should take into consideration that this might not be as trivial as it first sounds.
If you don't have a list of nodes in the target network you have no idea when and to whom to resend, even if active nodes receiving your messages can acknowledge them. Sending to a large number of nodes, expecting acks from all of them might also cause congestion problems in the network.
I'd suggest to rethink the network architecture of your application, e.g. using some kind of centralized solution, where you submit updates to a server, and it sends this message to all connected clients. Or, if the original sender node's address is known a priori, then just let clients connect to it, and let the sender push updates via these connections.
Have a look around the IETF site for RFCs on Reliable Multicast. There is an entire working group on this. Several protocols have been developed for different purposes. Also have a look around Oracle/Sun for the Java Reliable Multicast Service project (JRMS). It was a research project of Sun, never supported, but it did contain Java bindings for the TRAM and LRMS protocols.