A good setup for distributed monitoring and tracking latency/drops in a network - networking

I want to preface this by saying that I've never taken a networking class but I'm learning on the job. Things like TCP/IP networking I have a pretty basic grasp of and if you think this will hinder my attempt at this let me know.
The task I have at hand is thus: I have an Open Stack network with a bunch of nodes that can communicate with each other, all running CentOS virtual machines (just for simplicity's sake) with applications running on top of them. The task is basically to find a way to monitor the ping of every node and report whenever some kind of message (probably through http) that reports what happened. The logic of checking for the actual latency problems isn't what I'm struggling with, its the best structure to complete this task.
I'm thinking of using Nagios and setting up a distributed monitoring system. Basically my plan is to instal nagios on each node after writing my plugin (unless its already offered or exists) and it would simply ping everything else in the network once its setup and the other nodes ping it once the fact that it has joined the network is detected. I'm not sure exactly how scalable this is because if the number of nodes increase a lot would having every node pinging every other node actually be a good thing? Could it actually end up being a lot of stress on the network?
Is this a bad idea? I know a more efficient solution would be something where as long as every node is being checked (not necessarily have to have every node connected to by every other node) is more efficient. Visualizing it as a graph with a couple of points, it would be a bidirectional graph with just one path connecting each point rather than every possible point having edges between each other. But I don't know if this is the level I should be thinking about it or not.
In short, what I'm asking is: How would one go about setting up a ping monitoring system between a bunch of Open Stack nodes?
Let me know if this question makes sense. Thanks.

Still not entirely sure what you're trying to accomplish with this setup, but the Nagios setup you're describing sounds messy and likely won't cover what you need. I'd look at building packetbeat into the provisioning of each of your hosts, and then shipping that data off to Elasticsearch. That way you can watch your actual application-level traffic and response times. https://www.elastic.co/products/beats/packetbeat

Related

Connecting Rime network to the Internet

I have been playing with Contiki for some time now and have tried out various examples and wrote my own for both the simulation environment and the real hardware. I have only been experimenting with self contained networks that, for example, measure the temperature difference between two nodes and then communicate that data with other devices (PCs) over plain text RS232 link, blink LEDs, and such simple stuff.
Now I want to make a more complicated system where instead of just forwarding the data in plain text to be read on a console I would forward it to an application that would in turn post it to some sort of web service and, vice versa, receive data from web service to be delivered to nodes on the network. There is quite a lot of examples and tutorials describing this kind of setup but all of them (as far as I am aware) are focusing on the IP(v6) stack and SLIP to achieve this. The problem with this is that I have a really lousy programmer and uploading of a 50 kB image takes about 1.5 mins so the development cycle is pure hell. I am also out of luck with simulation since my platform is not really supported at the moment.
That's why I decided to try out the Rime stack, image size is 1/3 of the IPv6 and the development cycle is somewhat acceptable now (I really should get a decent JTAG programmer...) Meanwhile, I am having a bit of trouble wrapping my head around this new setup with a different network stack on which there is very little information around. Although it is pretty easy to understand by itself, I am not sure how I would go about connecting a Rime network to an IP network and if it was even possible or advised/intended by it's designers.
I have some ideas in my head, ranging from ad hoc communication over a serial link between a server application running on the PC and the collector node, to a real Rime border router that is certainly outside of my league, for now.
How would you go about it? It would certainly work for my simple experimental case to just have a collector node that gathers the data from the Rime network and sends the aggregated data over a serial connection to the custom application that does the rest of the magic, but, I wouldn't want to be the guy that reinvented the wheel and I am quite sure that Rime wasn't designed to be used in a vacuum so there must be at least an advised way of doing this?
Rime is a really simple stack (By simple I mean few functionnality). But it's quite quicker for simple task.
You need to program the Rime stack on your gateway. Thus, your board and the gateway can communicate with the same stack. So now you have the data send on your gateway. The gateway now can send the data with IP to whoever you want.
If you want more technical detail, then edit your question with more specific technical context.
Btw JTAG is a must have. (for industrial application)
Edit : An other solution is to simply send your data from your board to your gateway in broadcast. Then the gateway take the data and interpret it. The cons of this method is you have to somehow be sure that your gateway interpret only the data of your board (not of others board)

What is ZeroMQ underlying design architecture

I am comparatively new to ZeroMQ and would like some suggestions regarding its it's internal architecture.
I am planning to use ZeroMQ as a messaging framework for my work. The basic idea what I want to achieve is to be able to dynamically scale the infrastructure based on the load and computational capacity required to achieve a particular workflow deadlines.
So,if if there is a necessity to add more nodes, then the application spawns new nodes and the messaging framework should be able to incorporate the changes as well. I should also be able to be point where the additional computations should occur or how the framework dynamically adds the new nodes (if any). The event on a particular node decides subsequent actions to be performed on other nodes. Here is my scenario or my stack that I am thinking off, but wanted to know if it makes sense:
User applications
ZeroMQ messaging
Squid-Content based routing
Overlay
Physical Substrate
I am bit skeptical about the above stack as I believe ZeroMQ helps one to achieve most of the functionality and therby thereby making it simpler.
Few points about my stack:
Physical substrate are the total number of nodes that are available for the computations or as data sources.
Overlay is a logical network that is built dynamically upon the physical network based upon the closest nodes available for a particular workflow. i.e. if two nodes exchange data frequently, then those two nodes are placed logically close to one another. Is a separate overlay like CHORD etc required when we use ZeroMQ?
Squid is basically used for content based routing. Is Squid required when we use ZeroMQ?
ZeroMQ messaging is for the communication between different nodes for an application.
Basically, what I wanted to know is whether above stack can be made simpler given that ZeroMQ has richer functionalities. If so, can someone point or share the thoughts. I am however going through the documentations of ZeroMQ, I am finding it a bit difficult to understand the intrinsic design of ZeroMQ. Please help.
Thanks
There's so much specific to your use-case here that it's almost impossible to give any definite answers. ZeroMQ is not a direct replacement for the concepts you've built into your architecture, however it may meet the goals you're trying to meet depending on how you're using them.
My suggestion would be to put your current architecture aside and start trying to build up a new one with ZMQ as its core, and see where you run into limitations that are solved by the other parts of your stack.
As for the "intrinsic design" of ZMQ, here's the basics that you need to understand as a starting point:
A ZMQ socket handles connection details for you, including managing network hiccups - but this has limits that you'll need to know
There are different kinds of ZMQ sockets, and they have opinions about how you use them. Some of them communicate asynchronously, some of them are strictly synchronous, some are one way, some are bi-directional.
If a connection between two sockets is severed (e.g. one node goes down, there is a network failure - something more than a momentary hiccup), it's your job to recognize that and re-establish that connection
There is no built in brokering or topology, you have to design and build that all yourself.
... ultimately, ZMQ provides a toolset for you to build a messaging framework, it does not provide a fully realized messaging framework out of the box. So, yes, it has the power to replace some of the other tools you're currently using, but you'll have to build it.

How do I configure OpenSplice DDS for 100,000 nodes?

What is the right approach to use to configure OpenSplice DDS to support 100,000 or more nodes?
Can I use a hierarchical naming scheme for partition names, so "headquarters.city.location_guid_xxx" would prevent packets from leaving a location, and "company.city*" would allow samples to align across a city, and so on? Or would all the nodes know about all these partitions just in case they wanted to publish to them?
The durability services will choose a master when it comes up. If one durability service is running on a Raspberry Pi in a remote location running over a 3G link what is to prevent it from trying becoming the master for "headquarters" and crashing?
I am experimenting with durability settings in a remote node such that I use location_guid_xxx but for the "headquarters" cloud server I use a Headquarters
On the remote client I might to do this:
<Merge scope="Headquarters" type="Ignore"/>
<Merge scope="location_guid_xxx" type="Merge"/>
so a location won't be master for the universe, but can a durability service within a location still be master for that location?
If I have 100,000 locations does this mean I have to have all of them listed in the "Merge scope" in the ospl.xml file located at headquarters? I would think this alone might limit the size of the network I can handle.
I am assuming that this product will handle this sort of Internet of Things scenario. Has anyone else tried it?
Considering the scale of your system I think you should seriously consider the use of Vortex-Cloud (see these slides http://slidesha.re/1qMVPrq). Vortex Cloud will allow you to better scale your system as well as deal with NAT/Firewall. Beside that, you'll be able to use TCP/IP to communicate from your Raspberry Pi to the cloud instance thus avoiding any problem related to NATs/Firewalls.
Before getting to your durability question, there is something else I'd like to point out. If you try to build a flat system with 100K nodes you'll generate quite a bit of discovery information. Beside generating some traffic, this will be taking memory on your end applications. If you use Vortex-Cloud, instead, we play tricks to limit the discovery information. To give you an example, if you have a data-write matching 100K data reader, when using Vortex-Cloud the data-writer would only match on end-point and thus reducing the discovery information by 100K times!!!
Finally, concerning your durability question, you could configure some durability service as alignee only. In that case they will never become master.
HTH.
A+

Service Bus (Maybe?) for web farm

Part of this question is I'm not even sure what exactly I'll need to ask, so I'll start with the situation and work out from there.
One project I'm working on involves use of COMET via the aspComet library. The use case of the program is somewhat of a collaborative slideshow. One person runs the bulk of it, with one or more participants able to perform certain actions. Low latency between when an action is performed on screen
Previously, it was just running on one server. Now, we're wanting to scale it out a bit, more for reliability than performance reasons. So, we have some boxes out in Rackspace's cloud and all that fun stuff.
I knew from the start of this that I was going to need to make some changes to the way the COMET stuff works since different people in the same "show" might be on different servers, and I have no way of knowing what "show" they belong to until after they have already arrived on the site.
I initially tackled this using the WCF Mesh provider, which wasn't well documented to start with, now I'm running into issues with dispatching messages to it sometimes get lost, or delayed (I'm not 100% sure what is going on there), but it screws up the long poll for COMET and breaks things in rather strange ways (Clicking a button may trigger an event, or it'll hang for 10 seconds {long poll duration} and not actually do anything).
More research leads me to believe one of the .Net service bus providers may do what I need. However, I can't find examples that would cover what I need:
No single point of failure (outside of a database)
No hardcoding of peers.
Near-realtime (no polling, event based would be best)
My ideal solution would involve that when a server comes up, it lets other servers know of its existence (Even if it's just a row in a table somewhere), and they can start sending broadcast messages among each other, with each server being both a publisher and subscriber. This is what I somewhat had in the WCF Mesh provider, but I'm not overly confident in that code.
Can anyone point me in the right direction with this? Even proper terms to look for in the docs for service bus providers would be good at this point. Or are service buses not what I want? At this point I would settle for setting up a Jabber server on each web server and use that, if it could fit within my constraints.
I can't speak a ton to NServiceBus, but I expect the answers will be similar.
Single point of failure: MSMQ can use multicasting, which means each endpoint will broadcast it's existence and no DB table is needed. RabbitMQ uses this Exchange-to-Queue binding process which means as long as the Rabbit instance or cluster is up then messages still exist. RabbitMQ can be clustered, MSMQ cannot be. *Note: You might have issues with multicasting with Rackspace, no idea how they work. If so, you'll have to fall back on the runtime services for MSMQ (not RabbitMQ), that would create a single point of failure because everyone has a single point to coordinate control messages through.
Hard coding of peers: discussed above a bit; MSMQ's multicast handles it. Rabbit it can also be done, just bind queues to an exchange you want to listen to. MassTransit takes care of this for you.
Near-realtime: These both use messaging which is near real time. There's no polling in your message consumer code.
I think a service bus seems like a reasonable solution for what you're trying. Some more details would likely be needed, but the general messaging approach is correct. There are other more light weight messaging libraries if you decided you just want something on top of RabbitMQ and configure Rabbit to handle most of the stuff.
To get started with MassTransit, we have documentation up: http://readthedocs.org/projects/masstransit/ and mailing list http://groups.google.com/group/masstransit-discuss. Join the mailing list if you have future questions and someone will try and help you out.

Peer-to-peer replication of a sqlite database

I am looking for a way to replicate a small and simple relational database (like SQLite) across peers. This should work in an environment with unstable network connections, hence the need for each peer to have a full copy of the database. This should allow a peer to continue working off-line in the event of network failure.
To keep things simple, replication should only have to support the replication of addition of data, i.e. only INSERTs, not DELETEs or UPDATEs.
Does anyone know of a good - and ideally cross-platform - technology or method of creating such a system? I am currently looking at JXTA and JXSE, but I am put off by its complexity and apparant lack of life in its community after the takeover of Sun by Oracle.
Thanks!
Frans
rqlite uses the raft consensus algorithm, so it should be fairly resilient to unstable network connection.
Also, it seems to be possible to configure rqlite to accept reads even in the case of a network failure.
A similar project, dqlite, exists as a library, available in various languages, but it seems less explicit about the event of a network failure.
You may want to explore JGroups for the communication layer if you don't like JXTA. For the replication, I think you will have to implement your own code.
I am working on something similar (though the code is far from ready). I'll describe a little about my intended approach, but whether that is suitable for you depends on some key design points you'd need to consider. I am not aware of any ready-built projects that will do this, unfortunately.
In particular we'd need to know what language you wish to use, or which languages you'd rather avoid.
Also, consider how you intend to do peer dicovery - can you set up trust between node pairs manually, or do you want them to auto-discover?
Presumably all peers may insert data?
If you are able to use PHP, and are happy manually peering node pairs, then my approach may be of interest. Set up an ORM such as Doctrine, Propel or NotORM, and get each node to regularly sync with an internet time source. For each new row in a db, grab the data (either in an array or ORM object), serialise it, and push it out to all nodes that you have a trust relationship with. Where a push fails, keep a note of this and retry at periodic intervals (potentially giving up after a remote node fails to answer a large number of retries).
Pushes can either be kicked off by your application that creates the row, or can be called by whatever scheduler is available on each machine. A push message can be XML, or for simplicity can be just a POST message containing the new row and whatever metadata (e.g. timestamp of save, so as to resolve INSERT order from several nodes).
If your nodes do not have static IP addresses, they could be registered with a dynamic DNS addressing service so as to allow each node to stay in touch with peers even if their IP changes. You might also consider adding a message signing system, to ensure that messages between nodes are genuine.

Resources