Virtual Nodes in Dynamo - amazon-dynamodb

Virtual Nodes in Dynamo - amazon-dynamodb

Recently I have read the paper of Dynamo, the key/value storage system of Amazon. The Dynamo uses consistent hashing algorithm as the partition algorithm. To solve the challenge of load balance and heterogeneous, it applies the "virtual node" mechanism. Here is my question:
It is described that "The number of virtual nodes that a node is
responsible can decided based on its capacity", but what capacity it
is? Is it the calculation capacity, network bandwidth, or the disk
volume?
What is the technology to partition a node to "virtual nodes"? Is a virtual node just a process? Or maybe using docker or virtual machine?

Without going into specifics, for #1 the answer would be: all of the above. The capacity may be determined empirically for different node types after running some load testing and noting the results. A similar process to what you would use to determine the capacity of a web server.
And for your second question, the paper just says that you should think of nodes from a logical stand point. In order to satisfy #1, nodes in the ring are designated such that one or multiple nodes would hash to the same physical hardware. So a virtual node is just a logical mapping. It is just one more layer of abstraction on top of the physical layer. If you are familiar with file systems, think of a virtual node like an iNode vs. a disk cylinder (a comparison perhaps slightly dated)

Related

Mariadb galera cluster and cap theorem

Where does mariadb galera cluster lies according to cap theorem CP or AP based on a brief explanation how it works.

Consistency -- For handling the "critical read" problem, Galera needs a little help. See http://mysql.rjweb.org/doc.php/galera#critical_reads
Otherwise, one can state that Galera survives "any" single-point-of-failure.
Galera is normally deployed in 3 nodes, one in each of 3 geographic locations. That means that no single machine failure, data center failure, earthquake, tornado, network outage, etc, can take out more than one node at a time. The other two nodes (whichever two survive and still talk to each other) will declare that they "have a quorum" and continue to accept writes and deliver reads. Further, "split brain" is not 'possible'; this is what keeps any attempt at dual-master, even with monitoring from surviving any SPOF.
If the third node or the network is repaired, the Cluster goes about patching up the data as needed, so that the 3 nodes again have identical data.
Granted, this is not quite the same as the definition of CAP, but is is a reasonable goal for a computer cluster.
How it works (in a tiny nutshell)... Each node talks to each other node. It does this only during the COMMIT of a transaction. (Hence, it is reasonably efficient even when spread across a WAN, as needed to survive natural disasters.) The COMMIT says to the other nodes "I am about to do this write; is it OK?" Without actually doing the write, they check Galera's magic sauce to see if it will succeed. Once everyone says "yes", the COMMIT returns success to the client. (That gives you a hint of the "critical" read issue.)

Is whether sharing the secondary storage one difference between clusters and grids?

Is whether sharing the secondary storage one difference between clusters and grids?
I.e.
In a cluster, do all the computers share the disks so the cpus see a single distributed file system?
In a grid, do all the computers not share the disks, so the cpus don't see each other's disk and file system?

Yes, the concept of storage is the main point of difference between cluster systems and grid systems. (The below explanation is copied from "Comparison of Grid Computing vs. Cluster Computing")
Grid systems are loosely coupled(decentralised), whereas cluster systems are tightly coupled.
Grid systems have "distributed job management and scheduling policies", whereas cluster systems are "centralized job management & scheduling system".
The big difference is that a cluster is homogenous while grids are heterogeneous.
The computers that are part of a grid can run different operating systems and have different hardware whereas the cluster computers all have the same hardware and OS. A grid can make use of spare computing power on a desktop computer while the machines in a cluster are dedicated to work as a single unit and nothing else. Grid are inherently distributed by its nature over a LAN, metropolitan or WAN. On the other hand, the computers in the cluster are normally contained in a single location or complex.
Another difference lies in the way resources are handled.
In case of Cluster, the whole system (all nodes) behave like a single system view and resources are managed by centralized resource manager. In case of Grid, every node is autonomous i.e. it has its own resource manager and behaves like an independent entity.
Question - Is whether sharing the secondary storage one difference between
clusters and grids?
As Wikipedia page on Computer cluster states :-
"In most circumstances, all of the nodes use the same hardware[2] and the same operating system, although in some setups (i.e. using Open Source Cluster Application Resources (OSCAR)), different operating systems can be used on each computer, and/or different hardware."

How do I configure OpenSplice DDS for 100,000 nodes?

What is the right approach to use to configure OpenSplice DDS to support 100,000 or more nodes?
Can I use a hierarchical naming scheme for partition names, so "headquarters.city.location_guid_xxx" would prevent packets from leaving a location, and "company.city*" would allow samples to align across a city, and so on? Or would all the nodes know about all these partitions just in case they wanted to publish to them?
The durability services will choose a master when it comes up. If one durability service is running on a Raspberry Pi in a remote location running over a 3G link what is to prevent it from trying becoming the master for "headquarters" and crashing?
I am experimenting with durability settings in a remote node such that I use location_guid_xxx but for the "headquarters" cloud server I use a Headquarters
On the remote client I might to do this:
<Merge scope="Headquarters" type="Ignore"/>
<Merge scope="location_guid_xxx" type="Merge"/>
so a location won't be master for the universe, but can a durability service within a location still be master for that location?
If I have 100,000 locations does this mean I have to have all of them listed in the "Merge scope" in the ospl.xml file located at headquarters? I would think this alone might limit the size of the network I can handle.
I am assuming that this product will handle this sort of Internet of Things scenario. Has anyone else tried it?

Considering the scale of your system I think you should seriously consider the use of Vortex-Cloud (see these slides http://slidesha.re/1qMVPrq). Vortex Cloud will allow you to better scale your system as well as deal with NAT/Firewall. Beside that, you'll be able to use TCP/IP to communicate from your Raspberry Pi to the cloud instance thus avoiding any problem related to NATs/Firewalls.
Before getting to your durability question, there is something else I'd like to point out. If you try to build a flat system with 100K nodes you'll generate quite a bit of discovery information. Beside generating some traffic, this will be taking memory on your end applications. If you use Vortex-Cloud, instead, we play tricks to limit the discovery information. To give you an example, if you have a data-write matching 100K data reader, when using Vortex-Cloud the data-writer would only match on end-point and thus reducing the discovery information by 100K times!!!
Finally, concerning your durability question, you could configure some durability service as alignee only. In that case they will never become master.
HTH.
A+

Autodiscovery in P2P Applications

I want to create a P2P application on the internet. What is the best or if none exist a good enough way to do auto-discovery of other nodes in a decentralized network?

Grothoff and GauthierDickey from the GNUnet project (an anonymous censorship-resistant file-sharing network) researched on the question of bootstrapping a p2p network without any central hostlist.
They found that for the Gnutella (Limewire) network a random ip search needed on average 2500 connection attempts to find a peer.
In the paper they proposed a method which reduced the required connection attempts to 817 for Gnutella and 51 for the E2DK network.
Achieved was this through creating a statistical profile of p2p users for every DNS organization, this small (around 100kb) discovery database has to be created in advance and shipped with the p2p client.

This is the holy grail of P2P. There isn't a magic solution really - there's no way a node can discover other nodes without a good known point to act as a reference (well, you can do so on a LAN by using broadcasting, but not on the internet). P2P filesharing tends to work by having known websites distributing 'start points' for discovery, and then further discovery (I would expect) can come from asking nodes what other nodes they know about.
A good place to start on research would be Distributed Hash Tables.
As for security, that topic will be in the literature somewhere, I should think - again I would recommend Wikipedia. Non-existent ones are trivially dealt with: if you can't contact an IP/port, don't keep it on your list, and if a node regularly provides non-existent pointers, consider de-prioritising it or removing it from your list entirely.
For evil nodes, it depends on your use case, but let's say you are doing file sharing. If you request a section of a file, check with several nodes what the file section's hash should be, and then request by hash. If the evil node gives you a chunk that has a different hash, then you can again de-prioritise or forget that node.
Distributed processing systems work a little differently: they tend to ask several unrelated nodes to perform the same work, and then they use a voting system (probably using hashing again) to determine whether evilness is at hand. If a node provides consistently bad results, the administrator is contacted or the IP is removed from the known nodes list.

ok, for two peers to find each other they both have to know a common, lets say, mediator to exchange IPs once. You can use anything for this kind of the first handshake whilst being able to WRITE and READ from that "channel". i.e: DNS (your well known domains), e-Mail, IRC, Twitter, Facebook, dropbox, etc.

cluster vs Grid vs Cloud

There are two questions:
1) What is the difference between cluster and Grid
2) What is the Cloud
I am not looking for conceptual definitions,
I found a lot of that by googling but the problem is I still do not get it.
so I believe the answer I seek is different. From what I could re-search online I start to think that
many article writers who is trying to explain this either do not understand this deep enough themselves
or not able to explain their knowledge for an average guy like myself (which is common issue with very technical people).
Just to let you know my level: I am a computer programmer, .NET and LAMP, I can do basic admin on both
Linux flavors and Windows, I have hands on experience with Hyper-V and now researching Xen and XCP
to setup a test cloud based on two computers for learning purposes.
Below info you do not have to read, it is just my current understanding of cluster,grid and cloud it
just to support my two questions because I thought it would help to understand
what kind of mess is in my head right now and what answers I am looking for.
Thank you.
Two computers used for reference in my statements are "A" and "B"
specs for A: 2 core intel cpu, 8GB memory , 500gb disk
specs for "B": 2 core intel cpu, 8GB memory , 500gb disk,
Now I would like to look at A and B roles from Cluster, Grid and from Cloud angle.
Common definitions between Grid and Cloud
1) cluster or Grid are 2 or more computers hooked up together, on hardware level
they are hooked up though network cards and on a software level
it is using some kind of program implementing message passing interface
to make it possible to send commands between nodes.
2) cluster or Grid do NOT combine CPU power or memory between nodes, meaning
that in this simulation a FireFox browser running on A still has only one 2 cores cpu,
8GB memory and 500gb available.
Differences between Grid and Cloud:
1) Cluster only provides fail over part, if A node breaks while FireFox is running
the cluster software will re-start FireFox process on node B.
2) Grid however is able to run a software in parallel on multiple nodes at the same time
provided that software is coded with MPI in mind. It can also lunch any software on any node
on demand (even if it is not written for MPI)
3) Grid is also able to combine different type of
nodes, Linux Server, Windows XP, Xbox and Playstation into one Grid.
Cloud definition:
1) Cloud is not a technical term at all, it is just a short convenient word to describe
a computer of unlimited resources, it can aslo be called a Supercomputer, a Beast, an Ocean or Universe but someone
said "Cloud" first and here we are.
2) Cloud can be based on Grids or on Clusters
3) From technical point of view Cloud is a software to combine hardware resources into one,
meaning that if I install Cloud software on Grid or Cluster then it will combine A and B
and I will get one Cloud like this: 4 core CPU, 16gb memory and 1000gb disk.
edited: 2013.04.02
item 3) was a complete nonsense, cloud will NOT combine resources from many nodes into one huge resource, so in this case there will be no 4 core CPU, 16gb memory and 1000gb cloud.

Grid computing is designed to parcel out large workloads to many participating grid members--through software on each member which is expecting to hear that request for computation or for data, and to reply with it's small piece of the overall puzzle. Applications must be written specifically for this approach to problem-solving. It can be heterogeneous because it's not the OS that matters but the software waiting to hear problem-solving requests.
The expectation of a cluster is that it can run the same executable image across any member node--any node can execute that code--which is what drives its requirement for homogeneity. You can write cluster-aware code which distributes workload throughout the cluster, but again you have to write your code to be cluster aware in order to take advantage of more than the redundancy features of a cluster. As most application vendors do not write cluster-aware code, the simple redundancy feature is all that's commonly used in cluster deployments, but that does not limit the architecture. Clusters can and do share their resources, and can collaborate on tasks simultaneously.
Cloud, as it's commonly defined is neither of these, precisely, but it doesn't preclude them, either. Cloud computing assumes the ability to deploy an application without advanced knowledge of it's underlying operating system, or even control of that operating system, coupled with the ability to expand or reduce the processing and memory footprint available to that application without having to destroy and recreate that environment--all done with enough isolation that the application won't know or be able to know what other applications might be installed or running on it's shared infrastructure, unless that access is approved-of by both application managers.

I would like to answer my question before this is closed as a duplicate because I believe it can be very frustrating to find correct info in regards to clusters,grids and clouds and I think this post can save time for many. If someone wants to challenge it please do so, otherwise I will mark it as answer in 1 week.
1) There are many differences and there are none, it really depends on the technical context but
generally you can connect several nodes and call it a Grid or you can call it Cluster. I would say Grid is a Cluster with extended capabilities, such as ability to connect heterogeneous nodes. Both Grid and Cluster will serve as scale-out platform equally good. From Network Engineer and Programmer perspective the difference in implementation or coding will be pretty big if Gird connects heterogeneous nodes.
2) Now the first question was actually a prelude for second one and I believe it is best answered by
Matt Joyce in this post:
https://stackoverflow.com/a/15286488/2230126

I'll take a crack at it. I have been collecting and saving my notes, scripts, and programs since the year 2002 A.D. This is a chop and paste of my statements over the years. Here is a brain friendly memorization list:
The grid is the hardware and hardware specifications.
a. You plug into the router or switch and setup IP addresses and top-level domains over the internet (which is also known as ICANN).
b. This is like OSI level 1, 2, and 3.
The cluster is the kernel (software ring 0 or 1 if its a virtual type thing going on).
a. The kernel is configured (compiled) to run a network stack that can handle sessions, permission, and account authentication.
b. You set up port to port communications usually over TCP/IP (like in the OSI model).
c. You setup iptables, pf, arp, and other OS level applications or shared objects.
d. You can setup ssh, kerberos, ldap, or some other PKI-database and protocol-socket combo.
e. This is like OSI level 4, 5, and 6.
The cloud is user-space applications.
a. The application processes talk to other application-processes within the cluster.
b. You setup process level permissions (via files, cgroups, and/or user-groups).
c. You setup mysql, redis, riak, Message Brokers, hadoop, apache, nginx, cron, java, haskell, erlang, and etcetera.
d. This is like OSI level 7.
The cloud floats over the cluster that grows from the grid. And actually visually think, cloud in the air, cluster in tree, and grid on the ground. Most of us creative types (which make all these technologies) are visual thinkers that can back it up with mathematical data and code. So always see if you can answer the riddle and correlate technological facsimiles to our physical realm here on Earth.
Intro
Grid, Cluster, and Cloud are three different words that mark their specific time in history. Their definitions have intersecting traits and they are modernly interchangeable. You just need to know when to apply the correct or associated word. For example, I was talking to some older M.D.s (medical doctors) and they wanted to know what the cloud was. So I told them that the cloud was a computer cluster that you rent over the internet. And Bingo, they got the idea within 10 seconds.
I will use a little bit of history in chronological prose.
Grid
The term grid is first used to represent one resource that is repeated across terrestrial landscape or space. The term is frequently used during the distribution of telegraphs where repeaters had to be placed on poles every N radii (plural for radius) to amplify the signal. Another example is the electrical grid that Thomas Edison and Nikola Tesla competitively started spreading around the Earth. Computers got really popular and they soon were expanded across The Grid to replace human telegraph (and telephone) operators.
The Grid is now a bunch of computers that can connect and terminate communication channels. The Grid is an infrastructure of computers that function for one goal which is the run assembly (or binary) code.
Cluster
Farseeing the power of computers and actually witnessing computers win wars (Turing's machine), DARPA (or ARPA which is the U.S.A. Military) stepped in.
DARPA started commissioning universities and colleges to utilize the Grid for multi-plexing communication methods (that use baud and protocols). Universities and colleges started making protocols to separate the different tasks that they wanted to carry out over the Grid and target the computers. That started the modern internet. In-house testing clusters were established in laboratories to simulate the grid. Clusters are great for orchestration. A job can be sub-divided over all or some of the slaves within a cluster. The military utilized the college and university's findings and applied the SOFTWARE to the Grid. There were some gotchas with clusters:
Must be same (or near same) hardware
Must have same operating system
The rules were strict because all the instruction-sets had to be the same passing over the CPUs. Clusters usually had a master and slave type relationship. A Cluster usually ran one unic (or unix) job at a time. Clusters had job-schedulers. Then clusters got more complex because hardware manufacturers started making parallel chip architectures (on top of the Von Neumann arch).
Clusters become more powerful. The Clusters inherited more complexity and people were doing more creative things. Cluster could now do different jobs, tasks, processes, asynchronously processes, synchronized processes, and many more interesting things. One box (or computer node) could run more jobs. Now the Grid could be used for multiple purpose. The rate of software updates on clusters was faster than the actual grid. Clusters were deployed locally on campuses. Clusters started superseding the grid because you could directly produce a public facing stack that out-performed the (national) grid.
My Experience
I went to college during the late 1990s and 2000s and cluster was the word for a physical laboratory of multiple computers working as one virtual computer. Clusters were used for testing. Once your software worked on the cluster, then you could mv (move) it to the production grade Grid. Then I witness network worms and computer viruses control zombie computers. These swarm of zombies could be used as one gigantic virtual cluster used to run commands. Well programmers started DIY (do it yourself) protocols and software like bit-torrent and Napster.
So leaping forward into the future, testing cluster softwares are starting to be replaced by Solaris jails, FreeBSD jails, Linux containers, QEMU, hyper-visors, VMWare, VirtualBox, Vagrant, and Docker.
Cloud
Cloud is a marketing term used to umbrella the hardware of different grids and the software of those clusters. Cloud is one big ubiquitous word used to advertise, promote, and profess all that cluster technology for monetary gains. Cloud is also an effort to wrap all those technologies under one singular word. The Cloud allows multi-tenanted processes to share a gigantic grid. The Cloud maximizes efficiency by sub-dividing the electricity, CPU, RAM, DISK, Electricity, and broadband which gets shared and paid for by consumers. A side effect is that those consumer subscriptions and/or pay-rates started producing profit. The Cloud also allows multiple users to install multiple operating systems that run multiple processes all in the software. So now we have acronyms like IaaS, PaaS, and SasS. The Cloud can replace the start-up cost that was once so darn difficult to fund and bootstrap. The Cloud is a great solution for mock testing your software and building a consumer base for your business.
From another perspective, the Cloud triggers the brain of non-programmers to think a certain way. For example, the human resource department can comprehend and isolate what is presented in-front of them.
So if you got the money, then you can purchase your share of the cloud experience and have easy support along with it. But if you have the skill-set, the time, the quick know-how, and the ability to install your own servers at co-locations, then do that because it is cheaper over the long run.
That is my narrative on the Grid vs Cluster vs Cloud.

I think this link well compared the Cluster and Grid.
As I know, there are some exceptions in the case of Clusters. YARN (Yahoo!) tries to handle mutli-tenancy and distributed scheduling. Also Corona (Facebook) has distributed scheduling.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex