How does Kademlia protocol guarantee peers forming a connected graph? - networking

Nodes: Clients on DHT-network.
Peers: Clients trying to download a specific resource.
Suppose that the DHT-network is a connected graph, but NO nodes can access ALL other nodes (a consumption contrary to the common belief that the Internet, which DHT-network overlays on, is fully connected).
Is the Peer-network, which overlays on DHT-network, is still a connected graph? Why?

Kademlia is an abstract algorithm that assumes spherical cows in a vacuum. The only failure modes the paper discusses are churn and temporary graph partitions. Asymmetric reachability is not considered.
Kademlia as implemented in the real world makes no guarantees. Everything is done on a best-effort probabilities-are-good-enough basis.
The main concern in the real world are not nodes where interconnected cluster A cannot talk to a interconnected cluster B. NATs and firewalls do not introduce such clusters on a considerable scale. They create a set of second-class citizens which are not consistently reachable by anyone - absent NAT traversal measures - and thus can only connect to the first-class citizens which are the nodes where anyone can talk to anyone else. Of course a few edge cases exist, but they're largely irrelevant.
Anyway, since you're not even asking about kademlia but about bittorrent, which is not really an overlay over kademlia but a separate network which simply bootstraps its contact information from kademlia things get even more complicated. Bittorrent can be implemented over two different transport mechanisms, TCP and µTP, and clients may support different levels of nat traversal capabilities for TCP, µTP and Kademlia-via-UDP.
Kademlia nodes generally store contact information for bittorrent on several reachable nodes, since they - quite obviously - cannot reach unreachable nodes for the purpose of storage. They also do so with redundancy, which ensures a high likelihood that the stored contact information can be observed by anyone else.
Based on that contact information bittorrent clients can then attempt to connect to each other. As long as there are some reachable bittorrent clients they will be able to establish a direct connection and then may additionally be able to attempt some nat traversal measures between non-reachable nodes. Again, there are no guarantees, so small swarms may fail under some circumstances, but once a swarm becomes large enough the probabilities tip overwhelmingly in the favor of the graph becoming connected.
A small additional concern is IPv4 vs. IPv6. Generally IPv6 provides better connectivity (if firewalls don't get in the way) but not all clients implement the ipv6 extensions equally well, thus possibly preventing a few v6-edges from forming when they would in principle provide superior connectivity between the same nodes.
Note that ipv4 and ipv6 DHTs are in theory independent DHT networks, they just happen to have some significant overlap. It's basically outside the scope of kademlia how to coordinate multiple independent networks.

Related

Use a DHT for a gossip protocol?

I've been digging about DHTs and especially kademlia for some time now already.
I'm trying to implement a p2p network working on a Kademlia DHT. I want to be able to gossip a message to the whole network.
from my research for that gossip protocols are used, but it seems odd to add another completely new protocol to spread messages when I already use the dht to store peers.
Is there a gossip protocol that works over or with a DHT topology like Kademlia ?
How concerned are you about efficiency? As a lower bound someone has to send a packet to all N nodes in the network to propagate an update to all nodes.
The most naive approach is to simply forward every message to all entries in your routing table. This will not do since it obviously leads to forwarding storms.
The second most naive approach is to forward updates, i.e. newer data. This will result in N * log(N) traffic.
If all your nodes are trusted and you don't care about the last quantum of efficiency you can already stop here.
If nodes are not trusted you will need a mechanism to limit who can send updates and to verify packets.
If you also care about efficiency you can add randomized backoff before forwarding and tracking which routing table entry already has which version to prune unnecessary forwarding attempts.
If you don't want to gossip with the whole network but only a subset thereof you can implement subnetworks which interested nodes can join, i.e. subscribe to. Bittorrent Enhancement Proposal 50 describes such an approach.

How does the packet find the shortest path in a computer network?

Does it use some kind of Djikstra? Then what are the weights?
I was reading about computer networks..And thought about it
Thanks for your help
Routing algorithms are used by individual routers to determine the direction in which packets are sent. They can be simple, like RIP, that looks at advertised path costs received from other routers, or more complex, like OSPF, in which a model of the network is built within each router and a Djikstra is run to determine the best path. The trade off between them is the complexity of management and the resource consumption on each router.
Software Defined Networks are exploring options where a central controller makes all of the forwarding decisions and distributes them to the network elements that are handling the traffic. Technologies like OpenFlow are in use in large production networks today.

layer dependence about network

i have a question about network layer, that is:
as we all know, in layer architecture, the N+2 layer should only depends on the N+1 layer, while knows nothing about N layer. for example, in a typical application, the web layer should only depends on the business logic layer, but not the data access layer
when it comes to computer network, things seem to be different. In application layer, program has to know not only transmition layer(TCP port), but also network layer(IP address)
this confuse me, what do you think about this?
thanks for your help.
generally you are right. Unfortunately borders between layers in networks are kinda blurry, not just because we have a standard which is not used (OSI) and de facto standard which does not enforce the idea you mentioned, but also because the protocols are often not strictly bound to one layer but can do stuff on more then one of them. Good amount of protocols is developed before the OSI model and before they were standardized and then it was already too late to make some radical changes. So there are protocols that are considered to be between two layers (or on both layers) like MPLS, ARP etc. And protocols that are based on another protocol which is on the same layer, like OSPF that runs on top of IP even if they are considered to be on L3.
What you mentioned is another example. The reason for that is that addressing is not done on the most-upper layer (application layer) but on network layer (for host/network adapter) and transport layer (for process/application). So you need to know the IP address and port number (and actually a protocol) to be able to address the remote application. That's where the network sockets come in as an gateway (or API) between application and the network. So, even if you are technically correct about defying the principle of layered model, you are not really doing anything on L3 or L4 (but you can;) ). You don't need to fragment packets, handle retransmissions or worry about error corrections etc., you are just passing down the required addressing information when creating a socket.
TCP/IP is more oriented towards the feasibility of implementation, where OSI is more concerned about the standard then the implementation of that standard. This has it's bad and good sides. The ability to freely implement the protocol can be an advantage if you use that power well and since you are not strictly bound to some specification you can do some things more efficiently... or fail epically. The drawbacks of mixing 'responsibilities' are obvious and great example are protocols like H.323 which embed the IP addresses inside user's payload so if you want to do NAT for example you need to inspect the payload, change IP addresses, recalculate checksums, and stuff like that instead of just handling the translation on network layer.
Why are stuff still like this? Probably because there is no easy way to change any of that because of sheer number of devices and protocols, applications, etc that needs to be updated and this takes a lot of time. Just look at the speed of adopting IPv6 which has been around for more then 15 years.

Autodiscovery in P2P Applications

I want to create a P2P application on the internet. What is the best or if none exist a good enough way to do auto-discovery of other nodes in a decentralized network?
Grothoff and GauthierDickey from the GNUnet project (an anonymous censorship-resistant file-sharing network) researched on the question of bootstrapping a p2p network without any central hostlist.
They found that for the Gnutella (Limewire) network a random ip search needed on average 2500 connection attempts to find a peer.
In the paper they proposed a method which reduced the required connection attempts to 817 for Gnutella and 51 for the E2DK network.
Achieved was this through creating a statistical profile of p2p users for every DNS organization, this small (around 100kb) discovery database has to be created in advance and shipped with the p2p client.
This is the holy grail of P2P. There isn't a magic solution really - there's no way a node can discover other nodes without a good known point to act as a reference (well, you can do so on a LAN by using broadcasting, but not on the internet). P2P filesharing tends to work by having known websites distributing 'start points' for discovery, and then further discovery (I would expect) can come from asking nodes what other nodes they know about.
A good place to start on research would be Distributed Hash Tables.
As for security, that topic will be in the literature somewhere, I should think - again I would recommend Wikipedia. Non-existent ones are trivially dealt with: if you can't contact an IP/port, don't keep it on your list, and if a node regularly provides non-existent pointers, consider de-prioritising it or removing it from your list entirely.
For evil nodes, it depends on your use case, but let's say you are doing file sharing. If you request a section of a file, check with several nodes what the file section's hash should be, and then request by hash. If the evil node gives you a chunk that has a different hash, then you can again de-prioritise or forget that node.
Distributed processing systems work a little differently: they tend to ask several unrelated nodes to perform the same work, and then they use a voting system (probably using hashing again) to determine whether evilness is at hand. If a node provides consistently bad results, the administrator is contacted or the IP is removed from the known nodes list.
ok, for two peers to find each other they both have to know a common, lets say, mediator to exchange IPs once. You can use anything for this kind of the first handshake whilst being able to WRITE and READ from that "channel". i.e: DNS (your well known domains), e-Mail, IRC, Twitter, Facebook, dropbox, etc.

Practical implications of OSI vs TCP/IP networking

I'm supposed to be setting up a 'geolocation based', ipv6, wireless mesh network to run on google android.
I found what seems to be a good app to support the meshing:
http://www.open-mesh.net/wiki/batman-adv
"Batman-advanced is a new approach to
wireless networking which does no
longer operate on the IP basis. Unlike
B.A.T.M.A.N, which exchanges
information using UDP packets and sets
routing tables, batman-advanced
operates on ISO/OSI Layer 2 only and
uses and routes (or better: bridges)
Ethernet Frames. It emulates a virtual
network switch of all nodes
participating. Therefore all nodes
appear to be link local, thus all
higher operating protocols won't be
affected by any changes within the
network. You can run almost any
protocol above B.A.T.M.A.N. Advanced,
prominent examples are: IPv4, IPv6,
DHCP, IPX."
But other members in my team has said it's a no-go because it operates on OSI, rather than TCP/IP. This was the first I'd heard of OSI, and I'm wondering how much of a problem this is? What are the implications for mesh network apps that can be developed on top of it? Considering the android is relatively new, we don't need to worry too much about compatibility with existing apps, so does it matter?
I haven't spent a lot of time working with networks, so please put in noobmans terms.
"You can run almost any protocol above B.A.T.M.A.N. Advanced, prominent examples are: IPv4, IPv6, DHCP, IPX."
"But other members in my team has said it's a no-go because it operates on OSI, rather than TCP/IP. "
The other members in your team are confused by the buzzword-fest in BATMAN.
The "IP" of TCP/IP is IPv4 (or IPv6). So BATMAN supports TCP/IP directly and completely.
There's no conflict of any kind. Just confusion.
They're probably referring to the OSI model, which is a commonly-used way of distinguishing between network layers. I'm not sure it's a useful way of looking at things, but it's taught in every networking course on the planet.
OSI level 2 is the data link layer, which operates immediately above the actual physical level. Basically, it's in charge of flow control, error detection, and possibly error correction. The data link layer is strictly "single hop". It's only concerned about about point-to-point data transfers, not about multi-hop transfers or routing.
If they're actually referring the OSI networking protocal itself, run screaming as fast as you can. OSI was notoriously hard to implement, and I've never heard of an actual working installation. See the Wikipedia article for the gory details.
The OSI model and the OSI protocols are different.
The OSI model is a way of breaking things down: physical, link, network, transport, session, presentation, application. OSI protocols are protocol implementations that map directly to those layers in the model.
The model is a way of looking at things. It mostly makes sense, but it breaks down at the higher levels. For example: what does a presentation layer really do?
During the '90s, OSI was (in some circles) thought to be the future, but was actually the downfall of some companies, and wasted the resources of many others. For example, DECnet Phase V was Digital's insanely complex implementation of an OSI stack that met government OSI requirements, but was run over by the TCP/IP steamroller.
The test is: What are the bytes on the wire? In this case it is UDP over IP, not the OSI equivalent, which was CLNP.
Having said all that, if it is a layer two protocol, it will probably have scalability problems because it is a layer two protocol. Fine for a small number of nodes, but if you're trying to get scale, you need a better solution.
"ISO/OSI Layer 2" does not mean the OSI protocols. It refers to the "Seven Layer" model of network stacks. It means the Data Link layer.
The layers are: Physical, Data Link, Network, Transport, Session, Presentation, Application.
OSI is a model not a protocol like IP and TCP. What your team seem to be saying is that the mesh won't be using IP. I suspect they are wrong as the text you have quoted states the BATMAN protocol is capable of supporting IP & IPv6 and if that is the case you'd need a very strong reason to use anything else.

Resources