Using chord to implement a torrent network - hashtable

From my understanding, in order to implement a torrent network, you have maintain a map from hash value of the file to the list of peers available in the network.
However, I think there are a few problems with this.
the list will be changed constantly for it will be updated when a nodes joins or leaves.
the list might become very long? but there can only be one version of the list in the dht network(for it's a hash table)

You have maintain a map from hash value of the file to the list of peers available in the network.
That is correct and it's the purpose of the DHT. The DHT will keep information on who is on the network and who has a certain information.
The list will be changed constantly for it will be updated when a nodes joins or leaves.
This is also correct but bit torrent DHTs are built with this in mind, with protocols for joining and removing offline peers.
the list might become very long? but there can only be one version of the list in the dht network(for it's a hash table)
That's the thing with DHT, it's distributed so you don't need to keep the entire DHT in one place. You only need a find the information you're looking for across the network.
To join a network you'll need a way to find some peers and to request the DHT. Usually this is done with bootstrap nodes. Once you've contacted a node you will be able to share information you have and discover other nodes and get information they have.
Each DHT has its own algorithms, you can find some details for the Chord DHT here.

Related

P2P Network Bootstrapping

For P2P networks, I know that some networks have initial bootstrap nodes. However, one would assume that with all new nodes learning of peers from said bootstrap nodes, the network would have difficulty adding new peers and would end up with many cliques - unbalanced, for lack of a better word.
Are there any methods to prevent this from occuring? I'm aware that some DHTs structure their routing tables to be a bit less susceptible to this, but I would think the problem would still hold.
To clarify, I'm asking about what sort of peer mixing algorithms exist/are commonly used for peer to peer networks.
However, one would assume that with all new nodes learning of peers from said bootstrap nodes, the network would have difficulty adding new peers and would end up with many cliques - unbalanced, for lack of a better word.
If the bootstrap nodes were the only source of peers and no further mixing occured that might be an issue. But in practice bootstrap nodes only exist for bootstrapping (possibly only ever once) and then other peer discovery mechanisms take over.
Natural mixing caused by connection churn should be enough to randomize graphs over time, but proactive measures such as a globally agreed-on mixing algorithm to drop certain neighbors in favor of others can speed up that process.
I'm aware that some DHTs structure their routing tables to be a bit less susceptible to this, but I would think the problem would still hold.
The local buckets in kademlia should provide an exhaustive view of the neighborhood, medium-distance buckets will cover different parts of the keyspace for different nodes and the farthest buckets will preferentially contain long-lived nodes which should have a good view of the network.
This doesn't leave much room for clique-formation.

Options to achieve consensus in an immutable distributed hash table

I'm implementing a completely decentralized database. Anyone at any moment can upload any type of data to it. One good solution that fits on this problem is an immutable distributed hash table. Values are keyed with their hash. Immutability ensures this map remains always valid, simplifies data integrity checking, and avoids synchronization.
To provide some data retrieval facilities a tag-based classification will be implemented. Any key (associated with a single unique value) can be tagged with arbitrary tag (an arbitrary sequence of bytes). To keep things simple I want to use same distributed hash table to store this tag-hash index.
To implement this database I need some way to maintain a decentralized consensus of what is the actual and valid tag-hash index. Immutability forces me to use some kind of linked data structure. How can I find the root? How to synchronize entry additions? How to make sure there is a single shared root for everybody?
In a distributed hash table you can have the nodes structured in a ring, where each node in the ring knows about at least one other node in the ring (to keep it connected). To make the ring more fault-tolerant make sure that each node has knowledge about more than one other node in the ring, so that it is able to still connect if some node crashes. In DHT terminology, this is called a "sucessor list". When the nodes are structured in the ring with unique IDs and some stabilization-protocol, you can do key lookups by routing through the ring to find the node responsible for a certain key.
How to synchronize entry additions?
If you don't want replication, a weak version of decentralized consensus is enough and that is that each node has its unique ID and that they know about the ring structure, this can be achieved by a periodic stabilization protocol, like in Chord: http://nms.lcs.mit.edu/papers/chord.pdf
The stabilization protocol has each node communicating with its successor periodically to see if it is the true successor in the ring or if a new node has joined in-between in the ring or the sucessor has crashed and the ring must be updated. Since no replication is used, to do consistent insertions it is enough that the ring is stable so that peers can route the insertion to the correct node that inserts it in its storage. Each item is only held by a single node in a DHT without replication.
This stabilization procedure can give you very good probability that the ring will always be stable and that you minimize inconsistency, but it cannot guarantee strong consistency, there might be gaps where the ring is temporary unstable when nodes joins or leaves. During the inconsistency periods, data loss, duplication, overwrites etc could happen.
If your application requires strong consistency, DHT is not the best architecture, it will be very complex to implement that kind of consistency in a DHT. First of all you'll need replication and you'll also need to add a lot of ACK and synchronity in the stabilization protocol, for instance using a 2PC protocol or paxos protocol for each insertion to ensure that each replica got the new value.
How can I find the root?
How to make sure there is a single shared root for everybody?
Typically DHTs are associated with some lookup-service (centralized) that contains IPs/IDs of nodes and new nodes registers at the service. This service can then also ensure that each new node gets a unique ID. Since this service only manages IDs and simple lookups it is not under any high load or risk of crashing so it is "OK" to have it centralized without hurting fault-tolerance, but of course you could distribute the lookup service as well, and sycnhronizing them with a consensus protocol like Paxos.

What's the difference between distributed hashtable technology and the bitcoin blockchain?

This question could go into a bitcoin forum but I am trying to understand from a programming point of view.
There are technologies used for distributed storage, like distributed hashtables (say kademlia or similar). How is the bitcoin blockchain different from distributed hashtables? Or is maybe distributed hashtable technology underpinning the bitcoin blockchain? Or why is the bitcoin blockchain hailed as such a breakthrough compared to DHT?
Distributed Hash Table
A DHT is simply a key-value store distributed accross a number of
nodes in a network. The keys are distributed among nodes with a
deterministic algorithm. Each node is responsible for a portion of
the hash table.
A routing algorithm allows to perform requests in the hash table
without knowing every node of the network.
For exemple in the Chord
DHT —which is relatively simple DHT implementation— each
node is assigned an identifier and is responsible of keys which
are closer to its identifier.
Imagine there is 4 nodes that have identifiers: 2a6c, 7811, a20f, e9c3
The data with the identifier 2c92 will be stored on the node 2a6c.
Imagine now that you only know the node 7811 and you are looking
for the data with the identifier eabc.
You ask the node 7811 for the data eabc. 7811 doesn't have it so
it ask the node e9c3 wich send it to node 7811 which send it back
to you.
A clever algorithm allows to find data in O(log(N))
jumps. Without storing the entire routing table of the
network (the addresses of each nodes). Basically you ask the
closest node to the data identifier you know wich itself asks the
closest node it knows and so on reducing the size of the jump at
each step.
A DHT is very scalable because the data are uniformly distributed
among nodes and lookup time generally grows in O(log(N)).
Blockchain
A blockchain is also a distributed data structure but its purpose
is completely different.
Think of it as a history, or a ledger. The purpose is to store a
continuously-growing list of record without the possibility of
tampering and revision.
It is mainly used in the bitcoin currency system for keeping
track of transactions. Its property of being tamper-proof let everybody
know the exact balance of an account by knowing its history of
transaction.
In a blockchain, each node of the network stores the full data.
So it is absolutely not the same idea as the DHT in which data
are divided among nodes. Every new entry in the blockchain must
be validated by a process called mining whose details are out of the scope of this answer but this process insure consensus of the
data.
The two structures are both distributed data structure but serve
different purposes. DHT aims to provide an efficient (in term of
lookup time and storage footprint) structure to divide data on a
network and blockchain aims to provide a tamper-proof data
structure.
In computing, a hash table (hash map) is a data structure which implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
but block chain is
a digital ledger in which transactions made in bitcoin or another cryptocurrency are recorded chronologically and publicly.

Graph and nodes how to determine the connection weight and impact

I am having a network of objects that are connected each other.
Each object as usual can have classes and structures, etc.
I am giving you an example because it is difficult to explained other wise.
A structure propagates from one object to the another.For example object-1 can be an FTP server that sends a specific file with specific structure to a neighbor node, object-2. This connection e.g. is related with a protocol (FTP) and a specific file structure.
Among other what I am trying to determine are the following:
1. How can I measure the weight of this connection, I mean what could be the characteristics that I have to choose and use? How can I measure them?
In the case where nodes are connected to other nodes any change of a property, eg. a file structure impacts the operation of the next node that is connected. In which way I can can determine the diffusion of a change within a network and the impact per node?
Thanks a lot
George

Consistent hashing: Where is the data-structure of ring kept

We have N cache-nodes with basic consistent-hashing in a ring.
Questions:
Is data-structure of this ring stored:
On each of these nodes?
Partly on each node with its ranges?
On a separate machine as a load balancer?
What happens to the ring when other nodes join it?
Thanks a lot.
I have found an answer to the question No 1.
Answer 1:
All the approaches are written in my blog:
http://ivoroshilin.com/2013/07/15/distributed-caching-under-consistent-hashing/
There are a few options on where to keep ring’s data structure:
Central point of coordination: A dedicated machine keeps a ring and works as a central load-balancer which routes request to appropriate nodes.
Pros: Very simple implementation. This would be a good fit for not a dynamic system having small number of nodes and/or data.
Cons: A big drawback of this approach is scalability and reliability. Stable distributed systems don’t have a single poing of failure.
No central point of coordination – full duplication: Each node keeps a full copy of the ring. Applicable for stable networks. This option is used e.g. in Amazon Dynamo.
Pros: Queries are routed in one hop directly to the appropriate cache-server.
Cons: Join/Leave of a server from the ring requires notification/amendment of all cache-servers in the ring.
No central point of coordination – partial duplication: Each node keeps a partial copy of the ring. This option is direct implementation of CHORD algorithm. In terms of DHT each cache-machine has its predessesor and successor and when receiving a query one checks if it has the key or not. If there’s no such a key on that machine, a mapping function is used to determine which of its neighbors (successor and predessesor) has the least distance to that key. Then it forwards the query to its neighbor thas has the least distance. The process continues until a current cache-machine finds the key and sends it back.
Pros: For highly dynamic changes the previous option is not a fit due to heavy overhead of gossiping among nodes. Thus this option is the choice in this case.
Cons: No direct routing of messages. The complexity of routing a message to the destination node in a ring is O(lg N).

Resources