How does a TLB differ from a Hash table? - hashtable

I'm a noob, sorry to say, I understand both what a hash table is and what a translation lookaside buffer (TLB) is but it seems like they work on similar principals. What am I missing here?

A hash table uses a hash function that maps a certain space of original values to a smaller space of resulting values. The idea is that if the original space normally not used completely, one would better use the mapped space e.g. for table lookup. Since in this mapping, more than one original value is mapped to the same resulting value, conflicts can arise that require conflict resolution, which is no problem.
A TLB, on the other hand, is used to map a large virtual memory space to a small physical memory space. To do so, it caches virtual addresses, i.e. it maps a virtual address to a physical address. There are many ways to do this. The most simple (for explanation) is a direct mapped cache. It uses the upper part of the virtual address as the address of the cache memory. Stored there are the lower part of the virtual address together with the corresponding physical address. One could thus consider this as a hash table with a hash function that maps all virtual addresses that have the same upper bits to the same cache address. However, conflicts are here (in the direct mapped cache) resolved by replacing an old entry by a new entry. If the cache organization is more complex, i.e. as a multi-way associative cache, conflicts are handled by distributing new entries among the available associative storage locations.
So, you are right, there is no essential difference between hash tables and TLBs.

Related

How do Top level Name Servers handle huge map?

Typically top level domain name servers like "com" name server, need to have a map which gives out IP address of the name server for different domain names like "google","yahoo","facebook", etc.
I image this would have a very large number of key-value pairs.
How is this huge map handled? Is it a unordered map, ordered map, or any other "special" implementation?
Most of the major nameservers are open souce so you could study their sources:
bind
nsd
knot
yadifa
geodns
But it is of course far more complicated than just a "map".
Even if you start with very old documents, like RFC 1035 that defines the protocol they are few details about implementation, as expected.
While name server implementations are free to use any internal data
structures they choose, the suggested structure consists of three major
parts:
A "catalog" data structure which lists the zones available to
this server, and a "pointer" to the zone data structure. The
main purpose of this structure is to find the nearest ancestor
zone, if any, for arriving standard queries.
Separate data structures for each of the zones held by the
name server.
A data structure for cached data. (or perhaps separate caches
for different classes)
(and read the following sentences about various optimizations)
First, the task is different for an authoritative or a recursive nameserver.
Some authoritative ones for example let you "compile" a zone into some kind of format before loading it. See zonec in nsd for example
You also need to remember that this data is dynamic: it can be remotely updated incrementally by DNS UPDATE messages, and in the presence of DNSSEC, the RRSIGs may get dynamically computed or at least need to change from time to time.
Hence, a simple key,value store is probably not enough for all those needs. But note that multiple nameservers allow different "backends" so that the data can be pulled from other sources, with some constraints or not, like an SQL database or even a program creating the DNS response when the DNS query comes.
For example, from memory, bind uses internally a "red back binary tree". See Wikipedia explanation at https://en.wikipedia.org/wiki/Red%E2%80%93black_tree, in short:
A red–black tree is a kind of self-balancing binary search tree in computer science. Each node of the binary tree has an extra bit, and that bit is often interpreted as the color (red or black) of the node. These color bits are used to ensure the tree remains approximately balanced during insertions and deletions.
Side note, about "need to have a map which gives out IP address of the name server" which is not 100% exact: the registry authoritative nameservers will have mostly NS records, associating domain names to other authoritative nameservers (a delegation) and will have some A and AAAA records called glues in that case.
Some requests to them may not get you any IP addresses at all, see:
$ dig #a.gtld-servers.com NS afnic.com +noall +ans +auth
; <<>> DiG 9.12.0 <<>> #a.gtld-servers.com NS afnic.com +noall +ans +auth
; (1 server found)
;; global options: +cmd
afnic.com. 172800 IN NS ns1.nic.fr.
afnic.com. 172800 IN NS ns3.nic.fr.
afnic.com. 172800 IN NS ns2.nic.fr.
(no IP addresses whatsoever because nameservers are all out of zone, that is "out-of-bailiwick" for the true technical term)

Options to achieve consensus in an immutable distributed hash table

I'm implementing a completely decentralized database. Anyone at any moment can upload any type of data to it. One good solution that fits on this problem is an immutable distributed hash table. Values are keyed with their hash. Immutability ensures this map remains always valid, simplifies data integrity checking, and avoids synchronization.
To provide some data retrieval facilities a tag-based classification will be implemented. Any key (associated with a single unique value) can be tagged with arbitrary tag (an arbitrary sequence of bytes). To keep things simple I want to use same distributed hash table to store this tag-hash index.
To implement this database I need some way to maintain a decentralized consensus of what is the actual and valid tag-hash index. Immutability forces me to use some kind of linked data structure. How can I find the root? How to synchronize entry additions? How to make sure there is a single shared root for everybody?
In a distributed hash table you can have the nodes structured in a ring, where each node in the ring knows about at least one other node in the ring (to keep it connected). To make the ring more fault-tolerant make sure that each node has knowledge about more than one other node in the ring, so that it is able to still connect if some node crashes. In DHT terminology, this is called a "sucessor list". When the nodes are structured in the ring with unique IDs and some stabilization-protocol, you can do key lookups by routing through the ring to find the node responsible for a certain key.
How to synchronize entry additions?
If you don't want replication, a weak version of decentralized consensus is enough and that is that each node has its unique ID and that they know about the ring structure, this can be achieved by a periodic stabilization protocol, like in Chord: http://nms.lcs.mit.edu/papers/chord.pdf
The stabilization protocol has each node communicating with its successor periodically to see if it is the true successor in the ring or if a new node has joined in-between in the ring or the sucessor has crashed and the ring must be updated. Since no replication is used, to do consistent insertions it is enough that the ring is stable so that peers can route the insertion to the correct node that inserts it in its storage. Each item is only held by a single node in a DHT without replication.
This stabilization procedure can give you very good probability that the ring will always be stable and that you minimize inconsistency, but it cannot guarantee strong consistency, there might be gaps where the ring is temporary unstable when nodes joins or leaves. During the inconsistency periods, data loss, duplication, overwrites etc could happen.
If your application requires strong consistency, DHT is not the best architecture, it will be very complex to implement that kind of consistency in a DHT. First of all you'll need replication and you'll also need to add a lot of ACK and synchronity in the stabilization protocol, for instance using a 2PC protocol or paxos protocol for each insertion to ensure that each replica got the new value.
How can I find the root?
How to make sure there is a single shared root for everybody?
Typically DHTs are associated with some lookup-service (centralized) that contains IPs/IDs of nodes and new nodes registers at the service. This service can then also ensure that each new node gets a unique ID. Since this service only manages IDs and simple lookups it is not under any high load or risk of crashing so it is "OK" to have it centralized without hurting fault-tolerance, but of course you could distribute the lookup service as well, and sycnhronizing them with a consensus protocol like Paxos.

Is Tcl nested dictionary uses references to references, and avoids capacity issues?

According to the thread:
TCL max size of array
Tcl cannot have >128M list/dictionary elements. However, One could have a nested dictionary, which total values (in different levels) exceeds the number.
Now, is the nested dictionary using references, by design? This would mean that as long as in 1 level of the dictionary tree, there is no more than 128M elements, you should be fine. Is that true?
Thanks.
The current limitation is that no individual memory object (C struct or array) can be larger than 2GB, and it's because the high-performance memory allocator (and a few other key APIs) uses a signed 32-bit integer for the size of memory chunk to allocate.
This wasn't a significant limitation on a 32-bit machine, where the OS itself would usually restrict you at about the time when you started to near that limit.
However, on a 64-bit machine it's possible to address much more, while at the same time the size of pointers is doubled, e.g., 2GB of space means about 256k elements for a list, since each needs at least one pointer to hold the reference for the value inside it. In addition, the reference counter system might well hit a limit in such a scheme, though that wouldn't be the problem here.
If you create a nested structure, the total number of leaf memory objects that can be held within it can be much larger, but you need to take great care to never get the string serialisation of the list or dictionary since that would probably hit the 2GB hard limit. If you're really handling very large numbers of values, you might want to consider using a database like SQLite as storage instead as that can be transparently backed by disk.
Fixing the problem is messy because it impacts a lot of the API and ABI, and creates a lot of wreckage in the process (plus a few subtle bugs if not done carefully, IIRC). We'll fix it in Tcl 9.0.

What are static pointers in RAM and how can they exist?

I've been studying C++ Game Hacking via tutorials for a week or two and I almost get it all.
However, there's on thing that keeps bothering me over and over again.
To customize a value (f.e. player's health) we must search the memory address with Cheat Engine (or such) and set the value to something else.
These memory addresses are obviously different every time we start the program, cause it wont always use the same spot in RAM.
To solve this problem, people will try to find a static pointer to the memory address which contains the value; how are the pointers static, how can they reserve a static address from RAM?
Actually, it's not the pointer to the game variable that is static, but the offset of the variable's address, in reference to the address of another data.
If the game you want to "hack" is always storing data in the same, solid structure, then it's possible to find these offsets. When you know the offsets, then the only thing you need to do when the game starts is to find the address which the offsets are referencing to, instead of performing multiple scans - one for each variable.
Edit:
Additionaly, programs are very likely to be given the same virtual address space every time you run them, so in practice it would look like the variables with static offsets have the same addresses every time you run the program (further reading here, and there).

How to store a huge hash table in RAM and share it between different applications?

The data contains information like billions of ID-scores pairs. To quickly access these paired information, I plan to use the hash-table container since its time complexity of search is O(1). Considering the the raw data is around 80G, I don't want to load the data into RAM every time when I need to run search application. What I want to do is to generate the hash-table once and then store it in RAM with persistence of filesystem lifetime (the expense of RAM is not a criteria), and search it with different applications.
Based on my limited understanding, I could use "Memory Mapped Files" (boost C++ libraries). But I have questions:
1) Is it possible to keep the hash-table data structure when write it to the mapped file?
2) How much time it will cost to map the existed file to RAM?
Any answers/comments/suggestions are most welcomed!
Thanks,
1) Yes. The file is just bytes, just like memory.
2) Creating the mapping will be effectively instantaneous. Node that you won't be able to map all of it contiguously at once except on a 64-bit OS. Of course, if the file cache can't hold the portion of the map you're using, it will have to be read from disk.
How big are IDs? How big are pairs? How much locality of reference do you have? (Are there heavily-used pair and lightly used pairs?) How often will you be searching for pairs that aren't present? Is the data read-mostly? There may be better ways to do it. I'd strongly suggest starting with a broader question to make sure you're not stuck on a sub-optimal path.

Resources