What is the maximum number of collisions that may result from hashing n keys? - hashtable

Would it be n, or n-1? In my head I believe it should b n-1, because the first item you insert has nothing to collide with, and any other item you insert can collide with all items before it possibly but not itself. Am I correct in thinking this way?

If you have an empty set, it's quite difficult to explain how the first item you put in succeeds in colliding with something else. That said, if you consider a collision list (items having the same hash) you see that it contains M items that collide each with the other. Thus you have indeed N-1 collisions while inserting the items, but N colliding items, for none of them is the "right" one.

Related

vector<vector> as a quick-traversal 2d data structure

I'm currently considering the implementation of a 2D data structure to allow me to store and draw objects in correct Z-Order (GDI+, entities are drawn in call order). The requirements are loosely:
Ability to add new objects to the top of any depth index
Ability to remove arbitrary object
(Ability to move object to the top of new depth index, accomplished by 2 points above)
Fast in-order and reverse-order traversal
As the main requirement is speed of traversal across the full data, the first thing that came to mind was an array like structure, eg. vector. It also easily allows for pushing new objects (removing objects not so great..). This works perfectly fine for our requirements, as it just so happens that the bulk of drawable entities don't change, and the ones that do sit at the top end of the order.
However it got me thinking of the implications for more dynamic requirements:
A vector will resize itself as required -> as the 'depth' vectors would need to be maintained contiguously in memory (top-level vector enforces it), this could lead to some pretty expensive vector resizes. Worst case all vectors need to be moved to new memory location, average case requiring all vectors up the chain to be moved.
Vectors will often hold a buffer at the end for adding new objects -> traversal could still easily force a cache miss while jumping between 'depth' vectors, rendering the top-level vector's contiguous memory less beneficial
Could someone confirm that these observations are indeed correct, making a vector a mostly very expensive structure for storing larger dynamic data sets?
From my thoughts above, I end up deducing that while traversing the whole dataset, specifically jumping between different vectors in the top-level vector, you might as well use any other data structure with inferior traversal complexity, or similar random access complexity (linked_list; map). Traversal would effectively be the same, as we might as well assume the cache misses will happen anyway, and we save ourselves a lot of bother by not keeping the depth vectors contiguously in memory.
Would that indeed be a good solution? If I'm not mistaken, on a 1D problem space, this would come down to what's more important traversal or addition/removal, vector or linked-list. On a 2D space I'm not so sure it is so black and white.
I'm wondering what sort of application requires good traversal across a 2D space, without compromising data addition/removal, and what sort of data structures are used there.
P.S. I just noticed I'm completely ignoring space-complexity, so might as well keep on ignoring it (unless you feel like adding more insight :D)
Your first assumption is somewhat incorrect.
Instead of thinking of vectors as the blob of memory itself, think of it as a pointer to automatically managed blob of memory and some metadata to keep track of it. A vector itself is a fixed size, the memory it keeps track of isn't. (See this example, note that the size of the vector object is constant: https://ideone.com/3mwjRz)
A vector of vectors can be thought of as an array of pointers. Resizing what the pointers point to doesn't mean you need to resize the array that contains them. The promise of items being contiguous still holds: the parent array has all of the pointers adjacent to each other and each pointer points to a contiguous chunk of memory. However, it's not guaranteed that the end of arr[0][N-1] is adjacent to the beginning of arr[1][0]. (To this end, your second point is correct.)
I guess that a Linked List would be more appropriate as you will always be traversing the whole list (vectors are good for random access). Linked lists inserts and removal are very cheap and the traversal isn't that different from a vector traversal. Maybe you should consider a Doubly Linked List as you want to traverse it in both ways.

A* Pathfinding - Updating Parent

I've read A* Pathfinding for Beginners and looked at several source code implementations in C++ and other languages. I understand most of what is happening, with the exception of one possible issue I think I have found, and none of the tutorials/implementations I've found cover this.
When you get to this part:
If an adjacent square is already on the open list [...], if the G cost
of the new path is lower, change the parent of the adjacent square to
the selected square. Finally, recalculate both the F and G scores of
that square.
Changing the G score of a square should also change the G score of every child, right? That is, every square that already has this square as the parent, should get a new G score now also. So, shouldn't you find every child (and child of child) square in the open list and recalculate the G values? That will also change the F value, so if using a sorted list/queue, that also means a bunch of resorting.
Is this just not an actual problem, not worth the extra CPU for the extra calculations, and that is why the implementations I've seen just ignore this issue (do not update children)?
It depends on your heuristic.
For correctness, the basic A* algorithm requires that you have an admissible heuristic, that is, one that never overestimates the minimum cost of moving from a node to the goal. However, a search using an admissible heuristic may not always find the shortest path to intermediate nodes along the way. If that's the case with your heuristic, you might later find a shorter path to a node you've already visited and need to expand that node's children again. In this situation, you shouldn't use a closed list, as you need to be able to revisit nodes multiple times if you keep finding shorter routes.
However, if you use a consistent heuristic (meaning that the estimated cost of a node is never more than the estimated cost to one of its neighbors, plus the cost of moving from the node to that neighbor), you will only ever visit a node by the shortest path to it. That means that you can use a closed list and never revisit a node once you've expanded its children.
All consistent heuristics are admissible, but not all admissible heuristics are consistent. Most admissible heuristics are also consistent though, so you'll often seen descriptions and example code for A* that assumes the heuristic is consistent, even when it doesn't say so explicitly (or only mentions admissibility).
On the page you link to, the algorithm uses a closed list, so it requires a consistent heuristic to be guaranteed of finding an optimal path. However, the heuristic it uses (Manhattan distance) is not consistent (or admissible for that matter) given the way it handles diagonal moves. So while it might find the shortest path, it could also find some other path and incorrectly believe it is the shortest one. A more appropriate heuristic (Euclidean distance, for example) would be both admissible and consistent, and you'd be sure of not running into trouble.
#eselk : As the square, whose parent and G-score are to be updated, is still in OL, so this means that it has Not been expanded yet, and therefore there would be no child of the square in the OL. So updating G-scores of children and then their further children does not arise. Please let me know if this is not clear.

Why is removing a node from a doubly-linked list faster than removing a node from a singly-linked list?

I was curious why deleting a node from a double linked list is faster than a single linked. According to my lecture, it takes O(1) for a double linked list compared to O(n) for a single linked. According to my thought process, I thought they both should be O(n) since you have to traverse across possibly all the elements so it depends on the size.
I understand it's going be associated with the fact that each node has a previous pointer and a next pointer to the next node, I just can't understand how it would be a constant operation in the sense of O(1)
This partially depends on how you're interpreting the setup. Here are two different versions.
Version 1: Let's suppose that you want to delete a linked list node containing a specific value x from a singly or doubly-linked list, but you don't know where in the list it is. In that case, you would have to traverse the list, starting at the beginning, until you found the node to remove. In both a singly- and doubly-linked list, you can then remove it in O(1) time, so the overall runtime is O(n). That said, it's harder to do the remove step in the singly-linked list, since you need to update a pointer in the preceding cell (which isn't pointed at by the cell to remove), so you need to store two pointers as you do this.
Version 2: Now let's suppose you're given a pointer to the cell to remove and need to remove it. In a doubly-linked list, you can do this by using the next and previous pointers to identify the two cells around the cell to remove and then rewiring them to splice the cell out of the list. This takes time O(1). But what about a singly-linked list? To remove this cell from the list, you have to change the next pointer of the cell that appears before the cell to remove so that it no longer points to the cell to remove. Unfortunately, you don't have a pointer to that cell, since the list is only singly-linked. Therefore, you have to start at the beginning of the list, walk downwards across the nodes, and find the node that comes right before the one to remove. This takes time O(n), so the runtime for the remove step is O(n) in the worst case, rather than O(1). (That said, if you know two pointers - the cell you want to delete and the cell right before it, then you can remove the cell in O(1) time since you don't have to scan the list to find the preceding cell.)
In short: if you know the cell to remove in advance, the doubly-linked list lets you remove it in time O(1) while a singly-linked list would require time O(n). If you don't know the cell in advance, then it's O(n) in both cases.
Hope this helps!
The list does not have to be traversed in order to connect the previous node to the following node in a double-linked list. You simply point
curr.Prev.Next = curr.Next and
curr.Next.Prev = curr.Prev.
In a single-linked list, you have to traverse the list to find the previous node. Traversal can be O(n) in a non-sorted list.
An alternative approach seems to be to use double pointers as outlined in this excellent resource: https://github.com/mkirchner/linked-list-good-taste. This means you do not need to keep track of a current and previous pointer as you only use a single pointer to pointer that can modify in place directly. Please let me know if this is inaccurate as I just learnt this.

Time complexity to fill hash table?

This is a homework question, but I think there's something missing from it. It asks:
Provide a sequence of m keys to fill a hash table implemented with linear probing, such that the time to fill it is minimum.
And then
Provide another sequence of m keys, but such that the time fill it is maximum. Repeat these two questions if the hash table implements quadratic probing
I can only assume that the hash table has size m, both because it's the only number given and because we have been using that letter to address a hash table size before when describing the load factor. But I can't think of any sequence to do the first without knowing the hash function that hashes the sequence into the table.
If it is a bad hash function, such that, for instance, it hashes every entry to the same index, then both the minimum and maximum time to fill it will take O(n) time, regardless of what the sequence looks like. And in the average case, where I assume the hash function is OK, how am I supposed to know how long it will take for that hash function to fill the table?
Aren't these questions linked to the hash function stronger than they are to the sequence that is hashed?
As for the second question, I can assume that, regardless of the hash function, a sequence of size m with the same key repeated m-times will provide the maximum time, because it will cause linear probing from the second entry on. I think that will take O(n) time. Is that correct?
Well, the idea behind these questions is to test your understanding of probing styles. For linear probing, if a collision occurs, you simply test the next cell. And it goes on like this until you find an available cell to store your data.
Your hash table doesn't need to be size m but it needs to be at least size m.
First question is asking that if you have a perfect hash function, what is the complexity of populating the table. Perfect hashing function addresses each element without collision. So for each element in m, you need O(1) time. Total complexity is O(m).
Second question is asking for the case that hash(X)=cell(0), which all of the elements will search till the first empty cell(just rear of the currently populated table).
For the first element, you probe once -> O(1)
For the second element, you probe twice -> O(2)
for the nth element, you probe n times -> O(n)
overall you have m elements, so -> O(n*(n+1)/2)
For quadratic probing, you have the same strategy. The minimum case is the same, but the maximum case will have O(nlogn). ( I didn't solve it, just it's my educated guess.)
This questions doesn't sound terribly concerned with the hash function, but it would be nice to have. You seem to pretty much get it, though. It sounds to me like the question is more concerned with "do you know what a worst-case list of keys would be?" than "do you know how to exploit bad hash functions?"
Obviously, if you come up with a sequence where all the entries hash to different locations, then you have O(1) insertions for O(m) time in total.
For what you are saying about hashing all the keys to the same location, each insertion should take O(n) if that's what you are suggesting. However, that's not the total time for inserting all the elements. Also, you might want to consider not literally using the same key over and over but rather using keys that would produce the same location in the table. I think, by convention, inserting the same key should cause a replacement, though I'm not 100% sure.
I'll apologize in advance if I gave too much information or left anything unclear. This question seems pretty cut-and-dried save the part about not actually knowing the hash function, and it was kind of hard to really say much without answering the whole question.

What is a destructive update?

I see a lot of functional programming related topics mention destructive updates. I understand that it is something similar to mutation, so I understand the update part. But what is the destructive part? Or am I just over-thinking it?
You're probably overthinking it a bit. Mutability is all there is to it; the only thing being "destroyed" is the previous value of whatever you mutated.
Say you're using some sort of search tree to store values, and you want to insert a new one. After finding the location where the new values goes, you have two options:
With an immutable tree, you construct new nodes along the path from the new value's location up to the root. Subtrees not along the path are reused in the new tree, and if you still have a reference to the original tree's root you can use both, with the common subtrees shared between them. This economizes on space with no extra effort if you have lots of slightly-different copies floating around, and of course you have all the usual benefits of immutable data structures.
With a mutable tree, you attach the new value where it belongs and that's that; nothing else has to be changed. This is almost always faster, and economizes on memory allocation if you only ever have one copy around, but anything that had a reference to the "old" tree now has a reference to the new one. The original has been destroyed; it's gone forever. If you need to keep the original around, you have to go to the expense of creating an entirely new copy of the whole thing before changing it.
If "destruction" seems an unnecessarily harsh way to describe a simple in-place update, then you've probably not spent as much time as I have debugging code in order to figure out where on Earth some value is being changed behind your back.
The imperative programming languages allow variables to be redefined, e.g
x = 1
x = 2
So x first has the value 1 then, later, it has the value 2. The second operation is an destructive update, because x looses its initial definition as being equal to 1.
This is not how definition is handled in common mathematics. Once defined, a variable keeps its value.
The above, seen as system of equations, would allow to subtract the first from the second equation, which would give
x - x = 2 - 1 <=> 0 = 1
which is a false statement. It is assumed that once introduced, x is the same.
A familiar statement like
x = x + 1
would lead to the same conclusion.
The functional languages have the same use of variables, once they are defined it is not possible to reassign them. The above statement would turn into
x2 = x + 1
and we would have no for or while loop but rather recursion or some higher order function.

Resources