Do Bit vectors store memory locations? - hashtable

Hello Computer Science World,
I am trying to answer this question..
A bit vector is simply an array of bits (0s and 1s). A bit vector of length m takes
much less space than an array of m pointers. Describe how to use a bit vector
to represent a dynamic set of distinct elements with no satellite data. Dictionary
operations should run in O(1) time.
My thinking is that a bit vector can be used to store the memory locations of the elements and since we are assuming that no two elements have the same key we can use hash function to store the memory location and access it in O(1) time.
Do Bit vectors store memory locations ?
If not can someone guide me to the promise land.
Thanks

Related

What is the most efficient and portable way to define an order on pointers?

I have a slice that contains pointers to values. In a performance-critical part of my program, I'm adding or removing values from this slice. For the moment, inserting a value is just an append (O(1) complexity), and removal consists in searching the slice for the corresponding pointer value, from 0 to n-1, until the pointer is found (O(n)). To improve performance, I'd like to sort values in the slice, so that searching can be done using dichotomy (so O(log(n)).
But how can I compare pointer values? Pointer arithmetic is forbidden in go, so AFAIK to compare pointer values p1 and p2 I have to use the unsafe package and do something like
uintptr(unsafe.Pointer(p1)) < uintptr(unsafe.Pointer(p2))
Now, I'm not comfortable using unsafe, at least because of its name. So, is that method correct? Is it portable? Are there potential pitfalls? Is there a better way to define an order on pointer values? I know I could use maps, but maps are slow as hell.
As said by others, don't do this. Performance can't be that critical to resort to pointer arithmetic in Go.
Pointers are comparable, Spec: Comparison operators:
Pointer values are comparable. Two pointer values are equal if they point to the same variable or if both have value nil. Pointers to distinct zero-size variables may or may not be equal.
Just use a map with the pointers as keys. Simple as that. Yes, indexing maps is slower than indexing slices, but then again, if you'd want to keep your slice sorted and you wanted to perform binary searches in that, then the performance gap decreases, as the (hash) map implementation provides you O(1) lookup while binary search is only O(log n). In case of big data set, the map might even be faster than searching in the slice.
If you anticipate a big number of pointers in the map, then pre-allocate a big one with make() passing an estimated upper size, and until your map exceeds this size, no reallocation will occur.
m := make(map[*mytype]struct{}, 1<<20) // Allocate map for 1 million entries

Bit Array with Find Max

So bit arrays and hash tables don't seem to inherently allow for a find-max type operation, but there are ways around it. I'm wondering if there's a way using the bit array alone without extra variables, pointers, or manipulating the start/end of the array, in some scenarios. For example...
I have integers {1,...,n} and a n-bit bit array. To keep a subset of the integers, I use the integer itself as the key in the bit array and set the bit to 1 if it is in the subset, or 0 if it is not.
For example for integers {1,2,3,4} and subset {1,3), the bit array would look like {1,0,1,0}.
It seems like there's no way to do this without somehow moving the bits around which leads me to believe the O(1) dream is dead and perhaps the bit array won't work. Is something like this possible in O(log n)?
Thanks
Finding the highest set bit on a bit array of length n is O(n). If you need better, then you'll need to choose another data structure, or keep a high-water mark along with your bitmap.

What is Vector data structure

I know Vector in C++ and Java, it's like dynamic Array, but I can't find any general definition of Vector data structure. So what is Vector? Is Vector a general data structure(like arrray, stack, queue, tree,...) or it just a data type depending on language?
The word "vector" as applied to computer science/programming is borrowed from math, which can make the use confusing (even your question could be on multiple subjects).
The simplest example of vectors in math is the number line, used to teach elementary math (especially to help visualize negative numbers, subtraction of negative numbers, addition of negative numbers, etc).
The vector is a distance and direction from a point. This is why it can confuse the discussion, because a vector data structure COULD be three points, X,Y,Z, in a structure used in 3D graphics engines, or a 2D point (just X,Y). In that context, the subtraction of two such points results in a vector - the vector describes how far and in what direction to travel from one of the source operands to the other.
This applies to storage, like stl vectors or Java vectors, in that storage is represented as a distance from an address (where a memory address is similar to a point in space, or on a number line).
The concept is related to arrays, because arrays could be the storage allocated for a vector, but I submit that the vector is a larger concept than the array. A vector must include the concept of distance from a starting point, and if you think of the beginning of an array as the starting point, the distance to the end of the array is it's size.
So, the data structure representing a vector must include the size, whereas an array doesn't have storage to include the size, it's assumed by the way it's allocated. That is to say, if you dynamically allocate an array, there is no data structure storing the size of that array, the programmer must assume to know that size, or store it in a some integer or long.
The vector data structure (say, the design of a vector class) DOES need to store the size, so at a minimum, there would be a starting point (the base of an array, or some address in memory) and a distance from that point indicating size.
That's really "RAM" oriented, though, in description, because there's one more point not yet described which must be part of the data describing the vector - the notion of element size. If a vector represents bytes, and memory storage is typically measured in bytes, an address and a distance (or size) would represent a vector of bytes, but nothing else - and that's a very machine level thinking. A higher thought, that of some structure, has it's own size - say, the size of a float or double, or of a structure or class in C++. Whatever the element size is, the memory required to store N of them requires that the vector data structure have some knowledge of WHAT it's storing, and how large that thing is. This is why you'd think in terms of "a vector of strings" or "a vector of points". A vector must also store an element size.
So, a basic vector data structure must have:
An address (the starting point)
An element size (each thing it stores is X bytes long)
A number of elements stored (how many elements times element size is 'minimum' storage size).
One important "assumption" made in this simple 3 item list of entries in the vector data structure is that the address is allocated memory, which must be freed at some point, and is to be guarded against access beyond the end of the vector.
That means there's something missing. In order to make a vector class work, there is a recognizable difference between the number of ITEMS stored in the vector, and the amount of memory ALLOCATED for that storage. Typically, as you might realize from the use of vector from the STL, it may "know" it has room to store 10 items, but currently only has 2 of them.
So, a working vector class would ALSO have to store the amount of memory allocation. This would be how it could dynamically extend itself - it would now have sufficient information to expand storage automatically.
Thinking through just how you would make a vector class operate gives you the structure of data required to operate a vector class.
It's an array with dynamically allocated space, everytime you exceed this space new place in memory is allocated and old array is copied to the new one. Old one is freed then.
Moreover, vector usually allocates more memory, than it needs to, so it does not have to copy all the data, when new element is added.
It may seem, that lists then are much much better, but it's not necessarily so. If you do not change your vector often (in terms of size), then computer's cache memory functions much better with vectors, than lists, because they are continuus in memory space. Disadvantage is when you have large vector, that you need to expand. Then you have to agree to copy large amount of data to another space in memory.
What's more. You can add new data to the end and to the front of the vector. Because Vector's are array-like, then every time you want to add element to the beginning of the vector all the array has to be copied. Adding elements to the end of vector is far more efficient. There's no such an issue with linked lists.
Vector gives random access to it's internal kept data, while lists,queues,stacks do not.
Vectors are the same as dynamic arrays with the ability to resize
itself automatically when an element is inserted or deleted.
Vector elements are placed in contiguous storage so that they can be
accessed and traversed using iterators.
In vectors, data is inserted at the end.

Using 2d array vs array of derived type in Fortran 90

Assuming you want a list of arrays, each having the same size. Is it better performance-wise to use a 2D array :
integer, allocatable :: data(:,:)
or an array of derived types :
type test
integer, allocatable :: content(:)
end type
type(test), allocatable :: data(:)
Of course, for arrays of different sizes, we don't have a choice. But how is the memory managed between the 2 cases ? Also, is one of them good code practice ?
Choose the implementation which minimises the conceptual distance that your mind has to leap between the problem in your head and the solution in your code. The force of this approach increases with age, both the age of your code (good conceptual design is a solid foundation for future development) and your own age (the less effort understanding your code demands the longer you'll remain mentally competent enough to understand it).
As to the non-opinion-determined part of your question concerning the way that the memory is managed ... My naive expectation is that most compilers will, under most circumstances, allocate contiguous memory for the first of your outlines, and may not for the second. But I don't care enough about this to check, and I do not think that you should either. I don't, by this, suggest that you should not be interested in what is going on under the hood, but rather that you should be more concerned with the matters referred to in the first paragraph.
In general, you want to use the simplest data structure that suits your problem. If a 2d rectangular array meets your needs - and for a huge number of scientific computing problems, problems for which Fortran is a good choice, it does - then that's the choice you want.
The 2d array will be contiguous in memory, which will normally make accessing it faster both due to caching and one fewer level of indirection; the 2d array will also allow you to do things like data = data * 2 or data = 0. which the array-of-array approach doesn't [Edited to add: though as IanH points out in comments you can create a defined type and defined operations on those types to allow this]. Those advantages are great enough that even when you have "ragged arrays", if the range of expected row lengths isn't that large, implementing it as a rectangular 2d array is sometimes a choice worth considering.

HDF5: writing strings, structs, and vectors (oh, my)

I have a struct defined as such:
typedef struct {
string mName;
vector<int> mParts;
} AGroup;
I'm storing instances of this struct in a vector. I need to write this to an HDF (v5) file. I guess I could loop through each instance to find the longest mName, and longest mParts, create a new, non-variable length, array to hold the information, and then write that array to the file.
Is that the best way to do it? It seems overly complex just to write some data.
Variable length arrays and string introduce overhead. But they also make sense as they reflect more accurately your data structure. You could go with a compound datatype made of a variable length C string and a variable length array of integers.
If your string and vectors all have a size close to the same upper bound, save yourself some time and trouble and use fixed length strings and arrays.
It all depends on the size of your dataset. In terms of disk space, there is a tradeoff between overhead of variable length elements and wasted space in fixed size elements that is hard to estimate without trying. If your dataset is small, do what is more convenient to you. If it is large, choose what you favor most: avoid wasted space, semantic of your data, ease of programming, etc. and optimize according to this criteria.

Resources