The complexity of Scheme vectors - vector

The 6.3.6 Vectors section in the Scheme R5RS standard states the following about vectors:
Vectors are heterogenous structures whose elements are indexed by integers. A vector typically occupies less space than a list of the same length, and the average time required to access a randomly chosen element is typically less for the vector than for the list.
This description of vectors is a bit diffuse.
I'd like to know what this actually means in terms of the vector-ref and list-ref operations and their complexity. Both procedures returns the k-th element of a vector and a list. Is the vector operation O(1) and is the list operation O(n)? How are vectors different than lists? Where can I find more information about this?
Right now I'm using association lists as a data structure for storing key/value pairs for easy lookup. If the keys are integers it would perhaps be better to use vectors to store the values.

The very specific details of vector-ref and list-ref are implementation-dependent, meaning: each Scheme interpreter can implement the specification as it sees fit, so an answer for your question can not be generalized to all interpreters conforming to R5RS, it depends on the actual interpreter you're using.
But yes, in any decent implementation is a safe bet to assume that the vector-ref operation is O(1), and that the list-ref operation is probably O(n). Why? because a vector, under the hood, should be implemented using a data structure native to the implementation language, that allows O(1) access to an element given its index (say, a primitive array) - therefore making the implementation of vector-ref straightforward. Whereas lists in Lisp are created by linking cons cells, and finding an element at any given index entails traversing all the elements before it in the list - hence O(n) complexity.
As a side note - yes, using vectors would be a faster alternative than using association lists of key/value pairs, as long as the keys are integers and the number of elements to be indexed is known beforehand (a Scheme vector can not grow its size after its creation). For the general case (keys other than integers, variable size) check if your interpreter supports hash tables, or use an external library that provides them (say, SRFI 69).

A list is constructed from cons cells. From the R5RS list section:
The objects in the car fields of successive pairs of a list are the elements of the list. For example, a two-element list is a pair whose car is the first element and whose cdr is a pair whose car is the second element and whose cdr is the empty list. The length of a list is the number of elements, which is the same as the number of pairs.
For example, the list (a b c) is equivalent to the following series of pairs: (a . (b . (c . ())))
And could be represented in memory by the following "nodes":
[p] --> [p] --> [p] --> null
| | |
|==> a |==> b |==> c
With each node [] containing a pointer p to the value (it's car), and another pointer to the next element (it's cdr).
This allows the list to grow to an unlimited length, but requires a ref operation to start at the front of the list and traverse k elements in order to find the requested one. As you stated, this is O(n).
By contrast, a vector is basically an array of values which could be internally represented as an array of pointers. For example, the vector #(a b c) might be represented as:
[p p p]
| | |
| | |==> c
| |
| |==> b
|
|==> a
Where the array [] contains a series of three pointers, and each pointer is assigned to a value in the vector. So internally you could reference the third element of the vector v using the notation v[3]. Since you do not need to traverse the previous elements, vector-ref is an O(1) operation.
The main disadvantage is that vectors are of fixed size, so if you need to add more elements than the vector can hold, you have to allocate a new vector and copy the old elements to this new vector. This can potentially be an expensive operation if your application does this on a regular basis.
There are many resources online - this article on Scheme Data Structures goes into more detail and provides some examples, although it is much more focused on lists.
All that said, if your keys are (or can become) integers and you either have a fixed number of elements or can manage with a reasonable amount of vector reallocations - for example, you load the vector at startup and then perform mostly reads - a vector may be an attractive alternative to an association list.

Related

What is Vector data structure

I know Vector in C++ and Java, it's like dynamic Array, but I can't find any general definition of Vector data structure. So what is Vector? Is Vector a general data structure(like arrray, stack, queue, tree,...) or it just a data type depending on language?
The word "vector" as applied to computer science/programming is borrowed from math, which can make the use confusing (even your question could be on multiple subjects).
The simplest example of vectors in math is the number line, used to teach elementary math (especially to help visualize negative numbers, subtraction of negative numbers, addition of negative numbers, etc).
The vector is a distance and direction from a point. This is why it can confuse the discussion, because a vector data structure COULD be three points, X,Y,Z, in a structure used in 3D graphics engines, or a 2D point (just X,Y). In that context, the subtraction of two such points results in a vector - the vector describes how far and in what direction to travel from one of the source operands to the other.
This applies to storage, like stl vectors or Java vectors, in that storage is represented as a distance from an address (where a memory address is similar to a point in space, or on a number line).
The concept is related to arrays, because arrays could be the storage allocated for a vector, but I submit that the vector is a larger concept than the array. A vector must include the concept of distance from a starting point, and if you think of the beginning of an array as the starting point, the distance to the end of the array is it's size.
So, the data structure representing a vector must include the size, whereas an array doesn't have storage to include the size, it's assumed by the way it's allocated. That is to say, if you dynamically allocate an array, there is no data structure storing the size of that array, the programmer must assume to know that size, or store it in a some integer or long.
The vector data structure (say, the design of a vector class) DOES need to store the size, so at a minimum, there would be a starting point (the base of an array, or some address in memory) and a distance from that point indicating size.
That's really "RAM" oriented, though, in description, because there's one more point not yet described which must be part of the data describing the vector - the notion of element size. If a vector represents bytes, and memory storage is typically measured in bytes, an address and a distance (or size) would represent a vector of bytes, but nothing else - and that's a very machine level thinking. A higher thought, that of some structure, has it's own size - say, the size of a float or double, or of a structure or class in C++. Whatever the element size is, the memory required to store N of them requires that the vector data structure have some knowledge of WHAT it's storing, and how large that thing is. This is why you'd think in terms of "a vector of strings" or "a vector of points". A vector must also store an element size.
So, a basic vector data structure must have:
An address (the starting point)
An element size (each thing it stores is X bytes long)
A number of elements stored (how many elements times element size is 'minimum' storage size).
One important "assumption" made in this simple 3 item list of entries in the vector data structure is that the address is allocated memory, which must be freed at some point, and is to be guarded against access beyond the end of the vector.
That means there's something missing. In order to make a vector class work, there is a recognizable difference between the number of ITEMS stored in the vector, and the amount of memory ALLOCATED for that storage. Typically, as you might realize from the use of vector from the STL, it may "know" it has room to store 10 items, but currently only has 2 of them.
So, a working vector class would ALSO have to store the amount of memory allocation. This would be how it could dynamically extend itself - it would now have sufficient information to expand storage automatically.
Thinking through just how you would make a vector class operate gives you the structure of data required to operate a vector class.
It's an array with dynamically allocated space, everytime you exceed this space new place in memory is allocated and old array is copied to the new one. Old one is freed then.
Moreover, vector usually allocates more memory, than it needs to, so it does not have to copy all the data, when new element is added.
It may seem, that lists then are much much better, but it's not necessarily so. If you do not change your vector often (in terms of size), then computer's cache memory functions much better with vectors, than lists, because they are continuus in memory space. Disadvantage is when you have large vector, that you need to expand. Then you have to agree to copy large amount of data to another space in memory.
What's more. You can add new data to the end and to the front of the vector. Because Vector's are array-like, then every time you want to add element to the beginning of the vector all the array has to be copied. Adding elements to the end of vector is far more efficient. There's no such an issue with linked lists.
Vector gives random access to it's internal kept data, while lists,queues,stacks do not.
Vectors are the same as dynamic arrays with the ability to resize
itself automatically when an element is inserted or deleted.
Vector elements are placed in contiguous storage so that they can be
accessed and traversed using iterators.
In vectors, data is inserted at the end.

How do you prove termination of a recursive list length?

Suppose we have a list:
List = nil | Cons(car cdr:List).
Note that I am talking about modifiable lists!
And a trivial recursive length function:
recursive Length(List l) = match l with
| nil => 0
| Cons(car cdr) => 1 + Length cdr
end.
Naturally, it terminates only when the list is non-circular:
inductive NonCircular(List l) = {
empty: NonCircular(nil) |
\forall head, tail: NonCircular(tail) => NonCircular (Cons(head tail))
}
Note that this predicate, being implemented as a recursive function, also does not terminate on a circular list.
Usually I see proofs of list traversal termination that use list length as a bounded decreasing factor. They suppose that Length is non-negative. But, as I see it, this fact (Length l >= 0) follows from the termination of Length on the first place.
How do you prove, that the Length terminates and is non-negative on NonCircular (or an equivalent, better defined predicate) lists?
Am I missing an important concept here?
Unless the length function has cycle detection there is no guarantee it will halt!
For a singly linked list one uses the Tortoise and hare algorithm to determine the length where there is a chance there might be circles in the cdr.
It's just two cursors, the tortoise starts at first element and the hare starts at the second. Tortoise moves one pointer at a time while the hare moves two (if it can). The hare will eventually either be the same as the tortoise, which indicates a cycle, or it will terminate knowing the length is 2*steps or 2*steps+1.
Compared to finding cycles in a tree this is very cheap and performs just as well on terminating lists as a function that does not have cycle detection.
The definition of List that you have on top doesn't seem to permit circular lists. Each call to the "constructor" Cons will create a new pointer, and you aren't allowed to modify the pointer later to create the circularity.
You need a more sophisticated definition of List if you want to handle circularity. You probably need to define a Cell containing data value and an address, and a Node which contains a Cell and an address pointing to the previous node, and then you'll need to define the dereferencing operator to go back from addresses to Cells. You can also try to define non-circular on this object.
My gut feeling is that you will also need to define an injective function from the "simple" list definition you have above to the sophisticated one that I've outlined and then finally you'll be able to prove your result.
One other thing, the definition of NonCircular doesn't need to terminate. It isn't a program, it is a proof. If it holds, then you can examine the proof to see why it holds and use this in other proofs.
Edit: Thanks to Necto for pointing out I'm wrong.

Count negative numbers in list using list comprehension

Working through the first edition of "Introduction to Functional Programming", by Bird & Wadler, which uses a theoretical lazy language with Haskell-ish syntax.
Exercise 3.2.3 asks:
Using a list comprehension, define a function for counting the number
of negative numbers in a list
Now, at this point we're still scratching the surface of lists. I would assume the intention is that only concepts that have been introduced at that point should be used, and the following have not been introduced yet:
A function for computing list length
List indexing
Pattern matching i.e. f (x:xs) = ...
Infinite lists
All the functions and operators that act on lists - with one exception - e.g. ++, head, tail, map, filter, zip, foldr, etc
What tools are available?
A maximum function that returns the maximal element of a numeric list
List comprehensions, with possibly multiple generator expressions and predicates
The notion that the output of the comprehension need not depend on the generator expression, implying the generator expression can be used for controlling the size of the generated list
Finite arithmetic sequence lists i.e. [a..b] or [a, a + step..b]
I'll admit, I'm stumped. Obviously one can extract the negative numbers from the original list fairly easily with a comprehension, but how does one then count them, with no notion of length or indexing?
The availability of the maximum function would suggest the end game is to construct a list whose maximal element is the number of negative numbers, with the final result of the function being the application of maximum to said list.
I'm either missing something blindingly obvious, or a smart trick, with a horrible feeling it may be the former. Tell me SO, how do you solve this?
My old -- and very yellowed copy of the first edition has a note attached to Exercise 3.2.3: "This question needs # (length), which appears only later". The moral of the story is to be more careful when setting exercises. I am currently finishing a third edition, which contains answers to every question.
By the way, did you answer Exercise 1.2.1 which asks for you to write down all the ways that
square (square (3 + 7)) can be reduced to normal form. It turns out that there are 547 ways!
I think you may be assuming too many restrictions - taking the length of the filtered list seems like the blindingly obvious solution to me.
An couple of alternatives but both involve using some other function that you say wasn't introduced:
sum [1 | x <- xs, x < 0]
maximum (0:[index | (index, ()) <- zip [1..] [() | x <- xs, x < 0]])

Choosing unique items from a list, using recursion

As follow up to yesterday's question Erlang: choosing unique items from a list, using recursion
In Erlang, say I wanted choose all unique items from a given list, e.g.
List = [foo, bar, buzz, foo].
and I had used your code examples resulting in
NewList = [bar, buzz].
How would I further manipulate NewList in Erlang?
For example, say I not only wanted to choose all unique items from List, but also count the total number of characters of all resulting items from NewList?
In functional programming we have patterns that occur so frequently they deserve their own names and support functions. Two of the most widely used ones are map and fold (sometimes reduce). These two form basic building blocks for list manipulation, often obviating the need to write dedicated recursive functions.
Map
The map function iterates over a list in order, generating a new list where each element is the result of applying a function to the corresponding element in the original list. Here's how a typical map might be implemented:
map(Fun, [H|T]) -> % recursive case
[Fun(H)|map(Fun, T)];
map(_Fun, []) -> % base case
[].
This is a perfect introductory example to recursive functions; roughly speaking, the function clauses are either recursive cases (result in a call to iself with a smaller problem instance) or base cases (no recursive calls made).
So how do you use map? Notice that the first argument, Fun, is supposed to be a function. In Erlang, it's possible to declare anonymous functions (sometimes called lambdas) inline. For example, to square each number in a list, generating a list of squares:
map(fun(X) -> X*X end, [1,2,3]). % => [1,4,9]
This is an example of Higher-order programming.
Note that map is part of the Erlang standard library as lists:map/2.
Fold
Whereas map creates a 1:1 element mapping between one list and another, the purpose of fold is to apply some function to each element of a list while accumulating a single result, such as a sum. The right fold (it helps to think of it as "going to the right") might look like so:
foldr(Fun, Acc, [H|T]) -> % recursive case
foldr(Fun, Fun(H, Acc), T);
foldr(_Fun, Acc, []) -> % base case
Acc.
Using this function, we can sum the elements of a list:
foldr(fun(X, Sum) -> Sum + X, 0, [1,2,3,4,5]). %% => 15
Note that foldr and foldl are both part of the Erlang standard library, in the lists module.
While it may not be immediately obvious, a very large class of common list-manipulation problems can be solved using map and fold alone.
Thinking recursively
Writing recursive algorithms might seem daunting at first, but as you get used to it, it turns out to be quite natural. When encountering a problem, you should identify two things:
How can I decompose the problem into smaller instances? In order for recursion to be useful, the recursive call must take a smaller problem as its argument, or the function will never terminate.
What's the base case, i.e. the termination criterion?
As for 1), consider the problem of counting the elements of a list. How could this possibly be decomposed into smaller subproblems? Well, think of it this way: Given a non-empty list whose first element (head) is X and whose remainder (tail) is Y, its length is 1 + the length of Y. Since Y is smaller than the list [X|Y], we've successfully reduced the problem.
Continuing the list example, when do we stop? Well, eventually, the tail will be empty. We fall back to the base case, which is the definition that the length of the empty list is zero. You'll find that writing function clauses for the various cases is very much like writing definitions for a dictionary:
%% Definition:
%% The length of a list whose head is H and whose tail is T is
%% 1 + the length of T.
length([H|T]) ->
1 + length(T);
%% Definition: The length of the empty list ([]) is zero.
length([]) ->
0.
You could use a fold to recurse over the resulting list. For simplicity I turned your atoms into strings (you could do this with list_to_atom/1):
1> NewList = ["bar", "buzz"].
["bar","buzz"]
2> L = lists:foldl(fun (W, Acc) -> [{W, length(W)}|Acc] end, [], NewList).
[{"buzz",4},{"bar",3}]
This returns a proplist you can access like so:
3> proplists:get_value("buzz", L).
4
If you want to build the recursion yourself for didactic purposes instead of using lists:
count_char_in_list([], Count) ->
Count;
count_char_in_list([Head | Tail], Count) ->
count_char_in_list(Tail, Count + length(Head)). % a string is just a list of numbers
And then:
1> test:count_char_in_list(["bar", "buzz"], 0).
7

What is an m-ary vector?

I am watching a lecture on threading and they use the term m-ary vector as follows:
"Let [X] represent an m-ary vector of non-negative integers"
What is this? Is the arity the length? I presume a vector is merely a sequential data structure like an array? Why would the letter m be used - I have only ever seen n-ary previously.
Is the arity the length?
Yes.
I presume a vector is merely a sequential data structure like an array?
Yes.
Why would the letter m be used - I have only ever seen n-ary previously.
There are twenty-six latin letters that could be used. If -- later -- they are going to talk about two different length vectors, they're going to need to different letters.

Resources