Is there a intersection function for vectors? - vector

Very often one finds statements that lists have a performance disadvantage compared to vectors because of consing and additional gc steps and some function work on generic sequences accepting lists and vectors.
But some functions like intersection expect two lists. Is there a library providing an alternative for vectors?
I started with something like this, but have the feeling that there should be a more mature solution.
(defun vec-intersec (vec-1 vec-2 &aux (result (make-array 0 :adjustable t :fill-pointer 0)))
"A simple implementation of intersection for vectors instead of lists."
(loop :for v1 :across vec-1
:if (find v1 vec-2 :test #'equal)
:do (vector-push-extend v1 result))
result)

It always depends on the size of your collection and what you want to do with it.
Below about 20 to 50 elements, lists are often perfectly OK even for random access (if you're not in a tight inner loop, or consing a lot).
If you already have vectors, it might be most convenient to sort one of them so that you can do a binary search instead of a naïve linear one. If that is not enough, and your collections bigger, putting the elements into a hash-table (as keys, with an appropriate :test) gives you faster (amortized) lookup.
This should take you quite far. If you identify an issue that cannot be solved in such a simple way, you might want to look into FSet or CL-Containers, which support more advanced data structures.

Related

Why does apply throw a CONTROL-STACK-EXHAUSTED-ERROR on a large list?

(apply #'+ (loop for i from 1 to x collect 1))
works if x has value 253391, but fails with a (SB-KERNEL::CONTROL-STACK-EXHAUSTED-ERROR) on 253392*. This is orders of magnitude smaller than call-arguments-limit**.
Is recursion exhausting the stack? If so, is it in apply? Why hasn't it been optimized out?
*Also interesting, (apply #'max (loop for i from 1 to 253391 collect 1)) throws the error, but 253390 is fine.
**call-arguments-limit evaluates to 4611686018427387903 (with the help of format's ~R, it turns out this is four quintillion six hundred eleven quadrillion six hundred eighty-six trillion eighteen billion four hundred twenty-seven million three hundred eighty-seven thousand nine hundred three)
parameters that can be passed to a function in SBCL
You don't pass parameters. You pass arguments.
(defun foo (x y) (list x y))
x and y are parameters of the function foo.
(foo 20 22)
20 and 22 are arguments in a call of the function foo.
See the variables call-arguments-limit and lambda-parameters-limit.
SBCL and call-arguments-limit
If a function can't nearly handle the claimed number of arguments, then this looks like a bug in SBCL. You might want to report this bug. Maybe they need to change the value of call-arguments-limit.
Testing
APPLY is one way to test it.
Another:
(eval (append '(drop-params)
(loop for i from 1 to 2533911 collect 1)))
One can also use FUNCALL with a number of arguments spread out.
Why does a limit exist?
The Common Lisp standard was written to allow efficient implementations on various different computers. It was thought that some machine-level function calling implementations only support limited number of arguments. The standard says the number of supported arguments can be as low as 50. Actually some implementations have a relatively low number of supported arguments.
Thus apply in Common Lisp is not a tool for list processing, but to call functions with computed arglists.
For list and vector processing use REDUCE, instead of APPLY
If we want to sum all numbers in a list, replace
(apply #'+ list) ; don't use this
with
(reduce #'+ list) ; can handle arbitrary long lists
Recursion
apply is a non-optimized recursive function
I cannot see why the function APPLY should use recursion.
For example if you think of
(apply #'+ '(1 2 3 4 5))
The repeated summing of the arguments is done by the function + and not by apply.
This is different from
(reduce #'+ '(1 2 3 4 5))
where the repeated call of the function + with two arguments is done by reduce.
What's causing the stack exhaustion?
Though recursion is often a likely culprit for stack exhaustion, that is not true in this case. According to the SBCL Internals Manual:
In full call, the arguments are passed creating a partial frame on the stack top and storing stack arguments into that frame.
Each generated list element is stored in the new stack frame, quickly exhausting the stack. Presumably, passing SBCL a larger value through –control-stack-size would increase this practical limit to the number of arguments that can be passed in a function call.
Why is call-arguments-limit so much larger than the practical limit?
An SBCL mailing list response to someone with a similar problem explains why the practical limit of the stack size isn't reflected in call-arguments-limit:
The condition you're seeing here is not due to a fundamental implementation limitation, but rather because of the particular choice of stack size that someone chose -- if the stack size were bigger, this call of yours would not error. [...] Consider that, given our strategy of passing excess arguments on the stack, that the actual maximum number of arguments passable at any given time depends on the program state, and isn't in fact a constant.
Spec says that call-arguments-limit must be a constant, so SBCL seems to have defined it as most-positive-fixnum.
There are and a couple of bug reports discussing the issue, and a TODO in the source suggesting that at least one contributor feels it should be reduced to a less absurd value:
;; TODO: Reducing CALL-ARGUMENTS-LIMIT to something reasonable to
;; allow DWORD ops without it looking like a bug would make sense.
;; With a stack size of about 2MB, the limit is absurd anyway.
SBCL's particular way of implementing call-arguments-limit might have room for improvement and could lead to unexpected behaviour, but it does follow ANSI spec.
The practical limit varies depending on the space remaining on the stack, so defining call-arguments-limit according to this value would not obey the spec requirement for a constant value.

Checking the reverse of a list is the same as the list unchanged?

I am learning scheme and one of the things I have to do is recursion to figure out if the list is reflective, i.e. the list looks the same when it is reversed. I have to do it primitively so I can't use the reverse method for lists. I must also use recursion which is obvious. The problem is in scheme it's very hard to access the list or shorten the list using the very basic stuff that we have learned, since these are kinda of like linkedlists. I also want to do without using indexing.
With that said I have a few ideas and was wondering If any of these sufficient and do you think I could actually do it better with the basics of scheme.
Reverse the list using recursion (my implementation) and compare the original and this rev. list.
Compare the first and last element by recursing on the rest of the list to find the last element and compare to first. Keeping track of how many times I have recursed and then do it one less for the second last element of the list, to compare to the second element of the list. (This is very complicated to do as I've tried it and wound up failing but I want to see you guys would've done the same)
Shorten the list to trim off the first and last element each time and compare. I'm not sure if this can be done using the basics of scheme.
Your suggestions or hints or anything. I'm very new to scheme. Thanks for reading. I know it's long.
Checking if a list is a palindrome without reversing it is one of the examples of a technique explained in "There and Back Again" by Danvy and Goldberg. They do it in (ceil (/ (length lst) 2)) recursive calls, but there's a simpler version that does it in (length lst) calls.
Here's the skeleton of the solution:
(define (pal? orig-lst)
;; pal* : list -> list/#f
;; If lst contains the first N elements of orig-lst in reverse order,
;; where N = (length lst), returns the Nth tail of orig-lst.
;; Otherwise, returns #f to indicate orig-lst cannot be a palindrome.
(define (pal* lst)
....)
.... (pal* orig-lst) ....)
This sounds like it might be homework, so I don't want to fill in all of the blanks.
I agree that #1 seems the way to go here. It's simple, and I can't imagine it failing. Maybe I don't have a strong enough imagination. :)
The other options you're considering seem awkward because we're talking about linked lists, which directly support sequential access but not random access. As you note, "indexing" into a linked list is verboten. It ends up marching dutifully down the list structure. Because of this, the other options are certainly doable, but they are expensive.
That expense isn't because we're in Scheme: it's because we're dealing with linked lists. Just to make sure it's clear: Scheme has vectors (arrays) which do support fast random-access. Testing palindrome-ness on a vector is as easy as you'd expect:
#lang racket
;; In Professional-level Racket
(define (palindrome? vec)
(for/and ([i (in-range 0 (vector-length vec))]
[j (in-range (sub1 (vector-length vec)) -1 -1)])
(equal? (vector-ref vec i)
(vector-ref vec j))))
;; Examples
(palindrome? (vector "b" "a" "n" "a" "n" "a"))
(palindrome? (vector "a" "b" "b" "a"))
so one point of the problem you're tackling, I think, is to show that the data structure you choose---the representation of the problem---can have a strong effect on the problem solution.
(Aside: #2 is certainly doable, though it goes against the grain of the single-linked list data structure. The approach to #3 requires a radical change to the representation: from a first glance, I think you'd need mutable, double-linked lists to do the solution any justice, since you need to be able march backward.)
To answer dyoo's question, you don't have to know that you're at the halfway point; you just have to make a certain comparison. If that comparison works, then a) the string must be reversible and b) you must be at the midway point.
So that more efficient solution is there, if you can just reach for it...

Why is foldl defined in a strange way in Racket?

In Haskell, like in many other functional languages, the function foldl is defined such that, for example, foldl (-) 0 [1,2,3,4] = -10.
This is OK, because foldl (-) 0 [1, 2,3,4] is, by definition, ((((0 - 1) - 2) - 3) - 4).
But, in Racket, (foldl - 0 '(1 2 3 4)) is 2, because Racket "intelligently" calculates like this: (4 - (3 - (2 - (1 - 0)))), which indeed is 2.
Of course, if we define auxiliary function flip, like this:
(define (flip bin-fn)
(lambda (x y)
(bin-fn y x)))
then we could in Racket achieve the same behavior as in Haskell: instead of (foldl - 0 '(1 2 3 4)) we can write: (foldl (flip -) 0 '(1 2 3 4))
The question is: Why is foldl in racket defined in such an odd (nonstandard and nonintuitive) way, differently than in any other language?
The Haskell definition is not uniform. In Racket, the function to both folds have the same order of inputs, and therefore you can just replace foldl by foldr and get the same result. If you do that with the Haskell version you'd get a different result (usually) — and you can see this in the different types of the two.
(In fact, I think that in order to do a proper comparison you should avoid these toy numeric examples where both of the type variables are integers.)
This has the nice byproduct where you're encouraged to choose either foldl or foldr according to their semantic differences. My guess is that with Haskell's order you're likely to choose according to the operation. You have a good example for this: you've used foldl because you want to subtract each number — and that's such an "obvious" choice that it's easy to overlook the fact that foldl is usually a bad choice in a lazy language.
Another difference is that the Haskell version is more limited than the Racket version in the usual way: it operates on exactly one input list, whereas Racket can accept any number of lists. This makes it more important to have a uniform argument order for the input function).
Finally, it is wrong to assume that Racket diverged from "many other functional languages", since folding is far from a new trick, and Racket has roots that are far older than Haskell (or these other languages). The question could therefore go the other way: why is Haskell's foldl defined in a strange way? (And no, (-) is not a good excuse.)
Historical update:
Since this seems to bother people again and again, I did a little bit of legwork. This is not definitive in any way, just my second-hand guessing. Feel free to edit this if you know more, or even better, email the relevant people and ask. Specifically, I don't know the dates where these decisions were made, so the following list is in rough order.
First there was Lisp, and no mention of "fold"ing of any kind. Instead, Lisp has reduce which is very non-uniform, especially if you consider its type. For example, :from-end is a keyword argument that determines whether it's a left or a right scan and it uses different accumulator functions which means that the accumulator type depends on that keyword. This is in addition to other hacks: usually the first value is taken from the list (unless you specify an :initial-value). Finally, if you don't specify an :initial-value, and the list is empty, it will actually apply the function on zero arguments to get a result.
All of this means that reduce is usually used for what its name suggests: reducing a list of values into a single value, where the two types are usually the same. The conclusion here is that it's serving a kind of a similar purpose to folding, but it's not nearly as useful as the generic list iteration construct that you get with folding. I'm guessing that this means that there's no strong relation between reduce and the later fold operations.
The first relevant language that follows Lisp and has a proper fold is ML. The choice that was made there, as noted in newacct's answer below, was to go with the uniform types version (ie, what Racket uses).
The next reference is Bird & Wadler's ItFP (1988), which uses different types (as in Haskell). However, they note in the appendix that Miranda has the same type (as in Racket).
Miranda later on switched the argument order (ie, moved from the Racket order to the Haskell one). Specifically, that text says:
WARNING - this definition of foldl differs from that in older versions of Miranda. The one here is the same as that in Bird and Wadler (1988). The old definition had the two args of `op' reversed.
Haskell took a lot of stuff from Miranda, including the different types. (But of course I don't know the dates so maybe the Miranda change was due to Haskell.) In any case, it's clear at this point that there was no consensus, hence the reversed question above holds.
OCaml went with the Haskell direction and uses different types
I'm guessing that "How to Design Programs" (aka HtDP) was written at roughly the same period, and they chose the same type. There is, however, no motivation or explanation — and in fact, after that exercise it's simply mentioned as one of the built-in functions.
Racket's implementation of the fold operations was, of course, the "built-ins" that are mentioned here.
Then came SRFI-1, and the choice was to use the same-type version (as Racket). This decision was question by John David Stone, who points at a comment in the SRFI that says
Note: MIT Scheme and Haskell flip F's arg order for their reduce and fold functions.
Olin later addressed this: all he said was:
Good point, but I want consistency between the two functions.
state-value first: srfi-1, SML
state-value last: Haskell
Note in particular his use of state-value, which suggests a view where consistent types are a possibly more important point than operator order.
"differently than in any other language"
As a counter-example, Standard ML (ML is a very old and influential functional language)'s foldl also works this way: http://www.standardml.org/Basis/list.html#SIG:LIST.foldl:VAL
Racket's foldl and foldr (and also SRFI-1's fold and fold-right) have the property that
(foldr cons null lst) = lst
(foldl cons null lst) = (reverse lst)
I speculate the argument order was chosen for that reason.
From the Racket documentation, the description of foldl:
(foldl proc init lst ...+) → any/c
Two points of interest for your question are mentioned:
the input lsts are traversed from left to right
And
foldl processes the lsts in constant space
I'm gonna speculate on how the implementation for that might look like, with a single list for simplicity's sake:
(define (my-foldl proc init lst)
(define (iter lst acc)
(if (null? lst)
acc
(iter (cdr lst) (proc (car lst) acc))))
(iter lst init))
As you can see, the requirements of left-to-right traversal and constant space are met (notice the tail recursion in iter), but the order of the arguments for proc was never specified in the description. Hence, the result of calling the above code would be:
(my-foldl - 0 '(1 2 3 4))
> 2
If we had specified the order of the arguments for proc in this way:
(proc acc (car lst))
Then the result would be:
(my-foldl - 0 '(1 2 3 4))
> -10
My point is, the documentation for foldl doesn't make any assumptions on the evaluation order of the arguments for proc, it only has to guarantee that constant space is used and that the elements in the list are evaluated from left to right.
As a side note, you can get the desired evaluation order for your expression by simply writing this:
(- 0 1 2 3 4)
> -10

Put an element to the tail of a collection

I find myself doing a lot of:
(concat coll [e]) where coll is a collection and e a single element.
Is there a function for doing this in Clojure? I know conj does the job best for vectors but I don't know up front which coll will be used. It could be a vector, list or sorted-set for example.
Some types of collections can add cheaply to the front (lists, seqs), while others can add cheaply to the back (vectors, queues, kinda-sorta lazy-seqs). Rather than using concat, if possible you should arrange to be working with one of those types (vector is most common) and just conj to it: (conj [1 2 3] 4) yields [1 2 3 4], while (conj '(1 2 3) 4) yields (4 1 2 3).
concat does not add an element to the tail of a collection, nor does it concatenate two collections.
concat returns a seq made of the concatenation of two other seqs. The original type of the collections from which seqs may be inferred are lost for the return type of concat.
Now, clojure collections have different properties one must know about in order to write efficient code, that's why there isn't a universal function available in core to concatenate collections of any kind together.
To the contrary, list and vectors do have "natural insertion positions" which conj knows, and does what is right for the kind of collection.
This is a very small addendum to #amalloy's answer in order to address OP's request for a function that always adds to the tail of whatever kind of collection. This is an alternative to (concat coll [x]). Just create a vector version of the original collection:
(defn conj*
[s x]
(conj (vec s) x))
Caveats:
If you started with a lazy sequence, you've now destroyed the laziness--i.e. the output is not lazy. This may be either a good thing or a bad thing, depending on your needs.
There's some cost to creating the vector. If you need to call this function a lot, and you find (e.g. by benchmarking with Criterium) that this cost is significant for your purposes, then follow the other answers' advice to try to use vectors in the first place.
To distill the best of what amalloy and Laurent Petit have already said: use the conj function.
One of the great abstractions that Clojure provides is the Sequence API, which includes the conj function. If at all possible, your code should be as collection-type agnostic as it can be, instead using the seq API to handle operations on collections and picking a particular collection type only when you need to be specific.
If vectors are a good match, then yes, conj will be adding items onto the end. If use lists instead, then conj will be adding things to the front of your collection. But if you then use the standard seq API functions for pulling items from the "top" of a collection (the back of a vector, the front of a list), it doesn't matter which implementation you use, because it will always use the one with best performance and thus adding and removing items will be consistent.
If you are working with lazy sequences, you can also use lazy-cat:
(take 5 (lazy-cat (range) [1])) ; (0 1 2 3 4)
Or you could make it a utility method:
(defn append [coll & items] (lazy-cat coll items))
Then use it like this:
(take 5 (append (range) 1)) ; (0 1 2 3 4)

Practical use of fold/reduce in functional languages

Fold (aka reduce) is considered a very important higher order function. Map can be expressed in terms of fold (see here). But it sounds more academical than practical to me. A typical use could be to get the sum, or product, or maximum of numbers, but these functions usually accept any number of arguments. So why write (fold + 0 '(2 3 5)) when (+ 2 3 5) works fine. My question is, in what situation is it easiest or most natural to use fold?
The point of fold is that it's more abstract. It's not that you can do things that you couldn't before, it's that you can do them more easily.
Using a fold, you can generalize any function that is defined on two elements to apply to an arbitrary number of elements. This is a win because it's usually much easier to write, test, maintain and modify a single function that applies two arguments than to a list. And it's always easier to write, test, maintain, etc. one simple function instead of two with similar-but-not-quite functionality.
Since fold (and for that matter, map, filter, and friends) have well-defined behaviour, it's often much easier to understand code using these functions than explicit recursion.
Basically, once you have the one version, you get the other "for free". Ultimately, you end up doing less work to get the same result.
Here are a few simple examples where reduce works really well.
Find the sum of the maximum values of each sub-list
Clojure:
user=> (def x '((1 2 3) (4 5) (0 9 1)))
#'user/x
user=> (reduce #(+ %1 (apply max %2)) 0 x)
17
Racket:
> (define x '((1 2 3) (4 5) (0 9 1)))
> (foldl (lambda (a b) (+ b (apply max a))) 0 x)
17
Construct a map from a list
Clojure:
user=> (def y '(("dog" "bark") ("cat" "meow") ("pig" "oink")))
#'user/y
user=> (def z (reduce #(assoc %1 (first %2) (second %2)) {} y))
#'user/z
user=> (z "pig")
"oink"
For a more complicated clojure example featuring reduce, check out my solution to Project Euler problems 18 & 67.
See also: reduce vs. apply
In Common Lisp functions don't accept any number of arguments.
There is a constant defined in every Common Lisp implementation CALL-ARGUMENTS-LIMIT, which must be 50 or larger.
This means that any such portably written function should accept at least 50 arguments. But it could be just 50.
This limit exists to allow compilers to possibly use optimized calling schemes and to not provide the general case, where an unlimited number of arguments could be passed.
Thus to really process large (larger than 50 elements) lists or vectors in portable Common Lisp code, it is necessary to use iteration constructs, reduce, map, and similar. Thus it is also necessary to not use (apply '+ large-list) but use (reduce '+ large-list).
Code using fold is usually awkward to read. That's why people prefer map, filter, exists, sum, and so on—when available. These days I'm primarily writing compilers and interpreters; here's some ways I use fold:
Compute the set of free variables for a function, expression, or type
Add a function's parameters to the symbol table, e.g., for type checking
Accumulate the collection of all sensible error messages generated from a sequence of definitions
Add all the predefined classes to a Smalltalk interpreter at boot time
What all these uses have in common is that they're accumulating information about a sequence into some kind of set or dictionary. Eminently practical.
Your example (+ 2 3 4) only works because you know the number of arguments beforehand. Folds work on lists the size of which can vary.
fold/reduce is the general version of the "cdr-ing down a list" pattern. Each algorithm that's about processing every element of a sequence in order and computing some return value from that can be expressed with it. It's basically the functional version of the foreach loop.
Here's an example that nobody else mentioned yet.
By using a function with a small, well-defined interface like "fold", you can replace that implementation without breaking the programs that use it. You could, for example, make a distributed version that runs on thousands of PCs, so a sorting algorithm that used it would become a distributed sort, and so on. Your programs become more robust, simpler, and faster.
Your example is a trivial one: + already takes any number of arguments, runs quickly in little memory, and has already been written and debugged by whoever wrote your compiler. Those properties are not often true of algorithms I need to run.

Resources