Why exactly is filtering using a set more performant than filtering using a vector? - vector

After some researching, I was recently able to dramatically improve the performance of some code by using a set to compare rather than a vector. Here is a simple example of the initial code:
(def target-ids ["a" "b" "c"])
(def maps-to-search-through
[{"id": "a" "value": "example"}
{"id": "e" "value": "example-2"}])
(filter (fn [i] (some #(= (:id i) %) target-ids)) maps-to-search-through)
And here is the optimised code:
(def target-ids #{"a" "b" "c"})
(def maps-to-search-through
[{"id": "a" "value": "example"}
{"id": "e" "value": "example-2"}])
(filter (comp target-ids :id) maps-to-search-through)
For reference, target-ids and maps-to-search-through are both generated dynamically, and can contain thousands of values each -- although maps-to-search-through will always be at least 5x larger than target-ids.
All advice and documentation I found online suggested this improvement, specifically using a set instead of a vector, would be significantly faster, but didn't elaborate on why that is. I understand that in the initial case, filter is doing a lot of work - iterating through both vectors on every step. But I don't understand how that isn't the case in the improved code.
Can anyone help explain?

Sets are data structures that are designed to only contain unique values. Also you can use them as functions to check whether a given value is a member of the very set - just as you use your target-ids set. It basically boils down to a call of Set.contains on JVM side which uses some clever hash-based logic.
Your first solution loops through the vector using some, so it's similar to a nested for loop which is obviously slower.

Related

Delete an entry from a collection in Clojure

I'm new to Clojure and I'm wondering how I remove an element from a collection.
Say I have:
(def example ["a" "b" "c"])
I want to be able to remove say "b" and when I call
(println example)
and have it return a collection with only "a" and "c"
I know using (remove (partial = "b") example)
will return what I want but then how do i update the example variable with this?
Thanks!
(filter (fn [x] (not (= x "b"))) example)
Will get you '("a" "c"). Couple of points:
You shouldn't be thinking in terms of mutation. The whole point of using functional programming in general and clojure with it's persistent data structures in particular is to avoid the problems associated with mutability.
If you do really, really need something to be mutable you can use atoms, but if you're not sure you need it to be mutable, you don't.
First, check if you really need mutation in the first place. Clojure is designed around working with immutable data - there's a chance that what you ultimately want to do can be achieved without changing values in place.
If you need mutable data, you can create an atom for that - changing a car's value is generally a bad practice.
(def example (atom ["a" "b" "c"]))
(println #example) ;; ["a" "b" "c"]
(swap! example #(remove (partial = "b") %))
(println #example) ;; ["a" "c"]
Clojure's default data structures are immutable. Hence you cannot change a vector in place, but you can create a new one with the desired elements.
In a function context, this is how you could use remove:
(defn my-func [col]
(let [without-b (remove #(= "b" %) col)]
(println without-b)
; do something else w/ without-b
))
...
=> (my-func ["a" "b" "c"])
(a c)
This is the idiomatic Clojure way to work with collections, i.e., you create a new collection from an old one. This doesn't have "significant" memory or performance implications, as the data structures are implemented on a tree-based data structure, Tries, you can learn more about this here:
https://hypirion.com/musings/understanding-persistent-vector-pt-1

idiomatic way to use clojure to get set of unique words in vector of strings

I'm a complete newbie to clojure, so please forgive the stupidity below... but I'm trying to split a vector of strings on spaces, and then get all the unique strings from the whole resulting vector of vectors in a single sequence (where I'm not picky about the type of sequence). Here's the code I tried.
(require '[clojure.string :as str])
(require '[clojure.set :as set])
(def documents ["this is a cat" "this is a dog" "woof and a meow"])
(apply set/union (map #(str/split % #" ") documents))
I would have expected this to return a set of unique words, i.e.,
#{"woof" "and" "a" "meow" "this" "is" "cat" "dog"}
but instead it returns a vector of non-unique words, i.e.,
["woof" "and" "a" "meow" "this" "is" "a" "cat" "this" "is" "a" "dog"]
Ultimately, I just wrapped that in a set call, i.e.,
(set (apply set/union (map #(str/split % #" ") documents)))
and got what I wanted:
#{"dog" "this" "is" "a" "woof" "and" "meow" "cat"}
but I don't quite understand why that should be the case. According to the docs the union function returns a set. So why'd I get a vector?
Second question: an alternative approach is just
(distinct (apply concat (map #(str/split % #" ") documents)))
which also returns what I want, albeit in list form rather than in set form. But some of the discussion on this prior SO suggests that concat is unusually slow, perhaps slower than set operations (?).
Is that right... and is there any other reason to prefer one to the other approach (or some third approach)?
I don't really care whether I get a vector or a set coming out the other end, but will ultimately care about performance considerations. I'm trying to learn Clojure by actually producing something that will be useful for my text-mining habit, and so ultimately this bit of code will be part of workflow to handle large amounts of text data efficiently... the time for getting it right, performance-wise and just general not-being-stupid-wise, is now.
Thanks!
clojure.set/union operates on sets but you gave it sequences instead (the result of str/split is a sequence of strings).
(set (mapcat #(str/split % #" ") documents)) should give you what you need.
mapcat will do a lazy "map and concatenate" operation. set will convert that sequence into set, discarding duplicates as it goes.

What's the difference between a sequence and a collection in Clojure

I am a Java programmer and am new to Clojure. From different places, I saw sequence and collection are used in different cases. However, I have no idea what the exact difference is between them.
For some examples:
1) In Clojure's documentation for Sequence:
The Seq interface
(first coll)
Returns the first item in the collection.
Calls seq on its argument. If coll is nil, returns nil.
(rest coll)
Returns a sequence of the items after the first. Calls seq on its argument.
If there are no more items, returns a logical sequence for which seq returns nil.
(cons item seq)
Returns a new seq where item is the first element and seq is the rest.
As you can see, when describing the Seq interface, the first two functions (first/rest) use coll which seems to indicate this is a collection while the cons function use seq which seems to indicate this is a sequence.
2) There are functions called coll? and seq? that can be used to test if a value is a collection or a sequence. It is clearly collection and sequence are different.
3) In Clojure's documentation about 'Collections', it is said:
Because collections support the seq function, all of the sequence
functions can be used with any collection
Does this mean all collections are sequences?
(coll? [1 2 3]) ; => true
(seq? [1 2 3]) ; => false
The code above tells me it is not such case because [1 2 3] is a collection but is not a sequence.
I think this is a pretty basic question for Clojure but I am not able to find a place explaining this clearly what their difference is and which one should I use in different cases. Any comment is appreciated.
Any object supporting the core first and rest functions is a sequence.
Many objects satisfy this interface and every Clojure collection provides at least one kind of seq object for walking through its contents using the seq function.
So:
user> (seq [1 2 3])
(1 2 3)
And you can create a sequence object from a map too
user> (seq {:a 1 :b 2})
([:a 1] [:b 2])
That's why you can use filter, map, for, etc. on maps sets and so on.
So you can treat many collection-like objects as sequences.
That's also why many sequence handling functions such as filter call seq on the input:
(defn filter
"Returns a lazy sequence of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
{:added "1.0"
:static true}
([pred coll]
(lazy-seq
(when-let [s (seq coll)]
If you call (filter pred 5)
Don't know how to create ISeq from: java.lang.Long
RT.java:505 clojure.lang.RT.seqFrom
RT.java:486 clojure.lang.RT.seq
core.clj:133 clojure.core/seq
core.clj:2523 clojure.core/filter[fn]
You see that seq call is the is this object a sequence validation.
Most of this stuff is in Joy of Clojure chapter 5 if you want to go deeper.
Here are few points that will help understand the difference between collection and sequence.
"Collection" and "Sequence" are abstractions, not a property that can be determined from a given value.
Collections are bags of values.
Sequence is a data structure (subset of collection) that is expected to be accessed in a sequential (linear) manner.
The figure below best describes the relation between them:
You can read more about it here.
Every sequence is a collection, but not every collection is a sequence.
The seq function makes it possible to convert a collection into a sequence. E.g. for a map you get a list of its entries. That list of entries is different from the map itself, though.
In Clojure for the brave and true the author sums it up in a really understandable way:
The collection abstraction is closely related to the sequence
abstraction. All of Clojure's core data structures — vectors, maps,
lists and sets — take part in both abstractions.
The abstractions differ in that the sequence abstraction is "about"
operating on members individually while the collection abstraction is
"about" the data structure as a whole. For example, the collection
functions count, empty?, and every? aren't about any individual
element; they're about the whole.
I have just been through Chapter 5 - "Collection Types" of "The Joy of Clojure", which is a bit confusing (i.e. the next version of that book needs a review). In Chapter 5, on page 86, there is a table which I am not fully happy with:
So here's my take (fully updated after coming back to this after a month of reflection).
collection
It's a "thing", a collection of other things.
This is based on the function coll?.
The function coll? can be used to test for this.
Conversely, anything for which coll? returns true is a collection.
The coll? docstring says:
Returns true if x implements IPersistentCollection
Things that are collections as grouped into three separate classes. Things in different classes are never equal.
Maps Test using (map? foo)
Map (two actual implementations with slightly differing behaviours)
Sorted map. Note: (sequential? (sorted-map :a 1) ;=> false
Sets Test using (set? foo)
Set
Sorted set. Note: (sequential? (sorted-set :a :b)) ;=> false
Sequential collections Test using (sequential? foo)
List
Vector
Queue
Seq: (sequential? (seq [1 2 3])) ;=> true
Lazy-Seq: (sequential? (lazy-seq (seq [1 2 3]))) ;=> true
The Java interop stuff is outside of this:
(coll? (to-array [1 2 3])) ;=> false
(map? (doto (new java.util.HashMap) (.put "a" 1) (.put "b" 2))) ;=> false
sequential collection (a "chain")
It's a "thing", a collection holding other things according to a specific, stable ordering.
This is based on the function sequential?.
The function sequential? can be used to test for this.
Conversely, anything for which sequential? returns true is a sequential collection.
The sequential? docstring says:
Returns true if coll implements Sequential
Note: "sequential" is an adjective! In "The Joy of Clojure", the adjective is used as a noun and this is really, really, really confusing:
"Clojure classifies each collection data type into one of three
logical categories or partitions: sequentials, maps, and sets."
Instead of "sequential" one should use a "sequential thing" or a "sequential collection" (as used above). On the other hand, in mathematics the following words already exist: "chain", "totally ordered set", "simply ordered set", "linearly ordered set". "chain" sounds excellent but no-one uses that word. Shame!
"Joy of Clojure" also has this to say:
Beware type-based predicates!
Clojure includes a few predicates with names like the words just
defined. Although they’re not frequently used, it seems worth
mentioning that they may not mean exactly what the definitions here
might suggest. For example, every object for which sequential? returns
true is a sequential collection, but it returns false for some that
are also sequential [better: "that can be considered sequential
collections"]. This is because of implementation details that may be
improved in a future version of Clojure [and maybe this has already been
done?]
sequence (also "sequence abstraction")
This is more a concept than a thing: a series of values (thus ordered) which may or may not exist yet (i.e. a stream). If you say that a thing is a sequence, is that thing also necessarily a Clojure collection, even a sequential collection? I suppose so.
That sequential collection may have been completely computed and be completely available. Or it may be a "machine" to generate values on need (by computation - likely in a "pure" fashion - or by querying external "impure", "oracular" sources: keyboard, databases)
seq
This is a thing: something that can be processed by the functions
first, rest, next, cons (and possibly others?), i.e. something that obeys the protocol clojure.lang.ISeq (which is about the same concept as "providing an implementation for an interface" in Java), i.e. the system has registered function implementations for a pair (thing, function-name) [I sure hope I get this right...]
This is based on the function seq?.
The function seq? can be used to test for this
Conversely, a seq is anything for which seq? returns true.
Docstring for seq?:
Return true if x implements ISeq
Docstring for first:
Returns the first item in the collection. Calls seq on its argument.
If coll is nil, returns nil.
Docstring for rest:
Returns a possibly empty seq of the items after the first. Calls seq
on its argument.
Docstring for next:
Returns a seq of the items after the first. Calls seq on its argument.
If there are no more items, returns nil.
You call next on the seq to generate the next element and a new seq. Repeat until nil is obtained.
Joy of Clojure calls this a "simple API for navigating collections" and says "a seq is any object that implements the seq API" - which is correct if "the API" is the ensemble of the "thing" (of a certain type) and the functions which work on that thing. It depends on suitable shift in the concept of API.
A note on the special case of the empty seq:
(def empty-seq (rest (seq [:x])))
(type? empty-seq) ;=> clojure.lang.PersistentList$EmptyList
(nil? empty-seq) ;=> false ... empty seq is not nil
(some? empty-seq) ;=> true ("true if x is not nil, false otherwise.")
(first empty-seq) ;=> nil ... first of empty seq is nil ("does not exist"); beware confusing this with a nil in a nonempty list!
(next empty-seq) ;=> nil ... "next" of empty seq is nil
(rest empty-seq) ;=> () ... "rest" of empty seq is the empty seq
(type (rest empty-seq)) ;=> clojure.lang.PersistentList$EmptyList
(seq? (rest empty-seq)) ;=> true
(= (rest empty-seq) empty-seq) ;=> true
(count empty-seq) ;=> 0
(empty? empty-seq) ;=> true
Addenda
The function seq
If you apply the function seq to a thing for which that makes sense (generally a sequential collection), you get a seq representing/generating the members of that collection.
The docstring says:
Returns a seq on the collection. If the collection is empty, returns
nil. (seq nil) returns nil. seq also works on Strings, native Java
arrays (of reference types) and any objects that implement Iterable.
Note that seqs cache values, thus seq should not be used on any
Iterable whose iterator repeatedly returns the same mutable object.
After applying seq, you may get objects of various actual classes:
clojure.lang.Cons - try (class (seq (map #(* % 2) '( 1 2 3))))
clojure.lang.PersistentList
clojure.lang.APersistentMap$KeySeq
clojure.lang.PersistentList$EmptyList
clojure.lang.PersistentHashMap$NodeSeq
clojure.lang.PersistentQueue$Seq
clojure.lang.PersistentVector$ChunkedSeq
If you apply seq to a sequence, the actual class of the thing returned may be different from the actual class of the thing passed in. It will still be a sequence.
What the "elements" in the sequence are depends. For example, for maps, they are key-value pairs which look like 2-element vector (but their actual class is not really a vector).
The function lazy-seq
Creates a thing to generate more things lazily (a suspended machine, a suspended stream, a thunk)
The docstring says:
Takes a body of expressions that returns an ISeq or nil, and yields a
Seqable object that will invoke the body only the first time seq is
called, and will cache the result and return it on all subsequent seq
calls. See also - realized?"
A note on "functions" and "things" ... and "objects"
In the Clojure Universe, I like to talk about "functions" and "things", but not about "objects", which is a term heavily laden with Java-ness and other badness. Mention of objects feels like shards poking up from the underlying Java universe.
What is the difference between function and thing?
It's fluid! Some stuff is pure function, some stuff is pure thing, some is in between (can be used as function and has attributes of a thing)
In particular, Clojure allows contexts where one considers keywords (things) as functions (to look up values in maps) or where one interpretes maps (things) as functions, or shorthand for functions (which take a key and return the value associated to that key in the map)
Evidently, functions are things as they are "first-class citizens".
It's also contextual! In some contexts, a function becomes a thing, or a thing becomes a function.
There are nasty mentions of objects ... these are shards poking up from the underlying Java universe.
For presentation purposes, a diagram of Collections
For seq?:
Return true if x implements ISeq
For coll?:
Returns true if x implements IPersistentCollection
And I found ISeq interface extends from IPersistentCollection in Clojure source code, so as Rörd said, every sequences is a collection.

Checking the reverse of a list is the same as the list unchanged?

I am learning scheme and one of the things I have to do is recursion to figure out if the list is reflective, i.e. the list looks the same when it is reversed. I have to do it primitively so I can't use the reverse method for lists. I must also use recursion which is obvious. The problem is in scheme it's very hard to access the list or shorten the list using the very basic stuff that we have learned, since these are kinda of like linkedlists. I also want to do without using indexing.
With that said I have a few ideas and was wondering If any of these sufficient and do you think I could actually do it better with the basics of scheme.
Reverse the list using recursion (my implementation) and compare the original and this rev. list.
Compare the first and last element by recursing on the rest of the list to find the last element and compare to first. Keeping track of how many times I have recursed and then do it one less for the second last element of the list, to compare to the second element of the list. (This is very complicated to do as I've tried it and wound up failing but I want to see you guys would've done the same)
Shorten the list to trim off the first and last element each time and compare. I'm not sure if this can be done using the basics of scheme.
Your suggestions or hints or anything. I'm very new to scheme. Thanks for reading. I know it's long.
Checking if a list is a palindrome without reversing it is one of the examples of a technique explained in "There and Back Again" by Danvy and Goldberg. They do it in (ceil (/ (length lst) 2)) recursive calls, but there's a simpler version that does it in (length lst) calls.
Here's the skeleton of the solution:
(define (pal? orig-lst)
;; pal* : list -> list/#f
;; If lst contains the first N elements of orig-lst in reverse order,
;; where N = (length lst), returns the Nth tail of orig-lst.
;; Otherwise, returns #f to indicate orig-lst cannot be a palindrome.
(define (pal* lst)
....)
.... (pal* orig-lst) ....)
This sounds like it might be homework, so I don't want to fill in all of the blanks.
I agree that #1 seems the way to go here. It's simple, and I can't imagine it failing. Maybe I don't have a strong enough imagination. :)
The other options you're considering seem awkward because we're talking about linked lists, which directly support sequential access but not random access. As you note, "indexing" into a linked list is verboten. It ends up marching dutifully down the list structure. Because of this, the other options are certainly doable, but they are expensive.
That expense isn't because we're in Scheme: it's because we're dealing with linked lists. Just to make sure it's clear: Scheme has vectors (arrays) which do support fast random-access. Testing palindrome-ness on a vector is as easy as you'd expect:
#lang racket
;; In Professional-level Racket
(define (palindrome? vec)
(for/and ([i (in-range 0 (vector-length vec))]
[j (in-range (sub1 (vector-length vec)) -1 -1)])
(equal? (vector-ref vec i)
(vector-ref vec j))))
;; Examples
(palindrome? (vector "b" "a" "n" "a" "n" "a"))
(palindrome? (vector "a" "b" "b" "a"))
so one point of the problem you're tackling, I think, is to show that the data structure you choose---the representation of the problem---can have a strong effect on the problem solution.
(Aside: #2 is certainly doable, though it goes against the grain of the single-linked list data structure. The approach to #3 requires a radical change to the representation: from a first glance, I think you'd need mutable, double-linked lists to do the solution any justice, since you need to be able march backward.)
To answer dyoo's question, you don't have to know that you're at the halfway point; you just have to make a certain comparison. If that comparison works, then a) the string must be reversible and b) you must be at the midway point.
So that more efficient solution is there, if you can just reach for it...

Mapping over sequence with a constant

If I need to provide a constant value to a function which I am mapping to the items of a sequence, is there a better way than what I'm doing at present:
(map my-function my-sequence (cycle [my-constant-value]))
where my-constant-value is a constant in the sense that it's going to be the same for the mappings over my-sequence, although it may be itself a result of some function further out. I get the feeling that later I'll look at what I'm asking here and think it's a silly question because if I structured my code differently it wouldn't be a problem, but well there it is!
In your case I would use an anonymous function:
(map #(my-function % my-constant-value) my-sequence)
Using a partially applied function is another option, but it doesn't make much sense in this particular scenario:
(map (partial my-function my-constant-value) my-sequence)
You would (maybe?) need to redefine my-function to take the constant value as the first argument, and you don't have any need to accept a variable number of arguments so using partial doesn't buy you anything.
I'd tend to use partial or an anonymous function as dbyrne suggests, but another tool to be aware of is repeat, which returns an infinite sequence of whatever value you want:
(map + (range 4) (repeat 10))
=> (10 11 12 13)
Yet another way that I find sometimes more readable than map is the for list comprehension macro:
(for [x my-sequence]
(my-function x my-constant-value))
yep :) a little gem from the "other useful functions" section of the api constantly
(map my-function my-sequence (constantly my-constant-value))
the pattern of (map compines-data something-new a-constant) is rather common in idomatic clojure. its relativly fast also with chunked sequences and such.
EDIT: this answer is wrong, but constantly and the rest of the "other useful functions" api are so cool i would like to leave the reference here to them anyway.

Resources