Does OCaml garbage collect intermediate results? - functional-programming

suppose I have a function
let rec f x =
let a = (* build a huge list, possibly used f on smaller input again here *) in
let b = (* build a huge list, possibly used f on smaller input again too *) in
List.append a b
My question is that after calling f is a and b freed? My program is eating up my memory and I am suspecting whether this is the cause.

After f returns its value, the variables a and b can't be accessed. So the value of a will be garbage collected eventually. For the most obvious implementation of List.append, the value of b forms part of the result of the function. So it won't be garbage collected until this result is no longer accessible.
If you have a memory leak you need to look in places that are accessible for the whole computation. One example would be a large global table.

yes, it will be freed. OCaml garbage collection usually doesn't leak, and if value is not reachable it will be collected. If it is possible to reuse memory it will be reused.
Note, if you're using really big data structures, then maybe list is not a good choice, especially concatenating lists with concat is very inefficient.

Related

Why don't Julia closures copy arrays?

Just discovered a nasty bug in my program based on the fact that Julia does not copy arrays when defining a closure. This makes continuation programming hard. What was the motivation for this design choice?
Any suggestions for decoupling the state of my closure from the program state?
As an example
l = [2 1; 0 0];
f = x -> l[2,2];
Then f(1) = 0 but if you change l[2,2] = 1, then f(1) = 1.
Your assumption that this is a "closure" does not hold. l is not a "closed" variable in the context of the anonymous function at that point. It is simply a reference to a variable inherited from 'external' scope (since it has not been redefined locally inside the anonymous function).
Here's an example of a true closure:
f = let l=[2 1;0 0]
x -> l[2,2];
end
The variable l now is local to the let block, and not present at global scope. f still has access to it, even though it has technically gone out of scope. This is what a closure means.
As a result of l having gone out of scope, it is no longer accessible except through f which is a closure having access to it as a closed variable.
PS. I'm going to go out on a limb here and assume that what you're expecting was matlab-like behaviour. The big difference with matlab is that when you define an anonymous function handle there, it captures the current state of the workspace by copying all the variables and making them part of the function 'object'. You can confirm this by using the functions command. Matlab doesn't have references in the same way as julia. This is a strength of julia, not a weakness, as it allow the user to make use of optimizations that avoid reallocation of memory, that are harder to achieve in matlab*.
* though in fairness, matlab shines in other ways, by attempting to optimise this for you
EDIT: Liso pointed out a very important pitfall in the comments. Assume l already exists in the global workspace, and we type
let l=l
while this is perfectly valid syntax, making l a local variable to the let block, this is still initialised simply as a reference to the global l. Therefore any changes to the global l will still affect the closure, which is not what you want. In this case, you should be trying to 'mimic' matlab behaviour by making a copy (or a deep copy, depending on your use case), such that the local variable is truly independent of anything else once it goes out of scope and becomes 'closed' i.e.
let l = deepcopy(l)
Also, for completeness, when one makes a closure in julia, it is worth pointing out how this is implemented under the hood: your resulting f function is simply a callable object, containing a field for each 'closed' variable it needs to be aware of; you could even access this as f.l.

How can I retrieve an object by id in Julia

In Julia, say I have an object_id for a variable but have forgotten its name, how can I retrieve the object using the id?
I.e. I want the inverse of some_id = object_id(some_object).
As #DanGetz says in the comments, object_id is a hash function and is designed not to be invertible. #phg is also correct that ObjectIdDict is intended precisely for this purpose (it is documented although not discussed much in the manual):
ObjectIdDict([itr])
ObjectIdDict() constructs a hash table where the keys are (always)
object identities. Unlike Dict it is not parameterized on its key and
value type and thus its eltype is always Pair{Any,Any}.
See Dict for further help.
In other words, it hashes objects by === using object_id as a hash function. If you have an ObjectIdDict and you use the objects you encounter as the keys into it, then you can keep them around and recover those objects later by taking them out of the ObjectIdDict.
However, it sounds like you want to do this without the explicit ObjectIdDict just by asking which object ever created has a given object_id. If so, consider this thought experiment: if every object were always recoverable from its object_id, then the system could never discard any object, since it would always be possible for a program to ask for that object by ID. So you would never be able to collect any garbage, and the memory usage of every program would rapidly expand to use all of your RAM and disk space. This is equivalent to having a single global ObjectIdDict which you put every object ever created into. So inverting the object_id function that way would require never deallocating any objects, which means you'd need unbounded memory.
Even if we had infinite memory, there are deeper problems. What does it mean for an object to exist? In the presence of an optimizing compiler, this question doesn't have a clear-cut answer. It is often the case that an object appears, from the programmer's perspective, to be created and operated on, but in reality – i.e. from the hardware's perspective – it is never created. Consider this function which constructs a complex number and then uses it for a simple computation:
julia> function f(y::Real)
z = Complex(0,y)
w = 2z*im
return real(w)
end
f (generic function with 1 method)
julia> foo(123)
-246
From the programmer's perspective, this constructs the complex number z and then constructs 2z, then 2z*im, and finally constructs real(2z*im) and returns that value. So all of those values should be inserted into the "Great ObjectIdDict in the Sky". But are they really constructed? Here's the LLVM code for this function applied to an Int:
julia> #code_llvm foo(123)
define i64 #julia_foo_60833(i64) #0 !dbg !5 {
top:
%1 = shl i64 %0, 1
%2 = sub i64 0, %1
ret i64 %2
}
No Complex values are constructed at all! Instead, all of the work is inlined and eliminated instead of actually being done. The whole computation boils down to just doubling the argument (by shifting it left one bit) and negating it (by subtracting it from zero). This optimization can be done first and foremost because the intermediate steps have no observable side effects. The compiler knows that there's no way to tell the difference between actually constructing complex values and operating on them and just doing a couple of integer ops – as long as the end result is always the same. Implicit in the idea of a "Great ObjectIdDict in the Sky" is the assumption that all objects that seem to be constructed actually are constructed and inserted into a large, permanent data structure – which is a massive side effect. So not only is recovering objects from their IDs incompatible with garbage collection, it's also incompatible with almost every conceivable program optimization.
The only other way one could conceive of inverting object_id would be to compute its inverse image on demand instead of saving objects as they are created. That would solve both the memory and optimization problems. Of course, it isn't possible since there are infinitely many possible objects but only a finite number of object IDs. You are vanishingly unlikely to actually encounter two objects with the same ID in a program, but the finiteness of the ID space means that inverting the hash function is impossible in principle since the preimage of each ID value contains an infinite number of potential objects.
I've probably refuted the possibility of an inverse object_id function far more thoroughly than necessary, but it led to some interesting thought experiments, and I hope it's been helpful – or at least thought provoking. The practical answer is that there is no way to get around explicitly stashing every object you might want to get back later in an ObjectIdDict.

Calculating Big-O time and space complexity for functional languages

I'm thinking of using Ocaml for technical interviews in the future. However, I'm not sure how to calculate time and space complexity for functional languages. What are the basic runtimes for the basic higher level functions like map, reduce, and filter, and how do I calculate runtime and space complexity in general?
The time complexity of persistent recursive implementations is easy to infer directly from the implementation. In this case, the recursive definition maps directly to the recurrence relation. Consider the List.map function as it is implemented in the Standard Library:
let rec map f = function
| [] -> []
| a::l -> f a :: map f l
The complexity is map(N) = 1 + map (N-1) thus it is O(N).
Speaking of the space complexity it is not always that obvious, as it requires an understanding of tail-calls and a skill to see the allocations. The general rule is that in OCaml native integers, characters, and constructors without arguments do no allocate the heap memory, everything else is allocated in the heap and is boxed. All non-tail calls create a stack frame and thus consume the stack space. In our case, the complexity of the map in the stack domain is O(N), as it makes N non-tail calls. The heap-complexity is also O(N) as the :: operator is invoked N times.
Another place, where space is consumed are closures. If a function has at least one free variable (i.e., a variable that is not bound to function parameters and is not in the global scope), then a functional object called closure is created, that contains a pointer to the code and a pointer to each free variable (also called the captured variable).
For example, consider the following function:
let rec rsum = function
| [] -> 0
| x :: xs ->
List.fold_left (fun y -> x + y) 0 xs + rsum xs
For each element of a list, this function computes a sum this element, with all consecutive elements. The naive implementation above is O(N) in the stack (as each step has two non-tail calls), O(N) in the heap size, as each step constructs a new closure (unless the compiler is clever enough to optimize it). Finally, it is O(N^2) in the time domain (rsum(N) = (N-1) + rsum(N-1)).
However, it brings a question - should we take into account a garbage, that is produced by a computation? I.e., those values, that were allocated, during the computation, but are not referenced by it. Or those values, that are referenced only during a step, as in this case. So it all depends on the model of computation that you chose. If you will choose a reference counting GC, then the example above is definitely O(1) in the heap size.
Hope this will give some insights. Feel free to ask questions, if something is not clear.

What's the fastest way to cast Vector{T} (T<:A) to Vector{A}?

Suppose A is an abstract type, I have a function f{T<:A}(x::Vector{A}). So x could be type Vector{A} or Vector{B} where B <: A. In the middle of the function I would like to cast x to Vector{A} so it can be consumed by another function that requires that signature.
What's the best way to do that? At the moment I am doing x = collect(A, x). Is there a way to avoid copying if possible?
If at all possible, I'd just change your second function definition to be parametric like f. Enforcing this kind of container structure in method signatures is a big performance bug that doesn't gain you any functionality… and just makes them much harder to use.
That said, the best way to do this kind of conversion where you don't care if the output aliases the input is with convert(Vector{A}, x). This will be a no-op if x already isa Vector{A}, but otherwise it'll be just like collect. That's as good as it gets.
Here's why: two containers of types Vector{A} and Vector{B} cannot share the same memory if A !== B since it'd be possible to corrupt the data in the Vector{B} by assigning a non-B element to the array through the Vector{A}.

Operations on Clojure collections

I am quite new to Clojure, although I am familiar with functional languages, mainly Scala.
I am trying to figure out what is the idiomatic way to operate on collections in Clojure. I am particularly confused by the behaviour of functions such as map.
In Scala, a great care is taken in making so that map will always return a collection of the same type of the original collection, as long as this makes sense:
List(1, 2, 3) map (2 *) == List(2, 4, 6)
Set(1, 2, 3) map (2 *) == Set(2, 4, 6)
Vector(1, 2, 3) map (2 *) == Vector(2, 4, 6)
Instead, in Clojure, as far as I understand, most operations such as map or filter are lazy, even when invoked on eager data-structures. This has the weird result of making
(map #(* 2 %) [1 2 3])
a lazy-list instead of a vector.
While I prefer, in general, lazy operations, I find the above confusing. In fact, vectors guarantee certain performance characteristics that lists do not.
Say I use the result from above and append on its end. If I understand correctly, the result is not evaluated until I try to append on it, then it is evaluated and I get a list instead of a vector; so I have to traverse it to append on the end. Of course I could turn it into a vector afterwards, but this gets messy and can be overlooked.
If I understand correctly, map is polymorphic and it would not be a problem to implement is so that it returns a vector on vectors, a list on lists, a stream on streams (this time with lazy semantics) and so on. I think I am missing something about the basic design of Clojure and its idioms.
What is the reason basic operations on clojure data structures do not preverse the structure?
In Clojure many functions are based on the Seq abstraction.
The benefit of this approach is that you don't have to write a function for every different collection type - as long as your collection can be viewed as a sequence (things with a head and possibly a tail), you can use it with all of the seq functions. Functions that take seqs and output seqs are much more composable and thus re-usable than functions that limit their use to a certain collection type. When writing your own function on seq you don't need to handle special cases like: if the user gives me a vector, I have to return a vector, etc. Your function will fit in just as good inside the seq pipeline as any other seq function.
The reason that map returns a lazy seq is a design choice. In Clojure lazyness is the default for many of these functional constructions. If you want to have other behavior, like parallelism without intermediate collections, take a look at the reducers library: http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html
As far as performance goes, map always has to apply a function n times on a collection, from the first to the last element, so its performance will always be O(n) or worse. In this case vector or list makes no difference. The possible benefit that laziness would give you is when you would only consume the first part of the list. If you must append something to the end of map's output, a vector is indeed more efficient. You can use mapv (added in Clojure 1.4) in this case: it takes in a collection and will output a vector. I would say, only worry about these performance optimizations if you have a very good reason for it. Most of the time it's not worth it.
Read more about the seq abstraction here: http://clojure.org/sequences
Another vector-returning higher order function that was added in Clojure 1.4 is filterv.

Resources