functional programming: immutable data structure efficiency - functional-programming

I don't understand, how FP compilers make the code dealing with immutable data structures fast, not blow up stack, etc.
For example, insert operation in tree, it has to copy the whole tree before adding the new node and return the copied tree, versus the imperative couterpart that only needs to add a pointer to the new node. If the insert operation is run millions times, it would take a load of memory, and copying will be slower and slower when the tree is bigger. How do FP compilers actually optimize this ?

You don't have to copy the whole tree to make a change; you can share most of the structure. See e.g. the diagrams in this blog, or this talk by Rich Hickey on Clojure (see discussion of hash tries about halfway through).

The compiler won't really optimize this, it is something you have to program for specifically when coding. Techniques for doing this are explained in the excellent Purely Functional Data Structures (book, thesis).

Take a look at the Zipper data structure.

If the garbage collector is doing its job, the older copies of the data structure will be reclaimed when they're no longer being used.

Related

What is the need for immutable/persistent data structures in erlang

Each Erlang process maintains its own private address space. All communication happens via copying without sharing (except big binaries). If each process is processing one message at a time with no concurrent access over its objects, I don't see why do we need immutable/persistent data structures.
Erlang was initially implemented in Prolog, which doesn't really use mutable data structures either (though some dialects do). So it started off without them. This makes runtime implementation simpler and faster (garbage collection in particular).
So adding mutable data structures would require a lot of effort, could introduce bugs, and Erlang programmers are nearly by definition at least willing to live without them.
Many actually consider their absence to be a positive good: less concern about object identity, no need for defensive copying because you don't know whether some other piece of code is going to modify the data you passed (or might be changed later to modify it), etc.
This absence does mean that Erlang is pretty unusable in some domains (e.g. high performance scientific computing), at least as the main language. But again, this means that nobody in these domains is going to use Erlang in the first place and so there's no particular incentive to make it usable at the cost of making existing users unhappy.
I remember seeing a mailing list post by Joe Armstrong quite a long time ago (which I couldn't find with a quick search now) saying that he initially planned to add mutable variables when he'd need them... except he never quite did, and performance was good enough for everything he was using Erlang for.
It is indeed the case that in Erlang immutability does not solve any "shared state" problems, as immutable data are "process local".
From the functional programming language perspective, however, immutability offers a number of benefits, summarized adequately in this Quora answer:
The simplest definition of functional programming is that it’s a programming
paradigm where you are transforming immutable data with functions.
The definition uses functions in the mathematical sense, where it’s
something that takes an input, and produces an output.
OO + mutability tends to violate that definition because when you want
to change a piece of data it generally will not return the output, it
will likely return void or unit, and that when you call a method on
the object the object itself isn’t input for the function.
As far as what advantages the paradigm has, composability, thread
safety, being able to track what went wrong where better, the ability
to sort of separate the data from the actual computation on it being
done, etc.
how would this work?
factorial(1) -> 1;
factorial(X) ->
X*factorial(X-1).
if you run factorial(4), a single process will be running the same function. Each time the function will have it's own value of X, if the value of X was in the scope of the process and not the function recursive functions wouldn't work. So first we need to understand scope. If you want to say that you don't see why data needs to be immutable within the scope of a single function/block you would have a point, but it would be a headache to think about where data is immutable and where it isn't.

iterative or recursive to implement a binary search tree?

I'm learning Data Structures & Algorithms now.
My lecture notes have an implementation of a binary search tree which is implemented using a recursive method. That is an elegant way, but my question is in real life code, should I implement a binary search tree recursively, will it generate a lot of calling stack if the tree has large height/depth number.
I understand that recursion is a key concept to understand lots of data structure concepts, but would you choose to use recursion in real life code?
A tree is recursive by nature. Each node of a tree represents a subtree, and each child of each note represents a subtree of that subtree, so recursion is the best bet, especially in practice where other people people might have to edit and maintain your code.
Now, IF depth becomes a problem for your call stack, than I'm afraid that there are deeper problems with your data structure (either it's monstrously huge, or it's very unbalanced)
"I understand that recursive is a key concept to understand lots of
data structure, but will you choose to use recursive in real life
code?"
After first learning about recursion I felt the same way. However, having been working in the Software industry for over a year now, I can say that I have used the concept of recursion to solve several problems. There are often times that recursion is cleaner, easier to understand/read, and just downright better. And to emphasize a point in the previous answer, a tree is a recursive data structure. IMO, there is no other way to traverse a BST :)
Many times, the compiler can optimize your code, to avoid creating a new stack frame for each recursive call (look up tail recursion, for example). Of course, it all depends on the algorithm and on your data structure. If the tree is reasonably balanced, I don't think a recursive algorithm should cause any problems.
its true that recursion is intutive and elegent and it produces code that is clear and concise. its also correct that some methods such as quick sort, DFS etc. are really hard to implement iterativelly.
but in practice recursive implementations are almost always going to be slow when compared to iterative counterparts because of all the function calls (To really understand the performance hit I suggest you learn how much book keeping stuff assembler has to do for a single function call).
the optimizations that we talk about are not applicable to every recursive method in general and manny compilers and interpreters dont even support them.
so in summary if you are writing something which is performance critical such as a data strucute then stay away from recursion (or use it if you are sure that your compiler/interpreter got you covered)
PS: CLRS (introduction to algorithms, page 290, last line) suggests that iterative search procedure for a BST is faster compared to recursive one.

What can pointers do that are otherwise impossible to implement?

So I'm having a hard time grasping the idea behind pointers and all that memory allocation.
I'm thinking nowadays with computer as powerful as they are right now why do we have to use pointers at all?
Isn't there always a workaround to do things without the help of pointers?
Pointers are an indirection: instead of working with the data itself, you are working with (something) that points to the data. Depending on the semantics of the language, this allows many things: cheaply switch to another instance of data (by setting the pointer to point to another instance), passing pointers allows access to the original data without having to make (a possibly expensive) copy, etc.
Memory allocation is related to pointers, but separate: you can have pointers without allocating memory. The reason you need pointers for memory allocation is that the actual address the allocated block of memory resides is not known at compile time, so you can only access it via a level of indirection (i.e. pointers) -- the compiler statically allocates space for the pointer that will point to the dynamically allocated memory.
Pointers are incredibly powerful. Just because computers have a faster processing time nowdays, doesn't mean that's any reason to abandon something as essential as pointers. Passing around giant chunks of memory on the stack is inefficient at best, catastrophic at worst. With pointers, you only need to maintain a reference to where the data resides, rather than duplicating huge chunks of memory each time you call a function.
Also, if you're copying all the data every time, how do you modify the original data? Aside from returning the copy of the structure in every call that touches it.
I remember reading somewhere that Dijkstra was assessing a student for a programming course; this student was quite intelligent but s/he wasn't able to solve the problem because there was sort of a mental block.
All the code was sort of ok, but what was needed was simply to use the expression
a[a[i+1]] = j;
and even if being so close to the solution still the goal seemed to be miles away.
Languages "without pointers" already exist... e.g. BASIC. Without explicit pointers, that is. But the indirection idea... the idea that you can have data to mean just where to find other data is central to programming.
The very idea of array is about being able to use computed values to find other values.
Trying to hide this idea is an horrible plan. According to Dijkstra anyone that has been exposed to the BASIC language has already received such a mental mutilation that is impossible to recover as a good programmer (and probably the absence of explicit indirection was one of the problems).
I think he was exaggerating.
Just a bit.

Best Practices for cache locality in Multicore Parallelism in F#

I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.
I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.
Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.
As far as I can make out, the key to cache locality (multithreaded or otherwise) is
Keep work units in a contiguous block of RAM that will fit into the cache
To this end ;
Avoid objects where possible
Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
Work out the size of the cache on the machine you'll be executing it on
CPUs have different size L2 caches
It might be prudent to design your code to scale with different cache sizes
Or more simply, write code that will fit inside the lowest common cache size your code will be running on
Work out what needs to sit close to each datum
In practice, you're not going to fit your whole working set into the L2 cache
Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.
In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.
A good academic paper on the subject is Cache-Efficient String Sorting Using Copying
Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.
Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.
To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.
I am no parallelism expert, but here is my advice anyway.
I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.
A great approach is to split the work into smaller sections and iterate over each section on each core.
One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.
Heres a great example of that: Cache Locality For Performance
There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.
To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again.
As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.
If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.
To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.
You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.
You may get some ideas from these:
Cache-Oblivious http://supertech.csail.mit.edu/cacheObliviousBTree.html Cache-Oblivious Search Trees Project
DSapce#MIT Cache coherence strategies in a many-core processor http://dspace.mit.edu/handle/1721.1/61276
describes the revolutionary idea of cache oblivious algorithms via the elegant and efficient implementation of a matrix multiply in F#.

Big datastructures in functional programming

I'm newbie in Functional Programming.
I have a huge neural network with thousands of neurons and every connection between neurons has its weight. I have to update these weights very often, several thousand times per learning session.
Is FP still applicable here? I mean in fp we can't modify variables and only able to return new variables not changing previous values. Does this mean I have to recreate whole network on every weight update?
Is FP still applicable here?
You can certainly write this in a functional style with decent asymptotic algorithmic efficiency but you are not likely to get with 10× the performance of a decent imperative solution because purely functional programming makes it difficult to use CPU caches effectively.
I mean in fp we can't modify variables and only able to return new variables not changing previous values. Does this mean I have to recreate whole network on every weight update?
No, for two reasons:
Purely functional data structures can be updated efficiently because they decompose large structures (e.g. a hash table) into many small recursively-defined structures (e.g. a balanced binary tree). When you update a single node within an immutable tree, you copy data from every node in the path from the root to the destination but refer back to all other branches by reference safe in the knowledge that they cannot be changed under you because they are immutable. So you only do O(log n) work instead of O(n) work.
Purely functional data structures usually offer functions like map that allow every element to be updated in the same way and avoid rebalancing by copying the structure of the source tree. So the time for n updates is O(n) instead of O(n log n).
So you should be able to achieve similar or even equal asymptotic time complexity but, in absolute terms, you will be using several times as much space and time as an imperative solution. I described these basics in detail in my book Visual F# 2010 for Technical Computing and I wrote the article Artificial Intelligence: Neural Networks (8th May 2010) for the OCaml Journal.
Look into Haskell arrays which include mutable variants in a monad.
You should not need to recreate the entire network every time a weight update occurs. Presumably, your neurons are modeled as individual objects - this means that to "update" an individual neuron, you would actually be creating a new neuron with the updated weight. Then this neuron would be inserted into the network in place of the old one, which would in turn be free for reclamation by the garbage collector.
I do not agree with the idea of using mutable state. Functional languages know that they are functional, so they make optimizations for functional programming. If a functional language really is the best tool for the job, then take advantage of its benefits.
If you structure your data in such a way that you can use a persistent data structure to model your neural network, functional updates to the neural network will be cheap (at least compared to copying the whole thing).
If it is still not fast enough, your language may allow other techniques (such as careful use of mutation) to speed it up; for example, if you were using Clojure, you could use transients to some additional speed.

Resources