The difference between MapReduce and the map-reduce combination in functional programming - functional-programming

I read the mapreduce at http://en.wikipedia.org/wiki/MapReduce ,understood the example of how to get the count of a "word" in many "documents". However I did not understand the following line:
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.
Can someone elaborate on the difference again(MapReduce framework VS map and reduce combination)? Especially, what does the reduce functional programming do?
Thanks a great deal.

The main difference would be that MapReduce is apparently patentable. (Couldn't help myself, sorry...)
On a more serious note, the MapReduce paper, as I remember it, describes a methodology of performing calculations in a massively parallelised fashion. This methodology builds upon the map / reduce construct which was well known for years before, but goes beyond into such matters as distributing the data etc. Also, some constraints are imposed on the structure of data being operated upon and returned by the functions used in the map-like and reduce-like parts of the computation (the thing about data coming in lists of key/value pairs), so you could say that MapReduce is a massive-parallelism-friendly specialisation of the map & reduce combination.
As for the Wikipedia comment on the function being mapped in the functional programming's map / reduce construct producing one value per input... Well, sure it does, but here there are no constraints at all on the type of said value. In particular, it could be a complex data structure like perhaps a list of things to which you would again apply a map / reduce transformation. Going back to the "counting words" example, you could very well have a function which, for a given portion of text, produces a data structure mapping words to occurrence counts, map that over your documents (or chunks of documents, as the case may be) and reduce the results.
In fact, that's exactly what happens in this article by Phil Hagelberg. It's a fun and supremely short example of a MapReduce-word-counting-like computation implemented in Clojure with map and something equivalent to reduce (the (apply + (merge-with ...)) bit -- merge-with is implemented in terms of reduce in clojure.core). The only difference between this and the Wikipedia example is that the objects being counted are URLs instead of arbitrary words -- other than that, you've got a counting words algorithm implemented with map and reduce, MapReduce-style, right there. The reason why it might not fully qualify as being an instance of MapReduce is that there's no complex distribution of workloads involved. It's all happening on a single box... albeit on all the CPUs the box provides.
For in-depth treatment of the reduce function -- also known as fold -- see Graham Hutton's A tutorial on the universality and expressiveness of fold. It's Haskell based, but should be readable even if you don't know the language, as long as you're willing to look up a Haskell thing or two as you go... Things like ++ = list concatenation, no deep Haskell magic.

Using the word count example, the original functional map() would take a set of documents, optionally distribute subsets of that set, and for each document emit a single value representing the number of words (or a particular word's occurrences) in the document. A functional reduce() would then add up the global counts for all documents, one for each document. So you get a total count (either of all words or a particular word).
In MapReduce, the map would emit a (word, count) pair for each word in each document. A MapReduce reduce() would then add up the count of each word in each document without mixing them into a single pile. So you get a list of words paired with their counts.

MapReduce is a framework built around splitting a computation into parallelizable mappers and reducers. It builds on the familiar idiom of map and reduce - if you can structure your tasks such that they can be performed by independent mappers and reducers, then you can write it in a way which takes advantage of a MapReduce framework.
Imagine a Python interpreter which recognized tasks which could be computed independently, and farmed them out to mapper or reducer nodes. If you wrote
reduce(lambda x, y: x+y, map(int, ['1', '2', '3']))
or
sum([int(x) for x in ['1', '2', '3']])
you would be using functional map and reduce methods in a MapReduce framework. With current MapReduce frameworks, there's a lot more plumbing involved, but it's the same concept.

Related

Why after pressing semicolon program is back in deep recursion?

I'm trying to understand the semicolon functionality.
I have this code:
del(X,[X|Rest],Rest).
del(X,[Y|Tail],[Y|Rest]) :-
del(X,Tail,Rest).
permutation([],[]).
permutation(L,[X|P]) :- del(X,L,L1), permutation(L1,P).
It's the simple predicate to show all permutations of given list.
I used the built-in graphical debugger in SWI-Prolog because I wanted to understand how it works and I understand for the first case which returns the list given in argument. Here is the diagram which I made for better understanding.
But I don't get it for the another solution. When I press the semicolon it doesn't start in the place where it ended instead it's starting with some deep recursion where L=[] (like in step 9). I don't get it, didn't the recursion end earlier? It had to go out of the recursions to return the answer and after semicolon it's again deep in recursion.
Could someone clarify that to me? Thanks in advance.
One analogy that I find useful in demystifying Prolog is that Backtracking is like Nested Loops, and when the innermost loop's variables' values are all found, the looping is suspended, the vars' values are reported, and then the looping is resumed.
As an example, let's write down simple generate-and-test program to find all pairs of natural numbers above 0 that sum up to a prime number. Let's assume is_prime/1 is already given to us.
We write this in Prolog as
above(0, N), between(1, N, M), Sum is M+N, is_prime(Sum).
We write this in an imperative pseudocode as
for N from 1 step 1:
for M from 1 step 1 until N:
Sum := M+N
if is_prime(Sum):
report_to_user_and_ask(Sum)
Now when report_to_user_and_ask is called, it prints Sum out and asks the user whether to abort or to continue. The loops are not exited, on the contrary, they are just suspended. Thus all the loop variables values that got us this far -- and there may be more tests up the loops chain that sometimes succeed and sometimes fail -- are preserved, i.e. the computation state is preserved, and the computation is ready to be resumed from that point, if the user presses ;.
I first saw this in Peter Norvig's AI book's implementation of Prolog in Common Lisp. He used mapping (Common Lisp's mapcan which is concatMap in Haskell or flatMap in many other languages) as a looping construct though, and it took me years to see that nested loops is what it is really all about.
Goals conjunction is expressed as the nesting of the loops; goals disjunction is expressed as the alternatives to loop through.
Further twist is that the nested loops' structure isn't fixed from the outset. It is fluid, the nested loops of a given loop can be created depending on the current state of that loop, i.e. depending on the current alternative being explored there; the loops are written as we go. In (most of the) languages where such dynamic creation of nested loops is impossible, it can be encoded with nested recursion / function invocation / inside the loops. (Here's one example, with some pseudocode.)
If we keep all such loops (created for each of the alternatives) in memory even after they are finished with, what we get is the AND-OR tree (mentioned in the other answer) thus being created while the search space is being explored and the solutions are found.
(non-coincidentally this fluidity is also the essence of "monad"; nondeterminism is modeled by the list monad; and the essential operation of the list monad is the flatMap operation which we saw above. With fluid structure of loops it is "Monad"; with fixed structure it is "Applicative Functor"; simple loops with no structure (no nesting at all): simply "Functor" (the concepts used in Haskell and the like). Also helps to demystify those.)
So, the proper slogan could be Backtracking is like Nested Loops, either fixed, known from the outset, or dynamically-created as we go. It's a bit longer though. :)
Here's also a Prolog example, which "as if creates the code to be run first (N nested loops for a given value of N), and then runs it." (There's even a whole dedicated tag for it on SO, too, it turns out, recursive-backtracking.)
And here's one in Scheme ("creates nested loops with the solution being accessible in the innermost loop's body"), and a C++ example ("create n nested loops at run-time, in effect enumerating the binary encoding of 2n, and print the sums out from the innermost loop").
There is a big difference between recursion in functional/imperative programming languages and Prolog (and it really became clear to me only in the last 2 weeks or so):
In functional/imperative programming, you recurse down a call chain, then come back up, unwinding the stack, then output the result. It's over.
In Prolog, you recurse down an AND-OR tree (really, alternating AND and OR nodes), selecting a predicate to call on an OR node (the "choicepoint"), from left to right, and calling every predicate in turn on an AND node, also from left to right. An acceptable tree has exactly one predicate returning TRUE under each OR node, and all predicates returning TRUE under each AND node. Once an acceptable tree has been constructed, by the very search procedure, we are (i.e. the "search cursor" is) on a rightmost bottommost node .
Success in constructing an acceptable tree also means a solution to the query entered at the Prolog Toplevel (the REPL) has been found: The variable values are output, but the tree is kept (unless there are no choicepoints).
And this is also important: all variables are global in the sense that if a variable X as been passed all the way down the call chain from predicate to predicate to the rightmost bottommost node, then constrained at the last possible moment by unifying it with 2 for example, X = 2, then the Prolog Toplevel is aware of that without further ado: nothing needs to be passed up the call chain.
If you now press ;, search doesn't restart at the top of the tree, but at the bottom, i.e. at the current cursor position: the nearest parent OR node is asked for more solutions. This may result in much search until a new acceptable tree has been constructed, we are at a new rightmost bottommost node. The new variable values are output and you may again enter ;.
This process cycles until no acceptable tree can be constructed any longer, upon which false is output.
Note that having this AND-OR as an inspectable and modifiable data structure at runtime allows some magical tricks to be deployed.
There is bound to be a lot of power in debugging tools which record this tree to help the user who gets the dreaded sphynxian false from a Prolog program that is supposed to work. There are now Time Traveling Debuggers for functional and imperative languages, after all...

Julia: What are the right data structures for graph traversal?

I'm writing a bunch of recursive graph algorithms where graph nodes have parents, children, and a number of other properties. The algorithms can also create nodes dynamically, and make use of recursive functions.
What are the right data structures to use in this case? In C++ I would've implemented this via pointers (i.e. each node has a vector<Node*> parents, vector<Node*> children), but I'm not sure if Julia pointers are the right tool for that, or if there's something else ... ?
In Julia state-of-the-art with this regard is the LightGraphs.jl library.
It uses adjacency lists for graph representation and assumes that the data for nodes is being kept outside the graph (for example in Vectors indexed by node identifiers) rather than inside the graph.
This approach is generally most efficient and most convenient (operating Array indices rather than references).
LightGraphs.jl provides implementation for several typical graph algorithms and is usually the way to go when doing computation on graphs.
However, LightGraphs.jl,'s approach might be less convenient in scenarios where you are continuously at the same time adding and destroying many nodes within the graph.
Now, regarding an equivalent of the C++ approach you have proposed it can be accomplished as
struct MyNode{T}
data::T
children::Vector{MyNode}
parents::Vector{MyNode}
MyNode(data::T,children=MyNode[],parents=MyNode[]) where T= new{T}(data,children,parents)
end
And this API can be used as:
node1 = MyNode(nothing)
push!(node1.parents, MyNode("hello2"))
Finally, since LightGraphs.jl is a Julia standard it is usually worth to provide some bridging implementation so your API is able to use LightGraphs.jl functions.
For illustration how it can be done for an example have a look for SimpleHypergraphs.jl library.
EDIT:
Normally, for efficiency reasons you will want the data field to be be homogenous across the graph, in that case better is:
struct MyNode{T}
data::T
children::Vector{MyNode{T}}
parents::Vector{MyNode{T}}
MyNode(data::T,children=MyNode{T}[],parents=MyNode{T}[]) where T= new{T}(data,children,parents)
end

Where does the word "flatMap" originate from?

Nowdays flatMap is the most widely used name for correspondent operation on monad-like objects.
But I can't find where it has appeared for the first time and what has popularized it.
The oldest appearance I know about is in Scala.
In Haskell it is called bind.
In category theory Greek notation is used.
Partial answer, which hopefully provides some useful "seed nodes" to start more thorough search. My best guess:
1958 for map used for list processing,
1988 for flatten used in context of monads,
2004 for flatMap used as important method backing for-comprehensions in Scala.
The function / method name flatMap seems to be a portmanteau word composed from flatten and map. This makes sense, because whenever M is some monad, A,B some types, and a: M[A], f: A => M[B] a value and a function, then the implementations of map, flatMap and flatten should satisfy
a.flatMap(f) = a.map(f).flatten
(in Scala-syntax).
Let's first consider the both components map and flatten separately.
Map
The map-function seems to have been used to map over lists since time immemorial. My best guess would be that it came from Lisp (around 1958), and then spread to all other languages that had anything resembling higher-order functions.
Flatten
Given how many things are represented by lists in Lisp, I assume that flatten has also been used there for list processing.
The usage of flatten in context of monads must be much more recent, because the monads themselves have been introduced in programming quite a bit later. If we are looking for the usage of word "flatten" in the context of monadic computations, we probably should at least check the papers by Eugenio Moggi. Indeed, in "Computational Lambda-Calculus and Monads" from 1988, he uses the formulation:
Remark 2.2: Intuitively eta_A: A -> TA gives the inclusion of values into computations, while mu_A: T^2 A -> TA flatten a computation of a computation into a computation.
(typesetting changed by me, emphasis mine, text in italic as in original). I think it's interesting that Moggi talks about flattening computations, and not just lists.
Math notation / "Greek"
On the Greek used in mathematical notation: in category theory, the more common way to introduce monads is through the natural transformations that correspond to pure and flatten, the morphisms corresponding to flatMap are deemphasized. However, nobody calls it "flatten". For example, Maclane calls the natural transformation corresponding to method pure "unit" (not to be confused with method unit), and flatten is usually called "multiplication", in analogy with Monoids. One might investigate further whether it was different when the "triple"-terminology was more prevalent.
flatMap
To find the origins of the flatMap portmanteau word, I'd propose to start with the most prominent popularizer today, and then try to backtrack from there. Apparently, flatMap is a Scala meme, so it seems reasonable to start from Scala. One might check the standard libraries (especially the List data structure) of the usual suspects: the languages that influenced Scala. These "roots" are named in Chapter 1, section 1.4 in Odersky's "Programming in Scala":
C, C++ and C# are probably not where it came from.
In Java it was the other way around: the flatMap came from Scala into version 1.8 of Java.
I can't say anything about Smalltalk
Ruby definitely has flat_map on Enumerable, but I don't know anything about Ruby, and I don't want to dig into the source code to find out when it was introduced.
Algol and Simula: definitely not.
Strangely enough ML (SML) seems to get by without flatMap, it only has concat (which is essentially the same as flatten). OCaml's lists also seem to have flatten, but no flatMap.
As you've already mentioned, Haskell had all this long ago, but in Haskell it is called bind and written as an operator
Erlang has flatmap on lists, but I'm not sure whether this is the origin, or whether it was introduced later. The problem with Erlang is that it is from 1986, back then there was no github.
I can't say anything about Iswim, Beta and gbeta.
I think it would be fair to say that flatMap has been popularized by Scala, for two reasons:
The flatMap took a prominent role in the design of Scala's collection library, and few years later it turned out to generalize nicely to huge distributed collections (Apache Spark and similar tools)
The flatMap became the favorite toy of everyone who decided to do functional programming on the JVM properly (Scalaz and libraries inspired by Scalaz, like Scala Cats)
To sum it up: the "flatten" terminology has been used in the context of monads since the very beginning. Later, it was combined with map into flatMap, and popularized by Scala, or more specifically by frameworks such as Apache Spark and Scalaz.
flatmap was introduced in Section 2.2.3 Sequences as Conventional Interfaces in "Structure and Interpretation of Computer Programs" as
(define (flatmap proc seq)
(accumulate append nil (map proc seq)))
The first edition of the book appeared in 1985.

"Adding" a value to a tuple?

I am attempting to represent dice rolls in Julia. I am generating all the rolls of a ndsides with
sort(collect(product(repeated(1:sides, n)...)), by=sum)
This produces something like:
[(1,1),(2,1),(1,2),(3,1),(2,2),(1,3),(4,1),(3,2),(2,3),(1,4) … (6,3),(5,4),(4,5),(3,6),(6,4),(5,5),(4,6),(6,5),(5,6),(6,6)]
I then want to be able to reasonably modify those tuples to represent things like dropping the lowest value in the roll or adding a constant number, etc., e.g., converting (2,5) into (10,2,5) or (5,).
Does Julia provide nice functions to easily modify (not necessarily in-place) n-tuples or will it be simpler to move to a different structure to represent the rolls?
Thanks.
Tuples are immutable, so you can't modify them in-place. There is very good support for other mutable data structures, so there aren't many methods that take a tuple and return a new, slightly modified copy. One way to do this is by splatting a section of the old tuple into a new tuple, so, for example, to create a new tuple like an existing tuple t but with the first element set to 5, you would write: tuple(5, t[2:end]...). But that's awkward, and there are much better solutions.
As spencerlyon2 suggests in his comment, a one dimensional Array{Int,1} is a great place to start. You can take a look at the Data Structures manual page to get an idea of the kinds of operations you can use; one-dimensional Arrays are iterable, indexable, and support the dequeue interface.
Depending upon how important performance is and how much work you're doing, it may be worthwhile to create your own data structure. You'll be able to add your own, specific methods (e.g., reroll!) for that type. And by taking advantage of some of the domain restrictions (e.g., if you only ever want to have a limited number of dice rolls), you may be able to beat the performance of the general Array.
You can construct a new tuple based on spreading or slicing another:
julia> b = (2,5)
(2, 5)
julia> (10, b...)
(10, 2, 5)
julia> b[2:end]
(5,)

Pure functional bottom up tree algorithm

Say I wanted to write an algorithm working on an immutable tree data structure that has a list of leaves as its input. It needs to return a new tree with changes made to the old tree going upwards from those leaves.
My problem is that there seems to be no way to do this purely functional without reconstructing the entire tree checking at leaves if they are in the list, because you always need to return a complete new tree as the result of an operation and you can't mutate the existing tree.
Is this a basic problem in functional programming that only can be avoided by using a better suited algorithm or am I missing something?
Edit: I not only want to avoid to recreate the entire tree but also the functional algorithm should have the same time complexity than the mutating variant.
The most promising I have seen so far (which admittedly is not very long...) is the Zipper data structure: It basically keeps a separate structure, a reverse path from the node to root, and does local edits on this separate structure.
It can do multiple local edits, most of which are constant time, and write them back to the tree (reconstructing the path to root, which are the only nodes that need to change) all in one go.
The Zipper is a standard library in Clojure (see the heading Zippers - Functional Tree Editing).
And there's the original paper by Huet with an implementation in OCaml.
Disclaimer: I have been programming for a long time, but only started functional programming a couple of weeks ago, and had never even heard of the problem of functional editing of trees until last week, so there may very well be other solutions I'm unaware of.
Still, it looks like the Zipper does most of what one could wish for. If there are other alternatives at O(log n) or below, I'd like to hear them.
You may enjoy reading
http://lorgonblog.spaces.live.com/blog/cns!701679AD17B6D310!248.entry
This depends on your functional programming language. For instance in Haskell, which is a Lazy functional programming language, results are calculated at the last moment; when they are acutally needed.
In your example the assumption is that because your function creates a new tree, the whole tree must be processed, whereas in reality the function is just passed on to the next function and only executed when necessary.
A good example of lazy evaluation is the sieve of erastothenes in Haskell, which creates the prime numbers by eliminating the multiples of the current number in the list of numbers. Note that the list of numbers is infinite. Taken from here
primes :: [Integer]
primes = sieve [2..]
where
sieve (p:xs) = p : sieve [x|x <- xs, x `mod` p > 0]
I recently wrote an algorithm that does exactly what you described - https://medium.com/hibob-engineering/from-list-to-immutable-hierarchy-tree-with-scala-c9e16a63cb89
It works in 2 phases:
Sort the list of nodes by their depth in the hierarchy
constructs the tree from bottom up
Some caveats:
No Node mutation, The result is an Immutable-tree
The complexity is O(n)
Ignores cyclic referencing in the incoming list

Resources