Big datastructures in functional programming - functional-programming

I'm newbie in Functional Programming.
I have a huge neural network with thousands of neurons and every connection between neurons has its weight. I have to update these weights very often, several thousand times per learning session.
Is FP still applicable here? I mean in fp we can't modify variables and only able to return new variables not changing previous values. Does this mean I have to recreate whole network on every weight update?

Is FP still applicable here?
You can certainly write this in a functional style with decent asymptotic algorithmic efficiency but you are not likely to get with 10× the performance of a decent imperative solution because purely functional programming makes it difficult to use CPU caches effectively.
I mean in fp we can't modify variables and only able to return new variables not changing previous values. Does this mean I have to recreate whole network on every weight update?
No, for two reasons:
Purely functional data structures can be updated efficiently because they decompose large structures (e.g. a hash table) into many small recursively-defined structures (e.g. a balanced binary tree). When you update a single node within an immutable tree, you copy data from every node in the path from the root to the destination but refer back to all other branches by reference safe in the knowledge that they cannot be changed under you because they are immutable. So you only do O(log n) work instead of O(n) work.
Purely functional data structures usually offer functions like map that allow every element to be updated in the same way and avoid rebalancing by copying the structure of the source tree. So the time for n updates is O(n) instead of O(n log n).
So you should be able to achieve similar or even equal asymptotic time complexity but, in absolute terms, you will be using several times as much space and time as an imperative solution. I described these basics in detail in my book Visual F# 2010 for Technical Computing and I wrote the article Artificial Intelligence: Neural Networks (8th May 2010) for the OCaml Journal.

Look into Haskell arrays which include mutable variants in a monad.

You should not need to recreate the entire network every time a weight update occurs. Presumably, your neurons are modeled as individual objects - this means that to "update" an individual neuron, you would actually be creating a new neuron with the updated weight. Then this neuron would be inserted into the network in place of the old one, which would in turn be free for reclamation by the garbage collector.
I do not agree with the idea of using mutable state. Functional languages know that they are functional, so they make optimizations for functional programming. If a functional language really is the best tool for the job, then take advantage of its benefits.

If you structure your data in such a way that you can use a persistent data structure to model your neural network, functional updates to the neural network will be cheap (at least compared to copying the whole thing).
If it is still not fast enough, your language may allow other techniques (such as careful use of mutation) to speed it up; for example, if you were using Clojure, you could use transients to some additional speed.

Related

What is the need for immutable/persistent data structures in erlang

Each Erlang process maintains its own private address space. All communication happens via copying without sharing (except big binaries). If each process is processing one message at a time with no concurrent access over its objects, I don't see why do we need immutable/persistent data structures.
Erlang was initially implemented in Prolog, which doesn't really use mutable data structures either (though some dialects do). So it started off without them. This makes runtime implementation simpler and faster (garbage collection in particular).
So adding mutable data structures would require a lot of effort, could introduce bugs, and Erlang programmers are nearly by definition at least willing to live without them.
Many actually consider their absence to be a positive good: less concern about object identity, no need for defensive copying because you don't know whether some other piece of code is going to modify the data you passed (or might be changed later to modify it), etc.
This absence does mean that Erlang is pretty unusable in some domains (e.g. high performance scientific computing), at least as the main language. But again, this means that nobody in these domains is going to use Erlang in the first place and so there's no particular incentive to make it usable at the cost of making existing users unhappy.
I remember seeing a mailing list post by Joe Armstrong quite a long time ago (which I couldn't find with a quick search now) saying that he initially planned to add mutable variables when he'd need them... except he never quite did, and performance was good enough for everything he was using Erlang for.
It is indeed the case that in Erlang immutability does not solve any "shared state" problems, as immutable data are "process local".
From the functional programming language perspective, however, immutability offers a number of benefits, summarized adequately in this Quora answer:
The simplest definition of functional programming is that it’s a programming
paradigm where you are transforming immutable data with functions.
The definition uses functions in the mathematical sense, where it’s
something that takes an input, and produces an output.
OO + mutability tends to violate that definition because when you want
to change a piece of data it generally will not return the output, it
will likely return void or unit, and that when you call a method on
the object the object itself isn’t input for the function.
As far as what advantages the paradigm has, composability, thread
safety, being able to track what went wrong where better, the ability
to sort of separate the data from the actual computation on it being
done, etc.
how would this work?
factorial(1) -> 1;
factorial(X) ->
X*factorial(X-1).
if you run factorial(4), a single process will be running the same function. Each time the function will have it's own value of X, if the value of X was in the scope of the process and not the function recursive functions wouldn't work. So first we need to understand scope. If you want to say that you don't see why data needs to be immutable within the scope of a single function/block you would have a point, but it would be a headache to think about where data is immutable and where it isn't.

What is the network analog of a recursive function?

This is an ambitious question from a Wolfram Science Conference: Is there such a thing as a network analog of a recursive function? Maybe a kind of iterative "map-reduce" pattern? If we add interaction to iteration, things become complicated: continuous iteration of a large number of interacting entities can produce very complex results. It would be nice to have a way of seeing the consequences of the myriad interactions that define a complex system. Can we find a counterpart of a recursive function in an iterative network of connected nodes which contain nested propagation loops?
One of the basic patterns of distributed computation is Map-Reduce: it can be found in Cellular Automata (CA) and Neural Networks (NN). Neurons in NN collect informations through their synapses (reduce) and send it to other neurons (map). Cells in CA act similar, they gather informations from their neighbors (reduce), apply a transition rule (reduce), and offer the result to their neighbors again. Thus >if< there is a network analog of a recursive function, then Map-Reduce is certainly an important part of it. What kind of iterative "map-reduce" patterns exist? Do certain kinds of "map-reduce" patterns result in certain kinds of streams or even vortices or whirls? Can we formulate a calculus for map-reduce patterns?
I'll take a stab at the question about recursion in neural networks, but I really don't see how map-reduce plays into this at all. I get that neural network can perform distributed computation and then reduce it to a more local representation, but the term map-reduce is a very specific brand of this distributed/local piping, mainly associated with google and Hadoop.
Anyways, the simple answer to your question is that there isn't a general method for recursion in neural networks; in fact, the very related simpler problem of implementing general purpose role-value bindings in neural networks is currently still an open question.
The general principle of why things like role-binding and recursion in neural networks (ANNs) are so hard is that ANNs are very interdependent by nature; indeed that is where most of their computational power is derived from. Whereas function calls and variable bindings are both very delineated operations; what they include is an all-or-nothing affair, and that discreteness is a valuable property in many cases. So implementing one inside the other without sacrificing any computational power is very tricky indeed.
Here is a small sampling of papers that try there hand at partial solutions. Lucky for you, a great many people find this problem very interesting!
Visual Segmentation and the Dynamic Binding Problem: Improving the Robustness of an Artificial Neural Network Plankton Classifier (1993)
A Solution to the Binding Problem for Compositional Connectionism
A (Somewhat) New Solution to the Binding Problem

Best Practices for cache locality in Multicore Parallelism in F#

I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.
I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.
Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.
As far as I can make out, the key to cache locality (multithreaded or otherwise) is
Keep work units in a contiguous block of RAM that will fit into the cache
To this end ;
Avoid objects where possible
Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
Work out the size of the cache on the machine you'll be executing it on
CPUs have different size L2 caches
It might be prudent to design your code to scale with different cache sizes
Or more simply, write code that will fit inside the lowest common cache size your code will be running on
Work out what needs to sit close to each datum
In practice, you're not going to fit your whole working set into the L2 cache
Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.
In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.
A good academic paper on the subject is Cache-Efficient String Sorting Using Copying
Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.
Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.
To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.
I am no parallelism expert, but here is my advice anyway.
I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.
A great approach is to split the work into smaller sections and iterate over each section on each core.
One option I would start with is to look for cache locality improvements on a single core before going parallel, it should be simply a matter of subdividing the work again for each core. For example if you are doing matrix calculations with large matrices then you could split up the calculations into smaller sections.
Heres a great example of that: Cache Locality For Performance
There were some great sections in Tomas Petricek's book Real Work functional programming, check out Chapter 14 Writing Parallel Functional Programs, you might find Parallel processing of a binary tree of particular interest.
To write scalable Apps cache locality is paramount to your application speed. The principles are well explain by Scott Meyers talk. Immutability does not play well with cache locality since you create new objects in memory which forces the CPU to reload the data from the new object again.
As in the talk is noted even on modern CPUs the L1 cache has only 32 KB size which is shared for code and data between all cores. If you go multi threaded you should try to consume as little memory as possible (goodbye immutabilty) to stay in the fastest cache. The L2 cache is about 4-8 MB which is much bigger but still tiny compared to the data you are trying to sort.
If you manage to write an application which consumes as little memory as possible (data cache locality) you can get speedups of 20 or more. But if you manage this for 1 core it might be very well be that scaling to more cores will hurt performance since all cores are competing for the same L2 cache.
To get most out of it the C++ guys use PGA (Profile Guided Optimizations) which allows them to profile their application which is used as input data for the compiler to emit better optimized code for the specific use case.
You can get better to certain extent in a managed code but since so many factors influence your cache locality it is not likely that you will ever see a speedup of 20 in the real world due to total cache locality. This remains the regime of C++ and compilers which use profiling data.
You may get some ideas from these:
Cache-Oblivious http://supertech.csail.mit.edu/cacheObliviousBTree.html Cache-Oblivious Search Trees Project
DSapce#MIT Cache coherence strategies in a many-core processor http://dspace.mit.edu/handle/1721.1/61276
describes the revolutionary idea of cache oblivious algorithms via the elegant and efficient implementation of a matrix multiply in F#.

query language for graph sets: data modeling question

Suppose I have a set of directed graphs. I need to query those graphs. I would like to get a feeling for my best choice for the graph modeling task. So far I have these options, but please don't hesitate to suggest others:
Proprietary implementation (matrix)
and graph traversal algorithms.
RDBM and SQL option (too space consuming)
RDF and SPARQL option (too slow)
What would you guys suggest? Regards.
EDIT: Just to answer Mad's questions:
Each one is relatively small, no more than 200 vertices, 400 edges. However, there are hundreds of them.
Frequency of querying: hard to say, it's an experimental system.
Speed: not real time, but practical, say 4-5 seconds tops.
You didn't give us enough information to respond with a well thought out answer. For example: what size are these graphs? With what frequencies do you expect to query these graphs? Do you need real-time response to these queries? More information on what your application is for, what is your purpose, will be helpful.
Anyway, to counter the usual responses that suppose SQL-based DBMSes are unable to handle graphs structures effectively, I will give some references:
Graph Transformation in Relational Databases (.pdf), by G. Varro, K. Friedl, D. Varro, presented at International Workshop on Graph-Based Tools (GraBaTs) 2004;
5 Conclusion and Future Work
In the paper, we proposed a new graph transformation engine based on off-the-shelf
relational databases. After sketching the main concepts of our approach, we carried
out several test cases to evaluate our prototype implementation by comparing it to
the transformation engines of the AGG [5] and PROGRES [18] tools.
The main conclusion that can be drawn from our experiments is that relational
databases provide a promising candidate as an implementation framework for graph
transformation engines. We call attention to the fact that our promising experimental
results were obtained using a worst-case assessment method i.e. by recalculating
the views of the next rule to be applied from scratch which is still highly inefficient,
especially, for model transformations with a large number of independent matches
of the same rule. ...
They used PostgreSQL as DBMS, which is probably not particularly good at this kind of applications. You can try LucidDB and see if it is better, as I suspect.
Incremental SQL Queries (more than one paper here, you should concentrate on " Maintaining Transitive Closure of Graphs in SQL "): "
.. we showed that transitive closure, alternating paths, same generation, and other recursive queries, can be maintained in SQL if some auxiliary relations are allowed. In fact, they can all be maintained using at most auxiliary relations of arity 2. ..
Incremental Maintenance of Shortest Distance and Transitive Closure in First Order Logic and SQL.
Edit: you give more details so... I think the best way is to experiment a little with both a main-memory dedicated graph library and with a DBMS-based solution, then evaluate carefully pros and cons of both solutions.
For example: a DBMS need to be installed (if you don't use an "embeddable" DBMS like SQLite), only you know if/where your application needs to be deployed and what your users are. On the other hand, a DBMS gives you immediate benefits, like persistence (I don't know what support graph libraries gives for persisting their graphs), transactions management and countless other. Are these relevant for your application? Again, only you know.
The first option you mentioned seems best. If your graph won't have many edges (|E|=O(|V|)) then you might earn better complexity of time and space using Dictionary:
var graph = new Dictionary<Vertex, HashSet<Vertex>>();
An interesting graph library is QuickGraph. Never used it but it seems promising :)
I wrote and designed quite a few graph algorithms for various programming contests and in production code. And I noticed that every time I need one, I have to develop it from scratch, assembling together concepts from graph theory (BFS, DFS, topological sorting etc).
Perhaps a lack of experience is a reason, but it seems to me that there's still no reasonable general-purpose query language to solve graph problems. Pick a couple of general-purpose graph libraries and solve your particular task in a programming (not query!) language. That will give you best performance and space consumption, but will also require understanding of graph theory basic concepts and of their limitations.
And the last one: do not use SQL for graphs.

Advantages of stateless programming?

I've recently been learning about functional programming (specifically Haskell, but I've gone through tutorials on Lisp and Erlang as well). While I found the concepts very enlightening, I still don't see the practical side of the "no side effects" concept. What are the practical advantages of it? I'm trying to think in the functional mindset, but there are some situations that just seem overly complex without the ability to save state in an easy way (I don't consider Haskell's monads 'easy').
Is it worth continuing to learn Haskell (or another purely functional language) in-depth? Is functional or stateless programming actually more productive than procedural? Is it likely that I will continue to use Haskell or another functional language later, or should I learn it only for the understanding?
I care less about performance than productivity. So I'm mainly asking if I will be more productive in a functional language than a procedural/object-oriented/whatever.
Read Functional Programming in a Nutshell.
There are lots of advantages to stateless programming, not least of which is dramatically multithreaded and concurrent code. To put it bluntly, mutable state is enemy of multithreaded code. If values are immutable by default, programmers don't need to worry about one thread mutating the value of shared state between two threads, so it eliminates a whole class of multithreading bugs related to race conditions. Since there are no race conditions, there's no reason to use locks either, so immutability eliminates another whole class of bugs related to deadlocks as well.
That's the big reason why functional programming matters, and probably the best one for jumping on the functional programming train. There are also lots of other benefits, including simplified debugging (i.e. functions are pure and do not mutate state in other parts of an application), more terse and expressive code, less boilerplate code compared to languages which are heavily dependent on design patterns, and the compiler can more aggressively optimize your code.
The more pieces of your program are stateless, the more ways there are to put pieces together without having anything break. The power of the stateless paradigm lies not in statelessness (or purity) per se, but the ability it gives you to write powerful, reusable functions and combine them.
You can find a good tutorial with lots of examples in John Hughes's paper Why Functional Programming Matters (PDF).
You will be gobs more productive, especially if you pick a functional language that also has algebraic data types and pattern matching (Caml, SML, Haskell).
Many of the other answers have focused on the performance (parallelism) side of functional programming, which I believe is very important. However, you did specifically ask about productivity, as in, can you program the same thing faster in a functional paradigm than in an imperative paradigm.
I actually find (from personal experience) that programming in F# matches the way I think better, and so it's easier. I think that's the biggest difference. I've programmed in both F# and C#, and there's a lot less "fighting the language" in F#, which I love. You don't have to think about the details in F#. Here's a few examples of what I've found I really enjoy.
For example, even though F# is statically typed (all types are resolved at compile time), the type inference figures out what types you have, so you don't have to say it. And if it can't figure it out, it automatically makes your function/class/whatever generic. So you never have to write any generic whatever, it's all automatic. I find that means I'm spending more time thinking about the problem and less how to implement it. In fact, whenever I come back to C#, I find I really miss this type inference, you never realise how distracting it is until you don't need to do it anymore.
Also in F#, instead of writing loops, you call functions. It's a subtle change, but significant, because you don't have to think about the loop construct anymore. For example, here's a piece of code which would go through and match something (I can't remember what, it's from a project Euler puzzle):
let matchingFactors =
factors
|> Seq.filter (fun x -> largestPalindrome % x = 0)
|> Seq.map (fun x -> (x, largestPalindrome / x))
I realise that doing a filter then a map (that's a conversion of each element) in C# would be quite simple, but you have to think at a lower level. Particularly, you'd have to write the loop itself, and have your own explicit if statement, and those kinds of things. Since learning F#, I've realised I've found it easier to code in the functional way, where if you want to filter, you write "filter", and if you want to map, you write "map", instead of implementing each of the details.
I also love the |> operator, which I think separates F# from ocaml, and possibly other functional languages. It's the pipe operator, it lets you "pipe" the output of one expression into the input of another expression. It makes the code follow how I think more. Like in the code snippet above, that's saying, "take the factors sequence, filter it, then map it." It's a very high level of thinking, which you don't get in an imperative programming language because you're so busy writing the loop and if statements. It's the one thing I miss the most whenever I go into another language.
So just in general, even though I can program in both C# and F#, I find it easier to use F# because you can think at a higher level. I would argue that because the smaller details are removed from functional programming (in F# at least), that I am more productive.
Edit: I saw in one of the comments that you asked for an example of "state" in a functional programming language. F# can be written imperatively, so here's a direct example of how you can have mutable state in F#:
let mutable x = 5
for i in 1..10 do
x <- x + i
Consider all the difficult bugs you've spent a long time debugging.
Now, how many of those bugs were due to "unintended interactions" between two separate components of a program? (Nearly all threading bugs have this form: races involving writing shared data, deadlocks, ... Additionally, it is common to find libraries that have some unexpected effect on global state, or read/write the registry/environment, etc.) I would posit that at least 1 in 3 'hard bugs' fall into this category.
Now if you switch to stateless/immutable/pure programming, all those bugs go away. You are presented with some new challenges instead (e.g. when you do want different modules to interact with the environment), but in a language like Haskell, those interactions get explicitly reified into the type system, which means you can just look at the type of a function and reason about the type of interactions it can have with the rest of the program.
That's the big win from 'immutability' IMO. In an ideal world, we'd all design terrific APIs and even when things were mutable, effects would be local and well-documented and 'unexpected' interactions would be kept to a minimum. In the real world, there are lots of APIs that interact with global state in myriad ways, and these are the source of the most pernicious bugs. Aspiring to statelessness is aspiring to be rid of unintended/implicit/behind-the-scenes interactions among components.
One advantage of stateless functions is that they permit precalculation or caching of the function's return values. Even some C compilers allow you to explicitly mark functions as stateless to improve their optimisability. As many others have noted, stateless functions are much easier to parallelise.
But efficiency is not the only concern. A pure function is easier to test and debug since anything that affects it is explicitly stated. And when programming in a functional language, one gets in the habit of making as few functions "dirty" (with I/O, etc.) as possible. Separating out the stateful stuff this way is a good way to design programs, even in not-so-functional languages.
Functional languages can take a while to "get", and it's difficult to explain to someone who hasn't gone through that process. But most people who persist long enough finally realise that the fuss is worth it, even if they don't end up using functional languages much.
Without state, it is very easy to automatically parallelize your code (as CPUs are made with more and more cores this is very important).
Stateless web applications are essential when you start having higher traffic.
There could be plenty of user data that you don't want to store on the client side for security reasons for example. In this case you need to store it server-side. You could use the web applications default session but if you have more than one instance of the application you will need to make sure that each user is always directed to the same instance.
Load balancers often have the ability to have 'sticky sessions' where the load balancer some how knows which server to send the users request to. This is not ideal though, for example it means every time you restart your web application, all connected users will lose their session.
A better approach is to store the session behind the web servers in some sort of data store, these days there are loads of great nosql products available for this (redis, mongo, elasticsearch, memcached). This way the web servers are stateless but you still have state server-side and the availability of this state can be managed by choosing the right datastore setup. These data stores usually have great redundancy so it should almost always be possible to make changes to your web application and even the data store without impacting the users.
My understanding is that FP also has a huge impact on testing. Not having a mutable state will often force you to supply more data to a function than you would have to for a class. There's tradeoffs, but think about how easy it would be to test a function that is "incrementNumberByN" rather than a "Counter" class.
Object
describe("counter", () => {
it("should increment the count by one when 'increment' invoked without
argument", () => {
const counter = new Counter(0)
counter.increment()
expect(counter.count).toBe(1)
})
it("should increment the count by n when 'increment' invoked with
argument", () => {
const counter = new Counter(0)
counter.increment(2)
expect(counter.count).toBe(2)
})
})
functional
describe("incrementNumberBy(startingNumber, increment)", () => {
it("should increment by 1 if n not supplied"){
expect(incrementNumberBy(0)).toBe(1)
}
it("should increment by 1 if n = 1 supplied"){
expect(countBy(0, 1)).toBe(1)
}
})
Since the function has no state and the data going in is more explicit, there are fewer things to focus on when you are trying to figure out why a test might be failing. On the tests for the counter we had to do
const counter = new Counter(0)
counter.increment()
expect(counter.count).toBe(1)
Both of the first two lines contribute to the value of counter.count. In a simple example like this 1 vs 2 lines of potentially problematic code isn't a big deal, but when you deal with a more complex object you might be adding a ton of complexity to your testing as well.
In contrast, when you write a project in a functional language, it nudges you towards keeping fancy algorithms dependent on the data flowing in and out of a particular function, rather than being dependent on the state of your system.
Another way of looking at it would be illustrating the mindset for testing a system in each paradigm.
For Functional Programming: Make sure function A works for given inputs, you make sure function B works with given inputs, make sure C works with given inputs.
For OOP: Make sure Object A's method works given an input argument of X after doing Y and Z to the state of the object. Make sure Object B's method works given an input argument of X after doing W and Y to the state of the object.
The advantages of stateless programming coincide with those goto-free programming, only more so.
Though many descriptions of functional programming emphasize the lack of mutation, the lack of mutation also goes hand in hand with the lack of unconditional control transfers, such as loops. In functional programming languages, recursion, in particularly tail recursion, replaces looping. Recursion eliminates both the unconditional control construct and the mutation of variables in the same stroke. The recursive call binds argument values to parameters, rather than assigning values.
To understand why this is advantageous, rather than turning to functional programming literature, we can consult the 1968 paper by Dijkstra, "Go To Statement Considered Harmful":
"The unbridled use of the go to statement has an immediate consequence that it becomes terribly hard to find a meaningful set of coordinates in which to describe the process progress."
Dijkstra's observations, however still apply to structured programs which avoid go to, because statements like while, if and whatnot are just window dressing on go to! Without using go to, we can still find it impossible to find the coordinates in which to describe the process progress. Dijkstra neglected to observe that bridled go to still has all the same issues.
What this means is that at any given point in the execution of the program, it is not clear how we got there. When we run into a bug, we have to use backwards reasoning: how did we end up in this state? How did we branch into this point of the code? Often it is hard to follow: the trail goes back a few steps and then runs cold due to a vastness of possibilities.
Functional programming gives us the absolute coordinates. We can rely on analytical tools like mathematical induction to understand how the program arrived into a certain situation.
For example, to convince ourselves that a recursive function is correct, we can just verify its base cases, and then understand and check its inductive hypothesis.
If the logic is written as a loop with mutating variables, we need a more complicated set of tools: breaking down the logic into steps with pre- and post-conditions, which we rewrite in terms mathematics that refers to the prior and current values of variables and such. Yes, if the program uses only certain control structures, avoiding go to, then the analysis is somewhat easier. The tools are tailored to the structures: we have a recipe for how we analyze the correctness of an if, while, and other structures.
However, by contrast, in a functional program there is no prior value of any variable to reason about; that whole class of problem has gone away.
Haskel and Prolog are good examples of languages which may be implemented as stateless programming languages. But unfortunately they are not so far. Both Prolog and Haskel have imperative implementations currently. See some SMT's, seem closer to stateless coding.
This is why you are having hard time seeing any benefits from these programing languages. Due to imperative implementations we have no performance and stability benefits. So the lack of stateless languages infrastructure is the main reason you feel no any stateless programming language due to its absence.
These are some benefits of pure stateless:
Task description is the program (compact code)
Stability due to absense of state-dependant bugs (the most of bugs)
Cachable results (a set of inputs always cause same set of outputs)
Distributable computations
Rebaseable to quantum computations
Thin code for multiple overlapping clauses
Allows differentiable programming optimizations
Consistently applying code changes (adding logic breaks nothing written)
Optimized combinatorics (no need to bruteforce enumerations)
Stateless coding is about concentrating on relations between data which then used for computing by deducing it. Basically this is the next level of programming abstraction. It is much closer to native language then any imperative programming languages because it allow describing relations instead of state change sequences.

Resources