Could you implement async-await by memcopying stack frames rather than creating state machines?

Could you implement async-await by memcopying stack frames rather than creating state machines? - asynchronous

I am trying to understand all the low-level stuff Compilers / Interpreters / the Kernel do for you (because I'm yet another person who thinks they could design a language that's better than most others)
One of the many things that sparked my curiosity is Async-Await.
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines.
I've not found anything useful by googling ("async copy stack frame" and variations) or in the "Similar questions" section.
To me, this method seems rather complicated and overhead-heavy;
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
I'm aware that it is architecturally impossible for some languages (I thank the CLR can't do it, so C# can't either).
Am I missing something that makes this logically impossible? I would expect less complicated code and a performance boost from doing it that way, am I mistaken? I suppose when you have a deep stack hierarchy after a async call (eg. a recursive async function) the amount of data you would have to memcopy is rather large, but there are probably ways to work around that.
If this is possible, then why isn't it done anywhere?

Yes, an alternative to converting code into state machines is copying stacks around. This is the way that the go language does it now, and the way that Java will do it when Project Loom is released.
It's not an easy thing to do for real-world languages.
It doesn't work for C and C++, for example, because those languages let you make pointers to things on the stack. Those pointers can be used by other threads, so you can't move the stack away, and even if you could, you would have to copy it back into exactly the same place.
For the same reason, it doesn't work when your program calls out to the OS or native code and gets called back in the same thread, because there's a portion of the stack you don't control. In Java, project Loom's 'virtual threads' will not release the thread as long as there's native code on the stack.
Even in situations where you can move the stack, it requires dedicated support in the runtime environment. The stack can't just be copied into a byte array. It has to be copied off in a representation that allows the garbage collector to recognize all the pointers in it. If C# were to adopt this technique, for example, it would require significant extensions to the common language runtime, whereas implementing state machines can be accomplished entirely within the C# compiler.

I would first like to begin by saying that this answer is only meant to serve as a starting point to go in the actual direction of your exploration. This includes various pointers and building up on the work of various other authors
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines
You understood correctly that the Async/Await implementation for C# and Rust use state machines. Let us understand now as to why are those implementations chosen.
To put the general structure of stack frames in very simple terms, whatever we put inside a stack frame are temporary allocations which are not going to outlive the method which resulted in the addition of that stack frame (including, but not limited to local variables). It also contains the information of the continuation, ie. the address of the code that needs to be executed next (in other words, the control has to return to), within the context of the recently called method. If this is a case of synchronous execution, the methods are executed one after the other. In other words, the caller method is suspended until the called method finishes execution. This, from a stack perspective fits in intuitively. If we are done with the execution of a called method, the control is returned to the caller and the stack frame can be popped off. It is also cheap and efficient from a perspective of the hardware that is running this code as well (hardware is optimised for programming with stacks).
In the case of asynchronous code, the continuation of a method might have to trigger several other methods that might get called from within the continuation of callers. Take a look at this answer, where Eric Lippert outlines the entirety of how the stack works for an asynchronous flow. The problem with asynchronous flow is that, the method calls do not exactly form a stack and trying to handle them like pure stacks may get extremely complicated. As Eric says in the answer, that is why C# uses graph of heap-allocated tasks and delegates that represents a workflow.
However, if you consider languages like Go, the asynchrony is handled in a different way altogether. We have something called Goroutines and here is no need for await statements in Go. Each of these Goroutines are started on their own threads that are lightweight (each of them have their own stacks, which defaults to 8KB in size) and the synchronization between each of them is achieved through communication through channels. These lightweight threads are capable of waiting asynchronously for any read operation to be performed on the channel and suspend themselves. The earlier implementation in Go is done using the SplitStacks technique. This implementation had its own problems as listed out here and replaced by Contigious Stacks. The article also talks about the newer implementation.
One important thing to note here is that it is not just the complexity involved in handling the continuation between the tasks that contribute to the approach chosen to implement Async/Await, there are other factors like Garbage Collection that play a role. GC process should be as performant as possible. If we move stacks around, GC becomes inefficient because accessing an object then would require thread synchronization.
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
In short, you can. As this answer states here, Chicken Scheme uses a something similar to what you are exploring. It begins by allocating everything on the stack and move the stack values to heap when it becomes too large for the GC activities (Chicken Scheme uses Generational GC). However, there are certain caveats with this kind of implementation. Take a look at this FAQ of Chicken Scheme. There is also lot of academic research in this area (linked in the answer referred to in the beginning of the paragraph, which I shall summarise under further readings) that you may want to look at.
Further Reading
Continuation Passing Style
call-with-current-continuation
The classic SICP book
This answer (contains few links to academic research in this area)
TLDR
The decision of which approach to be taken is subjective to factors that affect the overall usability and performance of the language. State Machines are not the only way to implement the Async/Await functionality as done in C# and Rust. Few languages like Go implement a Contigious Stack approach coordinated over channels for asynchronous operations. Chicken Scheme allocates everything on the stack and moves the recent stack value to heap in case it becomes heavy for its GC algorithm's performance. Moving stacks around has its own set of implications that affect garbage collection negatively. Going through the research done in this space will help you understand the advancements and rationale behind each of the approaches. At the same time, you should also give a thought to how you are planning on designing/implementing the other parts of your language for it be anywhere close to be usable in terms of performance and overall usability.
PS: Given the length of this answer, will be happy to correct any inconsistencies that may have crept in.

I have been looking into various strategies for doing this myseøf, because I naturally thi k I can design a language better than anybody else - same as you. I just want to emphasize that when I say better, I actually mean better as in tastes better for my liking, and not objectively better.
I have come to a few different approaches, and to summarize: It really depends on many other design choices you have made in the language.
It is all about compromises; each approach has advantages and disadvantages.
It feels like the compiler design community are still very focused on garbage collection and minimizing memory waste, and perhaps there is room for some innovation for more lazy and less purist language designers given the vast resources available to modern computers?
How about not having a call stack at all?
It is possible to implement a language without using a call stack.
Pass continuations. The function currently running is responsible for keeping and resuming the state of the caller. Async/await and generators come naturally.
Preallocated static memory addresses for all local variables in all declared functions in the entire program. This approach causes other problems, of course.
If this is your design, then asymc functions seem trivial
Tree shaped stack
With a tree shaped stack, you can keep all stack frames until the function is completely done. It does not matter if you allow progress on any ancestor stack frame, as long as you let the async frame live on until it is no longer needed.
Linear stack
How about serializing the function state? It seems like a variant of continuations.
Independent stack frames on the heap
Simply treat invocations like you treat other pointers to any value on the heap.
All of the above are trivialized approaches, but one thing they have in common related to your question:
Just find a way to store any locals needed to resume the function. And don't forget to store the program counter in the stack frame as well.

Related

What is the need for immutable/persistent data structures in erlang

Each Erlang process maintains its own private address space. All communication happens via copying without sharing (except big binaries). If each process is processing one message at a time with no concurrent access over its objects, I don't see why do we need immutable/persistent data structures.

Erlang was initially implemented in Prolog, which doesn't really use mutable data structures either (though some dialects do). So it started off without them. This makes runtime implementation simpler and faster (garbage collection in particular).
So adding mutable data structures would require a lot of effort, could introduce bugs, and Erlang programmers are nearly by definition at least willing to live without them.
Many actually consider their absence to be a positive good: less concern about object identity, no need for defensive copying because you don't know whether some other piece of code is going to modify the data you passed (or might be changed later to modify it), etc.
This absence does mean that Erlang is pretty unusable in some domains (e.g. high performance scientific computing), at least as the main language. But again, this means that nobody in these domains is going to use Erlang in the first place and so there's no particular incentive to make it usable at the cost of making existing users unhappy.
I remember seeing a mailing list post by Joe Armstrong quite a long time ago (which I couldn't find with a quick search now) saying that he initially planned to add mutable variables when he'd need them... except he never quite did, and performance was good enough for everything he was using Erlang for.

It is indeed the case that in Erlang immutability does not solve any "shared state" problems, as immutable data are "process local".
From the functional programming language perspective, however, immutability offers a number of benefits, summarized adequately in this Quora answer:
The simplest definition of functional programming is that it’s a programming
paradigm where you are transforming immutable data with functions.
The definition uses functions in the mathematical sense, where it’s
something that takes an input, and produces an output.
OO + mutability tends to violate that definition because when you want
to change a piece of data it generally will not return the output, it
will likely return void or unit, and that when you call a method on
the object the object itself isn’t input for the function.
As far as what advantages the paradigm has, composability, thread
safety, being able to track what went wrong where better, the ability
to sort of separate the data from the actual computation on it being
done, etc.

how would this work?
factorial(1) -> 1;
factorial(X) ->
X*factorial(X-1).
if you run factorial(4), a single process will be running the same function. Each time the function will have it's own value of X, if the value of X was in the scope of the process and not the function recursive functions wouldn't work. So first we need to understand scope. If you want to say that you don't see why data needs to be immutable within the scope of a single function/block you would have a point, but it would be a headache to think about where data is immutable and where it isn't.

2 questions at the end of a functional programming course

Here seems to be the two biggest things I can take from the How to Design Programs (simplified Racket) course I just finished, straight from the lecture notes of the course:
1) Tail call optimization, and the lack thereof in non-functional languages:
Sadly, most other languages do not support TAIL CALL
OPTIMIZATION. Put another way, they do build up a stack
even for tail calls.
Tail call optimization was invented in the mid 70s, long
after the main elements of most languages were developed.
Because they do not have tail call optimization, these
languages provide a fixed set of LOOPING CONSTRUCTS that
make it possible to traverse arbitrary sized data.
a) What are the equivalents to this type of optimization in procedural languages that don't feature it?
b) Do using those equivalents mean we avoid building up a stack in similar situations in languages that don't have it?
2) Mutation and multicore processors
This mechanism is fundamental in almost any other language you
program in. We have delayed introducing it until now for
several reasons:
despite being fundamental, it is surprisingly complex
overuse of it leads to programs that are not amenable
to parallelization (running on multiple processors).
Since multi-core computers are now common, the ability
to use mutation only when needed is becoming more and
more important
overuse of mutation can also make it difficult to
understand programs, and difficult to test them well
But mutable variables are important, and learning this mechanism
will give you more preparation to work with Java, Python and many
other languages. Even in such languages, you want to use a style
called "mostly functional programming".
I learned some Java, Python and C++ before taking this course, so came to take mutation for granted. Now that has been all thrown in the air by the above statement. My questions are:
a) where could I find more detailed information regarding what is suggested in the 2nd bullet, and what to do about it, and
b) what kind of patterns would emerge from a "mostly functional programming" style, as opposed to a more careless style I probably would have had had I continued on with those other languages instead of taking this course?

As Leppie points out, looping constructs manage to recover the space savings of proper tail calling, for the particular kinds of loops that they support. The only problem with looping constructs is that the ones you have are never enough, unless you just hurl the ball into the user's court and force them to model the stack explicitly.
To take an example, suppose you're traversing a binary tree using a loop. It works... but you need to explicitly keep track of the "ones to come back to." A recursive traversal in a tail-calling language allows you to have your cake and eat it too, by not wasting space when not required, and not forcing you to keep track of the stack yourself.
Your question on parallelism and concurrency is much more wide-open, and the best pointers are probably to areas of research, rather than existing solutions. I think that most would agree that there's a crisis going on in the computing world; how do we adapt our mutation-heavy programming skills to the new multi-core world?
Simply switching to a functional paradigm isn't a silver bullet here, either; we still don't know how to write high-level code and generate blazing fast non-mutating run-concurrently code. Lots of folks are working on this, though!

To expand on the "mutability makes parallelism hard" concept, when you have multiple cores going, you have to use synchronisation if you want to modify something from one core and have it be seen consistently by all the other cores.
Getting synchronisation right is hard. If you over-synchronise, you have deadlocks, slow (serial rather than parallel) performance, etc. If you under-synchronise, you have partially-observed changes (where another core sees only a portion of the changes you made from a different core), leaving your objects observed in an invalid "halfway changed" state.
It is for that reason that many functional programming languages encourage a message-queue concept instead of a shared state concept. In that case, the only shared state is the message queue, and managing synchronisation in a message queue is a solved problem.

a) What are the equivalents to this type of optimization in procedural languages that don't feature it? b) Do using those equivalents mean we avoid building up a stack in similar situations in languages that don't have it?
Well, the significance of a tail call is that it can evaluate another function without adding to the call stack, so anything that builds up the stack can't really be called an equivalent.
A tail call behaves essentially like a jump to the new code, using the language trappings of a function call and all the appropriate detail management. So in languages without this optimization, you'd use a jump within a single function. Loops, conditional blocks, or even arbitrary goto statements if nothing else works.
a) where could I find more detailed information regarding what is suggested in the 2nd bullet, and what to do about it
The second bullet sounds like an oversimplification. There are many ways to make parallelization more difficult than it needs to be, and overuse of mutation is just one.
However, note that parallelization (splitting a task into pieces that can be done simultaneously) is not entirely the same thing as concurrency (having multiple tasks executed simultaneously that may interact), though there's certainly overlap. Avoiding mutation is incredibly helpful in writing concurrent programs, since immutable data avoids a lot of race conditions and resource contention that would otherwise be possible.
b) what kind of patterns would emerge from a "mostly functional programming" style, as opposed to a more careless style I probably would have had had I continued on with those other languages instead of taking this course?
Have you looked at Haskell or Clojure? Both are heavily inclined to a very functional style emphasizing controlled mutation. Haskell is more rigorous about it but has a lot of tools for working with limited forms of mutability, while Clojure is a bit more informal and might be more familiar to you since it's another Lisp dialect.

How to use non-blocking or asynchronous IO with Boost Spirit?

Does Spirit provide any capabilities for working with non-blocking IO?
To provide a more concrete example: I'd like to use Boost's Spirit parsing framework to parse data coming in from a network socket that's been placed in non-blocking mode. If the data is not completely available, I'd like to be able to use that thread to perform other work instead of blocking.
The trivial answer is to simply read all the data before invoking Spirit, but potentially gigabytes of data would need to be received and parsed from the socket.
It seems like that in order to support non-blocking I/O while parsing, Spirit would need some ability to partially parse the data and be able to pause and save its parse state when no more data is available. Additionally, it would need to be able to resume parsing from the saved parse state when data does become available. Or maybe I'm making this too complicated?

TODO Will post a example for a simple single-threaded 'event-based' parsing model. This is largely trivial but might just be what you need.
For anything less trivial, please heed to following considerations/hints/tips:
How would you be consuming the result? You wouldn't have the synthesized attributes any earlier anyway, or are you intending to use semantic actions on the fly?
That doesn't usually work well due to backtracking. The caveats could be worked around by careful and judicious use of qi::hold, qi::locals and putting semantic actions with side-effects only at stations that will never be backtracked. In other words:
this is bound to be very errorprone
this naturally applies to a limited set of grammars only (those grammars with rich contextual information will not lend themselves well for this treatment).
Now, everything can be forced, of course, but in general, experienced programmers should have learned to avoid swimming upstream.
Now, if you still want to do this:
You should be able to get spirit library thread safe / reentrant by defining BOOST_SPIRIT_THREADSAFE and linking to libboost_thread. Note this makes the gobals used by Spirit threadsafe (at the cost of fine grained locking) but not your parsers: you can't share your own parsers/rules/sub grammars/expressions across threads. In fact, you can only share you own (Phoenix/Fusion) functors iff they are threadsafe, and any other extensions defined outside the core Spirit library should be audited for thread-safety.
If you manage the above, I think by far the best approach would seem to
use boost::spirit::istream_iterator (or, for binary/raw character streams I'd prefer to define a similar boost::spirit::istreambuf_iterator using the boost::spirit::multi_pass<> template class) to consume the input. Note that depending on your grammar, quite a bit of memory could be used for buffering and the performance is suboptimal
run the parser on it's own thread (or logical thread, e.g. Boost Asio 'strands' or its famous 'stackless coprocedures')
use coarse-grained semantic actions like shown above to pass messages to another logical thread that does the actual processing.
Some more loose pointers:
you can easily 'fuse' some functions to handle lazy evaluation of your semantic action handlers using BOOST_FUSION_ADAPT_FUNCTION and friends; This reduces the amount of cruft you have to write to get simple things working like normal C++ overload resolution in semantic actions - especially when you're not using C++0X and BOOST_RESULT_OF_USE_DECLTYPE
Because you will want to avoid semantic actions with side-effects, you should probably look at Inherited Attributes and qi::locals<> to coordinate state across rules in 'pure functional fashion'.

Immutable game object, basic functional programming question

I'm in the process of trying to 'learn more of' and 'learn lessons from' functional programming and the idea of immutability being good for concurrency, etc.
As a thought exercise I imagined a simple game where Mario-esq type character can run and jump around with enemies that shoot at him...
Then I tried to imagine this being written functionally using immutable objects.
This raised some questions that puzzled me (being an Imperative OO programmer).
1) If my little guy at position x10,y100 moves right 1 unit do I just re-instantiate him using his old values with a +1 to his x position (e.g x11,y100)?
2) (If my first assumption is correct)
If my input thread moves little guy right 1 unit and my enemy AI thread shoots little guy and enemy-ai-thread resolves before input-thread then my guy will loose health, then upon input thread resolving, gain it back and move right ...
Does this mean I can't fire-&-forget my threads even with immutability?
Do I need to send my threads off to do their thing then new()up little guy synchronously when I have the results of both threaded operations? or is there a simple 'functional' solution?
This is a slightly different threading problem than I face on a day to day basis.
Usually I have to decide if I care about what order threads resolve in or not. Where as in the above case I technically don't care if he takes damage or moves first. But I do care if race conditions during instantiation cause one threads data to be totally lost.
3) (Again if my first assumption is correct) Does constantly instantiating new instances of an object (e.g Mario guy) have a horrible overhead that makes it a very serious/important design decision ?
EDIT
Sorry for this additional edit, I wasn't what good practice is on here about follow up questions...
4) If immutability is something I should strive for and even jump though hoops of instantiating new versions of objects that have changed...And If I instantiate my guy every time he moves (only with a different position) don't I have exactly the same problems as I would if he was mutable? in as much that something that referenced him at one point in time is actually looking at old values?.. The more I dig into this the more my head's spinning as generating new versions of the same thing with differing values just seems like mutability, via hack. :¬?
I guess my question is: How should this work? and how is it beneficial over just mutating his position?
for(ever)//simplified game-loop update or "tick" method
{
if(Keyboard.IsDown(Key.Right)
guy = new Guy(guy){location = new Point(guy.Location.x +1, guy.Location.y)};
}
Also confusing is: The above code means that guy is mutable!(even if his properties are not)
4.5) Is that at all possible with a totally immutable guy?
Thanks,
J.

A couple comments on your points:
1) Yes, maybe. To reduce overhead, a practical design will probably end up sharing a lot of state between these instances. For example, perhaps your little guy has an "Equipment" structure which is also immutable. The new copy and the old copy can reference the same "equipment" structure safely, since it's immutable; so you only have to copy a reference, not the whole thing. This is an common advantage you only get thanks to immutability -- if "equipment" was mutable, you couldn't share the reference, since if it changed, your "old" version would change too.
2) In a game, the most practical solution to this issue would probably be to have a global "clock" and have this sort of processing happen once, at a clock tick. Note that your exact scenario would still be a problem if you didn't write it in a functional style: Suppose H0 is the health at time T. If you passed H0 to a function which made a decision about health at time T, you took damage at time T+1, and then the function returned at time T+5, it might have made the wrong decision based on your current health.
3) In a language that encourages functional programming, object instantiation is often made as cheap as possible. I know that on the JVM, creating small objects on the heap is so fast that it's rarely a performance consideration in any practical situation at all, and in C# I've never encountered a situation where it was a concern either.

If my little guy at position
x10,y100 moves right 1 unit do I just
re-instantiate him using his old
values with a +1 to his x position
(e.g x11,y100)?
Well, not necessarily. You could instantiate the guy once, and change its position during play. You may model this with agents. The guy is an agent, so is the AI, so is the render thread, so is the user.
When the AI shoots the guy, it sends it a message, when the user presses an arrow key that sends another message and so on.
let guyAgent (guy, position, health) =
let messages = receiveMessages()
let (newPosition, newHealth) = process(messages)
sendMessage(renderer, (guy, newPosition, newHealth))
guyAgent (guy, newPosition, newHealth)
"Everything" is immutable now (actually, under the hood the agent's dipatch queue does have some mutable state probably).
If immutability is something I
should strive for and even jump though
hoops of instantiating new versions of
objects that have changed...And If I
instantiate my guy every time he moves
(only with a different position) don't
I have exactly the same problems as I
would if he was mutable?
Well, yes. Looping with mutable values and recurring with immutable ones is equivalent.
Edit:
For agents, the wiki is always helpful.
Luca Bolognese has an F# implementation of agents.
This book (called by some The Intelligent Agent Book), though targeting the AI applications (instead of having a SW engineering point of view) is excellent.

If everything in the global system state, outside the current stack frame, is immutable, unless gives another thread a reference to something on the stack (VERY DANGEROUS) there won't be any way for a threads to do anything to affect each other. You could fire and forget, or simply not bother firing in the first place, and the effect would be the same.
Assuming there are some parts of the global state that are mutable, one useful pattern is:
Do
Latch a mutable reference to an immutable object
Generate a new object based upon the latched reference
Loop While CompareExchange fails.
The compare exchange should update the mutable reference to the new one if it still points to old one. This avoids the overhead of locking if there is no concurrent access, but may perform worse than locking if many threads are try to update the same object and generating a new instance from the latched one is slow. One advantage of this approach is that there is no danger of deadlock, though in some situations liveLock could occur.

Another functional approach to this sort of problem is to take a step back and separate out the idea of state from the idea of your little guy.
Your state will include your little guy's position, as well as the position of your baddy and it's shot, and then you have some functions that take some or all of the state and do things like generating the next state and drawing the screen.
The timing issues you're talking about when things you want to parallelize depend on each other are real problems that won't magically go away, although the solutions may be more or less convenient in different languages.
Several suggestions have already been made, and there are a variety of concurrency solutions. The central clock and agents would work, as would Software Transactional Memory, Mutexes or CSP (go style channels), and probably others. The best approach is going to depend on the specifics of the problem, and to a certain extent on personal taste.
As for the head-spinning, try not to get too caught up in whether a thing is changing or not. The point of immutability is not that things don't change, it's that you can create pure functions so that your program is easier to reason about.
For example, an OO program might have a drawing function that iterates over all the objects in a scene, and asks them all to draw themselves, where a functional program might have a function that takes a state and draws a frame.
The end result would be the same scene, but the way the logic and the state is organised is very different.
I, for one, find that it's much easier to work on when you have all the data over here, in one big input lump, and all the drawing logic there, encapsulated in some functions. There are some pretty clear architectural wins too - serialization, testing, and swapping out front ends gets a lot easier with this sort of structure.

Not everything in your program should be immutable. A player's position is something you would expect to be mutable. His name, maybe not.
Immutability is good, but you should perhaps rethink your approach to use more concurrent solutions than simple "immutable"ize everything. Consider this
Thread AI gets copy of your position
You move three units to the left.
AI shoots you based on your old position, and hits... shouldn't happen!
Also, most gaming is done in "game ticks" - there's not much multithreading going on!

Advantages of stateless programming?

I've recently been learning about functional programming (specifically Haskell, but I've gone through tutorials on Lisp and Erlang as well). While I found the concepts very enlightening, I still don't see the practical side of the "no side effects" concept. What are the practical advantages of it? I'm trying to think in the functional mindset, but there are some situations that just seem overly complex without the ability to save state in an easy way (I don't consider Haskell's monads 'easy').
Is it worth continuing to learn Haskell (or another purely functional language) in-depth? Is functional or stateless programming actually more productive than procedural? Is it likely that I will continue to use Haskell or another functional language later, or should I learn it only for the understanding?
I care less about performance than productivity. So I'm mainly asking if I will be more productive in a functional language than a procedural/object-oriented/whatever.

Read Functional Programming in a Nutshell.
There are lots of advantages to stateless programming, not least of which is dramatically multithreaded and concurrent code. To put it bluntly, mutable state is enemy of multithreaded code. If values are immutable by default, programmers don't need to worry about one thread mutating the value of shared state between two threads, so it eliminates a whole class of multithreading bugs related to race conditions. Since there are no race conditions, there's no reason to use locks either, so immutability eliminates another whole class of bugs related to deadlocks as well.
That's the big reason why functional programming matters, and probably the best one for jumping on the functional programming train. There are also lots of other benefits, including simplified debugging (i.e. functions are pure and do not mutate state in other parts of an application), more terse and expressive code, less boilerplate code compared to languages which are heavily dependent on design patterns, and the compiler can more aggressively optimize your code.

The more pieces of your program are stateless, the more ways there are to put pieces together without having anything break. The power of the stateless paradigm lies not in statelessness (or purity) per se, but the ability it gives you to write powerful, reusable functions and combine them.
You can find a good tutorial with lots of examples in John Hughes's paper Why Functional Programming Matters (PDF).
You will be gobs more productive, especially if you pick a functional language that also has algebraic data types and pattern matching (Caml, SML, Haskell).

Many of the other answers have focused on the performance (parallelism) side of functional programming, which I believe is very important. However, you did specifically ask about productivity, as in, can you program the same thing faster in a functional paradigm than in an imperative paradigm.
I actually find (from personal experience) that programming in F# matches the way I think better, and so it's easier. I think that's the biggest difference. I've programmed in both F# and C#, and there's a lot less "fighting the language" in F#, which I love. You don't have to think about the details in F#. Here's a few examples of what I've found I really enjoy.
For example, even though F# is statically typed (all types are resolved at compile time), the type inference figures out what types you have, so you don't have to say it. And if it can't figure it out, it automatically makes your function/class/whatever generic. So you never have to write any generic whatever, it's all automatic. I find that means I'm spending more time thinking about the problem and less how to implement it. In fact, whenever I come back to C#, I find I really miss this type inference, you never realise how distracting it is until you don't need to do it anymore.
Also in F#, instead of writing loops, you call functions. It's a subtle change, but significant, because you don't have to think about the loop construct anymore. For example, here's a piece of code which would go through and match something (I can't remember what, it's from a project Euler puzzle):
let matchingFactors =
factors
|> Seq.filter (fun x -> largestPalindrome % x = 0)
|> Seq.map (fun x -> (x, largestPalindrome / x))
I realise that doing a filter then a map (that's a conversion of each element) in C# would be quite simple, but you have to think at a lower level. Particularly, you'd have to write the loop itself, and have your own explicit if statement, and those kinds of things. Since learning F#, I've realised I've found it easier to code in the functional way, where if you want to filter, you write "filter", and if you want to map, you write "map", instead of implementing each of the details.
I also love the |> operator, which I think separates F# from ocaml, and possibly other functional languages. It's the pipe operator, it lets you "pipe" the output of one expression into the input of another expression. It makes the code follow how I think more. Like in the code snippet above, that's saying, "take the factors sequence, filter it, then map it." It's a very high level of thinking, which you don't get in an imperative programming language because you're so busy writing the loop and if statements. It's the one thing I miss the most whenever I go into another language.
So just in general, even though I can program in both C# and F#, I find it easier to use F# because you can think at a higher level. I would argue that because the smaller details are removed from functional programming (in F# at least), that I am more productive.
Edit: I saw in one of the comments that you asked for an example of "state" in a functional programming language. F# can be written imperatively, so here's a direct example of how you can have mutable state in F#:
let mutable x = 5
for i in 1..10 do
x <- x + i

Consider all the difficult bugs you've spent a long time debugging.
Now, how many of those bugs were due to "unintended interactions" between two separate components of a program? (Nearly all threading bugs have this form: races involving writing shared data, deadlocks, ... Additionally, it is common to find libraries that have some unexpected effect on global state, or read/write the registry/environment, etc.) I would posit that at least 1 in 3 'hard bugs' fall into this category.
Now if you switch to stateless/immutable/pure programming, all those bugs go away. You are presented with some new challenges instead (e.g. when you do want different modules to interact with the environment), but in a language like Haskell, those interactions get explicitly reified into the type system, which means you can just look at the type of a function and reason about the type of interactions it can have with the rest of the program.
That's the big win from 'immutability' IMO. In an ideal world, we'd all design terrific APIs and even when things were mutable, effects would be local and well-documented and 'unexpected' interactions would be kept to a minimum. In the real world, there are lots of APIs that interact with global state in myriad ways, and these are the source of the most pernicious bugs. Aspiring to statelessness is aspiring to be rid of unintended/implicit/behind-the-scenes interactions among components.

One advantage of stateless functions is that they permit precalculation or caching of the function's return values. Even some C compilers allow you to explicitly mark functions as stateless to improve their optimisability. As many others have noted, stateless functions are much easier to parallelise.
But efficiency is not the only concern. A pure function is easier to test and debug since anything that affects it is explicitly stated. And when programming in a functional language, one gets in the habit of making as few functions "dirty" (with I/O, etc.) as possible. Separating out the stateful stuff this way is a good way to design programs, even in not-so-functional languages.
Functional languages can take a while to "get", and it's difficult to explain to someone who hasn't gone through that process. But most people who persist long enough finally realise that the fuss is worth it, even if they don't end up using functional languages much.

Without state, it is very easy to automatically parallelize your code (as CPUs are made with more and more cores this is very important).

Stateless web applications are essential when you start having higher traffic.
There could be plenty of user data that you don't want to store on the client side for security reasons for example. In this case you need to store it server-side. You could use the web applications default session but if you have more than one instance of the application you will need to make sure that each user is always directed to the same instance.
Load balancers often have the ability to have 'sticky sessions' where the load balancer some how knows which server to send the users request to. This is not ideal though, for example it means every time you restart your web application, all connected users will lose their session.
A better approach is to store the session behind the web servers in some sort of data store, these days there are loads of great nosql products available for this (redis, mongo, elasticsearch, memcached). This way the web servers are stateless but you still have state server-side and the availability of this state can be managed by choosing the right datastore setup. These data stores usually have great redundancy so it should almost always be possible to make changes to your web application and even the data store without impacting the users.

My understanding is that FP also has a huge impact on testing. Not having a mutable state will often force you to supply more data to a function than you would have to for a class. There's tradeoffs, but think about how easy it would be to test a function that is "incrementNumberByN" rather than a "Counter" class.
Object
describe("counter", () => {
it("should increment the count by one when 'increment' invoked without
argument", () => {
const counter = new Counter(0)
counter.increment()
expect(counter.count).toBe(1)
})
it("should increment the count by n when 'increment' invoked with
argument", () => {
const counter = new Counter(0)
counter.increment(2)
expect(counter.count).toBe(2)
})
})
functional
describe("incrementNumberBy(startingNumber, increment)", () => {
it("should increment by 1 if n not supplied"){
expect(incrementNumberBy(0)).toBe(1)
}
it("should increment by 1 if n = 1 supplied"){
expect(countBy(0, 1)).toBe(1)
}
})
Since the function has no state and the data going in is more explicit, there are fewer things to focus on when you are trying to figure out why a test might be failing. On the tests for the counter we had to do
const counter = new Counter(0)
counter.increment()
expect(counter.count).toBe(1)
Both of the first two lines contribute to the value of counter.count. In a simple example like this 1 vs 2 lines of potentially problematic code isn't a big deal, but when you deal with a more complex object you might be adding a ton of complexity to your testing as well.
In contrast, when you write a project in a functional language, it nudges you towards keeping fancy algorithms dependent on the data flowing in and out of a particular function, rather than being dependent on the state of your system.
Another way of looking at it would be illustrating the mindset for testing a system in each paradigm.
For Functional Programming: Make sure function A works for given inputs, you make sure function B works with given inputs, make sure C works with given inputs.
For OOP: Make sure Object A's method works given an input argument of X after doing Y and Z to the state of the object. Make sure Object B's method works given an input argument of X after doing W and Y to the state of the object.

The advantages of stateless programming coincide with those goto-free programming, only more so.
Though many descriptions of functional programming emphasize the lack of mutation, the lack of mutation also goes hand in hand with the lack of unconditional control transfers, such as loops. In functional programming languages, recursion, in particularly tail recursion, replaces looping. Recursion eliminates both the unconditional control construct and the mutation of variables in the same stroke. The recursive call binds argument values to parameters, rather than assigning values.
To understand why this is advantageous, rather than turning to functional programming literature, we can consult the 1968 paper by Dijkstra, "Go To Statement Considered Harmful":
"The unbridled use of the go to statement has an immediate consequence that it becomes terribly hard to find a meaningful set of coordinates in which to describe the process progress."
Dijkstra's observations, however still apply to structured programs which avoid go to, because statements like while, if and whatnot are just window dressing on go to! Without using go to, we can still find it impossible to find the coordinates in which to describe the process progress. Dijkstra neglected to observe that bridled go to still has all the same issues.
What this means is that at any given point in the execution of the program, it is not clear how we got there. When we run into a bug, we have to use backwards reasoning: how did we end up in this state? How did we branch into this point of the code? Often it is hard to follow: the trail goes back a few steps and then runs cold due to a vastness of possibilities.
Functional programming gives us the absolute coordinates. We can rely on analytical tools like mathematical induction to understand how the program arrived into a certain situation.
For example, to convince ourselves that a recursive function is correct, we can just verify its base cases, and then understand and check its inductive hypothesis.
If the logic is written as a loop with mutating variables, we need a more complicated set of tools: breaking down the logic into steps with pre- and post-conditions, which we rewrite in terms mathematics that refers to the prior and current values of variables and such. Yes, if the program uses only certain control structures, avoiding go to, then the analysis is somewhat easier. The tools are tailored to the structures: we have a recipe for how we analyze the correctness of an if, while, and other structures.
However, by contrast, in a functional program there is no prior value of any variable to reason about; that whole class of problem has gone away.

Haskel and Prolog are good examples of languages which may be implemented as stateless programming languages. But unfortunately they are not so far. Both Prolog and Haskel have imperative implementations currently. See some SMT's, seem closer to stateless coding.
This is why you are having hard time seeing any benefits from these programing languages. Due to imperative implementations we have no performance and stability benefits. So the lack of stateless languages infrastructure is the main reason you feel no any stateless programming language due to its absence.
These are some benefits of pure stateless:
Task description is the program (compact code)
Stability due to absense of state-dependant bugs (the most of bugs)
Cachable results (a set of inputs always cause same set of outputs)
Distributable computations
Rebaseable to quantum computations
Thin code for multiple overlapping clauses
Allows differentiable programming optimizations
Consistently applying code changes (adding logic breaks nothing written)
Optimized combinatorics (no need to bruteforce enumerations)
Stateless coding is about concentrating on relations between data which then used for computing by deducing it. Basically this is the next level of programming abstraction. It is much closer to native language then any imperative programming languages because it allow describing relations instead of state change sequences.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex