Won't an app based on immutable data structures run out of memory? - out-of-memory

Never mind redux or something - I am solely asking about Immutable.JS, Ramda, etc.
If new versions of a data structure is created by structural sharing, that means that every new version needs to have a pointer to the previous version in order for it to be able to share anything. That again means that older versions of a structure cannot be garbage collected, meaning again, that in an app where you have state, this state will use a monotonically increasing amount of memory. If this is the case, then that data structure will at some point have used all available memory, if it keeps getting modified.
Am I missing something here? I can see that for many (most) use cases on the web (in a browser), this won't be a problem, as you are probably just changing a tiny part of the structure each time, and you will probably leave the page or reload it way before you use all memory, but for long running processes this should pose a problem. Right? Riiight?

If new versions of a data structure is created by structural sharing, that means that every new version needs to have a pointer to the previous version in order for it to be able to share anything.
This is not correct in general. The new version would have pointers to subparts of the previous version. So it shares a fraction (
often almost all) of the data of the older version.
For example, Ocaml's maps (represented and implemented by some self-balancing binary search tree variant of red-black trees) are immutable: see documentation of Map. But if you add (or remove) some binding to (from) a map, you get a new map sharing most (but not all) of its internal nodes with the old one.
So the garbage collector would eventually "delete" those old internal nodes which are not relevant to the current "state".
BTW web programming (and web navigation) is related to continuations and continuation-passing style. See e.g. Byrd's Web Programming with Continuations and several papers by C.Queinnec.
Read also more about monads in functional programming.

Related

Is Ada.Containers.Functional_Maps usable in Ada2012?

The information about Ada.Containers.Functional_Maps in the GNAT documentation is quite—let's say—abstruse.
First, it says this:
…these containers can still be used safely.
In the second paragraph, it seems to me that you cannot free the memory allocated for those objects once the program exits the context where they are created. I am understanding that you could run into a memory leak. Am I right?
They are also memory consuming, as the allocated memory is not reclaimed when the container is no longer referenced.
Read the next two sentences in the doc:
Thus, they should in general be used in ghost code and annotations, so that they can be removed from the final executable. The specification of this unit is compatible with SPARK 2014.
Because the specification of Ada.Containers.Functional_Maps is compatible with SPARK, it may help to examine it in the context of related SPARK Libraries with regard to proof, testing and annotation. In particular,
The functional maps, sets and vectors are unbounded collections of indefinite elements that are neither controlled nor limited. While they are inefficient with regard to memory, they are simple, immutable and useful "to model user defined data structures."
The functional containers can be used in Ghost Code, "parts of the code that are only meant for specification and verification", as suggested here. This related example illustrates a ghost function.
it seems to me that you cannot free the memory allocated for those
objects once the program exits the context where they are created. I
am understanding that you could run into a memory leak. Am I right?
There are some things that you can do in Ada to manage memory, I would be surprised if (for example) the usage of an instance inside a declare-block were not cleaned-up on the block's exit. — This is, in fact, how some surprisingly robust applications can get away without "dynamically-allocated" memory/values (it's actually heap-allocated, but that's pedantic).
This sort of granular control is really nice, as you can constrain things/usages to specific points. Combined with Ada's good facilities for presenting interfaces, this means that changing some structure to another can be less-painful than it otherwise might be.
As an example of the above, I had a nested key-value map (a JSON object) that was being used to pass parameters around; the method for doing this changed and so I had a string of values (with common-rooted keys) coming in and a procedure that took JSON as input. Obviously what was needed was a "keys&values-to-JSON function, so inside the function I used the multiway-tree container where the leafs represented values and the internal-nodes the keys, the second step was to traverse the tree and create the JSON-object as needed - simple recursion and data-structure selection used to address the problem of adapting the textual key-value pairs of these nested parameters to JSON. — And because the usage of multi-way trees was exclusive to this function, I can be confident that the memory used by the intermediate tree-object I used is released on the function's exit.

Could you implement async-await by memcopying stack frames rather than creating state machines?

I am trying to understand all the low-level stuff Compilers / Interpreters / the Kernel do for you (because I'm yet another person who thinks they could design a language that's better than most others)
One of the many things that sparked my curiosity is Async-Await.
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines.
I've not found anything useful by googling ("async copy stack frame" and variations) or in the "Similar questions" section.
To me, this method seems rather complicated and overhead-heavy;
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
I'm aware that it is architecturally impossible for some languages (I thank the CLR can't do it, so C# can't either).
Am I missing something that makes this logically impossible? I would expect less complicated code and a performance boost from doing it that way, am I mistaken? I suppose when you have a deep stack hierarchy after a async call (eg. a recursive async function) the amount of data you would have to memcopy is rather large, but there are probably ways to work around that.
If this is possible, then why isn't it done anywhere?
Yes, an alternative to converting code into state machines is copying stacks around. This is the way that the go language does it now, and the way that Java will do it when Project Loom is released.
It's not an easy thing to do for real-world languages.
It doesn't work for C and C++, for example, because those languages let you make pointers to things on the stack. Those pointers can be used by other threads, so you can't move the stack away, and even if you could, you would have to copy it back into exactly the same place.
For the same reason, it doesn't work when your program calls out to the OS or native code and gets called back in the same thread, because there's a portion of the stack you don't control. In Java, project Loom's 'virtual threads' will not release the thread as long as there's native code on the stack.
Even in situations where you can move the stack, it requires dedicated support in the runtime environment. The stack can't just be copied into a byte array. It has to be copied off in a representation that allows the garbage collector to recognize all the pointers in it. If C# were to adopt this technique, for example, it would require significant extensions to the common language runtime, whereas implementing state machines can be accomplished entirely within the C# compiler.
I would first like to begin by saying that this answer is only meant to serve as a starting point to go in the actual direction of your exploration. This includes various pointers and building up on the work of various other authors
I've checked the under-the-hood implementation for a couple languages, including C# (the compiler generates the state machine from sugar code) and Rust (where the state machine has to be implemented manually from the Future trait), and they all implement Async-Await using state machines
You understood correctly that the Async/Await implementation for C# and Rust use state machines. Let us understand now as to why are those implementations chosen.
To put the general structure of stack frames in very simple terms, whatever we put inside a stack frame are temporary allocations which are not going to outlive the method which resulted in the addition of that stack frame (including, but not limited to local variables). It also contains the information of the continuation, ie. the address of the code that needs to be executed next (in other words, the control has to return to), within the context of the recently called method. If this is a case of synchronous execution, the methods are executed one after the other. In other words, the caller method is suspended until the called method finishes execution. This, from a stack perspective fits in intuitively. If we are done with the execution of a called method, the control is returned to the caller and the stack frame can be popped off. It is also cheap and efficient from a perspective of the hardware that is running this code as well (hardware is optimised for programming with stacks).
In the case of asynchronous code, the continuation of a method might have to trigger several other methods that might get called from within the continuation of callers. Take a look at this answer, where Eric Lippert outlines the entirety of how the stack works for an asynchronous flow. The problem with asynchronous flow is that, the method calls do not exactly form a stack and trying to handle them like pure stacks may get extremely complicated. As Eric says in the answer, that is why C# uses graph of heap-allocated tasks and delegates that represents a workflow.
However, if you consider languages like Go, the asynchrony is handled in a different way altogether. We have something called Goroutines and here is no need for await statements in Go. Each of these Goroutines are started on their own threads that are lightweight (each of them have their own stacks, which defaults to 8KB in size) and the synchronization between each of them is achieved through communication through channels. These lightweight threads are capable of waiting asynchronously for any read operation to be performed on the channel and suspend themselves. The earlier implementation in Go is done using the SplitStacks technique. This implementation had its own problems as listed out here and replaced by Contigious Stacks. The article also talks about the newer implementation.
One important thing to note here is that it is not just the complexity involved in handling the continuation between the tasks that contribute to the approach chosen to implement Async/Await, there are other factors like Garbage Collection that play a role. GC process should be as performant as possible. If we move stacks around, GC becomes inefficient because accessing an object then would require thread synchronization.
Could you not implement Async-Await by simply memcopying the stack frames of async calls to/from heap?
In short, you can. As this answer states here, Chicken Scheme uses a something similar to what you are exploring. It begins by allocating everything on the stack and move the stack values to heap when it becomes too large for the GC activities (Chicken Scheme uses Generational GC). However, there are certain caveats with this kind of implementation. Take a look at this FAQ of Chicken Scheme. There is also lot of academic research in this area (linked in the answer referred to in the beginning of the paragraph, which I shall summarise under further readings) that you may want to look at.
Further Reading
Continuation Passing Style
call-with-current-continuation
The classic SICP book
This answer (contains few links to academic research in this area)
TLDR
The decision of which approach to be taken is subjective to factors that affect the overall usability and performance of the language. State Machines are not the only way to implement the Async/Await functionality as done in C# and Rust. Few languages like Go implement a Contigious Stack approach coordinated over channels for asynchronous operations. Chicken Scheme allocates everything on the stack and moves the recent stack value to heap in case it becomes heavy for its GC algorithm's performance. Moving stacks around has its own set of implications that affect garbage collection negatively. Going through the research done in this space will help you understand the advancements and rationale behind each of the approaches. At the same time, you should also give a thought to how you are planning on designing/implementing the other parts of your language for it be anywhere close to be usable in terms of performance and overall usability.
PS: Given the length of this answer, will be happy to correct any inconsistencies that may have crept in.
I have been looking into various strategies for doing this myseøf, because I naturally thi k I can design a language better than anybody else - same as you. I just want to emphasize that when I say better, I actually mean better as in tastes better for my liking, and not objectively better.
I have come to a few different approaches, and to summarize: It really depends on many other design choices you have made in the language.
It is all about compromises; each approach has advantages and disadvantages.
It feels like the compiler design community are still very focused on garbage collection and minimizing memory waste, and perhaps there is room for some innovation for more lazy and less purist language designers given the vast resources available to modern computers?
How about not having a call stack at all?
It is possible to implement a language without using a call stack.
Pass continuations. The function currently running is responsible for keeping and resuming the state of the caller. Async/await and generators come naturally.
Preallocated static memory addresses for all local variables in all declared functions in the entire program. This approach causes other problems, of course.
If this is your design, then asymc functions seem trivial
Tree shaped stack
With a tree shaped stack, you can keep all stack frames until the function is completely done. It does not matter if you allow progress on any ancestor stack frame, as long as you let the async frame live on until it is no longer needed.
Linear stack
How about serializing the function state? It seems like a variant of continuations.
Independent stack frames on the heap
Simply treat invocations like you treat other pointers to any value on the heap.
All of the above are trivialized approaches, but one thing they have in common related to your question:
Just find a way to store any locals needed to resume the function. And don't forget to store the program counter in the stack frame as well.

Shouldn't a Redux app with immutable data run out of memory after a while?

I come from a background in embedded systems where you're really careful about memory management. With Redux especially its concept of immutability.
So let's say I'm modifying a member of an array. I have to create a new array that links to all original members plus the modified item.
I understand why using Immutability improves the speed but my question is since we essentially never remove the old copies of the objects and create new ones, Redux still keeps a reference to the old objects because of time traveling features.
Most machines these days have quite a lot of memory, but shouldn't at least in theory the Redux app crash because the tab/process runs out of memory? After a long use maybe?
No. First, Redux itself doesn't keep around old data - that's something that the Redux DevTools addon does. Second, I believe the DevTools addon has limits on how many actions it will track. Third, Javascript is a garbage-collected language, so items that are no longer referenced will be cleaned up. (Hand-waving a bit there, but that generally covers things.)
Immutable.js leverages the idea of Structural Sharing while creating new copies of collections. It implements persistent data structures and internally uses concepts like tries to implement structural sharing. So, if you created a list with 10 items, adding a new item to it will not create a new list.
Persistent data structures provide the benefits of immutability while
maintaining high performance reads and writes and present a familiar
API.
Immutable.js data structures are highly efficient on modern JavaScript VMs by
using structural sharing via hash maps tries and vector tries as
popularized by Clojure and Scala, minimizing the need to copy or cache
data.
I suggest you watch this awesome talk by Lee Bryon, Immutable.js creator

Language without explicit memory alloc/dealloc AND without garbage collection

I was wondering if it is possible to create a programming language without explicit memory allocation/deallocation (like C, C++ ...) AND without garbage collection (like Java, C#...) by doing a full analysis at the end of each scope?
The obvious problem is that this would take some time at the end of each scope, but I was wondering if it has become feasible with all the processing power and multiple cores in current CPU's. Do such languages exist already?
I also was wondering if a variant of C++ where smart pointers are the only pointers that can be used, would be exactly such a language (or am I missing some problems with that?).
Edit:
Well after some more research apparently it's this: http://en.wikipedia.org/wiki/Reference_counting
I was wondering why this isn't more popular. The disadvantages listed there don't seem quite serious, the overhead should be that large according to me. A (non-interpreted, properly written from the ground up) language with C family syntax with reference counting seems like a good idea to me.
The biggest problem with reference counting is that it is not a complete solution and is not capable of collecting a cyclic structure. The overhead is incurred every time you set a reference; for many kinds of problems this adds up quickly and can be worse than just waiting for a GC later. (Modern GC is quite advanced and awesome - don't count it down like that!!!)
What you are talking about is nothing special, and it shows up all the time. The C or C++ variant you are looking for is just plain regular C or C++.
For example write your program normally, but constrain yourself not to use any dynamic memory allocation (no new, delete, malloc, or free, or any of their friends, and make sure your libraries do the same), then you have that kind of system. You figure out in advance how much memory you need for everything you could do, and declare that memory statically (either function level static variables, or global variables). The compiler takes care of all the accounting the normal way, nothing special happens at the end of each scope, and no extra computation is necessary.
You can even configure your runtime environment to have a statically allocated stack space (this one isn't really under the compiler's control, more linker and operating system environment). Just figure out how deep your function call chain goes, and how much memory it uses (with a profiler or similar tool), an set it in your link options.
Without dynamic memory allocation (and thus no deallocation through either garbage collection or explicit management), you are limited to the memory you declared when you wrote the program. But that's ok, many programs don't need dynamic memory, and are already written that way. The real need for this shows up in embedded and real-time systems when you absolutely, positively need to know exactly how long an operation will take, how much memory (and other resources) it will use, and that the running time and the use of those resources can't ever change.
The great thing about C and C++ is that the language requires so little from the environment, and gives you the tools to do so much, that smart pointers or statically allocated memory, or even some special scheme that you dream up can be implemented. Requiring the use them, and the constraints you put on yourself just becomes a policy decision. You can enforce that policy with code auditing (use scripts to scan the source or object files and don't permit linking to the dynamic memory libraries)

Abstraction or not?

The other day i stumbled onto a rather old usenet post by Linus Torwalds. It is the infamous "You are full of bull****" post when he defends his choice of using plain C for Git over something more modern.
In particular this post made me think about the enormous amount of abstraction layers that accumulate one over the other where I work. Mine is a Windows .Net environment. I must say that I like C# and the .Net environment, it really makes most things easy.
Now, I come from a very different background made of Unix technologies like C and a plethora or scripting languages; to me, also, OOP is just one, and not always the best, programming paradigm.. I often struggle (in a working kind of way, of course!) with my colleagues (one in particular), because they appear to be of the "any problem can be solved with an additional level of abstraction" church, while I'm more of the "keeping it simple" school. I think that there is a very different mental approach to the problems that maybe comes from the exposure to different cultures.
As a very simple example, for the first project I did here I needed some configuration for an application. I made a 10 rows class to load and parse a txt file to be located in the program's root dir containing colon separated key / value pairs, one per row. It worked.
In the end, to standardize the approach to the configuration problem, we now have a library to be located on every machine running each configured program that calls a service that, at startup, loads up an xml that contains the references to other xmls, one per application, that contain the configurations themselves.
Now, it is extensible and made up of fancy reusable abstractions, providers and all, but I still think that, if we one day really happen to reuse part of it, with the time taken to make it up, we can make the needed code from start or copy / past the old code and modify it.
What are your thoughts about it? Can you point out some interesting reference dealing with the problem?
Thanks
Abstraction makes it easier to construct software and understand how it is put together, but it complicates fully understanding certain issues around performance and security, because the abstraction layers introduce certain kinds of complexity.
Torvalds' position is not absurd, but he is an extremist.
Simple answer: programming languages provide data structures and ways to combine them. Use these directly at first, do not abstract. If you find you have representation invariants to maintain that are at a high risk of being broken due to a large number of usage sites possibly outside your control, then consider abstraction.
To implement this, first provide functions and convert the call sites to use them without hiding the representation. Hide the data representation only when you're satisfied your functional representation is sufficient. Make sure at this time to document the invariant being protected.
An "extreme programming" version of this: do not abstract until you have test cases that break your program. If you think the invariant can be breached, write the case that breaks it first.
Here's a similar question: https://stackoverflow.com/questions/1992279/abstraction-in-todays-languages-excited-or-sad.
I agree with #Steve Emmerson - 'Coders at Work' would give you some excellent perspective on this issue.

Resources