Related
Raku provides many types that are immutable and thus cannot be modified after they are created. Until I started looking into this area recently, my understanding was that these Types were not persistent data structures – that is, unlike the core types in Clojure or Haskell, my belief was that Raku's immutable types did not take advantage of structural sharing to allow for inexpensive copies. I thought that statement my List $new = (|$old-list, 42); literally copied the values in $old-list, without the data-sharing features of persistent data structures.
That description of my understanding is in the past tense, however, due to the following code:
my Array $a = do {
$_ = [rand xx 10_000_000];
say "Initialized an Array in $((now - ENTER now).round: .001) seconds"; $_}
my List $l = do {
$_ = |(rand xx 10_000_000);
say "Initialized the List in $((now - ENTER now).round: .001) seconds"; $_}
do { $a.push: rand;
say "Pushed the element to the Array in $((now - ENTER now).round: .000001) seconds" }
do { my $nl = (|$l, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
do { my #na = |$l;
say "Copied List \$l into a new Array in $((now - ENTER now).round: .001) seconds" }
which produced this output in one run:
Initialized an Array in 5.938 seconds
Initialized the List in 5.639 seconds
Pushed the element to the Array in 0.000109 seconds
Appended an element to the List in 0.000109 seconds
Copied List $l into a new Array in 11.495 seconds
That is, creating a new List with the old values + one more is just as fast as pushing to a mutable Array, and dramatically faster than copying the List into a new Array – exactly the performance characteristics that you'd expect to see from a persistent List (copying to an Array is still slow because it can't take advantage of structural sharing without breaking the immutability of the List). The fast copying of $l into $nl is not due to either List being lazy; neither are.
All of the above leads me to believe that Lists in Rakudo actually are persistent data structures, with all the performance benefits that implies. That leaves me with several questions:
Am I right about Lists being persistent data structures?
Are all other immutable Types also persistent data structures? Or are any?
Is any of this part of Raku, or just an implementation choice Rakudo has made?
Are any of these performance characteristics documented/guaranteed anywhere?
I have to say, I am both extremely impressed and more than a bit baffled to discover evidence that at least some of Raku(do)'s types are persistent. It's the sort of feature that other languages list as a key selling point or that leads to the creation of libraries with 30k+ stars on GitHub. Have we really had it in Raku without even mentioning it?
I remember implementing these semantics, and I certainly don't recall thinking about them giving rise to a persistent data structure at the time - although it does seems fair to attach that label to the result!
I don't think you'll find anywhere that explicitly spells out this exact behavior, however the most natural implementation of things that are required by the language quite naturally leads to it. Taking the ingredients:
The infix:<,> operator is the List constructor in Raku
When a List is created, it is non-committal with regards to laziness and flattening (these arise from how we use the List, which we don't - in general - know at the point of its construction)
When we write (|$x, 1), the prefix:<|> operator constructs a Slip, which is a kind of List that should melt into its surrounding List. Thus what infix:<,> sees is a Slip and an Int.
Making the Slip melt into the result List immediately would mean making a commitment about eagerness, which List construction alone should not do. Thus the Slip and everything after it is placed into the lazily evaluated ("non-reified") portion of the List.
This last of these is what gives rise to the observed persistent data structure style behavior.
I expect it would be possible to have a implementation that inspects the Slip and chooses to eagerly copy things that are known not to be lazy, and still be in compliance with the specification test suite. That would change the time complexity of your example. If you want to be defensive against that, then:
do { my $nl = (|$l.lazy, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
Should be sufficient to force the issue even if the implementation changed.
Of other cases that immediately come to mind that are related to persistent data structures or at least tail sharing:
The MoarVM implementation of strings, which is behind str and thus Str, implements string concatenation by creating a new string that refers to the two that are being concatenated instead of copying the data in the two strings (and does similar tricks for substr and repetition). This is strictly an optimization, not a language requirement, and in some delicate cases (the last grapheme of one string and the first grapheme of the next will form a single grapheme in the resulting string), it gives up and takes the copying path.
Outside of the core, modules like Concurrent::Stack, Concurrent::Queue, and Concurrent::Trie use tail sharing as a technique to implement relatively efficient lock-free data structures.
I am faced with a need to send my data in parts, and at the same time I am expected to provide sha256 for my WHOLE data.
Something like this cat large file | chunker | receiver
where receiver is an application that is expected to receive the data, possibly in chunks having in the header sha256 of the payload, and then following payload. After collecting all chunks, it is supposed to store the whole transmitted data, and the sha256 of all data (particular sha256 will be used only to rehash and confirm integrity of the data.)
Of course, the simplest thing would be if the receiver generated sha256 from whole the streamed data, but I was wondering if there is a simpler way by collecting all hashes of all chunks, and combine them to generate one final hash, which would be the same as the hash calculated from all the data.
In other words - and I copy this from the title - I wonder if there is a function F that would receive a list of hashes of chunks of data, and then generated final hash that would be equal to the hash generated on all the data.
And again, in other words, having this formula:
F(sha256(data[0]), sha256(data[1]), ... sha256(data[N])) = sha256(data[0..N])
What would be the function F?
Would it be a universal function or there is no such thing for the way hashing is calculated?
I suspect there is no such function or this is too complicated question to answer.
AFAIK there are still no known collisions for SHA-256 but I bet that once some is found, i.e. someone finds two messages m1 and m2 such that SHA-256(m1) = SHA-256(m2), then for almost any prefix a hashes SHA-256(a || m1) and SHA-256(a || m2) will be different i.e. the function you ask is actually not a function (has different outputs for the same inputs). Or to put it otherwise SHA-2 is susceptible to length extension attacks but AFAIK not to prefixing attacks. Still even if this actually a function, it is not enough for you for such a function to exist, you also want it to be fast. And I believe there is no such fast to compute function.
On the other hand SHA-256 works by splitting the original message into 512-bit chunks and processing them using a well defined process (which is based on the state from all the previous chunks) so theoretically you can modify some implementation of SHA-256 to compute two hashes at the same time (by applying the same logic to different initial states):
Hash of your application-defined chunk (using standard initial state)
Hash of all chunks up to this point (using the state passed from the previous output of the same step as the initial state).
This probably will be slightly faster than doing those things independently but I don't know whether it will be so much faster to justify such a custom implementation.
The CLtL2 reference clearly distinguishes between nondestructive and destructive common-lisp operations. But, within the destructive camp, it seems a little less clear in marking the difference between those which simply return the result, and those which additionally modify a place (given as argument) to contain the result. The usual convention of annexing "f" to such place modifying operations (eg, setf, incf, alexandria:deletef) is somewhat sporadic, and also applies to many place accessors (eg, aref, getf). In an ideal functional programming style (based only on returned values) such confusion is probably not an issue, but it seems like it could lead to programming errors in some practical applications that do use place modification. Since different implementations can handle the place results differently, couldn't portability be affected? It even seems difficult to test a particular implementation's approach.
To better understand the above distinction, I've divided the destructive common-lisp sequence operations into two categories corresponding to "argument returning" and "operation returning". Could someone validate or invalidate these categories for me? I'm assuming these categories could apply to other kinds of destructive operations (for lists, hash-tables, arrays, numbers, etc) too.
Argument returning: fill, replace, map-into
Operation returning: delete, delete-if, delete-if-not, delete-duplicates, nsubstitute, nsubstitute-if, nsubstitute-not-if, nreverse, sort, stable-sort, merge
But, within the destructive camp, it seems a little less clear in marking the difference between those which simply return the result.
There are no easy syntactic marker about which operation is destructive or not, even though there are useful conventions like the n prefix. Remember that CL is a standard inspired by different Lisps, which does not help enforcing a consistent terminology.
The usual convention of annexing "f" to such place modifying operations (eg, setf, incf, alexandria:deletef) is somewhat sporadic, and also applies to many place accessors (eg, aref, getf).
All setf expanders should ends with f, but not everything that ends with f is a setf expander. For example, aref takes its name from array and reference and isn't a macro.
... but it seems like it could lead to programming errors in some practical applications that do use place modification.
Most data is mutable (see comments); once you code in CL with that in mind, you take care not to modify data you did not create yourself. As for using a destructive operation in place of a non-destructive one inadvertently, I don't know: I guess it can happen, with sort or delete, maybe the first times you use them. In my mind delete is stronger, more destructive than simply remove, but maybe that's because I already know the difference.
Since different implementations can handle the place results differently, couldn't portability be affected?
If you want portability, you follow the specification, which does not offer much guarantee w.r.t. which destructive operations are applied. Take for example DELETE (emphasis mine):
Sequence may be destroyed and used to construct the result; however, the result might or might not be identical to sequence.
It is wrong to assume anything about how the list is being modified, or even if it is being modified. You could actually implement delete as an alias of remove in a minimal implementation. In all cases, you use the return value of your function (both delete and remove have the same signature).
Categories
I've divided the destructive common-lisp sequence operations into two categories corresponding to "argument returning" and "operation returning".
It is not clear at all what those categories are supposed to represent. Are those definition the one you have in mind?
an argument returning operation is one which returns one of its argument as a return value, possibly modified.
an operation returning operation is one where the result is based on one of its argument, and might be identical to that argument, but needs not be.
The definition of operation returning is quite vague and encompass both destructive and non-destructive operations. I would classify cons as such because it does not return one of its argument; OTOH, it is a purely functional operation.
I don't really get what those categories offer in addition to destructive or non-destructive.
Setf composition gotcha
Suppose you write a function (remote host key) which gets a value from a remote key/value datastore. Suppose also that you define (setf remote) so that it updates the remote value.
You might expect (setf (first (remote host key)) value) to:
Fetch a list from host, indexed by key,
Replace its first element by value,
Push the changes back to the remote host.
However, step 3 does generally not happen: the local list is modified in place (this is the most efficient alternative, but it makes setf expansions somewhat lazy about updates). You could define a new set of macros such as the whole round-trip is always implemented, with DEFINE-SETF-EXPANDER, though.
Let me try to address your question by introducing some concepts.
I hope it helps you to consolidate your knowledge and to find your remaining answers about this subject.
The first concept is that of non-destructive versus destructive behavior.
A function that is non-destructive won't change the data passed to it.
A function that is destructive may change the data passed to it.
You can apply the (non-)destructive nature to something other than a single function. For instance, if a function stores the data passed to it somewhere, say in a object's slot, then the destructiveness depends on that object's behavior, its other operations, events, etc.
The convention for functions that immediately modify its arguments is to (usually) prefix with n.
The convention doesn't work the other way around, there are many functions that start with n (e.g. not/null, nth, ninth, notany, notevery, numberp etc.) There are also notable exceptions, such as delete, merge, sort and stable-sort. The only way to naturally grasp them is with time/experience. For instance, always refer to the HyperSpec whenever you see a function you don't know yet.
Moreover, you usually need to store the result of some destructive functions, such as delete and sort, because they may choose to skip the head of the list or to not be destructive at all. delete may actually return nil, the empty list, which is not possible to obtain from a modified cons.
The second concept is that of generalized reference.
A generalized reference is anything that can hold data, such as a variable, the car and cdr of a cons, the element locations of an array or hash table, the slots of an object, etc.
For each container data structure, you need to know the specific modifying function. However, for some generalized references, there might not be a function to modify it, such as a local variable, in which case there are still special forms to modify it.
As such, in order to modify any generalized reference, you need to know its modifying form.
Another concept closely related to generalized references is the place. A form that identifies a generalized reference is called a place. Or in other words, a place is the written way (form) that represents a generalized reference.
For each kind of place, you have a reader form and a writer form.
Some of these forms are documented, such as using the symbol of a variable to read it and setq a variable to write to it, or car/cdr to read from and rplaca/rplacd to write to a cons. Others are only documented to be accessors, such as aref to read from arrays; its writer form is not actually documented.
To get these forms, you have get-setf-expansion. You actually also get a set of variables and their initializing forms (to be used as through let*) that will be used by the reader form and/or the writer form, and a set of variables (to be bound to the new values) that will be used by the writer form.
If you've used Lisp before, you've probably used setf. setf is a macro that generates code that runs within the scope (environment) of its expansion.
Essentially, it behaves as if by using get-setf-expansion, generating a let* form for the variables and initializing forms, generating extra bindings for the writer variables with the result of the value(s) form and invoking the writer form within all this environment.
For instance, let's define a my-setf1 macro which takes only a single place and a single newvalue form:
(defmacro my-setf1 (place newvalue &environment env)
(multiple-value-bind (vars vals store-vars writer-form reader-form)
(get-setf-expansion place env)
`(let* (,#(mapcar #'(lambda (var val)
`(,var ,val))
vars vals))
;; In case some vars are used only by reader-form
(declare (ignorable ,#vars))
(multiple-value-bind (,#store-vars)
,newvalue
,writer-form
;; Uncomment the next line to mitigate buggy writer-forms
;;(values ,#store-vars)
))))
You could then define my-setf as:
(defmacro my-setf (&rest pairs)
`(progn
,#(loop
for (place newvalue) on pairs by #'cddr
collect `(my-setf1 ,place ,newvalue))))
There is a convention for such macros, which is to suffix with f, such as setf itself, psetf, shiftf, rotatef, incf, decf, getf and remf.
Again, the convention doesn't work the other way around, there are operators that end with f, such as aref, svref and find-if, which are functions, and if, which is a conditional execution special operator. And yet again, there are notable exceptions, such as push, pushnew, pop, ldb, mask-field, assert and check-type.
Depending on your point-of-view, many more operators are implicitly destructive, even if not effectively tagged as such.
For instance, every defining operator (e.g. the macros defun, defpackage, defclass, defgeneric, defmethod, the function load) changes either the global environment or a temporary one, such as the compilation environment.
Others, like compile-file, compile and eval, depend on the forms they'll execute. For compile-file, it also depends on how much it isolates the compilation environment from the startup environment.
Other operators, like makunbound, fmakunbound, intern, export, shadow, use-package, rename-package, adjust-array, vector-push, vector-push-extend, vector-pop, remhash, clrhash, shared-initialize, change-class, slot-makunbound, add-method and remove-method, are (more or less) clearly intended to have side-effects.
And it's this last concept that can be the widest. Usually, a side-effect is regarded as any observable variation in one environment. As such, functions that don't change data are usually considered free of side-effects.
However, this is ill-defined. You may consider that all code execution implies side-effects, depending on what you define to be your environment or on what you can measure (e.g. consumed quotas, CPU time and real time, used memory, GC overhead, resource contention, system temperature, energy consumption, battery drain).
NOTE: None of the example lists are exhaustive.
This question already has answers here:
Pointers vs. values in parameters and return values
(5 answers)
Closed 3 years ago.
Consider the following struct:
type Queue struct {
Elements []int
}
What would be the different between:
func NewQueue() Queue {
queue := Queue{}
return queue
}
and
func NewQueue() *Queue {
queue := &Queue{}
return queue
}
To me the seem practically the same, (and in fact trying it with some enqueuing and dequeueing yields the same results) but I still see both usages in the wild, so perhaps one is preferable.
It's possible to return a value and then the caller to call methods that have a pointer receiver. However, if the caller is always going to want to use pointers, because the object's big or because methods need to modify it in place, you might as well return a pointer. Pointers vs. values is a common question in Go and there's an answer trying to break down when to use one or the other.
In the specific case of a slice-backed Queue type, it's pretty small and fast to copy as a value, but if you want to be able to copy it around and have everyone see the same data whichever copy is accessed, you're going to need to use a pointer, because a slice is really a little struct of start pointer, length, and capacity, and those change when you reslice or grow it. If this is a surprise, the Go blog posts on the mechanics of append and slice usage and internals could be useful reading.
If your queue isn't for sharing or passing around but for using locally in a single function, you could provide an append-style interface where operations return a modified queue, but at that point maybe you just want to use slice tricks directly.
(If your queue is meant to be used concurrently, think hard about using a buffered channel. It might not be exactly what you're imagining, but a lot of the tricky bits have already been figured out for you by the implementers.)
Also--if Queue is really a slice with methods added, you can make it type Queue []int.
What are they and what are they good for?
I do not have a CS degree and my background is VB6 -> ASP -> ASP.NET/C#. Can anyone explain it in a clear and concise manner?
Imagine if every single line in your program was a separate function. Each accepts, as a parameter, the next line/function to execute.
Using this model, you can "pause" execution at any line and continue it later. You can also do inventive things like temporarily hop up the execution stack to retrieve a value, or save the current execution state to a database to retrieve later.
You probably understand them better than you think you did.
Exceptions are an example of "upward-only" continuations. They allow code deep down the stack to call up to an exception handler to indicate a problem.
Python example:
try:
broken_function()
except SomeException:
# jump to here
pass
def broken_function():
raise SomeException() # go back up the stack
# stuff that won't be evaluated
Generators are examples of "downward-only" continuations. They allow code to reenter a loop, for example, to create new values.
Python example:
def sequence_generator(i=1):
while True:
yield i # "return" this value, and come back here for the next
i = i + 1
g = sequence_generator()
while True:
print g.next()
In both cases, these had to be added to the language specifically whereas in a language with continuations, the programmer can create these things where they're not available.
A heads up, this example is not concise nor exceptionally clear. This is a demonstration of a powerful application of continuations. As a VB/ASP/C# programmer, you may not be familiar with the concept of a system stack or saving state, so the goal of this answer is a demonstration and not an explanation.
Continuations are extremely versatile and are a way to save execution state and resume it later. Here is a small example of a cooperative multithreading environment using continuations in Scheme:
(Assume that the operations enqueue and dequeue work as expected on a global queue not defined here)
(define (fork)
(display "forking\n")
(call-with-current-continuation
(lambda (cc)
(enqueue (lambda ()
(cc #f)))
(cc #t))))
(define (context-switch)
(display "context switching\n")
(call-with-current-continuation
(lambda (cc)
(enqueue
(lambda ()
(cc 'nothing)))
((dequeue)))))
(define (end-process)
(display "ending process\n")
(let ((proc (dequeue)))
(if (eq? proc 'queue-empty)
(display "all processes terminated\n")
(proc))))
This provides three verbs that a function can use - fork, context-switch, and end-process. The fork operation forks the thread and returns #t in one instance and #f in another. The context-switch operation switches between threads, and end-process terminates a thread.
Here's an example of their use:
(define (test-cs)
(display "entering test\n")
(cond
((fork) (cond
((fork) (display "process 1\n")
(context-switch)
(display "process 1 again\n"))
(else (display "process 2\n")
(end-process)
(display "you shouldn't see this (2)"))))
(else (cond ((fork) (display "process 3\n")
(display "process 3 again\n")
(context-switch))
(else (display "process 4\n")))))
(context-switch)
(display "ending process\n")
(end-process)
(display "process ended (should only see this once)\n"))
The output should be
entering test
forking
forking
process 1
context switching
forking
process 3
process 3 again
context switching
process 2
ending process
process 1 again
context switching
process 4
context switching
context switching
ending process
ending process
ending process
ending process
ending process
ending process
all processes terminated
process ended (should only see this once)
Those that have studied forking and threading in a class are often given examples similar to this. The purpose of this post is to demonstrate that with continuations you can achieve similar results within a single thread by saving and restoring its state - its continuation - manually.
P.S. - I think I remember something similar to this in On Lisp, so if you'd like to see professional code you should check the book out.
One way to think of a continuation is as a processor stack. When you "call-with-current-continuation c" it calls your function "c", and the parameter passed to "c" is your current stack with all your automatic variables on it (represented as yet another function, call it "k"). Meanwhile the processor starts off creating a new stack. When you call "k" it executes a "return from subroutine" (RTS) instruction on the original stack, jumping you back in to the context of the original "call-with-current-continuation" ("call-cc" from now on) and allowing your program to continue as before. If you passed a parameter to "k" then this becomes the return value of the "call-cc".
From the point of view of your original stack, the "call-cc" looks like a normal function call. From the point of view of "c", your original stack looks like a function that never returns.
There is an old joke about a mathematician who captured a lion in a cage by climbing into the cage, locking it, and declaring himself to be outside the cage while everything else (including the lion) was inside it. Continuations are a bit like the cage, and "c" is a bit like the mathematician. Your main program thinks that "c" is inside it, while "c" believes that your main program is inside "k".
You can create arbitrary flow-of-control structures using continuations. For instance you can create a threading library. "yield" uses "call-cc" to put the current continuation on a queue and then jumps in to the one on the head of the queue. A semaphore also has its own queue of suspended continuations, and a thread is rescheduled by taking it off the semaphore queue and putting it on to the main queue.
Basically, a continuation is the ability for a function to stop execution and then pick back up where it left off at a later point in time. In C#, you can do this using the yield keyword. I can go into more detail if you wish, but you wanted a concise explanation. ;-)
I'm still getting "used" to continuations, but one way to think about them that I find useful is as abstractions of the Program Counter (PC) concept. A PC "points" to the next instruction to execute in memory, but of course that instruction (and pretty much every instruction) points, implicitly or explicitly, to the instruction that follows, as well as to whatever instructions should service interrupts. (Even a NOOP instruction implicitly does a JUMP to the next instruction in memory. But if an interrupt occurs, that'll usually involve a JUMP to some other instruction in memory.)
Each potentially "live" point in a program in memory to which control might jump at any given point is, in a sense, an active continuation. Other points that can be reached are potentially active continuations, but, more to the point, they are continuations that are potentially "calculated" (dynamically, perhaps) as a result of reaching one or more of the currently active continuations.
This seems a bit out of place in traditional introductions to continuations, in which all pending threads of execution are explicitly represented as continuations into static code; but it takes into account the fact that, on general-purpose computers, the PC points to an instruction sequence that might potentially change the contents of memory representing a portion of that instruction sequence, thus essentially creating a new (or modified, if you will) continuation on the fly, one that doesn't really exist as of the activations of continuations preceding that creation/modification.
So continuation can be viewed as a high-level model of the PC, which is why it conceptually subsumes ordinary procedure call/return (just as ancient iron did procedure call/return via low-level JUMP, aka GOTO, instructions plus recording of the PC on call and restoring of it on return) as well as exceptions, threads, coroutines, etc.
So just as the PC points to computations to happen in "the future", a continuation does the same thing, but at a higher, more-abstract level. The PC implicitly refers to memory plus all the memory locations and registers "bound" to whatever values, while a continuation represents the future via the language-appropriate abstractions.
Of course, while there might typically be just one PC per computer (core processor), there are in fact many "active" PC-ish entities, as alluded to above. The interrupt vector contains a bunch, the stack a bunch more, certain registers might contain some, etc. They are "activated" when their values are loaded into the hardware PC, but continuations are abstractions of the concept, not PCs or their precise equivalent (there's no inherent concept of a "master" continuation, though we often think and code in those terms to keep things fairly simple).
In essence, a continuation is a representation of "what to do next when invoked", and, as such, can be (and, in some languages and in continuation-passing-style programs, often is) a first-class object that is instantiated, passed around, and discarded just like most any other data type, and much like how a classic computer treats memory locations vis-a-vis the PC -- as nearly interchangeable with ordinary integers.
In C#, you have access to two continuations. One, accessed through return, lets a method continue from where it was called. The other, accessed through throw, lets a method continue at the nearest matching catch.
Some languages let you treat these statements as first-class values, so you can assign them and pass them around in variables. What this means is that you can stash the value of return or of throw and call them later when you're really ready to return or throw.
Continuation callback = return;
callMeLater(callback);
This can be handy in lots of situations. One example is like the one above, where you want to pause the work you're doing and resume it later when something happens (like getting a web request, or something).
I'm using them in a couple of projects I'm working on. In one, I'm using them so I can suspend the program while I'm waiting for IO over the network, then resume it later. In the other, I'm writing a programming language where I give the user access to continuations-as-values so they can write return and throw for themselves - or any other control flow, like while loops - without me needing to do it for them.
Think of threads. A thread can be run, and you can get the result of its computation. A continuation is a thread that you can copy, so you can run the same computation twice.
Continuations have had renewed interest with web programming because they nicely mirror the pause/resume character of web requests. A server can construct a continaution representing a users session and resume if and when the user continues the session.