Destructive place-modifying operators - common-lisp

The CLtL2 reference clearly distinguishes between nondestructive and destructive common-lisp operations. But, within the destructive camp, it seems a little less clear in marking the difference between those which simply return the result, and those which additionally modify a place (given as argument) to contain the result. The usual convention of annexing "f" to such place modifying operations (eg, setf, incf, alexandria:deletef) is somewhat sporadic, and also applies to many place accessors (eg, aref, getf). In an ideal functional programming style (based only on returned values) such confusion is probably not an issue, but it seems like it could lead to programming errors in some practical applications that do use place modification. Since different implementations can handle the place results differently, couldn't portability be affected? It even seems difficult to test a particular implementation's approach.
To better understand the above distinction, I've divided the destructive common-lisp sequence operations into two categories corresponding to "argument returning" and "operation returning". Could someone validate or invalidate these categories for me? I'm assuming these categories could apply to other kinds of destructive operations (for lists, hash-tables, arrays, numbers, etc) too.
Argument returning: fill, replace, map-into
Operation returning: delete, delete-if, delete-if-not, delete-duplicates, nsubstitute, nsubstitute-if, nsubstitute-not-if, nreverse, sort, stable-sort, merge

But, within the destructive camp, it seems a little less clear in marking the difference between those which simply return the result.
There are no easy syntactic marker about which operation is destructive or not, even though there are useful conventions like the n prefix. Remember that CL is a standard inspired by different Lisps, which does not help enforcing a consistent terminology.
The usual convention of annexing "f" to such place modifying operations (eg, setf, incf, alexandria:deletef) is somewhat sporadic, and also applies to many place accessors (eg, aref, getf).
All setf expanders should ends with f, but not everything that ends with f is a setf expander. For example, aref takes its name from array and reference and isn't a macro.
... but it seems like it could lead to programming errors in some practical applications that do use place modification.
Most data is mutable (see comments); once you code in CL with that in mind, you take care not to modify data you did not create yourself. As for using a destructive operation in place of a non-destructive one inadvertently, I don't know: I guess it can happen, with sort or delete, maybe the first times you use them. In my mind delete is stronger, more destructive than simply remove, but maybe that's because I already know the difference.
Since different implementations can handle the place results differently, couldn't portability be affected?
If you want portability, you follow the specification, which does not offer much guarantee w.r.t. which destructive operations are applied. Take for example DELETE (emphasis mine):
Sequence may be destroyed and used to construct the result; however, the result might or might not be identical to sequence.
It is wrong to assume anything about how the list is being modified, or even if it is being modified. You could actually implement delete as an alias of remove in a minimal implementation. In all cases, you use the return value of your function (both delete and remove have the same signature).
Categories
I've divided the destructive common-lisp sequence operations into two categories corresponding to "argument returning" and "operation returning".
It is not clear at all what those categories are supposed to represent. Are those definition the one you have in mind?
an argument returning operation is one which returns one of its argument as a return value, possibly modified.
an operation returning operation is one where the result is based on one of its argument, and might be identical to that argument, but needs not be.
The definition of operation returning is quite vague and encompass both destructive and non-destructive operations. I would classify cons as such because it does not return one of its argument; OTOH, it is a purely functional operation.
I don't really get what those categories offer in addition to destructive or non-destructive.
Setf composition gotcha
Suppose you write a function (remote host key) which gets a value from a remote key/value datastore. Suppose also that you define (setf remote) so that it updates the remote value.
You might expect (setf (first (remote host key)) value) to:
Fetch a list from host, indexed by key,
Replace its first element by value,
Push the changes back to the remote host.
However, step 3 does generally not happen: the local list is modified in place (this is the most efficient alternative, but it makes setf expansions somewhat lazy about updates). You could define a new set of macros such as the whole round-trip is always implemented, with DEFINE-SETF-EXPANDER, though.

Let me try to address your question by introducing some concepts.
I hope it helps you to consolidate your knowledge and to find your remaining answers about this subject.
The first concept is that of non-destructive versus destructive behavior.
A function that is non-destructive won't change the data passed to it.
A function that is destructive may change the data passed to it.
You can apply the (non-)destructive nature to something other than a single function. For instance, if a function stores the data passed to it somewhere, say in a object's slot, then the destructiveness depends on that object's behavior, its other operations, events, etc.
The convention for functions that immediately modify its arguments is to (usually) prefix with n.
The convention doesn't work the other way around, there are many functions that start with n (e.g. not/null, nth, ninth, notany, notevery, numberp etc.) There are also notable exceptions, such as delete, merge, sort and stable-sort. The only way to naturally grasp them is with time/experience. For instance, always refer to the HyperSpec whenever you see a function you don't know yet.
Moreover, you usually need to store the result of some destructive functions, such as delete and sort, because they may choose to skip the head of the list or to not be destructive at all. delete may actually return nil, the empty list, which is not possible to obtain from a modified cons.
The second concept is that of generalized reference.
A generalized reference is anything that can hold data, such as a variable, the car and cdr of a cons, the element locations of an array or hash table, the slots of an object, etc.
For each container data structure, you need to know the specific modifying function. However, for some generalized references, there might not be a function to modify it, such as a local variable, in which case there are still special forms to modify it.
As such, in order to modify any generalized reference, you need to know its modifying form.
Another concept closely related to generalized references is the place. A form that identifies a generalized reference is called a place. Or in other words, a place is the written way (form) that represents a generalized reference.
For each kind of place, you have a reader form and a writer form.
Some of these forms are documented, such as using the symbol of a variable to read it and setq a variable to write to it, or car/cdr to read from and rplaca/rplacd to write to a cons. Others are only documented to be accessors, such as aref to read from arrays; its writer form is not actually documented.
To get these forms, you have get-setf-expansion. You actually also get a set of variables and their initializing forms (to be used as through let*) that will be used by the reader form and/or the writer form, and a set of variables (to be bound to the new values) that will be used by the writer form.
If you've used Lisp before, you've probably used setf. setf is a macro that generates code that runs within the scope (environment) of its expansion.
Essentially, it behaves as if by using get-setf-expansion, generating a let* form for the variables and initializing forms, generating extra bindings for the writer variables with the result of the value(s) form and invoking the writer form within all this environment.
For instance, let's define a my-setf1 macro which takes only a single place and a single newvalue form:
(defmacro my-setf1 (place newvalue &environment env)
(multiple-value-bind (vars vals store-vars writer-form reader-form)
(get-setf-expansion place env)
`(let* (,#(mapcar #'(lambda (var val)
`(,var ,val))
vars vals))
;; In case some vars are used only by reader-form
(declare (ignorable ,#vars))
(multiple-value-bind (,#store-vars)
,newvalue
,writer-form
;; Uncomment the next line to mitigate buggy writer-forms
;;(values ,#store-vars)
))))
You could then define my-setf as:
(defmacro my-setf (&rest pairs)
`(progn
,#(loop
for (place newvalue) on pairs by #'cddr
collect `(my-setf1 ,place ,newvalue))))
There is a convention for such macros, which is to suffix with f, such as setf itself, psetf, shiftf, rotatef, incf, decf, getf and remf.
Again, the convention doesn't work the other way around, there are operators that end with f, such as aref, svref and find-if, which are functions, and if, which is a conditional execution special operator. And yet again, there are notable exceptions, such as push, pushnew, pop, ldb, mask-field, assert and check-type.
Depending on your point-of-view, many more operators are implicitly destructive, even if not effectively tagged as such.
For instance, every defining operator (e.g. the macros defun, defpackage, defclass, defgeneric, defmethod, the function load) changes either the global environment or a temporary one, such as the compilation environment.
Others, like compile-file, compile and eval, depend on the forms they'll execute. For compile-file, it also depends on how much it isolates the compilation environment from the startup environment.
Other operators, like makunbound, fmakunbound, intern, export, shadow, use-package, rename-package, adjust-array, vector-push, vector-push-extend, vector-pop, remhash, clrhash, shared-initialize, change-class, slot-makunbound, add-method and remove-method, are (more or less) clearly intended to have side-effects.
And it's this last concept that can be the widest. Usually, a side-effect is regarded as any observable variation in one environment. As such, functions that don't change data are usually considered free of side-effects.
However, this is ill-defined. You may consider that all code execution implies side-effects, depending on what you define to be your environment or on what you can measure (e.g. consumed quotas, CPU time and real time, used memory, GC overhead, resource contention, system temperature, energy consumption, battery drain).
NOTE: None of the example lists are exhaustive.

Related

What persistent data structures does Raku/Rakudo include?

Raku provides many types that are immutable and thus cannot be modified after they are created. Until I started looking into this area recently, my understanding was that these Types were not persistent data structures – that is, unlike the core types in Clojure or Haskell, my belief was that Raku's immutable types did not take advantage of structural sharing to allow for inexpensive copies. I thought that statement my List $new = (|$old-list, 42); literally copied the values in $old-list, without the data-sharing features of persistent data structures.
That description of my understanding is in the past tense, however, due to the following code:
my Array $a = do {
$_ = [rand xx 10_000_000];
say "Initialized an Array in $((now - ENTER now).round: .001) seconds"; $_}
my List $l = do {
$_ = |(rand xx 10_000_000);
say "Initialized the List in $((now - ENTER now).round: .001) seconds"; $_}
do { $a.push: rand;
say "Pushed the element to the Array in $((now - ENTER now).round: .000001) seconds" }
do { my $nl = (|$l, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
do { my #na = |$l;
say "Copied List \$l into a new Array in $((now - ENTER now).round: .001) seconds" }
which produced this output in one run:
Initialized an Array in 5.938 seconds
Initialized the List in 5.639 seconds
Pushed the element to the Array in 0.000109 seconds
Appended an element to the List in 0.000109 seconds
Copied List $l into a new Array in 11.495 seconds
That is, creating a new List with the old values + one more is just as fast as pushing to a mutable Array, and dramatically faster than copying the List into a new Array – exactly the performance characteristics that you'd expect to see from a persistent List (copying to an Array is still slow because it can't take advantage of structural sharing without breaking the immutability of the List). The fast copying of $l into $nl is not due to either List being lazy; neither are.
All of the above leads me to believe that Lists in Rakudo actually are persistent data structures, with all the performance benefits that implies. That leaves me with several questions:
Am I right about Lists being persistent data structures?
Are all other immutable Types also persistent data structures? Or are any?
Is any of this part of Raku, or just an implementation choice Rakudo has made?
Are any of these performance characteristics documented/guaranteed anywhere?
I have to say, I am both extremely impressed and more than a bit baffled to discover evidence that at least some of Raku(do)'s types are persistent. It's the sort of feature that other languages list as a key selling point or that leads to the creation of libraries with 30k+ stars on GitHub. Have we really had it in Raku without even mentioning it?
I remember implementing these semantics, and I certainly don't recall thinking about them giving rise to a persistent data structure at the time - although it does seems fair to attach that label to the result!
I don't think you'll find anywhere that explicitly spells out this exact behavior, however the most natural implementation of things that are required by the language quite naturally leads to it. Taking the ingredients:
The infix:<,> operator is the List constructor in Raku
When a List is created, it is non-committal with regards to laziness and flattening (these arise from how we use the List, which we don't - in general - know at the point of its construction)
When we write (|$x, 1), the prefix:<|> operator constructs a Slip, which is a kind of List that should melt into its surrounding List. Thus what infix:<,> sees is a Slip and an Int.
Making the Slip melt into the result List immediately would mean making a commitment about eagerness, which List construction alone should not do. Thus the Slip and everything after it is placed into the lazily evaluated ("non-reified") portion of the List.
This last of these is what gives rise to the observed persistent data structure style behavior.
I expect it would be possible to have a implementation that inspects the Slip and chooses to eagerly copy things that are known not to be lazy, and still be in compliance with the specification test suite. That would change the time complexity of your example. If you want to be defensive against that, then:
do { my $nl = (|$l.lazy, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
Should be sufficient to force the issue even if the implementation changed.
Of other cases that immediately come to mind that are related to persistent data structures or at least tail sharing:
The MoarVM implementation of strings, which is behind str and thus Str, implements string concatenation by creating a new string that refers to the two that are being concatenated instead of copying the data in the two strings (and does similar tricks for substr and repetition). This is strictly an optimization, not a language requirement, and in some delicate cases (the last grapheme of one string and the first grapheme of the next will form a single grapheme in the resulting string), it gives up and takes the copying path.
Outside of the core, modules like Concurrent::Stack, Concurrent::Queue, and Concurrent::Trie use tail sharing as a technique to implement relatively efficient lock-free data structures.

Nested Predicates In Prolog

I am trying to write a predicate that ‘exists’ inside the ‘scope’ of another predicate . The reason I need this is because both predicates make use of the same very large parameters/arrays and the predicate I want to nest is doing self recurssion many times , so I want to avoid copying the same parameters . So , is there any way i can do this in Swi-Prolg ?
Thanks in advance .
You don't need to. You have to realize that all the terms "named" by Prolog variable names are already global, although inaccessible when the clause doesn't have a name referencing them (and names are always local to a clause). That "very large array" is on the heap. Just pass the name to it to any other predicate at ~0 cost.
As Paulo Moura says.
Suppose you have:
foo(BigArray) :- do_things(BigArray),do_more_things(BigArray).
Suppose do_things/1 either just prints the element at position 0 if it is an instantiated term, or sets it to bar if its is a fresh term:
do_things(BigArray) :- nth0(0,BigArray,Elem),nonvar(Elem),!,write(Elem).
do_things(BigArray) :- nth0(0,BigArray,Elem),var(Elem),!,Elem=bar.
If there was a fresh term on position 0, the, on return to foo/1, the atom bar on position 0 is visible to the caller and to do_more_things/1 because that list designated by BigArray is a "global term".
Some precision on your other question on whether to use "global variables":
SWI-Prolog also has "Global Variables", which are apparently similar to the GNU Prolog "Global Variables":
Global Variables
We read:
Global variables are associations between names (atoms) and terms.
They differ in various ways from storing information using assert/1 or
recorda/3.
...which means that their purpose is similar to the purpose of assert/1 and recorda/3: Storing state that survives query termination at the Prolog toplevel - similar to how program clauses of a program are stored.
I would say, use those only if absolutely needed.
Also read the intro: Database, where we find:
The recorded database is not part of the ISO standard but fairly
widely supported, notably in implementations building on the‘Edinburgh
tradition'. There are few reasons to use this database in SWI-Prolog
due to the good performance of dynamic predicates.

What is the complexity of std::vector<T>::clear() when T is a primitive type?

I understand that the complexity of the clear() operation is linear in the size of the container, because the destructors must be called. But what about primitive types (and POD)? It seems the best thing to do would be to set the vector size to 0, so that the complexity is constant.
If this is possible, is it also possible for std::unordered_map?
It seems the best thing to do would be to set the vector size to 0, so that the complexity is constant.
In general, the complexity of resizing a vector to zero is linear in the number of elements currently stored in the vector. Therefore, setting vector's size to zero offers no advantage over calling clear() - the two are essentially the same.
However, at least one implementation (libstdc++, source in bits/stl_vector.h) gives you an O(1) complexity for primitive types by employing partial template specialization.
The implementation of clear() navigates its way to the std::_Destroy(from, to) function in bits/stl_construct.h, which performs a non-trivial compile-time optimization: it declares an auxiliary template class _Destroy_aux with the template parameter of type bool. The class has a partial specialization for true and an explicit specialization for false. Both specializations define a single static function called __destroy. In case the template parameter is true, the function body is empty; in case the parameter is false, the body contains a loop invoking T's destructor by calling std::_Destroy(ptr).
The trick comes on line 136:
std::_Destroy_aux<__has_trivial_destructor(_Value_type)>::
__destroy(__first, __last);
The auxiliary class is instantiated based on the result of the __has_trivial_destructor check. The checker returns true for built-in types, and false for types with non-trivial destructor. As the result, the call to __destroy becomes a no-op for int, double, and other POD types.
The std::unordered_map is different from the vector in that it may need to delete structures that represent "hash buckets" of POD objects, as opposed to deleting objects themselves*. The optimization of clear to O(1) is possible, but it is heavily dependent on the implementation, so I would not count on it.
* The exact answer depends on the implementation: hash tables implementing collision resolution based on open addressing (linear probing, quadratic probing, etc.) may be able to delete all buckets in O(1); implementations based on separate chaining would have to delete buckets one-by-one, though.
gcc's version of std::_Destroy, which is what is eventually used by clear(), tries to template on on whether the type has a trivial destructor, so in that case the complexity is constant even without an optimisation pass. However I don't know how well the template works.

Priority Queue in R for OPTICS implementation

I need to construct a priority queue in R where i will put the ordered seed objects (or the index of the objects) for the OPTICS clustering algorithm.
One possibility is to implement it with heap with the array representation, and pass the heap array in each insert and decrease key call, and return the changed array and reassign it in the calling function. In which case, the reassign operation will make the performance very poor and every time one insert or decrease operation is executed the entire array needs to be copied twice, once for calling, and another once for returning and reassigning.
Another possibility is to code the heap operations inside the function instead of calling it. This will result in code repetition and cumbersome code.
Is there any pointer like access as we do in C
Can i declare user defined functions in the S3 or S4 classes in R ? In the the case i think the call to these functions still requires the same reassignment after returning (not like C++/Java classes, operates on the object (am i right?) )
Is there any builtin way with which i can insert and extract an object in a queue in O(log(n)) time in R?
Is there any other way with which i can achieve the goal, that is maintain a priority based insertion and removal of the seeds depending on the reachability distance of an object in the OPTICS algorithm, except explicitly sorting after each insertion.
R5 classes
define mutable objects, and very similar to Java classes:
they should allow you to avoid the copies when the object is modified.
Note that you do not just need a priority queue.
It actually needs to support efficient updates, too. A simple heap is not sufficient, you need to synchronize a hashmap to find objects efficiently for updating their values. Then you need to repair the heap at the changed position.

Are higher order functions on collections guaranteed to be executed sequentially?

In another question, a user suggested to write code like to that:
def list = ['a', 'b', 'c', 'd']
def i = 0;
assert list.collect { [i++] } == [0, 1, 2, 3]
Such code is, in other languages, considered bad practice because the content of collect changes the state of it's context (here it changes the value of i). In other words, the closure has side-effects.
Such higher order functions should be able to run the closure in parallel, and assemble it in a new list again. If the processing in the closure are long, CPU intensive operations, it may be worth executing them in separate threads. It would be easy to change collect to use an ExecutorCompletionService to achieve that, but it would break the above code.
Another example of a problem is if, for some reason, collect browse the collection in, say, reverse order, in which case the result would be [3, 2, 1, 0]. Note that in this case, the list have not been reverted, 0 is really the result of applying the closure to 'd'!
Interestingly, these functions are documented with "Iterates through this collection" in Collection's JavaDoc, which suggests the iteration is sequential.
Does the groovy specification explicitly defines the order of execution in higher order functions like collect or each? Is the above code broken, or is it OK?
I don't like explicit external variables being relied upon in my closures for the reasons you give above.
Indeed, the less variables I have to define, the happier I am ;-)
For the possibly parallel things as well, always code with a view to wrapping it with some level of GPars loveliness should it prove too much for a single thread to handle. For this, as you say, you want as little mutability as possible and to try and completely avoid side-effects (such as the external counter pattern above)
As for the question itself, if we take collect as an example function, and examine the source code, we can see that given an Object (Collection and Map are done in a similar way with slight differences as to how the Iterator is referenced) it iterates along InvokerHelper.asIterator(self), adding the result of each closure call to the resultant list.
InvokerHelper.asIterator (again source is here) basically calls the iterator() method on the Object passed in.
So for Lists, etc it will iterate down the objects in the order defined by the iterator.
It is therefore possible to compose your own class which follows the Iterable interface design (doesn't need to implement Iterable though, thanks to duck-typing), and define how the collection will be iterated.
I think by asking about the Groovy specification though, this answer might not be what you want, but I don't think there is an answer. Groovy has never really had a 'complete' specification (indeed this is point about groovy that some people dislike).
I think keeping the functions passed collect or findAll side-effect free is a good idea in general, not only for keeping the complexity low but making the code more parallel-friendly in case parallel execution is needed in the future.
But in the case of each there is not much point in keeping the function side-effect free, as it wouldn't do anything (in fact the sole purpose of this method is to replace act as a for-each loop). The Groovy's documentation have some examples of using each (and its variants, eachWithIndex and reverseEach) that require an execution order to be defined.
Now, from a pragmatic point of view, I think it can sometimes be OK to use functions with some side effects in methods like collect. For example, to transform a list in [index, value] pairs a transpose and range can be used
def list = ['a', 'b', 'c']
def enumerated = [0..<list.size(), list].transpose()
assert enumerated == [[0,'a'], [1,'b'], [2,'c']]
Or even an inject
def enumerated = list.inject([]) { acc, val -> acc << [acc.size(), val] }
But a collect and a counter does the trick too and I think the result is the most readable:
def n = 0, enumerated = list.collect{ [n++, it] }
Now, this example wouldn't make sense if Groovy provided acollect and similar methods with a index-value-param function (see Jira issue), but it kinda shows that sometimes practicality beats purity IMO :)

Resources