Priority Queue in R for OPTICS implementation - r

I need to construct a priority queue in R where i will put the ordered seed objects (or the index of the objects) for the OPTICS clustering algorithm.
One possibility is to implement it with heap with the array representation, and pass the heap array in each insert and decrease key call, and return the changed array and reassign it in the calling function. In which case, the reassign operation will make the performance very poor and every time one insert or decrease operation is executed the entire array needs to be copied twice, once for calling, and another once for returning and reassigning.
Another possibility is to code the heap operations inside the function instead of calling it. This will result in code repetition and cumbersome code.
Is there any pointer like access as we do in C
Can i declare user defined functions in the S3 or S4 classes in R ? In the the case i think the call to these functions still requires the same reassignment after returning (not like C++/Java classes, operates on the object (am i right?) )
Is there any builtin way with which i can insert and extract an object in a queue in O(log(n)) time in R?
Is there any other way with which i can achieve the goal, that is maintain a priority based insertion and removal of the seeds depending on the reachability distance of an object in the OPTICS algorithm, except explicitly sorting after each insertion.

R5 classes
define mutable objects, and very similar to Java classes:
they should allow you to avoid the copies when the object is modified.

Note that you do not just need a priority queue.
It actually needs to support efficient updates, too. A simple heap is not sufficient, you need to synchronize a hashmap to find objects efficiently for updating their values. Then you need to repair the heap at the changed position.

Related

Why are iterations over maps random?

From the Golang source code, they seem to follow a pretty standard implementation of hash tables (ie array of buckets). Based on this it seems that iteration should be deterministic for an unchanged map (ie iterate the array in order, then iterate within the buckets in order). Why do they make the iteration random?
TL;DR; They intentionally made it random starting with Go 1 to make developers not rely on it (to not rely on a specific iteration order which order may change from release-to-relase, from platform-to-platform, or may even change during a single runtime of an app when map internals change due to accommodating more elements).
The Go Blog: Go maps in action: Iteration order:
When iterating over a map with a range loop, the iteration order is not specified and is not guaranteed to be the same from one iteration to the next. Since the release of Go 1.0, the runtime has randomized map iteration order. Programmers had begun to rely on the stable iteration order of early versions of Go, which varied between implementations, leading to portability bugs. If you require a stable iteration order you must maintain a separate data structure that specifies that order.
Also Go 1 Release Notes: Iterating in maps:
The old language specification did not define the order of iteration for maps, and in practice it differed across hardware platforms. This caused tests that iterated over maps to be fragile and non-portable, with the unpleasant property that a test might always pass on one machine but break on another.
In Go 1, the order in which elements are visited when iterating over a map using a for range statement is defined to be unpredictable, even if the same loop is run multiple times with the same map. Code should not assume that the elements are visited in any particular order.
This change means that code that depends on iteration order is very likely to break early and be fixed long before it becomes a problem. Just as important, it allows the map implementation to ensure better map balancing even when programs are using range loops to select an element from a map.
Notable exceptions
Please note that the "random" order applies when ranging over the map using for range.
For reproducible outputs (for easy testing and other conveniences it brings) the standard lib sorts map keys in numerous places:
1. encoding/json
The json package marshals maps using sorted keys. Quoting from json.Marshal():
Map values encode as JSON objects. The map's key type must either be a string, an integer type, or implement encoding.TextMarshaler. The map keys are sorted and used as JSON object keys by applying the following rules, subject to the UTF-8 coercion described for string values above:
keys of any string type are used directly
encoding.TextMarshalers are marshaled
integer keys are converted to strings
2. fmt package
Starting with Go 1.12 the fmt package prints maps using sorted keys. Quoting from the release notes:
Maps are now printed in key-sorted order to ease testing. The ordering rules are:
When applicable, nil compares low
ints, floats, and strings order by <
NaN compares less than non-NaN floats
bool compares false before true
Complex compares real, then imaginary
Pointers compare by machine address
Channel values compare by machine address
Structs compare each field in turn
Arrays compare each element in turn
Interface values compare first by reflect.Type describing the concrete > - type and then by concrete value as described in the previous rules.
3. Go templates
The {{range}} action of text/template and html/template packages also visit elements in sorted keys order. Quoting from package doc of text/template:
{{range pipeline}} T1 {{end}}
The value of the pipeline must be an array, slice, map, or channel.
If the value of the pipeline has length zero, nothing is output;
otherwise, dot is set to the successive elements of the array,
slice, or map and T1 is executed. If the value is a map and the
keys are of basic type with a defined order, the elements will be
visited in sorted key order.
This is important for security, among other things.
There are lots of resources talking about this online -- see this post for example

Destructive place-modifying operators

The CLtL2 reference clearly distinguishes between nondestructive and destructive common-lisp operations. But, within the destructive camp, it seems a little less clear in marking the difference between those which simply return the result, and those which additionally modify a place (given as argument) to contain the result. The usual convention of annexing "f" to such place modifying operations (eg, setf, incf, alexandria:deletef) is somewhat sporadic, and also applies to many place accessors (eg, aref, getf). In an ideal functional programming style (based only on returned values) such confusion is probably not an issue, but it seems like it could lead to programming errors in some practical applications that do use place modification. Since different implementations can handle the place results differently, couldn't portability be affected? It even seems difficult to test a particular implementation's approach.
To better understand the above distinction, I've divided the destructive common-lisp sequence operations into two categories corresponding to "argument returning" and "operation returning". Could someone validate or invalidate these categories for me? I'm assuming these categories could apply to other kinds of destructive operations (for lists, hash-tables, arrays, numbers, etc) too.
Argument returning: fill, replace, map-into
Operation returning: delete, delete-if, delete-if-not, delete-duplicates, nsubstitute, nsubstitute-if, nsubstitute-not-if, nreverse, sort, stable-sort, merge
But, within the destructive camp, it seems a little less clear in marking the difference between those which simply return the result.
There are no easy syntactic marker about which operation is destructive or not, even though there are useful conventions like the n prefix. Remember that CL is a standard inspired by different Lisps, which does not help enforcing a consistent terminology.
The usual convention of annexing "f" to such place modifying operations (eg, setf, incf, alexandria:deletef) is somewhat sporadic, and also applies to many place accessors (eg, aref, getf).
All setf expanders should ends with f, but not everything that ends with f is a setf expander. For example, aref takes its name from array and reference and isn't a macro.
... but it seems like it could lead to programming errors in some practical applications that do use place modification.
Most data is mutable (see comments); once you code in CL with that in mind, you take care not to modify data you did not create yourself. As for using a destructive operation in place of a non-destructive one inadvertently, I don't know: I guess it can happen, with sort or delete, maybe the first times you use them. In my mind delete is stronger, more destructive than simply remove, but maybe that's because I already know the difference.
Since different implementations can handle the place results differently, couldn't portability be affected?
If you want portability, you follow the specification, which does not offer much guarantee w.r.t. which destructive operations are applied. Take for example DELETE (emphasis mine):
Sequence may be destroyed and used to construct the result; however, the result might or might not be identical to sequence.
It is wrong to assume anything about how the list is being modified, or even if it is being modified. You could actually implement delete as an alias of remove in a minimal implementation. In all cases, you use the return value of your function (both delete and remove have the same signature).
Categories
I've divided the destructive common-lisp sequence operations into two categories corresponding to "argument returning" and "operation returning".
It is not clear at all what those categories are supposed to represent. Are those definition the one you have in mind?
an argument returning operation is one which returns one of its argument as a return value, possibly modified.
an operation returning operation is one where the result is based on one of its argument, and might be identical to that argument, but needs not be.
The definition of operation returning is quite vague and encompass both destructive and non-destructive operations. I would classify cons as such because it does not return one of its argument; OTOH, it is a purely functional operation.
I don't really get what those categories offer in addition to destructive or non-destructive.
Setf composition gotcha
Suppose you write a function (remote host key) which gets a value from a remote key/value datastore. Suppose also that you define (setf remote) so that it updates the remote value.
You might expect (setf (first (remote host key)) value) to:
Fetch a list from host, indexed by key,
Replace its first element by value,
Push the changes back to the remote host.
However, step 3 does generally not happen: the local list is modified in place (this is the most efficient alternative, but it makes setf expansions somewhat lazy about updates). You could define a new set of macros such as the whole round-trip is always implemented, with DEFINE-SETF-EXPANDER, though.
Let me try to address your question by introducing some concepts.
I hope it helps you to consolidate your knowledge and to find your remaining answers about this subject.
The first concept is that of non-destructive versus destructive behavior.
A function that is non-destructive won't change the data passed to it.
A function that is destructive may change the data passed to it.
You can apply the (non-)destructive nature to something other than a single function. For instance, if a function stores the data passed to it somewhere, say in a object's slot, then the destructiveness depends on that object's behavior, its other operations, events, etc.
The convention for functions that immediately modify its arguments is to (usually) prefix with n.
The convention doesn't work the other way around, there are many functions that start with n (e.g. not/null, nth, ninth, notany, notevery, numberp etc.) There are also notable exceptions, such as delete, merge, sort and stable-sort. The only way to naturally grasp them is with time/experience. For instance, always refer to the HyperSpec whenever you see a function you don't know yet.
Moreover, you usually need to store the result of some destructive functions, such as delete and sort, because they may choose to skip the head of the list or to not be destructive at all. delete may actually return nil, the empty list, which is not possible to obtain from a modified cons.
The second concept is that of generalized reference.
A generalized reference is anything that can hold data, such as a variable, the car and cdr of a cons, the element locations of an array or hash table, the slots of an object, etc.
For each container data structure, you need to know the specific modifying function. However, for some generalized references, there might not be a function to modify it, such as a local variable, in which case there are still special forms to modify it.
As such, in order to modify any generalized reference, you need to know its modifying form.
Another concept closely related to generalized references is the place. A form that identifies a generalized reference is called a place. Or in other words, a place is the written way (form) that represents a generalized reference.
For each kind of place, you have a reader form and a writer form.
Some of these forms are documented, such as using the symbol of a variable to read it and setq a variable to write to it, or car/cdr to read from and rplaca/rplacd to write to a cons. Others are only documented to be accessors, such as aref to read from arrays; its writer form is not actually documented.
To get these forms, you have get-setf-expansion. You actually also get a set of variables and their initializing forms (to be used as through let*) that will be used by the reader form and/or the writer form, and a set of variables (to be bound to the new values) that will be used by the writer form.
If you've used Lisp before, you've probably used setf. setf is a macro that generates code that runs within the scope (environment) of its expansion.
Essentially, it behaves as if by using get-setf-expansion, generating a let* form for the variables and initializing forms, generating extra bindings for the writer variables with the result of the value(s) form and invoking the writer form within all this environment.
For instance, let's define a my-setf1 macro which takes only a single place and a single newvalue form:
(defmacro my-setf1 (place newvalue &environment env)
(multiple-value-bind (vars vals store-vars writer-form reader-form)
(get-setf-expansion place env)
`(let* (,#(mapcar #'(lambda (var val)
`(,var ,val))
vars vals))
;; In case some vars are used only by reader-form
(declare (ignorable ,#vars))
(multiple-value-bind (,#store-vars)
,newvalue
,writer-form
;; Uncomment the next line to mitigate buggy writer-forms
;;(values ,#store-vars)
))))
You could then define my-setf as:
(defmacro my-setf (&rest pairs)
`(progn
,#(loop
for (place newvalue) on pairs by #'cddr
collect `(my-setf1 ,place ,newvalue))))
There is a convention for such macros, which is to suffix with f, such as setf itself, psetf, shiftf, rotatef, incf, decf, getf and remf.
Again, the convention doesn't work the other way around, there are operators that end with f, such as aref, svref and find-if, which are functions, and if, which is a conditional execution special operator. And yet again, there are notable exceptions, such as push, pushnew, pop, ldb, mask-field, assert and check-type.
Depending on your point-of-view, many more operators are implicitly destructive, even if not effectively tagged as such.
For instance, every defining operator (e.g. the macros defun, defpackage, defclass, defgeneric, defmethod, the function load) changes either the global environment or a temporary one, such as the compilation environment.
Others, like compile-file, compile and eval, depend on the forms they'll execute. For compile-file, it also depends on how much it isolates the compilation environment from the startup environment.
Other operators, like makunbound, fmakunbound, intern, export, shadow, use-package, rename-package, adjust-array, vector-push, vector-push-extend, vector-pop, remhash, clrhash, shared-initialize, change-class, slot-makunbound, add-method and remove-method, are (more or less) clearly intended to have side-effects.
And it's this last concept that can be the widest. Usually, a side-effect is regarded as any observable variation in one environment. As such, functions that don't change data are usually considered free of side-effects.
However, this is ill-defined. You may consider that all code execution implies side-effects, depending on what you define to be your environment or on what you can measure (e.g. consumed quotas, CPU time and real time, used memory, GC overhead, resource contention, system temperature, energy consumption, battery drain).
NOTE: None of the example lists are exhaustive.

Setting parameters efficiently in S4 objects

I am writing a simulation model in R to track the behavior of a set of interacting agents. For my own sanity, I give each agent its own S4 object, in which I store its trajectory and other parameters. Currently, I pass an object to a function, do some operations, and pass the object back. For example,
#Create a new class and a sample object
setClass("example", slots = list(N="numeric"), prototype = list(N=0))
agentA<-new("example")
#Define a function to change the value in the N slot
add_one<-function(agent){
agent#N<-agent#N + 1
agent
}
#Call the function.
agentA<-add_one(agentA)
I know this works and it's really important to me that the structure is modular and easy to debug, but I'm wondering about the overhead of passing the agent object back and forth. Most of the objects contain arrays with a few thousand numbers, and they will get passed back and forth thousands of times. Is there a more efficient way to do it, or is this close enough to best practice?
I'm not really clear on how many times the object gets copied, versus when only a pointer is passed.

What's the most memory/processor efficient way to pass structs as parameters to functions for modification?

Are using pointers the most memory/processor efficient way to pass structs for modification like below or is there a better way?
Basically there are 2 ways to pass parameters to a function or method: by value and by address (pointer).
Passing a parameter by value makes a copy of the passed value and so if you would modify it, it would modify the copy and not the original value. So if you want to modify the original value, that leaves you only the passing by address option.
Notes:
Note that you could also pass by value and return the modified copy and assign the returned modified value to the variable, but obviously this is less efficient - especially if the struct is large (contains many fields).
In rare cases there might be more efficient ways to pass a value for modification, but I would rather name these cases "denote" rather than pass. Let's assume you have a global variable, being a slice of structs. In this case you could just pass the index of the value in the slice you want to modify. And your function could just do the modification of the field of the element denoted by the passed index value. If you just want to modify 1 field, this may be faster and on 32-bit architecture the size of the index value may be smaller than the pointer, and this way you could spare the address taking and dereferencing operations (needs benchmark). But the usability of the function would drop drastically, so I don't recommend this.
That leaves passing by pointer the optimal way in case you need to modify it.
In Go, using pointers to pass structs for modification is the idiomatic way to do it. There are no reference variables as there are in other languages. So passing a pointer is the most efficient way to do it now and since it is idiomatic, will most likely be so in future versions of Go as well.

What is the complexity of std::vector<T>::clear() when T is a primitive type?

I understand that the complexity of the clear() operation is linear in the size of the container, because the destructors must be called. But what about primitive types (and POD)? It seems the best thing to do would be to set the vector size to 0, so that the complexity is constant.
If this is possible, is it also possible for std::unordered_map?
It seems the best thing to do would be to set the vector size to 0, so that the complexity is constant.
In general, the complexity of resizing a vector to zero is linear in the number of elements currently stored in the vector. Therefore, setting vector's size to zero offers no advantage over calling clear() - the two are essentially the same.
However, at least one implementation (libstdc++, source in bits/stl_vector.h) gives you an O(1) complexity for primitive types by employing partial template specialization.
The implementation of clear() navigates its way to the std::_Destroy(from, to) function in bits/stl_construct.h, which performs a non-trivial compile-time optimization: it declares an auxiliary template class _Destroy_aux with the template parameter of type bool. The class has a partial specialization for true and an explicit specialization for false. Both specializations define a single static function called __destroy. In case the template parameter is true, the function body is empty; in case the parameter is false, the body contains a loop invoking T's destructor by calling std::_Destroy(ptr).
The trick comes on line 136:
std::_Destroy_aux<__has_trivial_destructor(_Value_type)>::
__destroy(__first, __last);
The auxiliary class is instantiated based on the result of the __has_trivial_destructor check. The checker returns true for built-in types, and false for types with non-trivial destructor. As the result, the call to __destroy becomes a no-op for int, double, and other POD types.
The std::unordered_map is different from the vector in that it may need to delete structures that represent "hash buckets" of POD objects, as opposed to deleting objects themselves*. The optimization of clear to O(1) is possible, but it is heavily dependent on the implementation, so I would not count on it.
* The exact answer depends on the implementation: hash tables implementing collision resolution based on open addressing (linear probing, quadratic probing, etc.) may be able to delete all buckets in O(1); implementations based on separate chaining would have to delete buckets one-by-one, though.
gcc's version of std::_Destroy, which is what is eventually used by clear(), tries to template on on whether the type has a trivial destructor, so in that case the complexity is constant even without an optimisation pass. However I don't know how well the template works.

Resources