How to pass a pointer to a subarray with BLAS function? - pointers

I am using Julia v0.3.5, which comes with WinPython 3.4.2.5 build 4. I am new to Julia. I am testing how fast Julia is compared to using SciPy's BLAS wrapper for ddot(), which has the following arguments: x,y,n,offx,incx,offy,incy. Julia's OpenBLAS library does not have the offset arguments, so I am trying to figure out how to emulate them while maximizing speed. I am passing 100MB subarrays of a 1GB array (vector) multiple times, so I don't want Julia to create a copy of each subarray, which would reduce the speed. Python's SciPy function is taking a couple of hours to execute, so would like to optimize Julia's speed. I have been reading about how Julia 0.4 will offer array views that avoid the unnecessary copy, but I am unclear about how Julia 0.3.5 handles this.
So far, I learned using REPL that the BLAS dot() function conflicts with the method in linalg/matmul.jl. Therefore, I learned to access it this way:
import Base.LinAlg.BLAS
methods(Base.LinAlg.BLAS.dot)
From the method display, I see that I can pass pointers to x and y subarrays and thus avoid a copy. For example:
x = [1., 2., 3.]
y = [4., 5., 6.]
Base.LinAlg.BLAS.dot(2, pointer(x), 1, pointer(y), 1)
However, when I add an integer offset to a pointer (to access a subarray), REPL crashes.
How can I pass a pointer to a subarray or a subarray to Base.LinAlg.BLAS.dot without the slowdown of a copy of that subarray?
Anything else I missed?

It segfaults because pointer arithmatic doesn't work like you probably think it does (i.e. the C way). pointer(x)+1 is one byte after pointer(x), but you probably want pointer(x)+8, e.g.
Base.LinAlg.BLAS.dot(2, pointer(x)+1*sizeof(Float64), 1, pointer(y)+1*sizeof(Float64), 1)
or, more user friendly and recommended:
Base.LinAlg.dot(x,2:3,y,2:3)
which is defined here.
I'd say using pointers like that in Julia is really not recommended, but I imagine if you are doing this at all then it is a special circumstance.

Related

Rf_allocVector only allocates and does not zero out memory

Original motivation behind this is that I have a dynamically sized array of floats that I want to pass to R through Rcpp without either incurring the cost of a zeroing out nor the cost of a deep copy.
Originally I had thought that there might be some way to take heap allocated array, make it aware to R's gc system and then wrap it with other data to create a "Rcpp::NumericVector" but it seems like that that's not possible - or doable with my current knowledge.
However and correct me if I'm wrong it looks like simply constructing a NumericVector with a size N and then using it as an N sized allocation will call R.h's Rf_allocVector and that itself does not either zero out the allocated array - I tested it on a small C program that gets dyn.loaded into R and it looks like garbage values. I also took a peek at the assembly and there doesn't seem to be any zeroing out.
Can anyone confirm this or offer any alternate solution?
Welcome to StackOverflow.
You marked this rcpp but that is a function from the C API of R -- whereas the Rcpp API offers you its constructors which do in fact set the memory tp zero:
> Rcpp::cppFunction("NumericVector goodVec(int n) { return NumericVector(n); }")
> sum(goodVec(1e7))
[1] 0
>
This creates a dynamically allocated vector using R's memory functions. The vector is indistinguishable from R's own. And it has the memory set to zero
as we use R_Calloc, which is documented in Writing R Extension to setting the memory to zero. (We may also use memcpy() explicitly, you can check the sources.)
So in short, you just have yourself confused over what the C API of R, as well as Rcpp offer, and what is easiest to use when. Keep reading documentation, running and writing examples, and studying existing code. It's all out there!

How to speed up writing to a matrix in a reference class in R

Here is a piece of R code that writes to each element of a matrix in a reference class. It runs incredibly slowly, and I’m wondering if I’ve missed a simple trick that will speed this up.
nx = 2000
ny = 10
ref_matrix <- setRefClass(
"ref_matrix",fields = list(data = "matrix"),
)
out <- ref_matrix(data = matrix(0.0,nx,ny))
#tracemem(out$data)
for (iy in 1:ny) {
for (ix in 1:nx) {
out$data[ix,iy] <- ix + iy
}
}
It seems that each write to an element of the matrix triggers a check that involves a copy of the entire matrix. (Uncommenting the tracemen() call shows this.) Now, I’ve found a discussion that seems to confirm this:
https://r-devel.r-project.narkive.com/8KtYICjV/rd-copy-on-assignment-to-large-field-of-reference-class
and this also seems to be covered by Speeding up field access in R reference classes
but in both of these this behaviour can be bypassed by not declaring a class for the field, and this works for the example in the first link which uses a 1D vector, b, which can just be set as b <<- 1:10000. But I’ve not found an equivalent way of creating a 2D array without using a explicit “matrix” instance.
Am I just missing something simple, or is this actually not possible?
Let me add a couple of things. First, I’m very new to R, so could easily have missed something. Second, I’m really just curious about the way reference classes work in this case and whether there’s a simple way to use them efficiently; I’m not looking for a really fast way to set the elements of a matrix - I can do that by not having the matrix in a reference class at all, and if I really care about speed I can write a C routine to do it and call it from R.
Here’s some background that might explain why I’m interested in this, which you’re welcome to ignore.
I got here by wanting to see how different languages, and even different compiler options and different ways of coding the same operation, compared for efficiency when accessing 2D rectangular arrays. I’ve been playing with a test program that creates two 2D arrays of the same size, and calls a subroutine that sets the first to the elements of the second plus their index values. (Almost any operation would do, but this one isn’t completely trivial to optimise.) I have this in a number of languages now, C, C++, Julia, Tcl, Fortran, Swift, etc., even hand-coded assembler (spoiler alert: assembler isn’t worth the effort any more) and thought I’d try R. The obvious implementation in R passes the two arrays to a subroutine that does the work, but because R doesn’t normally pass by reference, that routine has to make a copy of the modified array and return that as the function value. I thought using a reference class would avoid the relatively minor overhead of that copy, so I tried that and was surprised to discover that, far from speeding things up, it slowed them down enormously.
Use outer:
out$data <- outer(1:ny, 1:nx, `+`)
Also, don't use reference classes (or R6 classes) unless you actually need reference semantics. KISS and all that.

The best way to write code in Julia working on GPU's via ArrayFire

In Julia, I saw principally that to acelerate and optimizing codes when I work on a matrix, es better e.g.
-work by columns instead of by rows, this is for the way which Julia store the matrix.
-On loops could use #inbounds and #simd macros
-any function, macros or methods you could recommend it's welcome :D
But it seems that the above examples do not work when I use the ArrayFire package with a matrix stored on the GPU, similar codes in the CPU and GPU do not seem to favor the GPU that runs much slower in some cases, I think it shouldn't be like that, I think the problem is in the way of writing the code. Any help will be welcome
GPU computing should be done on optimized GPU kernels as much as possible. Indexing a GPU array is a small kernel that copies one value back to the CPU. This is really really bad for performance, so you should almost never index a GPUArray unless you have to (this is true for any implementation! It's just a hardware problem!)
Thus, instead of writing looping code for GPUs, you should write broadcasting ("vectorized") code. With the v0.6 broadcast changes, broadcasted operations are nearly as efficient as loops anyways (unless you hit this bug), so there's no reason to avoid them in generic code. However, there are cases where broadcasting is faster than looping, and GPUs is one big case.
Let me explain a little bit why. When you do the code:
#. A = B*C + D*E
it lowers to
A .= B.*C .+ D.*E
which then lowers to:
broadcast!((b,c,d,e)->b*c + d*e,A,B,C,D,E)
Notice that in there you have a fused anonymous function for the entire broadcast. For GPUArrays, this is then overwritten so that way a single GPU kernel is automatically created that performs this fused operation element-wise. Thus only one GPU kernel is required to do this whole operation! Notice that this is even more efficient than the R/Python/MATLAB way of doing GPU computing since those vectorized forms have temporaries and would require 4 kernels here, but this has no temporary arrays and is a single kernel, which is pretty much exactly how you'd write it if you were writing the kernel yourself. So if you exploit broadcast, then your GPU calculations will be fast.

How can I retrieve an object by id in Julia

In Julia, say I have an object_id for a variable but have forgotten its name, how can I retrieve the object using the id?
I.e. I want the inverse of some_id = object_id(some_object).
As #DanGetz says in the comments, object_id is a hash function and is designed not to be invertible. #phg is also correct that ObjectIdDict is intended precisely for this purpose (it is documented although not discussed much in the manual):
ObjectIdDict([itr])
ObjectIdDict() constructs a hash table where the keys are (always)
object identities. Unlike Dict it is not parameterized on its key and
value type and thus its eltype is always Pair{Any,Any}.
See Dict for further help.
In other words, it hashes objects by === using object_id as a hash function. If you have an ObjectIdDict and you use the objects you encounter as the keys into it, then you can keep them around and recover those objects later by taking them out of the ObjectIdDict.
However, it sounds like you want to do this without the explicit ObjectIdDict just by asking which object ever created has a given object_id. If so, consider this thought experiment: if every object were always recoverable from its object_id, then the system could never discard any object, since it would always be possible for a program to ask for that object by ID. So you would never be able to collect any garbage, and the memory usage of every program would rapidly expand to use all of your RAM and disk space. This is equivalent to having a single global ObjectIdDict which you put every object ever created into. So inverting the object_id function that way would require never deallocating any objects, which means you'd need unbounded memory.
Even if we had infinite memory, there are deeper problems. What does it mean for an object to exist? In the presence of an optimizing compiler, this question doesn't have a clear-cut answer. It is often the case that an object appears, from the programmer's perspective, to be created and operated on, but in reality – i.e. from the hardware's perspective – it is never created. Consider this function which constructs a complex number and then uses it for a simple computation:
julia> function f(y::Real)
z = Complex(0,y)
w = 2z*im
return real(w)
end
f (generic function with 1 method)
julia> foo(123)
-246
From the programmer's perspective, this constructs the complex number z and then constructs 2z, then 2z*im, and finally constructs real(2z*im) and returns that value. So all of those values should be inserted into the "Great ObjectIdDict in the Sky". But are they really constructed? Here's the LLVM code for this function applied to an Int:
julia> #code_llvm foo(123)
define i64 #julia_foo_60833(i64) #0 !dbg !5 {
top:
%1 = shl i64 %0, 1
%2 = sub i64 0, %1
ret i64 %2
}
No Complex values are constructed at all! Instead, all of the work is inlined and eliminated instead of actually being done. The whole computation boils down to just doubling the argument (by shifting it left one bit) and negating it (by subtracting it from zero). This optimization can be done first and foremost because the intermediate steps have no observable side effects. The compiler knows that there's no way to tell the difference between actually constructing complex values and operating on them and just doing a couple of integer ops – as long as the end result is always the same. Implicit in the idea of a "Great ObjectIdDict in the Sky" is the assumption that all objects that seem to be constructed actually are constructed and inserted into a large, permanent data structure – which is a massive side effect. So not only is recovering objects from their IDs incompatible with garbage collection, it's also incompatible with almost every conceivable program optimization.
The only other way one could conceive of inverting object_id would be to compute its inverse image on demand instead of saving objects as they are created. That would solve both the memory and optimization problems. Of course, it isn't possible since there are infinitely many possible objects but only a finite number of object IDs. You are vanishingly unlikely to actually encounter two objects with the same ID in a program, but the finiteness of the ID space means that inverting the hash function is impossible in principle since the preimage of each ID value contains an infinite number of potential objects.
I've probably refuted the possibility of an inverse object_id function far more thoroughly than necessary, but it led to some interesting thought experiments, and I hope it's been helpful – or at least thought provoking. The practical answer is that there is no way to get around explicitly stashing every object you might want to get back later in an ObjectIdDict.

fast apply_along_axis equivalent in Julia

Is there an equivalent to numpy's apply_along_axis() (or R's apply())in Julia? I've got a 3D array and I would like to apply a custom function to each pair of co-ordinates of dimensions 1 and 2. The results should be in a 2D array.
Obviously, I could do two nested for loops iterating over the first and second dimension and then reshape, but I'm worried about performance.
This Example produces the output I desire (I am aware this is slightly pointless for sum(). It's just a dummy here:
test = reshape(collect(1:250), 5, 10, 5)
a=[]
for(i in 1:5)
for(j in 1:10)
push!(a,sum(test[i,j,:]))
end
end
println(reshape(a, 5,10))
Any suggestions for a faster version?
Cheers
Julia has the mapslices function which should do exactly what you want. But keep in mind that Julia is different from other languages you might know: library functions are not necessarily faster than your own code, because they may be written to a level of generality higher than what you actually need, and in Julia loops are fast. So it's quite likely that just writing out the loops will be faster.
That said, a couple of tips:
Read the performance tips section of the manual. From that you'd learn to put everything in a function, and to not use untyped arrays like a = [].
The slice or sub function can avoid making a copy of the data.
How about
f = sum # your function here
Int[f(test[i, j, :]) for i in 1:5, j in 1:10]
The last line is a two-dimensional array comprehension.
The Int in front is to guarantee the type of the elements; this should not be necessary if the comprehension is inside a function.
Note that you should (almost) never use untyped (Any) arrays, like your a = [], since this will be slow. You can write a = Int[] instead to create an empty array of Ints.
EDIT: Note that in Julia, loops are fast. The need for creating functions like that in Python and R comes from the inherent slowness of loops in those languages. In Julia it's much more common to just write out the loop.

Resources