I wish to use indexing to populate a pre initialized matrix with the results of my array for loop output:
A = Float64.(reshape(1.0:81.0,9,9))
# initialize output
B = zeros(Int64, 2, 9)
j = 1
for j in 1:size(A,2) # loop cols
out = [sum(A[:,j]),j]
out = reshape(out,2,1) # make column to append
# append out to the B
global B = hcat(B,out) # this grows...
end
I initialized B = zeros(Int64, 2, 9)
same dims as the expected output of the sum operation.
in my real world example - I am iterating through j, columns and, i rows - the output is then an array... rather than use hcat() to append the array to my output can I do it with indexing?
In the above it uses hcat() which will then append to the existing B so it grows. I have since tried initializg with rows 2 and cols 0 so hcat() builds to correct output dim:
B = zeros(Int64, 2, 0)
I am doubting if hcat() will be memory efficient (excuse using global for example sakes) - if I couldn't do it with indexing I can populate it for another inner loop at my [i,j]. But perhaps someone has a way I can append an array as a column to existing pre initialized output?
The recommendation is to pre-allocate B and fill it afterwards. I wrap the code in a function as it simplifies benchmarking:
function f2()
A = reshape(1:81,9,9)
B = zeros(Int64, 2, 9 + size(A,2))
for j in 1:size(A,2) # loop cols
B[:, j + 9] .= (sum(view(A, :, j)), j)
end
B
end
Your old code is:
function f1()
A = Float64.(reshape(1.0:81.0,9,9))
B = zeros(Int64, 2, 9)
j = 1
for j in 1:size(A,2) # loop cols
out = [sum(A[:,j]),j]
out = reshape(out,2,1) # make column to append
# append out to the B
B = hcat(B,out)
end
B
end
And here is a comparison:
julia> #btime f1()
8.567 μs (83 allocations: 7.72 KiB)
2×18 Array{Float64,2}:
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 45.0 126.0 207.0 288.0 369.0 450.0 531.0 612.0 693.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
julia> #btime f2()
73.662 ns (1 allocation: 368 bytes)
2×18 Array{Int64,2}:
0 0 0 0 0 0 0 0 0 45 126 207 288 369 450 531 612 693
0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9
And you can see that the difference is very significant.
Some more minor comments to your original code:
there is no need to call Float64. on reshape(1.0:81.0,9,9), the object already holds elements that have Float64 values
in your code there was an inconsistency that initilally B held Int64 and A held Float64 - i have made this consistent (I chose Int64, but equally well you could use Float64)
sum(A[:,j]) unnecessarily allocated a new object; it is faster to use a view
You did not have to call reshape(out,2,1) on out before hcat as vectors are already treated as columnar objects
Related
I have a function for dynamic scheduling and I can use it for simple array and functions, for example I can use it for this code:
scheduling:
#everywhere function pmap(f,lst)
np=nprocs()
n=length(lst)
results=Vector{Any}(n)
i=1
nextidx()=(idx=i;i+=1;idx)
#sync begin
for p=1:np
if p != myid() || np==1
#sync begin
while true
idx=nextidx()
if idx > n
break
end
results[idx]= remotecall_fetch(f,p,lst[idx])
end
end
end
end
end
results
end
function:
#everywhere f(x)=x+1
f (generic function with 1 method)
array:
julia> arrays=SharedArray{Float64}(10)
10-element SharedArray{Float64,1}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
julia> arrays=[1 2 3 4 5 6 7 8 9 10]
1×10 Array{Int64,2}:
1 2 3 4 5 6 7 8 9 10
result:
#everywhere function fsum(x)
x+1
end
pmap(fsum,arrays)
10-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11
my question is if I had this function and arrays, how should I use scheduling function?
function:
#everywhere f(x,y)=x.+y
julia> x=SharedArray{Float64}(2,2)
2×2 SharedArray{Float64,2}:
0.0 0.0
0.0 0.0
julia> y=SharedArray{Float64}(2,2)
2×2 SharedArray{Float64,2}:
0.0 0.0
0.0 0.0
julia> x=[1 2;3 4]
2×2 Array{Int64,2}:
1 2
3 4
julia> y=[6 7;8 9]
2×2 Array{Int64,2}:
6 7
8 9
I wanted to call it by pmap(f,x,y) but I got this error:
ERROR: MethodError: no method matching pmap(::#f, ::Array{Int64,2}, ::Array{Int64,2})
You may have intended to import Base.pmap
Closest candidates are:
pmap(::Any, ::Any) at REPL[1]:2
and I have another question too, How we can be sure our problem is running in different process? How we can monitor it?
pmap splats the arguments, and so this works:
f(x,y) = x+y; pmap(f,1:5,6:10)
You probably re-defined pmap using what you have in the OP which doesn't splat the arguments and thus fails. You do not need to write your own here: if you just use the built-in version it will work.
Is there a function to create lagged variables in Julia without resorting any packages?
Specifically, I want to emulate the R's embed function in Julia.
> embed(1:8, 3)
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 4 3 2
[3,] 5 4 3
[4,] 6 5 4
[5,] 7 6 5
[6,] 8 7 6
After a couple of hours of browsing Julia manual, I gave up looking for suitable function in Julia. This ugly function (by R standard) is what I have so far. Is there any built-in function or any room for improvement?
julia> function embed(x, k)
n = length(x)
m = zeros(n - k + 1, k)
for i in 1:k
m[:, i] = x[(k-i+1):(n-i+1)]
end
return m
end
embed (generic function with 1 method)
julia> embed(1:8,3)
6x3 Array{Float64,2}:
3.0 2.0 1.0
4.0 3.0 2.0
5.0 4.0 3.0
6.0 5.0 4.0
7.0 6.0 5.0
8.0 7.0 6.0
You can dismiss zeros for cell to skip initialization. You can also do
embed(x,k) = hcat([x[i+k-1:-1:i] for i in 1:length(x)-k+1]...)'
Explanation
Create reverse stride indexes using [i+k-1:-1:i] and for
Take that list of items, and make it the arguments of hcat by using ...
Concatenate the strides (passed as arguments)
Transpose the result using '
EDIT: Assuming length(x) ⋙ k, you can also use:
embed(x,k) = hcat([x[k-i+1:length(x)-i+1] for i in 1:k]...)
Which gives the same results, but iterates less, and thus does less allocations.
I have a data frame in R that contains 2 columns named x and y (co-ordinates). The data frame represents a journey with each line representing the position at the next point in time.
x y seconds
1 0.0 0.0 0
2 -5.8 -8.5 1
3 -11.6 -18.2 2
4 -16.9 -30.1 3
5 -22.8 -40.8 4
6 -29.0 -51.6 5
I need to break the journey up into segments where each segment starts once the distance from the start of the previous segment crosses a certain threshold (e.g. 200).
I have recently switched from using SAS to R, and this is the first time I've come across anything I can do easily in SAS but can't even think of the way to approach the problem in R.
I've posted the SAS code I would use below to do the same job. It creates a new column called segment.
%let cutoff=200;
data segments;
set journey;
retain segment distance x_start y_start;
if _n_=1 then do;
x_start=x;
y_start=y;
segment=1;
distance=0;
end;
distance + sqrt((x-x_start)**2+(y-y_start)**2);
if distance>&cutoff then do;
x_start=x;
y_start=y;
segment+1;
distance=0;
end;
keep x y seconds segment;
run;
Edit: Example output
If the cutoff were 200 then an example of required output would look something like...
x y seconds segment
1 0.0 0.0 0 1
2 40.0 30.0 1 1
3 80.0 60.0 2 1
4 120.0 90.0 3 1
5 160.0 120.0 4 2
6 120.0 150.0 5 2
7 80.0 180.0 6 2
8 40.0 210.0 7 2
9 0.0 240.0 8 3
If your data set is dd, something like
cutoff <- 200
origin <- dd[1,c("x","y")]
cur.seg <- 1
dd$segment <- NA
for (i in 1:nrow(dd)) {
dist <- sqrt(sum((dd[i,c("x","y")]-origin)^2))
if (dist>cutoff) {
cur.seg <- cur.seg+1
origin <- dd[i,c("x","y")]
}
dd$segment[i] <- cur.seg
}
should work. There are some refinements (it might be more efficient to compute distances of the current origin to all rows, then use which(dist>cutoff)[1] to jump to the first row that goes beyond the cutoff), and it would be interesting to try to come up with a completely vectorized solution, but this should be OK. How big is your data set?
I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0
I need to convert a R data.frame object into a SpatialPointsDataFrame object in order to run spatial statistics functions on the data. However, for some reason converting a data.frame object into a SpatialPointsDataFrame give an unexpected behavior when running specific functions on the converted object.
In this example I try to run head() function on the resulting SpatialPointsDataFrame
Why does the function head() fail on some SpatialPointsDataFrame objects?
Here is the code to reproduce the behavior.
Example 1, no error:
#beginning of r code
#load S Classes and Methods for Spatial Data package "sp"
library(sp)
#Load an example dataset that contain geographic ccoordinates
data(meuse)
#check the structure of the data, it is a data.frame
str(meuse)
#>'data.frame': 155 obs. of 14 variables: ...
#with coordinates x,y
#Convert the data into a SpatialPointsDataFrame, by function coordinates()
coordinates(meuse) <- c("x", "y")
#check structure, seems ok
str(meuse)
#Check first rows of the data
head(meuse)
#It worked!
#Now create a small own dataset
testgeo <- as.data.frame(cbind(1:10,1:10,1:10))
#set colnames
colnames(testgeo) <- c("x", "y", "myvariable")
#convert to SpatialPointsDataFrame
coordinates(testgeo) <- c("x", "y")
#Seems ok
str(testgeo)
#But try running for instance head()
head(testgeo)
#Resulting output: Error in `[.data.frame`(x#data, i, j, ..., drop = FALSE) :
#undefined columns selected
#end of example code
There is some difference between the two example datasets that I do not understand. Function str() does not reveal the difference?
Why does the function head() fail on the dataset testgeo?
Why does head() work when adding more columns, 10 seems to be the limit:
testgeo <- as.data.frame(cbind(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10))
coordinates(testgeo) <- c("V1", "V2")
head(testgeo)
There is no specific head method for SpatialPoints/PolygonsDataFrames, so when you call head(testgeo) or head(meuse) it falls through to the default method:
> getAnywhere("head.default")
A single object matching ‘head.default’ was found
It was found in the following places
registered S3 method for head from namespace utils
namespace:utils
with value
function (x, n = 6L, ...)
{
stopifnot(length(n) == 1L)
n <- if (n < 0L)
max(length(x) + n, 0L)
else min(n, length(x))
x[seq_len(n)]
}
<bytecode: 0x97dee18>
<environment: namespace:utils>
What this does is then returns x[1:n], but for those spatial classes, square bracket indexing like that takes columns:
> meuse[1]
coordinates cadmium
1 (181072, 333611) 11.7
2 (181025, 333558) 8.6
3 (181165, 333537) 6.5
4 (181298, 333484) 2.6
5 (181307, 333330) 2.8
6 (181390, 333260) 3.0
7 (181165, 333370) 3.2
8 (181027, 333363) 2.8
9 (181060, 333231) 2.4
10 (181232, 333168) 1.6
> meuse[2]
coordinates copper
1 (181072, 333611) 85
2 (181025, 333558) 81
3 (181165, 333537) 68
4 (181298, 333484) 81
5 (181307, 333330) 48
6 (181390, 333260) 61
7 (181165, 333370) 31
8 (181027, 333363) 29
9 (181060, 333231) 37
10 (181232, 333168) 24
So when you do head(meuse) it tries to get meuse[1] to meuse[6], which exist because meuse has lots of columns.
But testgeo doesn't. So it fails.
The real fix might be to write a head.SpatialPointsDataFrame that goes:
> head.SpatialPointsDataFrame = function(x,n=6,...){x[1:n,]}
so that:
> head(meuse)
coordinates cadmium copper lead zinc elev dist om ffreq soil
1 (181072, 333611) 11.7 85 299 1022 7.909 0.00135803 13.6 1 1
2 (181025, 333558) 8.6 81 277 1141 6.983 0.01222430 14.0 1 1
3 (181165, 333537) 6.5 68 199 640 7.800 0.10302900 13.0 1 1
4 (181298, 333484) 2.6 81 116 257 7.655 0.19009400 8.0 1 2
5 (181307, 333330) 2.8 48 117 269 7.480 0.27709000 8.7 1 2
6 (181390, 333260) 3.0 61 137 281 7.791 0.36406700 7.8 1 2
lime landuse dist.m
1 1 Ah 50
2 1 Ah 30
3 1 Ah 150
4 0 Ga 270
5 0 Ah 380
6 0 Ga 470
> head(testgeo)
coordinates myvariable
1 (1, 1) 1
2 (2, 2) 2
3 (3, 3) 3
4 (4, 4) 4
5 (5, 5) 5
6 (6, 6) 6
The real real problem here is that the spatial classes don't inherit from data.frame, so they don't behave like them.
head(meuse) didn't give you the first few rows of the dataset meuse but its first few columns (6 + the coordinate column).
Your dataset testgeo only have 1 column so head(testgeo) fails. However head(testgeo,1) works.
head(testgeo,1)
coordinates myvariable
1 (1, 1) 1
2 (2, 2) 2
3 (3, 3) 3
4 (4, 4) 4
5 (5, 5) 5
6 (6, 6) 6
7 (7, 7) 7
8 (8, 8) 8
9 (9, 9) 9
10 (10, 10) 10
The reason why columns are selected instead of rows is unknown to me but if you want to see the first few rows of testgeo you can use the more traditional:
testgeo[1:5, ]
coordinates myvariable
1 (1, 1) 1
2 (2, 2) 2
3 (3, 3) 3
4 (4, 4) 4
5 (5, 5) 5
sp now has a head method for all Spatial objects, implemented as
> sp:::head.Spatial
function (x, n = 6L, ...)
{
ix <- sign(n) * seq(abs(n))
x[ix, , drop = FALSE]
}
note that it also takes care of negative n