How to create lagged variables in Julia? - julia

Is there a function to create lagged variables in Julia without resorting any packages?
Specifically, I want to emulate the R's embed function in Julia.
> embed(1:8, 3)
[,1] [,2] [,3]
[1,] 3 2 1
[2,] 4 3 2
[3,] 5 4 3
[4,] 6 5 4
[5,] 7 6 5
[6,] 8 7 6
After a couple of hours of browsing Julia manual, I gave up looking for suitable function in Julia. This ugly function (by R standard) is what I have so far. Is there any built-in function or any room for improvement?
julia> function embed(x, k)
n = length(x)
m = zeros(n - k + 1, k)
for i in 1:k
m[:, i] = x[(k-i+1):(n-i+1)]
end
return m
end
embed (generic function with 1 method)
julia> embed(1:8,3)
6x3 Array{Float64,2}:
3.0 2.0 1.0
4.0 3.0 2.0
5.0 4.0 3.0
6.0 5.0 4.0
7.0 6.0 5.0
8.0 7.0 6.0

You can dismiss zeros for cell to skip initialization. You can also do
embed(x,k) = hcat([x[i+k-1:-1:i] for i in 1:length(x)-k+1]...)'
Explanation
Create reverse stride indexes using [i+k-1:-1:i] and for
Take that list of items, and make it the arguments of hcat by using ...
Concatenate the strides (passed as arguments)
Transpose the result using '
EDIT: Assuming length(x) ⋙ k, you can also use:
embed(x,k) = hcat([x[k-i+1:length(x)-i+1] for i in 1:k]...)
Which gives the same results, but iterates less, and thus does less allocations.

Related

dynamic scheduling by Julia

I have a function for dynamic scheduling and I can use it for simple array and functions, for example I can use it for this code:
scheduling:
#everywhere function pmap(f,lst)
np=nprocs()
n=length(lst)
results=Vector{Any}(n)
i=1
nextidx()=(idx=i;i+=1;idx)
#sync begin
for p=1:np
if p != myid() || np==1
#sync begin
while true
idx=nextidx()
if idx > n
break
end
results[idx]= remotecall_fetch(f,p,lst[idx])
end
end
end
end
end
results
end
function:
#everywhere f(x)=x+1
f (generic function with 1 method)
array:
julia> arrays=SharedArray{Float64}(10)
10-element SharedArray{Float64,1}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
julia> arrays=[1 2 3 4 5 6 7 8 9 10]
1×10 Array{Int64,2}:
1 2 3 4 5 6 7 8 9 10
result:
#everywhere function fsum(x)
x+1
end
pmap(fsum,arrays)
10-element Array{Any,1}:
2
3
4
5
6
7
8
9
10
11
my question is if I had this function and arrays, how should I use scheduling function?
function:
#everywhere f(x,y)=x.+y
julia> x=SharedArray{Float64}(2,2)
2×2 SharedArray{Float64,2}:
0.0 0.0
0.0 0.0
julia> y=SharedArray{Float64}(2,2)
2×2 SharedArray{Float64,2}:
0.0 0.0
0.0 0.0
julia> x=[1 2;3 4]
2×2 Array{Int64,2}:
1 2
3 4
julia> y=[6 7;8 9]
2×2 Array{Int64,2}:
6 7
8 9
I wanted to call it by pmap(f,x,y) but I got this error:
ERROR: MethodError: no method matching pmap(::#f, ::Array{Int64,2}, ::Array{Int64,2})
You may have intended to import Base.pmap
Closest candidates are:
pmap(::Any, ::Any) at REPL[1]:2
and I have another question too, How we can be sure our problem is running in different process? How we can monitor it?
pmap splats the arguments, and so this works:
f(x,y) = x+y; pmap(f,1:5,6:10)
You probably re-defined pmap using what you have in the OP which doesn't splat the arguments and thus fails. You do not need to write your own here: if you just use the built-in version it will work.

Parse 2D-Array in Julia

In Julia I can create 2D-arrays with
[1 2 3 4 ; 5 6 7 8]
2×4 Array{Int64,2}:
1 2 3 4
5 6 7 8
The problem is, that I need to parse a 2D-array supplied as an argument to a script - that is as a String.
For example
$ julia script.jl "[1 2 3 4 ; 5 6 7 8]"
and in the script something like:
c = parse.(ARGS[1])
and c should be a 2×4 array.
I am flexible regarding the format of the input String.
The usecase is, that I want to call an optimization problem implemented in Julia + JuMP from within Java.
Check out the readdlm function, which will allow you to parse the text received from ARGS as an array:
using DelimitedFiles
a = readdlm(IOBuffer(ARGS[1]),',',';')
display(a)
Running:
$ julia argscript.jl "1,2,3,4;5,6,7,8"
2×4 Array{Float64,2}:
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
You can force the array element type in the script:
a = readdlm(IOBuffer(ARGS[1]),',',Int,';')
You could even enforce the matrix dimensions by passing two more arguments:
using DelimitedFiles
n = parse(Int,ARGS[1])
m = parse(Int,ARGS[2])
a = readdlm(IOBuffer(ARGS[3]),',',Int,';',dims=(n,m))
Running:
$ julia argscript.jl 2 3 "3,2,1;2,6,8"
2×3 Array{Int64,2}:
3 2 1
2 6 8
$ julia argscript.jl 2 4 "3,2,1;2,6,8"
ERROR: LoadError: at row 2, column 1 : ErrorException("missing value at row 1 column 4"))

pLepage function in R

Here is a self-define function for computing Lepage D statistic, which returns result different from the D statistic generated by NSM3::pLepage():
LepageD <- function(x, y){
m=length(x); n=length(y); N=m+n
z=sort(c(x,y),index=TRUE)
rz=seq(1,(N-1)/2); rz=c(rz,(N+1)/2,rev(rz))
r=rz[sort(z$ix,index=TRUE)$ix]
C=sum(r[12:21])
rk=rank(c(x,y))
W=sum(rk[12:21])
Wstar=(W-n*(N+1)/2)/sqrt(m*n*(N+1)/12)
Cstar=(C-n*((N+1)^2)/(4*N))/sqrt(m*n*(N+1)*(3+N^2)/(48*(N^2)))
D=Wstar^2+Cstar^2
D
}
> LepageD(1:10, 2:12)
[1] 1.09216
> pLepage(1:10, 2:12)$obs.stat
[1] 1.112263
And my function is not able to deal with situation x and y have same sample size.
> LepageD(1:10, 2:11)
[1] NA
I'm confused about where I did wrong.
According to me, the problem lies somewhere around this line:
r=rz[sort(z$ix,index=TRUE)$ix]
The reason for error occurence here is that z (in the test case giving output as NA) has 20 elements.
So, sort(z$ix,index=TRUE)$ix produces output as:
1 2 4 6 8 10 12 14 16 18 3 5 7 9 11 13 15 17 19 20
Also, the length of vector rz is 19 (and not 20).
Content of rz vector:
[1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.5 9.0 8.0 7.0 6.0 5.0
[16] 4.0 3.0 2.0 1.0
So, when we try to access the 20th element of vector rz, it produces NA.
As you haven't used the na.rm = T argument while doing the sum, the values for C and W becomes NA.
And which results Wstar, Cstar and ultimately D to become NA.

R: Mean of subvectors based on repeats in another vector

I am trying to make two subvectors equal length from two vectors equal length.
Values in first vector are ordered as follows:
a<-c(9,9,9,8,8,7,6,5,5,5)
Second vector is random, but lets take
b<-c(1,2,3,4,5,6,7,8,9,10)
The first subvector is simple:it is just the vector a withouth repeats
f(a)<-c(9,8,7,6,5)
The second subvector should be made as follows:
for single value in vector a (no repeats in a)the vector g(b) has the same value as vector b on corresponding position. For repeats in a the g(b) value should be mean of values from corresponding subvector b. So:
g(b)<-c(mean(c(1,2,3)), mean(c(4,5)), 6, 7, mean(c(8,9,10)))
I have no idea where to start. Thx for help!
tapply is the function you want. See ?tapply to see how it works. Here:
res<-tapply(b,a,mean)
# 5 6 7 8 9
#9.0 7.0 6.0 4.5 2.0
If you want to preserve the order:
tapply(b,a,mean)[as.character(unique(a))]
# 9 8 7 6 5
#2.0 4.5 6.0 7.0 9.0
As you can see, it gives the unique values of a and for each of them, the desired function (in this case mean(b)) is evaluated.
We can also use ave
unique(ave(b, a))
#[1] 2.0 4.5 6.0 7.0 9.0
Or another option would be to convert the 'b' to factor with levels specified
tapply(b, factor(a, levels=unique(a)), FUN=mean)
# 9 8 7 6 5
#2.0 4.5 6.0 7.0 9.0
You can do in this way:
uniqueA <- a[!duplicated(a)] # or simply unique(a) but I'm not sure about order preservation
uniqueB <- as.numeric(by(b,match(a,uniqueA),mean))
> uniqueA
[1] 9 8 7 6 5
> uniqueB
[1] 2.0 4.5 6.0 7.0 9.0

R: Improvement of loop to create distance matrix from data frame

I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0

Resources