Seeking an lapply like function for a list of lists - r

I have a list in R which looks something like this
b0=5;b1=2
f <- function(x) b0 + b1*x
Nsim <- 100
my.list <- vector("list", Nsim)
for(i in 1:Nsim){
x <- rep(0,1000)
y <- x
y[1] <- f(x[1])
for(j in 2:1000){
x[j] <- x[j-1] + rnorm(1,0,0.1)
y[j] < f(x[j])
}
my.list[[i]]$x <- x
my.list[[i]]$y <- y
}
In reality, f is the result of an optimisation routine and x tracks the input value over time and y is the function values which are generated. So in essence, I have Nsim time series. I want to plot metrics of these time series over time by averaging over the index i. For instance, the average performance of the algorithm over time.
At the moment I'm doing this with a bespoke function for each metric I want to calculate (e.g. mean squared error of x from the true value of x, another for generating error bars and so on). I want to use something like lapply to average over i so I can visualise how x and y evolve over time but that doesn't do the right thing.
Is what I want to output is a pointwise summary of the results. As an analogy, if my.list[[i]]$x was instead stored as a matrix, I could take colMeans() to see the average value of x over "time".
Is there a function/package which is good for working with lists of lists?

At least for what has been presented there is no real reason to use a list of lists. The x's are all the same and equal to 1, 2, 3, ... so this could be represented by a matrix with the x component being implicit or represented by row names or we could represent this as a ts object or zoo object. In the last two cases if X is the object time(X) is the common x.
mat <- sapply(my.list, "[[", "y")
ts(mat)
library(zoo); zoo(mat)
Alternately, get rid of my.list and construct one of these directly in the code.

Related

Calculate Errors using loop function in R

I have two data matrices both having the same dimensions. I want to extract the same series of columns vectors. Then take both series as vectors, then calculate different errors for example mean absolute error (mae), mean percentage error (mape) and root means square error
(rmse). My data matrix is quite large dimensional so I try to explain with an example and calculate these errors manually as:
mat1<- matrix(6:75,ncol=10,byrow=T)
mat2<- matrix(30:99,ncol=10,byrow=T)
mat1_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat1_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mat2_seri1 <- as.vector(mat1[,c(1+(0:4)*2)])
mat2_seri2<- as.vector(mat1[,c(2+(0:4)*2)])
mae1<-mean(abs(mat1_seri1-mat2_seri1))
mae2<-mean(abs(mat1_seri2-mat2_seri2))
For mape
mape1<- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2<- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
similarly, I calculate rmse from their formula, as I have large data matrices so manually it is quite time-consuming. Is it's possible to do this using looping which gives an output of the errors (mae,mape,rmse) term for each series separately.
I'm not sure if this is what you are looking for, but here is a function that could automate the process, maybe there is also a better way:
fn <- function(m1, m2) {
stopifnot(dim(m1) == dim(m2))
mat1_seri1 <- as.vector(m1[, (1:ncol(m1))[(1:ncol(m1))%%2 != 0]])
mat1_seri2 <- as.vector(m1[, (1:ncol(m1))[!(1:ncol(m1))%%2]])
mat2_seri1 <- as.vector(m2[, (1:ncol(m2))[(1:ncol(m2))%%2 != 0]])
mat2_seri2 <- as.vector(m2[, (1:ncol(m2))[!(1:ncol(m2))%%2]])
mae1 <- mean(abs(mat1_seri1-mat2_seri1))
mae2 <- mean(abs(mat1_seri2-mat2_seri2))
mape1 <- mean(abs(mat1_seri1-mat2_seri1)/mat1_seri1)*100
mape2 <- mean(abs(mat1_seri2-mat2_seri2)/mat1_seri2)*100
setNames(as.data.frame(matrix(c(mae1, mae2, mape1, mape2), ncol = 4)),
c("mae1", "mae2", "mape1", "mape2"))
}
fn(mat1, mat2)
mae1 mae2 mape1 mape2
1 24 24 92.62581 86.89572

Generate numbers with specific correlation [with only positive values in the output]

I want to obtain a dataframe with simulated values which have a specific correlation to each other.
I need to use this function, but in the returned output there are also negative values, which do not have meaning for my purposes:
COR <- function (n, xmean, xsd, ymean, ysd, correlation) {
x <- rnorm(n)
y <- rnorm(n)
z <- correlation * scale(x)[,1] + sqrt(1 - correlation^2) *
scale(resid(lm(y ~ x)))[,1]
xresult <- xmean + xsd * scale(x)[,1]
yresult <- ymean + ysd * z
data.frame(x=xresult,y=yresult)
}
Please note that my question starts from this previous post (currently closed):
another similar discussion
Is there a method able to exclude from the final output all the rows which have at least one negative value? (in another terms, x and y must be always positives).
I spent many hours without any concrete result.....
Filtering rows which have at least one negative value can be done with the apply function, e.g.
df <- simcor(100, 1, 1, 1, 1, 0.8)
filter <- apply(df, 1, function(x) sum(x < 0) > 0)
df <- df[!filter,]
plot(df)
First, I create a dataframe df from your funcion. Then, I apply the function sum(x < 0) > 0 rowwise to the dataframe (the second argument of apply, 1 indicates to go along the first dimension of the dataframe or array). This will create a logical vector that is TRUE for every row with at least one negative value. Subsetting the dataframe with the inverse of that (!filter) leaves you with all rows that have no negative values.
UPDATE:
Seems like the package VineCopula offers functions to create distributions with a given correlation. However, I did not dive into the math as deep so I was not able to fully grasp how copulas (i.e. multivariate probability distributions) work. Using this package, you can at least create e.g. two gaussian distributions.
library(VineCopula)
BC <- BiCop(family = 1, par = 0.9)
sim <- BiCopSim(N = 1000, obj = BC)
cor(sim[,1], sim[,2])
plot(sim)
You might be able to then scale the resulting matrix to achieve a certain standard derivation.

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

compute samples variance without loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of
length 600 - each in one column of the data frame.
I want to pick up 10 of the time-series randomly and then assign them
random weights that sum up to one. Using those I want to compute the
variance of the sum of the 10 weighted time series variables (e.g.
convex combination).
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
i can compute it inside a loop but r is vector oriented and it is not efficient.
ntrials = 10000
ts.sd = NULL
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd = cbind(ts.sd, temp)
}
Not sure what type of "random" you want for your weights... so I'll use a normal distribution scaled s.t. it sums to one:
x=as.data.frame(matrix(sample(1:20, 100*600, replace=TRUE), ncol=100))
myfun <- function(inc, DF=x) {
w = runif(10)
w = w / sum(w)
t(w) %*% cov(DF[, sample(seq_along(DF), 10)]) %*% w
}
lapply(1:ntrials, myfun)
However, this isn't really avoiding loops per say since lapply is just an efficient looping construct. That said, for loops in R aren't explicitly bad or inefficient. Growing a data structure, like you're doing with cbind, however, is.
But in this case since you're only growing it by appending a single element it really wont change things much. The "correct" version would be to pre-allocate your vector ts.sd using ntrials.
ts.sd = vector(mode='numeric', length=ntrials)
The in your loop assign into it using i:
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd[i] = temp
}

R: looping to search for max of non-monotonic function

Refer to the R code below. The function (someRfunction) operates on a vector and returns a scalar value. The data are pairs (x,y), where x and y are vectors of length n, which may be large.
I want to know the value of x* such that the result of someRfunction on y where {x>x*} is maximized. The function operates on y values and is non-monotonic in x*. I need to evaluate for all x* (i.e. each element of x). Speed is not an issue if executed once, but the code would be executed many times in a simulation. Is there any way to make this code more efficient/faster?
### x and y are vectors of length n
### sort x and y such that they are ordered by descending x
xord <- x[order(-x)]
yord <- y[order(-x)]
maxf <- -99999
maxcut <- NA
for (i in 1:n) {
### yi is a subvector of y that corresponds to y[x>x{i}]
### where x{i} is the (n-i+1)th order statistic of x
yi <- yord[1:(i-1)]
fxi <- someRfunction(yi)
if (fxi>maxf) {
maxf <- fxi
maxcut <- xord[i]
}
}
Thanks.
Edit: let someRfunction(yi)=t.test(yi)$statistic.
If you can say anything more about the function, particularly whether it is smooth and whether its gradient can be determine, you will get a better answer. At the moment the only increase in speed will be modest due to the ability to pre-specify a vector to hold the results, omit that if-max clause and then use which.max() on the vector. You might want to look at the function optimx in package "optimx".

Resources