Speed up sample function in R - r

In a quite big data frame, I have to pick up some random rows to execute a function. In my example, the first function I use is the variance and then a function closed to the real one I use in my script, called after f. I do not detail the purpose of f but it deals with truncated Gaussian distribution and maximum-likelihood estimation.
My problem is that my code is way too slow with the second function and I suppose a bit of optimization of the for loop or the sample function could help me.
Here is the code :
df <- as.data.frame(matrix(0,2e+6,2))
df$V1 <- runif(nrow(df),0,1)
df$V2 <- sample(c(1:10),nrow(df), replace=TRUE)
nb.perm <- 100 # number of permutations
res <- c()
for(i in 1:nb.perm) res <- rbind(res,tapply(df[sample(1:nrow(df)),"V1"],df$V2,var))
library(truncnorm)
f <- function(d) # d is a vector
{
f2 <- function(x) -sum(log(dtruncnorm(d, a=0, b=1, mean = x[1], sd = x[2])))
res <- optim(par=c(mean(d),sd(d)),fn=f2)
if(res$convergence!=0) warning("Optimization has not converged")
return(list(res1=res$par[1],res2=res$par[2]^2))
}
for(i in 1:nb.perm) res2 <- rbind(res,tapply(df[sample(1:nrow(df)),"V1"],df$V2,function(x) f(x)$res2))
I hope I am clear enough.

Related

Is it possible to use vector math in R for a summation involving intervals?

Title's a little rough, open to suggestions to improve.
I'm trying to calculate time-average covariances for a 500 length vector.
This is the equation we're using
The result I'm hoping for is a vector with an entry for k from 0 to 500 (0 would just be the variance of the whole set).
I've started with something like this, but I know I'll need to reference the gap (i) in the first mean comparison as well:
x <- rnorm(500)
xMean <-mean(x)
i <- seq(1, 500)
dfGam <- data.frame(i)
dfGam$gamma <- (1/(500-dfGam$i))*(sum((x-xMean)*(x[-dfGam$i]-xMean)))
Is it possible to do this using vector math or will I need to use some sort of for loop?
Here's the for loop that I've come up with for the solution:
gamma_func <- function(input_vec) {
output_vec <- c()
input_mean <- mean(input_vec)
iter <- seq(1, length(input_vec)-1)
for(val in iter){
iter2 <- seq((val+1), length(input_vec))
gamma_sum <- 0
for(val2 in iter2){
gamma_sum <- gamma_sum + (input_vec[val2]-input_mean)*(input_vec[val2-val]-input_mean)
}
output_vec[val] <- (1/length(iter2))*gamma_sum
}
return(output_vec)
}
Thanks
Using data.table, mostly for the shift function to make x_{t - k}, you can do this:
library(data.table)
gammabar <- function(k, x){
xbar <- mean(x)
n <- length(x)
df <- data.table(xt = x, xtk = shift(x, k))[!is.na(xtk)]
df[, sum((xt - xbar)*(xtk - xbar))/n]
}
gammabar(k = 10, x)
# [1] -0.1553118
The filter [!is.na(xtk)] starts the sum at t = k + 1, because xtk will be NA for the first k indices due to being shifted by k.
Reproducible x
x <- c(0.376972124936433, 0.301548373935665, -1.0980231706536, -1.13040590360378,
-2.79653431987176, 0.720573498411587, 0.93912102300901, -0.229377746707471,
1.75913134696347, 0.117366786802848, -0.853122822287008, 0.909259181618213,
1.19637295955276, -0.371583903741348, -0.123260233287436, 1.80004311672545,
1.70399587729432, -3.03876460529759, -2.28897494991878, 0.0583034949929225,
2.17436525195634, 1.09818265352131, 0.318220322390854, -0.0731475581637693,
0.834268741278827, 0.198750636733429, 1.29784138432631, 0.936718306241348,
-0.147433193833294, 0.110431994640128, -0.812504663900505, -0.743702167768748,
1.09534507180741, 2.43537370755095, 0.38811846676708, 0.290627670295127,
-0.285598287083935, 0.0760147178373681, -0.560298603759627, 0.447188372143361,
0.908501134499943, -0.505059597708343, -0.301004012157305, -0.726035976548133,
-1.18007702699501, 0.253074712637114, -0.370711296884049, 0.0221795637601637,
0.660044122429767, 0.48879363533552)

How to cycle through a list of functions in a for loop

I am fairly new to programming in R, so I apologize if this question is too basic. I am trying to study the properties of OLS with error terms created by three different processes (i.e., normal1, normal2, and chi-square). I include these in a list, 'fun_list'.
I would like to iterate through 1,000 (iter) regressions, each with sample size 500 (n). I would like to save all 1,000 X 500 observations in a dataset (big_data) as well as the regression results (reg_results).
At the end of the program, I would like 1,000 regressions for each of the three processes (for a total of 3,000 regressions). I have set up nested loops for the three functions on one level and the 1,000 iterations on a different (sub-) level. I am having trouble getting the program to loop through the three different functions. I am not sure how to call out each element of the list in this embedded loop. Any help would be greatly appreciated!
library(psych)
library(arm)
library(dplyr)
library(fBasics)
library(sjstats)
#set sample size and number of iterations
set.seed(12345)
n <- 500
iter <- 1000
#setting empty vectors. Probably a better way to do this. :)
bn <- rep(NA,iter)
sen <- rep(NA,iter)
#these are the three functions I want to use to generate en,
#which is the error term below. I want one loop for each of the three.
# I can get f1, f2 and f3 to work independently, but I can't get the list
#to work to cycle through all three.
f1 <- function (n) {rnorm(n, 0, 2)}
f2 <- function (n) {rnorm(n, 0, 10)}
f3 <- function (n) {rchisq(n, 2)}
fun_list <- list(f1, f2, f3)
#following line starting point for saving all iterations in one big
#dataset
datalist = list()
#if I remove the following line (for (j ....)), I can get this to work by
#referencing each function independently (i.e., using 'en <- f1(n)').
for (j in fun_list) {
for (s in 1:iter) {
# en <- f1(n)
en <- fun_list[[1]]
x <- rnorm(n, 0, .5)
yn <- .3*x + en
#this is the part that saves the data#
dat <- data.frame(yn, x, en)
dat$s <- s
datalist[[s]] <- dat
#### run model for normal data and save parameters###
lm1n <- lm(yn ~ x)
int.hatn <- coef (lm1n)[1]
b.hatn <- coef (lm1n)[2]
se.hatn <- se.coef (lm1n) [2]
##save them for each iteration
bn[s] = b.hatn
sen[s] = se.hatn
}
}
reg_results<- tibble(bn, sen)
big_data = do.call(rbind,datalist)
When using the loop, I get the following error:
Error in 0.3 * x + en : non-numeric argument to binary operator
I am assuming this is because I do not fully understand how to call out each of the three functions in the list.
Here is a complete solution which wraps the multiple points discussed in the comments:
library(psych)
library(arm)
library(dplyr)
library(fBasics)
library(sjstats)
#set sample size and number of iterations
set.seed(12345)
n <- 500
iter <- 1000
#setting empty vectors. Probably a better way to do this. :)
bn <- c()
sen <- c()
#these are the three functions I want to use to generate en,
#which is the error term below. I want one loop for each of the three.
# I can get f1, f2 and f3 to work independently, but I can't get the list
#to work to cycle through all three.
f1 <- function (n) {rnorm(n, 0, 2)}
f2 <- function (n) {rnorm(n, 0, 10)}
f3 <- function (n) {rchisq(n, 2)}
fun_list <- list(f1, f2, f3)
#following line starting point for saving all iterations in one big
#dataset
datalist = list()
#if I remove the following line (for (j ....)), I can get this to work by
#referencing each function independently (i.e., using 'en <- f1(n)').
for (j in c(1:length(fun_list))) {
en <- fun_list[[j]]
for (s in 1:iter) {
x <- rnorm(n, 0, .5)
random_part <- en(n)
yn <- .3*x + random_part
#this is the part that saves the data#
dat <- data.frame(yn, x, random_part)
dat$s <- s
datalist[[s]] <- dat
#### run model for normal data and save parameters###
lm1n <- lm(yn ~ x)
int.hatn <- coef(lm1n)[1]
b.hatn <- coef(lm1n)[2]
se.hatn <- se.coef(lm1n)[2]
##save them for each iteration
bn = c(bn,b.hatn)
sen = c(sen,se.hatn)
}
}
reg_results<- tibble(bn, sen)
big_data = do.call(rbind,datalist)

R: Stuck on a "simple" problem: calculating total sum of squares in a n*m matrix

Given a data matrix with n rows and m columns, I would like to calculate the total sum of squares in R.
For this I've tried a loop that iterates through the rows of each column and saves the results in a vector. These are then added to the "TSS" vector where each value is the SS of one column. The sum of this vector should be the TSS.
set.seed(2020)
m <- matrix(c(sample(1:100, 80)), nrow = 40, ncol = 2)
tss <- c()
for(j in 1:ncol(m)){
tssVec <- c()
for(i in 1:nrow(m)){
b <- sum(((m[i,]) - mean(m[,j]))^2)
tssVec <- c(tssVec, b)
}
tss <- c(tss, sum(tssVec))
}
sum(tss)
The output is equal to 136705.6. This is not feasible at all. As a novice coder, I am unfortunately stuck.
Any help is appreciated!
There are many methods to evaluate the TSS, of course they will give you the same result. I would do something like:
Method 1 that implies the use of ANOVA:
n <- as.data.frame(m)
mylm <- lm(n$V1 ~ n$V2)
SSTotal <-sum(anova(mylm)[,2])
Method 2:
SSTotal <- var( m[,1] ) * (nrow(m)-1)

Downsample matrix in R?

My question is about how to improve the performance of function that downsamples from the columns of a matrix without replacement (a.k.a. "rarefication" of a matrix... I know there has been mention of this here, but I could not find a clear answer that a) does what I need; b) does it quickly).
Here is my function:
downsampled <- function(data,samplerate=0.8) {
data.test <- apply(data,2,function(q) {
names(q) <- rownames(data)
samplepool <- character()
for (i in names(q)) {
samplepool <- append(samplepool,rep(i,times=q[i]))
}
sampled <- sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab <- table(sampled)
mat <- match(names(tab),names(q))
toret=numeric(length <- length(q))
names(toret) <- names(q)
toret[mat] <- tab
return(toret)
})
return(data.test)
}
I need to be downsampling matrices with millions of entries. I find this is quite slow (here I'm using a 1000x1000 matrix, which is about 20-100x smaller than my typical data size):
mat <- matrix(sample(0:40,1000*1000,replace=T),ncol=1000,nrow=1000)
colnames(mat) <- paste0("C",1:1000)
rownames(mat) <- paste0("R",1:1000)
system.time(matd <- downsampled(mat,0.8))
## user system elapsed
## 69.322 21.791 92.512
Is there a faster/easier way to perform this operation that I haven't thought of?
I think you can make this dramatically faster. If I understand what you are trying to do correctly, you want to down-sample each cell of the matrix, such that if samplerate = 0.5 and the cell of the matrix is mat[i,j] = 5, then you want to sample up to 5 things where each thing has a 0.5 chance of being sampled.
To speed things up, rather than doing all these operations on columns of the matrix, you can just loop through each cell of the matrix, draw n things from that cell by using runif (e.g., if mat[i,j] = 5, you can generate 5 random numbers between 0 and 1, and then add up the number of values that are < samplerate), and finally add the number of things to a new matrix. I think this effectively achieves the same down-sampling scheme, but much more efficiently (both in terms of running time and lines of code).
# Sample matrix
set.seed(23)
n <- 1000
mat <- matrix(sample(0:10,n*n,replace=T),ncol=n,nrow=n)
colnames(mat) <- paste0("C",1:n)
rownames(mat) <- paste0("R",1:n)
# Old function
downsampled<-function(data,samplerate=0.8) {
data.test<-apply(data,2,function(q){
names(q)<-rownames(data)
samplepool<-character()
for (i in names(q)) {
samplepool=append(samplepool,rep(i,times=q[i]))
}
sampled=sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab=table(sampled)
mat=match(names(tab),names(q))
toret=numeric(length = length(q))
names(toret)<-names(q)
toret[mat]<-tab
return(toret)
})
return(data.test)
}
# New function
downsampled2 <- function(mat, samplerate=0.8) {
new <- matrix(0, nrow(mat), ncol(mat))
colnames(new) <- colnames(mat)
rownames(new) <- rownames(mat)
for (i in 1:nrow(mat)) {
for (j in 1:ncol(mat)) {
new[i,j] <- sum(runif(mat[i,j], 0, 1) < samplerate)
}
}
return(new)
}
# Compare times
system.time(downsampled(mat,0.8))
## user system elapsed
## 26.840 3.249 29.902
system.time(downsampled2(mat,0.8))
## user system elapsed
## 4.704 0.247 4.918
Using an example 1000 X 1000 matrix, the new function I provided runs about 6 times faster.
One source of savings would be to remove the for loop that appends samplepool using rep. Here is a reproducible example:
myRows <- 1:5
names(myRows) <- letters[1:5]
# get the repeated values for sampling
samplepool <- rep(names(myRows), myRows)
Within your function, this would be
samplepool <- rep(names(q), q)

Iterate a function R code

I have a function myF(g,m,alpha,gam,theta,beta). Which returns three estimates of parameters. I want to iterate this function for (i in 1:10). How can i do this it in R?
myF <- function(g,m,alpha,gam,theta,beta){
dat <- sim.data(g,m,alpha,gam,theta,beta)
time <- dat$times
delta <- dat$cens
i <- dat$group
X1<-dat$cov #cov~rbinom
n <- length(levels(as.factor(i)))
di <- aggregate(delta,by=list(i),FUN=sum)[,2]
D <- sum(di)
loglik <- function(par){
.........................................
return(-lik)
}
initial=c(0.5,0.5,-0.5,0.5)
maxF <- nlm(loglik, initial)
return(c(theta=exp(maxF$estimate[2]),beta1=maxF$estimate[3],alpha=exp(maxF$estimate[2])))
}
This can easily be done using replicate:
replicate(10, myF(g,m,alpha,gam,theta,beta))
This will create a 3*10 matrix of the parameter estimates, where each column is the result of a separate iteration.

Resources