Programming 50000 regressions in R using parallel programming - r

I have the following homework problem, which I have finished but seems to take an exceptionally long time to complete:
Assume that Y , X1, · · · , X1000 are all normal random variables with mean 0 and standard deviation 1, and they are independent with each other. Generate 30 samples of Y, X1, ···, X1000. Now repeat the following 50000 times: Randomly pickup ten variables from X1, . . ., X1000, run a linear regression of Y on these ten variables and record the R2. Compute the maximum value of the 50000 R2’s.
And here is my code, which works for 8000 regressions (1000 regression on each core of my macbook pro), but can't seem to finish for 6250 regressions (50000 regressions total) on each core. Here is my code:
library(snow)
cl <- makeCluster(8, type = "SOCK")
invisible(clusterEvalQ(cl, reg_cluster <- function(rep, samples, n) {
X <- list()
R <- rep(0, rep)
for (k in 1:rep) {
Y <- rnorm(samples)
for (j in 1:n) {
X[[j]] <- rnorm(samples)
}
X_1 <- sample(X, 10, replace = FALSE)
X_1_unlist <- unlist(X_1)
X.1 <- matrix(X_1_unlist[1:30], ncol = 1)
X.2 <- matrix(X_1_unlist[31:60], ncol = 1)
X.3 <- matrix(X_1_unlist[61:90], ncol = 1)
X.4 <- matrix(X_1_unlist[91:120], ncol = 1)
X.5 <- matrix(X_1_unlist[121:150], ncol = 1)
X.6 <- matrix(X_1_unlist[151:180], ncol = 1)
X.7 <- matrix(X_1_unlist[181:210], ncol = 1)
X.8 <- matrix(X_1_unlist[211:240], ncol = 1)
X.9 <- matrix(X_1_unlist[241:270], ncol = 1)
X.10 <- matrix(X_1_unlist[271:300], ncol = 1)
X_data <- cbind(X.1, X.2, X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10)
X_data <- as.data.frame (X_data)
names(X_data) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")
attach(X_data)
reg <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10)
R[k] <- summary(reg)$r.squared
}
return(max(R))
}))
results <- clusterEvalQ(cl, reg_cluster(1000, 30, 1000))
results <-clusterEvalQ(cl, reg_cluster(6250, 30, 1000))
stopCluster(cl)
max_results <- c(results[[1]], results[[2]], results[[3]], results[[4]],
results[[5]], results[[6]], results[[7]], results[[8]])
max(max_results)
Something else should be noted here. Each time I run a new regression, the Y and all the X's are generated again. No random variables carry over from one regression to the next.
So my question is, how can I make this run faster?
Also, can anyone tell me why it finished after 12 minutes for 8000 regressions, but still has not finished, after 2.5 hours, for 50000 regressions?
Edit: The following procedure has been confirmed by the professor:
1) Generate 30 random standard normal variables of each of Y, X1, ..., X1000. I would have a total of 30 random normal variables for Y, and a total of 30 x 1,000 = 30,000 random normal variables for all the X's (30 for each one)
2) Randomly select ten of the 1000 choices for X (for example X726, X325, X722, X410, X46, X635, X822, X518, X773, X187)
3) Run a linear regression Y ~ 10 X's using the lm function in R. The Y would have 30 observations, while each X would also have 30 observations. Essentially we'd try to be fitting Y = B0 + B1 * X1 + B2 * X2 + ... + B10 * X10, where each of the X's represents one of the randomly selected in part 2.
4) Record the R2 value in a vector
5) Repeat steps 1-4 50,000 times
6) Find the maximum R2 of the 50,000 recorded

Here's an alternative code that seems to solve your problem.
ns <- 30
rvals <- replicate(50000, {
y <- rnorm(ns)
xvals <- replicate(1000, rnorm(ns))
selecteds <- xvals[,sample(1:1000, 10)]
df <- data.frame(y = ys, selecteds)
summary(lm(paste("y ~", paste0("X", 1:10, collapse = "+")), data = df))$r.squared
})
I'm not very experienced with clustering, but here are a few reasons why your code might be too slow:
You have nested foor loops to create X, and I used replicate, which could be slightly faster than using a list.
You're growing an empty list, X, that's very bad. (Check The R inferno - Circle 2)
You unlisted several list elements just to make them 1-column matrix, then bind them all and finally name the columns. Though this steps seem necessary, I think doing that one by one and one at a time is probably slow. The colnames, for example, are automatically set to X1:X10.
Using attach isn't necessary and probably slows things down.
If you open/close too many clusters, that consumes a lot of processing and can makes things slower than non-parallel. Doesn't seem like the case though.
As a final note, just make sure I'm doing the same as you, since the problemwas still a bit confusing for me.

Related

Extract co-linear columns name - R

Based on the answer for this question and its script, how can I print to the console the co-linear columns names?
Script:
library(corrplot)
library(caret)
x <- seq(0, 100, 1)
# colinear with x
y <- x + 2.3
# almost colinear with x / some small gaussian noise
z <- x + rnorm(mean = 0, sd = 5, n = 101)
# uncorrrelated gaussian
w <- rnorm(mean = 0, sd = 1, n = 101)
a <- z+seq(101, 200, 1)/.33 + rnorm(mean = 0, sd = 5, n = 1001)
b <- a -2.3
# this frame is made to exemplify the procedure
df <- data.frame(x = x, y = y, z = z, w = w, a=a, b=b)
corrplot(cor(df))
#drop perfectly multicollinear variables
constant<-rep(1,nrow(df))
tmp<-lm(constant ~ ., data=df)
to_keep<-tmp$coefficients[!is.na(tmp$coefficients)]
to_keep<-names(to_keep[-which(names(to_keep) == "(Intercept)")])
df_result<-df[to_keep]
corrplot(cor(df_result))
You want the variables not included in to_keep. Based off how to_keep is defined, you can write to_drop <- tmp$coefficients[is.na(tmp$coefficients)] to get the coefficients with NA values (meaning there are no estimates for the corresponding variables because they are collinear with others). Then, to print the names of those coefficients, you can simply do print(names(to_drop)).
However, keep in mind that: 1. this will only drop perfectly collinear variables in a hacky way and 2. the way this method decides which variables out of a set of perfectly collinear variables to drop is rather arbitrary (it will depend on the other of variables in your data).

R/Python - For Loop Statement for a Trigonometric Formula

I am working on an estimation module, where we are computing seasonality variations and forecasting. Previously, we were using fixed 5-order sinusoidal functions for estimation. The formula was as follows
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(4*pi*doy/365)+ z[4]*cos(4*pi*doy/365)
+z[5]*sin(6*pi*doy/365)+ z[6]*cos(6*pi*doy/365)
+z[7]*sin(8*pi*doy/365)+ z[8]*cos(8*pi*doy/365)
+ z[9]*sin(10*pi*doy/365)+ z[10]*cos(10*pi*doy/365))
Now, we have tried some modifications in our model. Using Fast Fourier Transform, we are able to generate the orders for trigonometric functions automatically.
For example, on my current dataset, I have the following array of orders.
order_FFT = [2, 6, 10, 24], such that
order_FFT[0] = 2
order_FFT[1] = 6
order_FFT[2] = 10
order_FFT[3] = 24
There will be 4 orders here. With some other dataset, there could be more or less no. of orders. Therefore, I need to define a for loop so that the formula gets modified.
With my current dataset and corresponding orders_FFT array, the for loop should execute the following formula:
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365)
which basically means
doy_seasonality = exp(z[1]*sin(order_FFT[0]*pi*doy/365)+z[2]*cos(order_FFT[0]*pi*doy/365)
+z[3]*sin(order_FFT[1]*pi*doy/365)+ z[4]*cos(order_FFT[1]*pi*doy/365)
+z[5]*sin(order_FFT[2]*pi*doy/365)+ z[6]*cos(order_FFT[2]*pi*doy/365)
+z[7]*sin(order_FFT[3]*pi*doy/365)+ z[8]*cos(order_FFT[3]*pi*doy/365)
I am at a loss trying to figure out a for loop code for this. Sorry that I am not able to show my own efforts here.
I would not use a loop. Here is an R approach:
#Some test data
set.seed(42)
z <- rnorm(8)
doy <- 1:365
order_FFT <- c(2, 6, 10, 24)
#separate coefficients for sin and cos in two rows:
z <- matrix(z, nrow = 2)
#calculate the sins and cosins:
sins <- outer(doy, order_FFT, function(x, y) sin(x * pi * y / 365))
cosins <- outer(doy, order_FFT, function(x, y) cos(x * pi * y / 365))
#use matrix products to multiply and sum
doy_seasonality2 <- c(exp(sins %*% z[1,] + cosins %*% z[2,]))
Does it produce the same result?
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365))
all.equal(doy_seasonality, doy_seasonality2)
#[1] TRUE

How to extract the coefficients of various n-th degree polynomial models and store them into a 1D array?

I have a collection of sixth-degree polynomial regression models from which I want to gather only the coefficients.
I have a large dataset that contains 3 columns: the first one is an arbitrary parameter that acts as a flag, the second is the input, and third is the output.
I subsetted my dataset according to my parameters, so I have 10 smaller datasets. My models arose from these subsets.
As an example:
#-----"Dummy" Dataset-----
a = seq(1:100) #act as input
b = a + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = as.data.frame(cbind(a,b))
colnames(df) = c("input", "output")
#
#-----Subsets-----
df_1_XlessThen50 = subset(df, x< 50) #example of subsetting. In this
#case I used the x values itself as threshold
#for subsetting just for simplicity.
#In reality, I use the first column of my dataframe(parameter).
df_2_XmoreThen50 = subset(df, x >= 50) #second subset. In other words,
#for every parameter, I will divide that subset
#into two smaller ones.
#
#-----Models-----
model_3_ab.1 = lm(output ~ poly(input, 6, raw = T), data = df_1_XlessThen50)
model_3_ab.2 = lm(output ~ poly(input, 6, raw = T), data = df_2_XmoreThen50)
My models's names follow a pattern: "model" + parameter + "_ab." + id number.
I should clarify that the "id number" indicates which of the two models for every parameter I will consider. (Theses smaller datasets within every parameter are the results of subsetting according to a pre-determined threshold.)
What I have now is a collection of models like these two above for every parameter in my dataset. I have 10 parameters, hence, 20 models.
I want to gather only the coefficients of every model and store them into a matrix or dataframe. To achieve that, I tried:
parameter = c(2,4,6,7,9,11,33,35,37,50)
myData = array()
for (i in parameter){ #Loop over all parameters
for (j in 1:2){ #Loop over the pair of models for each parameter
for ( k in 1:6){ #Loop over my model's coefficient
aux = paste("model",i,"ab.",j, sep = "")
aux = get(aux)
myData[i,j,k] = aux$coefficients[k]
}
}
}
However, I keep getting the same error:
Error in myData[i, j, k] = aux$coefficients[k] :
incorrect number of subscripts
With this error, I can't advance into my goal, which is to write a .txt with one single column formatted as such:
A(2,1,1) = first order coefficient for the first model related to parameter 2
B(2,2,1) = second order coefficient for the first model related to parameter 2
C(2,3,1)
...
G(2,7,1)
A(2,1,2)
where in (M, N, O): M is the parameter, N is the the coefficient of the N-th degree (N = 7 is the intercept), and O is either 1 or 2, respectively, the first or second model in each pair of models for every parameter.
It'd be nice to get help/guidance for the whole problem, but I'll already be grateful if I can get past the part where I want to store my coefficients in a matrix using for-loops. Thanks
Here is what I mean:
set.seed(42)
a1 = seq(1:100) #act as input
a2 <- runif(100)
b = a1 + a2 + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = data.frame(input1 = a1,
input2 = a2,
output = b)
df$flag <- a1 <= 50
library(reshape2)
df <- melt(df, id.vars = c("output", "flag"))
library(lme4)
df$flag_par <- interaction(df$flag, df$variable)
fits <- lmList(output ~ poly(value, 2, raw = TRUE) | flag_par, data = df)
coef(fits)
# (Intercept) poly(value, 2, raw = TRUE)1 poly(value, 2, raw = TRUE)2
#FALSE.input1 125.957730 -2.434849 0.022137337
#TRUE.input1 2.842223 1.216113 -0.006686362
#FALSE.input2 68.807752 -7.429319 26.486493218
#TRUE.input2 31.791633 -18.595105 16.608600876

R: simulating 2-level model

I am trying to simulate the unequal sample size in the multilevel model.I have four groups, the sample size is 100,200,300,and 400, respectively.
So, the total sample size is 1000. w, u0,u1 variables are in the level 2 ; x , r0 are in the level 1. y is an outcome
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4 ## 4 groups
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
dataLevel1 <- mat.or.vec(sum(nSubWithinGroup),4)
colnames(dataLevel1) <- c("Group","X","W","Y")
rowIndex <- 0
for (group in 1:nGroup) {
u0 <- rnorm(1,mean=0,sd=1)
u1 <- rnorm(1,mean=0,sd=1)
w <- rnorm(1,mean=0,sd=1)
for(i in 1:length(nSubWithinGroup)){
for (j in 1:nSubWithinGroup[i]){
r0 <- rnorm(1,mean=0,sd=1)
x <- rnorm(1,mean=0,sd=1)
y <- (gamma00+gamma01*w+u0)+(gamma10+gamma11*w+u1)*x+r0
rowIndex <- rowIndex + 1
dataLevel1[rowIndex,] <- c(group,x,w,y)
}
}
}
I ran the codes, and it showed me the value in the "Group" column is 1 , no 2,3, or 4. Also, it has errors, which is:
"Error in [<-(*tmp*, rowIndex, , value = c(2, -1.94476463667851, -0.153516782293473, :
subscript out of bounds"
Your original issue was a bit hard to find with all the for-loops, but you were looping twice on your grouping level (one time in 1:nGroup and then again in 1:length(nSubWithinGroup). This lead to more combinations than you had allowed for in your matrix, and thus your error. (If you want to check, run your loop without assigining to dataLevel1 and see what value rowIndex has at the end.
However, generating data like this in R can be notoriously slow and every function you use with n=1 can just as easily be used to generate nTotal numbers. I have rewritten your code to something that's (hopefully) more readable, but also more vectorized.
#set seed; you can never reproduce your result if you don't do this
set.seed(289457)
#set constants
gamma00 <- 1
gamma01 <- 1 ## b0 = gamma00+gamma01*w+u0
gamma10 <- 1 ## b1 = gamma10+gamma11*w+u1
gamma11 <- 1
#set size parameters
nSubWithinGroup <- c(100,200,300,400)###the sample size in each group
nGroup <-4
nTotal <- sum(nSubWithinGroup)
#simulate group-level data
level2_data <- data.frame(group=1:nGroup,
size=nSubWithinGroup, #not really necessary here, but I like to have everything documented/accessible
u0 = rnorm(nGroup,mean=0,sd=1),
u1 = rnorm(nGroup,mean=0,sd=1),
w = rnorm(nGroup,mean=0,sd=1)
)
#simulate individual_level data (from example code x and r0 where generated in the same way for each individual)
level1_data <- data.frame(id=1:nTotal,
group=rep(1:nGroup, nSubWithinGroup),
r0 = rnorm(nTotal,mean=0,sd=1),
x = rnorm(nTotal, mean=0,sd=1)
)
#several possibilities here, you can merge the two dataframes together or reference the level2data when calculating the outcome
#merging generates more data, but is also readable
combined_data <- merge(level1_data,level2_data,by="group",all.x=T)
#calculate outcome. This can be shortened for instance by calculating some linear parts before
#merging but wanted to stay as close to original code as possible.
combined_data$y <- (gamma00+gamma01*combined_data$w+combined_data$u0)+
(gamma10+gamma11*combined_data$w+combined_data$u1)*combined_data$x+combined_data$r0

lm models over all possible pairwise combinations of the columns of two matrices

I am working through a problem at the moment in R and have got stuck. I have searched around on various help lists for assistance but could not find anything - but apologies if I have missed something. A dummy example of my problem is below. I will continue to work on it, but any help would be greatly appreciated.
Thanks in advance for your time.
I have a matrix of response variables:
p<-matrix(c(rnorm(120,1),
rnorm(120,1),
rnorm(120,1)),
120,3)
and two matrices of covariates:
g<-matrix(c(rep(1:3, each=40),
rep(3:1, each=40),
rep(1:3, 40)),
120,3)
m<-matrix(c(rep(1:2, 60),
rep(2:1, 60),
rep(1:2, each=60)),
120,3)
For all combinations of the columns of the covariate matrices g and m I want to run these two models:
test <- function(uniq_m, uniq_g, p = p) {
full <- lm(p ~ factor(uniq_m) * factor(uniq_g))
null <- lm(p ~ factor(uniq_m) + factor(uniq_g))
return(list('f'=full, 'n'=null))
}
So I want to test for an interaction between column 1 of m and column 1 of g, then column 2 of m and column 1 of g, then column 2 of m and column 2 of g...and so forth across all possible pairwise interactions. The response variable is the same each time and is a matrix containing multiple columns.
So far, I can do this for a single combination of columns:
test_1 <- test(m[ ,1], g[ ,1], p)
And I can also run the model over all columns of m and one coloumn of g:
test_2 <- apply(m, 2, function(uniq_m) {
test(uniq_m, g[ ,1], p = p)
})
I can then get the F statistics for each response variable of each model:
sapply(summary(test_2[[1]]$f), function(x) x$fstatistic)
sapply(summary(test_2[[1]]$n), function(x) x$fstatistic)
And I can compare models for each response variable using an F-test:
d1<-colSums(matrix(residuals(test_2[[1]]$n),nrow(g),ncol(p))^2)
d2<-colSums(matrix(residuals(test_2[[2]]$f),nrow(g),ncol(p))^2)
F<-((d1-d2) / (d2/114))
My question is how do I run the lm models over all combinations of columns from the m and the g matrix, and get the F-statistics?
While this is a dummy example, the real analysis will have a response matrix that is 700 x 8000, and the covariate matrices will be 700 x 4000 and 700 x 100 so I need something that is as fast as possible.
Hopefully this helps, this is some code a friend of mine shared with me. It may not be exactly what you need but might set you off in the right direction (though given this is 9 months later than you asked it, it may be of no use to you specifically!):
#### this first function models the correlation and fixes the text size based on the strength of the correlation
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
##### this function places a histogram of your data on the diagonal
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
### read in Fishers famous iris dataset for our example
data(iris)
head(iris)
library(corrgram)
##corrgram also gives you some nice panel options to use in pairs, but you dont necesarily need them
##e.g. panel.ellipse, panel.pie, panel.conf
library(asbio)
##asbio offers more panel options, such as a linear regression (panel.lm) etc
### run pairs() on your data
### set upper panel to panel.cor (the function we just wrote), and diagonal to panel.hist
### do what you like for the lower, add a smoother line isnt very informative
pairs(~ Sepal.Length + Sepal.Width + Petal.Length, data=iris, lower.panel=panel.lm, upper.panel=panel.cor, diag.panel = panel.hist, main="pair plots of variables")
All credit to James Keating.

Resources