Writing a loop for randomly selecting rows of a matrix and doing a linear regression on data from rows and storing in a matrix - r

I need to write a program that does the following in R:
I have a data set (42 rows, 2 columns) of y variables and x variables.
I want to randomly select 12 rows from this matrix and record the coefficients (slope and intercept) of a linear regression of the randomly generated matrix. I would also like to write a loop for this so I can repeat this 1000 times, so I can then have a matrix with 1000 rows and 2 columns filled in with the slopes and intercepts of the 1000 randomly selected sets of 12 rows from my data set.
I am able to get this far but do not know how to incorporate a loop into the code, and a way to store the coefficients into a a matrix.
#Box.Z and Box.DC.gm are columns of data used to generate my initial matrix of data
A <- matrix(c(Box.Z, Box.DC.gm), nrow=42)
B <- A[sample(42, 12), ]
C <- lm(B[,2] ~ B[,1])
D <- matrix(c(coefficients(C)), ncol =2)

Something like this maybe:
#set.seed(23)
A <- matrix(runif(84),ncol=2)
randco <- function(A) {
B <- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
t(replicate(10,randco(A)))
# (Intercept) B[, 1]
# [1,] 0.6018459 -0.1643174222
# [2,] 0.4411607 0.0005322798
#...
# [9,] 0.3201649 0.4848679516
#[10,] 0.5413830 0.1850853748

Related

Pearson coefficient per rows on large matrices

I'm currently working with a large matrix (4 cols and around 8000 rows).
I want to perform a correlation analysis using Pearson's correlation coefficient between the different rows composing this matrix.
I would like to proceed the following way:
Find Pearson's correlation coefficient between row 1 and row 2. Then between rows 1 and 3... and so on with the rest of the rows.
Then find Pearson's correlation coefficient between row 2 and row 3. Then between rows 2 and 4... and so on with the rest of the rows. Note I won't find the coefficient with row 1 again...
For those coefficients being higher or lower than 0.7 or -0.7 respectively, I would like to list on a separate file the row names corresponding to those coefficients, plus the coefficient. E.g.:
row 230 - row 5812 - 0.76
I wrote the following code for this aim. Unfortunately, it takes a too long running time (I estimated almost a week :( ).
for (i in 1:7999) {
print("Analyzing row:")
print(i)
for (j in (i+1):8000) {
value<- cor(alpha1k[i,],alpha1k[j,],use = "everything",method = "pearson")
if(value>0.7 | value<(-0.7)){
aristi <- c(row.names(alpha1k)[i],row.names(alpha1k)[j],value)
arist1p<-rbind(arist1p,aristi)
}
}
Then my question is if there's any way I could do this faster. I read about making these calculations in parallel but I have no clue on how to make this work. I hope I made myself clear enough, thank you on advance!
As Roland pointed out, you can use the matrix version of cor to simplify your task. Just transpose your matrix to get a "row" comparison.
mydf <- data.frame(a = c(1,2,3,1,2,3,1,2,3,4), b = rep(5,2,10), c = c(1:10))
cor_mat <- cor(t(mydf)) # correlation of your transposed matrix
idx <- which((abs(cor_mat) > 0.7), arr.ind = T) # get relevant indexes in a matrix form
cbind(idx, cor_mat[idx]) # combine coordinates and the correlation
Note that parameters use = everything and method = "pearson" are used by default for correlation. There is no need to specify them.

R / Rolling Regression with extended Data Frame

Hallo I'm currently working on a Regression Analysis with the following Code:
for (i in 1:ncol(Ret1)){
r2.out[i]=summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
}
r2.out
This Code runs a simple OLS Regression of each column in the data Frame agianst the first column and provides the R^2 of These regressions. At the Moment the Regression uses all data Points of a column. What I Need now is that the Code instead of using all data Points in a column just uses a rolling window of data Points. So he calculates for a rolling window of 30 Days the R^2 over the entire time Frame. The output is a Matrix with all the R^2 per rolling window for each (1,i) pair.
This Code does the rolling Regression part but does not make the Regression for each (1,i) pair.
dolm <- function(x) summary(lm(Ret1[,1]~Ret1[,i]))$r.squared
rollapplyr(Ret1, 30, dolm, by.column = FALSE)
I really appreciate any help you can provide.
Using the built-in anscombe data frame we regress the y1 column against x1 and then x2, etc. We use a width of 3 here for purposes of illustration.
xnames should be set to the names of the x variables. In the anscombe data set the column names that begin with x are the x variables. As another example, if all the columns are x variables except the first then xnames <- names(DF)[-1] could be used.
We define an R squared function, rsq which takes the indexes to use, ix and the x variable name xname. We then sapply over the xnames and for each one rollapply over the indices 1:n.
library(zoo)
xnames <- grep("x", names(anscombe), value = TRUE)
n <- nrow(anscombe)
w <- 3
rsq <- function(ix, xname) summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq
sapply(xnames, function(xname) rollapply(1:n, w, rsq, xname = xname ))
giving the following result of dimensions n - w + 1 by length(xnames):
x1 x2 x3 x4
[1,] 2.285384e-01 2.285384e-01 2.285384e-01 0.0000000
[2,] 3.591782e-05 3.591782e-05 3.591782e-05 0.0000000
[3,] 9.841920e-01 9.841920e-01 9.841920e-01 0.0000000
[4,] 5.857410e-01 5.857410e-01 5.857410e-01 0.0000000
[5,] 9.351609e-01 9.351609e-01 9.351609e-01 0.0000000
[6,] 8.760332e-01 8.760332e-01 8.760332e-01 0.7724447
[7,] 9.494869e-01 9.494869e-01 9.494869e-01 0.7015512
[8,] 9.107256e-01 9.107256e-01 9.107256e-01 0.3192194
[9,] 8.385510e-01 8.385510e-01 8.385510e-01 0.0000000
Variations
1) It would also be possible to reverse the order of the rollapply and sapply replacing the last line of code with:
rollapply(1:n, 3, function(ix) sapply(xnames, rsq, ix = ix))
2) Another variation is to replace the definition of rsq and the sapply/rollapply line with the following single statement. It may be a bit harder to read so you may prefer the first solution but it does entail one simplification -- namely, xname need no longer be an explicit argument of the inner anonymous function (which takes the place of rsq above):
sapply(xnames, function(xname) rollapply(1:n, 3, function(ix)
summary(lm(y1 ~., anscombe[c("y1", xname)], subset = ix))$r.sq))
Update: Have fixed line which is now n <- nrow(anscombe)

How to generate matrix with certain rank in R

Does anyone know how to generate matrix with certain rank in R?
I ultimately want to create data matrix Y = X + E
where rank(X)=k and E~i.i.d.N(0,sigma^2).
The easiest is the identity matrix, which has always full rank. So e.g. use:
k <- 10
mymatrix <- diag(k)
Here, rows and columns are equal to the rank you specify
I suppose you want to mimic a regression model, so you might want to have more rows (meaning 'observations') than columns, (e.g. 'variables'). The following code allows you to specify both:
k <- 5 # rank of your matrix
nobs <- 10 # number of lines within X
X <- rbind(diag(k), matrix(rep(0,k*(nobs-k)), ncol=k))
y <- X + rnorm(nobs)
Note, that X - and therefore also y - now have full column rank. there is no multicollinearity in this 'model'.

Create a matrix out of remaining data from a random row selection of a matrix, and use data to calculate RMSE in R

I have a matrix[A] with 42 rows and 2 columns. I then have a function that selects randomly 12 of these rows, does a linear regression of the randomly selected matrix and outputs the coefficients (slope and intercept) of the linear regression.
In R, I want to then get the other 30 rows from the original matrix that were not selected in my random function, and then use that data with my newly calculated coefficients, to generate a point (y-value). So I will have 30 y-values, and then from there I would like to calculate the RMSE (http://upload.wikimedia.org/math/e/f/b/efb7882a7dbfa5fe48d771565d2675f3.png) using the new y-values, and 1 of the columns in my new 30 row matrix.
The code below is what I currently have right now:
#Calibration Equation 1 (TC OFF)
A <- matrix(c(Box.CR, Box.DC.ww), nrow=42)
randco <- function(A) {
B<- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
Z <- t(replicate(10000, randco(A)))
arows <- apply(A, 1, paste, collapse="_")
brows <- apply(B, 1, paste, collapse="_")
A[-match(brows, arows), ]
Alternative method, converting matrix to data.table
(not recommended, if your sole purpose is whats described above)
library(data.table)
A <- as.data.table(A)
B <- A[sample(nrow(A), 12)]
setkey(A)
setkey(B)
A[!B]

Random sampling of two vectors, finding mean of sample, then making a matrix in R?

Thanks for your time!
My data frame is simple. Two columns: the first has genotype (1-39) and second has trait values (numerical, continuous). I would like to choose 8 genotypes and calculate the mean and stdev of the associated trait values.
In the end I would like to sample 8 genotypes 10,000 times and for each sample I would like to have the stdev and mean of the associated trait values. Ideally this would be in a matrix where each row represented a sample, 8 columns for each genotype, and 2 final columns for stdev and mean of the trait values associated with those genotypes. This could be oriented the other way too.
How do you sample from two different columns in a data frame so that both values show up in your new sample? i.e genotypes and trait values with mean and stdev calculated
How do you get this sample into a matrix as I've described above?
How do you repeat the process 10,000 times?
Thanks again!
This would return a single sample of all rows with genotype in a random sample of 8 traits:
dat[ dat$genotype %in% sample(1:39, 8), ]
The replicate function is designed to repeat random process. Repeat 3 times getting the sd of "trait" from such a sample of 2 genotypes:
dat <- data.frame(genotype=sample(1:5, 25,replace=TRUE), trait=rnorm(25) )
replicate ( 3, sd(dat[ dat$genotype %in% sample(1:5, 2), "trait" ]) )
[1] 0.7231686 0.9225318 0.9225318
This records the sample ids with the means and sd values:
replicate ( 3, {c( samps =sample(1:5, 2),
sds=sd(dat[ dat$genotype %in% samps, "trait" ]) ,
means = mean(dat[ dat$genotype %in% samps, "trait" ]) )} )
[,1] [,2] [,3]
samps1 1.0000000 1.0000000 5.0000000
samps2 5.0000000 3.0000000 1.0000000
sds 0.8673977 0.8673977 0.8673977
means 0.2835325 0.2835325 0.2835325

Resources