Create a matrix out of remaining data from a random row selection of a matrix, and use data to calculate RMSE in R - r

I have a matrix[A] with 42 rows and 2 columns. I then have a function that selects randomly 12 of these rows, does a linear regression of the randomly selected matrix and outputs the coefficients (slope and intercept) of the linear regression.
In R, I want to then get the other 30 rows from the original matrix that were not selected in my random function, and then use that data with my newly calculated coefficients, to generate a point (y-value). So I will have 30 y-values, and then from there I would like to calculate the RMSE (http://upload.wikimedia.org/math/e/f/b/efb7882a7dbfa5fe48d771565d2675f3.png) using the new y-values, and 1 of the columns in my new 30 row matrix.
The code below is what I currently have right now:
#Calibration Equation 1 (TC OFF)
A <- matrix(c(Box.CR, Box.DC.ww), nrow=42)
randco <- function(A) {
B<- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
Z <- t(replicate(10000, randco(A)))

arows <- apply(A, 1, paste, collapse="_")
brows <- apply(B, 1, paste, collapse="_")
A[-match(brows, arows), ]
Alternative method, converting matrix to data.table
(not recommended, if your sole purpose is whats described above)
library(data.table)
A <- as.data.table(A)
B <- A[sample(nrow(A), 12)]
setkey(A)
setkey(B)
A[!B]

Related

How do we do linear regression on a dataset and regress it to one vector column by column?

i have a vector of 6x1 and a matrix 6X1000. I want to do linear regression on each column using the same vector. I need this algorithm to go through my matrix and extract out each R^2 value from each correlation/regression. Can anyone help with this? Thanks!
You need provide data, e.g.
set.seed(42)
x <- matrix(rnorm(6), 6, 1)
y <- matrix(rnorm(6*10), 6, 10) # Just 10 columns to demonstrate
The first solution uses x as the independent variable and each column of y as the dependent variable in a linear regression and then extracts the R2 value. The second solution just computes the squared correlation coefficient since this is the same for a bivariate regression:
corrs <- sapply(1:10, function(i) summary(lm(y[, i]~x))$r.squared)
# [1] 0.039143014 0.003056088 0.897015721 0.282917356 0.019288198 0.001808288 0.055232746 0.276741234 0.008821625 0.073663713
sapply(1:10, function(i) cor(x, y[, i])^2)
# [1] 0.039143014 0.003056088 0.897015721 0.282917356 0.019288198 0.001808288 0.055232746 0.276741234 0.008821625 0.073663713
names(corrs) <- 1:10 # Label the columns
corrs[which(corrs > .25)]
# 3 4 8
# 0.8970157 0.2829174 0.2767412

Pearson coefficient per rows on large matrices

I'm currently working with a large matrix (4 cols and around 8000 rows).
I want to perform a correlation analysis using Pearson's correlation coefficient between the different rows composing this matrix.
I would like to proceed the following way:
Find Pearson's correlation coefficient between row 1 and row 2. Then between rows 1 and 3... and so on with the rest of the rows.
Then find Pearson's correlation coefficient between row 2 and row 3. Then between rows 2 and 4... and so on with the rest of the rows. Note I won't find the coefficient with row 1 again...
For those coefficients being higher or lower than 0.7 or -0.7 respectively, I would like to list on a separate file the row names corresponding to those coefficients, plus the coefficient. E.g.:
row 230 - row 5812 - 0.76
I wrote the following code for this aim. Unfortunately, it takes a too long running time (I estimated almost a week :( ).
for (i in 1:7999) {
print("Analyzing row:")
print(i)
for (j in (i+1):8000) {
value<- cor(alpha1k[i,],alpha1k[j,],use = "everything",method = "pearson")
if(value>0.7 | value<(-0.7)){
aristi <- c(row.names(alpha1k)[i],row.names(alpha1k)[j],value)
arist1p<-rbind(arist1p,aristi)
}
}
Then my question is if there's any way I could do this faster. I read about making these calculations in parallel but I have no clue on how to make this work. I hope I made myself clear enough, thank you on advance!
As Roland pointed out, you can use the matrix version of cor to simplify your task. Just transpose your matrix to get a "row" comparison.
mydf <- data.frame(a = c(1,2,3,1,2,3,1,2,3,4), b = rep(5,2,10), c = c(1:10))
cor_mat <- cor(t(mydf)) # correlation of your transposed matrix
idx <- which((abs(cor_mat) > 0.7), arr.ind = T) # get relevant indexes in a matrix form
cbind(idx, cor_mat[idx]) # combine coordinates and the correlation
Note that parameters use = everything and method = "pearson" are used by default for correlation. There is no need to specify them.

Using a loop to create matrices in R

I'm trying to do a leave-one-out cross-validation on a relatively small dataset (n = 22, p = 17) on a linear regression made from the LARS algorithm. Essentially I need to create n matrices of standardized data (each column consists of entries centered by the mean and standardized by the SD of the column).
I've never used lists before, but would be open to making lists as long as columns of the different matrices can be manipulated/standardized.
Here's what I tried in R:
for (i in 1:n)
{
x.standardized.i <- matrix(data = NA, nrow = (n-1), ncol = p) #creates n matrices, all n-1 x p
for (j in 1:p)
{
x.standardized.i[,j] <- ((x[-i,j]-mean(x[-i,j]))/sd(x[-i,j])) #and standardizes the p variables with the ith row missing in each n matrix (i increments from 1 to n)
}
}
I'm not sure if I can share the data, since it's related to grades from a class, but when I run the code it goes through the loop and stops by assigning a standardized matrix with the last row missing as x.standardized.i.
You can do this quite simply with sapply and scale:
# Create dummy data
m <- matrix(runif(200), ncol=10)
# Leave each row out in turn, and scale each column
A <- sapply(seq_len(nrow(m)), function(i) scale(m[-i, ]), simplify='array')
By default, scale centres each column on its mean, and divides by its sd.
For the example above, you'll end up with an array with 19 rows, 10 columns and 20 slices.
To access particular slices (i.e. cross-validation training folds), you can subset like this:
A[,, 1] # all rows, all cols, first slice
A[,, 10] # all rows, all cols, tenth slice
To confirm that columns are centred on their mean and standardised by one sd:
apply(A, c(2, 3), mean)
apply(A, c(2, 3), sd)

Writing a loop for randomly selecting rows of a matrix and doing a linear regression on data from rows and storing in a matrix

I need to write a program that does the following in R:
I have a data set (42 rows, 2 columns) of y variables and x variables.
I want to randomly select 12 rows from this matrix and record the coefficients (slope and intercept) of a linear regression of the randomly generated matrix. I would also like to write a loop for this so I can repeat this 1000 times, so I can then have a matrix with 1000 rows and 2 columns filled in with the slopes and intercepts of the 1000 randomly selected sets of 12 rows from my data set.
I am able to get this far but do not know how to incorporate a loop into the code, and a way to store the coefficients into a a matrix.
#Box.Z and Box.DC.gm are columns of data used to generate my initial matrix of data
A <- matrix(c(Box.Z, Box.DC.gm), nrow=42)
B <- A[sample(42, 12), ]
C <- lm(B[,2] ~ B[,1])
D <- matrix(c(coefficients(C)), ncol =2)
Something like this maybe:
#set.seed(23)
A <- matrix(runif(84),ncol=2)
randco <- function(A) {
B <- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
t(replicate(10,randco(A)))
# (Intercept) B[, 1]
# [1,] 0.6018459 -0.1643174222
# [2,] 0.4411607 0.0005322798
#...
# [9,] 0.3201649 0.4848679516
#[10,] 0.5413830 0.1850853748

Computing linear regressions for every possible permutation of matrix columns

I have a (k x n) matrix. I have initially managed to linearly regress (using the lm function) column 1 with each and every other column and extracted only the coefficients.
fore.choose <- matrix(0, 1, NCOL(assets))
for(i in seq(1, NCOL(assets), 1))
{
abc <- lm(assets[,1]~assets[,i])$coefficients
fore.choose[1,i] <- abc[2:length(abc)]
}
The coefficients are placed in the fore.choose matrix.
What I now need to do is to linearly regress column 2 with each and every other column, and then column 3 and so on and so forth and extract only the coefficients.
The output will be a square matrix of OLS univariate coefficients. Kind of similar to a correlation matrix, but it is the beta coefficients I am interested in.
fore.choose <- matrix(0, 1, NCOL(assets))
will initially need to become
fore.choose <- matrix(0, NCOL(assets), NCOL(assets))
I'd just compute the coefficients directly from the correlation matrix, using beta = cor(x,y)*sd(x)/sd(y), like this:
# set up some sample data
set.seed(1)
d <- matrix(rnorm(50), ncol=5)
# get the coefficients
s <- apply(d, 2, sd)
cor(d)*outer(s, s, "/")
You could also use lsfit to get the coefficients of one term on all the others at once and then only have one loop to do:
sapply(1:ncol(d), function(i) {
coef(lsfit(d[,i], d))[2,]
})
I'm sure there must be a more elegant way than to nested loops.
fore.choose <- matrix(NA, NCOL(assets), NCOL(assets))
abc <- NULL
for(i in seq_len(ncol(assets))){ # loop over "dependant" columns
for(j in seq_len(ncol(assets))){ # loop over "independant" columns
abc <- lm(assets[,i]~assets[,j])$coefficients
fore.choose[i,j] <- abc[-1]
}
}

Resources