How do we do linear regression on a dataset and regress it to one vector column by column? - r

i have a vector of 6x1 and a matrix 6X1000. I want to do linear regression on each column using the same vector. I need this algorithm to go through my matrix and extract out each R^2 value from each correlation/regression. Can anyone help with this? Thanks!

You need provide data, e.g.
set.seed(42)
x <- matrix(rnorm(6), 6, 1)
y <- matrix(rnorm(6*10), 6, 10) # Just 10 columns to demonstrate
The first solution uses x as the independent variable and each column of y as the dependent variable in a linear regression and then extracts the R2 value. The second solution just computes the squared correlation coefficient since this is the same for a bivariate regression:
corrs <- sapply(1:10, function(i) summary(lm(y[, i]~x))$r.squared)
# [1] 0.039143014 0.003056088 0.897015721 0.282917356 0.019288198 0.001808288 0.055232746 0.276741234 0.008821625 0.073663713
sapply(1:10, function(i) cor(x, y[, i])^2)
# [1] 0.039143014 0.003056088 0.897015721 0.282917356 0.019288198 0.001808288 0.055232746 0.276741234 0.008821625 0.073663713
names(corrs) <- 1:10 # Label the columns
corrs[which(corrs > .25)]
# 3 4 8
# 0.8970157 0.2829174 0.2767412

Related

How to perform a multivariate linear regression when y is an indicator matrix in r?

this is the first time I am posting a question, hope it looks not confusing. And thanks very much for your time.
I am working on a zipcode dataset, which can be downloaded here:http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.train.gz
http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/zip.test.gz
In general, my goal is to fit principle component regression model with the top 3 PCs on the train dataset for those response variable are the handwriting digits of 2, 3, 5, and 8, and then predict by using the test data. My main problem is that after performing PCA on the X matrix, I am not sure if I did the regression part correctly. I have turned the response variables into an 2487*4 indicator matrix, and want to fit a multivariate linear regression model. But the prediction results are not binomial indicators, so I am confused that how should I interpret the predictions back to the original response variables, i.e., which are predicted as 2, 3, 5, or 8. Or did I do the regression part totally wrong? Here are my code as follows:
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8:
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using svd:
cxtrain <- scale(xtrain)
svd.xtrain <- svd(cxtrain)
cxtest <- scale(xtest)
svd.xtest <- svd(cxtest)
utrain.r3 <- svd.xtrain$u[,c(1:3)] # this is the u_r
vtrain.r3 <- svd.xtrain$v[,c(1:3)] # this is the v_r
dtrain.r3 <- svd.xtrain$d[c(1:3)]
Dtrain.r3 <- diag(x=dtrain.r3,ncol=3,nrow=3) # creat the diagonal matrix D with r=3
ztrain.r3 <- cxtrain %*% vtrain.r3 # this is the scores, the new components
utest.r3 <- svd.xtest$u[,c(1:3)]
vtest.r3 <- svd.xtest$v[,c(1:3)]
dtest.r3 <- svd.xtest$d[c(1:3)]
Dtest.r3 <- diag(x=dtest.r3,ncol=3,nrow=3)
ztest.r3 <- cxtest %*% vtest.r3
Third, which is the part I was not sure if I did in the correct way, I turned the response variables into an indicator matrix, and performed a multivariate linear regression like this:
ytrain.ind <-cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain.r3,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest.r3)
pred <- predict(model_train,newdata=new)
However, the pred was not an indicator matrix, so I am getting lost that how to interpret them back to the digits and compare them with the real test data to further calculate the prediction error.
I finally figured out how to perform multivariate linear regression with categorical y. First we need to turn the y into an indicator matrix, so then we could interpret the 0 and 1 in this matrix as probabilities. And then regress y on x to build a linear model, and finally use this linear model to predict with the test set of x. The result is a matrix with same dimensions as our indicator matrix. And all the entries should be interpreted as probabilities too, although they could be larger than 1 or smaller than 0 (that's why it confused me before). So we need to find the maximum number per row, to see which predicted y has the highest probability, and this y would be our final prediction. In this way, we could convert the continuous numbers back into categories, and then make a table to compare with the test set of y. So I updated my previous code as below.
First of all, I built the subset with those response variables are equal to 2, 3, 5, and 8 (the code remains the same as the one I posted in my question):
zip_train <- read.table(gzfile("zip.train.gz"))
zip_test <- read.table(gzfile("zip.test.gz"))
train <- data.frame(zip_train)
train_sub <- train[which(train$V1 == 2 | train$V1 == 3 | train$V1 == 5 | train$V1 == 8),]
test <- data.frame(zip_test)
test_sub <- test[which(test$V1 == 2 | test$V1 == 3 | test$V1 == 5 | test$V1 == 8),]
xtrain <- train_sub[,-1]
xtest <- test_sub[,-1]
ytrain <- train_sub$V1
ytest <- test_sub$V1
Second, I centered the X matrix, and calculated the top 3 principal components by using eigen(). I updated this part of code, because I standardized x instead of centering it in my previous code, leading to a wrong computation of the covariance matrix of x and eigenvectors of cov(x).
cxtrain <- scale(xtrain, center = TRUE, scale = FALSE)
eigenxtrain <- eigen(t(cxtrain) %*% cxtrain / (nrow(cxtrain) -1)) # same as get eigen(cov(xtrain)), because I have already centered x before
cxtest <- scale(xtest, center = TRUE, scale = FALSE)
eigenxtest <- eigen(t(cxtest) %*% cxtest/ (nrow(cxtest) -1))
r=3 # set r=3 to get top 3 principles
vtrain <- eigenxtrain$vectors[,c(1:r)]
ztrain <- scale(xtrain) %*% vtrain # this is the scores, the new componenets
vtest <- eigenxtrain$vectors[,c(1:r)]
ztest <- scale(xtest) %*% vtest
Third, I turned the response variables into an indicator matrix, and performed a multivariate linear regression on the training set. And then use this linear model to predict.
ytrain.ind <- cbind(I(ytrain==2)*1,I(ytrain==3)*1,I(ytrain==5)*1,I(ytrain==8)*1)
ytest.ind <- cbind(I(ytest==2)*1,I(ytest==3)*1,I(ytest==5)*1,I(ytest==8)*1)
mydata <- data.frame(cbind(ztrain,ytrain.ind))
model_train <- lm(cbind(X4,X5,X6,X7)~X1+X2+X3,data=mydata)
new <- data.frame(ztest)
pred<- predict(model_train,newdata=new)
The pred is a matrix with all the entries of probabilities, so we need to convert it back into a list of categorical y.
pred.ind <- matrix(rep(0,690*4),nrow=690,ncol=4) # build a matrix with the same dimensions as pred, and all the entries are 0.
for (i in 1:690){
j=which.max(pred[i,]) # j is the column number of the highest probability per row
pred.ind[i,j]=1 # we set 1 to the columns with highest probability per row, in this way, we could turn our pred matrix back into an indicator matrix
}
pred.col1=as.matrix(pred.ind[,1]*2) # first column are those predicted as digit 2
pred.col2=as.matrix(pred.ind[,2]*3)
pred.col3=as.matrix(pred.ind[,3]*5)
pred.col4=as.matrix(pred.ind[,4]*8)
pred.col5 <- cbind(pred.col1,pred.col2,pred.col3,pred.col4)
pred.list <- NULL
for (i in 1:690){
pred.list[i]=max(pred.col5[i,])
} # In this way, we could finally get a list with categorical y
tt=table(pred.list,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.3289855
For the third part, we could also perform a multinomial logistic regression instead. But in this way, we don't need to convert y into an indicator matrix, we just factor it. So the code looks as below:
library(nnet)
trainmodel <- data.frame(cbind(ztrain, ytrain))
mul <- multinom(factor(ytrain) ~., data=trainmodel)
new <- as.matrix(ztest)
colnames(new) <- colnames(trainmodel)[1:r]
predict<- predict(mul,new)
tt=table(predict,ytest)
err=(sum(tt)-sum(diag(tt)))/sum(tt) # error rate was 0.2627907
So it showed that the logistic model do perform better than the linear model.

Saving output from for-loop to 3D array in R

I am working in R to save outputs from a 'for' loop in to a 3D matrix. I have been unable to adapt a similar example answered here for my purposes, so I'd like to share a different example.
I have a mostly-completed "for" loop that generates slopes and intercepts from a linear model for N iterations; with each iteration using a new set of y-values with a random t-distribution ('rt').
The desired resulting output is a 3D matrix with two slices, here named "out2". One slice is named "Intercept" and the other is "Slope." Each column in both of the sheets is a result from the model generated with different degrees of dreedom (dfs)
set.seed(14)
x <- sample(0:50, 15) # Generate x-values for simulation
true.a <- 1.5 # Intercept for linear relationship
true.m <- 5 # Slope for linear relationship
dfs <- c(1,2,3,4,6,8,10,15,20,25) # Degrees of freedom
N <- 1000 # Reps in for-loop
out2 <- array(NA, dim=c(N, length(dfs), 2))
dimnames(out2) <- list(NULL, dfs, c("Intercept", "Slope"))
for(j in 1:length(dfs)) {
df.tdist <- dfs[j]
for(i in 1:N) {
y <- true.a + true.m * x + 25*rt(15,df.tdist)
fit <- lm(y ~ x)
out2[ ] <- ?????????????
# The output array 'out2' will consist of two "slices", one with intercepts
and one with slopes. The length of each slice is 1000 rows, and the
width of each slice is 10 columns
}
}
Thanks greatly in advance for your feedback.

Create a matrix out of remaining data from a random row selection of a matrix, and use data to calculate RMSE in R

I have a matrix[A] with 42 rows and 2 columns. I then have a function that selects randomly 12 of these rows, does a linear regression of the randomly selected matrix and outputs the coefficients (slope and intercept) of the linear regression.
In R, I want to then get the other 30 rows from the original matrix that were not selected in my random function, and then use that data with my newly calculated coefficients, to generate a point (y-value). So I will have 30 y-values, and then from there I would like to calculate the RMSE (http://upload.wikimedia.org/math/e/f/b/efb7882a7dbfa5fe48d771565d2675f3.png) using the new y-values, and 1 of the columns in my new 30 row matrix.
The code below is what I currently have right now:
#Calibration Equation 1 (TC OFF)
A <- matrix(c(Box.CR, Box.DC.ww), nrow=42)
randco <- function(A) {
B<- A[sample(42,12),]
lm(B[,2] ~ B[,1])$coefficients
}
Z <- t(replicate(10000, randco(A)))
arows <- apply(A, 1, paste, collapse="_")
brows <- apply(B, 1, paste, collapse="_")
A[-match(brows, arows), ]
Alternative method, converting matrix to data.table
(not recommended, if your sole purpose is whats described above)
library(data.table)
A <- as.data.table(A)
B <- A[sample(nrow(A), 12)]
setkey(A)
setkey(B)
A[!B]

How can I perform a pairwise t.test in R across multiple independent vectors?

TL;DR edition
I have vectors X1,X2,X3,...Xn. I want to test to see whether the average value for any one vector is significantly different than the average value for any other vector, for every possible combination of vectors. I am seeking a better way to do this in R than running n^2 individual t.tests.
Full Story
I have a data frame full of census data for a particular CSA. Each row contains observations for each variable (column) for a particular census tract.
What I need to do is compare means for the same variable across census tracts in different MSAs. In other words, I want to factor my data.frame according to the MSA designation variable (which is one of the columns) and then compare the differences in the means for another variable of interest pairwise across each newly-factored MSA. This is essentially doing pairwise t.tests across each ensuing vector, but I wish to do this in a more elegant way than writing t.test(MSAx, MSAy) over and over again. How can I do this?
The advantage to my method below to the one proposed by #ashkan would be that mine removes duplicates. (i.e. either X1 vs X2 OR X2 vs X1 will appear in the results, not both)
# Generate dummy data
df <- data.frame(matrix(rnorm(100), ncol = 10))
colnames(df) <- paste0("X", 1:10)
# Create combinations of the variables
combinations <- combn(colnames(df),2, simplify = FALSE)
# Do the t.test
results <- lapply(seq_along(combinations), function (n) {
df <- df[,colnames(df) %in% unlist(combinations[n])]
result <- t.test(df[,1], df[,2])
return(result)})
# Rename list for legibility
names(results) <- paste(matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,1], matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,2], sep = " vs. ")
Just use pairwise.t.test, here is an example:
x1 <- rnorm(50)
x2 <- rnorm(30, mean=0.2)
x3 <- rnorm(100,mean=0.1)
x4 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(x1,x2,x3,x4),
key=c(
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3)),
rep("x4", length(x4))) )
pairwise.t.test(x$data,
x$key,
pool.sd=FALSE)
# Pairwise comparisons using t tests with non-pooled SD
#
# data: x$data and x$key
#
# x1 x2 x3
# x2 0.7395 - -
# x3 0.9633 0.9633 -
# x4 0.0067 0.9633 0.0121
#
# P value adjustment method: holm
If you have a data.frame and you wish to independently perform T-tests between each column of the data.frame, you can use a double apply loop:
apply(MSA, 2, function(x1) {
apply(MSA, 2, function(x2) {
t.test(x1, x2)
})
})
A good visualization to accompany such a brute force approach would be a forest plot:
cis <- apply(MSA, 2, function(x) mean(x) + c(-1, 1) * sd(x) * 1.96)
plot.new()
plot.window(xlim=c(1, ncol(cis)), ylim=range(cis))
segments(1:ncol(cis), cis[1, ], 1:ncol(cis), cis[2, ])
axis(1, at=1:ncol(cis), labels=colnames(MSA))
axis(2)
box()
abline(h=mean(MSA), lty='dashed')
title('Forest plot of 95% confidence intervals of MSA')
In addition to response from quarzgar, there are another method to perform pairwise ttest across multiple factors in R. Basically is a trick for the two (or more) factors used by creating a combination of factor levels.
Example with a 2x2 classical design:
df <- data.frame(Id=c(rep(1:100,2),rep(101:200,2)),
dv=c(rnorm(100,10,5),rnorm(100,20,7),rnorm(100,11,5),rnorm(100,12,6)),
Group=c(rep("Experimental",200),rep("Control",200)),
Condition=rep(c(rep("Pre",100),rep("Post",100)),2))
#ANOVA
summary(aov(dv~Group*Condition+Error(Id/Condition),data = df))
#post-hoc across all factors
df$posthoclevels <- paste(df$Group,df$Condition) #factor combination
pairwise.t.test(df$dv,df$posthoclevels)
# Pairwise comparisons using t tests with pooled SD
#
# data: df$dv and df$posthoclevels
#
# Control Post Control Pre Experimental Post
# Control Pre 0.60 - -
# Experimental Post <2e-16 <2e-16 -
# Experimental Pre 0.26 0.47 <2e-16
#
# P value adjustment method: holm

Computing linear regressions for every possible permutation of matrix columns

I have a (k x n) matrix. I have initially managed to linearly regress (using the lm function) column 1 with each and every other column and extracted only the coefficients.
fore.choose <- matrix(0, 1, NCOL(assets))
for(i in seq(1, NCOL(assets), 1))
{
abc <- lm(assets[,1]~assets[,i])$coefficients
fore.choose[1,i] <- abc[2:length(abc)]
}
The coefficients are placed in the fore.choose matrix.
What I now need to do is to linearly regress column 2 with each and every other column, and then column 3 and so on and so forth and extract only the coefficients.
The output will be a square matrix of OLS univariate coefficients. Kind of similar to a correlation matrix, but it is the beta coefficients I am interested in.
fore.choose <- matrix(0, 1, NCOL(assets))
will initially need to become
fore.choose <- matrix(0, NCOL(assets), NCOL(assets))
I'd just compute the coefficients directly from the correlation matrix, using beta = cor(x,y)*sd(x)/sd(y), like this:
# set up some sample data
set.seed(1)
d <- matrix(rnorm(50), ncol=5)
# get the coefficients
s <- apply(d, 2, sd)
cor(d)*outer(s, s, "/")
You could also use lsfit to get the coefficients of one term on all the others at once and then only have one loop to do:
sapply(1:ncol(d), function(i) {
coef(lsfit(d[,i], d))[2,]
})
I'm sure there must be a more elegant way than to nested loops.
fore.choose <- matrix(NA, NCOL(assets), NCOL(assets))
abc <- NULL
for(i in seq_len(ncol(assets))){ # loop over "dependant" columns
for(j in seq_len(ncol(assets))){ # loop over "independant" columns
abc <- lm(assets[,i]~assets[,j])$coefficients
fore.choose[i,j] <- abc[-1]
}
}

Resources