How to extract residuals from the regressed data? - r

This might be a very easy question, but I do need some help with R.
I have an expression data, for which I have run a linear regression to correct for the covariates, and I would like to extract the residuals in a file.
So following is the loop I have
for (i in 1:n) {
geneProbe <- z.na[,i]
lm1 <- lm(geneProbe ~ phenotype + covariate1 + covariate2 + covariate3)
write.table(lm1$residuals, file="residuals.txt", sep="\t")
}
Ofcourse when I do the following
write.table(lm1$residuals, file="residuals.txt", sep="\t")
I am able to retrieve the residuals only for one loop as follows (residuals.txt):
Res1
-0.00224226
0.005144119
0.011142788
1.90E-05
-0.003698019
I would rather like to have them for all the is, or loops, into a single file as follows (residuals.txt). In other words, every loop should add a column:
Res1 Res2 Res3 Res4
-0.00224226 0.009583449 0.000538104 0.012497267
0.005144119 0.015632242 -0.000104554 -0.009199898
0.011142788 -0.012912383 -0.004363051 -0.010270967
1.90E-05 -0.038716093 0.004149837 0.011071139
-0.003698019 0.015219847 -0.002486236 -0.009230721

Save the residuals in a matrix or dataframe within your loop and then save the table when finished. E.g.:
resdat = matrix(NA, 5,n)
for (i in 1:n) {
geneProbe <- z.na[,i]
lm1 <- lm(geneProbe ~ phenotype + covariate1 + covariate2 + covariate3)
resdat[,i] = lm1$residuals
}
write.table(resdat, file="residuals.txt", sep="\t")

Related

Dummies not included in summary

I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.

Is there a function for substituting (or removing at all) explaining variables in a linear model (lm)?

I have a linear model with lots of explaining variables (independent variables)
model <- lm(y ~ x1 + x2 + x3 + ... + x100)
some of which are linear depended on each other (multicollinearity).
I want the machine to search for the name of the explaining variable which has the highest VIF coefficient (x2 for example), delete it from the formula and then run the old lm function with the new formula
model <- lm(y ~ x1 + x3 + ... + x100)
I already learned how to retrieve the name of the explaining variable which has the highest VIF coefficient:
max_vif <- function(x) {
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
But I still don't understand how to search the needed explaining variable, delete it from the formula and run the function again.
We can use the update function and paste in the column that needs to be removed. We first can fit a model, and then use update to change that model's formula. The model formula can be expressed as a character string, which allows you to concatenate the general formula .~. and whatever variable(s) you'd like removed (using the minus sign -).
Here is an example:
fit1 <- lm(wt ~ mpg + cyl + am, data = mtcars)
coef(fit1)
# (Intercept) mpg cyl am
# 4.83597190 -0.09470611 0.08015745 -0.52182463
rm_var <- "am"
fit2 <- update(fit1, paste0(".~. - ", rm_var))
coef(fit2)
# (Intercept) mpg cyl
# 5.07595833 -0.11908115 0.08625557
Using max_vif we can wrap this into a function:
rm_max_vif <- function(x){
# find variable(s) needing to be removed
rm_var <- max_vif(x)
# concatenate with "-" to remove variable(s) from formula
rm_var <- paste(paste0("-", rm_var), collapse = " ")
# update model
update(x, paste0(".~.", rm_var))
}
Problem solved!
I created a list containing all variables for lm model:
Price <- list(y,x1,...,x100)
Then I used different way for setting lm model:
model <- lm(y ~ ., data = Price)
So we can just delete variable with the highest VIF from Price list.
With the function i already came up the code will be:
Price <- list(y,x1,x2,...,x100)
model <- lm(y ~ ., data = Price)
max_vif <- function(x) { # Function for finding name of variable with the highest VIF
vifac <- data.frame(vif(x))
nameofmax <- rownames(which(vifac == max(vifac), arr.ind = TRUE))
return(nameofmax)
}
n <- max(data.frame(vif(model)))
while(n >= 5) { # Loop for deleting variable with the highest VIF from `Price` list one after another, untill there is no VIF equal or higher then 5
Price[[m]] <- NULL
model_auto <- lm(y ~ ., data = Price)
m <- max_vif(model)
n <- max(data.frame(vif(model)))
}

Writing function to identify confounding variables

I am trying to model an outcome as a function of several exposures, adjusting the models for any covariates that may be confounders (≥ 10% ∆ to outcome coefficient when added to model). I am looking at many covariates as potential confounders, so have created a dataframe with all of them and am using lapply (the outcome and exposures are in a separate dataframe which has already been attached). To make sorting through all my outputs easier, I have tried to write a function which will only display the output if the covariate is a confounder. The exposures and number of them are different in each model, so I find myself having to write code like bellow each time I run my analyses, but know there must be an easier way. Would there be a function I could write to just lapply with, using the model without confounders and the Covariates dataframe as arguments? Thanks!
lapply(Covariates, function(x) {
model <- summary(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
if ((model$coefficients[2, 1] - summary(lm(Outcome ~ Exposure))$coefficients[2, 1])/
model$coefficients[2, 1] >= .1)
return(model)
})
I have written a function to solve this problem!
confounder <- function(model) {
model.sum <- summary(model)
model.b <- model.sum$coefficients[2, 1]
oldmodel <- update(model, . ~ . -x)
oldmodel.sum <- summary(oldmodel)
oldmodel.b <- oldmodel.sum$coefficients[2, 1]
model.frame <- tidy(model.sum)
model.sub <- subset(model.frame, term = "x")
model.sub.b <- model.sub[, 5]
if ((model.b - oldmodel.b)/model.b >= .1 |
model.sub.b < .05)
return(model.sum)
}
I then lapply this function to the model:
lapply(Covariates, function(x) {
confounder(lm(Outcome ~ Exposure1 + Exposure2 + ... + x))
})

Looping regression commands

I'm going to simplify my problem as much as to prove that I'm not just throwing my assignment at you guys. I really want to learn how to get loop to work with regressions.
Let's suppose I want to run two OLS but I don't want to type the same ols command twice OR add another series of commands into my script. This is because a) I actually have way more than two regressions and b) I want to try to code this as efficient as I can (I have tried copying and pasting same ols commands). Also, I'm not just running a simple OLS as I'm also running HAC estimator depending on the serial correlation and heteroskedasticity test.
The code that i have came up with so far is,
Packages
if (!require("lmtest")) install.packages("lmtest")
library("lmtest")
if (!require("sandwich")) install.packages("sandwich")
library("sandwich")
Data
data<-read.csv(file.choose())
x1<-data$x1
x2<-data$x2
x3<-data$x4
x5<-data$x5
x6<-data$x6
x7<-data$x7
y1<-data$y1
Regressions
reg1<-(y1 ~ x1 + x2 + x3 + x4)
reg2<-(y1 ~ x2 + x4 + x6 + x7)
p<-0.05
Loop
for (i in 1:2) {
#OLS#
ols[i]<-lm(reg[i])
#Breusch-Pagan Test#
bptest(ols[i],varformula = NULL, studentize = TRUE)
bpp<-bptest(ols[i])$p.value
if(bpp>p) hs<-F else hs<-T
#Breusch-Godfrey Serial Correlation Test#
bgtest(ols[i],order=2,order.by=NULL,type=c("Chisq"))
bgp<-bgtest(ols[i])$p.value
if(bgp>p) sc<-F else sc<-T
#HAC Estimator#
HAC<-vcovHAC(ols[i],order.by=NULL,prewhite=FALSE,adjust=TRUE,diagnostics=FALSE,sandwich = TRUE,ar.method = "ols")
if (sc==T|hs==T) coeftest(ols[i],vcov.=HAC) else ols[i]
if (sc==T|hs==T) write.csv(coeftest(ols[i],vcov.=HAC),file="ols[i]HAC.csv") else write.csv(summary(ols[i])$coefficient,file="ols1.csv")
}
When I run this I get
Error in stats::model.frame(formula = reg[i], drop.unused.levels = TRUE) : object 'reg' not found
I have also tried the above code with
for (i in reg[1]:reg[2]) {
}
but it only returned
Error: object 'reg' not found.
Where did I go wrong?
This is too long for a comment, so I post it as a partial answer.
The difference seems to be the formula, and you are asking for a way to make your code more efficient. One way is to use a list of formulae and then combine the list with lapply. For instance,
reg <- list(
reg1 = as.formula(y1 ~ x1 + x2 + x3 + x4),
reg2 = as.formula(y1 ~ x2 + x4 + x6 + x7)
)
ols <- lapply(reg, function(x) lm(x, data=data))
Here, ols is a list of two elements, each of which is a regression corresponding to the formula-list. You can use the same principle for other functions, for instance:
bgtests <- lapply(ols, function(x)
bgtest(x,order=2,order.by=NULL,type=c("Chisq")))
This executes your bgtest function for each regression stored in ols. In a similar fashion, you can write it up so that it executes your heteroskedasticity corrections etc. The important point is this: you supply a list to lapply, and each element of that list is what is passed onto the function that you provide. The output of lapply is then a list with the output of that function.
In case you don't want to use lapply and to adress your actual question: the problem in your code is that there is no object called reg. Subsetting a non-existing object such as reg[1] hence does not work. If you execute the first lines of my code above, reg[1] and reg[2] become defined so that your loop should work.
The 'get' function is what you want, in conjunction with 'paste'. Below I fit two regressions using the cars data in R. Then I want to write a loop that extracts its coefficients. The 'get' function goes and find the object that matches the object name you specified.
> (reg1 <- lm(dist ~ speed, data = cars))
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept) speed
-17.579 3.932
> (reg2 <- lm(dist ~ 1 + I(speed^2), data = cars))
Call:
lm(formula = dist ~ 1 + I(speed^2), data = cars)
Coefficients:
(Intercept) I(speed^2)
8.860 0.129
> coeff <- matrix(0, nrow = 2, ncol = 2)
> for (i in 1:2)
+ {
+
+ # Main step
+ model <- get(paste("reg", i, sep = ""))
+ coeff[i,] <- coefficients(model)
+ }
> coeff
[,1] [,2]
[1,] -17.579095 3.9324088
[2,] 8.860049 0.1289687
>

How to map predicted values to unique id in dataset?

I have written this R code to reproduce. Here, I have a created a unique column "ID", and I am not sure how to add the predicted column back to test dataset mapping to their respective IDs. Please guide me on the right way to do this.
#Code
library(C50)
data(churn)
data=rbind(churnTest,churnTrain)
data$ID<-seq.int(nrow(data)) #adding unique id column
rm(churnTrain)
rm(churnTest)
set.seed(1223)
ind <- sample(2,nrow(data),replace = TRUE, prob = c(0.7,0.3))
train <- data[ind==1,1:21]
test <- data[ind==2, 1:21]
xtrain <- train[,-20]
ytrain <- train$churn
xtest <- test[,-20]
ytest<- test$churn
x <- cbind(xtrain,ytrain)
## C50 Model
c50Model <- C5.0(churn ~
state +
account_length +
area_code +
international_plan +
voice_mail_plan +
number_vmail_messages +
total_day_minutes +
total_day_calls +
total_day_charge +
total_eve_minutes +
total_eve_calls +
total_eve_charge +
total_night_minutes +
total_night_calls +
total_night_charge +
total_intl_minutes +
total_intl_calls +
total_intl_charge +
number_customer_service_calls,data=train, trials=10)
# Evaluate Model
c50Result <- predict(c50Model, xtest)
table(c50Result, ytest)
#adding prediction to test data
testnew = cbind(xtest,c50Result)
#OR predict directly
xtest$churn = predict(c50Model, xtest)
I’d use match(dataID, predictedID) to match ID columns in data sets.
In reply to your comment:
If you want to add predicted values to the original dataframe, both ways of merging data and prediction are correct and produce identical result. The only thing is, I would use
xtest$churn_hut <- predict(c50Model, xtest)
instead of
xtest$churn <- predict(c50Model, xtest)
because here you are replacing original churn ( as in data$churn) with whatever the model predicted, so you can’t compare the two.

Resources