Displaying RMSE in summary when running multiple univariate linear regressions - r

I wrote a function to run univariate linear regressions for multiple variables at a time. However, in the summary table, I noticed that the RMSE is missing. How do I also display the RMSE to each of these regressions?
Here is my code and here is what my output looks like:
my.data <- read.csv("filename.csv", header=TRUE)
variables <-names(my.data[1:30])
my.list <- lapply(variables, function(var){formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
summary(res.linear)
})
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.367075060 4.46417498 5.2343547 3.017975e-06
variable1 0.008312962 0.04747918 0.1750865 8.616917e-01
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.347246142 1.48314397 3.605345 0.0006984638
variable2 0.008342116 0.01577413 0.528848 0.5991611451

We may extract the residuals from the summary output, get the squared mean and take the square root and cbind with the extracted coefficients
my.list <- lapply(variables, function(var){
formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
smry <- summary(res.linear)
RMSE <- sqrt(mean(smry$residuals^2))
cbind(coef(smry), RMSE = RMSE)
})

Related

How to find intercept coefficients after setting restrictions with linearHypothese

So I built my logistic regression model using glm(). When I display the summary I get this with values for each variable.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
After I have set a restriction using linearHypotheses but I only get results for chisquared, residuals df, and df. Am I able to see what the coefficients are for the model with the restriction?
library(car)
library(tidyverse)
library(dplyr)
logitmodel <- glm(x~ratio+x+y, family=binomial(link="logit"))
nullhypothese <- "x=0"
restrictedmodel <- linearHypothesis(logitmodel, nullhypothesis)

R: Why does lapply() double up my results?

I'm writing a function for getting diagnostics and test error from a series of linear regression models.
My input is a list of lists. Each list carries the information for its own model.
model.1 <- list("medv","~.","Boston_Ready")
names(model.1) <- c("response", "input","dataset")
model.2 <- list("medv","~lstat","Boston_Ready")
names(model.2) <- c("response", "input","dataset")
models <- list(model.1,model.2)
My function calculates regression diagnostics when given one list that has the dataframe, response variable and inputs.
TestError <- function(model){
library('boot')
df <- get(model$dataset)
formula <- paste(model$response,model$input)
response <- model$response
##Diagnostics
fit <- lm(formula,data=df)
fit_summ <- summary(fit)
F_Stat <- fit_summ$fstatistic[1]
Adj_R_Sq <- fit_summ$adj.r.squared
RSS <- with(fit_summ, df[2] * sigma^2)
AIC <- AIC(fit)
BIC <- BIC(fit)
##Cross-Validation
#5-fold cross validation
glm.fit <- glm(formula, data=df)
cv.err <- cv.glm(df, glm.fit, K=5)
Five.Fold_MSE <- cv.err$delta[1]
#10-fold cross validation
glm.fit <- glm(formula, data=df)
cv.err <- cv.glm(df, glm.fit, K=10)
Ten.Fold_MSE <- cv.err$delta[1]
#LOOCV
glm.fit <- glm(formula, data=df)
cv.err <- cv.glm(df, glm.fit)
LOOCV_MSE <- cv.err$delta[1]
#Output
label <- c("lm","formula =",paste(model$response,model$input), "data= ",model$dataset)
print(paste(label))
Results <- (c(LOOCV_MSE,Five.Fold_MSE,Ten.Fold_MSE,F_Stat,Adj_R_Sq, RSS, AIC, BIC))
names(Results) <- c("LOOCV MSE", "5-Fold MSE", "10-Fold MSE","F-Stat","Adjusted R^2","RSS","AIC","BIC")
print(Results)
}
For some reason, the output generates the same thing twice
lapply(models,TestError)
> lapply(models,TestError)
[1] "lm" "formula =" "medv ~." "data= " "Boston_Ready"
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.3250332 0.3288020 0.3251508 114.3744328 0.6918372 152.5405737 853.2181335 903.9365735
[1] "lm" "formula =" "medv ~lstat" "data= " "Boston_Ready"
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.4597660 0.4622565 0.4593045 601.6178711 0.5432418 230.2061197 1043.4596316 1056.1392416
[[1]]
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.3250332 0.3288020 0.3251508 114.3744328 0.6918372 152.5405737 853.2181335 903.9365735
[[2]]
LOOCV MSE 5-Fold MSE 10-Fold MSE F-Stat Adjusted R^2 RSS AIC BIC
0.4597660 0.4622565 0.4593045 601.6178711 0.5432418 230.2061197 1043.4596316 1056.1392416
Is that due to a quirk with lapply()?
Because at the end of your function, you have print(result), so it is actually printing out your model then return it as list value.

Extracting Coefficients, Std Errors, R2 etc from multiple regressions

I have the following regression model;
models <- lapply(1:25, function(x) lm(Y_df[,x] ~ X1))
Which runs 25 regressions on 25 columns in the Y_df dataframe.
One of the outputs can be shown as;
models[15] # Gives me the coefficients for model 15
Call:
lm(formula = Y_df[, x] ~ X1)
Coefficients:
(Intercept) X1
0.1296812 1.0585835
Which I can store in a separate df. The problem I am running into is regarding Std. Error, R2, residules etc.
I would like to store these also into a separate dataframe.
I can run individual regressions and extract the summaries as a normal R regression output would look like.
ls_1 <- summary(models[[1]])
ls_1
ls_1$sigma
However I am hoping to take the values directly from the line of code which runs the 25 regressions.
This code works
> (models[[15]]$coefficients)
(Intercept) X1
-0.3643446787 1.0789369642
However; this code does not.
> (models[[15]]$sigma)
NULL
I have tried a variety of different combinations to try and extract these results with no luck.
The following did exactly what I wanted perfectly. I had hoped there was a way to replace the word coef with Std Error or R2 etc. but this does not work.
models <- lapply(1:25, function(x) lm(Y_df[,x] ~ X1))
# extract just coefficients
coefficients <- sapply(Y_df, coef)
Ideally I would like to store the Std Error from the above model
If a model is named mod, you can get to all of the residuals in the same way as the coefficients:
mod$residuals
There are also functions that extract the coefficients and residuals:
coef(mod)
resid(mod)
The other outputs, you can extract via summary:
summary(mod)$coef[,"Std. Error"] # standard errors
summary(mod)$r.squared # r squared
summary(mod)$adj.r.squared # adjusted r squared
So you can either create a list containing each of these results for each model:
outputList <- lapply(models, function(x){
coefs <- coef(mod)
stdErr <- summary(mod)$coef[,"Std. Error"]
rsq <- summary(mod)$r.squared
rsq_adj <- summary(mod)$adj.r.squared
rsd <- resid(mod)
list(coefs = coefs,
stdErr = stdErr,
rsq = rsq,
rsq_adj = rsq_adj,
rsd = rsd)
})
You can then get to the rsq for mod1 via outputList$mod1$rsq, for example.
Or you can create separate dataframes for each:
library(tidyverse)
# coefficients
coefs <- lapply(models, coef) %>%
do.call(rbind, .) %>%
as.data.frame() %>% # convert from matrix to dataframe
rownames_to_column("model") # add original model name as a column in the dataframe
# standard errors
stdErr <- lapply(models, function(x){
summary(mod)$coef[,"Std. Error"]
}) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
rownames_to_column("model")
# r squareds
rsq <- sapply(models, function(x){
summary(mod)$r.squared
}) %>%
as.data.frame() %>%
rownames_to_column("model")
# adjusted r squareds
rsq_adj <- sapply(models, function(x){
summary(mod)$adj.r.squared
})%>%
as.data.frame() %>%
rownames_to_column("model")
# residuals
rsd <- lapply(models, resid) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
rownames_to_column("model")
Worth noting that, if you're in RStudio and you assign the summary to something (ie temp <- summary(mod)), you can type the name of the object, then "$" and a dropdown of all the other objects that can be extracted from the summary appears.

lm in for loop no p-values stored? (in R)

I'm trying to fit a regression model to explain donations in dictator games. As I have a lot of variables, I want to automate the process with a 'for' loop. For now I begin with univariate models.
When I print/summarise the fits fit[1:24], only the intercepts and coefficients are displayed. It seems like the p-values are not stored?
predictor<-0
fit<-0
dictatorgame<-mydata$dictatorgame
sumres<-0
pVal<-0
for(i in 1:24) #24 predictor variables stored in column 1-24 in mydata
{
predictor<-mydata[i]
unlist(predictor)
fit[i]<-lm(dictatorgame~unlist(predictor))
}
I tried two different solutions I found here on SO, both of them seeming to think that the objects are atomic:
sumres[i]=summary(fit[i])
pf(sumres[i]$fstatistic[1L], sumres[i]$fstatistic[2L],sumres[i]$fstatistic[3L], lower.tail = FALSE)
and
pVal[i] <- (fit[i])$coefficients[,4]
but always end up getting error messages $ operator is invalid for atomic vectors.
I generated some data to perform multiple regressions. At the end you can find the first three elements of the output list. Is it what you want?
dependent <- rnorm(1000)
independent <- matrix(rnorm(10*1000), ncol = 10)
result <- list()
for (i in 1:10){
result[[i]] <- lm(dependent ~ independent[ ,i])
}
lapply(result, function(x) summary(x)$coefficients )
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02890665 0.03167108 -0.9127144 0.3616132
independent[, i] -0.04605868 0.03138201 -1.4676776 0.1425069
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03142412 0.03161656 -0.9939134 0.3205060
independent[, i] -0.03874678 0.03251463 -1.1916723 0.2336731
[[3]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03208370 0.03162904 -1.0143749 0.3106497
independent[, i] 0.02089094 0.03189098 0.6550737 0.5125713

Using loops to do Chi-Square Test in R

I am new to R. I found the following code for doing univariate logistic regression for a set of variables. What i would like to do is run chi square test for a list of variables against the dependent variable, similar to the logistic regression code below. I found couple of them which involve creating all possible combinations of the variables, but I can't get it to work. Ideally, I want the one of the variables (X) to be the same.
Chi Square Analysis using for loop in R
lapply(c("age","sex","race","service","cancer",
"renal","inf","cpr","sys","heart","prevad",
"type","frac","po2","ph","pco2","bic","cre","loc"),
function(var) {
formula <- as.formula(paste("status ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
Are you sure that the strings in the vector you lapply over are in the column names of the icu dataset?
It works for me when I download the icu data:
system("wget http://course1.winona.edu/bdeppa/Biostatistics/Data%20Sets/ICU.TXT")
icu <- read.table('ICU.TXT', header=TRUE)
and change status to STA which is a column in icu. Here an example for some of your variables:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),
function(var) {
formula <- as.formula(paste("STA ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
This gives me a list with summary.glm objects. Example:
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05851323 0.69608124 -4.393903 1.113337e-05
Age 0.02754261 0.01056416 2.607174 9.129303e-03
[[2]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4271164 0.2273030 -6.2784758 3.419081e-10
Sex 0.1053605 0.3617088 0.2912855 7.708330e-01
[[3]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0500583 0.4983146 -2.1072198 0.03509853
Race -0.2913384 0.4108026 -0.7091933 0.47820450
[[4]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9465961 0.2310559 -4.096827 0.0000418852
Ser -0.9469461 0.3681954 -2.571858 0.0101154495
[[5]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.386294e+00 0.1863390 -7.439638e+00 1.009615e-13
Can 7.523358e-16 0.5892555 1.276756e-15 1.000000e+00
If you want to do a chi-square test:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),function(var)chisq.test(icu$STA, icu[,var]))
or a chi-square test for all combinations of variables:
my.list.all <- apply(combn(colnames(icu), 2), 2, function(x)chisq.test(icu[,x[1]], icu[,x[2]]))
Does this work?

Resources