I am new to R. I found the following code for doing univariate logistic regression for a set of variables. What i would like to do is run chi square test for a list of variables against the dependent variable, similar to the logistic regression code below. I found couple of them which involve creating all possible combinations of the variables, but I can't get it to work. Ideally, I want the one of the variables (X) to be the same.
Chi Square Analysis using for loop in R
lapply(c("age","sex","race","service","cancer",
"renal","inf","cpr","sys","heart","prevad",
"type","frac","po2","ph","pco2","bic","cre","loc"),
function(var) {
formula <- as.formula(paste("status ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
Are you sure that the strings in the vector you lapply over are in the column names of the icu dataset?
It works for me when I download the icu data:
system("wget http://course1.winona.edu/bdeppa/Biostatistics/Data%20Sets/ICU.TXT")
icu <- read.table('ICU.TXT', header=TRUE)
and change status to STA which is a column in icu. Here an example for some of your variables:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),
function(var) {
formula <- as.formula(paste("STA ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
This gives me a list with summary.glm objects. Example:
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05851323 0.69608124 -4.393903 1.113337e-05
Age 0.02754261 0.01056416 2.607174 9.129303e-03
[[2]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4271164 0.2273030 -6.2784758 3.419081e-10
Sex 0.1053605 0.3617088 0.2912855 7.708330e-01
[[3]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0500583 0.4983146 -2.1072198 0.03509853
Race -0.2913384 0.4108026 -0.7091933 0.47820450
[[4]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9465961 0.2310559 -4.096827 0.0000418852
Ser -0.9469461 0.3681954 -2.571858 0.0101154495
[[5]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.386294e+00 0.1863390 -7.439638e+00 1.009615e-13
Can 7.523358e-16 0.5892555 1.276756e-15 1.000000e+00
If you want to do a chi-square test:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),function(var)chisq.test(icu$STA, icu[,var]))
or a chi-square test for all combinations of variables:
my.list.all <- apply(combn(colnames(icu), 2), 2, function(x)chisq.test(icu[,x[1]], icu[,x[2]]))
Does this work?
Related
I wrote a function to run univariate linear regressions for multiple variables at a time. However, in the summary table, I noticed that the RMSE is missing. How do I also display the RMSE to each of these regressions?
Here is my code and here is what my output looks like:
my.data <- read.csv("filename.csv", header=TRUE)
variables <-names(my.data[1:30])
my.list <- lapply(variables, function(var){formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
summary(res.linear)
})
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.367075060 4.46417498 5.2343547 3.017975e-06
variable1 0.008312962 0.04747918 0.1750865 8.616917e-01
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.347246142 1.48314397 3.605345 0.0006984638
variable2 0.008342116 0.01577413 0.528848 0.5991611451
We may extract the residuals from the summary output, get the squared mean and take the square root and cbind with the extracted coefficients
my.list <- lapply(variables, function(var){
formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
smry <- summary(res.linear)
RMSE <- sqrt(mean(smry$residuals^2))
cbind(coef(smry), RMSE = RMSE)
})
So I built my logistic regression model using glm(). When I display the summary I get this with values for each variable.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
After I have set a restriction using linearHypotheses but I only get results for chisquared, residuals df, and df. Am I able to see what the coefficients are for the model with the restriction?
library(car)
library(tidyverse)
library(dplyr)
logitmodel <- glm(x~ratio+x+y, family=binomial(link="logit"))
nullhypothese <- "x=0"
restrictedmodel <- linearHypothesis(logitmodel, nullhypothesis)
I'm trying to fit a regression model to explain donations in dictator games. As I have a lot of variables, I want to automate the process with a 'for' loop. For now I begin with univariate models.
When I print/summarise the fits fit[1:24], only the intercepts and coefficients are displayed. It seems like the p-values are not stored?
predictor<-0
fit<-0
dictatorgame<-mydata$dictatorgame
sumres<-0
pVal<-0
for(i in 1:24) #24 predictor variables stored in column 1-24 in mydata
{
predictor<-mydata[i]
unlist(predictor)
fit[i]<-lm(dictatorgame~unlist(predictor))
}
I tried two different solutions I found here on SO, both of them seeming to think that the objects are atomic:
sumres[i]=summary(fit[i])
pf(sumres[i]$fstatistic[1L], sumres[i]$fstatistic[2L],sumres[i]$fstatistic[3L], lower.tail = FALSE)
and
pVal[i] <- (fit[i])$coefficients[,4]
but always end up getting error messages $ operator is invalid for atomic vectors.
I generated some data to perform multiple regressions. At the end you can find the first three elements of the output list. Is it what you want?
dependent <- rnorm(1000)
independent <- matrix(rnorm(10*1000), ncol = 10)
result <- list()
for (i in 1:10){
result[[i]] <- lm(dependent ~ independent[ ,i])
}
lapply(result, function(x) summary(x)$coefficients )
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02890665 0.03167108 -0.9127144 0.3616132
independent[, i] -0.04605868 0.03138201 -1.4676776 0.1425069
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03142412 0.03161656 -0.9939134 0.3205060
independent[, i] -0.03874678 0.03251463 -1.1916723 0.2336731
[[3]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03208370 0.03162904 -1.0143749 0.3106497
independent[, i] 0.02089094 0.03189098 0.6550737 0.5125713
I want to grab the Standard Error column when I do summary on a linear regression model. The output is below:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.436954 0.616937 -13.676 < 2e-16 ***
x1 -0.138902 0.024247 -5.729 1.01e-08 ***
x2 0.005978 0.009142 0.654 0.51316 `
...
I just want the Std. Error column values stored into a vector. How would I go about doing so? I tried model$coefficients[,2] but that keeps giving me extra values. If anyone could help that would be great.
Say fit is the linear model, then summary(fit)$coefficients[,2] has the standard errors. Type ?summary.lm.
fit <- lm(y~x, myData)
summary(fit)$coefficients[,1] # the coefficients
summary(fit)$coefficients[,2] # the std. error in the coefficients
summary(fit)$coefficients[,3] # the t-values
summary(fit)$coefficients[,4] # the p-values
In R, when using lm(), if I set na.action = na.pass inside the call to lm(), then in the summary table there is an NA for any coefficient that cannot be estimated (because of missing cells in this case).
If, however, I extract just the coefficients from the summary object, using either summary(myModel)$coefficients or coef(summary(myModel)), then the NA's are omitted.
I want the NA's to be included when I extract the coefficients the same way that they are included when I print the summary. Is there a way to do this?
Setting options(na.action = na.pass) does not seem to help.
Here is an example:
> set.seed(534)
> myGroup1 <- factor(c("a","a","a","a","b","b"))
> myGroup2 <- factor(c("first","second","first","second","first","first"))
> myDepVar <- rnorm(6, 0, 1)
> myModel <- lm(myDepVar ~ myGroup1 + myGroup2 + myGroup1:myGroup2)
> summary(myModel)
Call:
lm(formula = myDepVar ~ myGroup1 + myGroup2 + myGroup1:myGroup2)
Residuals:
1 2 3 4 5 6
-0.05813 0.55323 0.05813 -0.55323 -0.12192 0.12192
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.15150 0.23249 -0.652 0.561
myGroup11 0.03927 0.23249 0.169 0.877
myGroup21 -0.37273 0.23249 -1.603 0.207
myGroup11:myGroup21 NA NA NA NA
Residual standard error: 0.465 on 3 degrees of freedom
Multiple R-squared: 0.5605, Adjusted R-squared: 0.2675
F-statistic: 1.913 on 2 and 3 DF, p-value: 0.2914
> coef(summary(myModel))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.15149826 0.2324894 -0.6516352 0.5611052
myGroup11 0.03926774 0.2324894 0.1689012 0.8766203
myGroup21 -0.37273117 0.2324894 -1.6032180 0.2072173
> summary(myModel)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.15149826 0.2324894 -0.6516352 0.5611052
myGroup11 0.03926774 0.2324894 0.1689012 0.8766203
myGroup21 -0.37273117 0.2324894 -1.6032180 0.2072173
Why don't you just extract the coefficients from the fitted model:
> coef(myModel)
(Intercept) myGroup1b
-0.48496169 -0.07853547
myGroup2second myGroup1b:myGroup2second
0.74546233 NA
That seems the easiest option.
na.action has nothing to do with this. Note that you didn't pass na.action = na.pass in your example.
na.action is a global option for handling NA in the data passed to a model fit, usually in conjunction with a formula; it is also the name of a function na.action(). R builds up the so called model frame from the data argument and the symbolic representation of the model expressed in the formula. At this point, any NA would be detected and the default option for na.action is to use na.omit() to remove the NA from the data by dropping samples with NA for any variable. There are alternatives, most usefully na.exclude(), which would remove NA during fitting but add back NA in the correct places in the fitted values, residuals etc. Read ?na.omit and ?na.action for more, plus ?options for more on this.
the documentation of summary.lm says 'Aliased coefficients are omitted in the return object but restored by the print method'. It seems there is no parameter to control this omit. There is another work around besides using coef(summary(myModel)) as suggested by #Gavin Simpson. You can create a matrix
nr <- num_regressors - nrow(summary(myModel)$coefficients) ##num_regressors shall be defined previously
nc <- 4
rnames <- names(which(summary(myModel)$aliased))
cnames <- colnames(summary(myModel)$coefficients)
mat_na <- matrix(data = NA,nrow = nr,ncol = nc,
dimnames = list(rnames,cnames))
and then rbind the two matrice:
mat_coef <- rbind(summary(myModel)$coefficients,mat_na)
You can also just transform the summary fit table into a data frame (where the variables that are NA are lost):
fit <- as.data.frame(summary(fit)$coefficients)
And then extract the coefficients by name:
fit["age", "Pr(>|z|)"]
If "age" has been dropped, you'll get an NA when trying to extract the P-value for age from the dataframe