I'm trying to fit a regression model to explain donations in dictator games. As I have a lot of variables, I want to automate the process with a 'for' loop. For now I begin with univariate models.
When I print/summarise the fits fit[1:24], only the intercepts and coefficients are displayed. It seems like the p-values are not stored?
predictor<-0
fit<-0
dictatorgame<-mydata$dictatorgame
sumres<-0
pVal<-0
for(i in 1:24) #24 predictor variables stored in column 1-24 in mydata
{
predictor<-mydata[i]
unlist(predictor)
fit[i]<-lm(dictatorgame~unlist(predictor))
}
I tried two different solutions I found here on SO, both of them seeming to think that the objects are atomic:
sumres[i]=summary(fit[i])
pf(sumres[i]$fstatistic[1L], sumres[i]$fstatistic[2L],sumres[i]$fstatistic[3L], lower.tail = FALSE)
and
pVal[i] <- (fit[i])$coefficients[,4]
but always end up getting error messages $ operator is invalid for atomic vectors.
I generated some data to perform multiple regressions. At the end you can find the first three elements of the output list. Is it what you want?
dependent <- rnorm(1000)
independent <- matrix(rnorm(10*1000), ncol = 10)
result <- list()
for (i in 1:10){
result[[i]] <- lm(dependent ~ independent[ ,i])
}
lapply(result, function(x) summary(x)$coefficients )
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02890665 0.03167108 -0.9127144 0.3616132
independent[, i] -0.04605868 0.03138201 -1.4676776 0.1425069
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03142412 0.03161656 -0.9939134 0.3205060
independent[, i] -0.03874678 0.03251463 -1.1916723 0.2336731
[[3]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03208370 0.03162904 -1.0143749 0.3106497
independent[, i] 0.02089094 0.03189098 0.6550737 0.5125713
Related
I wrote a function to run univariate linear regressions for multiple variables at a time. However, in the summary table, I noticed that the RMSE is missing. How do I also display the RMSE to each of these regressions?
Here is my code and here is what my output looks like:
my.data <- read.csv("filename.csv", header=TRUE)
variables <-names(my.data[1:30])
my.list <- lapply(variables, function(var){formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
summary(res.linear)
})
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.367075060 4.46417498 5.2343547 3.017975e-06
variable1 0.008312962 0.04747918 0.1750865 8.616917e-01
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.347246142 1.48314397 3.605345 0.0006984638
variable2 0.008342116 0.01577413 0.528848 0.5991611451
We may extract the residuals from the summary output, get the squared mean and take the square root and cbind with the extracted coefficients
my.list <- lapply(variables, function(var){
formula <- as.formula(paste("gene ~", var))
res.linear <- lm(formula, data = my.data)
smry <- summary(res.linear)
RMSE <- sqrt(mean(smry$residuals^2))
cbind(coef(smry), RMSE = RMSE)
})
I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco
What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.
I wrote this code in R:
library(boot)
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
results <- boot(data=z, statistic=bs,
R=1000, formula=z[,1]~z[,2])
I'm trying to do random x -resampling using for data a dataframe that contains my response and my predictor however my results return without bias and without std.
Bootstrap Statistics :
original bias std. error
t1* 83.5466254 0 0
t2* -0.6360426 0 0
Can anyone spot the problem?
Your formula is incorrect. When you use z[,1]~z[,2] you are literally specifying a formula that has the first column of z as the response and the second column of z as the independent variables. Note that z never changes. It's the data= parameter that's changing. Furthernore, the formula syntax does not work with positional indexes like that. You need to use variable names. Here's some sample data
z <- data.frame(a=runif(50), b=runif(50))
Note how this doesn't work
results <- boot(data=z, statistic=bs,
R=10, formula=z[,1]~z[,2])
results
# Bootstrap Statistics :
# original bias std. error
# t1* 0.45221233 0 0
# t2* 0.08818014 0 0
it's just retuning the same values over and over again which are the same as when you use the fill data set
lm(a~b, z)
# Coefficients:
# (Intercept) b
# 0.45221 0.08818
What you want is
results <- boot(data=z, statistic=bs,
R=10, formula=a~b)
results
# Bootstrap Statistics :
# original bias std. error
# t1* 0.45221233 0.01024794 0.08853861
# t2* 0.08818014 -0.01546608 0.16376128
This allows for the boot function to pass in a different dataset each time and since the literal vector values aren't included in the formula that specially refer to the z data.frame, you'll get updated values.
I am new to R. I found the following code for doing univariate logistic regression for a set of variables. What i would like to do is run chi square test for a list of variables against the dependent variable, similar to the logistic regression code below. I found couple of them which involve creating all possible combinations of the variables, but I can't get it to work. Ideally, I want the one of the variables (X) to be the same.
Chi Square Analysis using for loop in R
lapply(c("age","sex","race","service","cancer",
"renal","inf","cpr","sys","heart","prevad",
"type","frac","po2","ph","pco2","bic","cre","loc"),
function(var) {
formula <- as.formula(paste("status ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
Are you sure that the strings in the vector you lapply over are in the column names of the icu dataset?
It works for me when I download the icu data:
system("wget http://course1.winona.edu/bdeppa/Biostatistics/Data%20Sets/ICU.TXT")
icu <- read.table('ICU.TXT', header=TRUE)
and change status to STA which is a column in icu. Here an example for some of your variables:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),
function(var) {
formula <- as.formula(paste("STA ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
This gives me a list with summary.glm objects. Example:
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05851323 0.69608124 -4.393903 1.113337e-05
Age 0.02754261 0.01056416 2.607174 9.129303e-03
[[2]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4271164 0.2273030 -6.2784758 3.419081e-10
Sex 0.1053605 0.3617088 0.2912855 7.708330e-01
[[3]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0500583 0.4983146 -2.1072198 0.03509853
Race -0.2913384 0.4108026 -0.7091933 0.47820450
[[4]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9465961 0.2310559 -4.096827 0.0000418852
Ser -0.9469461 0.3681954 -2.571858 0.0101154495
[[5]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.386294e+00 0.1863390 -7.439638e+00 1.009615e-13
Can 7.523358e-16 0.5892555 1.276756e-15 1.000000e+00
If you want to do a chi-square test:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),function(var)chisq.test(icu$STA, icu[,var]))
or a chi-square test for all combinations of variables:
my.list.all <- apply(combn(colnames(icu), 2), 2, function(x)chisq.test(icu[,x[1]], icu[,x[2]]))
Does this work?
I have an R dataframe, strongly simplified as:
id <- rep(1:2, c(6,8))
correct <- sample(0:1,14,TRUE)
phase <- c(rep("discr",3),rep("rev",3), rep("discr",4),rep("rev",4))
dat <- data.frame(id,correct,phase)
with id as my subjects (in reality I have a lot more than 2), correct = responses coded as incorrect (0) or correct (1), and the phases Discrimination and Reversal (within-subjects factor).
I want to perform a logistic regression in the form of
glm(correct~phase, dat, family="binomial")
later possibly adding additional predictors.
However, since I have a varying amount of data for each subject, I would like to perform glm() seperately for each subject and later compare the coefficients with ANOVA for group effects.
I would like to do this in a for loop in the form of
for(i in seq_along(dat$id)){
my_glm[i] <- glm(correct~list,dat[dat$id==i,],family="binomial")
}
but keep receiving the error message
>Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels.
I have checked my data and there is no factor which contains only one level. All subjects gave at least one incorrect and one correct response, and all took part in Discrimination and Reversal. The function works outside the loop when I specify a particular subject.
Here's an R Base solution:
> lapply(split(dat, dat$id), function(x) coef(summary(glm(correct~phase,family="binomial",data=x))))
$`1`
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.931472e-01 1.224745 -5.659524e-01 0.5714261
phaserev -3.845925e-16 1.732050 -2.220446e-16 1.0000000
$`2`
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.356998e-16 1.000000 3.356998e-16 1.000000
phaserev 1.098612e+00 1.527524 7.192109e-01 0.472011
you currently trying to do a glm for each row in of id:
I think you want a glm for each id seperately. Personally, I would go with something like:
library(plyr)
ddply(dat, .(id), function (x){
intercept <- coef(summary(glm(correct~phase,family="binomial",data=x)))[1]
slope <- coef(summary(glm(correct~phase,family="binomial",data=x)))[2]
c(intercept,slope)
})
# id V1 V2
#1 1 -0.6931472 1.386294e+00
#2 2 1.0986123 -6.345448e-16
# here V1 is intercept and V2 is the estimate