Bootstrap code not reporting bias or standard error - r

I wrote this code in R:
library(boot)
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
results <- boot(data=z, statistic=bs,
R=1000, formula=z[,1]~z[,2])
I'm trying to do random x -resampling using for data a dataframe that contains my response and my predictor however my results return without bias and without std.
Bootstrap Statistics :
original bias std. error
t1* 83.5466254 0 0
t2* -0.6360426 0 0
Can anyone spot the problem?

Your formula is incorrect. When you use z[,1]~z[,2] you are literally specifying a formula that has the first column of z as the response and the second column of z as the independent variables. Note that z never changes. It's the data= parameter that's changing. Furthernore, the formula syntax does not work with positional indexes like that. You need to use variable names. Here's some sample data
z <- data.frame(a=runif(50), b=runif(50))
Note how this doesn't work
results <- boot(data=z, statistic=bs,
R=10, formula=z[,1]~z[,2])
results
# Bootstrap Statistics :
# original bias std. error
# t1* 0.45221233 0 0
# t2* 0.08818014 0 0
it's just retuning the same values over and over again which are the same as when you use the fill data set
lm(a~b, z)
# Coefficients:
# (Intercept) b
# 0.45221 0.08818
What you want is
results <- boot(data=z, statistic=bs,
R=10, formula=a~b)
results
# Bootstrap Statistics :
# original bias std. error
# t1* 0.45221233 0.01024794 0.08853861
# t2* 0.08818014 -0.01546608 0.16376128
This allows for the boot function to pass in a different dataset each time and since the literal vector values aren't included in the formula that specially refer to the z data.frame, you'll get updated values.

Related

Residual standard error: NaN on 0 degrees of freedom getting this error while creating a linear model

I am creating a linear model from a data frame in which column 6 depends on column 1 to 5. Although the code executes properly, when I print the summary of the linear model I get the following.
Call:
lm(formula = AAPL[, 6] ~ AAPL[, 1] + AAPL[, 2], data = AAPL[,
c(1, 2)], subset = 1)
Residuals:
ALL 1 residuals are 0: no residual degrees of freedom!
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.104 NA NA NA
AAPL[, 1] NA NA NA NA
AAPL[, 2] NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
The code I am using :-
lm <- lm(train[,6] ~ train[,2]+train[,3]+train[,4]+train[,5]+train[,1] , 1 , data=train)
PS: If I remove the part data=train then this works in terminal but not when executed from file.
The next line of code which is almost same except one parameter runs perfectly.The next line is:
lm2 <- lm(train[,6] ~ train[,2]+train[,3]+train[,4]+train[,5]+train[,1] , 5)
tl;dr you are (unintentionally?) specifying that the model should use only the first observation. Let's look at what's here ...
lm <- lm(train[,6] ~ train[,2]+train[,3]+train[,4]+train[,5]+train[,1] ,
1 , data=train)
the first argument is the formula (fine, although (1) it's clearer to use variable names rather than columns and (2) if you are using all the variables in the data set to predict with, you can use the shortcut y ~ . (where y is the name of the response variable)
what does the second argument mean? R matches arguments by position and name. The second and third arguments to lm() (see ?lm) are data and subset. Since you have specified data as the third argument, and haven't named the second argument, R will interpret the second argument as subset. Let's see what ?lm says about the subset argument:
subset: an optional vector specifying a subset of observations to be
used in the fitting process.
That means that R will take the value 1 as a "vector specifying a subset of observations", i.e. it will take only the first row of the training data set.
Since you are using only one observation to fit the data set, lm() can fit only an intercept, not any other parameters.
By the way, it's generally not recommended to use the names of built-in R functions (lm) as variable names. It works most of the time, but when it doesn't work the resulting error messages are very confusing.

lm in for loop no p-values stored? (in R)

I'm trying to fit a regression model to explain donations in dictator games. As I have a lot of variables, I want to automate the process with a 'for' loop. For now I begin with univariate models.
When I print/summarise the fits fit[1:24], only the intercepts and coefficients are displayed. It seems like the p-values are not stored?
predictor<-0
fit<-0
dictatorgame<-mydata$dictatorgame
sumres<-0
pVal<-0
for(i in 1:24) #24 predictor variables stored in column 1-24 in mydata
{
predictor<-mydata[i]
unlist(predictor)
fit[i]<-lm(dictatorgame~unlist(predictor))
}
I tried two different solutions I found here on SO, both of them seeming to think that the objects are atomic:
sumres[i]=summary(fit[i])
pf(sumres[i]$fstatistic[1L], sumres[i]$fstatistic[2L],sumres[i]$fstatistic[3L], lower.tail = FALSE)
and
pVal[i] <- (fit[i])$coefficients[,4]
but always end up getting error messages $ operator is invalid for atomic vectors.
I generated some data to perform multiple regressions. At the end you can find the first three elements of the output list. Is it what you want?
dependent <- rnorm(1000)
independent <- matrix(rnorm(10*1000), ncol = 10)
result <- list()
for (i in 1:10){
result[[i]] <- lm(dependent ~ independent[ ,i])
}
lapply(result, function(x) summary(x)$coefficients )
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02890665 0.03167108 -0.9127144 0.3616132
independent[, i] -0.04605868 0.03138201 -1.4676776 0.1425069
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03142412 0.03161656 -0.9939134 0.3205060
independent[, i] -0.03874678 0.03251463 -1.1916723 0.2336731
[[3]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03208370 0.03162904 -1.0143749 0.3106497
independent[, i] 0.02089094 0.03189098 0.6550737 0.5125713

Using loops to do Chi-Square Test in R

I am new to R. I found the following code for doing univariate logistic regression for a set of variables. What i would like to do is run chi square test for a list of variables against the dependent variable, similar to the logistic regression code below. I found couple of them which involve creating all possible combinations of the variables, but I can't get it to work. Ideally, I want the one of the variables (X) to be the same.
Chi Square Analysis using for loop in R
lapply(c("age","sex","race","service","cancer",
"renal","inf","cpr","sys","heart","prevad",
"type","frac","po2","ph","pco2","bic","cre","loc"),
function(var) {
formula <- as.formula(paste("status ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
Are you sure that the strings in the vector you lapply over are in the column names of the icu dataset?
It works for me when I download the icu data:
system("wget http://course1.winona.edu/bdeppa/Biostatistics/Data%20Sets/ICU.TXT")
icu <- read.table('ICU.TXT', header=TRUE)
and change status to STA which is a column in icu. Here an example for some of your variables:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),
function(var) {
formula <- as.formula(paste("STA ~", var))
res.logist <- glm(formula, data = icu, family = binomial)
summary(res.logist)
})
This gives me a list with summary.glm objects. Example:
lapply(my.list, coefficients)
[[1]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05851323 0.69608124 -4.393903 1.113337e-05
Age 0.02754261 0.01056416 2.607174 9.129303e-03
[[2]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4271164 0.2273030 -6.2784758 3.419081e-10
Sex 0.1053605 0.3617088 0.2912855 7.708330e-01
[[3]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0500583 0.4983146 -2.1072198 0.03509853
Race -0.2913384 0.4108026 -0.7091933 0.47820450
[[4]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9465961 0.2310559 -4.096827 0.0000418852
Ser -0.9469461 0.3681954 -2.571858 0.0101154495
[[5]]
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.386294e+00 0.1863390 -7.439638e+00 1.009615e-13
Can 7.523358e-16 0.5892555 1.276756e-15 1.000000e+00
If you want to do a chi-square test:
my.list <- lapply(c("Age","Sex","Race","Ser","Can"),function(var)chisq.test(icu$STA, icu[,var]))
or a chi-square test for all combinations of variables:
my.list.all <- apply(combn(colnames(icu), 2), 2, function(x)chisq.test(icu[,x[1]], icu[,x[2]]))
Does this work?

Multiple glm in for loop

I have an R dataframe, strongly simplified as:
id <- rep(1:2, c(6,8))
correct <- sample(0:1,14,TRUE)
phase <- c(rep("discr",3),rep("rev",3), rep("discr",4),rep("rev",4))
dat <- data.frame(id,correct,phase)
with id as my subjects (in reality I have a lot more than 2), correct = responses coded as incorrect (0) or correct (1), and the phases Discrimination and Reversal (within-subjects factor).
I want to perform a logistic regression in the form of
glm(correct~phase, dat, family="binomial")
later possibly adding additional predictors.
However, since I have a varying amount of data for each subject, I would like to perform glm() seperately for each subject and later compare the coefficients with ANOVA for group effects.
I would like to do this in a for loop in the form of
for(i in seq_along(dat$id)){
my_glm[i] <- glm(correct~list,dat[dat$id==i,],family="binomial")
}
but keep receiving the error message
>Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels.
I have checked my data and there is no factor which contains only one level. All subjects gave at least one incorrect and one correct response, and all took part in Discrimination and Reversal. The function works outside the loop when I specify a particular subject.
Here's an R Base solution:
> lapply(split(dat, dat$id), function(x) coef(summary(glm(correct~phase,family="binomial",data=x))))
$`1`
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.931472e-01 1.224745 -5.659524e-01 0.5714261
phaserev -3.845925e-16 1.732050 -2.220446e-16 1.0000000
$`2`
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.356998e-16 1.000000 3.356998e-16 1.000000
phaserev 1.098612e+00 1.527524 7.192109e-01 0.472011
you currently trying to do a glm for each row in of id:
I think you want a glm for each id seperately. Personally, I would go with something like:
library(plyr)
ddply(dat, .(id), function (x){
intercept <- coef(summary(glm(correct~phase,family="binomial",data=x)))[1]
slope <- coef(summary(glm(correct~phase,family="binomial",data=x)))[2]
c(intercept,slope)
})
# id V1 V2
#1 1 -0.6931472 1.386294e+00
#2 2 1.0986123 -6.345448e-16
# here V1 is intercept and V2 is the estimate

Different NA actions for coefficients and summary of linear model in R

In R, when using lm(), if I set na.action = na.pass inside the call to lm(), then in the summary table there is an NA for any coefficient that cannot be estimated (because of missing cells in this case).
If, however, I extract just the coefficients from the summary object, using either summary(myModel)$coefficients or coef(summary(myModel)), then the NA's are omitted.
I want the NA's to be included when I extract the coefficients the same way that they are included when I print the summary. Is there a way to do this?
Setting options(na.action = na.pass) does not seem to help.
Here is an example:
> set.seed(534)
> myGroup1 <- factor(c("a","a","a","a","b","b"))
> myGroup2 <- factor(c("first","second","first","second","first","first"))
> myDepVar <- rnorm(6, 0, 1)
> myModel <- lm(myDepVar ~ myGroup1 + myGroup2 + myGroup1:myGroup2)
> summary(myModel)
Call:
lm(formula = myDepVar ~ myGroup1 + myGroup2 + myGroup1:myGroup2)
Residuals:
1 2 3 4 5 6
-0.05813 0.55323 0.05813 -0.55323 -0.12192 0.12192
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.15150 0.23249 -0.652 0.561
myGroup11 0.03927 0.23249 0.169 0.877
myGroup21 -0.37273 0.23249 -1.603 0.207
myGroup11:myGroup21 NA NA NA NA
Residual standard error: 0.465 on 3 degrees of freedom
Multiple R-squared: 0.5605, Adjusted R-squared: 0.2675
F-statistic: 1.913 on 2 and 3 DF, p-value: 0.2914
> coef(summary(myModel))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.15149826 0.2324894 -0.6516352 0.5611052
myGroup11 0.03926774 0.2324894 0.1689012 0.8766203
myGroup21 -0.37273117 0.2324894 -1.6032180 0.2072173
> summary(myModel)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.15149826 0.2324894 -0.6516352 0.5611052
myGroup11 0.03926774 0.2324894 0.1689012 0.8766203
myGroup21 -0.37273117 0.2324894 -1.6032180 0.2072173
Why don't you just extract the coefficients from the fitted model:
> coef(myModel)
(Intercept) myGroup1b
-0.48496169 -0.07853547
myGroup2second myGroup1b:myGroup2second
0.74546233 NA
That seems the easiest option.
na.action has nothing to do with this. Note that you didn't pass na.action = na.pass in your example.
na.action is a global option for handling NA in the data passed to a model fit, usually in conjunction with a formula; it is also the name of a function na.action(). R builds up the so called model frame from the data argument and the symbolic representation of the model expressed in the formula. At this point, any NA would be detected and the default option for na.action is to use na.omit() to remove the NA from the data by dropping samples with NA for any variable. There are alternatives, most usefully na.exclude(), which would remove NA during fitting but add back NA in the correct places in the fitted values, residuals etc. Read ?na.omit and ?na.action for more, plus ?options for more on this.
the documentation of summary.lm says 'Aliased coefficients are omitted in the return object but restored by the print method'. It seems there is no parameter to control this omit. There is another work around besides using coef(summary(myModel)) as suggested by #Gavin Simpson. You can create a matrix
nr <- num_regressors - nrow(summary(myModel)$coefficients) ##num_regressors shall be defined previously
nc <- 4
rnames <- names(which(summary(myModel)$aliased))
cnames <- colnames(summary(myModel)$coefficients)
mat_na <- matrix(data = NA,nrow = nr,ncol = nc,
dimnames = list(rnames,cnames))
and then rbind the two matrice:
mat_coef <- rbind(summary(myModel)$coefficients,mat_na)
You can also just transform the summary fit table into a data frame (where the variables that are NA are lost):
fit <- as.data.frame(summary(fit)$coefficients)
And then extract the coefficients by name:
fit["age", "Pr(>|z|)"]
If "age" has been dropped, you'll get an NA when trying to extract the P-value for age from the dataframe

Resources