I would like to compute cluster robust standard errors using a mids class object. This arise from multiple imputation of missing values in a column of my original data. A minimal example below.
library(mice)
y <- c(1,0,0,1,1,1,1,0)
x <- c(26, 34, 55, 15, 31 ,47, 97, 12)
z <- c(2, NA, 0, NA, 3 ,7,7, 5)
mydata <- as.data.frame(cbind(y,x,z))
tempData <- mice(mydata,m=5,maxit=5,meth='pmm',seed=500)
class(tempData)
# [1] "mids"
modelFit <- with(tempData,lm(y ~ x + z))
summary(modelFit)
At this point I would like to get the cluster robust standard errors. Unfortunately miceadds::lm.cluster does not allow "mids" class objects.
The function lm.cluster in miceadds is intended for regular data frames. An example for an application to multiply imputed data is given in the documentation.
Given below is a version adapted to your question. I used the first variables as a cluster indicator because your example didn't have one.
library(mice)
library(miceadds)
id <- c(1,0,0,1,1,1,1,0)
y <- c(26,34,55,15,31,47,97,12)
x <- c(2,NA,0,NA,3,7,7,5)
dat <- data.frame(id,y,x)
imp <- mice(dat, m=5, maxit=5, method='pmm', seed=500)
implist <- lapply(1:5, function(i) complete(imp,i))
mod <- lapply( implist, function(i){
lm.cluster( i, formula=y~x, cluster=i$id )
})
# extract parameters and covariance matrices
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# pool
summary(pool_mi( qhat=betas, u=vars ))
Related
I always use a for loop to create a matrix of linear and logistic regression output, but I am having difficulty doing so with qgcomp(). If anyone has experience or advice, I would greatly appreciate it. I prefer this approach over functions since all it takes is 1 click. The outputted matrix only contains "NA" and even when I remove the try catch error, it gives me 1 of 2 erros: Error: $ operator is invalid for atomic vectors or Error in x[good, , drop = FALSE] : (subscript) logical subscript too long. I have tried every single possible iteration with the quotation marks within the model statement... and also to use the typical cbind() trick for multivariate regressions.
Here is an example:
mtcars
IVs <- c("mpg", "cyl", "vs")
mixture <- c("disp", "wt")
DVs <- c("drat", "qsec")
matrix <- matrix(nrow = 2, ncol = 5)
colnames(matrix) <- c("Psi_beta", "Psi.LCL", "Psi.UCL", "Pos.psi.weights", "Neg.psi.weights")
rownames(matrix) <- DVs
for(i in 1:2){
tryCatch(
{
mod1 <- qgcomp.noboot(paste0(DVs[i],"~mpg", "+factor(cyl)", "+factor(vs)"
,"+disp","+wt"),
dat=mtcars,
expnms=mixture,
q=4,
family=gaussian())
coef <- as.data.frame(summary(mod1))
pospsi <- as.data.frame(mod1$pos.psi)
negpsi <- as.data.frame(mod1$neg.psi)
matrix[i,1] <- round(coef[2,1], 3) # extract overall mixture estimate
matrix[i,2] <- round(coef[2,3], 3) #extract LCL for mixture
matrix[i,3] <- round(coef[2,4], 3) #extract LCL for mixture
matrix[i,4] <- round(pospsi[1], 3) #extract sum of positive weights for mixture
matrix[i,5] <- round(negpsi[1], 3) #extract sum of negative weights for mixture
}, error=function(e){})
}
It's hard to tell what is happening when I can't see what mix.df looks like. Given the error you are facing, it is likely that you are trying to access a vector using $ notation. As far as I can tell, qgcomp.boot returns a list, so that is not likely to be the source of your error. Therefore, I think it might have something to do with mix.df.
Edit: I've made the code and I think it does what you want it to. Take a look and let me know
mtcars
IVs <- c("mpg", "cyl", "vs")
mixture <- c("disp", "wt")
DVs <- c("drat", "qsec")
matrix <- matrix(nrow = 2, ncol = 5)
colnames(matrix) <- c("Psi_beta", "Psi.LCL", "Psi.UCL", "Pos.psi.weights", "Neg.psi.weights")
rownames(matrix) <- DVs
for(i in 1:2){
#Need to convert the string to a formula because qgcomp doesn't do that for us
mod1.formula <- as.formula(paste0(DVs[i],"~mpg", "+factor(cyl)", "+factor(vs)"
,"+disp","+wt"))
#Run qgcomp as usual, with the previous formula
mod1 <- qgcomp.noboot(mod1.formula,
dat=mtcars,
expnms=mixture,
q=4,
family=gaussian())
coef <- as.data.frame(summary(mod1))
pospsi <- as.data.frame(mod1$pos.psi)
negpsi <- as.data.frame(mod1$neg.psi)
matrix[i,1] <- round(coef[2,1], 3) # extract overall mixture estimate
matrix[i,2] <- round(coef[2,3], 3) #extract LCL for mixture
matrix[i,3] <- round(coef[2,4], 3) #extract LCL for mixture
#I changed the way you access popspsi and negpsi because
#they are dataframes and need to be treated differently
matrix[i,4] <- round(pospsi[1,1], 3) #extract sum of positive weights for mixture
matrix[i,5] <- round(negpsi[1,1], 3) #extract sum of negative weights for mixture
}
I am trying to create a loop that perform lm on each row of a dataframe (response) against a predictor vector which is outside data frame. Values and length of the predictor varies depending on which category (df$Group) the row belongs to.
Appreciate if you could help me to create a loop that performs lm for each row, which also saves coefficients to a single vector/ dataframe. Also, to run the loop through several dataframes with same structure and then save coefficients to a single dataframe what changes should I make to the code?
Below is what I tried- it does not save coefficients.
dfList <- list(df,df1,df2)
df <- data.frame(ID=c(1:10),Group=c("A","A","A","A","B","B","B","B","B","B"), T1= rnorm(10, mean=1, sd=1),
T2= rnorm(10, mean=2, sd=1),T3= seq(40,58,by=2),T4= seq(10,28,by=2))
A <- df$ID
B <- df$Group
C <- numeric(length=length(A))
x1 <- c(1:4)
x2 <- c(2:4)
for (i in length(A)){
if(B[i] == "A"){C[i] <- apply(df[,c(3:6)],1,function(y) lm(y~x1)$coefficients[2])[i]}
if(B[i] == "B"){C[i] <- apply(df[,c(4:6)],1,function(y)lm(y~x2)$coefficients[2])[i]}
}
Appreciate any help!
Thank you!
Are you trying to do something like this ?
list_df <- split(df, df$Group)
A_coeff <- apply(list_df[[1]][, 3:6], 1, function(y) lm(y~x1)$coefficients[2])
B_coeff <- apply(list_df[[2]][, 4:6], 1, function(y) lm(y~x2)$coefficients[2])
The goal is to make a single ordered logistic regression using multiple data frames with the same structure.
In the example below I joined two data frames to be able to do a single ordered logistic regression. However, I would like a solution that automates the process, since there are 417 data frames.
library(dplyr)
library(MASS)
#dataframe 1
party <- factor(rep(c("Rep","Dem"), c(407, 428)),
levels=c("Rep","Dem"))
rpi <- c(30, 46, 148, 84, 99) # cell counts
dpi <- c(80, 81, 171, 41, 55) # cell counts
ideology <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology, rpi),
rep(ideology, dpi)), levels = ideology)
data1 <- data.frame(party,pol.ideology)
#dataframe 2
party <- factor(rep(c("Rep","Dem"), c(410, 430)),
levels=c("Rep","Dem"))
rpi2 <- c(26, 50, 140, 95, 99) # cell counts
dpi2 <- c(75, 86, 141, 61, 67) # cell counts
ideology2 <- c("Very Liberal","Slightly Liberal","Moderate","Slightly Conservative","Very Conservative")
pol.ideology <- factor(c(rep(ideology2, rpi2),
rep(ideology2, dpi2)), levels = ideology2)
data2 <- data.frame(party,pol.ideology)
nrow(data1)
nrow(data2)
## Joining dataframes "manually"
dat <- bind_rows(data1,data2)
table(dat)
nrow(dat)
# fit proportional odds model
pom <- polr(pol.ideology ~ party, data=dat)
summary(pom)
So I tried to loop in an attempt to do a single regression with data1 and date2. I was not successful, as the regression considers only data2.
## LOOP
data = ls(pattern="data")
for(i in 1:length(ls(pattern="data"))){
pom <- polr(pol.ideology ~ party, data=i)
}
summary(pom)
Could someone point me to a solution? I recently started studying the loop theme on R. Sorry if the question is silly.
The best option would be create a list of dataframes(assuming all dfs have similar structure and all are individual dfs in your environment with names as data1, data2, data3...)
library(tidyverse)
df_list <- map(paste0("data",1:3), ~eval(sym(.x)))
Once you have the list created. then you can easily do whatever you want.
Create regression for individual dataframes.
result_list <- map(df_list, ~polr(pol.ideology ~ party, data=.x))
Or, if you want to create one regression of combined dataframes.
Combine all dataframe together.
big_df <- df_list %>% bind_rows()
run your regression for combined df. big_df
big_df_result <- polr(pol.ideology ~ party, data= big_df)
The i in your loop is not the data. It's the index of the character vector you created outside the loop. But be careful assigning the name to data as you then create one more item in the for loop! Best to rename that. Your other problems are that you never save the results of each model, only the last one. So use a list to save them all and then lapply to extract them.
library(MASS) # For ordered logistic regression (polr)
## LOOP
my_list = ls(pattern="data")
pom <- list() # initialise your pom
for(i in 1:length(my_list)){ # or seq_along(my_list)
pom[[i]] <- polr(pol.ideology ~ party, data=get(my_list[i]))
}
lapply(pom, summary)
You could also do this without a loop using lapply.
I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.
Here the sample code for calculations:
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- as.data.frame(cbind(x1,x2,x3,x4,x5))
library(lmtest)
library(car)
model <- lm(x1~x2+x3+x4+x5, data = df)
# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)
# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)
# check for multicollinearity
vif(model)
#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))
model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)
# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)
# check for heteroskedasticity
bptest(model_no_out)
# check for multicollinearity
vif(model_no_out)
What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).
I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer).
I know there´re already some threads dealing with this problem
How to run lm models using all possible combinations of several variables and a factor
Finding the best combination of variables for high R-squared values
but I haven´t find a script which would also calculate JUST the P-VALUES.
Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.
I would really very appreciate any suggestions or help with this.
you are scratching the surface of what is now referred to as Statistical learning. the intro text is "Statistical Learning with applications in R" and the grad level text is "The Elements of Statistical learning".
to do what you need you use regsubsets() function from the "leaps" package. However if you read at least chapter 6 from the intro book you will discover about cross-validation and bootstrapping which are the modern way of doing model selection.
The following automates the models fitting and the tests you made afterwards.
There is one function that fits all possible models. Then a series of calls to the *apply functions will get the values you want.
library(lmtest)
library(car)
fitAllModels <- function(data, resp, regr){
f <- function(M){
apply(M, 2, function(x){
fmla <- paste(resp, paste(x, collapse = "+"), sep = "~")
fmla <- as.formula(fmla)
lm(fmla, data = data)
})
}
regr <- names(data)[names(data) %in% regr]
regr_list <- lapply(seq_along(regr), function(n) combn(regr, n))
models_list <- lapply(regr_list, f)
unlist(models_list, recursive = FALSE)
}
Now the data.
# Make up a data.frame to test the function above.
# Don't forget to set the RNG seed to make the
# results reproducible
set.seed(7646)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- data.frame(x1, x2, x3, x4, x5)
First fit all models with "x1" as response and the other variables as possible regressors. The function can be called with one response and any number of possible regressors you want.
fit_list <- fitAllModels(df, "x1", names(df)[-1])
And now the sequence of tests.
# Normality test, standardized residuals
rs_sd_list <- lapply(fit_list, rstandard)
sw_list <- lapply(rs_sd_list, shapiro.test)
sw_pvalues <- sapply(sw_list, '[[', 'p.value')
# check for heteroskedasticity (Breusch-Pagan-Test)
bp_list <- lapply(fit_list, bptest)
bp_pvalues <- sapply(bp_list, '[[', 'p.value')
# check for multicollinearity,
# only models with 2 or more regressors
vif_values <- lapply(fit_list, function(fit){
regr <- attr(terms(fit), "term.labels")
if(length(regr) < 2) NA else vif(fit)
})
A note on the Cook's distance. In your code, you are subsetting the original data.frame, producing a new one without outliers. This will duplicate data. I have opted for a list of indices of the df's rows only. If you prefer the duplicated data.frames, uncomment the line in the anonymous function below and comment out the last one.
# models without outliers
# identify outliers (calculating the
# Cooks distance, if x > 4/(n - k - 1) --> outlier
df_no_out_list <- lapply(fit_list, function(fit){
cooks <- cooks.distance(fit)
regr <- attr(terms(fit), "term.labels")
k <- length(regr)
inx <- cooks < 4/(nrow(df) - k - 1)
#df[inx, ]
which(inx)
})
# This tells how many rows have the df's without outliers
sapply(df_no_out_list, NROW)
# A data.frame without outliers. This one is the one
# for model number 8.
# The two code lines could become a one-liner.
i <- df_no_out_list[[8]]
df[i, ]
I am trying to do a trending analysis of reliability data. A typical case would be to determine if a 10-year trend exists in the demand rate for a specified system at specified plants.
I am trying to generate a test case but am a bit confused about how to structure the data. The trend years range from 2004 to 2013. In my test case I have, for each year, 10 systems for which demands have been counted. I am using normally distributed demand counts with different means and variances each year. Of course real data will likely not have the same system count each year, and the demand counts are not necessarily normally distributed.
The following R code produces a data frame (df1) that seems reasonable to me:
yr <- 2004:2013
y2004 <- rnorm(10, 10, 3)
y2005 <- rnorm(10, 11, 2)
y2006 <- rnorm(10, 12, 1)
y2007 <- rnorm(10, 13, 5)
y2008 <- rnorm(10, 14, 3)
y2009 <- rnorm(10, 15, 4)
y2010 <- rnorm(10, 16, 1)
y2011 <- rnorm(10, 17, 2)
y2012 <- rnorm(10, 18, 4)
y2013 <- rnorm(10, 19, 1)
df1 <- data.frame(cbind(yr), y2004, y2005, y2006, y2007, y2008, y2009, y2010, y2011, y2012,y2013)
df2 <- data.frame(cbind(rep(0.0, 100), rep(0.0, 100)))
names(df2) <- c("x", "y")
k <-1
for (i in 1:10) {
for (j in 1:10) {
df2$x[k] <- df1$yr[i]
df2$y[k] <- df1[j,i+1]
k <- k + 1
}
}
boxplot(y ~ x, df2)
Anyway, my first problem is the construction of df2 seems unnecessary given I already have the data in df1 - it's just that the call to lmer seems to require the organization of df2. My call to lmer looks like the following:
fit <- lmer(y ~ x + (1|x), data=df2)
So is there a way to use lmer without the construction of df2, using df1 directly? Or is there a better way to structure the data entirely?
My second problem is I am not really sure how to use lmer to do what I want to do. Basically I am looking to pool the count data for each year and fit the mean demand count each year with a straight line. The best fit should consider the variance in the data in each pooled year group. Am I going about it correctly?
Nearly all plotting and modeling functions in R require data in the "long" format (ie df2). So if anything, I would skip the construction of df1. If you want to generate df2 more directly, you could do
df2 <- do.call("rbind.data.frame", Map(cbind,
y=Map(function(n,m,s) rnorm(n,m,s), 10, 10:19, c(3,2,1,5,3,4,1,2,3,1)),
x=2004:2013))