Comparing nested models with NAs in R

Comparing nested models with NAs in R - r

I am trying to compare nested regression models using the anova() function in R, but am running into problems because the level 1 and level 2 models differ in the number of observations due to missing cases. Here is a simple example:
# Create dataframe with multiple predictors with different number of NAs
dep <- c(45,46,45,48,49)
basevar <- c(10,12,10,16,17)
pred1 <- c(NA,20,NA,19,21)
dat <- data.frame(dep,basevar,pred1)
# Define level 1 of the nested models
basemodel <- lm(dep ~ basevar, data = dat)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = dat)
# Compare the models (uh oh!)
anova(basemodel, model1)
I have seen 2 suggestions to similar problems, but both are problematic.
Suggestion 1: Impute the missing data. The problem with this is that the missing cases in my data were removed because they were outliers, and thus are not "missing at random," and imputing may overfit the data.
Suggestion 2: Make a separate data frame containing only the complete cases for the variable with missing cases, and use that for regressions. This is also problematic if you are creating multiple nested models sharing the same level 1 variable, but in which the level 2 variables differ in the number of missing cases. Here is an example of this:
# Create a new predictor variable with a different number of NAs from pred1
pred2 <- c(23,21,NA,10,11)
dat <- cbind(dat,pred2)
# Create dataframe containing only completed cases of pred1
nonadat1 <- subset(dat, subset = !is.na(pred1))
# Do the same for pred2
nonadat2 <- subset(dat, subset = !is.na(pred2))
# Define level 1 of the nested models within dataframe of pred1 complete cases
basemodel1 <- lm(dep ~ basevar, data = nonadat1)
# Check values of the model
summary(basemodel1)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = nonadat1)
# Compare the models (yay it runs!)
anova(basemodel1, model1)
# Define level 1 of the nested models within dataframe of pred2 complete cases
basemodel2 <- lm(dep ~ basevar, data = nonadat2)
# Values are different from those in basemodel1
summary(basemodel2)
# Add level 2
model2 <- lm(dep ~ basevar + pred2, data = nonadat2)
# Compare the models
anova(basemodel2, model2)
As you can see, creating individual data frames creates differences at level 1 of the nested models, which makes interpretation problematic.
Does anyone know how I can compare these nested models while circumventing these problems?

Could this work? See here for more information. It doesn't exactly deal with the fact that models are fitted on different datasets, but it does allow for a comparison.
A<-logLik(basemodel)
B<-logLik(model1)
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
(p.val <- pchisq(teststat, df = 1, lower.tail = FALSE))

Related

How to get X & Y rows to match?

I'm working on a new type of code and need a little help with the ridge regularized regression. trying to build a predictive model but first i need x and y matrix rows to match.
I found something similar with a google search but their data is randomly generated and not provided like mine is. the data is a large dataset with over 500,000 observations and 670 variables.
library(rsample)
library(glmnet)
library(dplyr)
library(ggplot2)
# Create training (70%) and test (30%) sets
# Use set.seed for reproducibility
set.seed(123)
alumni_split<-initial_split(alumni, prop=.7, strata = "Id.Number")
alumni_train<-training(alumni_split)
alumni_test<-testing(alumni_split)
#----
# Create training and testing feature model matrices and response
vectors.
# we use model.matrix(...)[, -1] to discard the intercept
alumni_train_x <- model.matrix(Id.Number ~ ., alumni_train)[, -1]
alumni_test_x <- model.matrix(Id.Number ~ ., alumni_test)[, -1]
alumni_train_y <- log(alumni_train$Id.Number)
alumni_test_y <- log(alumni_test$Id.Number)
# What is the dimension of of your feature matrix?
dim(alumni_train_x)
#---- [HERE]
# Apply Ridge regression to alumni data
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
The error message (with code):
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
Error in glmnet(alumni_train_x, alumni_train_y, alpha = 0) :
number of observations in y (329870) not equal to the number of rows of
x (294648)

predict multivariate model using rfsrc

I use the package "randomForestSRC" and I want to predict two variables at the same time. I have no issue predicting both of these variables alone, but can't get the combination to work. Here's a mwe:
library(randomForestSRC)
# create dataset, only continuous variables
dat <- mtcars[,c("drat", "wt", "qsec")]
dat$drat_wt <- dat$drat*dat$wt
dat$drat_qsec <- dat$drat*dat$qsec
train <- sample(nrow(dat), round(nrow(dat)*0.8))
# train the algorithms with only one outcome variable
test_1 <- rfsrc(drat_wt ~ . ,data = dat[train,],
importance = 'permute')
test_2 <- rfsrc(drat_qsec ~ . ,data = dat[train,],
importance = 'permute')
# train the radom forest with multiple outcome variables
test_3 <- rfsrc(cbind(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
test_4 <- rfsrc(Multivar(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
# predict all the random forest data
pred_1 <- predict(test_1, dat[-train,], na.action= "na.impute")
pred_2 <- predict(test_2, dat[-train,], na.action= "na.impute")
pred_3 <- predict(test_3, dat[-train,], na.action= "na.impute")
pred_4 <- predict(test_4, dat[-train,], na.action= "na.impute")
Now for the random forests with only one outcome (pred_1 and pred_2) there is an index pred_1$predicted containing the predicted values, not however for the other random forests.
Bonus: Is there any difference in the test_3 and test_4 formulation? Couldn't find anything said on that in the documentation.

To answer your first question, the output of pred_3 and pred_4 contain a list element, regrOutput, that is not contained in the pred_1 and pred_2 objects. If you look inside of this object, you will find the names of the two outcome (or target or dependent or ...) variables. For example,
names(pred_4$regrOutput)
[1] "drat_qsec" "drat_wt"
Now, inside each of these ojects, you find the objects that are missing in the
names(pred_1). For example,
names(pred_4$regrOutput$drat_qsec)
[1] "predicted" "predicted.oob" "quantile" "quantile.oob" "err.rate"
and
setdiff(names(pred_1), names(pred_3))
[1] "predicted" "predicted.oob" "quantile" "quantile.oob" "err.rate"
I am not sure about the second question, but the objects have the same structure. One way to explore this is to look through the objects.
For example, set the seed for each forest and then look at the returned objects.
set.seed(1234)
test_3 <- rfsrc(cbind(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
set.seed(1234)
test_4 <- rfsrc(Multivar(drat_qsec,drat_wt) ~ . ,data = dat[train,],
importance = 'permute')
Since the first element is the call, which we know is different, drop it and check for identity.
identical(test_3[-1], test_4[-1])
Now, look at each element in the returned objects.
mapply(identical, test_3, test_4)
Zero in on the those that are not identical:
which(!mapply(identical, test_3, test_4))
call forest regrOutput
1 19 36
Then further into the object...
which(!mapply(identical, test_3$forest, test_4$forest))
nativeArray
2
and so on.
However, what might matter most is, do these produce the same result? Let's rerun the prediction and taking a look at the predicted values.
pred_3 <- predict(test_3, dat[-train,], na.action= "na.impute")
pred_4 <- predict(test_4, dat[-train,], na.action= "na.impute")
Now, take a look
pred_3$regrOutput$drat_qsec$predicted
[1] 68.04419 55.95063 55.02067 56.38382 56.38382 75.33735
pred_4$regrOutput$drat_qsec$predicted
[1] 68.04419 55.95063 55.02067 56.38382 56.38382 75.33735
They look the same to me
identical(pred_3$regrOutput$drat_qsec$predicted,
pred_4$regrOutput$drat_qsec$predicted)
[1] FALSE
They aren't identical, though. Let's use the recommended all.equal function to check for floating point values.
all.equal(pred_3$regrOutput$drat_qsec$predicted,
pred_4$regrOutput$drat_qsec$predicted)
[1] TRUE
Ah, it's probably just a matter of numerical precision.

Stargazer one line per data set

I am running regressions using various subsets of a data set and a number of dependent variables.
An example using attitude data:
library(stargazer)
#REGRESSIONS USING DATASET 1
linear1.1 <- lm(rating ~ complaints, data = attitude) #dependent 1
linear1.2 <- lm(privileges ~ complaints, data = attitude) #dependent 2
#REGRESSIONS USING DATASET 2
linear2.1 <- lm(rating ~ complaints, data = attitude[1:15,]) #dependent 1
linear2.2 <- lm(privileges ~ complaints, data = attitude[1:15,]) #dependent 2
As you can see, both depdendent variables rating and privileges are used in regressions for both subsets of the data. Using a standard stargazer approach produces the following table:
stargazer::stargazer(linear1.1,linear1.2,linear2.1,linear2.2,
omit.stat = "all",
keep = "complaints")
Each column represents one of the regression models. However, I'd like to have each column represent one dependent variable. Each subset of the data should represent one row:
I have produced this table by hand. Does anyone know whether it's possible to achieve this using stargazer? I have a lot of regression subsets and dependent variables, so a highly automatic solution is appreciated. Thanks!

I just wonder if this little modification from this (Exporting output of custom multiple regressions from R to Latex) will suit you
library(stargazer)
library(broom)
## generate dummy data
set.seed(123)
x <- runif(1000)
z <- x^0.5
y <- x + z + rnorm(1000, sd=.05)
model1 <- lm(y ~ x)
model2 <- lm(y ~ z)
## transform model summaries into dataframes
tidy(model1) -> model1_tidy
tidy(model2) -> model2_tidy
output <- rbind(model1_tidy,model2_tidy)
stargazer(output, type='text', summary=FALSE)

Compare lm models from subset of data or find inflection point to remove lag time

I am running lots of linear models on data from different experiments where we want to exclude a lag time from the start of the experiment. This lag time may vary between runs and is very obvious in the example plot below. Is there a robust way to automatically exclude the lag time, in my example below it would be where x < 1
I thought the way to do it would be to produce linear models and gradually remove points from the start of the data and compare models but I don't know the best way to compare models from subsetted data
df <- data.frame (x1 = c(0.7,1.7,2.8,3.7,4.9,6.0,6.7,7.7,8.7,9.7,10.7,12.0,13.1),
y1 = c(22.7,50.7,103.2,143.4,175.2,216.8,234.1,246.6,256.0,266.2,276.0,287.6,295.5))
plot(1/df$x1, log(df$y1), type = "l")
summary(lm(log(y1) ~ I(1/x1), data = df))
summary(lm(log(y1) ~ I(1/x1), data = df[df$x1 > 1,]))
summary(lm(log(y1) ~ I(1/x1), data = df[df$x1 > 2,]))

i. generate list of 14 data.frame, where v = 0 to maximum integer value of df$x1
library(dplyr)
all.dat <- lapply(max(df$x1) %>% seq(from =0, to=.), function(v) df[df$x1 > v,])
ii. generate list of lm models using 14 data.frames
lm.form <- as.formula("log(y1) ~ I(1/x1)")
all.lm <- lapply(all.dat, function(x)lm(lm.form, data=x))
iii. view summary of all 14 lm models
lapply(all.lm, summary)
sapply(all.lm, function(x)summary(x)$r.sq) #extract r.sq value for all models
[1] 0.9074019 0.9960153 0.9957543 0.9903552 0.9783031 0.9937000 0.9899247 0.9915223 0.9982270 0.9997441 0.9998207 1.0000000 0.0000000 0.0000000

caret dummy-vars exclude target

How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem

At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')