omitting columns from data - r

The data set has 252 observations and 18 variables. I needed a test sample with every tenth observation and a training sample with the remaining data so I created two separate datasets:
id <- seq(1, nrow(fat), by=10)
test <- fat[id,]
train <-fat[id,]
I can do a linear regression using all predictors except brozek and density:
model2 <- lm(siri ~ .-brozek -density, train)
I need to do a principal component regression model
fatpca<-prcomp(fat[-id,]
but this still includes the variables brozek and density.
How do I exclude these variables to do a PCR model?

To remove a couple of variables you have a few choices:
trainsub <- subset(train,select=-c(brozek,density))
or
trainsub <- train[!colnames(train) %in% c("brozek","density"))
or
trainsub <- dplyr::select(train,-c(brozek,density))
You can also use a formula interface with prcomp, i.e.
prcomp(~ . -brozek - density, data=train)

Related

Comparing nested models with NAs in R

I am trying to compare nested regression models using the anova() function in R, but am running into problems because the level 1 and level 2 models differ in the number of observations due to missing cases. Here is a simple example:
# Create dataframe with multiple predictors with different number of NAs
dep <- c(45,46,45,48,49)
basevar <- c(10,12,10,16,17)
pred1 <- c(NA,20,NA,19,21)
dat <- data.frame(dep,basevar,pred1)
# Define level 1 of the nested models
basemodel <- lm(dep ~ basevar, data = dat)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = dat)
# Compare the models (uh oh!)
anova(basemodel, model1)
I have seen 2 suggestions to similar problems, but both are problematic.
Suggestion 1: Impute the missing data. The problem with this is that the missing cases in my data were removed because they were outliers, and thus are not "missing at random," and imputing may overfit the data.
Suggestion 2: Make a separate data frame containing only the complete cases for the variable with missing cases, and use that for regressions. This is also problematic if you are creating multiple nested models sharing the same level 1 variable, but in which the level 2 variables differ in the number of missing cases. Here is an example of this:
# Create a new predictor variable with a different number of NAs from pred1
pred2 <- c(23,21,NA,10,11)
dat <- cbind(dat,pred2)
# Create dataframe containing only completed cases of pred1
nonadat1 <- subset(dat, subset = !is.na(pred1))
# Do the same for pred2
nonadat2 <- subset(dat, subset = !is.na(pred2))
# Define level 1 of the nested models within dataframe of pred1 complete cases
basemodel1 <- lm(dep ~ basevar, data = nonadat1)
# Check values of the model
summary(basemodel1)
# Add level 2
model1 <- lm(dep ~ basevar + pred1, data = nonadat1)
# Compare the models (yay it runs!)
anova(basemodel1, model1)
# Define level 1 of the nested models within dataframe of pred2 complete cases
basemodel2 <- lm(dep ~ basevar, data = nonadat2)
# Values are different from those in basemodel1
summary(basemodel2)
# Add level 2
model2 <- lm(dep ~ basevar + pred2, data = nonadat2)
# Compare the models
anova(basemodel2, model2)
As you can see, creating individual data frames creates differences at level 1 of the nested models, which makes interpretation problematic.
Does anyone know how I can compare these nested models while circumventing these problems?
Could this work? See here for more information. It doesn't exactly deal with the fact that models are fitted on different datasets, but it does allow for a comparison.
A<-logLik(basemodel)
B<-logLik(model1)
(teststat <- -2 * (as.numeric(A)-as.numeric(B)))
(p.val <- pchisq(teststat, df = 1, lower.tail = FALSE))

use cox model to estimate survival

I first establish a cox model in R:
test1<- test[1:20,]
model.1 <- coxph(Surv(test1$days,test1$status==1) ~ test1$MTT+test1$ADC,data=test1)
and when i tried to predict next patient's survival like this:
covs1 <- data.frame(test[21,]$MTT,test[21,]$ADC)
summary(survfit(model.1, newdata= covs1, type ="aalen"))
it gave me too many survival results and the warning is
"'newdata' had 1 row but variables found have 20 rows "
fyi, there are 20 events and the results contain 20 survival results.
The names of the columns in the datframe being given as the basis for a prediction must have the same column names as are in the RHS of the model formula. I don't think yours will qualifiy unless you do something like this:
test1<- test[1:20,]
model.1 <- coxph( Surv(days, status==1) ~ MTT + ADC, data=test1)
covs1 <- test[21, c("MTT", "ADC")]
# then do your prediction
You should not use $ to supply arguments to Surv. It is important that the model be constructed in the environment of the dataframe.

Generating predictive simulations from a multilevel model with random intercepts

I am struggling to understand how, in R, to generate predictive simulations for new data using a multilevel linear regression model with a single set of random intercepts. Following the example on pp. 146-147 of this text, I can execute this task for a simple linear model with no random effects. What I can't wrap my head around is how to extend the set-up to accommodate random intercepts for a factor added to that model.
I'll use iris and some fake data to show where I'm getting stuck. I'll start with a simple linear model:
mod0 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
Now let's use that model to generate 1,000 predictive simulations for 250 new cases. I'll start by making up those cases:
set.seed(20912)
fakeiris <- data.frame(Sepal.Length = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Sepal.Width = rnorm(250, mean(iris$Sepal.Length), sd(iris$Sepal.Length)),
Species = sample(as.character(unique(iris$Species)), 250, replace = TRUE),
stringsAsFactors=FALSE)
Following the example in the aforementioned text, here's what I do to get 1,000 predictive simulations for each of those 250 new cases:
library(arm)
n.sims = 1000 # set number of simulations
n.tilde = nrow(fakeiris) # set number of cases to simulate
X.tilde <- cbind(rep(1, n.tilde), fakeiris[,"Sepal.Width"]) # create matrix of predictors describing those cases; need column of 1s to multiply by intercept
sim.fakeiris <- sim(mod0, n.sims) # draw the simulated coefficients
y.tilde <- array(NA, c(n.sims, n.tilde)) # build an array to hold results
for (s in 1:n.sims) { y.tilde[s,] <- rnorm(n.tilde, X.tilde %*% sim.fakeiris#coef[s,], sim.fakeiris#sigma[s]) } # use matrix multiplication to fill that array
That works fine, and now we can do things like colMeans(y.tilde) to inspect the central tendencies of those simulations, and cor(colMeans(y.tilde), fakeiris$Sepal.Length) to compare them to the (fake) observed values of Sepal.Length.
Now let's try an extension of that simple model in which we assume that the intercept varies across groups of observations --- here, species. I'll use lmer() from the lme4 package to estimate a simple multilevel/hierarchical model that matches that description:
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
Okay, that works, but now what? I run:
sim.fakeiris.lmer <- sim(mod1, n.sims)
When I use str() to inspect the result, I see that it is an object of class sim.merMod with three components:
#fixedef, a 1,000 x 2 matrix with simulated coefficients for the fixed effects (the intercept and Sepal.Width)
#ranef, a 1,000 x 3 matrix with simulated coefficients for the random effects (the three species)
#sigma, a vector of length 1,000 containing the sigmas associated with each of those simulations
I can't wrap my head around how to extend the matrix construction and multiplication used for the simple linear model to this situation, which adds another dimension. I looked in the text, but I could only find an example (pp. 272-275) for a single case in a single group (here, species). The real-world task I'm aiming to perform involves running simulations like these for 256 new cases (pro football games) evenly distributed across 32 groups (home teams). I'd greatly appreciate any assistance you can offer.
Addendum. Stupidly, I hadn't looked at the details on simulate.merMod() in lme4 before posting this. I have now. It seems like it should do the trick, but when I run simulate(mod0, nsim = 1000, newdata = fakeiris), the result has only 150 rows. The values look sensible, but there are 250 rows (cases) in fakeiris. Where is that 150 coming from?
One possibility is to use the predictInterval function from the merTools package. The package is about to be submitted to CRAN, but the current developmental release is available for download from GitHub,
install.packages("devtools")
devtools::install_github("jknowles/merTools")
To get the median and a 95% credible interval of 100 simulations:
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
out <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median")
By default, predictInterval includes the residual variation, but you can
turn that feature off with:
out2 <- predictInterval(mod1, newdata=fakeiris, level=0.95,
n.sims=100, stat="median",
include.resid.var=FALSE)
Hope this helps!
This might help: it doesn't use sim(), but instead uses mvrnorm() to draw the new coefficients from the sampling distribution of the fixed-effect parameters, uses a bit of internal machinery (setBeta0) to reassign the internal values of the fixed-effect coefficients. The internal values of the random effect coefficients are automatically resampled by simulate.merMod using the default argument re.form=NA. However, the residual variance is not resampled -- it is held fixed across the simulations, which isn't 100% realistic.
In your use case, you would specify newdata=fakeiris.
library(lme4)
mod1 <- lmer(Sepal.Length ~ Sepal.Width + (1 | Species), data = iris)
simfun <- function(object,n=1,newdata=NULL,...) {
v <- vcov(object)
b <- fixef(object)
betapars <- MASS::mvrnorm(n,mu=b,Sigma=v)
npred <- if (is.null(newdata)) {
length(predict(object))
} else nrow(newdata)
res <- matrix(NA,npred,n)
for (i in 1:n) {
mod1#pp$setBeta0(betapars[i,])
res[,i] <- simulate(mod1,newdata=newdata,...)[[1]]
}
return(res)
}
ss <- simfun(mod1,100)

'predict' gives different results than using manually the coefficients from 'summary'

Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.

Can I do predict.glmnet on test data with different number of predictor variables?

I used glmnet to build a predictive model on a training set with ~200 predictors and 100 samples, for a binomial regression/classification problem.
I selected the best model (16 predictors) that gave me the max AUC. I have an independent test set with only those variables (16 predictors) which made it into the final model from the training set.
Is there any way to use the predict.glmnet based on the optimal model from the training set with new test set which has data for only those variables that made it into the final model from the training set?
glmnet requires the exact same number/names of variables from the training dataset to be in the validation/test set. For example:
library(caret)
library(glmnet)
df <- ... # a dataframe with 200 variables, some of which you want to predict on
# & some of which you don't care about.
# Variable 13 ('Response.Variable') is the dependent variable.
# Variables 1-12 & 14-113 are the predictor variables
# All training/testing & validation datasets are derived from this single df.
# Split dataframe into training & testing sets
inTrain <- createDataPartition(df$Response.Variable, p = .75, list = FALSE)
Train <- df[ inTrain, ] # Training dataset for all model development
Test <- df[ -inTrain, ] # Final sample for model validation
# Run logistic regression , using only specified predictor variables
logCV <- cv.glmnet(x = data.matrix(Train[, c(1:12,14:113)]), y = Train[,13],
family = 'binomial', type.measure = 'auc')
# Test model over final test set, using specified predictor variables
# Create field in dataset that contains predicted values
Test$prob <- predict(logCV,type="response", newx = data.matrix(Test[,
c(1:12,14:113) ]), s = 'lambda.min')
For a completely new set of data, you could constrain the new df to the necessary variables using some variant of the following method:
new.df <- ... # new df w/ 1,000 variables, which include all predictor variables used
# in developing the model
# Create object with requisite predictor variable names that we specified in the model
predictvars <- c('PredictorVar1', 'PredictorVar2', 'PredictorVar3',
... 'PredictorVarK')
new.df$prob <- predict(logCV,type="response", newx = data.matrix(new.df[names(new.df)
%in% predictvars ]), s = 'lambda.min')
# the above method limits the new df of 1,000 variables to
# whatever the requisite variable names or indices go into the
# model.
Additionally, glmnet only deals with matrices. This is probably why you're getting the error you post in the comment to your question. Some users (myself included) have found that as.matrix() doesn't resolve the issue; data.matrix() seems to work though (hence why it's in the above code). This issue is addressed in a thread or two on SO.
I assume that all variables in the new dataset to be predicted also need to be formatted the same as they were in the dataset used for model development. I usually pull all of my data from the same source so I haven't encountered what glmnet will do in cases where formatting is different.

Resources