I have a dataframe data with more than 50 variables and I am trying to do a PCA in R using the caret package.
library(caret)
library(e1071)
trans <- preProcess(data,method=c("YeoJohnson", "center","scale", "pca"))
If I understand this code correctly, it applies a YeoJohnson transformation (because data has zeros in it), standardises data and than applies PCA (by default, the function keeps only the PCs that are necessary to explain at least 95% of the variability in the data)
However, when I use the prcomp command,
model<-prcomp(data,scale=TRUE)
I can get more outputs like printing the summary or doing plot(data, type = "l") which I am not able to do in trans. Does anyone know if there are any functions in caret package producing the same outputs as in prcomp?
You can access the principal components themselves with the predict function.
df <- predict(trans, data)
summary(df)
You won't have exactly the same output as with prcomp: while caret uses prcomp(), it discards the original prcomp class object and does not return it.
Related
I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
Is it possible to calculate prediction intervals from a tidymodels stacked model?
Working through the example from the stacks() package here yields the stacked frog model (which can be downloaded here for reprex) and the testing data:
data("tree_frogs")
tree_frogs <- tree_frogs %>%
filter(!is.na(latency)) %>%
select(-c(clutch, hatched))
set.seed(1)
tree_frogs_split <- initial_split(tree_frogs)
tree_frogs_train <- training(tree_frogs_split)
tree_frogs_test <- testing(tree_frogs_split)
I tried to run something like this:
pi <- predict(tree_frogs_model_st, tree_frogs_test, type = "pred_int")
but this gives an error:
Error in UseMethod("stack_predict") : no applicable method for 'stack_predict' applied to an object of class "NULL"
Reading the documentation of stacks() I also tried passing "pred_int" in the opts list:
pi <- predict(tree_frogs_model_st, tree_frogs_test, opts = list(type = "pred_int"))
but this just gives: opts is only used with type = raw and was ignored.
For reference, I am trying to do a similar thing that is done in Ch.19 of Tidy Modeling with R book
lm_fit <- fit(lm_wflow, data = Chicago_train)
predict(lm_fit, Chicago_test, type = "pred_int")
which seems to work fine for a single model fit like lm_fit, but apparently not for a stacked model?
Am I missing something? Is it not possible to calculate prediction intervals for stacked models for some reason?
This is very difficult to do.
Even if glmnet produced a prediction interval, it would be a significant underestimate since it doesn’t know anything about the error in each of the ensemble members.
We would have to get the standard error of prediction from all of the models to compute it for the stacking model. A lot of these models don’t/can’t generate that standard error.
The alternative is the use bootstrapping to get the interval but you would have to bootstrap each model a large number of times to get the overall prediction interval.
I am using the function plot_model from the sjPlot package to generate a confidence interval plot for a fixed effect linear model (felm). For the individual base model, I encountered no issues, and generated the confidence interval plot effectively. Now, however, I am attempting to do this for multiple similarly-constructed models, but cannot make it work.
An individual model's code is as simple as the following:
plot <- plot_model(model1, show.values = TRUE)
Using the lapply function I generated the multiple models, which are now stored as a list object. However, I cannot find the way to put this list object (or its multiple individual models) into the plot_model (or plot_models) function. I have received various errors, including this one:
Warning: Could not access model information.Error in if (fam.info$is_linear) transform <- NULL else transform <- "exp" : argument is of length zero
Is there a way to place multiple similar models from a list into the plot_model function, so that the resulting confidence interval plots can be readily compared?
Update: with sample code and error:
plots <- plot_models(modellist[[1]], modellist[[2]], show.values = TRUE)
Warning: Could not access model information.Error: Sorry, `model_parameters()` failed with the following error (possible class 'numeric' not supported): $ operator is invalid for atomic vectors
I'm doing some regression using the geepack package and want to use multiple imputation to deal with missing values. The pool() command in mi doesn't work for my GEE, so I have to export (is that right?) so that I can use the data in geepack.
The complete() function produces each iteration, but not the pooled estimates.
Is there a way to produce a data frame with the pooled estimates?
The complete function in the mi package produces a list of m data.frames. You can call gee on each element of that list for the data argument and then use Rubin's rules to obtain pooled estimates.
There are a couple if packages that implement Rubin's rules in R (e.g., mi, mice, mitools, and mitml). The problem is that these implementation require that the functions for fitting statistical models have working methods for coef() and vcov() defined.
The geeglm() function, however, does not define vcov(), and standard implementations will not work. To remedy that situation, it is easiest to just define the missing method for the GEE. Below is an example using the mitml package and one of the example data sets provided with geepack.
library(geepack)
library(mitml)
# example data
data(dietox)
# example imputation
fml <- Feed + Weight ~ 1 + Time + (1|Pig)
imp <- panImpute(data=dietox, formula=fml, n.burn=5000, n.iter=500)
implist <- mitmlComplete(imp, "all")
# fit GEE
fit <- with(implist, geeglm(Weight ~ 1 + Time + Feed, id=Pig))
# define missing vcov() function for geeglm-objects
vcov.geeglm <- function(x) summary(x)$cov.scaled
# combine estimates using Rubin's rules
testEstimates(fit)
I was wondering if it is possible to predict with the plm function from the plm package in R for a new dataset of predicting variables. I have create a model object using:
model <- plm(formula, data, index, model = 'pooling')
Now I'm hoping to predict a dependent variable from a new dataset which has not been used in the estimation of the model. I can do it through using the coefficients from the model object like this:
col_idx <- c(...)
df <- cbind(rep(1, nrow(df)), df[(1:ncol(df))[-col_idx]])
fitted_values <- as.matrix(df) %*% as.matrix(model_object$coefficients)
Such that I first define index columns used in the model and dropped columns due to collinearity in col_idx and subsequently construct a matrix of data which needs to be multiplied by the coefficients from the model. However, I can see errors occuring much easier with the manual dropping of columns.
A function designed to do this would make the code a lot more readable I guess. I have also found the pmodel.response() function but I can only get this to work for the dataset which has been used in predicting the actual model object.
Any help would be appreciated!
I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed Effects models with plm.
The function is posted here:
https://stackoverflow.com/a/44185441/2409896