How to use missForest package in R for test data? - r

We can basically use missForest package for imputing missing values in R(for both categorical and numeric).But this approach requires a complete response variable for training the forest. So,how to impute missing values in the test data set using this missForest package ,because we do not have any response variable in the test data set?

You can just use missForest. No need for the response variable. See code below.
library(missForest)
# remove response variable
my_iris <- iris[, -5]
## Artificially produce missing values using the 'prodNA' function:
set.seed(81)
iris.mis <- prodNA(my_iris, noNA = 0.2)
#impute
iris.imp <- missForest(iris.mis, verbose = TRUE)
#out of bag error
iris.imp$OOBerror
# not available if there is no response variable
iris.imp$error
# Imputed matrix
iris.imp$ximp

Related

Fastshap summary plot - Error: can't combine <double> and <factor<919a3>>

I'm trying to get a summary plot using fastshap explain function as in the code below.
p_function_G<- function(object, newdata)
caret::predict.train(object,
newdata =
newdata,
type = "prob")[,"AntiSocial"] # select G class
# Calculate the Shapley values
#
# boostFit: is a caret model using catboost algorithm
# trainset: is the dataset used for bulding the caret model.
# The dataset contains 4 categories W,G,R,GM
# corresponding to 4 diferent animal behaviors
library(caret)
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= game_train[which(game_test=="AntiSocial"),])
)
However I'm getting error
Error in 'stop_vctrs()':
can't combine latitude and gender <factor<919a3>>
What's the way out?
I see that you are adapting code from Julia Silge's Predict ratings for board games Tutorial. The original code used SHAPforxgboost for generating SHAP values, but you're using the fastshap package.
Because Shapley explanations are only recently starting to gain traction, there aren't very many standard data formats. fastshap does not like tidyverse tibbles, it only takes matrices or matrix-likes.
The error occurs because, by default, fastshap attempts to convert the tibble to a matrix. But this fails, because matrices can only have one type (f.x. either double or factor, not both).
I also ran into a similar issue and found that you can solve this by passing the X parameter as a data.frame. I don't have access to your full code but you could you try replacing the shap_values_G code-block as so:
shap_values_G <- fastshap::explain(xgb_fit,
X = game_train,
pred_wrapper =
p_function_G,
nsim = 50,
newdata= as.data.frame(game_train[which(game_test=="AntiSocial"),]))
)
Wrap newdata with as.data.frame. This converts the tibble to a dataframe and so shouldn't upset fastshap.

Error in data frame undefined columns after imputing in R

I'm working with imputation with some data in R. I found a code online to perform imputation and then modeling the imputed data and the original data. The code is this:
# Using airquality dataset
data <- airquality
data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA
# Removing categorical variables
data <- airquality[-c(5,6)]
summary(data)
# Impute missing data using mice
library(mice)
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
# Get completed datasets (observed and imputed)
completedData <- complete(tempData,1)
summary(completedData)
# Plots
# Density plot original vs imputed dataset
densityplot(tempData)
This is my syntax:
library(readr)
input_preg<- read_csv("datasurvey.csv")
summary(input_preg)
imput<- input_preg
#Imputation
library(mice)
temporal <- mice(imput,m=5,maxit=50,meth='pmm',seed=500)
#example imputed
temporal$imp$`52bcalif`
#I selected a dataset for imputation
completos<-complete(temporal,1)
#Ploting
densityplot(temporal)
So i'm doing almost exactly what the code indicates and when I'm doing the densityplot it doesnt work stating:
Error in `[.data.frame`(r, , xvar) : undefined columns selected
But with the original code, it has no problems to do the densityplot. So I dont know if it is because of the large number of imputations or that original data had 4 variables and I have 29.
Change the name of that column,temporal$imp$52bcalif, I think the mistake is there. You used a number. I tested myself.

compare R packages missForest and Hmisc performance

I am trying to compare the performance of 2 R packages, missForest and Hmisc performace in dealing with missing value, when there are more than 50% missing values.
I got testing data in this way:
data("iris")
library(missForest)
iris.mis <- prodNA(iris, noNA = 0.6)
summary(iris.mis)
mis1 <- iris.mis
mis2 <- iris.mis
In missForest, it has mixError() method which allows you to compare the imputation accuracy with the original data.
# using missForest
missForest_imputed <- missForest(mis1, ntree = 100)
missForest_error <- mixError(missForest_imputed$ximp, mis1, iris)
dim(missForest_imputed$ximp)
missForest_error
Hmisc does not have mixError() method, I am using its powerful aregImpute() to do the imputation, like this:
# using Hmisc
library(Hmisc)
hmisc_imputed <- aregImpute(~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species,
data = mis2, n.impute = 1)
I was hoping to convert the imputed results into a format like missForest_imputed$ximp, so that I can use mixError() method. The problem is, in aregImpute(), no matter I tried n.impute = 1 or n.impute = 5, I cannot have 150 values for each feature like the original data iris... And the number of values in each feature is also different....
So, is there any way to compare the performance of missForest and Hmisc in dealing with missing values?
Part 1
Hmisc::aregImpute returns the imputed values. For your object named hmisc_imputed, they can be found in hmisc_imputed$imputed. However, the imputed object is a list for each dimension.
If you wish to recreate the equivalent of missForest_imputed$ximp, you have to do it yourself. To do so, we can use the fact that:
all.equal(as.integer(attr(xx$Sepal.Length, "dimnames")[[1]]), which(is.na(iris.mis$Sepal.Length))) ## returns true
Which I do here:
check_missing <- function(x, hmisc) {
return(all.equal(which(is.na(x)), as.integer(attr(hmisc, "dimnames")[[1]])))
}
get_level_text <- function(val, lvls) {
return(lvls[val])
}
convert <- function(miss_dat, hmisc) {
m_p <- ncol(miss_dat)
h_p <- length(hmisc)
if (m_p != h_p) stop("miss_dat and hmisc must have the same number of variables")
# assume matches for all if 1 matches
if (!check_missing(miss_dat[[1]], hmisc[[1]]))
stop("missing data an imputed data do not match")
for (i in 1:m_p) {
i_factor <- is.factor(miss_dat[[i]])
if (!i_factor) {miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- hmisc[[i]]}
else {
levels_i <- levels(miss_dat[[i]])
miss_dat[[i]] <- as.character(miss_dat[[i]])
miss_dat[[i]][which(is.na(miss_dat[[i]]))] <- sapply(hmisc[[i]], get_level_text, lvls= levels_i)
miss_dat[[i]] <- factor(miss_dat[[i]])
}
}
return(miss_dat)
}
iris.mis2 <- convert(iris.mis, hmisc_imputed$imputed)
Part 2
mixError uses RMSE to calculate error-rates, ?mixError:
Value imputation error. In case of continuous variables only this is the normalized root mean squared error (NRMSE, see 'help(missForest)' for further details). In case of categorical variables onlty this is the proportion of falsely classified entries (PFC). In case of mixed-type variables both error measures are supplied.
To do this on your object from "Part 1" [iris.mis2], you just need to use the nrmse function, which is provided in library(missForest).

Perform operation on each imputed dataset in R's MICE

How can I perform an operation (like subsetting or adding a calculated column) on each imputed dataset in an object of class mids from R's package mice? I would like the result to still be a mids object.
Edit: Example
library(mice)
data(nhanes)
# create imputed datasets
imput = mice(nhanes)
The imputed datasets are stored as a list of lists
imput$imp
where there are rows only for the observations with imputation for the given variable.
The original (incomplete) dataset is stored here:
imput$data
For example, how would I create a new variable calculated as chl/2 in each of the imputed datasets, yielding a new mids object?
This can be done easily as follows -
Use complete() to convert a mids object to a long-format data.frame:
long1 <- complete(midsobj1, action='long', include=TRUE)
Perform whatever manipulations needed:
long1$new.var <- long1$chl/2
long2 <- subset(long1, age >= 5)
use as.mids() to convert back manipulated data to mids object:
midsobj2 <- as.mids(long2)
Now you can use midsobj2 as required. Note that the include=TRUE (used to include the original data with missing values) is needed for as.mids() to compress the long-formatted data properly. Note that prior to mice v2.25 there was a bug in the as.mids() function (see this post https://stats.stackexchange.com/a/158327/69413)
EDIT: According to this answer https://stackoverflow.com/a/34859264/4269699 (from what is essentially a duplicate question) you can also edit the mids object directly by accessing $data and $imp. So for example
midsobj2<-midsobj1
midsobj2$data$new.var <- midsobj2$data$chl/2
midsobj2$imp$new.var <- midsobj2$imp$chl/2
You will run into trouble though if you want to subset $imp or if you want to use $call, so I wouldn't recommend this solution in general.
Another option is to calculate the variables before the imputation and place restrictions on them.
library(mice)
# Create the additional variable - this will have missing
nhanes$extra <- nhanes$chl / 2
# Change the method of imputation for extra, so that it always equals chl/2
# Change the predictor matrix so only chl predicts extra
ini <- mice(nhanes, max = 0, print = FALSE)
meth <- ini$meth
meth["extra"] <- "~I(chl / 2)"
pred <- ini$pred # extra isn't used to predict
pred["extra", "chl"] <- 1
# Imputations
imput <- mice(nhanes, seed = 1, pred = pred, meth = meth, print = FALSE)
There are examples in mice: Multivariate Imputation by Chained Equations in R.
There is an overload of with that can help you here
with(imput, chl/2)
the documentation is given at ?with.mids
There's a function for this in the basecamb package:
library(basecamb)
apply_function_to_imputed_data(mids_object, function)

Predict function from Caret package give an Error

I am doing just a regular logistic regression using the caret package in R. I have a binomial response variable coded 1 or 0 that is called a SALES_FLAG and 140 numeric response variables that I used dummyVars function in R to transform to dummy variables.
data <- dummyVars(~., data = data_2, fullRank=TRUE,sep="_",levelsOnly = FALSE )
dummies<-(predict(data, data_2))
model_data<- as.data.frame(dummies)
This gives me a data frame to work with. All of the variables are numeric. Next I split into training and testing:
trainIndex <- createDataPartition(model_data$SALE_FLAG, p = .80,list = FALSE)
train <- model_data[ trainIndex,]
test <- model_data[-trainIndex,]
Time to train my model using the train function:
model <- train(SALE_FLAG~. data=train,method = "glm")
Everything runs nice and I get a model. But when I run the predict function it does not give me what I need:
predict(model, newdata =test,type="prob")
and I get an ERROR:
Error in dimnames(out)[[2]] <- modelFit$obsLevels :
length of 'dimnames' [2] not equal to array extent
On the other hand when I replace "prob" with "raw" for type inside of the predict function I get prediction but I need probabilities so I can code them into binary variable given my threshold.
Not sure why this happens. I did the same thing without using the caret package and it worked how it should:
model2 <- glm(SALE_FLAG ~ ., family = binomial(logit), data = train)
predict(model2, newdata =test, type="response")
I spend some time looking at this but not sure what is going on and it seems very weird to me. I have tried many variations of the train function meaning I didn't use the formula and used X and Y. I used method = 'bayesglm' as well to check and id gave me the same error. I hope someone can help me out. I don't need to use it since the train function to get what I need but caret package is a good package with lots of tools and I would like to be able to figure this out.
Show us str(train) and str(test). I suspect the outcome variable is numeric, which makes train think that you are doing regression. That should also be apparent from printing model. Make it a factor if you want to do classification.
Max

Resources