Imputation of missing data: mice gives erratic results in R - r

I am running some simple code with the mice function for imputation of missing data, using the library mice.
I run the code with the airquality dataset (base R) without a problem, but when I run the same code using another dataset from base R --mtcars--I am getting an error ("undefined columns selected"). See below:
The code as text is the following:
library(dplyr)
library(mice)
data = airquality
data[4:10,3] = rep(NA,7)
data[1:5,4] = NA
summary(data)
tempData = mice(data,m=5,maxit=50,meth='pmm',seed=500)
data(mtcars)
mtcars[mtcars$am == 1, "am"] = NA
data1 = mtcars[, c(2:11)]
summary(data1)
tempData = mice(data1,m=5,maxit=50,meth='pmm',seed=144)
I am confused. Why the same code works in the former case and then does not work in the latter?
Your advice will be appreciated.
Edit
Indeed I installed the latest version of Mice from CRAN and the code run without a problem

Related

Error "t.haven_labelled()` not supported" when trying to substitute NA with mice package

Total R noob here, trying to figure out how to implement mice package to account for NAs in my dataset.
This is my code so far (i left out the unimportant stuff like trimming the data set down to relevant variables, recoding etc.)
install.packages("haven")
install.packages("survey")
library(haven)
library(data.table)
library(survey)
library(car)
dat <- read_dta("ZA5270_v2-0-0.dta")
dat_wght <- svydesign(ids= ~1, data=dat, weights =~wghtpew)
install.packages("mice")
library(mice)
dat_wght[["variables"]]$sex = as.factor(dat_wght[["variables"]]$sex)
dat_imp <- mice(dat_wght[["variables"]], m=5, maxit=10)
The error message I get is:
iter imp variable
1 1 px03Error in `t()`:
! `t.haven_labelled()` not supported.
I already did some research and apparantly it has to do with label values since haven package causes lots of weird problems. I already tried to remove all label values with sapply(dat_wght[["variables"]], haven::zap_labels)but the error still occurs (same when I try it with remove_val_labels()) Does anyone know how to solve this problem?
I'm really grateful for every single piece of advice :) Thanks in advance!

R difference between class and DMwR package knn functions?

So I was working on a project in R and I ran into a issue with fitting a KNN model to some data. I was getting different results when I ran the knn from class and kNN from DMwR libraries. I tied using the Weekly data from the psych package but I got similar results. Confusion matrices for the fits give significantly different results as does the strait up comparison between between the predictions.
I am not sure why these two functions are returning different results. Maybe someone can review my sample code and let me know what is going on.
library(ISLR)
WTrain <- subset(Weekly, Year <= 2008)
WTest <- subset(Weekly, Year >= 2009)
library(caret)
library(class)
fitClass <- knn(train = data.matrix(WTrain$Lag2), test = data.matrix(WTest$Lag2), cl=WTrain$Direction, k=5)
confusionMatrix(data = fitClass, reference = WTest$Direction)
library(DMwR)
fitDMwR <- kNN(Direction~Lag2,train = WTrain, test = WTest, norm=FALSE, k=5)
confusionMatrix(table(fitDMwR == 'Down', WTest$Direction =='Down'))
results <- cbind(fitClass,fitDMwR)
head(results)

Correct use of R naive_bayes() and predict()

I am trying to run a simple naive bayes model (trying to redo what I have seen the datacamp course).
I am using the R naivebayes package.
The training dataset is where9am which looks like this:
My first problem is the following... when I have several predictions in a dataframe thursday9am...
... and I use the following code:
locmodel <- naive_bayes(location ~ daytype, data = where9am)
my_pred <- predict(locmodel, thursday9am)
I get a series of <NA> while it works well with the correct prediction if the thursday9am dataframe only contains a single observation.
The second problem is the following: when I use the following code to get the associated probabilities...
locmodel <- naive_bayes(location ~ daytype, data = where9am, type = c("class", "prob"))
predict(locmodel, thursday9am , type = "prob")
... even if I have only one observation in thursday9am, I get a series of <NaN>.
I am not sure what I am doing wrong.

Error in data frame undefined columns after imputing in R

I'm working with imputation with some data in R. I found a code online to perform imputation and then modeling the imputed data and the original data. The code is this:
# Using airquality dataset
data <- airquality
data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA
# Removing categorical variables
data <- airquality[-c(5,6)]
summary(data)
# Impute missing data using mice
library(mice)
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
# Get completed datasets (observed and imputed)
completedData <- complete(tempData,1)
summary(completedData)
# Plots
# Density plot original vs imputed dataset
densityplot(tempData)
This is my syntax:
library(readr)
input_preg<- read_csv("datasurvey.csv")
summary(input_preg)
imput<- input_preg
#Imputation
library(mice)
temporal <- mice(imput,m=5,maxit=50,meth='pmm',seed=500)
#example imputed
temporal$imp$`52bcalif`
#I selected a dataset for imputation
completos<-complete(temporal,1)
#Ploting
densityplot(temporal)
So i'm doing almost exactly what the code indicates and when I'm doing the densityplot it doesnt work stating:
Error in `[.data.frame`(r, , xvar) : undefined columns selected
But with the original code, it has no problems to do the densityplot. So I dont know if it is because of the large number of imputations or that original data had 4 variables and I have 29.
Change the name of that column,temporal$imp$52bcalif, I think the mistake is there. You used a number. I tested myself.

svytable and svychisq not recognizing homemade function variables, running RStudio Version 0.99.467 running R 3.2.3 Windows 64 bit

I am trying to build a function in R that will allow me to generate weighted tables of named variables within in a data frame using the R survey package by Thomas Lumley. I think I am running into errors because the svytable and svychisq functions in the survey package are not recognizing the arguments in my homemade function as the names of the desired variables from my data frame I want analyzed.
I had tried to utilize the example given here to develop a solution:
lapply with anonymous function call to svytable results in object 'x' not found
I also took a look at this post when trying to troubleshoot svychisq: Error using dynamic variable specification in R survey function svychisq()
None of the suggestions from these posts have worked for me to date. I would very much appreciate if people could look through my code and give feedback for what I could try to generate a function that prints 1. a weighted table of totals 2. the results of running a chi square analysis on that weighted table.
Here is my data/code to date:
#data to make the code reproducible
independent <- c(2,1,1,1,1,1,2,2,1,1,2,2,2,2,1,1,1,2,2,2,2,2,2,1,1,1,2,1,2,1,1,1,2,2,1,1,2,1,1,2,1,2,1,1,1,1,1,1,1,2,1,2,1,2,2,2,2,1,1,1,1,2,2,1,1,2,2,1,1,1,1,2,1,2,1,1,1,1,2,1,2,1,2,2,1,1,1,2,2,1,2,2,2,1,2,1,2,2,2,2,2,1,1,2,2,1,2,1,2,2,2,1,1,2,1,1,2,1,1,1,2,1,2,1,1,2,1,2,2,2,1,1,1,1,1,1,2,2,1,1,1,2,2,1,1,2,2,2,2,1,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,2,2,1,2,2,1,1,2,2,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,1,2,2,1,2,1,1,2,1)
dependent <-c(1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,2,1,1,1,2,1,98,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1,1,1,3,1,1,2,1,2,1,2,1,1,1,1,1,1,2,1,1,1,1,1,2,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,4,1,1,1,5,1,2,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,2,1,1,5,1,2,1,1,1,2,1,1,1,1,2,1,1,1,2,1,1,1,2,1,4,3,1,3,1,1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,2,1,1,2,1,2,2,1,2,1)'
weight <- c(2937,9389,7918,3851,8560,7433,6577,4507,5601,1358,7274,4507,5160,2429,618,4938,3734,5177,3683,3988,3534,2364,1979,5879,5131,6119,4321,4425,1302,3563,5133,10947,2183,3988,6607,5637,4507,5598,3086,6341,2553,6040,15050,809,5056,5709,2557,3734,1136,2596,7029,4507,2099,1296,4103,4030,4944,3940,4358,5598,1794,4636,6130,6119,431,4944,8361,5732,2602,6117,3719,2727,1406,6023,15050,3870,6808,3957,4507,3827,3658,7600,4244,4507,1850,4358,3770,4103,4927,4801,4012,2364,5547,4381,4103,4444,1427,5575,4656,3708,2865,4447,5709,477,2617,6232,3160,2745,3103,1604,2925,3950,1679,1341,7420,7600,1655,6307,4041,3882,5058,6412,4332,3415,7913,4330,5604,4027,305,4933,4635,1224,4464,8988,4664,1143,882,4061,6142,5763,9811,882,580,780,6973,1752,4263,3809,3440,4447,4740,443,3438,4076,3948,3438,2429,3582,4397,4283,4399,2028,758,15944,3593,4430,4159,4292,5976,3619,18834,2301,4485,942,5861,4284,2102,2102,8503,942,4747,8503,4358,4348,1794,4783,1899,373,5811,2183,5575,4702,4466,4653,3216,4553,7523,2606,2396,3216)
psu <- c(108,167,224,185,167,151,294,187,274,111,161,187,286,179,228,145,108,248,240,267,253,109,179,240,228,172,278,165,129,153,271,161,141,267,243,278,187,168,108,250,179,194,210,278,114,240,189,108,287,189,154,187,141,251,108,108,187,142,145,168,143,253,172,172,178,187,165,187,267,172,177,165,167,186,210,107,168,108,187,167,186,273,187,187,179,145,174,108,115,115,215,109,253,120,108,240,123,150,115,115,278,174,240,142,109,159,187,185,176,140,220,166,129,129,240,273,111,161,141,141,121,115,120,175,168,472,472,142,142,145,147,145,150,132,186,166,166,169,161,169,163,166,166,166,163,179,178,176,177,174,180,178,171,180,171,171,179,173,192,192,157,154,151,160,154,142,186,186,163,171,160,141,142,143,143,148,141,141,143,143,150,143,145,147,143,141,141,142,148,141,150,141,148,148,136,134,140,136,136,136)
#make the data frame for reproducible example
dat <- as.data.frame(cbind(independent,dependent,weight,psu))
dat$independent <- factor(dat$independent, levels=c("1", "2"))
dat$dependent <- factor(dat$dependent, levels = c("1", "2", "3", "4", "5", "98"))
str(dat)
#Function to print svytables + chi square statistic
library(survey) #opens the survey package in R
x <- independent
y <- dependent
cstbl <- function(x,y) {
dat_w <- svydesign(id=~psu, weights=~weight, data=dat)
tbl <- svytable(bquote(~.(as.name(y)+as.name(x))), dat_w) #should make table
chi <- svychisq(as.formula(paste("~", x, "+", y)), dat_w) #should give test stat
final <- list(tbl, chi)
return(final) #should return table + chi square test stat
}
cstbl(x,y)
This throws the error: "Error in as.name(y) + as.name(x) :
non-numeric argument to binary operator"
I get the same error if I replace the third line of the function with
tbl <- svytable(~y+x, dat_w)
Your guidance and patience are much appreciated, as I am new to programming.

Resources