Error with svydesign using imputed data sets - r

I am analyzing an imputed dataset using svydesign but I am getting an error. Below is the code:
library(mitools)
library(survey)
data(nhanes)
nhanes$hyp <- as.factor(nhanes$hyp)
imp <- mice(nhanes,method=c("polyreg","pmm","logreg","pmm"), seed = 23109)
des<-svydesign(id=~1, strat=~age, data=imputationList(imp))
Error in as.data.frame.default(data, optional = TRUE) : cannot coerce class ""call"" to a data.frame
I am following the tutorial from this page:
http://r-survey.r-forge.r-project.org/survey/svymi.html
how do i modify the code for it to work?
EDIT:
I change data=imputationList(imp) to data=complete(imp,1) and i was able to make the code work. However, this is not efficient since I have to do this to all my imputed sets. Is there something worng with using imputationList?

mice() produces the results and the imputationList requires a list of all five data.frame with the imputed values, but you need to use mice::complete to construct those five completed data.frame objects
library(mitools)
library(survey)
library(mice)
data(nhanes)
nhanes$hyp <- as.factor(nhanes$hyp)
imp <- mice(nhanes,method=c("polyreg","pmm","logreg","pmm"), seed = 23109)
imp_list <- lapply( 1:5 , function( n ) complete( imp , action = n ) )
des<-svydesign(id=~1, strat=~age, data=imputationList(imp_list))

Related

Change class length in Knn function in R

I am trying to run the following code to run a Knn algorithm on a data set. My code is below:
# random number that is 90% of the total number of rows in dataset
ran <- sample(1:nrow(Knn_data), 0.9*nrow(Knn_data))
# the normalization function created
nor <- function(x) { (x-min(x))/(max(x)-min(x))}
#run normalization function on predictors
Knn_data_norm <- as.data.frame(lapply(Knn_data[,c(1,2,3,4,5,6,7)], nor))
summary(Knn_data_norm)
# extract training set
Knn_train <- Knn_data_norm[ran,]
# extract testing set
Knn_test <- Knn_data_norm[-ran,]
# extract 8th column of train dataset because it will be used as 'cl' argument in knn function
Knn_target_category <- Knn_data[ran,8]
# extract 8th column of test dataset to measure the accuracy
Knn_test_category <- Knn_data[-ran,8]
library(class)
#run knn function
pr <- knn(Knn_train, Knn_test, cl=Knn_target_category, k=3)
I keep getting the following error:
Error in knn(Knn_train, Knn_test, cl = Knn_target_category, k = 3) : 'train' and 'class' have different lengths
I am not sure how to change the code to correct this error enter image description here

How to use kNN function in R

I ned to run a knn on my data set ( I tried to use dput() to show data set but it doesn't come up in format summary() so unsure how to share it).
I have used the following code
library(caret)
library(class)
set.seed(123)
ind <- createDataPartition(user_col$Nscore, p=0.7,list=FALSE)
training_data <- user_col[1:942,,1 ]
testing_data <- user_col[943:1884,,1 ]
model <- knn(training_data, testing_data, training_data,k=1)
predictions <- as.factor(model)
confusionMatrix(predictions, testing_data[,5])
It stops running at model<- ..... with this error
Error in knn(training_data, testing_data, training_data, k = 1) : 'train' and 'class' have different lengths
I have looked in the environment and both training_data and testing_data are the same sizes so not sure where the error is.

Multiple Imputed datasets - pooling results

I have a dataset containing missing values. I have imputed this dataset, as follows:
library(mice)
id <- c(1,2,3,4,5,6,7,8,9,10)
group <- c(0,1,1,0,1,1,0,1,0,1)
measure_1 <- c(60,80,90,54,60,61,77,67,88,90)
measure_2 <- c(55,NA,88,55,70,62,78,66,65,92)
measure_3 <- c(58,88,85,56,68,62,89,62,70,99)
measure_4 <- c(64,80,78,92,NA,NA,87,65,67,96)
measure_5 <- c(64,85,80,65,74,69,90,65,70,99)
measure_6 <- c(70,NA,80,55,73,64,91,65,91,89)
dat <- data.frame(id, group, measure_1, measure_2, measure_3, measure_4, measure_5, measure_6)
dat$group <- as.factor(dat$group)
imp_anova <- mice(dat, maxit = 0)
meth <- imp_anova$method
pred <- imp_anova$predictorMatrix
imp_anova <- mice(dat, method = meth, predictorMatrix = pred, seed = 2018,
maxit = 10, m = 5)
This creates five imputed datasets. Then, I created the complete datasets (example dataset 1):
impute_1 <- mice::complete(imp_anova, 1) # complete set 1
And then I performed the desired analysis:
library(reshape)
library(reshape2)
datLong <- melt(impute_1, id = c("id", "group"), measure.vars = c("measure_1", "measure_2", "measure_3", "measure_4", "measure_5", "measure_6"))
colnames(datLong) <- c("ID", "Gender", "Time", "Value")
table(datLong$Time) # To check if correct
datLong$ID <- as.factor(datLong$ID)
library(ez)
model_mixed_1 <- ezANOVA(data = datLong,
dv = Value,
wid = ID,
within = Time,
between = Gender,
detailed = TRUE,
type = 3,
return_aov = TRUE)
I did this for all the five datasets, resulting in five models:
model_mixed_1
model_mixed_2
model_mixed_3
model_mixed_4
model_mixed_5
Now I want to combine the results of this models, to generate one results.
I have asked a similar question before, but there I focused on the models. Here I just want to ask how I can simply combine five models. Hope someone can help me!
You understood the basic multiple imputation process right. The process is like:
First your create your m imputed datasets. (mice() - function)
Then you do your analysis on each of these datasets. (with() - function)
In the end you combine these results together. (pool() - function)
This is a quite often misunderstand process (often people assume you have to combine your m imputed datasets together to one dataset - which is wrong)
Here is a picture of this process:
Now you have to follow these steps within the mice framework - you did this only till step 1.
Here an excerpt from the mice help:
The pool() function combines the estimates from m repeated complete data analyses. The typical sequence of steps to do a multiple imputation analysis is:
Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
Pool the estimates from each model into a single set of estimates and standard errors, resulting is an object of class mipo;
Optionally, compare pooled estimates from different scientific models by the pool.compare() function.
Code wise this can look for example like this:
imp <- mice(nhanes, maxit = 2, m = 5)
fit <- with(data=imp,exp=lm(bmi~age+hyp+chl))
summary(pool(fit))

Error in panel regression in case of different independent variable r

I am trying to run Fama Macbeth regression by the following code:
require(foreign)
require(plm)
require(lmtest)
fpmg <- pmg(return~max_1,df_all_11, index=c("yearmonth","firms" ))
Fama<-fpmg
coeftest(Fama)
It is working when I regress the data using the independent variable named 'max_1'. However when I change it and use another independent variable named 'ivol_1' the result is showing an error. The code is
require(foreign)
require(plm)
require(lmtest)
fpmg <- pmg(return~ivol_1,df_all_11, index=c("yearmonth","firms" ))
Fama<-fpmg
coeftest(Fama)
the error message is like this:
Error in pmg(return ~ ivol_1, df_all_11, index = c("yearmonth", "firms")) :
Insufficient number of time periods
or sometimes the error is like this
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
object is not a matrix
For your convenience, I am sharing my data with you. The data link is
data frame
I am wondering why this is happening in case of the different variable in the same data frame. I would be grateful if you can solve this problem.
This problem can be solved by mice function
library(mice)
library(dplyr)
require(foreign)
require(plm)
require(lmtest)
df_all_11<-read.csv("df_all_11.csv.part",sep = ",",header = TRUE,stringsAsFactor = F)
x<-data.frame(ivol_1=df_all_11$ivol_1,month=df_all_11$Month)
imputed_Data <- mice(x, m=3, maxit =5, method = 'pmm', seed = 500)
completeData <- complete(imputed_Data, 3)
df_all_11<-mutate(df_all_11,ivol_1=completeData$ivol_1)
fpmg2 <- pmg(return~ivol_1,df_all_11, index=c("yearmonth","firms"))
coeftest(fpmg2)
this problem because the variable ivol_1 have a lots of NA so you should impute the NA first then run the pmg function.

R: variable has different number of levels in the node and in the data

I want to use bnlearn for a classification task with Naive Bayes algorithm.
I use this data set for my tests. Where 3 variables are continuous ()V2, V4, V10) and others are discrete. As far as I know bnlearn cannot work with continuous variables, so there is a need to convert them to factors or discretize. For now I want to convert all the features into factors. However, I came across to some problems. Here is a sample code
dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...
trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)
# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)
...
For this code I get an error message while calling predict()
'V1' has different number of levels in the node and in the data.
And when I remove that V1 from the training set, I get the same error for the V2 variable. However, error disappears when I do factorization dataSet [] <- lapply(dataSet, as.factor) and only than split it into training and test sets.
So which is the elegant solution for this? Because in real world applications test and train sets can be from different sources. Any ideas?
The issue appears to be caused by the fact that my train and test datasets had different factor levels. I solved this issue by using the rbind command to combine the two different dataframes (train and test), applying as.factor to get the full set of factors for the complete dataset, and then slicing the factorized dataframe back into separate train and test datasets.
train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]
complete <- rbind(learn, test)
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]
bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)
I hope this can be helpful.

Resources