Error in data frame undefined columns after imputing in R - r

I'm working with imputation with some data in R. I found a code online to perform imputation and then modeling the imputed data and the original data. The code is this:
# Using airquality dataset
data <- airquality
data[4:10,3] <- rep(NA,7)
data[1:5,4] <- NA
# Removing categorical variables
data <- airquality[-c(5,6)]
summary(data)
# Impute missing data using mice
library(mice)
tempData <- mice(data,m=5,maxit=50,meth='pmm',seed=500)
summary(tempData)
# Get completed datasets (observed and imputed)
completedData <- complete(tempData,1)
summary(completedData)
# Plots
# Density plot original vs imputed dataset
densityplot(tempData)
This is my syntax:
library(readr)
input_preg<- read_csv("datasurvey.csv")
summary(input_preg)
imput<- input_preg
#Imputation
library(mice)
temporal <- mice(imput,m=5,maxit=50,meth='pmm',seed=500)
#example imputed
temporal$imp$`52bcalif`
#I selected a dataset for imputation
completos<-complete(temporal,1)
#Ploting
densityplot(temporal)
So i'm doing almost exactly what the code indicates and when I'm doing the densityplot it doesnt work stating:
Error in `[.data.frame`(r, , xvar) : undefined columns selected
But with the original code, it has no problems to do the densityplot. So I dont know if it is because of the large number of imputations or that original data had 4 variables and I have 29.

Change the name of that column,temporal$imp$52bcalif, I think the mistake is there. You used a number. I tested myself.

Related

R Caret knnImpute for partially NA rows

I'm trying to run some code to preprocess my data for machine learning in Caret. One step I'm having a lot of trouble with is KNN imputation. When I run the following block of code:
library(caret)
traindf <- data.frame(matrix( rnorm(7*7,mean=0,sd=1), nrow=7, ncol=7))
testdf <- data.frame(matrix( rnorm(7*7,mean=0,sd=1), nrow=7, ncol=7))
for(i in 1:7){
traindf[i,i] <- NA #generates NA's in every row
}
impute_model <- preProcess(traindf, method = c('knnImpute')) #this line is problematic
imputed_train <- predict(impute_model, traindf)
imputed_test <- predict(impute_model, testdf)
I get an error:
Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, :
Cannot find more nearest neighbours than there are points
From some research, I believe this is due to the fact that the kNN imputation implementation Caret uses discards rows with any NA's. In my dataset, NA's are scattered throughout such that this would result in all rows being discarded for imputation purposes. Instead I would like to keep these partially NA rows and still use them for imputation.
I know of one package that does this:https://www.rdocumentation.org/packages/impute/versions/1.46.0/topics/impute.knn. However, this one doesn't override predict, so I can't use it easily to impute the test set as well like in the above example.
Does anyone have suggestions on how I can get this partial-NA KNN imputation working with Caret?

How to use missForest package in R for test data?

We can basically use missForest package for imputing missing values in R(for both categorical and numeric).But this approach requires a complete response variable for training the forest. So,how to impute missing values in the test data set using this missForest package ,because we do not have any response variable in the test data set?
You can just use missForest. No need for the response variable. See code below.
library(missForest)
# remove response variable
my_iris <- iris[, -5]
## Artificially produce missing values using the 'prodNA' function:
set.seed(81)
iris.mis <- prodNA(my_iris, noNA = 0.2)
#impute
iris.imp <- missForest(iris.mis, verbose = TRUE)
#out of bag error
iris.imp$OOBerror
# not available if there is no response variable
iris.imp$error
# Imputed matrix
iris.imp$ximp

Getting a constant answer while finding patterns with neuralnet package

I'm trying to find patterns in a large dataset using the neuralnet package.
My data file looks something like this (30,204,447 rows) :
id.company,EPS.or.Sales,FQ.or.FY,fiscal,date,value
000001,EPS,FY,2001,20020201,-5.520000
000001,SAL,FQ,2000,20020401,70.300003
000001,SAL,FY,2001,20020325,49.200001
000002,EPS,FQ,2008,20071009,-4.000000
000002,SAL,FY,2008,20071009,1.400000
I have split this initial file into four new files for annual/quarterly sales/EPS and it is on those files that I want to use neural networks to see if I can use the variables id.company, fiscal and date in the case below to predict the annual sales results.
To do so, I have written the following code:
dataset <- read.table("fy_sal_data.txt",header=T, sep="\t") #my file doesn't actually use comas as separators
#extract training set and testing set
trainset <- dataset[1:1000, ]
testset <- dataset[1001:2000, ]
#building the NN
ann <- neuralnet(value ~ id.company + fiscal + date, trainset, hidden = 3,
lifesign="minimal", threshold=0.01)
#testing the output
temp_test <- subset(testset, select=c("id.company", "fiscal", "date"))
ann.results <- compute(ann, temp_test)
#display the results
cleanoutput <- cbind(testset$value, as.data.frame(ann.results$net.result))
colnames(cleanoutput) <- c("Expected Output", "NN Output")
head(cleanoutput, 30)
Now my problem is that the compute function returns a constant answer no matter the inputs of the testing set.
Expected Output NN Output
1001 2006.500000 1417.796651
1002 2009.000000 1417.796651
1003 2006.500000 1417.796651
1004 2002.500000 1417.796651
I am very new to R and its neural networks packages but I have found online that some of the reasons for such results can be either:
an insufficient number of training examples (here I'm using a thousand ones but I've also tried using a million rows and the results were the same, only it took 4h to train)
or an error in the formula.
I am sure I'm doing something wrong but I can't seem to figure out what.

Perform operation on each imputed dataset in R's MICE

How can I perform an operation (like subsetting or adding a calculated column) on each imputed dataset in an object of class mids from R's package mice? I would like the result to still be a mids object.
Edit: Example
library(mice)
data(nhanes)
# create imputed datasets
imput = mice(nhanes)
The imputed datasets are stored as a list of lists
imput$imp
where there are rows only for the observations with imputation for the given variable.
The original (incomplete) dataset is stored here:
imput$data
For example, how would I create a new variable calculated as chl/2 in each of the imputed datasets, yielding a new mids object?
This can be done easily as follows -
Use complete() to convert a mids object to a long-format data.frame:
long1 <- complete(midsobj1, action='long', include=TRUE)
Perform whatever manipulations needed:
long1$new.var <- long1$chl/2
long2 <- subset(long1, age >= 5)
use as.mids() to convert back manipulated data to mids object:
midsobj2 <- as.mids(long2)
Now you can use midsobj2 as required. Note that the include=TRUE (used to include the original data with missing values) is needed for as.mids() to compress the long-formatted data properly. Note that prior to mice v2.25 there was a bug in the as.mids() function (see this post https://stats.stackexchange.com/a/158327/69413)
EDIT: According to this answer https://stackoverflow.com/a/34859264/4269699 (from what is essentially a duplicate question) you can also edit the mids object directly by accessing $data and $imp. So for example
midsobj2<-midsobj1
midsobj2$data$new.var <- midsobj2$data$chl/2
midsobj2$imp$new.var <- midsobj2$imp$chl/2
You will run into trouble though if you want to subset $imp or if you want to use $call, so I wouldn't recommend this solution in general.
Another option is to calculate the variables before the imputation and place restrictions on them.
library(mice)
# Create the additional variable - this will have missing
nhanes$extra <- nhanes$chl / 2
# Change the method of imputation for extra, so that it always equals chl/2
# Change the predictor matrix so only chl predicts extra
ini <- mice(nhanes, max = 0, print = FALSE)
meth <- ini$meth
meth["extra"] <- "~I(chl / 2)"
pred <- ini$pred # extra isn't used to predict
pred["extra", "chl"] <- 1
# Imputations
imput <- mice(nhanes, seed = 1, pred = pred, meth = meth, print = FALSE)
There are examples in mice: Multivariate Imputation by Chained Equations in R.
There is an overload of with that can help you here
with(imput, chl/2)
the documentation is given at ?with.mids
There's a function for this in the basecamb package:
library(basecamb)
apply_function_to_imputed_data(mids_object, function)

Forecasting with `tslm` returning dimension error

I'm having a similar problem to the questioners here had with the linear model predict function, but I am trying to use the "time series linear model" function from Rob Hyndman's forecasting package.
Predict.lm in R fails to recognize newdata
predict.lm with newdata
totalConv <- ts(varData[,43])
metaSearch <- ts(varData[,45])
PPCBrand <- ts(varData[,38])
PPCGeneric <- ts(varData[,34])
PPCLocation <- ts(varData[,35])
brandDisplay <- ts(varData[,29])
standardDisplay <- ts(varData[,3])
TV <- ts(varData[,2])
richMedia <- ts(varData[,46])
df.HA <- data.frame(totalConv, metaSearch,
PPCBrand, PPCGeneric, PPCLocation,
brandDisplay, standardDisplay,
TV, richMedia)
As you can see I've tried to avoid the names issues by creating a data frame of the time series objects.
However, I then fit a tslm object (time series linear model) as follows -
fit1 <- tslm(totalConv ~ metaSearch
+ PPCBrand + PPCGeneric + PPCLocation
+ brandDisplay + standardDisplay
+ TV + richMedia data = df.HA
)
Despite having created a data frame and named all the objects properly I get the same dimension error as these other users have experienced.
Error in forecast.lm(fit1) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 696 rows
2: 'newdata' had 10 rows but variables found have 696 rows
the model frame seems to give sensible names to all of the variables, so I don't know what is up with the forecast function:-
names(model.frame(fit1))
[1] "totalConv" "metaSearch" "PPCBrand" "PPCGeneric" "PPCLocation" "brandDisplay"
[7] "standardDisplay" "TV" "richMedia"
Can anyone suggest any other improvements to my model specification that might help the forecast function to run?
EDIT 1: Ok, just so there's a working example, I've used the data given in Irsal's answer to this question (converting to time series objects) and then fitted the tslm. I get the same error (different dimensions obviously):-
Is there an easy way to revert a forecast back into a time series for plotting?
I'm really confused about what I'm doing wrong, my code looks identical to that used in all of the examples on this....
data <- c(11,53,50,53,57,69,70,65,64,66,66,64,61,65,69,61,67,71,74,71,77,75,85,88,95,
93,96,89,95,98,110,134,127,132,107,94,79,72,68,72,70,66,62,62,60,59,61,67,
74,87,112,134,51,50,38,40,44,54,52,51,48,50,49,49,48,57,52,53,50,50,55,50,
55,60,65,67,75,66,65,65,69,72,93,137,125,110,93,72,61,55,51,52,50,46,46,45,
48,44,45,53,55,65,89,112,38,7,39,35,37,41,51,53,57,52,57,51,52,49,48,48,51,
54,48,50,50,53,56,64,71,74,66,69,71,75,84,93,107,111,112,90,75,62,53,51,52,
51,49,48,49,52,50,50,59,58,69,95,148,49,83,40,40,40,53,57,54,52,56,53,55,
55,51,54,45,49,46,52,49,50,57,58,63,73,66,63,72,72,71,77,105,97,104,85,73,
66,55,52,50,52,48,48,46,48,53,49,58,56,72,84,124,76,4,40,39,36,38,48,55,49,
51,48,46,46,47,44,44,45,43,48,46,45,50,50,56,62,53,62,63)
data2 <- c(rnorm(237))
library(forecast)
nData <- ts(data)
nData2 <- ts(data2)
dat.ts <- tslm(nData~nData2)
forecast(dat.ts)
Error in forecast.lm(dat.ts) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 237 rows
2: 'newdata' had 10 rows but variables found have 237 rows
EDIT 2: Same error even if I combine both series into a data frame.
nData.df <- data.frame(nData, nData2)
dat.ts <- tslm(nData~nData2, data = nData.df)
forecast(dat.ts)
tslm fits a linear regression model. You need to provide the future values of the explanatory variables if you want to forecast. These should be provided via the newdata argument of forecast.lm.

Resources