Not losing observations when faced with missing data - r

I have a dataset where I've fitted a linear model and I've tried to use the step function on this linear model. I get an error message "saying number of rows in use has changed: remove missing values?".
I noticed that a few of the observations (not many) in my dataset had NA values for one variable. I've seen similar questions which suggest using na.omit(), but when I do this I lose the observations. I want to keep the observations however, because they contain useful information for the other variables. Is there a way to use step and avoid losing the observations?

You can call the nobs function to check that the number of observations is unchanged, and its use.fallback argument to potentially guess the missing values. The R documentation however recommends omitting the relevant data before running step.

I would discourage you from simply omitting the missing values if they are indeed really missing. You can use multiple imputation via Amelia to impute the data such that you have a full dataset.
see here: https://cran.r-project.org/web/packages/Amelia/Amelia.pdf
also I would recommend reviewing the book "Statistical Analysis With Missing Data" by R. Little and D.B. Rubin.

Related

How to code a predictor in logistic regression when some values are purposefully unknown

I decides to post my question here because, strictly speaking, it has to do with coding.
The problem is as follows. In a psychological experiment involving two conditions, an independent variable - made up of numeric values - was present in one condition but not in the the other. Accordingly, in one condition the variable in point provided relevant information, and ranged between 0 and 20. In the other condition participants were simply not provided with such information.
Binding the data together, in the second condition - where participants were not provided with such information - I coded the variable as NA. However, when I run my logistic model, setting na.action = na.omit causes the model to fail.
In principle, the NAs in my data are not missing values but, in accordance with the experimental design, would like to reflect the absence of this information within one of the conditions.
Therefore, it seems to me that multivariate imputation - as could be implemented with mice or other packages - is not the correct course of action. In fact, if I wanted, I could simply retrive the values of interest, but including them in the data would be improper because, as already mentioned, participants were kept from knowing the values thereof.
Is there any strategy to code such unknown values and cope with this problem?
Any help would be much appreciated. Thank you very much!

R (GLM) Split continuous variate with missing values into variate and a factor for missing values

I was wondering if anyone knew if there was a way to take a continuous variate with missing values into a continuous variate and a factor for the missing level. Essentially I want the GLM to fit me the variable without taking the NAs into account and just fit a separate parameter for the NA level.
I tried doing this using interactions but of course this introduces an alias into the model.
It would be easier to understand your question if you included a simple example code of what you are trying to accomplish.
From what I understand, you want to break the effect of one variable into two different variables: one for the valid data and one for the NAs.
However, usually dropping NAs means giving up the whole observation (row). So you wouldn't be able to use the remaining information for that observation (so there will be no point in adding information about the presence of NAs to the observation. It will be discarded anyway)
Now I don't know about the specifics of your data and application, but if you insist on introducing a new parameter to indicate rows having NAs, what you could do is to impute the missing values, and then add your "NAs parameter". This way you would keep all the observations and still provide the model with additional information about the presence of missing values.
Then again, if you actually to the imputation, you should also ask if the "NAs parameter" is still justified; again, depends on your specific problem.
But doing imputation is a sensitive design decision. You should be careful about introducing additional information into your data. The same can be said about the "NA parameter", unless you know the underlying reason for the missing values\.

Princomp error in R : covariance matrix is not non-negative definite

I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!

Multiple Imputation on New/Predictor Data

Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?
It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.

Error thrown while imputing values using regression FNN package in R

I am trying to impute missing values using regression and have searched thoroughly online and it hasn't been of much help. I read the FNN package documentation for the knn.reg function and find it difficult to interpret. I have a column of missing values in the test data which i want to predict using my training data and have a code like this ::
regress<-knn.reg(data.train[data.train[,4]==1,][c(1,2,3)],test=data.test[c(1,2,3)],data.test[c(2)],5)
But I get the following error:: Error in get.knnx(train, test, k, algorithm) : Data include NAs. The column which contains missing values is col #2. When I exclude the column which has NA values i.e.
regress<-knn.reg(data.train[data.train[,4]==1,][c(1,2,3)],test=data.test[c(1,3)],data.test[c(2)],5)
I get an error:: Error in get.knnx(train, test, k, algorithm) : Number of columns must be same!. Please help !!
You might want to consider the mice package (and read part of the paper).
Using standard settings which have been proven to a good starting point:
library(mice)
mi <- mice(dataset)
mi.reg <- with(data=mi,exp=glm(y~x+z))
Here, simply calling mice() on your data will fill in each NA value. Finer tuning is of course possible (and needed if it would take too long to converge, or if you have reason to believe it is not accurate). Many different types of imputations are possible and are listed on page 16.

Resources