I am trying to impute missing values using regression and have searched thoroughly online and it hasn't been of much help. I read the FNN package documentation for the knn.reg function and find it difficult to interpret. I have a column of missing values in the test data which i want to predict using my training data and have a code like this ::
regress<-knn.reg(data.train[data.train[,4]==1,][c(1,2,3)],test=data.test[c(1,2,3)],data.test[c(2)],5)
But I get the following error:: Error in get.knnx(train, test, k, algorithm) : Data include NAs. The column which contains missing values is col #2. When I exclude the column which has NA values i.e.
regress<-knn.reg(data.train[data.train[,4]==1,][c(1,2,3)],test=data.test[c(1,3)],data.test[c(2)],5)
I get an error:: Error in get.knnx(train, test, k, algorithm) : Number of columns must be same!. Please help !!
You might want to consider the mice package (and read part of the paper).
Using standard settings which have been proven to a good starting point:
library(mice)
mi <- mice(dataset)
mi.reg <- with(data=mi,exp=glm(y~x+z))
Here, simply calling mice() on your data will fill in each NA value. Finer tuning is of course possible (and needed if it would take too long to converge, or if you have reason to believe it is not accurate). Many different types of imputations are possible and are listed on page 16.
Related
I have a dataset where I've fitted a linear model and I've tried to use the step function on this linear model. I get an error message "saying number of rows in use has changed: remove missing values?".
I noticed that a few of the observations (not many) in my dataset had NA values for one variable. I've seen similar questions which suggest using na.omit(), but when I do this I lose the observations. I want to keep the observations however, because they contain useful information for the other variables. Is there a way to use step and avoid losing the observations?
You can call the nobs function to check that the number of observations is unchanged, and its use.fallback argument to potentially guess the missing values. The R documentation however recommends omitting the relevant data before running step.
I would discourage you from simply omitting the missing values if they are indeed really missing. You can use multiple imputation via Amelia to impute the data such that you have a full dataset.
see here: https://cran.r-project.org/web/packages/Amelia/Amelia.pdf
also I would recommend reviewing the book "Statistical Analysis With Missing Data" by R. Little and D.B. Rubin.
Just starting out with ERGM so apologies if the following question is not logical. I have tried to search on this site, and statnet_help, with no luck.
I was wondering whether the ergm() function in statnet can now cope with missing data on attributes? I have coded it as 'na' in R but running the following ergm model resulted in an error.
> m2 <- ergm(d1~edges + nodecov('wellbeing'))
> Error in ergm.getglobalstats(nw, model, response = response) :
> NA/NaN/Inf in foreign function call (arg 13)
The attribute variable in question is continuous.
Many thanks,
S
I don't think it is possible to have NAs on edge/node covariates. It is not very clear how should they be treated anyway. Depending on your interests in tracing the importance of nodes with missing data you might try:
Imputing NAs with some sensible values (even a mean)
Adding a binary covariate equal to 1 for NA and 0 otherwise and using it in nodecov and perhaps some other effects to check whether there is any evidence for these nodes to have some special role in the network structure.
I'm trying to use models from the bnlearn package in R to do classifier predictions, but with some datasets, some ofthe variable values (levels) are rarely seen, which means that the test data partition may not have all of the values for variable represented in the data file.
When using predict() with the bn model on this type of data set, an error message similar to the following is returned:
: In check.data(data) : variable V3 has levels that are not observed
in the data.
I would like to reset the levels in the model similar to the method here:
Error in bn.fit predict function in bnlear R
but I don't have access to the original data, just the model.
So, how do I get the number of levels from the bn data structure to set the number of levels in the data set to be predicted?
The answer is that the question is asking the wrong thing. After quite a bit of poring over the code, the answer lies in a function, check.data, used to verify the data for both the learning and the predicting phases, which is, in this case, non-sensical. The correct answer is to modify bnlearn to eliminate this bug.
I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!
Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?
It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.