I am trying to impute data before modeling with random forest for example.
I have categorical and continuous features. I would like to use the function kNN (VIM package) to impute my data. But I can't use this function in the preProcess function of caret and the knn imputation of this function does not handle mixed data.
How can I impute mixed data in the preProcess function ?
As of right now, it will only impute continuous predictors (which you can get via dummy variables).
You could write your won custom method to use that function for pre-processing if you like. This example might help.
Once I get through the work for my day job, improving preProcess is the next major task for package development.
Thanks,
Max
Related
I need to call a forecast model from R within Anlylogic, and return the resulting outputs in R. It is a specific timeseries that I have built in R, and just copying the coefficients to Anylogic is not efficient. I have seen a couple of older posts on similar questions, but I am not sure I can follow. Any advice would be very appreciated.
I have a regression forecast model that uses predictors to provide a forecast along with Prediction Intervals. I need these outputs to be updated by the different values of the predictors and then used in Anylogic.
I am working with a dataset that is zero inflated and I found a function that allows me to do a GLMM with the glmmadmb function. But, I realized that my data are not linearly related with the environmental variables I want to test, this is why I am looking for a GAMM instead of a GLMM.
Does anyone know which function can replace glmmadmb for a GAMM ?
I am trying to fit a transfer function model using R in order to apply the fitted model to a validation set of data, because SPSS doesn't allow me to (or I don't know how to) compute point forecasts just like the function Arima() from forecast package does. It does let me apply the model, but it does not use the dependet variable's lagged values, that's why I am trying R.
Anyone know how I could get those type of "updated" or validation forecasts using the arimax() function? I am not looking for the following type of predictions:
predict(vixari011, n.ahead=12)
But rather these:
Arima(test$VIX, model = vixari)
From what I have been reading there is no prediction function for the arimax() function, any ideas about how I could forecast to evaluate point-by-point performance? I can just think of computing manually using a spreadsheet...
I had the same problem. I know this post is old but this can help someone.
I used this it worked just fine
forecast(fitted(arimax_ts_model), h=11)
Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?
It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.
I'm using randomForest in order to find out the most significant variables. I was expecting some output that defines the accuracy of the model and also ranks the variables based on their importance. But I am a bit confused now. I tried randomForest and then ran importance() to extract the importance of variables.
But then I saw another command rfcv (Random Forest Cross-Valdidation for feature selection), which should be the most appropriate for this purpose I suppose, but the question I have regarding this is: how to get the list of the most important variables? How to see the output after running it? Which command to use?
Another thing: What is the difference between randomForest and predict.randomForest?
I am not very familiar with randomforest and R therefore any help would be appreciated.
Thank you in advance!
After you have made a randomForest model you use predict.randomForest to use the model you created on new data e.g. build a random forest with training data then run your validation data through that model with predict.randomForest.
As for the rfcv there is an option recursive which (from the help):
whether variable importance is (re-)assessed at each step of variable
reduction
Its all in the help file