Background: I'm creating a recipe to clean and transform time-series data that will be used by multiple models. One of the steps in the recipe is to remove correlated predictors using the step_corr() function.
However, due to the nature of the data set, some of the variables can have a constant value for the entire set of training data when doing cross-validation using a rolling window and thus cause the step_corr() function to throw a warning.
Problem Statement: In such cases, is it possible to exclude such variables from the correlation step? Or perhaps remove the variable entirely?
P.S. I know I can easily ignore the warning and proceed. But I'm looking for a cleaner approach / best practice advice.
There are two steps for you to consider:
step_zv() will remove variables that all have the same value (zero variance)
step_nzv() will remove variables that almost all have the same value (highly sparse and unbalanced)
Related
I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.
I have a dataset where I've fitted a linear model and I've tried to use the step function on this linear model. I get an error message "saying number of rows in use has changed: remove missing values?".
I noticed that a few of the observations (not many) in my dataset had NA values for one variable. I've seen similar questions which suggest using na.omit(), but when I do this I lose the observations. I want to keep the observations however, because they contain useful information for the other variables. Is there a way to use step and avoid losing the observations?
You can call the nobs function to check that the number of observations is unchanged, and its use.fallback argument to potentially guess the missing values. The R documentation however recommends omitting the relevant data before running step.
I would discourage you from simply omitting the missing values if they are indeed really missing. You can use multiple imputation via Amelia to impute the data such that you have a full dataset.
see here: https://cran.r-project.org/web/packages/Amelia/Amelia.pdf
also I would recommend reviewing the book "Statistical Analysis With Missing Data" by R. Little and D.B. Rubin.
I was wondering if anyone knew if there was a way to take a continuous variate with missing values into a continuous variate and a factor for the missing level. Essentially I want the GLM to fit me the variable without taking the NAs into account and just fit a separate parameter for the NA level.
I tried doing this using interactions but of course this introduces an alias into the model.
It would be easier to understand your question if you included a simple example code of what you are trying to accomplish.
From what I understand, you want to break the effect of one variable into two different variables: one for the valid data and one for the NAs.
However, usually dropping NAs means giving up the whole observation (row). So you wouldn't be able to use the remaining information for that observation (so there will be no point in adding information about the presence of NAs to the observation. It will be discarded anyway)
Now I don't know about the specifics of your data and application, but if you insist on introducing a new parameter to indicate rows having NAs, what you could do is to impute the missing values, and then add your "NAs parameter". This way you would keep all the observations and still provide the model with additional information about the presence of missing values.
Then again, if you actually to the imputation, you should also ask if the "NAs parameter" is still justified; again, depends on your specific problem.
But doing imputation is a sensitive design decision. You should be careful about introducing additional information into your data. The same can be said about the "NA parameter", unless you know the underlying reason for the missing values\.
I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!
Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?
It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.