Why is data pre-processed with caret encoded as non-tabular objects - r

I am learning how to use the R caret package, and I am wondering why there are than many functions that encode output data as objects that are non directly usable for training or regression.
For example, for preprocessing, the dummyVars functions returns an object of class "dummyVars". And similarly, the preProcess function returns an object of class "preProcess". These are non-usable by caret::train, and one has to work it out first with stats::predict like:
caret::dummyVars(Y ~ ., data = mydata) %>%
stats::predict(newdata = mydata)
Is there a reason for that? Why? What are the benefits?

After becoming more familiar with the package, I think I can provide an answer to this question and explain the advantages of such approach.
The caret package is intended to be used in a context in which one would not typically use just one predictive model to fit the data, but many.
Thus, objects such as preProcess are useful because they simply provide rules to process data than later can be passed to as many models as required. This saves coding and avoids errors (e.g. copy-pasting ones), because the same preProcess object can be used for all subsequent models in the train function by means of the trControl argument.
One must note also that these objects do not save entire "pre-processed" datasets (e.g. dummyVars, but just rules to be used during training, or during pre-processing. This also helps saving memory in a context where one might tend to accumulate many temporary variables and dataframes.

Related

Can I use xgboost global model properly, if I skip step_dummy(all_nominal_predictors(), one_hot = TRUE)?

I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.

How can I get My.stepwise.glm to return the model outside the console?

I asked this question on RCommunity but haven't had anyone bite... so I'm here!
My current project involves me predicting whether some trees will survive given future climate change scenarios. Against better judgement (like using Maxent) I've decided to pursue this with a GLM, which requires presence and absence data. Everytime I generate my absence data (as I was only given presence data) using randomPoints from dismo, the resulting GLM model has different significant variables. I found a package called My.stepwise that has a My.stepwise.glm function (here: My.stepwise.glm: Stepwise Variable Selection Procedure for Generalized Linear... in My.stepwise: Stepwise Variable Selection Procedures for Regression Analysis) , and this goes through a forward/backward selection process to find the best variables and returns a model ready for you.
My problem is that I don't want to run My.stepwise.glm just once and use the model it spits out for me. I'd like to run it roughly 100 times with different pseudo-absence data and see which variables it returns, then take the most frequent variables and move forward with building my model using those. The issue is that the My.stepwise.glm function ends by 'print(summary(initial.model))' and I would like to be able to access the output similar to how step() returns a list, where you can then say 'step$coefficients' and have the function coefficients return as numerics. Can anyone help me with this?

R: Finding help for classes created by packages

It often seems to be the case that R packages contain multiple functions that create an object of some class, specified by the package, with generic or non-generic methods that apply to all objects of that class. Although it is generally easy to find out about the functions in a package, I have not found any equally straightforward way to find a precise description of the class itself for S3 classes. I think this is at least partly intentional. Class definitions may be regarded as the sort of internal workings that, on one hand, the user should not have to think about, and on the other, may be changeable by the package creator, who wants people not to rely on them.
However, I find that I sometimes want to create additional objects of the same class that work with the package functions that are methods for that class. And it is not always easy to deduce what features an object must have in order to be usable by package functions that do various things to objects of that class, especially as instances created by different functions may or may not all have exactly the same structure.
The example with which I am currently wrestling are forecast objects created by various functions of the forecast package. The forecast package provides a large number of functions that take forecast objects as inputs. This blog post by Rob Hyndman describes a function to do cross validation and requires an object of class forecast as an argument The tsCV function documentation says it takes a "forecastFunction" as an argument, which must return an object of class forecast and have a univariate time series as its first object (of forecasts, one assumes) and have an argument h giving the horizon. Well, that sounds easy enough. But then in Hyndman’s associated textbook, section 3.6, we are told that forecast objects contain information about the forecasting method, the data, the point forecasts, prediction intervals, residuals, and fitted values. That’s a lot of things, and I am not sure if they are all mandatory or if some are optional, or required only if you intend to use certain methods. And I don’t know anything about mandatory internal structure of the class.
Finally, I particularly want to know if the new fable package, intended as a forecast package replacement, uses the same forecast class mechanism and require the same internal structure., or if not, how they are different. I have not been able to find, in fpp3 or elsewhere, anything that either describes a change or contains a comparable description of objects of class forecast.
I’m going to be embarrassed if there is some simple function,
you_should_know_this_dummy(package = “forecast”, class = “forecast”),
that returns a detailed description of the class. But I have looked for such a function every way I could think of and not found it.
O.K., my bad. I was trying so hard to find a way of locating the help file for the class description (which I don't think exists) that I overlooked the existence of a pretty good description of the class forecast under the function forecast() in the manual for the package forecast. Here it is:
An object of class "forecast" is a list usually containing at least the following elements:
model A list containing information about the fitted model
method The name of the forecasting method as a character string
mean Point forecasts as a time series
lower Lower limits for prediction intervals
upper Upper limits for prediction intervals
level The confidence values associated with the prediction intervals
x The original time series (either object itself or the time series used to create the model stored as object).
residuals Residuals from the fitted model. For models with additive errors, the residuals will be x minus the fitted values.
fitted Fitted values (one-step forecasts)
This still leaves some questions unanswered, like the format for the model information argument model, and for the x argument with multivariate models. But I am hoping that these are similar to those handed to or returned by, e.g., lm(). I think this gives me enough to get started and to hope for informative errors.
I still don't know if the fable package also uses objects of class forecast. The forecast package documents the forecast() function as a generic. The fable package does not document the generic, though it has a very similar list of functions that look like methods, e.g., forecast.whatever. If I figure out the answer, I'll post it here.
I am also looking for a number of other package that provide time series forecast of particular types. I'm hoping that they provide output similar enough that I can use the forecast/fable functions for display, cross-validation, and so forth. We'll see.

mlr classification training with rpart does not complete

I have a classification task that I managed to train with mlr package using LDA ("classif.lda") in a few seconds. However when I trained it using "classif.rpart" the training never ended.
Is there any different setup to be done for the different methods?
My training data here if needed to replicate the problem. I tried to train it simply with
pred.bin.task <- makeClassifTask(id="CountyCrime", data=dftrain, target="count.bins")
train("classif.rpart", pred.bin.task)
In general, you don't need to change anything about the setup when switching learners -- one of the main points of mlr is to make this easy! This does not mean that it'll always work though, as different learning methods do different things under the hood.
It looks like in this particular case the model simply takes a long time to train, so you probably didn't wait long enough for it to complete. You have quite a large data frame.
Looking at your data, you seem to have an interval of values in count.bins. This is treated as a factor by R (i.e. intervals are only the same if the string matches completely), which is probably not what you want here. You could encode start and end as separate (numerical) features.

caret package: Is it possible to implement my own bootstrapping method?

I am using caret package for R to select variables for my model. When using rfe command, one should pass rfeControl object, which has a method parameter. Options for this parameter are boot, cv, LOOCV and LGOCV. Since I am dealing with time series data I need to use special bootstrapping/cross-validation techniques as normal ones do not apply for time series data (otherwise distributions get corrupted etc.).
My question is how would I plug-in my own implementation of bootstrapping but still use caret rfe method, which has every other thing I need.
There isn't an easy way. If you study the code for rfe.default() you will note that in cases where method = "boot" the createResample() function is used. This is the function that generates the bootstrap samples. Similar functions are used for the other CV methods.
There is a hard way; overtake the create*() function that is most appropriate; say you want to do a block bootstrap or ME bootstrap, take over the createResample() function and use method = "boot", or if you want a special form of CV, use method = "cv" and take over createFolds().
You will need to write your own create*() function and replace the one in the caret NAMESPACE with your version. Not easy but eminently doable. Say you write your own createResample() function; first you need to note that this function creates n = times bootstrap samples returning this in a matrix with times columns and as many rows as your have samples. You need to write a custom createResample() function that returns the same object but which performs the time series bootstrapping you want to employ.
Once you have written that function you then need to get it into the caret namespace so that it is used by functions in the caret package. For this you use assignInNamespace(). Say your new bootstrapping function is called createMyResample() and it is your workspace, to insert this into the caret namespace do:
assignInNamespace("createResample", createMyResample, ns = "caret")
Sorry I can't be more specific but you don't say how you want the bootstrap/CV to be performed nor what R code you want to use to do the resampling. If you provide further details on how you would do the resampling I will take another look and see if I can help you write your create*() function.
Failing all of this, contact Max Kuhn, the author and maintainer of caret; he may be able to advice further or at least you can suggest this feature as a wish-list for a future version.

Resources