Store regression models in dataframe - r

I conduct a large number of regression analyses using ols and cph (different models, sensitivity analyses etc) which takes on my computer around two hours. Therefore, I would like to save these models so that I don't have to re-run the same analyses every time I want to work with them. The models all have very structured names, so I can create a list of names as follows:
model.names <- list()[grep("^im", ls())
But how can I use this to save those models? Could they be placed into a data frame?

I think you are looking for save()
save writes an external representation of R objects to the specified file. The objects can be read back from the file at a later date by using the function load or attach (or data in some cases).

Related

Can I use xgboost global model properly, if I skip step_dummy(all_nominal_predictors(), one_hot = TRUE)?

I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.

is there an R function by which you I can extract GAM summary output as a table to be presented in a document?

I have run a GAM and obtained a summary output. I want to extract the output of summary into a table, so that I can use it in a document.
Let's say your gam model is called gamfit.
You can use l <- as.list(summary(gamfit)) to store this in a list.
Then access the parts with l[[1]], l[[2]], ...
The problem with this is that, depending on how many variables are used in the analysis, the data you want may move around and the list items aren't named. If the model isn't going to change, then it's fine.
Also note that all the information about the model is stored in gamfit.
Explore this in the Rstudio Environment window or at the command line with names(gamfit), followed by gamfit$name_of_interest.

How to save my trained Random Forest model and apply it to test data files one by one?

This is a long shot and more of a code designing sort of ask for a rookie like me but I think it has real value for real world applications
The core questions are:
Can I save a trained ML model, such as Random Forest (RF), in R and call/use it later without the need to reload all the data used for training it?
When, in real life, I have a massive folder of hundreds and thousands files of data to be tested, can I load that model I saved somewhere in R and ask it to go read the unknown files one by one (so I am not limited by RAM size) and perform regression/classification etc analysis for each of the file read in, and store ALL the output together into a file.
For example,
If I have 100,000 csv files of data in a folder, and I want to use 30% of them as training set, and the rest as test for a Random Forest (RF) classification.
I can select the files of interest, call them "control files". Then use fread() then randomly sample 50% of the data in those files, call the CARET library or RandomForest library, train my "model"
model <- train(,x,y,data,method="rf")
Now can I save the model somewhere? So I don't have to load all the control files each time I want to use the model?
Then I want to apply this model to all the remaining csv files in the folder, and I want it to read those csv files one by one when applying the model, instead of reading them all in, due to RAM issue.

How to extract variables from tab_model in R to create new data frame?

Example output of tab_model
I have created a table from tab_model that includes multiple models and wish to extract all 'p-values' and 'Estimates/Odds Ratio' to create a data frame that includes these. Output of tab_model is an html file. I am unable to find a function to pull this info in accordance, any ideas on how I could do this?
For example, I want to retrieve all p-values and Estimates for variable 'age' in all of my models...Only 3 in example image but I have hundreds
You should get these values from the regression models themselves, instead of outputting them to a HTML-table, and then extract them.
Without further knowledge of your process and data it is difficult to provide a more concrete answer.

Generating variable names for dataframes based on the loop number in a loop in R

I am working on developing and optimizing a linear model using the lm() function and subsequently the step() function for optimization. I have added a variable to my dataframe by using a random generator of 0s and 1s (50% chance each). I use this variable to subset the dataframe into a training set and a validation set If a record is not assigned to the training set it is assigned to the validation set. By using these subsets I am able to estimate how good the fit of the model is (by using the predict function for the records in the validation set and comparing them to the original values). I am interested in the coefficients of the optimized model and in the results of the KS-test between the distributions of the predicted and actual results.
All of my code was working fine, but when I wanted to test whether my model is sensitive to the subset that I chose I ran into some problems. To do this I wanted to create a for (i in 1:10) loop, each time using a different random subset. This turned out to be quite a challenge for me (I have never used a for loop in R before).
Here's the problem (well actually there are many problems, but here is one of them):
I would like to have separate dataframes for each run in the loop with a unique name (for example: Run1, Run2, Run3). I have been able to create a variable with different strings using paste(("Run",1:10,sep=""), but that just gives you a list of strings. How do I use these strings as names for my (subsetted) dataframes?
Another problem that I expect to encounter:
Subsequently I want to use the fitted coefficients for each run and export these to Excel. By using coef(function) I have been able to retrieve the coefficients, however the number of coefficients included in the model may change per simulation run because of the optimization algorithm. This will almost certainly give me some trouble with pasting them into the same dataframe, any thoughts on that?
Thanks for helping me out.
For your first question:
You can create the strings as before, using
df.names <- paste(("Run",1:10,sep="")
Then, create your for loop and do the following to give the data frames the names you want:
for (i in 1:10){
d.frame <- # create your data frame here
assign(df.name[i], d.frame)
}
Now you will end up with ten data frames with ten different names.
For your second question about the coefficients:
As far as I can tell, these don't naturally fit into your data frame structure. You should consider using lists, as they allow different classes - in other words, for each run, create a list containing a data frame and a numeric vector with your coefficients.
Don't create objects with numbers in their names, and then try and access them in a loop later, using get and paste and assign. The right way to do this is to store your elements in an R list object.

Resources