Generating variable names for dataframes based on the loop number in a loop in R - r

I am working on developing and optimizing a linear model using the lm() function and subsequently the step() function for optimization. I have added a variable to my dataframe by using a random generator of 0s and 1s (50% chance each). I use this variable to subset the dataframe into a training set and a validation set If a record is not assigned to the training set it is assigned to the validation set. By using these subsets I am able to estimate how good the fit of the model is (by using the predict function for the records in the validation set and comparing them to the original values). I am interested in the coefficients of the optimized model and in the results of the KS-test between the distributions of the predicted and actual results.
All of my code was working fine, but when I wanted to test whether my model is sensitive to the subset that I chose I ran into some problems. To do this I wanted to create a for (i in 1:10) loop, each time using a different random subset. This turned out to be quite a challenge for me (I have never used a for loop in R before).
Here's the problem (well actually there are many problems, but here is one of them):
I would like to have separate dataframes for each run in the loop with a unique name (for example: Run1, Run2, Run3). I have been able to create a variable with different strings using paste(("Run",1:10,sep=""), but that just gives you a list of strings. How do I use these strings as names for my (subsetted) dataframes?
Another problem that I expect to encounter:
Subsequently I want to use the fitted coefficients for each run and export these to Excel. By using coef(function) I have been able to retrieve the coefficients, however the number of coefficients included in the model may change per simulation run because of the optimization algorithm. This will almost certainly give me some trouble with pasting them into the same dataframe, any thoughts on that?
Thanks for helping me out.

For your first question:
You can create the strings as before, using
df.names <- paste(("Run",1:10,sep="")
Then, create your for loop and do the following to give the data frames the names you want:
for (i in 1:10){
d.frame <- # create your data frame here
assign(df.name[i], d.frame)
}
Now you will end up with ten data frames with ten different names.
For your second question about the coefficients:
As far as I can tell, these don't naturally fit into your data frame structure. You should consider using lists, as they allow different classes - in other words, for each run, create a list containing a data frame and a numeric vector with your coefficients.

Don't create objects with numbers in their names, and then try and access them in a loop later, using get and paste and assign. The right way to do this is to store your elements in an R list object.

Related

Can I use xgboost global model properly, if I skip step_dummy(all_nominal_predictors(), one_hot = TRUE)?

I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.

What data structure in R is suitable to store models?

Say I'm training models:
lm.1, lm.2, ... , lm.100
I would like to refer back to these later in my code for various purposes: say to inspect coefficients, run test data against them, etc.
What data structure should I use to store them in?
A list is what I've been using but lists for some reason seem a bit unwieldy of all the data structures in R
It is certainly a list.
A complete lm object is a list; the only thing in R that can hold a list is just a list. We have no other option.
In some cases, even if the return of a model is just a vector, we can not guarantee the resulting vectors are of equal length for all models we try, so we still have to use a list.

Applying univariate coxph function to multiple covariates (columns) at once

First, I gathered from this link Applying a function to multiple columns that using the "function" function would perhaps do what I'm looking for. However, I have not been able to make the leap from thinking about it in the way presented to making it actually work in my situation (or really even knowing where to start). I'm a beginner in R so I apologize in advance if this is a really "newb" question. My data is a data frame that consists of an event variable (tumor recurrence) and a time variable (followup time/time to recurrence) as well as recurrence risk factors (t-stage, tumor size,age at dx, etc.). Some risk factors are categorical and some are continuous. I have been running my univariate analysis by hand, one at a time like this example univariateageatdx<-coxph(survobj~agedx), and then collecting the data. This gets very tedious for multiple factors and doing it for a few different recurrence types. I figured there must be a way to code such that I could basically have one line of code that had the coxph equation and then applied it to all of my variables of interest and spit out a result that had the univariate analysis results for each factor. I tried using cbind to bind variables (i.e x<-cbind("agedx","tumor size") then running cox coxph(recurrencesurvobj~x) but this of course just did the multivariate analysis on these variables and didn't split them out as true univariate analyses.
I also tried the following code based on a similar problem that I found on a different site, but it gave the error shown and I don't know quite what to make of it. Is this on the right track?
f <- as.formula(paste('regionalsurvobj ~', paste(colnames(nodcistradmasvssubcutmasR)[6-9], collapse='+')))
I then ran it has coxph(f)
Gave me the results of a multivariate cox analysis.
Thanks!
**edit: I just fixed the error, I needed to use the column numbers I suppose not the names. Changes are reflected in the code above. However, it still runs the variables selected as a multivariate analysis and not as the true univariate analysis...
If you want to go the formula-route (which in your case with multiple outcomes and multiple variables might be the most practical way to go about it) you need to create a formula per model you want to fit. I've split the steps here a bit (making formulas, making models and extracting data), they can off course be combined this allows you to inspect all your models.
#example using transplant data from survival package
#make new event-variable: death or no death
#to have dichot outcome
transplant$death <- transplant$event=="death"
#making formulas
univ_formulas <- sapply(c("age","sex","abo"),function(x)as.formula(paste('Surv(futime,death)~',x))
)
#making a list of models
univ_models <- lapply(univ_formulas, function(x){coxph(x,data=transplant)})
#extract data (here I've gone for HR and confint)
univ_results <- lapply(univ_models,function(x){return(exp(cbind(coef(x),confint(x))))})

Trying to apply formula to each column in R, how to feed data to formula?

So I'm trying to apply an exponential smoothing model to each column in a data frame called 'cities'. I have used apply to identify the data frame, go by columns, and I thought to run the model. However, when I try to do so, it tells me that I need to specify data for the exponential smoothing model...I thought I already had by putting it in the apply loop.
apply(x=cities,2,FUN=HoltWinters(x=x,gamma=FALSE))
Also, eventually I'd like to predict the next 4 periods using the HW model developed using forecast.predict. Do I need to use a different loop or can I combine it all in this one?
FUN takes a function, but you're trying to give it the output of a function.
Try this:
apply(cities, 2, FUN=function(x) HoltWinters(x=x,gamma=FALSE))

function for removing nonsignificant variables at one step in R

I am trying to automate logistic regression in R.
Basically, my source code will generate a new equation everyday as the input data is updated,
(Variables, data format etc are same) and print out te significant variables with corresponding coefficients.
When I use step function, sometimes the resulting coefficients are not significant. Therefore, I want to update my set of coefficients and get rid of all the ones that are not significant enough.
Is there a function or automated way of doing it?
If not, the only way I can think of is writing a script on another language that takes the coefficients and corresponding P value and checking significance, and rerunning R accordingly. But even for that, do you know how can I get only P values and coefficients of variables. I can either print whole summary of regression result with "summary" function. I can't reach only P values.
Thank you very much
It's a bit hard for me without sample code and data, but you can subset based on variable values like this,
newdata <- data[ which(data$p.value < 0.5), ]
You can inspect your R object using str, see ?str to figure out how to select whatever you want to use in your subset $p.value or $residuals.
If this doesn't answer your question try submitting some sample code and data.
Best,
Eric

Resources