recombining data frames in R without using row.names - r

I start with a data.frame (or a data_frame) containing my dependent Y variable for analysis, my independent X variables, and some "Z" variables -- extra columns that I don't need for my modeling exercise.
What I would like to do is:
Create an analysis data set without the Z variables;
Break this data set into random training and test sets;
Find my best model;
Predict on both the training and test sets using this model;
Recombine the training and test sets by rows; and finally
Recombine these data with the Z variables, by column.
It's the last step, of course, that presents the problem -- how do I make sure that the rows in the recombined training and test sets match the rows in the original data set? We might try to use the row.names variable from the original set, but I agree with Hadley that this is an error-prone kludge (my words, not his) -- why have a special column that's treated differently from all other data columns?
One alternative is to create an ID column that uniquely identifies each row, and then keep this column around when dividing into the train and test sets (but excluding it from all modeling formulas, of course). This seems clumsy as well, and would make all my formulas harder to read.
This must be a solved problem -- could people tell me how they deal with this? Especially using the plyr/dplyr/tidyr package framework?

Related

Partition data while preserving groups with caret

Apologies for the cross-stack post, I wasn't sure if this is more appropriate for stackoverflow or for crossvalidated. I initiatlly posted on the latter, but realized this might be the more appropriate place.
So, I have a dataset with many rows of individuals, each with a unique indvidual ID.
For each individual, there is also a column indicating whether or not that person belongs to the same household, which is a unique householdID.
Finally, there is a Target variable, for each row, which is what I will be trying to make predictions on. Of course, there are columns with different features.
My question is--as the membership to different households is important--is there a way to partition the data into train and test sets where all the people belonging to the same household are kept together and not randomly split over both sets? (i.e., any given householdID number should not appear in both sets). But also, it is possible to split the households over both train and test sets and keep a balanced Target variable?
So, using the createDataPartition function in caret, I've managed to have a blanced Target value in both train and test when I set y = Target, and I've managed to separate the households cleanly over both train and test when I set y = unique(householdID), but I can't figure out if there's a way to get both of these results at the same time.
I'm pretty flat out of ideas, so any suggestions would be most welcome!
Thanks!
groupKFold is the way to go. But instead of using data$Target you need to split on data$householdID (or whatever your household ID column is named). This will make sure that all members of a group will be in the same fold.
After this you can use the folds in trainControl to model on data$Target.

Adding multiple random forest models into a single data frame or data table in R

I am training multiple 'treebag' models in R. I loop through a data set, where each iteration I define a specific subset based on a feature in the set and train on that subset. I could save each result to disk, but I was hoping to save all the models to a single data frame or data table. I am not sure if this is at all possible. The data frame/table could have numerous classes (numeric and character), however I would like to add a completed model.
To start, is it even possible to assign multiple models to a single column, where each model is assigned to a different row in a data frame or data table?
Any ideas on how this could work is greatly appreciated.

Applying univariate coxph function to multiple covariates (columns) at once

First, I gathered from this link Applying a function to multiple columns that using the "function" function would perhaps do what I'm looking for. However, I have not been able to make the leap from thinking about it in the way presented to making it actually work in my situation (or really even knowing where to start). I'm a beginner in R so I apologize in advance if this is a really "newb" question. My data is a data frame that consists of an event variable (tumor recurrence) and a time variable (followup time/time to recurrence) as well as recurrence risk factors (t-stage, tumor size,age at dx, etc.). Some risk factors are categorical and some are continuous. I have been running my univariate analysis by hand, one at a time like this example univariateageatdx<-coxph(survobj~agedx), and then collecting the data. This gets very tedious for multiple factors and doing it for a few different recurrence types. I figured there must be a way to code such that I could basically have one line of code that had the coxph equation and then applied it to all of my variables of interest and spit out a result that had the univariate analysis results for each factor. I tried using cbind to bind variables (i.e x<-cbind("agedx","tumor size") then running cox coxph(recurrencesurvobj~x) but this of course just did the multivariate analysis on these variables and didn't split them out as true univariate analyses.
I also tried the following code based on a similar problem that I found on a different site, but it gave the error shown and I don't know quite what to make of it. Is this on the right track?
f <- as.formula(paste('regionalsurvobj ~', paste(colnames(nodcistradmasvssubcutmasR)[6-9], collapse='+')))
I then ran it has coxph(f)
Gave me the results of a multivariate cox analysis.
Thanks!
**edit: I just fixed the error, I needed to use the column numbers I suppose not the names. Changes are reflected in the code above. However, it still runs the variables selected as a multivariate analysis and not as the true univariate analysis...
If you want to go the formula-route (which in your case with multiple outcomes and multiple variables might be the most practical way to go about it) you need to create a formula per model you want to fit. I've split the steps here a bit (making formulas, making models and extracting data), they can off course be combined this allows you to inspect all your models.
#example using transplant data from survival package
#make new event-variable: death or no death
#to have dichot outcome
transplant$death <- transplant$event=="death"
#making formulas
univ_formulas <- sapply(c("age","sex","abo"),function(x)as.formula(paste('Surv(futime,death)~',x))
)
#making a list of models
univ_models <- lapply(univ_formulas, function(x){coxph(x,data=transplant)})
#extract data (here I've gone for HR and confint)
univ_results <- lapply(univ_models,function(x){return(exp(cbind(coef(x),confint(x))))})

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.

Generating variable names for dataframes based on the loop number in a loop in R

I am working on developing and optimizing a linear model using the lm() function and subsequently the step() function for optimization. I have added a variable to my dataframe by using a random generator of 0s and 1s (50% chance each). I use this variable to subset the dataframe into a training set and a validation set If a record is not assigned to the training set it is assigned to the validation set. By using these subsets I am able to estimate how good the fit of the model is (by using the predict function for the records in the validation set and comparing them to the original values). I am interested in the coefficients of the optimized model and in the results of the KS-test between the distributions of the predicted and actual results.
All of my code was working fine, but when I wanted to test whether my model is sensitive to the subset that I chose I ran into some problems. To do this I wanted to create a for (i in 1:10) loop, each time using a different random subset. This turned out to be quite a challenge for me (I have never used a for loop in R before).
Here's the problem (well actually there are many problems, but here is one of them):
I would like to have separate dataframes for each run in the loop with a unique name (for example: Run1, Run2, Run3). I have been able to create a variable with different strings using paste(("Run",1:10,sep=""), but that just gives you a list of strings. How do I use these strings as names for my (subsetted) dataframes?
Another problem that I expect to encounter:
Subsequently I want to use the fitted coefficients for each run and export these to Excel. By using coef(function) I have been able to retrieve the coefficients, however the number of coefficients included in the model may change per simulation run because of the optimization algorithm. This will almost certainly give me some trouble with pasting them into the same dataframe, any thoughts on that?
Thanks for helping me out.
For your first question:
You can create the strings as before, using
df.names <- paste(("Run",1:10,sep="")
Then, create your for loop and do the following to give the data frames the names you want:
for (i in 1:10){
d.frame <- # create your data frame here
assign(df.name[i], d.frame)
}
Now you will end up with ten data frames with ten different names.
For your second question about the coefficients:
As far as I can tell, these don't naturally fit into your data frame structure. You should consider using lists, as they allow different classes - in other words, for each run, create a list containing a data frame and a numeric vector with your coefficients.
Don't create objects with numbers in their names, and then try and access them in a loop later, using get and paste and assign. The right way to do this is to store your elements in an R list object.

Resources