How to stratify sample a data set, conduct statistical analysis with Caret and repeat in r? - r

I have a data set that I would like to stratify sample, create statistical models on using the caret package and then generate predictions.
The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).
What I want to be able to do is:
Generate the stratified data sample
Create the machine learning model
Repeat 1000 times & take the average model output
I hope that by repeating the steps on the variations of the stratified data set, I am able to avoid the subtle changes in the predictions generated due to a smaller data sample.
For example, it may look something like this in r;
Original.Dataset = data.frame(A)
Stratified.Dataset = stratified(Original.Dataset, group = x)
Model = train(Stratified.Dataset.....other model inputs)
Repeat process with new stratified data set based on the original data and average out.
Thank you in advance for any help, or package suggestions that might be useful. Is it possible to stratify the sample in caret or simulate in caret?

First of all, welcome to SO.
It is hard to understand what you exactly are wondering, your question is very broad.
If you need input on statistics I would suggest you to ask more clearly defined questions in Cross Validated.
Q&A for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
The problem I am finding is that in different iterations of the
stratified data set I get significantly different results (this may be
in part due to the relatively small data sample M=1000).
I assume you are referring to different iterations of your model. This depends on how large your different groups are. E.g. if you are trying to divide your data set consisting of 1000 samples in to groups of 10 samples, your model could very likely be unstable and hence give different results in each iteration. This could also be due to that your model depends on some randomness, and the smaller your data is (and the more groups) your will have larger variation. See here or here for more information on cross validation, stability and bootstrap aggregating.
Generate the stratified data sample
How to generate it: the dplyr package is excellent in grouping data depending on different variables. You might also want to use the split function found in the base package. See here for more information. You could also use the in-built methods found in the caret package, found here.
How to know how to split it: it very much depends on your question you would like to answer, most likely you would like to even out some variables, e.g. gender and age for creating a model for predicting disease. See here for more info.
In the case of having e.g. duplicated observations and you want to create unique subsets with different combinations of replicates with it's unique measurements you would have to use other methods. If the replicates have a common identifier, here sample_names. You could do something like this to select all samples but with different combinations of the replicates:
tg <- data.frame(sample_names = rep(1:5,each=2))
set.seed(10)
tg$values<-rnorm(10)
partition <- lapply(1:100, function(z) {
set.seed(z)
sapply(unique(tg$sample_names), function(x) {
which(x == tg$sample_names)[sample(1:2, 1)]
})
})
#the first partition of your data to train a model.
tg[partition[[1]],]
Create the machine learning model
If you want to use caret, you could go to the caret webpage. And see all the available models. Depending on your research question and/or data you would like to use different types of models. Therefore, I would recommend you to take some online machine learning courses, for instance the Stanford University course given by Andrew Ng (I have taken it myself), to get more familiar with the different major algorithms.If you are familiar with the algorithms, just search for the available models.
Repeat 1000 times & take the average model output
You can either repeat your model 1000 times with different seeds (see set.seed) and different training methods e.g. cross validations or bootstrap aggregation. There are a lot of different training parameters in the caret package:
The function trainControl generates parameters that further control
how models are created, with possible values:
method: The resampling method: "boot", "cv", "LOOCV", "LGOCV",
"repeatedcv", "timeslice", "none" and "oob"
For more information on the methods, see here.

Related

Imputation missing data for MLM in R

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.

R H20 - Cross-validation with stratified sampling and non i.i.d. rows

I'm using H2O to analyse a dataset but I'm not sure how to correctly perform cross-validation on my dataset. I have an unbalanced dataset, so I would like to performed stratified cross-validation ( were the output variable is used to balance the groups on each partition).
However, on top of that, I also have an issue that many of my rows are repeats (a way of implementing weights without actually having weights). Independently of the source of this problem, I have seen before that, in some cases, you can do cross-validation were some rows have to be kept together. This seams to be the usage of fold_column. However, it is not possible to do both at the same time?
If there is no H2O solution, how can I compute the fold a priori and use it on H2O?
Based on H2O-3 docs this can't be done:
Note that all three options are only suitable for datasets that are i.i.d. If the dataset requires custom grouping to perform meaningful cross-validation, then a fold_column should be created and provided instead.
One quick idea is using weights_column instead of duplicating rows. Then both balance_classes and weights_column are available together as parameters in
GBM, DRF, Deep Learning, GLM, Naïve-Bayes, and AutoML.
Otherwise, I suggest following workflow performed in R or H2O on your data to achieve both fold assignment and consistency of duplicates between folds:
take original dataset (no repeats in data yet)
divide it into 2 sets based on the outcome field (the one that is unbalanced): one for positive and one for negative (if it's multinomial then have as many sets as there are outcomes)
divide each set into N folds by assigning new foldId column in both sets independently: this accomplishes stratified folds
combine (rbind) both sets back together
apply row duplication process that implements weights (which will preserve your fold assignments automatically now).

What exactly does complete in mice do?

I am researching how to use multiple imputation results. The following is my understanding, and please let me know if there're mistakes.
Suppose you have a data set with missing values, and you want to conduct a regression analysis. You may perform multiple imputation for m = 5 times, and for each imputed data set (5 imputed data sets now) you run a regression analysis, then "pool" the coefficient estimates from these m = 5 models via Rubin's rules (or use R package "pool").
My question is that, in mice you have a function complete(), and the manual says you can extract completed data set by using complete(object).
But if I use mice for m = 5 times, does it still make sense to use complete()? Which imputation results will complete() get for me?
Also, does it make sense if I only use mice with m = 1? Thank you.
You probably overlooked that mice::complete() in arguments uses action=1 as default, which "returns the first imputed data set" (see ?mice::complete) and actually is worthless.
You should definitely use action="long" to take account for the "multiplicity" of the multiple imputation!
No, it makes no sense at all to use m=1 (apart from debugging), because every imputation is based on a random process and you have to pool the results (using any method whatsoever) to account for the variation. Often m>20 is recommended1.
Basically, multiple imputation works as follows:
Create m imputation processes with a random component, to obtain
m slightly different imputed data sets.
Analyze each imputed data set to get slightly different parameter
estimates.
Combine results, calculating the variation in parameter estimates.
(Also see multiple-imputation-in-a-nutshell for a brief overview.)
When you use mice, you get an object that is not the imputed data set. You cannot perform operations on it directly without using the special functions in mice. If you want to extract that actual imputed datasets, you use complete, the output of which is a data.frame with one row per individual per imputation (if using the "long" format). If you are doing any analysis with your imputed data that cannot be performed within mice, you need to create this dataset first.

Is there a way to input a covariance matrix (or something like that) into lme4 in R?

I have a very large data set that I extract from a data warehouse. To download the data set to the box where I want to run lme4 takes a long time. I would like to know if I could process the data into a covariance matrix, download that data (which is much smaller), and use that as the data input to lme4. I have done something similar to this for multiple regression models using SAS, and am hoping I can create this type of input for lme4.
Thanks.
I don't know of any way to use the observed covariance matrix to fit an lmer model. But if the goal is to reduce data set size in order to speed up analysis, there may be simpler approaches. For example, if you don't need the conditional modes of the random effects, and you have a very large sample size, then you might try fitting the model to progressively larger subsets of the data until the estimates of the fixed effects and the covariance matrix of the random effects 'stabilize'. This approach has worked well in my experience, and has been discussed by others:
http://andrewgelman.com/2012/04/hierarchicalmultilevel-modeling-with-big-data/
Here's another quotation:
"Related to the “multiple model” approach are simple approximations that speed the computations. Computers are getting faster and faster—but models are getting more and more complicated! And so these general tricks might remain important. A simple and general trick is to break the data into subsets and analyze each subset separately. For example, break the 85 counties of radon data randomly into three sets of 30, 30, and 25 counties, and analyze each set separately." Gelman and Hill (2007), p.547.

R knn large dataset

I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?
EDIT:
To clarify there are a couple issues.
1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:
RF <- randomForest(x, y, ntree, type,...)
then turn around and use the model to predict data using the test data set:
pred <- predict(RF, testData)
2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:
a) find a way to use > 5000 lines in a training set
or
b) find a way to use the model on the full 100k lines.
The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.
The training data is the model.
To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.
The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.
As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.
The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.

Resources