predict on test data in random forest survival analysis - r

I'm using predict on real data cohort on a model trained by package randomForestSRC in R. The real data cohort has missing values whereas the training set that produced the model does not have missing values.
pred_cohort <- predict(model, cohort,na.action = "na.impute")
Since the cohort is a small set (only 8 observation) the number of levels of the factors are fewer than in the training set produced the model. I realize now by a coincidence that if I set the levels of the real data cohort to be the levels of the model (se code example below) I get different answers of the prediction than if I do not. why is that?
levels(cohort$var1)<levels(model$xvar$var1)
I also realize that the imputed value for the missing cells are different if I force the levels of the real data to be the levels of the model (according to the code above) then if I leave the levels as is.
The question is if this is a bug? if not, which way to prefer? And why to prefer that option?

Related

How to handle missing values (NA's) in a column in lmer

I would like to use na.pass for na.action when working with lmer. There are NA values in some observations of the data set in some columns. I just want to control for this variables that contains the NA's. It is very important that the size of the data set will be the same after the control of the fixed effects.
I think I have to work with na.action in lmer(). I am using the following model:
baseline_model_0 <- lmer(formula=log_life_time_income_child ~ nationality_dummy +
sex_dummy + region_dummy + political_position_dummy +(1|Family), data = baseline_df
Error in qr.default(X, tol = tol, LAPACK = FALSE) :
NA/NaN/Inf in foreign function call (arg 1)
My data: as you see below, there are quite a lot of NA's in all the control variables. So "throwing" away all of these observations is no option!
One example:
nat_dummy
1 : 335
2 : 19
NA's: 252
My questions:
1.) How can I include all of my control variables (expressed in multiple columns) to the model without kicking out observations (expressed in rows)?
2.) How does lmer handle the missing variables in all the columns?
To answer your second question, lmer typically uses maximum likelihood, where it will estimate missing values of the dependent variable and kick out missing values of your predictors. To avoid this, as others have suggested, you can use multiple imputation instead. I demonstrate below an example with the airquality dataset native to R since you don't have your data included in your question. First, load the necessary libraries: lmerTest for fitting the regression, mice for imputation and broom.mixed for summarizing the results.
#### Load Libraries ####
library(lmerTest)
library(mice)
library(broom.mixed)
We can inspect the missing patterns with the next code:
#### Missing Patterns ####
md.pattern(airquality)
Which gives us this nice plot of all the missing data patterns. For example, you may notice that we have two observations that are missing both Ozone and Solar.R.
To fill in the gap, we can impute the data with 5 imputations (the default, so you don't have to include the m=5 part, but I specify explicitly for your understanding.
#### Impute ####
imp <- mice(airquality,
m=5)
After, you run your imputations with the model like below. The with argument takes your imputed data and runs each imputation with the regression model. This model is a bit erroneous and comes back singular, but I just use it because its the quickest dataset I could remember with missing values included.
#### Fit With Imputations ####
fit <- with(imp,
lmer(Solar.R ~ Ozone + (1|Month)))
From there you can pool and summarize your results like so:
#### Pool and Summarise ####
pool <- pool(fit)
summary(pool)
Obviously with the model being singular this would be meaningless, but with a proper fit model, your summary should look something like this:
term estimate std.error statistic df p.value
1 (Intercept) 151.9805678 12.1533295 12.505262 138.8303 0.000000000
2 Ozone 0.8051218 0.2190679 3.675216 135.4051 0.000341446
As Ben already mentioned, you need to also determine why your data is missing. If there are non-random reasons for their missingness, this would require some consideration, as this can bias your imputations/model. I really recommend the mice vignettes here as a gentle introduction to the topic:
https://www.gerkovink.com/miceVignettes/
Edit
You asked in the comments about adding in random effects estimates. I'm not sure why this isn't already something ported into the respective packages already, but the mitml package can help fill that gap. Here is the code:
#### Load Library and Get All Estimates ####
library(mitml)
testEstimates(as.mitml.result(fit),
extra.pars = T)
Which gives you both fixed and random effects for imputed lmer objects:
Call:
testEstimates(model = as.mitml.result(fit), extra.pars = T)
Final parameter estimates and inferences obtained from 5 imputed data sets.
Estimate Std.Error t.value df P(>|t|) RIV FMI
(Intercept) 146.575 14.528 10.089 68.161 0.000 0.320 0.264
Ozone 0.921 0.254 3.630 90.569 0.000 0.266 0.227
Estimate
Intercept~~Intercept|Month 112.587
Residual~~Residual 7274.260
ICC|Month 0.015
Unadjusted hypothesis test as appropriate in larger samples.
And if you just want to pull the random effects, you can use testEstimates(as.mitml.result(fit), extra.pars = T)$extra.pars instead, which gives you just the random effects:
Estimate
Intercept~~Intercept|Month 1.125872e+02
Residual~~Residual 7.274260e+03
ICC|Month 1.522285e-02
Unfortunately there is no easy answer to your question; using na.pass doesn't do anything smart, it just lets the NA values go forward into the mixed-model machinery, where (as you have seen) they screw things up.
For most analysis types, in order to deal with missing values you need to use some form of imputation (using a model of some kind to fill in plausible values). If you only care about prediction without confidence intervals, you can use some simple single imputation method such as replacing NA values with means. If you want to do inference (compute p-values/confidence intervals), you need multiple imputation, i.e. generating multiple data sets with imputed values drawn differently in each one, fitting the model to each data set, then pooling estimates and confidence intervals appropriately across the fits.
mice is the standard/state-of-the-art R package for multiple imputation: there is an example of its use with lmer here.
There a bunch of questions you should ask/understand the answers to before you embark on any kind of analysis with missing data:
what kind of missingness do I have ("completely at random" [MCAR], "at random" [MAR], "not at random" [MNAR])? Can my missing-data strategy lead to bias if the data are missing not-at-random?
have I explored the pattern of missingness? Are there subsets of rows/columns that I can drop without much loss of information (e.g. if some column(s) or row(s) have mostly missing information, imputation won't help very much)
mice has a variety of imputation methods to choose from. It won't hurt to try out the default methods when you're getting started (as in #ShawnHemelstrand's answer), but before you go too far you should at least make sure you understand what methods mice is using on your data, and that the defaults make sense for your case.
I would strongly recommend the relevant chapter of Frank Harrell's Regression Modeling Strategies, if you can get ahold of a copy.

How to make predictions using an LDA (Linear discriminant analysis) model in R

as the title suggests I am trying to make predictions using an LDA model in R. I have two sets of data that I'm working with: the first set is a series of entries associated with 16 predictor variables and 1 outcome variable (the outcome variable are "groups" that each entry belongs to that I've assigned myself), the second set of data also consists of entries associated with the same 16 predictor variables, but with no outcome variable. What I would like to do is predict the group membership of the entries in the second set of data.
So far I've successfully managed to create an LDA model by separating the first dataset into a "training set" and a "test set". However, now that I have the model I don't know how I would go about predicting the group membership of the entries in my second data set.
Thanks for the help! Please let me know if any more information is required, this is my first post on stack overflow so I am still learning the ropes.
Short example based on An introduction to Statistical learning, chapter 4. Say you have fitted a model lda_model on a training_data set, with dependent variable Group which you aim to predict, and predictors Predictor1 and Predictor2
library(MASS)
lda_model <- lda(Groupāˆ¼ Predictor1 + Predictor2, data = training_set)
You can then make predictions with the lda_model using the predict function on the testing_set
lda_predictions <- predict(lda_model, testing_set)
lda_predictions then holds the posterior probabilities in $posterior that the observation is part of Group j.
You could then apply a threshold of for instance (but not limiting to) 50% probability. E.g.
sum(lda_model$posterior[, 7] >= .5)
returns the number of observations for which the probabilty that the observation is part of Group 7 is larger than 50%

Regression model performance fails with a factor having more number of levels

I have a mixed data(both quantitative and categorical) predicting a quantitative variable. I have converted the categorical data into factors before feeding into glm model in R. My data has categorical variables with most of them having more than 150 levels. When I try to feed them to glm model, it fails with memory issues because of these factors having more levels. We can put a threshold and accept only the variables upto certain number of levels. But, I need to embed these factors which has more levels into the model. Is there any methodology to follow to address this issue.
Edit: The dataset has 120000 rows and 50 columns. When the data is expanded with model.matrix there are 4772 columns.
If you have a lot of data, the easiest thing to do is sample from your matrix/data frame, then run the regression.
Given sampling theory, we know that the standard error of a proportion p is equal to sqrt((p(1-p))/n). So if you have 150 levels, assuming that the number of observations in levels is evenly distributed, then we would want to be able to find proportions as small as .005 or so from your data set. So if we take a 10,000 row sample, the standard error of one of those factor levels is roughly:
sqrt((.005*.995)/10000) = 0.0007053368
That's really not all that much additional variance that you added to your regression estimate. Especially when you are doing exploratory analysis, sampling from the rows in your data, say a 12,000 row sample, should still give you plenty of data to estimate quantities while making estimation possible. Reducing your rows by a factor of 10 should also help R do the estimation without running out of memory. Win-win.

Forecast future values for a time series using support vector machin

I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.

Logistic Regression Model & Multicolinearity of Categorical Variables in R

I have a training dataset that has 3233 rows and 62 columns. The independent variable is Happy (train$Happy), which is a binary variable. The other 61 columns are categorical independent variables.
I've created a logistic regression model as follows:
logModel <- glm(Happy ~ ., data = train, family = binary)
However, I want to reduce the number of independent variables that go into the model, perhaps down to 20 or so. I would like to start by getting rid of colinear categorical variables.
Can someone shed some light on how to determine which categorical variables are colinear and what threshold that I should use when removing a variable from a model?
Thank you!
if your variables were categorical then the obvious solution would be penalized logistic regression (Lasso) in R it is implemented in glmnet.
With categorical variables the problem is much more difficult.
I was in a similar situation and I used the importance plot from the package random forest in order to reduce the number of variables.
This would not help you to find collinearity but only to rank the variables by importance.
You have only 60 variable and maybe you have a knowledge of the field so you can try to add to you model some variables that makes sense to you (like z=x1-x3 if you think that the value x1-x3 is important.) and then rank them according to a random forest model
You could use Cramer's V, or the related Phi or contingency coefficient (see a great paper at http://www.harding.edu/sbreezeel/460%20files/statbook/chapter15.pdf), to measure colinearity among categorical variables. If two or more categorical variables have a Cramer's V value close to 1, it means they're highly "correlated" and you may not need to keep all of them in your logistic regression model.

Resources