Intuitive problem with multiple imputation - r

I've read a lot of about multiple imputation and I think the explanation of that method on internet is no so comprehensive. I have some doubts about it and I will be pleased if you can help me with that.
Let's take code following :
library(missForest)
library(mic)
#taking iris data
data <- iris
#Randomly pick values for NA
iris.mis <- prodNA(iris, noNA = 0.1)
#Turning on multiple imputation
imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed = 500)
where :
m - the number of imputations made per missing observation (5 is normal–generates 5 data sets with imputed/original values)
maxit - the number of iterations
method - We use probable means
seed - Values to randomly generate from
As far as I understand method = 'pmm' average the results,
but I'm not able to understand what exactly is happening when running that function. Can you please explain to me algorithm we are dealing with ? what exactly m and maxit are responsible for ?

As jay.sf pointed out PMM is predictive mean matching, it is an algorithm that tries to predict a "likely mean value" for the imputed metric based on the existing data in other columns. This is similar to other advanced imputation methods which do not simply take one constant value for each imputation (e.g. like mean imputation). This results in a different imputation value per case and has some inherent randomness/error.
To improve on that error and reduce variance the imputation isn't done just "one time" but multiple times.
m means the amount of imputed variants that are produced. E.g. with an m = 5 you will get five imputed data sets with slight variations in their imputations.
maxit is the maximum of iterations the algorithm is able to learn and improve, similar to the epochs/iterations in a ML model.

Related

Cross-Validation in R using vtreat Package

Currently learning about cross validation through a course on DataCamp. They start the process by creating an n-fold cross validation plan. This is done with the kWayCrossValidation() function from the vtreat package. They call it as follows:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
Then, they suggest running a for loop as follows:
dframe$pred.cv <- 0
# k is the number of folds
# splitPlan is the cross validation plan
for(i in 1:k) {
# Get the ith split
split <- splitPlan[[i]]
# Build a model on the training data
# from this split
# (lm, in this case)
model <- lm(fmla, data = dframe[split$train,])
# make predictions on the
# application data from this split
dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
This results in a new column in the datafram with the predictions, per the last line of the above chunk of code.
My doubt is thus whether the predicted values on the data frame will be in fact averages of the 3 folds or if they will just be those of the 3rd run of the for loop?
Am I missing a detail here, or is this exactly what this code is doing, which would then defeat the purpose of the 3-fold cross validation or any-fold cross validation for that matter, as it will simply output the results of the last iteration? Shouldn't we be looking to output the average of all the folds, as laid out in the splitPlan?
Thank you.
I see there is confusion about the scope of K-fold cross-validation. The idea is not to average predictions over different folds, rather to average some measure of the prediction error, so to estimate test errors.
First of all, as you are new on SO, notice that you should always provide some data to work with. As in this case your question is not data-contingent, I just simulated some. Still, it is a good practice helping us helping you.
Check the following code, which slightly modifies what you have provided in the post:
library(vtreat)
# Simulating data.
set.seed(1986)
X = matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y, pred.cv = NA)
# Folds.
nRows = dim(dta)[1]
nSplits = 3
splitPlan = kWayCrossValidation(nRows, nSplits)
# Fitting model on all folds but i-th.
for(i in 1:nSplits)
{
# Get the i-th split.
split = splitPlan[[i]]
# Build a model on the training data from this split.
model = lm(y ~ ., data = dta[split$train, -4])
# Make predictions on the application data from this split.
dta$pred.cv[split$app] = predict(model, newdata = dta[split$app, -4])
}
# Now compute an estimate of the test error using pred.cv.
mean((dta$y - dta$pred.cv)^2)
What the for loop does, is to fit a linear model on all folds but the i-th (i.e., on dta[split$train, -4]), and then it uses the fitted function to make predictions on the i-th fold (i.e., dta[split$app, -4]). At least, I am assuming that split$train and split$app serve such roles, as the documentation is really lacking (which usually is a bad sign). Notice I am revoming the 4-th column (dta$pred.cv) as it just pre-allocates memory in order to store all the predictions (it is not a feature!).
At each iteration, we are not filling the whole dta$pred.cv, but only a subset of that (corresponding to the rows of the i-th fold, stored each time in split$app). Thus, at the end that column just stores predictions from the K iteration.
The real rationale for cross-validation jumps in here. Let me introduce the concepts of training, validation, and test set. In data analysis, the ideal is to have such a huge data set so that we can divide it in three subsamples. The first one could then be used to train the algorithms (fitting models), the second to validate the models (tuning the models), the third to choose the best model in terms on some perfomance measure (usually mean-squared-error for regression, or MSE).
However, we often do not have all these data points (especially if you are an economist). Thus, we seek an estimator for the test MSE, so that the need for splitting data disappears. This is what K-fold cross-validation does: at once, each fold is treated as the test set, and the union of all the others as the training set. Then, we make predictions as in your code (in the loop), and save them. What you miss is the last line in the code I provided: the average of the MSE across folds. That provides us with as estimate of the test MSE, where we choose the model yielding the lowest value.
That being said, I never heard before of the vtreat package. If you are into data analysis, I suggest to have a look at the tidiyverse and the caret packages. As far as I know (and I see here on SO), they are widely used and super-well documented. May be worth learning them.

survfit for stratified cox-model

I have a stratified cox-model and want predicted survival-curves for certain profiles, based on that model.
Now, because I'm working with a large dataset with a lot of strata, I want predictions for very specific strata only, to save time and memory.
The help-page of survfit.coxph states: ... If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.
When I run the code below, where newdata does contain the stratum-variable, I still get predictions for both strata, which contradicts the help-page
df <- data.frame(X1 = runif(200),
X2 = sample(c("A", "B"), 200, replace = TRUE),
Ev = sample(c(0,1), 200, replace = TRUE),
Time = rexp(200))
testfit <- coxph( Surv(Time, Ev) ~ X1 + strata(X2), df)
out <- survfit(testfit, newdata = data.frame(X1 = 0.6, X2 = "A"))
Is there anything I fail to see or understand here?
I'm not sure if this is a bug or a feature in survival:::survfit.coxph. It looks like the intended behaviour in the code is that only requested strata are returned. In the function:
strata(X2) is evaluated in an environment containing newdata and the result, A is returned.
The full curve is then created.
There is then some logic to split the curve into strata, but only if result$surv is a matrix.
In your example it is not a matrix. I can't find any documentation on the expected usage of this if it's not a bug. Perhaps it would be worth dropping the author/maintainer a note.
maintainer("survival")
# [1] "Terry M Therneau <xxxxxxxx.xxxxx#xxxx.xxx>"
Some comments that may be helpfull:
My example was not big enough (and I seem not to have read the related github post very well, but that was after I posted my question here): if newdata has at least two lines (and of course the strata-variable), predictions are returned only for the requested strata
There is an inefficiency inside survfit.coxph, where the baseline-hazard is calculated for every stratum in the original dataset, not only for the requested strata (see my contribution to the same github post). However, that doesn't seem to be a big issue (a test on a dataset with roughly half a million observation, 50% events and 1000 strata), takes less than a minute
The problem is memory allocation somewhere during calculations (in the above example, things collapse once I want predictions for 100 observations - 1 stratum each - while the final output of predictions for 80 is only a few MB)
My work-around:
Select all observations you want predictions for
use lp <- predict(..., type='lp') to get the linear predictor for all these observations
use survfit only on the first observation: survfit(fit, newdata = expand_grid(newdf, strat = strata_list))
Store the resulting survival estimates in a data.frame (or not, that's up to you)
To calculate predicted survival for other observations, use the PH-assumption (see formula below). This invokes the overhead of survfit.coxph only once and if you focus on survival on only a few times (e.g. 5- and 10-year survival), you can reduce the computer time even more

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

No convergence for hard competitive learning clustering (flexclust package)

I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.

Successive training in neuralnet

I have a huge trainData and I want to withdraw random subsets out of it (let's say 1000 times) and use them to train the nural network object successively. Is it possible to do by using neuralnet R package. What I am thinking about is something like:
library(neuralnet)
for (i=1:1000){
classA <- 2000
classB <- 2000
dataB <- trainData[sample(which(trainData$class == "B"), classB, replace=TRUE),] #withdraw 2000 samples from class B
dataU <- trainData[sample(which(trainData$class == "A"), classA, replace=TRUE),] #withdraw 2000 samples from class A
subset <- rbind(dataB, dataU) #bind them to make a subset
and then feed this subset of actual trainData to train the neuralnet object again and again like:
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647) #use that subset for training the neural network
}
My question is will this neualnet object named nn will be trained in every iteration of loop and when loop will finish will I get a fully trained neural network object? Secondly, what will be the effect of non-convergence in the cases when the neuralnet would be unable to converge for a particular subset? Will it affect the predictions result?
The shortest answer - No
More nuanced answer - Sort of ...
Why? - Because the neuralnet::neuralnet function is not designed to return the weights if the threshold is not reached within stepmax. However, if the threshold is reached, the resulting object will contain the final weights. These weights could then be fed to the neuralnet function as the startweights argument allowing for successive learning. Your call would look like the following:
# nn.prior = previously run neuralnet object
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647, startweights = nn.prior$weights)
However, I initially answer 'No' because choosing a threshold to get a suitable amount of information out of a subset while also making sure it 'converges' before stepmax would likely be a guessing game and not very objective.
You have essentially four options I can think of:
Find another package that allows for this explicitly
Get the neuralnet source code and modify it to return the weights even when 'convergence' isn't achieved (i.e. reaching threshold).
Take a suitably sized random subset and just build your model on that and test its' performance. (This is actually quite common practice AFAIK).
Take all your subsets, build a model on each and look into combining them as an 'ensemble' model.
I would recommend to use k-fold validation to train many nets using library(e1071) and tune function.

Resources