corr.bias parameter in Random forest regression model in R - r

I'm using the regression model of random forest in R and I found the parameter corr.bias which according to the manual is "experimental", my data is nonlinear and I just wonder if setting this parameter to true can enhance the results, plus I don't know exactly how it works for nonlinear data, so I really appreciate if someone can explain to me how this correction bias works in the random forest package and if it can enhance my regression model or not.

The short answer is that it performs a simple correction based on a linear regression on the actual and fitted values.
From regrf.c:
/* Do simple linear regression of y on yhat for bias correction. */
if (*biasCorr) simpleLinReg(nsample, yptr, y, coef, &errb, nout);
and the first few lines of that function are simply:
void simpleLinReg(int nsample, double *x, double *y, double *coef,
double *mse, int *hasPred) {
/* Compute simple linear regression of y on x, returning the coefficients,
the average squared residual, and the predicted values (overwriting y). */
So when you fit a regression random forest with corr.bias = TRUE the model object returned will contain a coef element which will simply be the two coefficients from the linear regression.
Then when you call predict.randomForest this happens:
## Apply bias correction if needed.
yhat <- rep(NA, length(rn))
names(yhat) <- rn
if (!is.null(object$coefs)) {
yhat[keep] <- object$coefs[1] + object$coefs[2] * ans$ypred
}
The non-linear nature of your data probably isn't necessarily relevant, but the bias correction may be very poor if the relationship between the fitted and actual values is very far from linear.
You can always fit the model and then plot the fitted vs actual values yourself and see whether a correction based on a linear regression would help or not.

Related

95% CI for the ICC in linear mixed effects model (multilevel model, hierarchical model)

I fitted a linear mixed effect model to predict the math score as the outcome, x= participant factor (nominal or ordinal) as the fixed effect, Schl is the random effect. Then I compared it with the simple linear regression model using compare_performance, and while the output gives the ICC, I was not sure how to calculate the 95% for it? (for coefficients I used confintconfint and it did the job)
lm1<- lm(math~ gender, data= df)
lme1<- lmer(math~gender+(1|schl), data=df)
compare_performance(lm1,lme1)
the ICC was 0.15
From this gist from Peter Dahlgren, taken in turn from a CrossValidated answer by #Ashe, here is the crux:
calc.icc <- function(y) {
sumy <- summary(y)
(sumy$varcor$id[1]) / (sumy$varcor$id[1] + sumy$sigma^2)
}
boot.icc <- bootMer(mymod, calc.icc, nsim=1000)
#Draw from the bootstrap distribution the usual 95% upper and lower confidence limits
quantile(boot.icc$t, c(0.025, 0.975))
You can (and should) check that this calc.icc() function gives the same results as your compare_performance() function. Since this uses parametric bootstrapping, you can substitute any ICC function you like as it long takes a fitted model as input and returns the ICC as a single numeric value. (Also, because it uses PB, it will be slow; there are potentially faster approximate methods, but PB is reliable and easy to program.)

R: fit mixed effect model

Suppose we have the following linear mixed effects model:
How do we fit this nested model in R?
For now, I tried two things:
Rewrite the model as:
Then using lmer function in lme4 package to fit the mixed effect model and put Xi as both random and fixed effect covariate as:
lmer(y ~ X-1+(0+X|subject))
But when I pass the result to BIC and do the model selection, it always picks the simplest model, which is not correct.
I tried to regress y_i on X_i first and treat X_i as the fixed effect, then I will get the estimate of the slope, i.e. phi_i vector. Then see phi_i as the new observations and regress on C_i again to get the beta. But it seems not correct since we do not know C_i in the real problem and it looks like C_i and beta jointly decide the coefficients.
So, are there other ways to fit this kind of model in R and where are my mistakes?
Thanks for any help!

using lambda.min to extrace coefficients from model trained with glmnet

I am using glmnet to train the logistic regression model and then try to obtain the coefficients with the specific lambda. I used the simple example here:
load("BinomialExample.RData")
fit = glmnet(x, y, family = "binomial")
coef(fit, s = c(0.05,0.01))
I have checked the values of fit$lambda, however, I could not find the specific values of 0.05 or 0.01 in fit$lambda. So how could coef return the coefficients with a lambda not in the fit$lambda vector.
This is explained in the help for coef.glmnet, specifically the exact argument:
exact
This argument is relevant only when predictions are made at values of s (lambda) different from those used in the fitting of the original model. If exact=FALSE (default), then the predict function uses linear interpolation to make predictions for values of s (lambda) that do not coincide with those used in the fitting algorithm. While this is often a good approximation, it can sometimes be a bit coarse. With exact=TRUE, these different values of s are merged (and sorted) with object$lambda, and the model is refit before predictions are made.

R: Linear regression model does not work very well

I'm using R to fit a linear regression model and then I use this model to predict values but it does not predict very well boundary values. Do you know how to fix it?
ZLFPS is:
ZLFPS<-c(27.06,25.31,24.1,23.34,22.35,21.66,21.23,21.02,20.77,20.11,20.07,19.7,19.64,19.08,18.77,18.44,18.24,18.02,17.61,17.58,16.98,19.43,18.29,17.35,16.57,15.98,15.5,15.33,14.87,14.84,14.46,14.25,14.17,14.09,13.82,13.77,13.76,13.71,13.35,13.34,13.14,13.05,25.11,23.49,22.51,21.53,20.53,19.61,19.17,18.72,18.08,17.95,17.77,17.74,17.7,17.62,17.45,17.17,17.06,16.9,16.68,16.65,16.25,19.49,18.17,17.17,16.35,15.68,15.07,14.53,14.01,13.6,13.18,13.11,12.97,12.96,12.95,12.94,12.9,12.84,12.83,12.79,12.7,12.68,27.41,25.39,23.98,22.71,21.39,20.76,19.74,19.49,19.12,18.67,18.35,18.15,17.84,17.67,17.65,17.48,17.44,17.05,16.72,16.46,16.13,23.07,21.33,20.09,18.96,17.74,17.16,16.43,15.78,15.27,15.06,14.75,14.69,14.69,14.6,14.55,14.53,14.5,14.25,14.23,14.07,14.05,29.89,27.18,25.75,24.23,23.23,21.94,21.32,20.69,20.35,19.62,19.49,19.45,19,18.86,18.82,18.19,18.06,17.93,17.56,17.48,17.11,23.66,21.65,19.99,18.52,17.22,16.29,15.53,14.95,14.32,14.04,13.85,13.82,13.72,13.64,13.5,13.5,13.43,13.39,13.28,13.25,13.21,26.32,24.97,23.27,22.86,21.12,20.74,20.4,19.93,19.71,19.35,19.25,18.99,18.99,18.88,18.84,18.53,18.29,18.27,17.93,17.79,17.34,20.83,19.76,18.62,17.38,16.66,15.79,15.51,15.11,14.84,14.69,14.64,14.55,14.44,14.29,14.23,14.19,14.17,14.03,13.91,13.8,13.58,32.91,30.21,28.17,25.99,24.38,23.23,22.55,20.74,20.35,19.75,19.28,19.15,18.25,18.2,18.12,17.89,17.68,17.33,17.23,17.07,16.78,25.9,23.56,21.39,20.11,18.66,17.3,16.76,16.07,15.52,15.07,14.6,14.29,14.12,13.95,13.89,13.66,13.63,13.42,13.28,13.27,13.13,24.21,22.89,21.17,20.06,19.1,18.44,17.68,17.18,16.74,16.07,15.93,15.5,15.41,15.11,14.84,14.74,14.68,14.37,14.29,14.29,14.27,18.97,17.59,16.05,15.49,14.51,13.91,13.45,12.81,12.6,12,11.98,11.6,11.42,11.33,11.27,11.13,11.12,11.11,10.92,10.87,10.87,28.61,26.4,24.22,23.04,21.8,20.71,20.47,19.76,19.38,19.18,18.55,17.99,17.95,17.74,17.62,17.47,17.25,16.63,16.54,16.39,16.12,21.98,20.32,19.49,18.2,17.1,16.47,15.87,15.37,14.89,14.52,14.37,13.96,13.95,13.72,13.54,13.41,13.39,13.24,13.07,12.96,12.95,27.6,25.68,24.56,23.52,22.41,21.69,20.88,20.35,20.26,19.66,19.19,19.13,19.11,18.89,18.53,18.13,17.67,17.3,17.26,17.26,16.71,19.13,17.76,17.01,16.18,15.43,14.8,14.42,14,13.8,13.67,13.33,13.23,12.86,12.85,12.82,12.75,12.61,12.59,12.59,12.45,12.32)
QPZL<-c(36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,36,35,34,33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16)
ZLDBFSAO<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
My model is:
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO)
results3 <- coef(summary(fit32))
first3<-as.numeric(results3[1])
second3<-as.numeric(results3[2])
third3<-as.numeric(results3[3])
fourth3<-as.numeric(results3[4])
fifth3<-as.numeric(results3[5])
#inverse model used for prediction of FPS
f1 <- function(x) {first3 +second3*x +third3*x^2 + fourth3*1}
You can see my dataset here. This dataset contains the values that I have to predict. The FPS variation per QP is heterogenous. See dataset. I added a new column.
The fitted dataset is a different one.
To test the model just write exp(f1(selected_QP)) where selected QP varies from 16 to 36. See the given dataset for QP values and the FPS value that the model should predict.
You can run the model online here.
When I'm using QP values in the middle, let's say between 23 and 32 the model predicts the FPS value pretty well. Otherwise, the prediction has big error value.
Regarding the linear regression model I should use Weighted Least Squares as a Solution to Heteroskedasticity of the fitted dataset. For references, see here, here and here.
fit32=lm(log(ZLFPS) ~ poly(QPZL,2,raw=T) + ZLDBFSAO, weights=1/(1+0.5*QPZL^2))
The other code remains the same. This model gives me lower prediction error than the previous.

Determining the values of beta in a logistic regression equation

I read about logistic regression on Wikipedia and it talks of the equation where z, the output depends on the values of beta1, beta2, beta3, and so on which are termed as regression coefficients along with beta0, which is the intercept.
How do you determine the values of these coefficients and intercept given a sample set where we know the output probability and the values of the input dependent factors?
Use R. Here's a video: www.youtube.com/watch?v=Yv05RjKpEKY

Resources