Add predictors one by one in random forest - r

once more I will need your help in order to solve a syntax problem and I thank you for that.
So I have a dataset that looks like that :
y <- rnorm(1000)
x1 <- rnorm(1000) + 0.2 * y
x2 <- rnorm(1000) + 0.2 * x1 + 0.1 * y
x3 <- rnorm(1000) - 0.1 * x1 + 0.3 * x2 - 0.3 * y
data <- data.frame(y, x1, x2, x3)
head(data)
#
I need a loop to run a random forest starting with one predictor and adding all the predictors one by one each time, like that:
randomForest(y ~ x1, data= data)
randomForest(y ~ x1 + x2, data= data)
randomForest(y ~ x1 + x2 + x3, data=data) etc...
Would you be kind enough to help me? Thank you in advance!

You can build the formula, and use as.formula()
lapply(1:3, \(i) {
formula = as.formula(paste0("y~",paste0("x",1:i, collapse="+")))
randomForest(formula, data=data)
})
A more general approach, for example if the predictors were not consistently named, or without specifying how many there are, would be to obtain a string vector of the predictors, say using colnames(), and adjust the loop slightly
predictors = colnames(data[,-1])
lapply(1:length(predictors), \(i) {
formula = as.formula(paste0("y~",paste0(predictors[1:i], collapse="+")))
randomForest(formula, data=data)
})

Related

How to validate and train a multivariate multiple regression modeling multiple responses?

I would like to validate the accuracy of my Multivariate Multiple Regression model (MMR) and train.
For this MMR I have used four dependent variables (y1, y2, y3, y4) and two independent variables (x1 and x2)
Firstly, I worked with R function lm() such as:
modelo_MMR <- lm(cbind(y1,y2,y3,y4) ~ x1 + x2, data = my_data))
I have checked that the previous function could be divided such as:
m1 <- lm(y1 ~ x1 + x2, data = my_data)
m2 <- lm(y2 ~ x1 + x2, data = my_data)
m3 <- lm(y3 ~ x1 + x2, data = my_data)
m4 <- lm(y4 ~ x1 + x2, data = my_data)
To predict y1, y2, y3, and y4 with x1 and x2 values, I have used predict() function and I have obtained the same results using:
nd <- data.frame(x1 = 9, x2 = 40)
p_MMR <- predict(modelo_MMR, nd)
than using:
p_m1 <- predict(m1, nd)
p_m2 <- predict(m2, nd)
p_m3 <- predict(m3, nd)
p_m4 <- predict(m4, nd)
When I use lm(cbind(y1,y2,y3,y4) ~ x1 + x2, data = my_data)) in some scripts to validate the model shows me several errors because those scripts usually use lm(), glm(), etc. using one response variable instead of four… I was thinking about to validate each multiple linear regression (p_m1, p_m2, p_m3 and p_m4) separately using the same training-set and test-set for cross validation. Also, I was thinking about using different machine learning models with each multiple linear regression because instead of MMR because of I have not find yet information about how to train MMR using Naives, Randomforest, K-NN, etc
Anyone could suggest me what I could do to train a MMR model and validate its accuracy?
Thanks

Regression Output from probitmfx

I am trying to produce a nice regression table for marginal effects & p-values from the probitmfx function, where p-values are reported under the marginal effect per covariate. An picture example of what I'd like it to look like is here Similar Output from Stata.
I tried the stargazer function, as suggested here but this does not seem to work if I don't have an OLS / probit.
data_T1 <- read_dta("xxx")
#specification (1)
T1_1 <- probitmfx(y ~ x1 + x2 + x3, data=data_T1)
#specification (1)
T1_2 <- probitmfx(y ~ x1 + x2 + x3 + x4 + x5, data=data_T1)
#this is what I tried but does not work
table1 <- stargazer(coef=list(T1_1$mfxest[,1], T1_2$mfxest[,1]),
p=list(T1_2$mfxest[,4],T1_2$mfxest[,4]), type="text")
Any suggestions how I can design such a table in R?
You can probably use parameters package to produce a beautiful table:
Code:
library(mfx)
library(parameters)
# simulate some data
set.seed(12345)
n <- 1000
x <- rnorm(n)
# binary outcome
y <- ifelse(pnorm(1 + 0.5 * x + rnorm(n)) > 0.5, 1, 0)
data <- data.frame(y, x)
mod <- probitmfx(formula = y ~ x, data = data)
print_html(model_parameters(mod))
HTML table to be used in Rmarkdown:

Simulate data from regression model with exact parameters in R

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

Remove dependent variable from formula for model.matrix

I'm just learning how to deal with model.matrix. For example, to create out-of-sample predictions I extract the formula from my model, say it's a linear model.
Using the function formula(mymodel) extracts that:
form <- formula(y ~ x1 + x2 * x3)
Now, to create predictions I need a model.matrix without my y. I could type that by hand:
X <- model.matrix(~ x1 + x2 * x3, data=out.of.sample.data)
However, is there a way using, for example, update to get rid of the left part my formula?
Thanks!
It can be done with update by setting the response variable to NULL:
form <- formula(y ~ x1 + x2 * x3)
newform <- update(form, NULL ~ .)
This is how I usually do this. I'm not aware of a built-in function for this.
df = data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10), x3=rnorm(10))
mymodel = lm(y ~ x1 + x2 + x3, df)
form_vars_only =
formula(paste("~",strsplit(as.character(formula(mymodel)),"~")[[3]]))

Subset of predictors using coefplot()

I'd like to do a plot of coefficients using coefplot() that only takes into account a subset of the predictors that I'm using. For example, if you have the code
y1 <- rnorm(1000,50,23)
x1 <- rnorm(1000,50,2)
x2 <- rbinom(1000,1,prob=0.63)
x3 <- rpois(1000, 2)
fit1 <- lm(y1 ~ x1 + x2 + x3)
and then ran
coefplot(fit1)
it would give you a plot displaying the coefficients of the intercept, x1, x2 and x3. How can I modify this so I only get the coefficients for say, x1 and x2?
You can use the argument predictors and it will only plot the coefficients you need:
library(coefplot)
coefplot(fit1, predictors=c('x1','x2'))
Output:

Resources