Regression Output from probitmfx - r

I am trying to produce a nice regression table for marginal effects & p-values from the probitmfx function, where p-values are reported under the marginal effect per covariate. An picture example of what I'd like it to look like is here Similar Output from Stata.
I tried the stargazer function, as suggested here but this does not seem to work if I don't have an OLS / probit.
data_T1 <- read_dta("xxx")
#specification (1)
T1_1 <- probitmfx(y ~ x1 + x2 + x3, data=data_T1)
#specification (1)
T1_2 <- probitmfx(y ~ x1 + x2 + x3 + x4 + x5, data=data_T1)
#this is what I tried but does not work
table1 <- stargazer(coef=list(T1_1$mfxest[,1], T1_2$mfxest[,1]),
p=list(T1_2$mfxest[,4],T1_2$mfxest[,4]), type="text")
Any suggestions how I can design such a table in R?

You can probably use parameters package to produce a beautiful table:
Code:
library(mfx)
library(parameters)
# simulate some data
set.seed(12345)
n <- 1000
x <- rnorm(n)
# binary outcome
y <- ifelse(pnorm(1 + 0.5 * x + rnorm(n)) > 0.5, 1, 0)
data <- data.frame(y, x)
mod <- probitmfx(formula = y ~ x, data = data)
print_html(model_parameters(mod))
HTML table to be used in Rmarkdown:

Related

Interaction plot

I am interested in representing an interaction effect among continuous variables in which the effect of one variable (X1) on Y depends on another variable (X2).
I have a code similar to this:
X1 <- rnorm(1000,0,1)
X2 <- rnorm(1000,0,1)
error <- rnorm(1000,0,0.5)
intercept <- 5
coef_1 <- 0.5
coef_2 <- 1
coef_3 <- -1.5
Y <- intercept + coef_1*X1 + coef_2*X2 + coef_3*X1*X2 + error
data <- data.frame(Y=Y,X1=X1,X2=X2)
fit <- lm(Y ~ X1*X2,data=data)
summary(fit)
There are multiple ways to represent the interaction effect like 3D-plots and the interaction_plot(). However, I would like to have an output similar to this:
I had in mind to do it with the segments() function though any other advice would be helpful.

Add predictors one by one in random forest

once more I will need your help in order to solve a syntax problem and I thank you for that.
So I have a dataset that looks like that :
y <- rnorm(1000)
x1 <- rnorm(1000) + 0.2 * y
x2 <- rnorm(1000) + 0.2 * x1 + 0.1 * y
x3 <- rnorm(1000) - 0.1 * x1 + 0.3 * x2 - 0.3 * y
data <- data.frame(y, x1, x2, x3)
head(data)
#
I need a loop to run a random forest starting with one predictor and adding all the predictors one by one each time, like that:
randomForest(y ~ x1, data= data)
randomForest(y ~ x1 + x2, data= data)
randomForest(y ~ x1 + x2 + x3, data=data) etc...
Would you be kind enough to help me? Thank you in advance!
You can build the formula, and use as.formula()
lapply(1:3, \(i) {
formula = as.formula(paste0("y~",paste0("x",1:i, collapse="+")))
randomForest(formula, data=data)
})
A more general approach, for example if the predictors were not consistently named, or without specifying how many there are, would be to obtain a string vector of the predictors, say using colnames(), and adjust the loop slightly
predictors = colnames(data[,-1])
lapply(1:length(predictors), \(i) {
formula = as.formula(paste0("y~",paste0(predictors[1:i], collapse="+")))
randomForest(formula, data=data)
})

Simulate data from regression model with exact parameters in R

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

Implement Logistic Regression

I am applying multiple ML algorithm to this dataset so I tried logistic regression and I plotted the predictions and it seems completely off since the plot only shows data points from one class. Here is the data and what I attempted
set.seed(10)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- ifelse(x1 ^ 2 - x2 ^ 2 > 0, 1, 0)
dat <- data.frame(x1, x2, y)
#Logistic Regression
fit.glm <- glm(y ~ x1 + x2, data = dat, family = "binomial")
y.hat.3 <- predict(fit.glm,dat)
plot(x1,x2,col = c("red","blue")[y.hat.3 + 1])
predict returns log-odds for a logistic regression by default. To get predicted classes, use type = "resp" to get predicted probabilities and then use a decision rule like p > 0.5 to turn them into classes:
y.hat.3 <- predict(fit.glm,dat, type = "resp") > 0.5
plot(x1,x2,col = c("red","blue")[y.hat.3 + 1])

Partial regression plot of results from the np package

I run a nonparametric regression using the np package (npreg) and try to plot my results for the variable of interest x1 holding all other variables at their means/modes.
library("np")
y <- rnorm(100)
x1 <- rnorm(100,10,30)
x2 <- rbinom(100,1,0.5)
x3 <- rbinom(100,1,0.5)
model.np <- npreg(y ~ x1 + x2 + x3)
plot(model.np)
The plots are exactly what I want but I cannot figure out how to generate them separately "by hand". In particular, I only want the first (of the three) output plots.
Apparantly, a detailed answer can be found in the help file for the npplot-routine with plot.behavior being the crucial argument.
For my example, plotting only the x1-graph could be done via:
nlmodel.plot <- plot(model.np, plot.behavior="data")
y.eval <- fitted(nlmodel.plot$r1) #fit partial regression model for r1=airnoise
y.se <- se(nlmodel.plot$r1) #grab SE from botstrap
y.lower.ci <- y.eval + logp.se[,1] #lower CI
y.upper.ci <- y.eval + logp.se[,2] #upper CI
x1.eval <- nlmodel.plot$r1$eval[,1] #grab x1 values saved in plot$r1
plot(x1,y)
lines(x1.eval,y.eval)
lines(x1.eval,y.lower.ci,lty=3)
lines(x1.eval,y.upper.ci,lty=3)

Resources