Implement Logistic Regression - r

I am applying multiple ML algorithm to this dataset so I tried logistic regression and I plotted the predictions and it seems completely off since the plot only shows data points from one class. Here is the data and what I attempted
set.seed(10)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- ifelse(x1 ^ 2 - x2 ^ 2 > 0, 1, 0)
dat <- data.frame(x1, x2, y)
#Logistic Regression
fit.glm <- glm(y ~ x1 + x2, data = dat, family = "binomial")
y.hat.3 <- predict(fit.glm,dat)
plot(x1,x2,col = c("red","blue")[y.hat.3 + 1])

predict returns log-odds for a logistic regression by default. To get predicted classes, use type = "resp" to get predicted probabilities and then use a decision rule like p > 0.5 to turn them into classes:
y.hat.3 <- predict(fit.glm,dat, type = "resp") > 0.5
plot(x1,x2,col = c("red","blue")[y.hat.3 + 1])

Related

Regression Output from probitmfx

I am trying to produce a nice regression table for marginal effects & p-values from the probitmfx function, where p-values are reported under the marginal effect per covariate. An picture example of what I'd like it to look like is here Similar Output from Stata.
I tried the stargazer function, as suggested here but this does not seem to work if I don't have an OLS / probit.
data_T1 <- read_dta("xxx")
#specification (1)
T1_1 <- probitmfx(y ~ x1 + x2 + x3, data=data_T1)
#specification (1)
T1_2 <- probitmfx(y ~ x1 + x2 + x3 + x4 + x5, data=data_T1)
#this is what I tried but does not work
table1 <- stargazer(coef=list(T1_1$mfxest[,1], T1_2$mfxest[,1]),
p=list(T1_2$mfxest[,4],T1_2$mfxest[,4]), type="text")
Any suggestions how I can design such a table in R?
You can probably use parameters package to produce a beautiful table:
Code:
library(mfx)
library(parameters)
# simulate some data
set.seed(12345)
n <- 1000
x <- rnorm(n)
# binary outcome
y <- ifelse(pnorm(1 + 0.5 * x + rnorm(n)) > 0.5, 1, 0)
data <- data.frame(y, x)
mod <- probitmfx(formula = y ~ x, data = data)
print_html(model_parameters(mod))
HTML table to be used in Rmarkdown:

Simulate data from regression model with exact parameters in R

How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...

R : Plotting Prediction Results for a multiple regression

I want to observe the effect of a treatment variable on my outcome Y. I did a multiple regression: fit <- lm (Y ~ x1 + x2 + x3). x1 is the treatment variable and x2, x3 are the control variables. I used the predict function holding x2 and x3 to their means. I plotted this predict function.
Now I would like to add a line to my plot similar to a simple regression abline but I do not know how to do this.
I think I have to use line(x,y) where y = predict and x is a sequence of values for my variable x1. But R tells me the lengths of y and x differ.
I think you are looking for termplot:
## simulate some data
set.seed(0)
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
y <- cbind(1,x1,x2,x3) %*% runif(4) + rnorm(100, sd = 0.1)
## fit a model
fit <- lm(y ~ x1 + x2 + x3)
termplot(fit, se = TRUE, terms = "x1")
termplot uses predict.lm(, type = "terms") for term-wise prediction. If a model has intercept (like above), predict.lm will centre each term (What does predict.glm(, type=“terms”) actually do?). In this way, each terms is predicted to be 0 at the mean of the covariate, and the standard error at the mean is 0 (hence the confidence interval intersects the line at the mean).

Removing additive effect in Multiple Regression in R

I have this data set that I will used for my model
set.seed(123)
x <- rnorm(100)
DF <- data.frame(x = x,
y = 4 + (1.5*x) + rnorm(100, sd = 2),
b = as.factor(round(abs(DF$x/3))),
c = as.factor(round(abs(DF$y/3)))
)
I was assigned to create a multiplicative model for them with a based 5 like this equation:
y=5*b(i)*c(i)
but the best that I can do is this one:
m1 <- lm(y ~ b*c, data = DF)
summary(m1)
This model is okay but I do want to remove the additive effect and just get the multiplicative model and I also replace the intercept with 5 and create difference coefficient for the first level of b and c.
Is there a way in R to do this task?
To fit the model without a constant use lm(y~b*c -1,...). Setting a fixed constant can be done by specifying the offset and not fitting the constant or by subtracting the known constant from the dependent variable and fitting a model with no constant.
set.seed(123)
x <- rnorm(100)
DF <- as.data.frame(cbind(x))
DF$y = 4 + (1.5*x) + rnorm(100, sd = 2)
DF$b = round(abs(DF$x/3))
DF$c = round(abs(DF$y/3))
DF$bc = DF$b*DF$c
m1 <- lm(y~ b*c, data=DF) # model w/ a constant
m2 <- lm(y~ b*c - 1, data=DF) # model w/o a constant
m3 <- lm(y~ b*c -1 + offset(rep(5,nrow(DF))), data=DF) # model w/ a constant of 5
m4 <- lm(y-5~ b*c -1, data=DF) # subtracting fixed constant from y's

Adding error variance to output of predict()

I am attempting to take a linear model fitted to empirical data, eg:
set.seed(1)
x <- seq(from = 0, to = 1, by = .01)
y <- x + .25*rnorm(101)
model <- (lm(y ~ x))
summary(model)
# R^2 is .6208
Now, what I would like to do is use the predict function (or something similar) to create, from x, a vector y of predicted values that shares the error of the original relationship between x and y. Using predict alone gives perfectly fitted values, so R^2 is 1 e.g:
y2 <- predict(model)
summary(lm(y2 ~ x))
# R^2 is 1
I know that I can use predict(model, se.fit = TRUE) to get the standard errors of the prediction, but I haven't found an option to incorporate those into the prediction itself, nor do I know exactly how to incorporate these standard errors into the predicted values to give the correct amount of error.
Hopefully someone here can point me in the right direction!
How about simulate(model) ?
set.seed(1)
x <- seq(from = 0, to = 1, by = .01)
y <- x + .25*rnorm(101)
model <- (lm(y ~ x))
y2 <- predict(model)
y3 <- simulate(model)
matplot(x,cbind(y,y2,y3),pch=1,col=1:3)
If you need to do it it by hand you could use
y4 <- rnorm(nobs(model),mean=predict(model),
sd=summary(model)$sigma)

Resources