project a linear regression hyper plane to a 2d plot (abline-like) - r

I have this code
factors<-read.csv("India_Factors.csv",header=TRUE)
marketfactor<-factors[,4]
sizefactor<-factors[,5]
valuefactor<-factors[,6]
dati<-get.hist.quote("SI", quote = "AdjClose", compression = "m")
returns<-diff(dati)
regression<-lm(returns ~ marketfactor + sizefactor + valuefactor,na.action=na.omit)
that does multilinear regression.
I want to plot on a 2D plane the returns against a factor (and this is trivial of course) with superimposed the projection of the linear regression hyperplane for the specific factor. To be more clear the result should be like this: wolfram demonstrations (see the snapshots).
Any help will be greatly appreciated.
Thank you for your time and have a nice week end.
Giorgio.

The points in my comment withstanding, here is the canonical way to generate output from a fitted model in R for combinations of predictors. It really isn't clear what the plots you want are showing, but the ones that make sense to me are partial plots; where one variable is varied over its range whilst holding the others at some common value. Here I use the sample mean when holding a variable constant.
First some dummy data, with only to covariates, but this extends to any number
set.seed(1)
dat <- data.frame(y = rnorm(100))
dat <- transform(dat,
x1 = 0.2 + (0.4 * y) + rnorm(100),
x2 = 2.4 + (2.3 * y) + rnorm(100))
Fit the regression model
mod <- lm(y ~ x1 + x2, data = dat)
Next some data values to predict at using the model. You could do all variables in a single prediction and then subset the resulting object to plot only the relevant rows. Alternatively, more clearly (though more verbose), you can deal with each variable separately. Below I create two data frames, one per covariate in the model. In a data frame I generate 100 values over the range of the covariate being varied, and repeat the mean value of the other covariate(s).
pdatx1 <- with(dat, data.frame(x1 = seq(min(x1), max(x1), length = 100),
x2 = rep(mean(x2), 100)))
pdatx2 <- with(dat, data.frame(x1 = rep(mean(x1), 100),
x2 = seq(min(x2), max(x2), length = 100)))
In the linear regression with straight lines, you really don't need 100 values --- the two end points of the range of the covariate will do. However for models where the fitted function is not linear you need to predict at more locations.
Next, use the model to predict at these data points
pdatx1 <- transform(pdatx1, yhat = predict(mod, pdatx1))
pdatx2 <- transform(pdatx2, yhat = predict(mod, pdatx2))
Now we are ready to draw the partial plots. First compute a range for the y axis - again it is mostly redundant here but if you are adding confidence intervals you will need to include their values below,
ylim <- range(pdatx1$y, pdatx2$y, dat$y)
To plot (here putting two figures on the same plot device) we can use the following code
layout(matrix(1:2, ncol = 2))
plot(y ~ x1, data = dat)
lines(yhat ~ x1, data = pdatx1, col = "red", lwd = 2)
plot(y ~ x2, data = dat)
lines(yhat ~ x2, data = pdatx2, col = "red", lwd = 2)
layout(1)
Which produces

Related

Constrained Spline Function in r

hope all is well.
I have been exploring a few options for constraining a spline function so that it not only stays positive, but, so that it stays above the lowest value of y in the dataframe. I am assuming there is a penalized spline function out there where one can readily adapt the shape, though I have not found easily or tried yet. I have also tried nls with an exponential decay function which works, however, the last estimated point is much higher than desired (would like it to pass through, or be closer to the final value of y). see code below with the options i have tried. The ultimate goal however is to fit a spline that passes through all points and never decreases below the lowest value of y at any point while also acknowledging that yes there are only 5 data points. thanks in advance for the help.
library(tidyverse)
library(broom)
library(gnm)
library(cobs)
library(zoo)
DF <- data.frame(x = seq(1,5,1),y=c(26419753,9511111,3566667,57993,52194))
t=1:5
# option 1a and 1b: preferred method which is fitting a spline function
mod1a <- splinefun(DF$x,DF$y)
curve(mod1a, 1,5)
pred_interval_mod1a <- seq(1,5,length = 40)
interp(pred_interval_mod1a) # has that dip to negative near the end which should remain larger than y= 52,194
mod1b <- cobs(x= DF$x,y = DF$y,pointwise=rbind(c(0,52194,-1),c(0,26419753,1)))
pred_interval_mod1b <- seq(1,5,length = 40)
interp(pred_interval_mod1b)
# option 2: NLS for exponential decay with starting values
mod2 <- nls(y ~ SSasymp(t, yf, y0, log_alpha), data = DF)
qplot(t, y, data = augment(mod2)) + geom_line(aes(y = .fitted))
# option 3: similar NLS premise but with lower defined
mod3 <- nls(y ~ yf + (y0 - yf) * exp(-alpha * t), data = DF,
start = list(y0 = 26419753, yf = 52194, alpha = 1),
lower= c(-Inf,52194,-Inf),algorithm="port")
# option 4: similar to 2 and 3
a=log(52194)
mod4 <- gnm(y ~ Exp(1 + t) -1, verbose = FALSE, constrain="Exp(.+x).Intercept",
constrainTo=a, start=c(a,-0.05), data=DF)
mod4_df <- data.frame(t = seq(1,5,by=1))
mod4_pred <- predict(mod4,newdata=mod4_df)
mod4_pred

How to predict gam model with random effect in R?

I am working on predicting gam model with random effect to produce 3D surface plot by plot_ly.
Here is my code;
x <- runif(100)
y <- runif(100)
z <- x^2 + y + rnorm(100)
r <- rep(1,times=100) # random effect
r[51:100] <- 2 # replace 1 into 2, making two groups
df <- data.frame(x, y, z, r)
gam_fit <- gam(z ~ s(x) + s(y) + s(r,bs="re"), data = df) # fit
#create matrix data for `add_surface` function in `plot_ly`
newx <- seq(0, 1, len=20)
newy <- seq(0, 1, len=30)
newxy <- expand.grid(x = newx, y = newy)
z <- matrix(predict(gam_fit, newdata = newxy), 20, 30) # predict data as matrix
However, the last line results in error;
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
variable lengths differ (found for 'r')
In addition: Warning message:
In predict.gam(gam_fit, newdata = newxy) :
not all required variables have been supplied in newdata!
Thanks to the previous answer, I am sure that above codes work without random effect, as in here.
How can I predict gam models with random effect?
Assuming you want the surface conditional upon the random effects (but not for a specific level of the random effect), there are two ways.
The first is to provide a level for the random effect but exclude that term from the predicted values using the exclude argument to predict.gam(). The second is to again use exclude but this time to not provide any data for the random effect and instead stop predict.gam() from checking the newdata using the argument newdata.guaranteed = TRUE.
Option 1:
newxy1 <- with(df, expand.grid(x = newx, y = newy, r = 2))
z1 <- predict(gam_fit, newdata = newxy1, exclude = 's(r)')
z1 <- matrix(z1, 20, 30)
Option 2:
z2 <- predict(gam_fit, newdata = newxy, exclude = 's(r)',
newdata.guaranteed=TRUE)
z2 <- matrix(z2, 20, 30)
These produce the same result:
> all.equal(z1, z2)
[1] TRUE
A couple of notes:
Which you use will depend on how complex the rest of you model is. I would generally use the first option as it provides an extra check against me doing something stupid when creating the data. But in this instance, with a simple model and set of covariates it seems safe enough to trust that newdata is OK.
Your example uses a random slope (was that intended?), not a random intercept as r is not a factor. If your real example uses a factor random effect then you'll need to be a little more careful when creating the newdata as you need to get the levels of the factor right. For example:
expand.grid(x = newx, y = newy,
r = with(df, factor(2, levels = levels(r))))
should get the right set-up for a factor r

Logistic Regression's ROC Goes Abnormal

Currently, I'm learning about logistic regression and LDA (Linear Discriminant Analysis) classification. I'm trying to generate the data differently to learn logistic regression and LDA behavior.
Here is the data visualization of 2-dimensional predictors with class plotted as color:
Here is my code:
library(ggplot2)
library(MASS)
set.seed(1)
a <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(0.4,0,0,0.4), nrow = 2, ncol = 2))
b <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(10,0,0,10), nrow = 2, ncol =2 ))
#I want to make sure b1 separated from a
b1 <- b[sqrt(b[,1]^2 + b[,2]^2) > 4,]
df <- as.data.frame(rbind(a,b1))
names(df) <- c('x','y')
labelA <- rep('A', nrow(a))
labelB <- rep('B', nrow(b1))
#Put the label column to the data frame
df$labs <- c(labelA,labelB)
ggplot(df, aes(x = x, y = y, col = labs)) + geom_point()
prd <- glm(as.factor(labs) ~ x + y, family = binomial('probit'), data = df)
prd_score <- predict(prd, type = 'response')
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))
And this is the roc curve plot
It's really frustrating because I couldn't find any mistake in my code that generates this kind of problem. Can anyone help me to point out any mistake in my code that generates this weird kind of ROC or any explanation on why the ROC could become weird like that?
NB: Please assume that the generated data set above is the training data and I want to predict the training data again.
There is no mistake in your code.
Your dataset is a typical example that cannot be separated with a linear combination of features. Therefore linear classification method such as logistic regression or LDA won't help you here. This is why your ROC curve looks "weird", but it's totally normal and only telling you that your model fails to separate the data.
You need to investigate non-linear classification techniques. Given the radial distribution of the data, I can imagine that support vector machines (SVM) with a radial basis kernel could do the trick.
require(e1071)
# We need a numeric label for SVM regression
labelA <- rep(0, nrow(a))
labelB <- rep(1, nrow(b1))
df$labsNum <- c(labelA,labelB)
# We create a radial basis model
svm_prd <- svm(labsNum ~ x + y, data = df, kernel = "radial", type = "eps-regression")
svm_score <- predict(svm_prd)
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))

Plotting interaction effects in Bayesian models (using rstanarm)

I'm trying to show how the effect of one variables changes with the values of another variable in a Bayesian linear model in rstanarm(). I am able to fit the model and take draws from the posterior to look at the estimates for each parameter, but it's not clear how to give some sort of plot of the effects of one variable in the interaction as the other changes and the associated uncertainty (i.e. a marginal effects plot). Below is my attempt:
library(rstanarm)
# Set Seed
set.seed(1)
# Generate fake data
w1 <- rbeta(n = 50, shape1 = 2, shape2 = 1.5)
w2 <- rbeta(n = 50, shape1 = 3, shape2 = 2.5)
dat <- data.frame(y = log(w1 / (1-w1)),
x = log(w2 / (1-w2)),
z = seq(1:50))
# Fit linear regression without an intercept:
m1 <- rstanarm::stan_glm(y ~ 0 + x*z,
data = dat,
family = gaussian(),
algorithm = "sampling",
chains = 4,
seed = 123,
)
# Create data sets with low values and high values of one of the predictors
dat_lowx <- dat
dat_lowx$x <- 0
dat_highx <- dat
dat_highx$x <- 5
out_low <- rstanarm::posterior_predict(object = m1,
newdata = dat_lowx)
out_high <- rstanarm::posterior_predict(object = m1,
newdata = dat_highx)
# Calculate differences in posterior predictions
mfx <- out_high - out_low
# Somehow get the coefficients for the other predictor?
In this (linear, Gaussian, identity link, no intercept) case,
mu = beta_x * x + beta_z * z + beta_xz * x * z
= (beta_x + beta_xz * z) * x
= (beta_z + beta_xz * x) * z
So, to plot the marginal effect of x or z, you just need an appropriate range of each and the posterior distribution of the coefficients, which you can obtain via
post <- as.data.frame(m1)
Then
dmu_dx <- post[ , 1] + post[ , 3] %*% t(sort(dat$z))
dmu_dz <- post[ , 2] + post[ , 3] %*% t(sort(dat$x))
And you can then estimate a single marginal effect for each observation in your data by using something like the below, which calculated the effect of x on mu for each observation in your data and the effect of z on mu for each observation.
colnames(dmu_dx) <- round(sort(dat$x), digits = 1)
colnames(dmu_dz) <- dat$z
bayesplot::mcmc_intervals(dmu_dz)
bayesplot::mcmc_intervals(dmu_dx)
Note that the column names are simply the observations in this case.
You could also use either the ggeffects-package, especially for marginal effects; or the sjPlot-package for marginal effects and other plot types (for marginal effects, sjPlot simply wraps the functions from ggeffects).
To plot marginal effects of interactions, use sjPlot::plot_model() with type = "int". Use mdrt.values to define which values to plot for continuous moderator variables, and use ppd to let prediction be based on either the posterior distribution of the linear predictor or draws from posterior predictive distribution.
library(sjPlot)
plot_model(m1, type = "int", terms = c("x", "z"), mdrt.values = "meansd")
plot_model(m1, type = "int", terms = c("x", "z"), mdrt.values = "meansd", ppd = TRUE)
or to plot marginal effects at other specific values, use type = "pred" and specify the values in the terms-argument:
plot_model(m1, type = "pred", terms = c("x", "z [10, 20, 30, 40]"))
# same as:
library(ggeffects)
dat <- ggpredict(m1, terms = c("x", "z [10, 20, 30, 40]"))
plot(dat)
There are more options, and also different ways of customizing the plot appearance. See related help files and package vignettes.

R: Formula with multiple Conditions and Categorized Surface Plot

I want to make 3D plots for linear Regression Models in R: I wish to display surface of the regression plane of a linear model.
I have 2 continuous variables (say AGE, HEIGHT) and 2 factors (SEX, ALLERGIC). I want to display the predicted values of the LM w.r.t. the 2 continuous variables conditioned on the specified levels of each factor, e.g.
ILLNESS = AGE|{SEX==MALE + ALLERGIC==YES} + HEIGHT|{SEX==MALE + ALLERGIC==YES} +
AGE|{SEX==MALE + ALLERGIC==YES}*HEIGHT|{SEX==MALE + ALLERGIC==YES}
This is the outcome I have in mind:
First Question: Are there any cool function, where you can do this very easy?
Second Question: If not, how can I write formulas, where I can condition on >1 factor level?
First, let's make some sample input data to have something to test with.
set.seed(15)
dd <- data.frame(
sex = sample(c("M","F"), 200, replace=T),
allergic = sample(c("YES","NO"), 200, replace=T),
age = runif(200, 18,65),
height = rnorm(200, 6, 2)
)
expit <- function(x) exp(x)/(exp(x)+1)
dd <- transform(dd,
illness=expit(-1+(sex=="M")*.8-0.025*age*ifelse(sex=="M",-1,1)+.16*height*ifelse(allergic=="YES",-1,1)+rnorm(200))>.5
)
Now we define the set of values we want to predict over
gg<-expand.grid(sex=c("M","F"), allergic=c("YES","NO"))
vv<-expand.grid(age=18:65, height=3:9)
and then we fit a model, and use the predict function to calculate the response for each point on the surface we wish to plot.
mm <- glm(illness~sex+allergic+age+height, dd, family=binomial)
pd<-do.call(rbind, Map(function(sex, allergic) {
nd <- cbind(vv, sex=sex, allergic=allergic)
cbind(nd, pred=predict(mm, nd, type="response"))
}, sex=gg$sex, allergic=gg$allergic))
Finally, we can use lattice to plot the data
library(lattice)
wireframe(pred~age+height|sex+allergic, pd, drape=TRUE)
which give us

Resources