lines() function in base R plot giving several lines instead of smooth line - r

I am attempting to add a smoother to a plot of a regression model I have. I was just using base R to plot my X and Y vectors and add a smoother using plot() and then lines(). I've done this before, and it worked, but today I am given a plot with multiple lines connecting the points as opposed to one smooth line through all the data. I can't figure out what is different about this piece of code I have written, so I am hoping someone here could help me identify the issue.
Here is my code. I am using data I randomly generated to practice something else:
X and random variable vectors to create 'Y':
X <- rnorm(100, mean = 10, sd = 1)
epsilon <- rnorm(100, 0, 1)
Y:
b0 <- 0.27
b1 <- 0.49
b2 <- 0.62
b3 <- 0.8
Y <- b0 + b1*X + b2*2^2 + b3*X^3 + epsilon
Creating df and reg model/Yhat:
df = data.frame(Y,X,epsilon)
reg <- lm(Y ~ I(X^3), data = df)
Yhat <- fitted.values(reg)
cbind(df, Yhat) -> df
plot:
plot(X, Y)
lines(X, Yhat, col = "blue", lwd = 0.5)

For this to work, the X values have to be sorted and the Y values sorted according to their corresponding X values:
X <- rnorm(100, mean = 10, sd = 1)
epsilon <- rnorm(100, 0, 1)
b0 <- 0.27
b1 <- 0.49
b2 <- 0.62
b3 <- 0.8
Y <- b0 + b1*X + b2*2^2 + b3*X^3 + epsilon
df = data.frame(Y,X,epsilon)
reg <- lm(Y ~ I(X^3), data = df)
Yhat <- fitted.values(reg)
cbind(df, Yhat) -> df
plot(X, Y)
lines(X[order(X)], Yhat[order(X)], col = "blue", lwd = 0.5)

Related

Confidence intervals from mocel coefficients vs whole model

I'm trying to demonstrate that there is an important difference between two ways of making linear model predictions. The first way, which my heart tells me is more correct, uses predict.lm which as I understand preserves the correlations between coefficients. The second approach tries to use the parameters independently.
Is this the correct way to show the difference? The two approaches seem somewhat close.
Also, is the StdErr of the coefficients the same as the standard deviation of their distributions? Or have I misunderstood what the model table is saying.
Below is a quick reprex to show what I mean:
# fake dataset
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
# predictions
coef_sterr <- summary(model)$coefficients
inters <- rnorm(500, mean = coef_sterr[1,1], sd = coef_sterr[1,2])
slopes <- rnorm(500, mean = coef_sterr[2,1], sd = coef_sterr[2,2])
newx <- seq(from = -1, to= 1, length.out = 20)
avg_predictions <- cbind(1, newx) %*% rbind(inters, slopes)
conf_predictions <- apply(avg_predictions, 1, quantile, probs = c(.25, .975), simplify = TRUE)
# from confint
conf_interval <- predict(model, newdata=data.frame(xs = newx),
interval="confidence",
level = 0.95)
# plot to visualize
plot(ys~xs)
# averages are exactly the same
abline(model)
abline(a = coef(model)[1], b = coef(model)[2], col = "red")
# from predict, using parameter covariance
matlines(newx, conf_interval[,2:3], col = "blue", lty=1, lwd = 3)
# from simulated lines, ignoring parameter covariance
matlines(newx, t(conf_predictions), col = "orange", lty = 1, lwd = 2)
Created on 2022-04-05 by the reprex package (v2.0.1)
In this case, they would be close because there is very little correlation between the model parameters, so drawing them from two independent normals versus a multivariate normal is not that different:
set.seed(519)
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
cov2cor(vcov(model))
# (Intercept) xs
# (Intercept) 1.00000000 -0.08054106
# xs -0.08054106 1.00000000
Also, it is probably worth calculating both of the intervals the same way, though it shouldn't make that much difference. That said, 500 observations may not be enough to get reliable estimates of the 2.5th and 97.5th percentiles of the distribution. Let's consider a slightly more complex example. Here, the two X variables are correlated - the correlation of the parameters derives in part from the correlation of the columns of the design matrix, X.
set.seed(519)
X <- MASS::mvrnorm(200, c(0,0), matrix(c(1,.65,.65,1), ncol=2))
b <- c(-1.3, 3.1, 2.5)
ytrue <- cbind(1,X) %*% b
y <- ytrue + rnorm(200, 0, .5*sd(ytrue))
dat <- data.frame(y=y, x1=X[,1], x2=X[,2])
model <- lm(y ~ x1 + x2, data=dat)
cov2cor(vcov(model))
# (Intercept) x1 x2
# (Intercept) 1.00000000 0.02417386 -0.01515887
# x1 0.02417386 1.00000000 -0.73228003
# x2 -0.01515887 -0.73228003 1.00000000
In this example, the coefficients for x1 and x2 are correlated around -0.73. As you'll see, this still doesn't result in a huge difference. Let's calculate the relevant statistics.
First, we draw B1 using the multivariate method that you rightly suspect is correct. Then, we'll draw B2 from a bunch of independent normals (actually, I'm using a multivariate normal with a diagonal variance-covariance matrix, which is the same thing).
b_est <- coef(model)
v <- vcov(model)
B1 <- MASS::mvrnorm(2500, b_est, v, empirical=TRUE)
B2 <- MASS::mvrnorm(2500, b_est, diag(diag(v)), empirical = TRUE)
Now, let's make a hypothetical X matrix and generate the relevant predictions:
hypX <- data.frame(x1=seq(-3,3, length=50),
x2 = mean(dat$x2))
yhat1 <- as.matrix(cbind(1, hypX)) %*% t(B1)
yhat2 <- as.matrix(cbind(1, hypX)) %*% t(B2)
Then we can calculate confidence intervals, etc...
yh1_ci <- t(apply(yhat1, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh2_ci <- t(apply(yhat2, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh1_ci <- as.data.frame(yh1_ci)
yh2_ci <- as.data.frame(yh2_ci)
names(yh1_ci) <- names(yh2_ci) <- c("lwr", "upr")
yh1_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh2_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh1_ci$method <- factor(1, c(1,2), labels=c("Multivariate", "Independent"))
yh2_ci$method <- factor(2, c(1,2), labels=c("Multivariate", "Independent"))
yh1_ci$x1 <- hypX[,1]
yh2_ci$x1 <- hypX[,1]
yh <- rbind(yh1_ci, yh2_ci)
We could then plot the two confidence intervals as you did.
ggplot(yh, aes(x=x1, y=fit, ymin=lwr, ymax=upr, fill=method)) +
geom_ribbon(colour="transparent", alpha=.25) +
geom_line() +
theme_classic()
Perhaps a better visual would be to compare the widths of the intervals.
w1 <- yh1_ci$upr - yh1_ci$lwr
w2 <- yh2_ci$upr - yh2_ci$lwr
ggplot() +
geom_point(aes(x=hypX[,1], y=w2-w1)) +
theme_classic() +
labs(x="x1", y="Width (Independent) - Width (Multivariate)")
This shows that for small values of x1, the independent confidence intervals are wider than the multivariate ones. For values of x1 above 0, it's a more mixed bag.
This tells you that there is some difference, but you don't need the simulation to know which one is 'right'. That's because the prediction is a linear combination of constants and random variables.
In this case, the b terms are the random variables and the x values are the constants. We know that the variance of a linear combination can be calculated this way:
All that is to say that your intuition is correct.

I would like to visualize the third order interaction fitted with thin-plate regression splines

I am a beginner of R so it may be a simple question.
I am now trying to fit a 4-dimensional point using thin-plate regression splines. One variable is a target variable and three variables are an explanatory variable.
I made a model with third order interaction and fitted the data to this.
library(mgcv)
dat <- read.csv('../data//data.csv')
model <- gam(Y ~ s(x1, x2, x3), data=dat)
By giving x3, I want to visualize a three-dimensional graph of spline curve or estimated contour plot, but how do I do it?
It will be very helpful if you can answer.
Thanks.
This is the sample data.
n = 100
x1 <- runif(n, min = 0, max = 100)
x2 <- runif(n, min = 0, max = 100)
x3 <- runif(n, min = 0, max = 100)
Y = numeric(n)
for(i in 1:n){
Y[i] <- x1[i]**0.5*x2[i]**2*x3[i]/10000
}
dat = data.frame(Y=Y, x1=x1, x2=x2, x3=x3)
I do thin-plane regression spline using this dat.
model <- gam(Y ~ s(x1, x2, x3, k= 50), data=dat)
Then, I would like to obtain a fitting curve of three-dimensional thin-plane regression spline or contour plot estimated by regression spline when x3 = 25, for example.
To make a contour plot, you can use contour(x, y, z, ...). z is your data Matrix (in your case, Y[x1,x2, ], x and y are index vectors from 0 to 1 with a length of nrow(Y[x1,x2, ]) and ncol(Y[x1,x2, ]).
You should be able to use it similar to:
contour( x = seq(0, 1, length.out = length(x1)), y = seq(0, 1, length.out = length(x2)), z = Y[x1,x2, ] )
I found a solution with reference to the answer of d0d0.
n=100
const=25
x = y = seq(0, n, 1)
f = function(x,y){
dtmp <- data.frame(x1=(x), x2=(y), x3=(const))
pred <- predict.gam(model, dtmp)
}
z = outer(x, y, f)
contour(x,y,z)

Reproduce Fisher linear discriminant figure

Many books illustrate the idea of Fisher linear discriminant analysis using the following figure (this particular is from Pattern Recognition and Machine Learning, p. 188)
I wonder how to reproduce this figure in R (or in any other language). Pasted below is my initial effort in R. I simulate two groups of data and draw linear discriminant using abline() function. Any suggestions are welcome.
set.seed(2014)
library(MASS)
library(DiscriMiner) # For scatter matrices
# Simulate bivariate normal distribution with 2 classes
mu1 <- c(2, -4)
mu2 <- c(2, 6)
rho <- 0.8
s1 <- 1
s2 <- 3
Sigma <- matrix(c(s1^2, rho * s1 * s2, rho * s1 * s2, s2^2), byrow = TRUE, nrow = 2)
n <- 50
X1 <- mvrnorm(n, mu = mu1, Sigma = Sigma)
X2 <- mvrnorm(n, mu = mu2, Sigma = Sigma)
y <- rep(c(0, 1), each = n)
X <- rbind(x1 = X1, x2 = X2)
X <- scale(X)
# Scatter matrices
B <- betweenCov(variables = X, group = y)
W <- withinCov(variables = X, group = y)
# Eigenvectors
ev <- eigen(solve(W) %*% B)$vectors
slope <- - ev[1,1] / ev[2,1]
intercept <- ev[2,1]
par(pty = "s")
plot(X, col = y + 1, pch = 16)
abline(a = slope, b = intercept, lwd = 2, lty = 2)
MY (UNFINISHED) WORK
I pasted my current solution below. The main question is how to rotate (and move) the density plot according to decision boundary. Any suggestions are still welcome.
require(ggplot2)
library(grid)
library(MASS)
# Simulation parameters
mu1 <- c(5, -9)
mu2 <- c(4, 9)
rho <- 0.5
s1 <- 1
s2 <- 3
Sigma <- matrix(c(s1^2, rho * s1 * s2, rho * s1 * s2, s2^2), byrow = TRUE, nrow = 2)
n <- 50
# Multivariate normal sampling
X1 <- mvrnorm(n, mu = mu1, Sigma = Sigma)
X2 <- mvrnorm(n, mu = mu2, Sigma = Sigma)
# Combine into data frame
y <- rep(c(0, 1), each = n)
X <- rbind(x1 = X1, x2 = X2)
X <- scale(X)
X <- data.frame(X, class = y)
# Apply lda()
m1 <- lda(class ~ X1 + X2, data = X)
m1.pred <- predict(m1)
# Compute intercept and slope for abline
gmean <- m1$prior %*% m1$means
const <- as.numeric(gmean %*% m1$scaling)
z <- as.matrix(X[, 1:2]) %*% m1$scaling - const
slope <- - m1$scaling[1] / m1$scaling[2]
intercept <- const / m1$scaling[2]
# Projected values
LD <- data.frame(predict(m1)$x, class = y)
# Scatterplot
p1 <- ggplot(X, aes(X1, X2, color=as.factor(class))) +
geom_point() +
theme_bw() +
theme(legend.position = "none") +
scale_x_continuous(limits=c(-5, 5)) +
scale_y_continuous(limits=c(-5, 5)) +
geom_abline(intecept = intercept, slope = slope)
# Density plot
p2 <- ggplot(LD, aes(x = LD1)) +
geom_density(aes(fill = as.factor(class), y = ..scaled..)) +
theme_bw() +
theme(legend.position = "none")
grid.newpage()
print(p1)
vp <- viewport(width = .7, height = 0.6, x = 0.5, y = 0.3, just = c("centre"))
pushViewport(vp)
print(p2, vp = vp)
Basically you need to project the data along the direction of the classifier, plot a histogram for each class, and then rotate the histogram so its x axis is parallel to the classifier. Some trial-and-error with scaling the histogram is needed in order to get a nice result. Here's an example of how to do it in Matlab, for the naive classifier (difference of class' means). For the Fisher classifier it is of course similar, you just use a different classifier w. I changed the parameters from your code so the plot is more similar to the one you gave.
rng('default')
n = 1000;
mu1 = [1,3]';
mu2 = [4,1]';
rho = 0.3;
s1 = .8;
s2 = .5;
Sigma = [s1^2,rho*s1*s1;rho*s1*s1, s2^2];
X1 = mvnrnd(mu1,Sigma,n);
X2 = mvnrnd(mu2,Sigma,n);
X = [X1; X2];
Y = [zeros(n,1);ones(n,1)];
scatter(X1(:,1), X1(:,2), [], 'b' );
hold on
scatter(X2(:,1), X2(:,2), [], 'r' );
axis equal
m1 = mean(X(1:n,:))';
m2 = mean(X(n+1:end,:))';
plot(m1(1),m1(2),'bx','markersize',18)
plot(m2(1),m2(2),'rx','markersize',18)
plot([m1(1),m2(1)], [m1(2),m2(2)],'g')
%% classifier taking only means into account
w = m2 - m1;
w = w / norm(w);
% project data onto w
X1_projected = X1 * w;
X2_projected = X2 * w;
% plot histogram and rotate it
angle = 180/pi * atan(w(2)/w(1));
[hy1, hx1] = hist(X1_projected);
[hy2, hx2] = hist(X2_projected);
hy1 = hy1 / sum(hy1); % normalize
hy2 = hy2 / sum(hy2); % normalize
scale = 4; % set manually
h1 = bar(hx1, scale*hy1,'b');
h2 = bar(hx2, scale*hy2,'r');
set([h1, h2],'ShowBaseLine','off')
% rotate around the origin
rotate(get(h1,'children'),[0,0,1], angle, [0,0,0])
rotate(get(h2,'children'),[0,0,1], angle, [0,0,0])

Drawing a regression surface with an interaction in a 3D figure in R

Using car::scatter3d(), I am trying to create a 3D figure with a regression surface indicating an interaction between a categorical and a continuous variable. Partly following the code here, I obtained a figure below.
The figure is obviously wrong in that the regression surface does not reach one of the values of the categorical variable. The problem perhaps lies in the use of the rgl::persp3d() (the last block of the code below), but I have not been able to identify what exactly I'm doing wrongly. Could someone let me know what I'm missing and how to fix the problem?
library(rgl)
library(car)
n <- 100
set.seed(1)
x <- runif(n, 0, 10)
set.seed(1)
z <- sample(c(0, 1), n, replace = TRUE)
set.seed(1)
y <- 0.5 * x + 0.1 * z + 0.3 * x * z + rnorm(n, sd = 1.5)
d <- data.frame(x, z, y)
scatter3d(y ~ x + z, data = d,
xlab = "continuous", zlab = "categorical", ylab = "outcome",
residuals = FALSE, surface = FALSE
)
d2 <- d
d2$x <- d$x / (max(d$x) - min(d$x))
d2$y <- d$y / (max(d$y) - min(d$y))
mod <- lm(y ~ x * z, data = d2)
grd <- expand.grid(x = unique(d2$x), z = unique(d2$z))
grd$pred <- predict(mod, newdata = grd)
grd <- grd[order(grd$z, grd$x), ]
# The problem is likely to lie somewhere below.
persp3d(x = unique(grd$x), y = unique(grd$z),
z = matrix(grd$pred, length(unique(grd$z)), length(unique(grd$x))),
alpha = 0.5,
col = "blue",
add = TRUE,
xlab = "", ylab = "", zlab = ""
)
I prefer sticking to car::scatter3d() in drawing the original graph because I already made several figures with car::scatter3d() and want to make this figure consistent with them as well.

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Resources