Extracting x-axis intercept from a linear fit in R - r

I have some data generated using the following lines of code,
x <- c(1:10)
y <- x^3
z <- y-20
s <- z/3
t <- s*6
q <- s*y
x1 <- cbind(x,y,z,s,t,q)
x1 <- data.frame(x1)
I would like to plot x versus y,s, and t so I melt the data frame x1 first,
library(reshape2)
xm <- melt(x1, id=names(x1)[1], measure=names(x1)[c(2, 4, 5)], variable = "cols"`)
Then I plot them along with their linear fits using the following code,
library(ggplot2)
plt <- ggplot(xm, aes(x = x, y = value, color = cols)) +
geom_point(size = 3) +
labs(x = "x", y = "y") +
geom_smooth(method = "lm", se = FALSE)
plt
The plot which is generated is shown below,
Now I would liked to interpolate the x-intercept of the linear fit. The point in the plot where y axis value is 0.
The following lines of code as shown here, extracts the slope and y-intercept.
fits <- by(xm[-2], xm$cols, function(i) coef(lm(value ~ x, i)))
data.frame(cols = names(fits), do.call(rbind, fits))
Is there any way how I can extract the x-intercept other than manually calculating from the slope and y-intercept?
Thanks for the help!

You could do inverse prediction as implemented in package chemCal for calibrations if you don't want to calculate this yourself:
library(chemCal)
res <- by(xm[-2], xm$cols, function(i) inverse.predict(lm(value ~ x, i), 0)$Prediction)
res[1:3]
#xm$cols
#y s t
#2.629981 2.819734 2.819734
Edit:
Maybe you prefer this:
library(plyr)
res <- ddply(xm, .(cols),
function(i) data.frame(xinter=inverse.predict(lm(value ~ x, i), 0)$Prediction))
# cols xinter
# 1 y 2.629981
# 2 s 2.819734
# 3 t 2.819734

I don't think you can avoid computing the linear equation, though of course you don't have to do it by hand (unless you want to). For example:
by(xm[-2], xm$cols, function(i) {
fit <- lm(value~x, i); print(fit); solve(coef(fit)[-1], -coef(fit)[1] )}
)
Call:
lm(formula = value ~ x, data = i)
Coefficients:
(Intercept) x
-277.2 105.4
Call:
lm(formula = value ~ x, data = i)
Coefficients:
(Intercept) x
-99.07 35.13
Call:
lm(formula = value ~ x, data = i)
Coefficients:
(Intercept) x
-594.4 210.8
xm$cols: y
[1] 2.629981
-----------------------------------------------------------------------------------------------------------------
xm$cols: s
[1] 2.819734
-----------------------------------------------------------------------------------------------------------------
xm$cols: t
[1] 2.819734
What was solved is basically -277.2 + 105.4*x = 0 for x -> 105.4*x = 277.2 (the solve-function call) -> x = 2.629981. Seems your lines 's' and 't' intersect the y=0 axis at the same spot. If I understood correctly, your problem isn't extrapolation since your x-range covers the intercept but instead interpolation.
Ps. I think your code was missing: require("reshape")
EDIT:
result <- c(by(xm[-2], xm$cols, function(i) { fit <- lm(value~x, i); print(fit); solve(coef(fit)[-1], -coef(fit)[1] )} )); print(result)
> print(result)
y s t
2.629981 2.819734 2.819734

I found a way to calculate the x-intercept, first create a data frame with the y-intercept and slope values,
par <- data.frame(cols = names(fits), do.call(rbind, fits))
Then rename column header names to accurately denote the values,
colnames(par)[2] <- "y_intercept"
colnames(par)[3] <- "slope"
# Calculate the x-intercept by using the formula -(y_intercept)/slope
x_incpt <- -par[2]/par[3]
colnames(x_incpt) <- "x_intercept"
Which gives the following result,
x_intercept
y 2.629981
s 2.819734
t 2.819734

Related

How to set all coefficients to one in model?

To fix certain coefficient in regression to one we can use offset function.
I want to set all coefficients to 1.
Let's take this example:
set.seed(42)
y <- rnorm(100)
df <- data.frame("Uni" = runif(100), "Exp" = rexp(100), "Wei" = rweibull(100, 1))
lm(y~ offset(2*get("Uni")) + Exp + Wei, data = df)
Call:
lm(formula = y ~ offset(Uni) + offset(Exp) + offset(Wei), data = df)
Coefficients:
(Intercept)
-2.712
This code works, however what if I have huge amount of data e.g. 800 variables and I want to do for all of them ? Writing all their names would be not so efficient. Is there any solution which allows us to do it more tricky ?
I think I found one solution if we do it this way:
set.seed(42)
# Assign everything to one data frame
df <- data.frame("Dep" = rnorm(100), "Uni" = runif(100),
"Exp" = rexp(100), "Wei" = rweibull(100, 1))
varnames <- names(df)[-1]
# Create formula for the sake of model creation
form <- paste0("offset","(",varnames, ")",collapse = "+")
form <- as.formula(paste0(names(df)[1], "~", form))
lm(form, data = df)
1) terms/update The following one-liner will produce the indicated formula.
update(formula(terms(y ~ ., data = df)), ~ offset(.))
## y ~ offset(Uni + Exp + Wei)
2) reformulate/sprintf another approach is:
reformulate(sprintf("offset(%s)", names(df)), "y")
## y ~ offset(Dep) + offset(Uni) + offset(Exp) + offset(Wei)
3) rowSums Another approach is to simply sum each row:
lm(y ~ offset(rowSums(df)))
4) lm.fit We could use lm.fit in which case we don't need a formula:
lm.fit(cbind(y^0), y, offset = rowSums(df))
5) mean If you only need the coefficient then it is just:
mean(y - rowSums(df))

geom_smooth() with median instead of mean

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

R convert regression model fit to a function

I want to quickly extract the fit of a regression model to a function.
So I want to get from:
# generate some random data
set.seed(123)
x <- rnorm(n = 100, mean = 10, sd = 4)
z <- rnorm(n = 100, mean = -8, sd = 3)
y <- 9 * x - 10 * x ^ 2 + 5 * z + 10 + rnorm(n = 100, 0, 30)
df <- data.frame(x,y)
plot(df$x,df$y)
model1 <- lm(formula = y ~ x + I(x^2) + z, data = df)
summary(model1)
to a model_function(x) that describes the fitted values for me.
Of course I could do this by hand in a way like this:
model_function <- function(x, z, model) {
fit <- coefficients(model)["(Intercept)"] + coefficients(model)["x"]*x + coefficients(model)["I(x^2)"]*x^2 + coefficients(model)["z"]*z
return(fit)
}
fit <- model_function(df$x,df$z, model1)
which I can compare to the actual fitted values and (with some rounding errors) works perfectly.
all(round(as.numeric(model1$fitted.values),5) == round(fit,5))
But of course this is not a universal solution (e.g. more variables etc.).
So to be clear:
Is there an easy way to extract the fitted values relationship as a function with the coefficients that were just estimated?
Note: I know of course about predict and the ability to generate fitted values from new data - but I'm really looking for that underlying function. Maybe that's possible through predict?
Grateful for any help!
If you want an actual function you can do something like this:
get_func <- function(mod) {
vars <- as.list(attr(mod$terms, "variables"))[-(1:2)]
funcs <- lapply(vars, function(x) list(quote(`*`), 1, x))
terms <- mapply(function(x, y) {x[[2]] <- y; as.call(x)}, funcs, mod$coefficients[-1],
SIMPLIFY = FALSE)
terms <- c(as.numeric(mod$coefficients[1]), terms)
body <- Reduce(function(a, b) as.call(list(quote(`+`), a, b)), terms)
vars <- setNames(lapply(seq_along(vars), function(x) NULL), sapply(vars, as.character))
f <- as.function(c(do.call(alist, vars), body))
formals(f) <- formals(f)[!grepl("\\(", names(formals(f)))]
f
}
Which allows:
my_func <- get_func(model1)
my_func
#> function (x = NULL, z = NULL)
#> 48.6991866925322 + 3.31343108778127 * x + -9.77589420188036 * I(x^2) + 5.38229596972984 * z
<environment: 0x00000285a1982b48>
and
my_func(x = 1:10, z = 3)
#> [1] 58.38361 32.36936 -13.19668 -78.31451 -162.98413 -267.20553
#> [7] -390.97872 -534.30371 -697.18048 -879.60903
and
plot(1:10, my_func(x = 1:10, z = 3), type = "b")
At the moment, this would not work with interaction terms, etc, but should work for most simple linear models
Any of these give the fitted values:
fitted(model1)
predict(model1)
model.matrix(model1) %*% coef(model1)
y - resid(model1)
X <- model.matrix(model1); X %*% qr.solve(X, y)
X <- cbind(1, x, x^2, z); X %*% qr.solve(X, y)
Any of these give the predicted values for any particular x and z:
cbind(1, x, x^2, z) %*% coef(model1)
predict(model1, list(x = x, z = z))

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
EDIT
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
pb[["data"]][[1]]
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
Output:
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
library("mgcv")
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
geom_line()
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
}
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
out)
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
geom_line()
I won't show the plot here as it is identical to that shown above except for the axis labels.

How to get a scatter plot of mixture data with different shape and colour for each distribution?

I am running a simulation of mixture data. My function is harder than Gaussian distribution. Hence, here, I simplified my question to be in Gaussian form. That is, if I simulated a mixture data like this:
N=2000
U=runif(N, min=0,max=1)
X = matrix(NA, nrow=N, ncol=2)
for (i in 1:N){
if(U[i] < 0.7){
X[i,] <- rnorm(1,0.5,1)
} else {
X[i,] <- rnorm(1,3,5)
}
}
How can I have a scatter plot with different colour and shape (type of the plot point) for each cluster or distribution? I would like to have this manually since my function is hard and complex. I tried plot(X[,1],X[,2],col=c("red","blue")) but it does not work.
I think this is what you want. Note that I had to do a bit of guesswork here to figure out what was going on, because your example code seems to have an error in it, you weren't generating different x1 and x2 values in each row:
N=2000
U=runif(N, min=0,max=1)
X = matrix(NA, nrow = N, ncol=2)
for (i in 1:N){
if(U[i] < 0.7){
# You had rnorm(n=1, ...) which gives 2 identical values in each row
# Change that to 2 and you get different X1 and X2 values
X[i,] <- rnorm(2, 0.5, 1)
} else {
X[i,] <- rnorm(2, 3, 5)
}
}
df = data.frame(
source = ifelse(U < 0.7, "dist1", "dist2"),
x = X[, 1],
y = X[, 2]
)
library(ggplot2)
ggplot(df, aes(x = x, y = y, colour = source, shape = source)) +
geom_point()
Result:
Here's what I got, but I'm not sure if this what you are looking for - the location of the observations for both clusters are exactly the same.
library(tidyverse)
df <- data.frame(X = X, U = U)
df <- gather(df, key = cluster, value = X, -U)
ggplot(df, aes(x = X, y = U, colour = cluster)) + geom_point() + facet_wrap(~cluster)
EDIT: I don't seem to be understanding what you are looking to map onto a scatter plot, so I'll indicate how you need to shape your data in order to create a chart like the above with the proper X and Y coordinates:
head(df)
U cluster X
1 0.98345408 X.1 2.3296047
2 0.33939935 X.1 -0.6042917
3 0.66715421 X.1 -2.2673422
4 0.06093674 X.1 2.4007376
5 0.48162959 X.1 -2.3118850
6 0.50780007 X.1 -0.7307929
So you want one variable for the Y coordinate (I'm using variable U here), one variable for the X coordinate (using X here), and a 3rd variable that indicates whether the observation belongs to cluster 1 or cluster 2 (variable cluster here).

Resources