Defining a line between two points on a scatterplot in ggplot - r

Apologies if this question is a duplicate; Looking up "drawing a line between two points" on Stack Overflow gave me some ideas, but I'm not sure how to apply them to my specific problem.
Let's say that my data take the shape of:
x <- runif(n = 10)
y <- runif(n = 10)
graphData <- data.frame(x, y)
graphData
x y
1 0.3328235 0.30122890
2 0.4886130 0.06072057
3 0.9544738 0.94772694
4 0.4829024 0.72059627
5 0.8903502 0.14229430
6 0.9144382 0.54928466
7 0.6087350 0.95409124
8 0.4106898 0.58548335
9 0.1470947 0.40451028
10 0.9352998 0.64789348
Then I do a scatterplot of those data:
library(ggplot2)
p <- ggplot(graphData, aes(x = x, y = y)) +
geom_point()
p
What I want is, to draw exactly one line, connecting the point that has the highest y-value with the point that has the highest x-value. (The example makes it look like those could be the same point, but in my real-life data, the odds of that happening are infinitesimally small.)
Also, I'm not just drawing the line on a plot; I will also need to provide that line as a formula, to be used in a separate analysis. Thoughts?

I would try:
p <- ggplot(graphData, aes(x = x, y = y)) +
geom_point()+
geom_smooth(data = . %>% filter(x == max(x) | y == max(y)), method = lm)
p
and then call the formula of the line:
lm(y ~ x, data = graphData %>% filter(x == max(x) | y == max(y)))

Related

Remove data to the left and right of local minima

I have a lot of measurements where I get data that looks something like this:
# Generate example data
x <- 1:100
y <- 100*(1-exp(-0.3*x))
x2 <- 101:200
y2 <- rev(y)
df <- data.frame("x" = c(x, x2),
"y" = c(y, y2))
df$x <- df$x + 50
rm(x, x2, y, y2)
x <- 1:50
y <- 25.91818
x2 <- 251:300
y2 <- 25.91818
df2 <- data.frame("x" = c(x, x2),
"y" = c(y, y2))
rm(x, x2, y, y2)
df <- rbind(df, df2)
rm(df2)
If I plot this I can see that there are left-most and right-most local minima.
library(ggplot2)
p <- ggplot(df, aes(x,y))+
geom_line()+
geom_point(data = data.frame("x" = c(50, 250), "y" = c(25.91818, 25.91818)),
mapping = aes(x, y), colour = "red")+
scale_y_continuous(limits = c(0, 101))
p + annotate("text", label = "minimum 1", x = 50, y = 20) +
annotate("text", label = "minimum 2", x = 250, y = 20)
What I would like to do is trim those data that are to the left of minimum 1 and right of minimum 2. It's not super straightforward as there may also be local minima between those two points, because the real data doesn't look this ideal. I would also need to apply this process to many many samples, but I think this may be trivial because I could use e.g. dplyr and group_by().
I had some luck plotting the local minima using the ggpmisc package, but I'm not sure how I can use that to actually subset my data. Just for clarity I included the code to do so below, and with the real data it looks a little better:
library(ggpmisc)
p2 <- ggplot(df, aes(x, y))+
geom_line()+
ggpmisc::stat_peaks(col="red", span=3)
p2
I hope this is clear and I'm happy to clarify any questions. Thank you in advance.
You could do this using the following steps:
Sort your data according to its x co-ordinates
On your sorted data, find the diff of the y co-ordinates, which will be 0 (or close to 0) for the flat sections at either end (as well as any flat sections in between)
Starting from the left, find the first point where the diff is not zero (or at least is above a minimal threshold). Store this index as a variable called left
Starting from the right, find the first point where the diff is not zero (or at least is above a minimal threshold). Store this index as a variable called right
Subset your data frame so it only contains the data between rows left:right
So, in your example we would have:
# Define a minimal threshold above which we are not at the minimum line
minimal_change <- 1e-6
df <- df[order(df$x),] # Step 1
left <- which(diff(df$y) > minimal_change)[1] # Step 2
right <- nrow(df) - which(diff(rev(df$y)) > minimal_change)[1] + 1 # Step 3
df <- df[left:right, ] # Step 4
Now we can plot the result:
ggplot(df, aes(x, y)) +
geom_line()+
geom_point(data = data.frame("x" = c(50, 250), "y" = c(25.91818, 25.91818)),
mapping = aes(x, y), colour = "red") +
scale_y_continuous(limits = c(0, 101)) +
scale_x_continuous(limits = c(0, 300))

geom_smooth() with median instead of mean

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

Make ggplot with regression line and normal distribution overlay

I am trying to make a plot to show the intuition behind logistic (or probit) regression. How would I make a plot that looks something like this in ggplot?
(Wolf & Best, The Sage Handbook of Regression Analysis and Causal Inference, 2015, p. 155)
Actually, what I would rather even do is have one single normal distribution displayed along the y axis with mean = 0, and a specific variance, so that I can draw horizontal lines going from the linear predictor to the y axis and sideways normal distribution. Something like this:
What this is supposed to show (assuming I haven't misunderstood something) is . I haven't had much success so far...
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
# Probability density function of a normal logistic distribution
pdfDeltaFun <- function(x) {
prob = (exp(x)/(1 + exp(x))^2)
return(prob)
}
# Tried switching the x and y to be able to turn the
# distribution overlay 90 degrees with coord_flip()
ggplot(df, aes(x = y, y = x)) +
geom_point() +
geom_line() +
stat_function(fun = pdfDeltaFun)+
coord_flip()
I think this comes pretty close to the first illustration you give. If this is a thing you don't need to repeat many times, it is probably best to compute the density curves prior to plotting and use a seperate dataframe to plot these.
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
# For every row in `df`, compute a rotated normal density centered at `y` and shifted by `x`
curves <- lapply(seq_len(NROW(df)), function(i) {
mu <- df$y[i]
range <- mu + c(-3, 3)
seq <- seq(range[1], range[2], length.out = 100)
data.frame(
x = -1 * dnorm(seq, mean = mu) + df$x[i],
y = seq,
grp = i
)
})
# Combine above densities in one data.frame
curves <- do.call(rbind, curves)
ggplot(df, aes(x, y)) +
geom_point() +
geom_line() +
# The path draws the curve
geom_path(data = curves, aes(group = grp)) +
# The polygon does the shading. We can use `oob_squish()` to set a range.
geom_polygon(data = curves, aes(y = scales::oob_squish(y, c(0, Inf)),group = grp))
The second illustration is pretty close to your code. I simplified your density function by the standard normal density function and added some extra paramters to stat function:
library(ggplot2)
x <- seq(1, 11, 1)
y <- x*0.5
x <- x - mean(x)
y <- y - mean(y)
df <- data.frame(x, y)
ggplot(df, aes(x, y)) +
geom_point() +
geom_line() +
stat_function(fun = dnorm,
aes(x = after_stat(-y * 4 - 5), y = after_stat(x)),
xlim = range(df$y)) +
# We fill with a polygon, squishing the y-range
stat_function(fun = dnorm, geom = "polygon",
aes(x = after_stat(-y * 4 - 5),
y = after_stat(scales::oob_squish(x, c(-Inf, -1)))),
xlim = range(df$y))

How to get a scatter plot of mixture data with different shape and colour for each distribution?

I am running a simulation of mixture data. My function is harder than Gaussian distribution. Hence, here, I simplified my question to be in Gaussian form. That is, if I simulated a mixture data like this:
N=2000
U=runif(N, min=0,max=1)
X = matrix(NA, nrow=N, ncol=2)
for (i in 1:N){
if(U[i] < 0.7){
X[i,] <- rnorm(1,0.5,1)
} else {
X[i,] <- rnorm(1,3,5)
}
}
How can I have a scatter plot with different colour and shape (type of the plot point) for each cluster or distribution? I would like to have this manually since my function is hard and complex. I tried plot(X[,1],X[,2],col=c("red","blue")) but it does not work.
I think this is what you want. Note that I had to do a bit of guesswork here to figure out what was going on, because your example code seems to have an error in it, you weren't generating different x1 and x2 values in each row:
N=2000
U=runif(N, min=0,max=1)
X = matrix(NA, nrow = N, ncol=2)
for (i in 1:N){
if(U[i] < 0.7){
# You had rnorm(n=1, ...) which gives 2 identical values in each row
# Change that to 2 and you get different X1 and X2 values
X[i,] <- rnorm(2, 0.5, 1)
} else {
X[i,] <- rnorm(2, 3, 5)
}
}
df = data.frame(
source = ifelse(U < 0.7, "dist1", "dist2"),
x = X[, 1],
y = X[, 2]
)
library(ggplot2)
ggplot(df, aes(x = x, y = y, colour = source, shape = source)) +
geom_point()
Result:
Here's what I got, but I'm not sure if this what you are looking for - the location of the observations for both clusters are exactly the same.
library(tidyverse)
df <- data.frame(X = X, U = U)
df <- gather(df, key = cluster, value = X, -U)
ggplot(df, aes(x = X, y = U, colour = cluster)) + geom_point() + facet_wrap(~cluster)
EDIT: I don't seem to be understanding what you are looking to map onto a scatter plot, so I'll indicate how you need to shape your data in order to create a chart like the above with the proper X and Y coordinates:
head(df)
U cluster X
1 0.98345408 X.1 2.3296047
2 0.33939935 X.1 -0.6042917
3 0.66715421 X.1 -2.2673422
4 0.06093674 X.1 2.4007376
5 0.48162959 X.1 -2.3118850
6 0.50780007 X.1 -0.7307929
So you want one variable for the Y coordinate (I'm using variable U here), one variable for the X coordinate (using X here), and a 3rd variable that indicates whether the observation belongs to cluster 1 or cluster 2 (variable cluster here).

Different behaviour lm in stat_smooth

In this question someone asked if it is possible change the colour in a ggplot2 plot depending on a linear regression line.
The proposed solution worked, the points have a different colour above and below the plot.
library(ggplot2)
set.seed(2015)
df <- data.frame(x = rnorm(100),
y = rnorm(100))
# Fit linear regression
l = lm(y ~ x, data = df)
# Make new group variable based on residuals
df$group = NA
df$group[which(l$residuals >= 0)] = "above"
df$group[which(l$residuals < 0)] = "below"
# Make the plot
ggplot(df, aes(x,y)) +
geom_point(aes(colour = group)) +
geom_smooth(method = "lm", formula = y ~ x)
But I would like to do regression for y-1. As asked in this question.
# Fit linear regression
l = lm(y - 1 ~ x, data = df)
# Make new group variable based on residuals
df$group = NA
df$group[which(l$residuals >= 0)] = "above"
df$group[which(l$residuals < 0)] = "below"
# Make the plot
ggplot(df, aes(x,y)) +
geom_point(aes(colour = group)) +
geom_smooth(method = "lm", formula = y - 1 ~ x)
This is not what I expected. It looks to me that stat_smooth did what expected. The lm however gives the same result for y ~ x and y - 1 ~ x
What am I missing here?
If you want to color points based on where they lie according to the line, you can try comparing the actual value to the predicted value rather than using the residual
df$group = NA
df$group[df$y>predict(l)] = "above"
df$group[df$y<predict(l)] = "below"

Resources