How to make a contour ternary represents surface response with R - r

I want to make a triangle plot represents the response surface for all possible combinations of X, Y and Z factors and the gradient area inside the triangle expresses the predicted distribution of the response variable Gi.
# Here are the data:
X <- rep(c(45,40,55,40,43,50,43,50,43,48), each = 3)
Y <- rep(c(15, 12,22,14,14,19,12,17,17,12 ), each = 3)
Z <- rep(c(15,22,12,12,19,14,14,17,12,17), each = 3)
Gi <- c(353,381,320,312,335,265,394,350,374,320,299,316,300,304,295,360,331,395,
351,280,342,299,303,279,374,364,419,306,290,315)
Ft <- data.frame (X, Y, Z)
# Fitted model:
require (compositions) # package "compositions"
model = lm(Gi ~ ilr (Ft) + I (ilr (Ft)^2) + I (ilr (Ft)^3) )
# Generate random compositional data of the factors X, Y, and Z
library(tmvtnorm)
corMat <- var(Ft)
dt3 <- rtmvnorm (n=500, mean = c(45.2, 15.4, 15.4), sigma = corMat, lower = c(10,5,5), upper = c(80,60,60))
# Predict Gi using the model
pGi <- predict (model, list (Ft = dt3) )
pdt <- cbind (dt3, pGi) %>% as.data.frame() %>%
rename (X = V1, Y = V2, Z = V3)
With the model and predicted data, is it possible to express the estimated pGi as gradient surface in the triangle to get the output like the example enclosed? I have tried with ggtern below, but the output ternary plot is not what I want.
ggtern(data = pdt, aes(x = X, y = Y, z = Z, value = pGi)) +
stat_interpolate_tern(geom="polygon",
formula = value ~ x+y,
method = lm,
aes(fill = ..level..), expand = 1) +
scale_fill_gradient(low="green", high="blue") +
theme_gray () +
theme ( tern.axis.arrow.show = T)

Related

geom_smooth() with median instead of mean

I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)

Standalone legend in ggpairs

How can I include a legend inside one of the empty panels of the following matrix plot?
I have color coded different regression lines in the plots. I need a legend based on color.
I believe this answer comes closest to answer my question, yet I do not know how exactly to modify my code to get a legend based on color for different regression lines.
As for the background of the code, I am trying to study different robust and non-robust regression methods applied to multivariate data with and without outliers.
library(ggplot2)
library(GGally)
library(MASS)
library(robustbase)
## Just create data -- you can safely SKIP this function.
##
## Take in number of input variables (k), vector of ranges of k inputs
## ranges = c(min1, max1, min2, max2, ...) (must have 2k elements),
## parameters to create data (must be consistent with the number of
## input variables plus one), parameters are vector of linear
## coefficients (b) and random seed (seed), number of observations
## (n), vector of outliers (outliers)
##
## Return uncontaminated dataframe and contaminated dataframe
create_data <- function(k, ranges, b, seed = 6, n,
outliers = NULL) {
x <- NULL # x: matrix of input variables
for (i in 1:k) {
set.seed(seed^i)
## x <- cbind(x, runif(n, ranges[2*i-1], ranges[2*i]))
x <- cbind(x, rnorm(n, ranges[2*i-1], ranges[2*i]))
}
set.seed(seed - 2)
x_aug = cbind(rep(1, n), x)
y <- x_aug %*% b
y_mean = mean(y)
e <- rnorm(n, 0, 0.20 * y_mean) # rnorm x
y <- y + e
df <- data.frame(x = x, y = y)
len <- length(outliers)
n_rows <- len %/% (k+1)
if (!is.null(outliers)) {
outliers <- matrix(outliers, n_rows, k+1, byrow = TRUE)
df_contamin <- data.frame(x = rbind(x, outliers[,1:k]), y = c(y, outliers[,k+1]))
} else {
df_contamin <- df
}
dat <- list(df, df_contamin)
}
# plot different regression models (some are robust) for two types of
# data (one is contaminated with outliers)
plot_models <- function(data, mapping, data2) {
cb_palette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
## 1.grey, 2.light orange, 3.light blue, 4.green, 5.yellow, 6.blue, 7.red, 8.purple
plt <- ggplot(data = data, mapping = mapping) +
geom_point() +
theme_bw() +
geom_smooth(method = lm, formula = y ~ x, data = data2, color = cb_palette[3], se = FALSE) +
geom_smooth(method = lm, formula = y ~ x, color = cb_palette[7], se = FALSE) +
geom_smooth(method = rlm, formula = y ~ x, color = cb_palette[4], se = FALSE) +
geom_smooth(method = lmrob, formula = y ~ x, color = cb_palette[1], se = FALSE)
plt
}
# trim the upper and right panels of plots
trim_gg <- function(gg) {
n <- gg$nrow
gg$nrow <- gg$ncol <- n-1
v <- 1:n^2
gg$plots <- gg$plots[v > n & v%%n != 0]
gg$xAxisLabels <- gg$xAxisLabels[-n]
gg$yAxisLabels <- gg$yAxisLabels[-1]
gg
}
dat <- create_data(3, c(1, 10, 1, 10, 1, 10), c(5, 8, 6, 7), 6, 20, c(30, 30, 50, 400))
df <- dat[[1]]
df_contamin <- dat[[2]]
## Note that plot_models is called here
g <- ggpairs(df_contamin, columns = 1:4, lower = list(continuous = wrap(plot_models, data2 = df)), diag = list(continuous = "blankDiag"), upper = list(continuous = "blank")) #, legend = lgd)
gr <- trim_gg(g)
print(gr)
Created on 2019-10-09 by the reprex package (v0.3.0)
Sorry for the long code, but most probably only the plot_models function and the line where ggpairs is called need to be modified.
I want to get a legend in the blank upper half of the plots. It may be done by somehow tweaking the plot_models function, setting the mapping in ggpairs to color using ggplot2::aes_string, and using getPlot and putPlot of the GGally package. But I can't wrap my head around how to do it exactly.

How to fit exponential regression in r?(a.k.a changing power of base)

I am making exponential regressions in r.
Actually I want to compare y = exp^(ax+b) with y = 5^(ax+b).
# data
set.seed(1)
y <- c(3.5, 2.9, 2.97,4.58,6.18,7.11,9.50,9.81,10.17,10.53,
12.33,14.14,18, 22, 25, 39, 40, 55, 69, 72) + rnorm(20, 10, 1)
x <- 1:length(y)
df = data.frame(x = x, y = y)
predata = data.frame(x = 1:20)
# plot
plot(df, ylim = c(0,100), xlim = c(0,40))
# simple linear regression
fit_sr = lm(y~x, data = df)
pre_sr = predict(fit_sr, newdata = predata,
interval ='confidence',
level = 0.90)
lines(pre_sr[,1], col = "red")
# exponential regression 1
fit_er1 = lm(log(y, base = exp(1))~x, data = df)
pre_er1 = predict(fit_er1, newdata = predata,
interval ='confidence',
level = 0.90)
pre_er1 = exp(1)^pre_er1 # correctness
lines(pre_er1[,1], col = "dark green")
# exponential regression 2
fit_er2 = lm(log(y, base = 5) ~ x, data = df)
pre_er2 = predict(fit_er2, newdata = predata,
interval ='confidence',
level = 0.90)
pre_er2 = 5^pre_er2 # correctness
lines(pre_er2[,1], col = "blue")
I expect something like this(plot1), but exponential regression 1 and 2 are totally the same(plot2).
plot1
plot2
The two regression should be different because of the Y value is different.
Also, I am looking for how to make y = exp(ax+b) + c fitting in R.
Your code is correct, your theory is where the problem is. The models should be the same.
Easiest way is to think on the log scale, as you've done in your code. Starting with y = exp(ax + b) we can get to log(y) = ax + b, so a linear model with log(y) as the response. With y = 5^(cx + d), we can get log(y) = (cx + d) * log(5) = (c*log(5)) * x + (d*log(5)), also a linear model with log(y) as the response. Yhe model fit/predictions will not be any different with a different base, you can transform the base e coefs to base 5 coefs by multiplying them by log(5). a = c*log(5) and b = d*log(5).
It's a bit like wanting to compare the linear models y = ax + b where x is measured in meters vs y = ax + b where x is measured in centimeters. The coefficients will change to accommodate the scale, but the fit isn't any different.
The first part is already answered by #gregor, the second part "...I am looking for how to make y = exp(ax+b) + c fitting in R" can be done with nls:
fit_er3 <- nls(y ~ exp(a*x+b) + c, data = df, start=list(a=1,b=0,c=0))

How do I propagate the error of a linear regression when projecting from Y to X?

I'm trying to figure out how to propagate errors in the following case
I am calibrating a machine with a couple of standards (a, b, c) with
accepted values x. My machine measures y for these standards, with a
certain error (standard deviation of 1 in this example).
Then I measure replicates of a sample, yielding ynew. Now I want to
convert these values to the accepted measurement scale (the x-axis).
To do this, I can of course do some linear algebra and convert the slope and
intercept that I got from my standard measurements to a reversed slope and
intercept as follows
This works nicely to convert the input values, but how do I get proper estimates of the errors?
In R, I've tried the following:
library(broom) # for tidy lm
library(ggplot2) # for plotting
library(dplyr) # to allow piping
# find confidence value
cv <- function(x, level = 95) {
qt(1 - ((100 - level) / 100) / 2, df = length(x) - 1) * sd(x) / sqrt(length(x))
}
# find confidence interval
ci <- function(x, level = 95) {
xbar <- mean(x)
xci <- cv(x, level = level)
c(fit = xbar, lwr = xbar - xci, upr = xbar + xci)
}
set.seed(1337)
# create fake data
dat <- data.frame(id = rep(letters[1:3], 20),
x = rep(c(1, 7, 10), 20)) %>%
mutate(y = rnorm(n(), -20 + 1.25 * x, 1))
# generate linear model
mod <- lm(y ~ x, dat)
# tidy
mod_aug <- augment(mod)
# these are the new samples that my machine measures
ynew <- rnorm(10, max(dat$y) + 3)
# predict new x-value based on y-value that is outside of range
## predict(mod, newdata = data.frame(y = ynew), interval = "predict")
# Error in eval(predvars, data, env) : object 'x' not found
# or tidy
## augment(mod, newdata = data.frame(y = ynew))
# 50 row df that doesn't make sense
# found this function that should do the job, but it doesn't extrapolate
## approx(x = mod$fitted.values, y = dat$x, xout = ynew)$y
# [1] NA NA NA NA NA NA NA NA NA NA
# this one from Hmisc does allow for extrapolation
with_approx <- Hmisc::approxExtrap(x = mod_aug$.fitted, y = mod_aug$x, xout = ynew)$y
# but in case of lm, isn't using the slope and intercept of a model okay too?
with_itc_slp <- (- coef(mod)[1] / coef(mod)[2]) + (1 / coef(mod)[2] * ynew)
# this would be the 95% prediction interval of the model at the average
# sample position. Could also use "confidence" but this is more correct?
avg_prediction <- predict(mod,
newdata = data.frame(x = mean(with_itc_slp)),
interval = "prediction")
# plot it
ggplot(dat, aes(x = x, y = y, col = id)) +
geom_point() +
geom_hline(yintercept = ynew, col = "gray") +
geom_smooth(aes(group = 1), method = "lm", se = F, fullrange = T,
col = "lightblue") +
geom_smooth(aes(group = 1), method = "lm") +
# 95% CI of the new sample
annotate("pointrange", x = 1, y = mean(ynew),
ymin = ci(ynew)[2], ymax = ci(ynew)[3], col = "green") +
# 95% prediction interval of the linear model at the average transformed
# x-position
annotate("pointrange", x = mean(with_approx), y = mean(ynew),
ymin = avg_prediction[2], ymax = avg_prediction[3], col = "green") +
# transformed using approx
annotate("point", x = with_approx, y = ynew, size = 3, col = "blue",
shape = 1) +
# transformed using intercept and slope
annotate("point", x = with_itc_slp, y = ynew, size = 3, col = "red",
shape = 2) +
# it's pretty
coord_fixed()
resulting in this plot:
Now how do I go from these 95% CIs in the y-direction to transformed sample
x-values with a confidence interval in the x-direction?

2 polynomial regressions in a ggplot() graph

This is my Dataset:
As you can see, there are two quantitative variables (X, Y) and 1 categorical variable (molar, with two factors: M1, M2).
I would like to represent in one single graph two polynomial regressions and their respective prediction intervals: one for the M1 factor and one for the M2 factor. Each polynomial regression has its own degree (M1 is a 4 degree polynomial regression, and M2 is a 6 degree).
I want to use ggplot() function (which is in package ggplot2 in R). I have actually performed this figure but with all data merged (I mean, with no distinction between factors). This is the code I used:
# Fit a linear model
m <- lm(Y ~ X+I(X^2)+I(X^3)+I(X^4), data = Dataset)
# cbind the predictions to Dataset
mpi <- cbind(Dataset, predict(m, interval = "prediction"))
ggplot(mpi, aes(x = X)) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = "blue", alpha = 0.2) +
geom_point(aes(y = Y)) +
geom_line(aes(y = fit), colour = "blue", size = 1)
With this result:
So, I would like to have two different-grade polynomial regressions (one for the M1 and one for the M2), taking into account their respective predictions intervals. Which would be the exact code?
UPDATE - New code! I run this code with no success:
M1=subset(Dataset,Dataset$molar=="M1",select=X:Y)
M2=subset(Dataset,Dataset$molar=="M2",select=X:Y)
M1.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M1",select=X:Y))
M2.R <- lm(Y ~ X +I(X^2)+I(X^3)+I(X^4),
data=subset(Dataset,Dataset$molar=="M2",select=X:Y))
newdf <- data.frame(x = seq(0, 1, c(408,663)))
M1.P <- cbind(data=subset(Dataset,Dataset$molar=="M1",select=X:Y), predict(M1.R, interval = "prediction"))
M2.P <- cbind(data=subset(Dataset,Dataset$molar=="M2",select=X:Y), predict(M2.R, interval = "prediction"))
p = cbind(as.data.frame(rbind(M1.P, M2.P)), f = factor(rep(1:2, c(408,663)), x = rep(newdf$x, 2))
mdf = with(Dataset, data.frame(x = rep(x, 2), y = c(subset(Dataset,Dataset$molar=="M1",select=Y), subset(Dataset,Dataset$molar=="M2",select=Y),
f = factor(rep(1:2, c(408,663))))
ggplot(mdf, aes(x = x, y = y, colour = f)) + geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
These are the messages I get now:
[98] WARNING: Warning in if (n < 0L) stop("wrong sign in 'by' argument") :
the condition has length > 1 and only the first element will be used
Warning in if (n > .Machine$integer.max) stop("'by' argument is much too small") :
the condition has length > 1 and only the first element will be used
Warning in 0L:n :
numerical expression has 2 elements: only the first used
Warning in if (by > 0) pmin(x, to) else pmax(x, to) :
the condition has length > 1 and only the first element will be used
[99] WARNING: Warning in predict.lm(M1.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[100] WARNING: Warning in predict.lm(M2.R, interval = "prediction") :
predictions on current data refer to _future_ responses
[101] ERROR: <text>
I think I am closer but still can't see it. Help!
Here is one way. If you have more than two models/levels in the factor you should look into code that will work over the levels of the factor and fit the models that way.
Anyway, first some dummy data:
set.seed(100)
x <- runif(100)
y1 <- 2 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) + rnorm(100)
y2 <- -1 + (0.3 * x) + (2.4 * x^2) + (-2.5 * x^3) + (3.4 * x^4) +
(-0.3 * x^5) + (2.4 * x^6) + rnorm(100)
df <- data.frame(x, y1, y2)
Fit our two models:
m1 <- lm(y1 ~ poly(x, 4), data = df)
m2 <- lm(y2 ~ poly(x, 6), data = df)
Now precict at some new locations x and stick it together with x and f, a factor indexing the model, into a tidy format:
newdf <- data.frame(x = seq(0, 1, length = 100))
p1 <- predict(m1, newdata = newdf, interval = "prediction")
p2 <- predict(m2, newdata = newdf, interval = "prediction")
p <- cbind(as.data.frame(rbind(p1, p2)), f = factor(rep(1:2, each = 100)),
x = rep(newdf$x, 2))
Melt the original data into tidy form
mdf <- with(df, data.frame(x = rep(x, 2), y = c(y1, y2),
f = factor(rep(1:2, each = 100))))
Draw the plot, using colour to distinguish the models/data
ggplot(mdf, aes(x = x, y = y, colour = f)) +
geom_point() +
geom_ribbon(data = p, aes(x = x, ymin = lwr, ymax = upr,
fill = f, y = NULL, colour = NULL),
alpha = 0.2) +
geom_line(data = p, aes(x = x, y = fit))
This gets us

Resources