Related
I am building a plot with ggplot. I have data where y is mostly independent of X, but I randomly have a few extreme values of Y at low values of X. Like this:
set.seed(1)
X <- rnorm(500, mean=5)
y <- rnorm(500)
y[X < 3] <- sample(c(0, 1000), size=length(y[X < 3]),prob=c(0.9, 0.1),
replace=TRUE)
I want to make the point that the MEDIAN y-value is still constant over X values. I can see that this is basically true here:
mean(y[X < 3])
median(y[X < 3])
If I make a geom_smooth() plot, it does mean, and is very affected by outliers:
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth()
I have a few potential fixes. For example, I could first use group_by/summarize to make a dataset of binned medians and then plot that. I would rather NOT do this because in my real data I have a lot of facetting and grouping variables, and it would be a lot to keep track of (non-ideal). A lot plot definitely looks better, but log does not have nice interpretation in my application (median does have nice interpretation)
ggplot(data=NULL, aes(x=X, y=y)) + geom_smooth() +
scale_y_log10()
Finally, I know about geom_quantile but I think I'm using it wrong. Is there a way to add an error bar? Also- this geom_quantile plot looks way too smooth, and I don't understand why it is sloping down. Am I using it wrong?
ggplot(data=NULL, aes(x=X, y=y)) +
geom_quantile(quantiles=c(0.5))
I realize that this problem probably has a LOT of workarounds, but if possible I would love to use geom_smooth and just provide an argument that tells it to use a median. I want geom_smooth for a side-by-side comparison with consistency. I want to put the mean and median geom_smooths side-by-side to show "hey look, super strong pattern between Y and X is driven by a few large outliers, if we look only at median the pattern disappears".
Thanks!!
You can create your own method to use in geom_smooth. As long as you have a function that produces an object on which the predict generic works to take a data frame with a column called x and translate into appropriate values of y.
As an example, let's create a simple model that interpolates along a running median. We wrap it in its own class and give it its own predict method:
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
Now we can use our method in geom_smooth:
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now of course, this doesn't look very "flat", but it is way flatter than the line calculated by the loess method of the standard geom_smooth() :
ggplot(data = NULL, aes(x = X, y = y)) +
geom_smooth(formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(formula = y ~ x, method = "rolling_median", se = FALSE)
Now, I understand that this is not the same thing as "regressing on the median", so you may wish to explore different methods, but if you want to get geom_smooth to plot them, this is how you can go about it. Note that if you want standard errors, you will need to have your predict function return a list with members called fit and se.fit
Here's a modification of #Allan's answer that uses a fixed x window rather than a fixed number of points. This is useful for irregular time series and series with multiple observations at the same time (x value). It uses a loop so it's not very efficient and will be slow for larger data sets.
# running median with time window
library(dplyr)
library(ggplot2)
library(zoo)
# some irregular and skewed data
set.seed(1)
x <- seq(2000, 2020, length.out = 400) # normal time series, gives same result for both methods
x <- sort(rep(runif(40, min = 2000, max = 2020), 10)) # irregular and repeated time series
y <- exp(runif(length(x), min = -1, max = 3))
data <- data.frame(x = x, y = y)
# ggplot(data) + geom_point(aes(x = x, y = y))
# 2 year window
xwindow <- 2
nwindow <- xwindow * length(x) / 20 - 1
# rolling median
rolling_median <- function(formula, data, n_roll = 11, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
y <- zoo::rollmedian(y, n_roll, na.pad = TRUE)
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed")
}
predict.rollmed <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# rolling time window median
rolling_median2 <- function(formula, data, xwindow = 2, ...) {
x <- data$x[order(data$x)]
y <- data$y[order(data$x)]
ys <- rep(NA, length(x)) # for the smoothed y values
xs <- setdiff(unique(x), NA) # the unique x values
i <- 1 # for testing
for (i in seq_along(xs)){
j <- xs[i] - xwindow/2 < x & x < xs[i] + xwindow/2 # x points in this window
ys[x == xs[i]] <- median(y[j], na.rm = TRUE) # y median over this window
}
y <- ys
structure(list(x = x, y = y, f = approxfun(x, y)), class = "rollmed2")
}
predict.rollmed2 <- function(mod, newdata, ...) {
setNames(mod$f(newdata$x), newdata$x)
}
# plot smooth
ggplot(data) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y, colour = "nwindow"), formula = y ~ x, method = "rolling_median", se = FALSE, method.args = list(n_roll = nwindow)) +
geom_smooth(aes(x = x, y = y, colour = "xwindow"), formula = y ~ x, method = "rolling_median2", se = FALSE, method.args = list(xwindow = xwindow))
Created on 2022-01-05 by the reprex package (v2.0.1)
I have got two vectors and a 2D-matrix, from which I want to create a 3D surface plot. I already have split my data into X and Y (vectors (time "t" and wavelength "w") and Z (matrix; absorbance "NIR" at time and wavelength) with the same number of rows/columns respectively:
t = matrix(1:456, ncol= 1)
w = matrix(1350:1650, nrow = 1)
NIR = as.matrix(read.table("NIR_alle_pur.txt", header = TRUE, dec =","))
colnames(NIR) = c(paste0("NIR.", 1350:1650))
dim(NIR)
# [1] 456 301
dput(NIR_example)
structure(c(60771.93, 57230.56, 56235.96, 41617.47, 41709.93,
57466.6, 59916.97, 63376.4, 41966.73, 41254.34, 65535, 61468.76,
65535, 41238.03, 42530.97, 56936.03, 65009.4, 65535, 40375.5,
41021.6, 62757, 65455.44, 63795.6, 41349.6, 41178.2), .Dim = c(5L,
5L), .Dimnames = list(NULL, c("NIR.Spectrum_1350.0000000", "NIR.Spectrum_1351.0000000",
"NIR.Spectrum_1352.0000000", "NIR.Spectrum_1353.0000000", "NIR.Spectrum_1354.0000000"
)))
I tried to insert those into the rgl.surface function, but I get the following error message:
Error in rgl.surface(x, y, z, coords = 1:3) : Bad dimension for rows
I've also tried to plot them with plotly, but my success was equally low.
Can someone give me an input how I can get my spectral data to look like the last ones (multiple surfaces) on this site, individually? I'll try the overlay of the surfaces with plotlylater on!
I am happy for every extra input and information on my level!
Thank you!
After looking at the source code, I'd guess the problem is that you stored your x and y vectors as matrices. If they are matrices, they need to be identical in shape to z.
As I mentioned in a comment, you should avoid using rgl.surface (and the other rgl.* functions in most cases), and use surface3d instead, or persp3d if you want axes.
The *3d functions are higher level functions that act more like other R functions, and they will lead to fewer problems in the long run.
You haven't posted any data, so I'll post a completely artificial example. Let's suppose z = x^2 + y^2 + a, where a is a different constant for each surface. Then you can plot it like this:
x <- seq(-2, 2, length = 7)
y <- seq(-3, 3, length = 5) # I've chosen different ranges
# and lengths just to illustrate.
z <- outer(x, y, function(x, y) x^2 + y^2)
colours <- heat.colors(100)
minval <- min(z)
maxval <- max(z) + 10
col <- colours[(z - minval)/(maxval - minval)*99 + 1]
persp3d(x, y, z, col = col) # get axes the first time
z <- outer(x, y, function(x, y) x^2 + y^2 + 5)
col <- colours[(z - minval)/(maxval - minval)*99 + 1]
surface3d(x, y, z, col = col)
z <- outer(x, y, function(x, y) x^2 + y^2 + 10)
col <- colours[(z - minval)/(maxval - minval)*99 + 1]
surface3d(x, y, z, col = col)
aspect3d(1, 1, 1) # Make axes all equal
That produces this plot:
I have some data generated using the following lines of code,
x <- c(1:10)
y <- x^3
z <- y-20
s <- z/3
t <- s*6
q <- s*y
x1 <- cbind(x,y,z,s,t,q)
x1 <- data.frame(x1)
I would like to plot x versus y,s, and t so I melt the data frame x1 first,
library(reshape2)
xm <- melt(x1, id=names(x1)[1], measure=names(x1)[c(2, 4, 5)], variable = "cols"`)
Then I plot them along with their linear fits using the following code,
library(ggplot2)
plt <- ggplot(xm, aes(x = x, y = value, color = cols)) +
geom_point(size = 3) +
labs(x = "x", y = "y") +
geom_smooth(method = "lm", se = FALSE)
plt
The plot which is generated is shown below,
Now I would liked to interpolate the x-intercept of the linear fit. The point in the plot where y axis value is 0.
The following lines of code as shown here, extracts the slope and y-intercept.
fits <- by(xm[-2], xm$cols, function(i) coef(lm(value ~ x, i)))
data.frame(cols = names(fits), do.call(rbind, fits))
Is there any way how I can extract the x-intercept other than manually calculating from the slope and y-intercept?
Thanks for the help!
You could do inverse prediction as implemented in package chemCal for calibrations if you don't want to calculate this yourself:
library(chemCal)
res <- by(xm[-2], xm$cols, function(i) inverse.predict(lm(value ~ x, i), 0)$Prediction)
res[1:3]
#xm$cols
#y s t
#2.629981 2.819734 2.819734
Edit:
Maybe you prefer this:
library(plyr)
res <- ddply(xm, .(cols),
function(i) data.frame(xinter=inverse.predict(lm(value ~ x, i), 0)$Prediction))
# cols xinter
# 1 y 2.629981
# 2 s 2.819734
# 3 t 2.819734
I don't think you can avoid computing the linear equation, though of course you don't have to do it by hand (unless you want to). For example:
by(xm[-2], xm$cols, function(i) {
fit <- lm(value~x, i); print(fit); solve(coef(fit)[-1], -coef(fit)[1] )}
)
Call:
lm(formula = value ~ x, data = i)
Coefficients:
(Intercept) x
-277.2 105.4
Call:
lm(formula = value ~ x, data = i)
Coefficients:
(Intercept) x
-99.07 35.13
Call:
lm(formula = value ~ x, data = i)
Coefficients:
(Intercept) x
-594.4 210.8
xm$cols: y
[1] 2.629981
-----------------------------------------------------------------------------------------------------------------
xm$cols: s
[1] 2.819734
-----------------------------------------------------------------------------------------------------------------
xm$cols: t
[1] 2.819734
What was solved is basically -277.2 + 105.4*x = 0 for x -> 105.4*x = 277.2 (the solve-function call) -> x = 2.629981. Seems your lines 's' and 't' intersect the y=0 axis at the same spot. If I understood correctly, your problem isn't extrapolation since your x-range covers the intercept but instead interpolation.
Ps. I think your code was missing: require("reshape")
EDIT:
result <- c(by(xm[-2], xm$cols, function(i) { fit <- lm(value~x, i); print(fit); solve(coef(fit)[-1], -coef(fit)[1] )} )); print(result)
> print(result)
y s t
2.629981 2.819734 2.819734
I found a way to calculate the x-intercept, first create a data frame with the y-intercept and slope values,
par <- data.frame(cols = names(fits), do.call(rbind, fits))
Then rename column header names to accurately denote the values,
colnames(par)[2] <- "y_intercept"
colnames(par)[3] <- "slope"
# Calculate the x-intercept by using the formula -(y_intercept)/slope
x_incpt <- -par[2]/par[3]
colnames(x_incpt) <- "x_intercept"
Which gives the following result,
x_intercept
y 2.629981
s 2.819734
t 2.819734
I'm looking for a way to plot a nonlinear regression line on a data set where every value in my vector y is being stored multiple times, so I tried to use something like:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(1,4,9,15,25,9,36,25,36,25)
reg4 <- lm( x ~ y + I(y^2) )
plot(x ~ y)
lines(y, predict(reg4), type="l", col="red", lwd=1)
this gives http://i.imgur.com/qSEVNdT.png
So my question is, is there a way to, let's say, use some sort of mean value for each y entry? Or well just make it a 'continous' line instead of something that branches of into multiple lines/returns to a lower y value at the points where there are multiple 'entries'.
In these cases, it is best to predict from the model over the range of the covariate. You do this for say 50 or 100 locations equally spaced over the range of x. Increasing or decreasing the number of locations to predict at as needed - more complex responses will need more locations etc. Doing this also solves the spaghetti plot issue as the newdata supplied will be in the order of x
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(1,4,9,15,25,9,36,25,36,25)
reg4 <- lm( x ~ y + I(y^2) )
## predictions
pred <- data.frame(y = seq(min(y), max(y), length = 100))
pred <- transform(pred, x = predict(reg4, newdata = pred))
## plot
plot(x ~ y)
lines(x ~ y, data = pred, type = "l", col = "red", lwd = 1)
The problem does not come from the ties in the data:
for a given value of y, there is only one forecast.
The problem is that the points are not sorted,
so that when you join them, you end up with a tangle of lines.
You can use order to reorder the points.
plot(
x ~ y,
xlab = "y", ylab = "x" # Confusing...
)
i <- order(y)
lines( y[i], predict(reg4)[i] )
imagine I have a 3 columns matrix
x, y, z
where z is a function of x and y.
I know how to plot a "scatter plot" of these points with
plot3d(x,y,z)
But if I want a surface instead I must use other commands such as surface3d
The problem is that it doesn't accept the same inputs as plot3d
it seems to need a matrix with
(nÂș elements of z) = (n of elements of x) * (n of elements of x)
How can I get this matrix?
I've tried with the command interp, as I do when I need to use contour plots.
How can I plot a surface directly from x,y,z without calculating this matrix?
If I had too many points this matrix would be too big.
cheers
If your x and y coords are not on a grid then you need to interpolate your x,y,z surface onto one. You can do this with kriging using any of the geostatistics packages (geoR, gstat, others) or simpler techniques such as inverse distance weighting.
I'm guessing the 'interp' function you mention is from the akima package. Note that the output matrix is independent of the size of your input points. You could have 10000 points in your input and interpolate that onto a 10x10 grid if you wanted. By default akima::interp does it onto a 40x40 grid:
require(akima)
require(rgl)
x = runif(1000)
y = runif(1000)
z = rnorm(1000)
s = interp(x,y,z)
> dim(s$z)
[1] 40 40
surface3d(s$x,s$y,s$z)
That'll look spiky and rubbish because its random data. Hopefully your data isnt!
You can use the function outer() to generate it.
Have a look at the demo for the function persp(), which is a base graphics function to draw perspective plots for surfaces.
Here is their first example:
x <- seq(-10, 10, length.out = 50)
y <- x
rotsinc <- function(x,y) {
sinc <- function(x) { y <- sin(x)/x ; y[is.na(y)] <- 1; y }
10 * sinc( sqrt(x^2+y^2) )
}
z <- outer(x, y, rotsinc)
persp(x, y, z)
The same applies to surface3d():
require(rgl)
surface3d(x, y, z)
You could look at using Lattice. In this example I have defined a grid over which I want to plot z~x,y. It looks something like this. Note that most of the code is just building a 3D shape that I plot using the wireframe function.
The variables "b" and "s" could be x or y.
require(lattice)
# begin generating my 3D shape
b <- seq(from=0, to=20,by=0.5)
s <- seq(from=0, to=20,by=0.5)
payoff <- expand.grid(b=b,s=s)
payoff$payoff <- payoff$b - payoff$s
payoff$payoff[payoff$payoff < -1] <- -1
# end generating my 3D shape
wireframe(payoff ~ s * b, payoff, shade = TRUE, aspect = c(1, 1),
light.source = c(10,10,10), main = "Study 1",
scales = list(z.ticks=5,arrows=FALSE, col="black", font=10, tck=0.5),
screen = list(z = 40, x = -75, y = 0))
rgl is great, but takes a bit of experimentation to get the axes right.
If you have a lot of points, why not take a random sample from them, and then plot the resulting surface. You can add several surfaces all based on samples from the same data to see if the process of sampling is horribly affecting your data.
So, here is a pretty horrible function but it does what I think you want it to do (but without the sampling). Given a matrix (x, y, z) where z is the heights it will plot both the points and also a surface. Limitations are that there can only be one z for each (x,y) pair. So planes which loop back over themselves will cause problems.
The plot_points = T will plot the individual points from which the surface is made - this is useful to check that the surface and the points actually meet up. The plot_contour = T will plot a 2d contour plot below the 3d visualization. Set colour to rainbow to give pretty colours, anything else will set it to grey, but then you can alter the function to give a custom palette. This does the trick for me anyway, but I'm sure that it can be tidied up and optimized. The verbose = T prints out a lot of output which I use to debug the function as and when it breaks.
plot_rgl_model_a <- function(fdata, plot_contour = T, plot_points = T,
verbose = F, colour = "rainbow", smoother = F){
## takes a model in long form, in the format
## 1st column x
## 2nd is y,
## 3rd is z (height)
## and draws an rgl model
## includes a contour plot below and plots the points in blue
## if these are set to TRUE
# note that x has to be ascending, followed by y
if (verbose) print(head(fdata))
fdata <- fdata[order(fdata[, 1], fdata[, 2]), ]
if (verbose) print(head(fdata))
##
require(reshape2)
require(rgl)
orig_names <- colnames(fdata)
colnames(fdata) <- c("x", "y", "z")
fdata <- as.data.frame(fdata)
## work out the min and max of x,y,z
xlimits <- c(min(fdata$x, na.rm = T), max(fdata$x, na.rm = T))
ylimits <- c(min(fdata$y, na.rm = T), max(fdata$y, na.rm = T))
zlimits <- c(min(fdata$z, na.rm = T), max(fdata$z, na.rm = T))
l <- list (x = xlimits, y = ylimits, z = zlimits)
xyz <- do.call(expand.grid, l)
if (verbose) print(xyz)
x_boundaries <- xyz$x
if (verbose) print(class(xyz$x))
y_boundaries <- xyz$y
if (verbose) print(class(xyz$y))
z_boundaries <- xyz$z
if (verbose) print(class(xyz$z))
if (verbose) print(paste(x_boundaries, y_boundaries, z_boundaries, sep = ";"))
# now turn fdata into a wide format for use with the rgl.surface
fdata[, 2] <- as.character(fdata[, 2])
fdata[, 3] <- as.character(fdata[, 3])
#if (verbose) print(class(fdata[, 2]))
wide_form <- dcast(fdata, y ~ x, value_var = "z")
if (verbose) print(head(wide_form))
wide_form_values <- as.matrix(wide_form[, 2:ncol(wide_form)])
if (verbose) print(wide_form_values)
x_values <- as.numeric(colnames(wide_form[2:ncol(wide_form)]))
y_values <- as.numeric(wide_form[, 1])
if (verbose) print(x_values)
if (verbose) print(y_values)
wide_form_values <- wide_form_values[order(y_values), order(x_values)]
wide_form_values <- as.numeric(wide_form_values)
x_values <- x_values[order(x_values)]
y_values <- y_values[order(y_values)]
if (verbose) print(x_values)
if (verbose) print(y_values)
if (verbose) print(dim(wide_form_values))
if (verbose) print(length(x_values))
if (verbose) print(length(y_values))
zlim <- range(wide_form_values)
if (verbose) print(zlim)
zlen <- zlim[2] - zlim[1] + 1
if (verbose) print(zlen)
if (colour == "rainbow"){
colourut <- rainbow(zlen, alpha = 0)
if (verbose) print(colourut)
col <- colourut[ wide_form_values - zlim[1] + 1]
# if (verbose) print(col)
} else {
col <- "grey"
if (verbose) print(table(col2))
}
open3d()
plot3d(x_boundaries, y_boundaries, z_boundaries,
box = T, col = "black", xlab = orig_names[1],
ylab = orig_names[2], zlab = orig_names[3])
rgl.surface(z = x_values, ## these are all different because
x = y_values, ## of the confusing way that
y = wide_form_values, ## rgl.surface works! - y is the height!
coords = c(2,3,1),
color = col,
alpha = 1.0,
lit = F,
smooth = smoother)
if (plot_points){
# plot points in red just to be on the safe side!
points3d(fdata, col = "blue")
}
if (plot_contour){
# plot the plane underneath
flat_matrix <- wide_form_values
if (verbose) print(flat_matrix)
y_intercept <- (zlim[2] - zlim[1]) * (-2/3) # put the flat matrix 1/2 the distance below the lower height
flat_matrix[which(flat_matrix != y_intercept)] <- y_intercept
if (verbose) print(flat_matrix)
rgl.surface(z = x_values, ## these are all different because
x = y_values, ## of the confusing way that
y = flat_matrix, ## rgl.surface works! - y is the height!
coords = c(2,3,1),
color = col,
alpha = 1.0,
smooth = smoother)
}
}
The add_rgl_model does the same job without the options, but overlays a surface onto the existing 3dplot.
add_rgl_model <- function(fdata){
## takes a model in long form, in the format
## 1st column x
## 2nd is y,
## 3rd is z (height)
## and draws an rgl model
##
# note that x has to be ascending, followed by y
print(head(fdata))
fdata <- fdata[order(fdata[, 1], fdata[, 2]), ]
print(head(fdata))
##
require(reshape2)
require(rgl)
orig_names <- colnames(fdata)
#print(head(fdata))
colnames(fdata) <- c("x", "y", "z")
fdata <- as.data.frame(fdata)
## work out the min and max of x,y,z
xlimits <- c(min(fdata$x, na.rm = T), max(fdata$x, na.rm = T))
ylimits <- c(min(fdata$y, na.rm = T), max(fdata$y, na.rm = T))
zlimits <- c(min(fdata$z, na.rm = T), max(fdata$z, na.rm = T))
l <- list (x = xlimits, y = ylimits, z = zlimits)
xyz <- do.call(expand.grid, l)
#print(xyz)
x_boundaries <- xyz$x
#print(class(xyz$x))
y_boundaries <- xyz$y
#print(class(xyz$y))
z_boundaries <- xyz$z
#print(class(xyz$z))
# now turn fdata into a wide format for use with the rgl.surface
fdata[, 2] <- as.character(fdata[, 2])
fdata[, 3] <- as.character(fdata[, 3])
#print(class(fdata[, 2]))
wide_form <- dcast(fdata, y ~ x, value_var = "z")
print(head(wide_form))
wide_form_values <- as.matrix(wide_form[, 2:ncol(wide_form)])
x_values <- as.numeric(colnames(wide_form[2:ncol(wide_form)]))
y_values <- as.numeric(wide_form[, 1])
print(x_values)
print(y_values)
wide_form_values <- wide_form_values[order(y_values), order(x_values)]
x_values <- x_values[order(x_values)]
y_values <- y_values[order(y_values)]
print(x_values)
print(y_values)
print(dim(wide_form_values))
print(length(x_values))
print(length(y_values))
rgl.surface(z = x_values, ## these are all different because
x = y_values, ## of the confusing way that
y = wide_form_values, ## rgl.surface works!
coords = c(2,3,1),
alpha = .8)
# plot points in red just to be on the safe side!
points3d(fdata, col = "red")
}
So my approach would be to, try to do it with all your data (I easily plot surfaces generated from ~15k points). If that doesn't work, take several smaller samples and plot them all at once using these functions.
Maybe is late now but following Spacedman, did you try duplicate="strip" or any other option?
x=runif(1000)
y=runif(1000)
z=rnorm(1000)
s=interp(x,y,z,duplicate="strip")
surface3d(s$x,s$y,s$z,color="blue")
points3d(s)