Is it possible to draw an abline of a fit only in a certain range of x-values?
I have a dataset with a linear fit of a subset of that dataset:
# The dataset:
daten <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
# make a linear fit for the datapoints 3, 4, 5
daten_fit <- lm(formula = y~x, data = daten, subset = 3:5)
When I plot the data and draw a regression line:
plot (y ~ x, data = daten)
abline(reg = daten_fit)
The line is drawn for the full range of x-values in the original data. But, I want to draw the regression line only for the subset of data that was used for curve fitting. There were 2 ideas that came to my mind:
Draw a second line that is thicker, but is only shown in the range 3:5. I checked the parameters for abline, lines and segments but I could not find anything
Add small ticks to the respective positions, that are perpendicular to the abline. I have now idea how I could do this. this would be the nicer way of course.
Do you have any idea for a solution?
The answer is No, it is not possible to get abline() to draw the fitted line on only one part of the plot region where the model was fitted. This is because it uses only the model coefficients to draw the line, not predictions from the model. If you look closely, you'll see that the line draw actually extends outside the plot region, covering the plot frame where it exists the region.
The simplest solution to such problems is to predict from the model for the regions you want.
# The dataset:
daten <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
# make a linear fit for the datapoints 3, 4, 5
mod <- lm(y~x, data = daten, subset = 3:5)
First, we get the range of x values we want to differentiate:
xr <- with(daten, range(x[3:5]))
then we predict for a set of evenly spaced points on this range using the model:
pred <- data.frame(x = seq(from = xr[1], to = xr[2], length = 50))
pred <- transform(pred, yhat = predict(mod, newdata = pred))
Now plot the data and the model using abline():
plot(y ~ x, data = daten)
abline(mod)
then add in the region you want to emphasise:
lines(yhat ~ x, data = pred, col = "red", lwd = 2)
Which gives us this plot:
If you have a model that is more complex than that which can be handled by abline(), then we take a slightly different strategy, predicting over the range of the available, plotted data to draw the line, and then pick out the interval we want to highlight. The following code does that:
## range of all `x` data
xr2 <- with(daten, range(x))
## same as before
pred <- data.frame(x = seq(from = xr2[1], to = xr2[2], length = 100))
pred <- transform(pred, yhat = predict(mod, newdata = pred))
## plot the data and the fitted model line
plot(y ~ x, data = daten)
lines(yhat ~ x, data = pred)
## add emphasis to the interval used in fitting
with(pred, lines(yhat ~ x, data = pred, subset = x >= xr[1] & x <= xr[2],
lwd = 2, col = "red"))
What we do here is use the subset argument to pick out the values from the predictions that are in the interval used in fitting, the vector we pass to subset is a logical vector of TRUE and FALSE values indicating which data are in the region of interest and lines() only plots a line along those data.
R> head(with(pred, x >= xr[1] & x <= xr[2]))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
One might wonder why I have done predictions over 50 or 100 evenly spaced values of the predictor variable when we could, in this case, just have done a prediction for the start and the end of the data or region of interest and join the two points? Well, not all modelling exercises are that simple - you double log model from a previous question is a case in point - and the generic solution I outline above will work in all cases whereas simply joining two predictions won't.
#Andrie has furnished you with a solution to Idea 2.
One way would be to use colours to distinguish between points that are fitted and those that aren't:
daten_fit <- lm(formula = y~x, data = daten[3:5, ])
plot(y ~ x, data = daten)
points(y ~ x, data = daten[3:5, ], col="red")
abline(reg=daten_fit, col="red")
The second way is to plot the tick marks on the x-axis. These ticks are called rugs, and can be drawn using the rug function. But first you have to calculate the range:
#points(y ~ x, data = daten[3:5, ], col="red")
abline(reg=daten_fit, col="red")
rug(range(daten[3:5, 1]), lwd=3, col="red")
This is a somewhat basic plotting question -- use the ylim=c(low, high) option with suitable options for low and high.
You may want to read then An Introduction to R manual that came with your R version, and the other fine contributed documentation on the CRAN site.
Related
I am trying to export the underlying data from a LOESS plot (blue line)
I found this post on the subject and was able to get it to export like the post says:
Can I export the result from a loess regression out of R?
However, as the last comment from the poster in that post says, I am not getting the results for my LOESS line. Does anyone have any insights on how to get it to export properly?
Thanks!
Code for my export is here:
#loess object
CL111_loess <- loess(dur_cleaned~TS_LightOn, data = CL111)
#get SE
CL111_predict <- predict(CL111_loess, se=T)
CL111_ouput <- data.frame("fitted" = CL111_predict$fit, "SE"=CL111_predict$se.fit)
write.csv(CL111_ouput, "CL111_output.csv")
Data for the original plot is here:
Code for my original plot is here:
{r}
#individual plot
ggplot(data = CL111) + geom_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "lm", se = FALSE, colour = "Green") +
labs(x = "TS Light On (Seconsd)", y = "TS Response Time (Seconds)", title = "Layout 1, Condition AO, INS High") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_smooth(mapping = aes(x = TS_LightOn, y = dur_cleaned), method = "loess", se = TRUE) + xlim(0, 400) + ylim (0, 1.0)
#find coefficients for best fit line
lm(CL111_LM$dur_cleaned ~ CL111_LM$TS_LightOn)
You can get this information via ggplot_build().
If your plot is saved as gg1, run ggplot_build(gg1); then you have to examine the data object (which is a list of data for different layers) and try to figure out which layer you need (in this case, I looked for which data layer included a colour column that matched the smooth line ...
bb <- ggplot_build(gg1)
## extract the right component, just the x/y coordinates
out <- bb$data[[2]][,c("x","y")]
## check
plot(y~x, data = out)
You can do whatever you want with this output now (write.csv(), save(), saveRDS() ...)
I agree that there is something weird/that I don't understand about the way that ggplot2 is setting up the loess fit. You do have to do predict() with the right newdata (e.g. a data frame with a single column TS_LightOn that ranges from 0 to 400) - otherwise you get predictions of the points in your data set, which may not be properly spaced/in the right order - but that doesn't resolve the difference for me.
To complement #ben-bolker, I have just written a small function that may be useful for retrieving the internal dataset created by ggplot for a geom_smooth call. It takes the resultant ggplot as input and returns the smoothed data. The problem it solves is that, as Ben observed, internally ggplot creates a smoothed fit with predicted data on random intervals, different from the interval used for the input data. This function will get you back the ggplot fit data with an interval based on integer and equally spaced values. That function uses a loess fit on the already smoothed data, using a small value of span (0.1), that is adjusted upward on-the-fly to cope with small numbers of values.
This is useful if you used geom_smooth with a method that is not 'loess' or using 'NULL' and you cannot easily build a model that replicates what geom_smooth is doing internally.
The function separates different series on the same plot as well as series located on different facets. It also returns the 'ymin' and 'ymax' values.
Note that this function uses an interval based on integer values of x. You can modify this if you need an interval based on equally-spaced values of x, but not integral. In that case, pass your x interval of choice in the xInterval parameter, or tweak the line:
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x)))).
get_geom_smooth_dataFromPlot <- function (a_ggplot, xInterval=NULL) {
#internal ggplot values read in ggTable
ggTable <- ggplot_build(a_ggplot)$data[[1]]
#facet panels
panels <- as.numeric(names(table(ggTable$PANEL)))
nPanel <- length(panels)
onePanel <- (nPanel==1)
#number of series in each plot
groups <- as.numeric(names(table(ggTable$group)))
nGroup <- length(groups)
oneGroup <- (nGroup==1)
out <- data.frame()
#are there 'ymin' and 'ymax' values?
SE_data <- "ymin" %in% colnames(ggTable)
for (pan in (1:nPanel)) {
for (grp in (1:nGroup)) {
sub <- subset(ggTable, (PANEL==panels[pan])&(group==groups[grp]))
#no group series for this facet panel?
if (dim(sub)[1] == 0) next
if (is.null(xInterval)) {
outOne <- data.frame(x=c(min(trunc(sub$x)):max(trunc(sub$x))))
} else {
outOne <- data.frame(x=xInterval)
}
nObs <- dim(outOne)[1]
#hack to avoid problems with a small range for the x interval
# when there are more than 90 x values
# we use a span of 0.1, but
# we adjust on-the-fly up to a span of 0.5
# for 10 values of the x interval
cSpan <- max (0.1, 0.5 * 10 / (nObs-(nObs-10)/2))
if (!onePanel) outOne$panel <- pan
if (!oneGroup) outOne$group <- grp
mod <- loess(y~x, data=sub, span=cSpan)
outOne$y <- predict(mod, outOne$x, se=FALSE)
if (SE_data) {
mod <- loess(ymin~x, data=sub, span=cSpan)
outOne$ymin <- predict(mod, outOne$x, se=FALSE)
mod <- loess(ymax~x, data=sub, span=cSpan)
outOne$ymax <- predict(mod, outOne$x, se=FALSE)
}
out <- rbind(out, outOne)
}
}
return (out)
}
I am trying to plot a regression model for a data set with measurements for "mue" (friction coefficient for breaking train), speed of train and temp of train. I built a simple regression model using lm so I could test plotting with plot_ly. The plot attached shows the blue markers of the original data and the surface plotted doesn't look right. It should look more like a regression surface... I also plotted in 2D to make sure the regression actually works and it does. I've posted the code below and am wondering if anyone here has any advice. Been trying everything I can find online and none of it seems to be working. I think the issue might have to do with building a grid? I've tried that a few times, but I always get error messages for vectors not matching up, etc. I'd be happy to post that as well if needed. Thank you!
3D Regression Plot
2D Regression Plot
set.seed(123) # randum number generator
training.samples <- avg.frame$avg.mue %>%
createDataPartition(p = 0.8, list = FALSE) # pick 80 percent of data
train.data <- avg.frame[training.samples, ] # 80 percent is training data
test.data <- avg.frame[-training.samples, ] # 20 percent is test data
model_2 <- lm(avg.mue ~ avg.speed + avg.temp, data = train.data)
vals <- predict(model_2, train.data)
avg.mue <- matrix(vals, nrow = length(test.data$avg.speed), ncol = length(test.data$avg.temp))
plane <- avg.mue
p <- plot_ly(data = train.data, z = ~avg.mue, x = ~avg.speed, y = ~avg.temp, opacity = 0.6) %>%
add_markers()
p %>% add_surface(z = ~plane, x = ~avg.speed, y = ~avg.temp, showscale = FALSE) %>%
layout(showlegend = FALSE)
I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:
some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group
I have two 3D plots, one was made using fitted values using library p3d (left side). In the second one I used predict and then interp command from package akima and plotted using persp (right side). The images are not showing the same angle, but it is the best way I found to show that one is flat and the other one has a curve.
I would like to know why the graph showing fitted values has a curve and the other one with predicted values doesn't have it.
First graph code:
library(p3d)
Init3d(family="serif", cex = 1)
Plot3d( TCL ~ reLDM+yr, nest6)
Axes3d()
fit = lm( TCL ~ reLDM+yr+I(yr^2)+I(reLDM*yr)+I(reLDM*yr^2), nest6)
Fit3d( fit )
Second graph code:
library(akima)
x <- nest6$reLDM
y <- nest6$yr
y2 <- y^2
z <- nest6$TCL
m <- lm(z ~ x*y+y2+x:y2)
i <- 25
xtemp <- seq(min(x),max(x),length.out=i)
xrange <- rep(xtemp,times=i)
ytemp <- seq(min(y),max(y),length.out=i)
yrange <- rep(ytemp,each=i)
y2temp <- seq(min(y2),max(y2),length.out=i)
y2range <- rep(y2temp,each=i)
newdata <- data.frame(x=xrange,y=yrange,y2=y2range)
zhat <- predict(m,newdata=newdata)
xyz <- interp(xrange,yrange,zhat)
jet.colors <- colorRampPalette( c("yellow", "red", "blue") )
nbcol <- 500
color <- jet.colors(nbcol)
nrz <- length(xyz[[1]])
ncz <- length(xyz[[2]])
z<-xyz[[3]]
zfacet <- z[-1, -1] + z[-1, -ncz] + z[-nrz, -1] + z[-nrz, -ncz]
facetcol <- cut(zfacet, nbcol)
quartz()
persp(xyz,xlab="x",ylab="y",zlab="z", cex.lab = 1,cex.axis = 1,
theta = 35, phi = 50,col=color[facetcol], border="grey40", ticktype = "detailed", zlim=c(1,7))
You will find the dataset in this link, it big so it was not possible to post it here: https://www.dropbox.com/s/czdascoq02alm46/TCL16_26.csv?dl=0
My model was done using lm(), I read in a post that there is no difference between fitted and predict functions in a simple linear regression model. However, in akima I am using the interp command wich I understand it estimates values between two known data points (basically fills the missing data gaps).
Another difference I found is that predicted values plot uses new data from the maximum and minimum range of values in the original data set. For fitted values there is one value per observation.
I have problems explaining this to my supervisor, he thinks it is not enough reason. What would be a better explanation why is the curve missing in the second graph?
Replace
y2temp <- seq(min(y2),max(y2),length.out=i)
with
y2temp <- ytemp^2
You will get a similar curve.
Using simulated data:
How do I plot the equivalent of contour (base R) with ggplot2? Below is an example with linear discriminant function analysis:
require(MASS)
iris.lda<-lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)
datPred<-data.frame(Species=predict(iris.lda)$class,predict(iris.lda)$x) #create data.frame
#Base R plot
eqscplot(datPred[,2],datPred[,3],pch=as.double(datPred[,1]),col=as.double(datPred[,1])+1)
#Create decision boundaries
iris.lda2 <- lda(datPred[,2:3], datPred[,1])
x <- seq(min(datPred[,2]), max(datPred[,2]), length.out=30)
y <- seq(min(datPred[,3]), max(datPred[,3]), length.out=30)
Xcon <- matrix(c(rep(x,length(y)),
rep(y, rep(length(x), length(y)))),,2) #Set all possible pairs of x and y on a grid
iris.pr1 <- predict(iris.lda2, Xcon)$post[, c("setosa","versicolor")] %*% c(1,1) #posterior probabilities of a point belonging to each class
contour(x, y, matrix(iris.pr1, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple") #Plot contour lines in the base R plot
iris.pr2 <- predict(iris.lda2, Xcon)$post[, c("virginica","setosa")] %*% c(1,1)
contour(x, y, matrix(iris.pr2, length(x), length(y)),
levels=0.5, add=T, lty=3,method="simple")
#Eqivalent plot with ggplot2 but without decision boundaries
ggplot(datPred, aes(x=LD1, y=LD2, col=Species) ) +
geom_point(size = 3, aes(pch = Species))
It is not possible to use a matrix when plotting contour lines with ggplot. The matrix can be rearranged to a data-frame using melt. In the data-frame below the probability values from iris.pr1 are displayed in the first column along with the x and y coordinates in the following two columns. The x and y coordinates form a grid of 30 x 30 points.
df <- transform(melt(matrix(iris.pr1, length(x), length(y))), x=x[X1], y=y[X2])[,-c(1,2)]
I would like to plot the coordinates (preferably connected by a smoothed curve) where the posterior probabilities are 0.5 (i.e. the decision boundaries).
You can use geom_contour in ggplot to achieve a similar effect. As you correctly assumed, you do have to transform your data. I ended up just doing
pr<-data.frame(x=rep(x, length(y)), y=rep(y, each=length(x)),
z1=as.vector(iris.pr1), z2=as.vector(iris.pr2))
And then you can pass that data.frame to the geom_contour and specify you want the breaks at 0.5 with
ggplot(datPred, aes(x=LD1, y=LD2) ) +
geom_point(size = 3, aes(pch = Species, col=Species)) +
geom_contour(data=pr, aes(x=x, y=y, z=z1), breaks=c(0,.5)) +
geom_contour(data=pr, aes(x=x, y=y, z=z2), breaks=c(0,.5))
and that gives
The partimat function in the klaR library does what you want for observed predictors, but if you want the same for the LDA projections, you can build a data frame augmenting the original with the LD1...LDk projections, then call partimat with formula Group~LD1+...+LDk, method='lda' - then you see the "LD-plane" that you intended to see, nicely partitioned for you. This seemed easier to me, at least to explain to students newer to R, since I'm just reusing a function already provided in a way in which it wasn't quite intended.