ggplot() for multiple datasets on a linear regression - r

This is the dataset:
heartData <- structure(list(id = 1:6, biking = c(30.80124571, 65.12921517,
1.959664531, 44.80019562, 69.42845368, 54.40362555), smoking = c(10.89660802,
2.219563176, 17.58833051, 2.802558875, 15.9745046, 29.33317552
), heart.disease = c(11.76942278, 2.854081478, 17.17780348, 6.816646909,
4.062223522, 9.550045997)), row.names = c(NA, 6L), class = "data.frame")
Here I have used multiple linear regression as model.
model.1 <- lm( heart.disease ~ biking + smoking, data = heartData)
plotting.data is a synthesized data I am interested in to check the confidence interval around as well as prediction interval.
plotting.data <- expand.grid(
biking = seq(min(heartData$biking), max(heartData$biking), length.out = 5),
smoking = c(mean(heartData$smoking)))
plotting.data$predicted.y <- predict(model.1, newdata = plotting.data, interval = 'confidence')
plotting.data$smoking <- round(plotting.data$smoking, digits = 2)
plotting.data$smoking <- as.factor(plotting.data$smoking)
After running the block of code above, I can see I have created plotting.data with 5 columns however, when I'm running
colnames(plotting.data)
I get 3 column names. plotting.data$predicted.y is only one column and I can't have access or rename plotting.data$predicted.y[,"fit"], plotting.data$predicted.y[,"upr"] or plotting.data$predicted.y[,"lwr"]
To plot results
heart.plot <- ggplot(data = heartData, aes(x = biking, y = heart.disease)) + geom_point()
+ geom_line(data = plotting.data, aes(x = biking, y = predicted.y[,"fit"], color = "red"), size = 1.25)
+ geom_ribbon(data = plotting.data, aes(ymin = predicted.y[,"lwr"], ymax = predicted.y[,"upr"]), alpha = 0.1)
heart.plot
I get the error:
Error in FUN(X[[i]], ...) : object 'heart.disease' not found
I don't know why I'm getting this error. From my own trial and errors, I know that the following part of the code is giving the error. however, I don't know how I can write it in a better way.
geom_ribbon(data = plotting.data, aes(ymin = predicted.y[,"lwr"], ymax = predicted.y[,"upr"]), alpha = 0.1)

It's because when you name variables in the aes() wrapper in ggplot(), it is expected that those variables are available to any data set that you happen to call in the additional geoms. If you want to use multiple data sets and they don't necessarily have the same variables, you need to have a separate aes() wrapper in each of the geoms to better control this issue.
ggplot() +
geom_point(data = heartData, aes(x = biking, y = heart.disease)) +
geom_line(data = plotting.data, aes(x = biking, y = predicted.y[,"fit"]), color = "red", size = 1.25) +
geom_ribbon(data = plotting.data, aes(x = biking, ymin = predicted.y[,"lwr"], ymax = predicted.y[,"upr"]), alpha = 0.1)

Related

Different objects are not showing up on my ggplot2

I'm studying the returns to college admission for marginal student and i'm trying to make a ggplot2 of the following data which is, average salaries of students who finished or didn't finish their masters in medicin and the average 'GPA' (foreign equivalent) distance to the 'acceptance score':
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
I have to do a Regression Discontinuity Design (RDD), so to do the regression - as far as i understand it - i have to rewrite the DistanceGrades to numeric so i just created a variable z
z <- -5:4
where 0 is the cutoff (ie. 0 is equal to "0.0" in DistanceGrades).
I then make a dataframe
df <- data.frame(z,SalaryAfter)
Now my attempt to create the plot gets a bit messy (i use the package 'fpp3', but i suppose that it is just the ggplot2 and maybe dyplr packages)
df %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0))) %>%
ggplot(aes(x = z, y = SalaryAfter, color = D)) +
geom_point(stat = "identity") +
geom_smooth(method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
xlim(-6,5) +
xlab("Distance to acceptance score") +
labs(title = "Figur 1.1", subtitle = "Salary for every distance to the acceptance score")
Which plots:
What i'm trying to do is firstly, split the data with a dummy variable D=1 if z>0 and D=0 if z<0. Then i plot it with a linear regression and a vertical line at z=0. Lastly i write the title and subtilte. Now i have two problems:
The x axis is displaying -5, -2.5, ... but i would like for it to show all the integers, the rational numbers have no relation to the z variable which is discrete. I have tried to fix this with several different methods, but none of them have worked, i can't remember all the ways i have tried (theme(panel.grid...),scale_x_discrete and many more), but the outcome has all been pretty similar. They all cause the x-axis to be completely removed such that there is no numbers and sometimes it even removes the axis title.
i would like for the regression channel for the first part of the data to extend to z=0
When i try to solve both of these problems i again get similar results, most of the things i try is not producing an error message when i run the code, but they either do nothing to my plot or they remove some of the existing elements which leaves me made of questions. I suppose that the error is caused by some of the elements not working together but i have no idea.
Try this:
library(tidyverse)
SalaryAfter <- c(287.780,305.181,323.468,339.082,344.738,370.475,373.257,
372.682,388.939,386.994)
DistanceGrades <- c("<=-1.0","[-0.9,-0.5]","[-0.4,-0.3]","-0,2","-0.1",
"0.0","0.1","[0.2,0.3]","[0.4,0.5]",">=0.5")
z <- -5:4
df <- data.frame(z,SalaryAfter) %>%
select(z, SalaryAfter) %>%
mutate(D = as.factor(ifelse(z >= -0.1, 1, 0)))
# Fit a lm model for the left part of the panel
fit_data <- lm(SalaryAfter~z, data = filter(df, z <= -0.1)) %>%
predict(., newdata = data.frame(z = seq(-5, 0, 0.1)), interval = "confidence") %>%
as.data.frame() %>%
mutate(z = seq(-5, 0, 0.1), D = factor(0, levels = c(0, 1)))
# Plot
ggplot(mapping = aes(color = D)) +
geom_ribbon(data = filter(fit_data, z <= 0 & -1 <= z),
aes(x = z, ymin = lwr, ymax = upr),
fill = "grey70", color = "transparent", alpha = 0.5) +
geom_line(data = fit_data, aes(x = z, y = fit), size = 1) +
geom_point(data = df, aes(x = z, y = SalaryAfter), stat = "identity") +
geom_smooth(data = df, aes(x = z, y = SalaryAfter), method = "lm") +
geom_vline(xintercept = 0) +
theme(panel.grid = element_line(color = "white",
size = 0.75,
linetype = 1)) +
scale_x_continuous(limits = c(-6, 5), breaks = -6:5) +
xlab("Distance to acceptance score") +
labs(title = "Figure 1.1", subtitle = "Salary for every distance to the acceptance score")

Using geom_polygon or geom_rect to plot multiple x and y errors

I have a dataset where the errors have the following x1 (age min), x2 (age max), y1 (height min), y2 (height max) and make a trapezium shape like this plot.
I want to do the same and plot these as errors and then have the gaussian process mean and error from a different model showing. To plot the errors as trapezium shapes I think I can do this using geom_polygon but I can't work out how to get the polygons to plot. It looks like you have to manually specify all of the coordinates see https://ggplot2.tidyverse.org/reference/geom_polygon.html . This seems extremely time-consuming to do for over 20 data points. Does anyone know of a more concise way to do this?
N.B. I have flipped the coordinates for the plot - this can be a bit confusing
Thanks,
library(ggplot2)
library(tidypalaeo)
### Create graph
ggplot(WAPRSL, aes(x =RSLc, y = Age))+
labs(x = "RSL (m)",y="Age (AD)")+
theme_classic()+
geom_lineh(data = WAPRSLgp, aes(x=mean,y=Age),col="#227988")+
coord_flip()+
geom_ribbon(data = WAPRSLgp, aes(x=mean, xmax=mean+error, xmin=mean-error), fill="#227988",alpha=.5)+
geom_ribbon(data = WAPRSLgp, aes(x=mean, xmax=mean+error*2, xmin=mean-error*2), fill="#227988",alpha=.7)+
geom_polygon(data=WAPRSL, aes(c(x1,x2,x2,x1),c(y1,y1,y2,y2))) ### something like this?
current plot without polygons
Data
### WAPRSL data
structure(list(depths = c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5), RSL = c(0.162319931,
0.170053941, 0.166157744, 0.268604159, 0.173369111, 0.207652794
), RSLerror = c(0.084355046, 0.084524909, 0.084307832, 0.084389419,
0.0838797, 0.083901714), Age = c(2017.393323, 2015.935137, 2013.065412,
2008.534508, 2004.853771, 2001.797776), Ageerror = c(0.183297248,
0.303357588, 0.566892665, 1.183257304, 2.427930603, 2.481236284
), RSLc = c(0.162319931, 0.16973314, 0.165205604, 0.26665522,
0.17061041, 0.204221774), y1 = c(2017.210026, 2015.631779, 2012.498519,
2007.351251, 2002.42584, 1999.31654), y2 = c(2017.57662, 2016.238495,
2013.632305, 2009.717765, 2007.281702, 2004.279012), x1 = c(0.162360256,
0.169799879, 0.16533032, 0.266915536, 0.171144554, 0.204767646
), x2 = c(0.162279606, 0.169666401, 0.165080887, 0.266394903,
0.170076265, 0.203675902)), row.names = c(NA, 6L), class = "data.frame")
### WAPRSLgp data
structure(list(Age = 1832:1837, mean = c(-0.098482271, -0.09855201,
-0.098622714, -0.098572523, -0.098894533, -0.099396926), error = c(0.054412551,
0.053483911, 0.052543897, 0.051595228, 0.05064071, 0.049683294
), min = c(-0.152894822, -0.152035921, -0.151166611, -0.150167751,
-0.149535243, -0.14908022), max = c(-0.04406972, -0.045068098,
-0.046078817, -0.046977296, -0.048253822, -0.049713632)), row.names = c(NA,
6L), class = "data.frame")
Your x1, x2, y1 and y2 points describe a perfect rectangle. Hence, the easiest thing is to simply use geom_rect(). I've commented out some lines since the WAPRSLgp data seems to describe a different part of the x-axis. The examples assume the WAPRSL data is in the global environment.
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(WAPRSL, aes(x =RSLc, y = Age))+
labs(x = "RSL (m)",y="Age (AD)")+
theme_classic()+
# geom_line(data = WAPRSLgp, aes(x=mean,y=Age),col="#227988", orientation = "x")+
coord_flip()+
# geom_ribbon(data = WAPRSLgp, aes(x=mean, xmax=mean+error, xmin=mean-error), fill="#227988",alpha=.5)+
# geom_ribbon(data = WAPRSLgp, aes(x=mean, xmax=mean+error*2, xmin=mean-error*2), fill="#227988",alpha=.7) +
geom_rect(aes(xmin = x1, xmax = x2, ymin = y1, ymax = y2),
fill = "transparent", colour = "black")
However, if you insist on polygons, you'd need to reshape your data a bit.
WAPRSL$id <- seq_len(nrow(WAPRSL))
poly <- tidyr::pivot_longer(WAPRSL, y1:y2, names_to = "y_var", values_to = "y_val")
poly <- tidyr::pivot_longer(poly, x2:x1, names_to = "x_var", values_to = "x_val")
# Correct for the order
poly <- poly[(poly$id - 1) * 4 + rep(c(1,2,4,3), max(poly$id)), ]
ggplot(WAPRSL, aes(x =RSLc, y = Age))+
labs(x = "RSL (m)",y="Age (AD)")+
theme_classic()+
# geom_line(data = WAPRSLgp, aes(x=mean,y=Age),col="#227988", orientation = "x")+
coord_flip()+
# geom_ribbon(data = WAPRSLgp, aes(x=mean, xmax=mean+error, xmin=mean-error), fill="#227988",alpha=.5)+
# geom_ribbon(data = WAPRSLgp, aes(x=mean, xmax=mean+error*2, xmin=mean-error*2), fill="#227988",alpha=.7) +
geom_polygon(
data = poly,
aes(x = x_val, y = y_val, group = id),
fill = NA, colour = "black"
)
Created on 2021-07-07 by the reprex package (v1.0.0)

ggplot: Extend regression line to predicted value with different linetype

Is there a simple way to extend a dotted line from the end of a solid regression line to a predicted value?
Below is my basic attempt at it:
x = rnorm(10)
y = 5 + x + rnorm(10,0,0.4)
my_lm <- lm(y~x)
summary(my_lm)
my_intercept <- my_lm$coef[1]
my_slope <- my_lm$coef[2]
my_pred = predict(my_lm,data.frame(x = (max(x)+1)))
ggdf <- data.frame( x = c(x,max(x)+1), y = c(y,my_pred), obs_Or_Pred = c(rep("Obs",10),"Pred") )
ggplot(ggdf, aes(x = x, y = y, group = obs_Or_Pred ) ) +
geom_point( size = 3, aes(colour = obs_Or_Pred) ) +
geom_abline( intercept = my_intercept, slope = my_slope, aes( linetype = obs_Or_Pred ) )
This doesn't give the output I'd hoped to see. I've looked at some other answers on SO and haven't seen anything simple.The best I've come up with is:
ggdf2 <- data.frame( x = c(x,max(x),max(x)+12), y = c(y,my_intercept+max(x)*my_slope,my_pred), obs_Or_Pred = c(rep("Obs",8),"Pred","Pred"), show_Data_Point = c(rep(TRUE,8),FALSE,TRUE) )
ggplot(ggdf2, aes(x = x, y = y, group = obs_Or_Pred ) ) +
geom_point( data = ggdf2[ggdf2[,"show_Data_Point"],] ,size = 3, aes(colour = obs_Or_Pred) ) +
geom_smooth( method = "lm", se=F, aes(colour = obs_Or_Pred, linetype=obs_Or_Pred) )
This gives output which is correct, but I have had to include an extra column specifying whether or not I want to show the data points. If I don't, I end up with the second of these two plots, which has an extra point at the end of the fitted regression line:
Is there a simpler way to tell ggplot to predict a single point out from the linear model and draw a dashed line to it?
You can plot the points using only your actual data and build a prediction data frame to add the lines. Note that max(x) appears twice so that it can be an endpoint of both the Obs line and the Pred line. We also use a shape aesthetic so that we can remove the point marker that would otherwise appear in the legend key for Pred.
# Build prediction data frame
pred_x = c(min(x),rep(max(x),2),max(x)+1)
pred_lines = data.frame(x=pred_x,
y=predict(my_lm, data.frame(x=pred_x)),
obs_Or_Pred=rep(c("Obs","Pred"), each=2))
ggplot(pred_lines, aes(x, y, colour=obs_Or_Pred, shape=obs_Or_Pred, linetype=obs_Or_Pred)) +
geom_point(data=data.frame(x,y, obs_Or_Pred="Obs"), size=3) +
geom_line(size=1) +
scale_shape_manual(values=c(16,NA)) +
theme_bw()
Semi-ugly: You can use scale_x_continuous(limits = to set the range of x values used for prediction. Plot the predicted line first with fullrange = TRUE, then add the 'observed' line on top. Note that the overplotting isn't rendered perfectly, and you may want to increase the size of the observed line slightly.
ggplot(d, aes(x, y)) +
geom_point(aes(color = "obs")) +
geom_smooth(aes(color = "pred", linetype = "pred"), se = FALSE, method = "lm",
fullrange = TRUE) +
geom_smooth(aes(color = "obs", linetype = "obs"), size = 1.05, se = FALSE, method = "lm") +
scale_linetype_discrete(name = "obs_or_pred") +
scale_color_discrete(name = "obs_or_pred") +
scale_x_continuous(limits = c(NA, max(x) + 1))
However, I tend to agree with Gregor: "ggplot is a plotting package, not a modeling package".

R ggplot2::geom_density with a constant variable

I have recently came across a problem with ggplot2::geom_density that I am not able to solve. I am trying to visualise a density of some variable and compare it to a constant. To plot the density, I am using the ggplot2::geom_density. The variable for which I am plotting the density, however, happens to be a constant (this time):
df <- data.frame(matrix(1,ncol = 1, nrow = 100))
colnames(df) <- "dummy"
dfV <- data.frame(matrix(5,ncol = 1, nrow = 1))
colnames(dfV) <- "latent"
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.2, position = "identity") +
geom_vline(data = dfV, aes(xintercept = latent, color = 'ls'), size = 2)
This is OK and something I would expect. But, when I shift this distribution to the far right, I get a plot like this:
df <- data.frame(matrix(71,ncol = 1, nrow = 100))
colnames(df) <- "dummy"
dfV <- data.frame(matrix(75,ncol = 1, nrow = 1))
colnames(dfV) <- "latent"
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.2, position = "identity") +
geom_vline(data = dfV, aes(xintercept = latent, color = 'ls'), size = 2)
which probably means that the kernel estimation is still taking 0 as the centre of the distribution (right?).
Is there any way to circumvent this? I would like to see a plot like the one above, only the centre of the kerner density would be in 71 and the vline in 75.
Thanks
Well I am not sure what the code does, but I suspect the geom_density primitive was not designed for a case where the values are all the same, and it is making some assumptions about the distribution that are not what you expect. Here is some code and a plot that sheds some light:
# Generate 10 data sets with 100 constant values from 0 to 90
# and then merge them into a single dataframe
dfs <- list()
for (i in 1:10){
v <- 10*(i-1)
dfs[[i]] <- data.frame(dummy=rep(v,100),facet=v)
}
df <- do.call(rbind,dfs)
# facet plot them
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.5, position = "identity") +
facet_wrap( ~ facet,ncol=5 )
Yielding:
So it is not doing what you thought it was, but it is also probably not doing what you want. You could of course make it "translation-invariant" (almost) by adding some noise like this for example:
set.seed(1234)
noise <- +rnorm(100,0,1e-3)
dfs <- list()
for (i in 1:10){
v <- 10*(i-1)
dfs[[i]] <- data.frame(dummy=rep(v,100)+noise,facet=v)
}
df <- do.call(rbind,dfs)
ggplot() +
geom_density(data = df, aes(x = dummy, colour = 's'),
fill = '#FF6666', alpha = 0.5, position = "identity") +
facet_wrap( ~ facet,ncol=5 )
Yielding:
Note that there is apparently a random component to the geom_density function, and I can't see how to set the seed before each instance, so the estimated density is a bit different each time.

Plotting points and lines separately in R with ggplot

I'm trying to plot 2 sets of data points and a single line in R using ggplot.
The issue I'm having is with the legend.
As can be seen in the attached image, the legend applies the lines to all 3 data sets even though only one of them is plotted with a line.
I have melted the data into one long frame, but this still requires me to filter the data sets for each individual call to geom_line() and geom_path().
I want to graph the melted data, plotting a line based on one data set, and points on the remaining two, with a complete legend.
Here is the sample script I wrote to produce the plot:
xseq <- 1:100
x <- rnorm(n = 100, mean = 0.5, sd = 2)
x2 <- rnorm(n = 100, mean = 1, sd = 0.5)
x.lm <- lm(formula = x ~ xseq)
x.fit <- predict(x.lm, newdata = data.frame(xseq = 1:100), type = "response", se.fit = TRUE)
my_data <- data.frame(x = xseq, ypoints = x, ylines = x.fit$fit, ypoints2 = x2)
## Now try and plot it
melted_data <- melt(data = my_data, id.vars = "x")
p <- ggplot(data = melted_data, aes(x = x, y = value, color = variable, shape = variable, linetype = variable)) +
geom_point(data = filter(melted_data, variable == "ypoints")) +
geom_point(data = filter(melted_data, variable == "ypoints2")) +
geom_path(data = filter(melted_data, variable == "ylines"))
pushViewport(viewport(layout = grid.layout(1, 1))) # One on top of the other
print(p, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
You can set them manually like this:
We set linetype = "solid" for the first item and "blank" for others (no line).
Similarly for first item we set no shape (NA) and for others we will set whatever shape we need (I just put 7 and 8 there for an example). See e.g. http://www.r-bloggers.com/how-to-remember-point-shape-codes-in-r/ to help you to choose correct shapes for your needs.
If you are happy with dots then you can use my_shapes = c(NA,16,16) and scale_shape_manual(...) is not needed.
my_shapes = c(NA,7,8)
ggplot(data = melted_data, aes(x = x, y = value, color=variable, shape=variable )) +
geom_path(data = filter(melted_data, variable == "ylines") ) +
geom_point(data = filter(melted_data, variable %in% c("ypoints", "ypoints2"))) +
scale_colour_manual(values = c("red", "green", "blue"),
guide = guide_legend(override.aes = list(
linetype = c("solid", "blank","blank"),
shape = my_shapes))) +
scale_shape_manual(values = my_shapes)
But I am very curious if there is some more automated way. Hopefully someone can post better answer.
This post relied quite heavily on this answer: ggplot2: Different legend symbols for points and lines

Resources