I am trying to add lines for confidence intervals in R but lines() isn't working. In the following code b is a dataframe, 100 observations of 2 variables 'pred' and 'se'.
plot(c(1:300),b$pred,type="l",lwd=1.5)
lines(c(1:300),b$pred+2*b$se,type="l",lty=2,col='red')
The first line is working but the second is not. I have tried it with and without the x values (plot works with or without, lines works for neither). I can get lines to work for different dataframes, but not this one.
It seems very fragile to me to use 1:300 when also referencing b; it might work when b has 300 rows, but any other time it's going to either complain with warnings or recycling silently and show a misleading/meaningless plot. In general, "never" use hard-coded numbers when working programmatically like this, perhaps better seq_len(nrow(b)) instead of 1:300.
The bounds (x/y limits) for the plot are defined with the first plot command. After that, in base R graphics, no other plotting command will alter the limits. This means it is highly likely that all of pred+2*se are greater than max(pred), so R thinks it's plotting the lines, but due to plotting inefficiency is really doing nothing since the lines are off-canvas.
For this, you need to set the limits up front, perhaps:
xlims <- with(b, range(c(pred, pred+2*se), na.rm = TRUE))
plot(seq_len(nrow(b)), b$pred, type="l", lwd=1.5, xlim=xlims)
lines(seq_len(nrow(b)), b$pred+2*b$se, type="l", lty=2, col='red')
That should address your question. Continue reading if you want to consider migration to ggplot2 ... not a one-for-one migration, not trivial, and perhaps premature at this point, but still something to think about.
While the above should fix the problem you cited, you might also consider migrating to ggplot2: it allows many other things (too many to discuss here), including the feature of updating the x/y limits with every "layer" you add to it. For instance, I wonder if the above will work:
library(ggplot2)
ggplot(b, aes(x = seq_along(pred), y = pred)) +
geom_line(linewidth = 1.5) + # this is doing what your first 'plot' is doing
geom_line(aes(y = pred + 2*se), linewidth = 2, color = "red") # your call to lines
(Notice no need to handle the x/y limits manually, ggplot2 figures it out for you with each layer added.)
I'm going to infer that you'll want to add a pred - 2*se as well, in which case it'll be another call to geom_line, as in
ggplot(b, aes(x = seq_along(pred), y = pred)) +
geom_line(linewidth = 1.5) +
geom_line(aes(y = pred + 2*se), linewidth = 2, color = "red") +
geom_line(aes(y = pred - 2*se), linewidth = 2, color = "blue")
Note that ggplot2 would actually prefer that you handle this with "long" data ... in that case, we can do something like below:
library(dplyr)
library(tidyr) # pivot_longer
b %>%
select(x, pred, se) %>%
mutate(
x = row_number(),
sehigh = pred + 2*se,
selow = pred - 2*se
) %>%
pivot_longer(-x, names_to = "type", values_to = "val") %>%
ggplot(aes(x, val, group = type, color = type)) +
geom_line() +
scale_color_manual(values = c(pred = "black", sehigh = "red", selow = "blue"))
In this case, only one call to geom_line, and ggplot will handle colors automatically (based on the new categorical variable type that we created in a previous step).
Related
Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.
I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:
#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!
I want to output two plots in a grid using the same function but with different input for x. I am using ggplot2 with stat_function as per this post and I have combined the two plots as per this post and this post.
f01 <- function(x) {1 - abs(x)}
ggplot() +
stat_function(data = data.frame(x=c(-1, 1)), aes(x = x, color = "red"), fun = f01) +
stat_function(data = data.frame(x=c(-2, 2)), aes(x = x, color = "black"), fun = f01)
With the following outputs:
Plot:
Message:
`mapping` is not used by stat_function()`data` is not used by stat_function()`mapping` is not used by stat_function()`data` is not used by stat_function()
I don't understand why stat_function() won't use neither of the arguments. I would expect to plot two graphs one with x between -1:1 and the second with x between -2:2. Furthermore it takes the colors as labels, which I also don't understand why. I must be missing something obvious.
The issue is that according to the docs the data argument is
Ignored by stat_function(), do not use.
Hence, at least in the second call to stat_function the data is ignored.
Second, the
The function is called with a grid of evenly spaced values along the x axis, and the results are drawn (by default) with a line.
Therefore both functions are plotted over the same range of x values.
If you simply want to draw functions this can be achievd without data and mappings like so:
library(ggplot2)
f01 <- function(x) {1 - abs(x)}
ggplot() +
stat_function(color = "black", fun = f01, xlim = c(-2, 2)) +
stat_function(color = "red", fun = f01, xlim = c(-1, 1))
To be honest, I'm not really sure what happens here with ggplot and its inner workings. It seems that the functions are always applied to the complete range, here -2 to 2. Also, there is an issue on github regarding a wrong error message for stat_function.
However, you can use the xlim argument for your stat_function to limit the range on which a function is drawn. Also, if you don't specify the colour argument by a variable, but by a manual label, you need to tell which colours should be used for which label with scale_colour_manual (easiest with a named vector). I also adjusted the line width to show the function better:
library(ggplot2)
f01 <- function(x) {1 - abs(x)}
cols <- c("red" = "red", "black" = "black")
ggplot() +
stat_function(data = data.frame(x=c(-1, 1)), aes(x = x, colour = "red"), fun = f01, size = 1.5, xlim = c(-1, 1)) +
stat_function(data = data.frame(x=c(-2, 2)), aes(x = x, colour = "black"), fun = f01) +
scale_colour_manual(values = cols)
I need to create "two plots" in "one plot" with ggplot. I managed to do it with base R as follows:
x=rnorm(10)
y=rnorm(10)*20+100
plot(1:10,rev(sort(x)),cex=2,col='red',ylim=c(0,2.2))
segments(x0=1:10, x1=1:10, y0=1.8,y1=1.8+y/max(y)*.2,lwd=3,col='dodgerblue')
However, I am struggling with ggplot, how can it be done?
Here's one possible translation of that code.
ggplot(data.frame(idx=seq_along(x), x,y)) +
geom_point(aes(idx, rev(sort(x))), col="red") +
geom_segment(aes(x=idx, xend=idx, y=1.8, yend=1.8+y/max(y)*.2), color="dodgerblue")
In general with ggplot2, you can add multiple views of data to a plot by adding additional layers (geoms)
My solution is similar to #MrFlick.
I would always recommend having a plot data frame and referring to the variables from there as you can more easily relate variables to plot aesthetics.
library(tidyverse)
plot_df <- data.frame(x, y) %>%
arrange(-x) %>%
mutate(id = 1:10)
ggplot(plot_df) +
geom_point(aes(id, x), color = "red", pch = 1, size = 5) +
geom_segment(aes(x = id, xend = id, y = 1.8, yend = 1.8+y/max(y)*.2),
lwd = 2, color = 'dodgerblue') +
scale_y_continuous(limits = c(0,2.2)) +
theme_light()
Ultimately, the goal of ggplot is to add aesthetics (in this case, the points and the segments) to form the final plot.
If you'd like to learn more, check out the ggplot cheat sheet and read more on the ideas behind ggplot: https://ggplot2.tidyverse.org/
A third week into my R class (please be patient with me even if it seems obvious where went wrong!), and I am struggling with a homework problem with using the R ggplot2 library. Using the built in diamonds data frame, the problem is to make a scatter plot regression line for log (carat) and log (price), but plotting only for the Fair and Ideal cut diamonds.
This is what the plot is supposed to look like
A quick background, the 3 variables in question here are carat (num), cut (Fair, Good, Very Good, Premium, Ideal), and price (int).
I start with the following code:
set.seed(123)
d <- ggplot(diamonds[sample(nrow(diamonds),5000),] #this was provided to us in the homework
d + geom_point(aes(x = log(carat), y = log(price), colour = cut) +
labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
stat_smooth(aes(x = log(carat), y = log(price), colour = cut), method = "gam")
Here's what I got
Now, I know this is incorrect, because "colour = cut" shows ALL the cuts, but I only want "Fair" and "Ideal". The professor hinted that we should try diamonds$cut%in%c(...), and so I tried it in many different ways. One of the latest (wrong) code is:
d + geom_point(aes(x = log(carat), y = log(price), colour = diamonds[diamonds$cut%in%c("Fair","Ideal")]), alpha = 0.5) +
labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
stat_smooth(aes(x = log(carat), y = log(price), colour = diamonds[diamonds$cut%in%c("Fair","Ideal")]), method = "gam")
I continue to get error messages regardless of where I tried to subset the diamonds$cut (e.g., Length of logical index vector for '[' must equal number of columns, Aesthetics must be either length 1 or the same as the data (5000):colour).
How do I extract just the Fair and Ideal cut to make this graph?
Any help is appreciated!
This is the way to define the data argument to ggplot2 prior to declaring it within the ggplot argument, although I'm not sure how to filter the cut column when it is specified as a mapping variable in aes(colour = cut). Although the plot doesn't appear exactly as it should according to your post if that matters at this point. Hopefully this helps.
library(ggplot2)
set.seed(123)
z <- diamonds[sample(nrow(diamonds),5000),]
z <- z[z$cut %in% c("Fair", "Ideal"),]
d <- ggplot(data = z) +
geom_point(aes(x = log(carat), y = log(price), colour = cut), alpha = 0.5) +
labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
stat_smooth(aes(x = log(carat), y = log(price), colour = cut), method = "gam")
d
Created on 2019-03-21 by the reprex package (v0.2.1)
Use subset() to subset the data. One modification is to get exactly as your graph is changing the method to 'auto' in stat_smooth so the line will follow the data points.
The chart can't be same always as we are doing random sampling.
library(ggplot2)
df<-diamonds[sample(nrow(diamonds),50000),]
subset(df,cut%in%c("Fair","Ideal"))->df_fair_ideal
ggplot(df_fair_ideal,aes(x=log(carat),y=log(price),color=cut),alpha=0.5)+
labs(title = 'Regression line for Fair and Ideal Cut Diamonds') +
geom_point()+xlim(min(log(df_fair_ideal$carat)),max(log(df_fair_ideal$carat)))+
stat_smooth(method = "auto",se=T)
I would like to take a ggplot scatterplot and overlay on top of it the mean of the y-variable within evenly-spaced bins on the x-axis.
So far what I have is this:
library(tidyverse)
data(midwest)
ggplot(arrange(midwest,percollege),aes(x=percollege,y=percbelowpoverty))+
geom_point()+
stat_summary_bin(aes(x=percollege,y=percbelowpoverty),
bins=10,fun.y='mean',geom='point',col='red')
Which produces
which is basically perfect except instead of red points I would like horizontal red lines that extend from the beginning of the bin to the end of the bin.
I can sort of mimic what I want with
library(tidyverse)
data(midwest)
ggplot(arrange(midwest,percollege),aes(x=percollege,y=percbelowpoverty))+
geom_point()+
stat_summary_bin(aes(x=percollege,y=percbelowpoverty),
bins=10,fun.y='mean',geom='point',col='red',shape="-",size=50)
which gives
Which is kinda what I want, except
I have to manually set the size every time I make a new graph like this
Uh, ew.
Another approach I've tried is with geom='bar',fill=NA, which seems promising if I can somehow get it to only show the top bar without the sides or bottom of the bar.
Any tips for this? I've had little luck with setting the geom to pointrange or linerange or line (the first two I've yet to get to work, and the last just connects each point with non-horizontal lines). Kind of surprised this isn't default behavior for stat_summary_bin to be honest!
Thanks!
This should work. I think the rownames_to_column line may not be necessary, and the modify_if argument is necessary because the cut function produces strings rather than than numeric values.
midwest_sum <- midwest %>%
mutate(coll_bins = cut(percollege, breaks = 10)) %>%
group_by(coll_bins) %>%
summarise(bin_mean = mean(percbelowpoverty)) %>%
rownames_to_column(var = "bin_num") %>%
tidyr::extract(coll_bins, c("min", "max"), "\\((.*),(.*)]") %>%
modify_if(is.character, as.numeric)
ggplot()+
geom_point(data = midwest, aes(x=percollege,y=percbelowpoverty)) +
geom_errorbarh(data = midwest_sum, aes(xmin = min, xmax = max, y = bin_mean),
col = "red", size = 1)
Hope this helps!
I wouldn't often call this desired default behaviour; leaving out the sides of the bins necessarily makes it confusing where the bin boundaries actually are for points far above or below the bin means.
Anyway, here's a first attempt. We can calculate the bin boundaries based on some input parameter and then use geom_segment to draw them on the graph. geom_segment needs start and end coordinates, so bin_boundaries calculates the means of the y variable and the bounds of the bins for the x variable, and returns a call to geom_segment. This means we can simply add the output of our function to our ggplot call and it works as expected. Note the use of passing through ... so we can still use the geom parameters.
You can probably modify to use other bin width and dodge parameters instead of calculating from the bounds of your x variable, haven't thought too carefully about that. Note that the lines look different from your use of stat_summary_bin because they are centered differently and so use different points in each calculation. You might also consider a version that uses geom_step which would connect the ends of each horizontal line.
library(tidyverse)
bin_boundaries <- function(tbl, n_bins, x_var, y_var, ...) {
x_var <- enquo(x_var)
y_var <- enquo(y_var)
bin_bounds <- seq(
from = min(pull(tbl, !!x_var)),
to = max(pull(tbl, !!x_var)),
length.out = n_bins + 1)
bounds_tbl <- tbl %>%
mutate(bin_group = ntile(!!x_var, n_bins)) %>%
group_by(bin_group) %>%
summarise(!!y_var := mean(!!y_var)) %>%
mutate(bin_start = bin_bounds[1:n_bins], bin_end = bin_bounds[2:(n_bins + 1)])
geom_segment(
data = bounds_tbl,
mapping = aes(
x = bin_start, y = !!y_var,
xend = bin_end, yend = !!y_var
),
...
)
}
ggplot(midwest) +
geom_point(aes(x = percollege, y = percbelowpoverty)) +
bin_boundaries(midwest, 10, percollege, percbelowpoverty, colour = "red", size = 1)
Created on 2019-02-07 by the reprex package (v0.2.1)