I want to plot a probability function in a certain interval. The x-axes has to be longer than the interval because there are more pdf's in one plot.
But with the growing of the x-axes, the pdf generating data too, although the code implied the certain interval.
Code:
p1 <- ggplot() +
stat_function(data=data.frame(x=c(2,30)),aes(x),fun = dnorm, n = 101,
args= list(mean=5,sd=1),color="black")+
xlim(-5,80)+
scale_y_continuous(breaks = NULL)
the pdf in p1 generate data until x=80. But x values in the code are in a vector until x=30.
How could I prevent that the pdf produces values until 80 or how have to be the code that the distribution stops at x=30?
We can construct a dataframe with the only the x-values you want plotted on the fly using dplyr::data_frame. I added another p.d.f. to demonstrate that the way you want it presented will work.
library(dplyr)
# to use data_frame
ggplot() +
geom_line(data=data_frame(x=seq(2,30, 0.25), y = dnorm(x, mean = 5, sd = 1)),
aes(x, y), color = "black") +
geom_line(data=data_frame(x=seq(40,70, 0.25), y = dnorm(x, mean = 60, sd = 5)),
aes(x, y), color = "red") +
xlim(-5,80)+
scale_y_continuous(breaks = NULL)
Update
You can also label them like so, by moving the color= inside the aes(...):
ggplot() +
geom_line(data=data_frame(x=seq(2,30, 0.25), y = dnorm(x, mean = 5, sd = 1)),
aes(x, y, color = "mean:5, sd:1")) +
geom_line(data=data_frame(x=seq(40,70, 0.25), y = dnorm(x, mean = 60, sd = 5)),
aes(x, y, color = "mean:60, sd:5")) +
xlim(-5,80)+
scale_y_continuous(breaks = NULL) +
scale_color_manual(values = c("mean:5, sd:1" = "black",
"mean:60, sd:5" = "red"))
Update2
Your density function works for me.
dTDF<-function(x,g,a,b,k){
exp(-exp(-(x/a)+((a*k)/(g-x))-b))*(exp(-(x/a)+((a*k)/(g-x))-b))*((1/a)-((a*k)/((g-x)^2)))
}
df1 <- data_frame(x=seq(2500,11300, 100),
y = dTDF(x,g=11263,a=1185, b=-4, k=-0.5))
df2 <- data_frame(x=seq(7000,14300, 100),
y = dTDF(x,g=15263,a=1105, b=-10, k=-0.5))
ggplot() +
geom_line(data = df1, aes(x, y))+
geom_line(data = df2, aes(x, y)) +
xlim(1000, 15000)
Related
For each treatment tmt, I want to plot the means using stat_summary in ggplot2 with different colour size. I find that the there are mulitple means being plotted over the current points. Not sure how to rectify it.
df <- data.frame(x = rnorm(12, 4,1), y = rnorm(12, 6,4), tmt = rep(c("A","B","C"), each = 4))
ggplot(aes(x = x, y = y, fill = tmt), data = df) +
geom_point(shape=21, size=5, alpha = 0.6) +
scale_fill_manual(values=c("pink","blue", "purple")) +
stat_summary(aes(fill = tmt), fun = 'mean', geom = 'point', size = 5) +
scale_fill_manual(values=c("pink","blue", "purple"))
Plot without the last two lines of code
Plot with the entire code
Using stat_summary you compute the mean of y for each pair of x and tmt. If you want the mean of x and the mean of y per tmt I would suggest to manually compute the means outside of ggplot and use a second geom_point to plot the means. In my code below I increased the size and used rectangles for the means:
df <- data.frame(x = rnorm(12, 4,1), y = rnorm(12, 6,4), tmt = rep(c("A","B","C"), each = 4))
library(ggplot2)
library(dplyr)
df_mean <- df |>
group_by(tmt) |>
summarise(across(c(x, y), mean))
ggplot(aes(x = x, y = y, fill = tmt), data = df) +
geom_point(shape=21, size=5, alpha = 0.6) +
geom_point(data = df_mean, shape=22, size=8, alpha = 0.6) +
scale_fill_manual(values=c("pink","blue", "purple"))
I am trying to plot observations and their grouped regression lines with ggplot as follows:
ggplot(df, aes(x = cabpol.e, y = pred.vote_share, color = coalshare)) +
geom_point() +
scale_color_gradient2(midpoint = 50, low="blue", mid="green", high="red") +
geom_smooth(aes(x = cabpol.e, y = pred.vote_share, group=coalshare1, fill = coalshare1), se = FALSE, method='lm') +
scale_fill_manual(values = c(Junior="blue", Medium="green", Senior="red"))
The problem is that the lines from geom_smooth are all the same color. I tried using scale_fill_manual so that there aren't two different color scales, and manually determining which color corresponds to each group. but instead all the lines appear blue. How can I make each line a different color?
As requested, here is a set of replicable data with the same problem:
set.seed(1000)
dff <- data.frame(x=rnorm(100, 0, 1),
y=rnorm(100, 1, 2),
z=seq(1, 100, 1),
g=rep(c("A", "B"), 50))
ggplot(dff, aes(x = x, y = y, color = z, group = g, fill = g)) +
geom_point() +
scale_color_gradient2(midpoint = 50, low="blue", high="red") +
geom_smooth(se = FALSE, method='lm')
My solution to this problem would be to create multiple geom_smooth calls, and each time subset the data for the desired factor level. This way you are able to pass a different color to each call of geom_smooth. As long as you do not have many factors, this solution is not terribly inefficient.
dff <- data.frame(x=rnorm(100, 0, 1),
y=rnorm(100, 1, 2),
z=seq(1, 100, 1),
g=rep(c("A", "B"), 50))
ggplot(dff, aes(x = x, y = y,
color = z,
group = g)) +
geom_point() +
scale_color_gradient2(midpoint = 50, low="blue", high="red") +
geom_smooth(
aes(x = x, y =y),
color = "red",
method = "lm",
data = filter(dff, g == "A"),
se = FALSE
) +
geom_smooth(
aes(x = x, y =y),
color = "blue",
method = "lm",
data = filter(dff, g == "B"),
se = FALSE
)
Group-trends between the x and y variables can be plotted by using different dataframes for the geom_line (with predicted values) and geom_point (with raw data) functions. Make sure to determine in the ggplot() function that color is always the same variable, and then for geom_line group by the same variable.
p2 <- ggplot(NULL, aes(x = cabpol.e, y = vote_share, color = coalshare)) +
geom_line(data = preds, aes(group = coalshare, color = coalshare), size = 1) +
geom_point(data = df, aes(x = cabpol.e, y = vote_share)) +
scale_color_gradient2(name = "Share of Seats\nin Coalition (%)",
midpoint = 50, low="blue", mid = "green", high="red") +
xlab("Ideological Differences on State/Market") +
ylab("Vote Share (%)") +
ggtitle("Vote Share Won by Coalition Parties in Next Election")
Is there a way to place horizontal lines with the group means on a plot without creating the summary data set ahead of time? I know this works, but I feel there must be a way to do this with just ggplot2.
library(dplyr)
library(ggplot2)
X <- data_frame(
x = rep(1:5, 3),
y = c(rnorm(5, 5, 0.5),
rnorm(5, 3, 0.3),
rnorm(5, 4, 0.7)),
grp = rep(LETTERS[1:3], each = 5))
X.mean <- X %>%
group_by(grp) %>%
summarize(y = mean(y))
X %>%
ggplot(aes(x = x, y = y, color = grp)) +
geom_point(shape = 19) +
geom_hline(data = X.mean, aes(group = grp, yintercept = y, color = grp)) +
background_grid()
Expanding on my comment:
ggplot(X, aes(x = x, y = y, color = grp)) +
geom_point(shape = 19) +
stat_smooth(method="lm", formula=y~1, se=FALSE)+
theme_bw()
So this applies a linear model with only the constant term, which returns the mean. Credit to this answer for the basic idea.
Edit: Response to OP's very clever suggestion.
It looks like you can use quantile regression to generate the medians!
library(quantreg)
ggplot(X, aes(x = x, y = y, color = grp)) +
geom_point(shape = 19) +
stat_smooth(method="rq", formula=y~1, se=FALSE)+
theme_bw()
The basic requirement for stat_smooth(method=..., ...) is that the method returns an object for which there is a predict(...) method. So here rq(...) returns an rq object and there is a predict.rq(...) method. You can get into trouble using se=TRUE sometimes as not all predict methods return standard errors of the estimates.
Before explaining details, here is my data:
set.seed (1234)
datas <- data.frame (Indv = 1:20, Xvar = rnorm (20, 50, 10),
Yvar = rnorm (20, 30,5), Yvar1 = rnorm (20, 10, 2),
Yvar2 = rnorm (20, 5, 1), Yvar3 = rnorm (20, 100, 20),
Yvar4 = rnorm (20, 15, 3))
I want to prepare a graph (Metroglymph ) which is essentially point plot however points (of Xvar and Yvar) with spikes (lines) orignated from the point scaled to rest of variables (Yvar1, Yvar2, Yvar3, Yvar4).
Each spike are ordered and preferably color coded.
require(ggplot2)
ggplot(datas, aes(x=Xvar, y=Yvar)) +
geom_point(shape=1, size = 10) + theme_bw()
Here is one possible approach that may be helpful to you. It uses stat_spoke() from ggplot2. Each of your y-variables is mapped to the spoke radius in 4 separate calls to stat_spoke.
plot_1 = ggplot(datas, aes(x=Xvar, y=Yvar)) +
stat_spoke(aes(angle=(1/8)*pi, radius=Yvar1), colour="#E41A1C",size=1) +
stat_spoke(aes(angle=(3/8)*pi, radius=Yvar2), colour="#377EB8",size=1) +
stat_spoke(aes(angle=(5/8)*pi, radius=Yvar3), colour="#4DAF4A",size=1) +
stat_spoke(aes(angle=(7/8)*pi, radius=Yvar4), colour="#984EA3",size=1) +
geom_point(shape=1, size = 10)
ggsave("plot_1.png", plot_1)
Depending on your data and specific needs, it may make sense to transform the variables so they fit better on the plot.
normalize = function(x) {
new_x = (x - mean(x)) / sd(x)
new_x = new_x + abs(min(new_x))
return(new_x)
}
plot_2 = ggplot(datas, aes(x=Xvar, y=Yvar)) +
stat_spoke(aes(angle=(1/8)*pi, radius=normalize(Yvar1)), colour="#E41A1C", size=1) +
stat_spoke(aes(angle=(3/8)*pi, radius=normalize(Yvar2)), colour="#377EB8", size=1) +
stat_spoke(aes(angle=(5/8)*pi, radius=normalize(Yvar3)), colour="#4DAF4A", size=1) +
stat_spoke(aes(angle=(7/8)*pi, radius=normalize(Yvar4)), colour="#984EA3", size=1) +
geom_point(shape=1, size = 10)
ggsave("plot_2.png", plot_2)
Important caveat: For the same spoke radius value, the magnitude of the plotted line will be greater if the line is more vertical, and less if the line is more horizontal. This is because the range of x is around twice the range of y for your data set. The plotted angles also become distorted as the x-to-y axis ratio changes. Adding coord_equal(ratio=1) solves this issue, but may introduce other problems.
Edit: Plotting without a loop
This was fun and educational to figure out. Possibly it would have been more time-efficient to type the repetitive code! If anyone can offer advice to improve this code, please comment.
library(reshape2)
dat2 = melt(datas, id.vars=c("Indv", "Xvar", "Yvar"),
variable.name="spoke_var", value.name="spoke_value")
# Apply normalization in a loop. Can plyr do this more gracefully?.
for (var_name in levels(dat2$spoke_var)) {
select_rows = dat2$spoke_var == var_name
norm_dat = normalize(dat2[select_rows, "spoke_value"])
dat2[select_rows, "spoke_value"] = norm_dat
}
# Pick an angle for each Yvar, then add angle column to dat2.
tmp = data.frame(spoke_var=unique(dat2$spoke_var))
tmp$spoke_angle = seq(from=pi/8, by=pi/4, length.out=nrow(tmp))
dat2 = merge(dat2, tmp)
plot_4 = ggplot(dat2, aes(x=Xvar, y=Yvar)) +
stat_spoke(data=dat2, size=1,
aes(colour=spoke_var, angle=spoke_angle, radius=spoke_value)) +
geom_point(data=datas, aes(x=Xvar, y=Yvar), shape=1, size=7) +
coord_equal(ratio=1) +
scale_colour_brewer(palette="Set1")
Here is more manual approach:
set.seed (1234)
datas <- data.frame (Indv = 1:20, Xvar = rnorm (20, 50, 10),
Yvar = rnorm (20, 30,5), Yvar1 = rnorm (20, 10, 2),
Yvar2 = rnorm (20, 5, 1), Yvar3 = rnorm (20, 100, 20),
Yvar4 = rnorm (20, 15, 3))
datas$SYvar1 <- 2 + scale (datas$Yvar1)
datas$SYvar2 <- 2 + scale (datas$Yvar2)
datas$SYvar3 <- 2 + scale (datas$Yvar3)
datas$SYvar4 <- 2 + scale (datas$Yvar4)
require(ggplot2)
p <- ggplot(datas, aes(x=Xvar, y=Yvar)) +
geom_point(size = 10, pch = 19, col = "yellow2")
p + geom_segment(aes(x = Xvar, y = Yvar, xend = Xvar + SYvar1,
yend = Yvar), col = "red4", size = 1) +
geom_segment(aes(x = Xvar, y = Yvar, xend = Xvar,
yend = Yvar + SYvar2), col = "green4", size = 1) +
geom_segment(aes(x = Xvar, y = Yvar, xend = Xvar-2.5,
yend = Yvar + SYvar3), col = "darkblue", size = 1) +
geom_segment(aes(x = Xvar, y = Yvar, xend =
Xvar - SYvar4, yend = Yvar ), col = "red", size = 1) +
theme_bw()
I've been trying to superimpose a normal curve over my histogram with ggplot 2.
My formula:
data <- read.csv (path...)
ggplot(data, aes(V2)) +
geom_histogram(alpha=0.3, fill='white', colour='black', binwidth=.04)
I tried several things:
+ stat_function(fun=dnorm)
....didn't change anything
+ stat_density(geom = "line", colour = "red")
...gave me a straight red line on the x-axis.
+ geom_density()
doesn't work for me because I want to keep my frequency values on the y-axis, and want no density values.
Any suggestions?
Solution found!
+geom_density(aes(y=0.045*..count..), colour="black", adjust=4)
Think I got it:
library(ggplot2)
set.seed(1)
df <- data.frame(PF = 10*rnorm(1000))
ggplot(df, aes(x = PF)) +
geom_histogram(aes(y =..density..),
breaks = seq(-50, 50, by = 10),
colour = "black",
fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))
This has been answered here and partially here.
The area under a density curve equals 1, and the area under the histogram equals the width of the bars times the sum of their height ie. the binwidth times the total number of non-missing observations. To fit both on the same graph, one or other needs to be rescaled so that their areas match.
If you want the y-axis to have frequency counts, there are a number of options:
First simulate some data.
library(ggplot2)
set.seed(1)
dat_hist <- data.frame(
group = c(rep("A", 200), rep("B",150)),
value = c(rnorm(200, 20, 5), rnorm(150,25,10)))
# Set desired binwidth and number of non-missing obs
bw = 2
n_obs = sum(!is.na(dat_hist$value))
Option 1: Plot both histogram and density curve as density and then rescale the y axis
This is perhaps the easiest approach for a single histogram.
Using the approach suggested by Carlos, plot both histogram and density curve as density
g <- ggplot(dat_hist, aes(value)) +
geom_histogram(aes(y = ..density..), binwidth = bw, colour = "black") +
stat_function(fun = dnorm, args = list(mean = mean(dat_hist$value), sd = sd(dat_hist$value)))
And then rescale the y axis.
ybreaks = seq(0,50,5)
## On primary axis
g + scale_y_continuous("Counts", breaks = round(ybreaks / (bw * n_obs),3), labels = ybreaks)
## Or on secondary axis
g + scale_y_continuous("Density", sec.axis = sec_axis(
trans = ~ . * bw * n_obs, name = "Counts", breaks = ybreaks))
Option 2: Rescale the density curve using stat_function
With code tidied as per PatrickT's answer.
ggplot(dat_hist, aes(value)) +
geom_histogram(colour = "black", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = mean(dat_hist$value), sd = sd(dat_hist$value)) * bw * n_obs)
Option 3: Create an external dataset and plot using geom_line.
Unlike the above options, this one works with facets. (EDITED to provide dplyr rather than plyr based solution). Note, the summarised dataset is being used as the primary, and the raw passed in for the histogram only.
library(tidyverse)
dat_hist %>%
group_by(group) %>%
nest(data = c(value)) %>%
mutate(y = map(data, ~ dnorm(
.$value, mean = mean(.$value), sd = sd(.$value)
) * bw * sum(!is.na(.$value)))) %>%
unnest(c(data,y)) %>%
ggplot(aes(x = value)) +
geom_histogram(data = dat_hist, binwidth = bw, colour = "black") +
geom_line(aes(y = y)) +
facet_wrap(~ group)
Option 4: Create external functions to edit the data on the fly
A bit over the top perhaps, but might be useful for someone?
## Function to create scaled dnorm data along full x axis range
dnorm_scaled <- function(data, x = NULL, binwidth = 1, xlim = NULL) {
.x <- na.omit(data[,x])
if(is.null(xlim))
xlim = c(min(.x), max(.x))
x_range = seq(xlim[1], xlim[2], length.out = 101)
setNames(
data.frame(
x = x_range,
y = dnorm(x_range, mean = mean(.x), sd = sd(.x)) * length(.x) * binwidth),
c(x, "y"))
}
## Function to apply over groups
dnorm_scaled_group <- function(data, x = NULL, group = NULL, binwidth = NULL, xlim = NULL) {
dat_hists <- lapply(
split(data, data[, group]), dnorm_scaled,
x = x, binwidth = binwidth, xlim = xlim)
for(g in names(dat_hists))
dat_hists[[g]][, "group"] <- g
setNames(do.call(rbind, dat_hists), c(x, "y", group))
}
## Single histogram
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = bw, colour = "black") +
geom_line(data = ~ dnorm_scaled(., "value", binwidth = bw),
aes(y = y))
## With a single faceting variable
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = 2, colour = "black") +
geom_line(data = ~ dnorm_scaled_group(
., x = "value", group = "group", binwidth = 2, xlim = c(0,50)),
aes(y = y)) +
facet_wrap(~ group)
This is an extended comment on JWilliman's answer. I found J's answer very useful. While playing around I discovered a way to simplify the code. I'm not saying it is a better way, but I thought I would mention it.
Note that JWilliman's answer provides the count on the y-axis and a "hack" to scale the corresponding density normal approximation (which otherwise would cover a total area of 1 and have therefore a much lower peak).
Main point of this comment: simpler syntax inside stat_function, by passing the needed parameters to the aesthetics function, e.g.
aes(x = x, mean = 0, sd = 1, binwidth = 0.3, n = 1000)
This avoids having to pass args = to stat_function and is therefore more user-friendly. Okay, it's not very different, but hopefully someone will find it interesting.
# parameters that will be passed to ``stat_function``
n = 1000
mean = 0
sd = 1
binwidth = 0.3 # passed to geom_histogram and stat_function
set.seed(1)
df <- data.frame(x = rnorm(n, mean, sd))
ggplot(df, aes(x = x, mean = mean, sd = sd, binwidth = binwidth, n = n)) +
theme_bw() +
geom_histogram(binwidth = binwidth,
colour = "white", fill = "cornflowerblue", size = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = mean, sd = sd) * n * binwidth,
color = "darkred", size = 1)
This code should do it:
set.seed(1)
z <- rnorm(1000)
qplot(z, geom = "blank") +
geom_histogram(aes(y = ..density..)) +
stat_density(geom = "line", aes(colour = "bla")) +
stat_function(fun = dnorm, aes(x = z, colour = "blabla")) +
scale_colour_manual(name = "", values = c("red", "green"),
breaks = c("bla", "blabla"),
labels = c("kernel_est", "norm_curv")) +
theme(legend.position = "bottom", legend.direction = "horizontal")
Note: I used qplot but you can use the more versatile ggplot.
Here's a tidyverse informed version:
Setup
library(tidyverse)
Some data
d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv")
Preparing data
We'll use a "total" histogram for the whole sample, to that end, we'll need to remove the grouping information from the data.
d2 <-
d |>
select(-gender)
Here's a data set with summary data:
d_summary <-
d %>%
group_by(gender) %>%
summarise(height_m = mean(height, na.rm = T),
height_sd = sd(height, na.rm = T))
d_summary
Plot it
d %>%
ggplot() +
aes() +
geom_histogram(aes(y = ..density.., x = height, fill = gender)) +
facet_wrap(~ gender) +
geom_histogram(data = d2, aes(y = ..density.., x = height),
alpha = .5) +
stat_function(data = d_summary %>% filter(gender == "female"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "female")$height_m,
sd = filter(d_summary,
gender == "female")$height_sd)) +
stat_function(data = d_summary %>% filter(gender == "male"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "male")$height_m,
sd = filter(d_summary,
gender == "male")$height_sd)) +
theme(legend.position = "none",
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(title = "Facetted histograms with overlaid normal curves",
caption = "The grey histograms shows the whole distribution (over) both groups, i.e. females and men") +
scale_fill_brewer(type = "qual", palette = "Set1")