Cannot overlay multiple stat_function with ggplot2 - r

I have a table with a binning variable VAR2_BY_NS_BIN and an x-y data pair (MP_BIN,CORRECT_PROP). I want to plot the data point binned, and also draw a different line for each bin using stat_function, taking a different reference each time using the for loop.
test_tab <- data.table(VAR2_BY_NS_BIN=c(0.0005478, 0.0005478, 0.002266, 0.002266, 0.006783, 0.006783, 0.020709, 0.020709, 0.142961, 0.142961),
MP_BIN=rep(c(0.505, 0.995), 5),
CORRECT_PROP=c(0.5082, 0.7496, 0.5024, 0.8627, 0.4878, 0.9368, 0.4979, 0.9826, 0.4811, 0.9989))
VAR2_BIN <- sort(unique(test_tab$VAR2_BY_NS_BIN)) #get unique bin values
LEN_VAR2_BIN <- length(VAR2_BIN) #get number of bins
col_base <- c("#FF0000", "#BB0033", "#880088", "#3300BB", "#0000FF") #mark bins with different colours
p <- ggplot(data = test_tab)
for (i in 1:LEN_VAR2_BIN) {
p <- p + geom_point(data = test_tab[test_tab$VAR2_BY_NS_BIN==VAR2_BIN[i],],
aes(x = MP_BIN, y = CORRECT_PROP),
col = col_base[i],
alpha = 0.5) +
stat_function(fun = function(t) {VAR2_BIN[i]*(t-0.5)+0.5}, col = col_base[i])
}
p <- p + xlab("MP") + ylab("Observed proportion")
print(p)
The above code (a reproducible example), however, always returns a plot with only the last stat_function line drawn (which is the 5th line in the above case).
The following code (without using the for loop) works, but I in fact have a large number of bins so it is not very feasible...
p <- p + stat_function(fun = function(t) {VAR2_BIN[1]*(t-0.5)+0.5}, col = col_base[1])
p <- p + stat_function(fun = function(t) {VAR2_BIN[2]*(t-0.5)+0.5}, col = col_base[2])
p <- p + stat_function(fun = function(t) {VAR2_BIN[3]*(t-0.5)+0.5}, col = col_base[3])
p <- p + stat_function(fun = function(t) {VAR2_BIN[4]*(t-0.5)+0.5}, col = col_base[4])
p <- p + stat_function(fun = function(t) {VAR2_BIN[5]*(t-0.5)+0.5}, col = col_base[5])
Thanks in advance!

You don't need a for loop or stat_function. To plot the points, just map MP_BIN and CORRECT_PROP to x and y and the points can be plotted with a single call to geom_point. For the lines, you can create the necessary values on the fly (as done in the code below) and plot those with geom_line.
library(tidyverse)
ggplot(test_tab %>% mutate(model=VAR2_BY_NS_BIN*(MP_BIN - 0.5) + 0.5),
aes(x=MP_BIN, colour=factor(VAR2_BY_NS_BIN))) +
geom_point(aes(y=CORRECT_PROP)) +
geom_line(aes(y=model)) +
labs(colour="VAR2_BY_NS_BIN") +
guides(colour=guide_legend(reverse=TRUE))
In terms of the problem you were having with the for loop, what's going on is that ggplot doesn't actually evaluate the loop variable (i) until you print the plot. The value of i is 5 at the end of the loop when the plot is printed, so that's the only line you get. You can find several questions related to this issue on Stack Overflow. Here's one of them.

Related

Why does R behave differently when parsing parameters of plotting?

I am attempting to plot multiple time series variables on a single line chart using ggplot. I am using a data.frame which contains n time series variables, and a column of time periods. Essentially, I want to loop through the data.frame, and add exactly n goem_lines to a single chart.
Initially I tried using the following code, where;
df = data.frame containing n time series variables, and 1 column of time periods
wid = n (number of time series variables)
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
for (i in 1:wid) {
p <- p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
}
ggplotly(p)
However, this only produces a plot of the final time series variable in the data.frame. I then investigated further, and found that following sets of code produce completely different results:
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
i = 1
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
i = 2
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
i = 3
p = p + geom_line(aes(x=df$Time, y=df[,i], color=var.lab[i]))
ggplotly(p)
Plot produced by code above
p <- ggplot() +
scale_color_manual(values=c(colours[1:wid]))
p = p + geom_line(aes(x=df$Time, y=df[,1], color=var.lab[1]))
p = p + geom_line(aes(x=df$Time, y=df[,2], color=var.lab[2]))
p = p + geom_line(aes(x=df$Time, y=df[,3], color=var.lab[3]))
ggplotly(p)
Plot produced by code above
In my mind, these two sets of code are identical, so could anyone explain why they produce such different results?
I know this could probably be done quite easily using autoplot, but I am more interested in the behavior of these two snipits of code.
What you're trying to do is a 'hack' way by plotting multiple lines, but it's not ideal in ggplot terms. To do it successfully, I'd use aes_string. But it's a hack.
df <- data.frame(Time = 1:20,
Var1 = rnorm(20),
Var2 = rnorm(20, mean = 0.5),
Var3 = rnorm(20, mean = 0.8))
vars <- paste0("Var", 1:3)
col_vec <- RColorBrewer::brewer.pal(3, "Accent")
library(ggplot2)
p <- ggplot(df, aes(Time))
for (i in 1:length(vars)) {
p <- p + geom_line(aes_string(y = vars[i]), color = col_vec[i], lwd = 1)
}
p + labs(y = "value")
How to do it properly
To make this plot more properly, you need to pivot the data first, so that each aesthetic (aes) is mapped to a variable in your data frame. That means we need a single variable to be color in our data frame. Hence, we pivot_longer and plot again:
library(tidyr)
df_melt <- pivot_longer(df, cols = Var1:Var3, names_to = "var")
ggplot(df_melt, aes(Time, value, color = var)) +
geom_line(lwd = 1) +
scale_color_manual(values = col_vec)

How to plot two distribution curves in a faceted way in R / ggplot2?

I have two probability distribution curves, a Gamma and a standarized Normal, that I need to compare:
library(ggplot2)
pgammaX <- function(x) pgamma(x, shape = 64.57849, scale = 0.08854802)
f <- ggplot(data.frame(x=c(-4, 9)), aes(x)) + stat_function(fun=pgammaX)
f + stat_function(fun = pnorm)
The output is like this
However I need to have the two curves separated by means of the faceting mechanism provided by ggplot2, sharing the Y axis, in a way like shown below:
I know how to do the faceting if the depicted graphics come from data (i.e., from a data.frame), but I don't understand how to do it in a case like this, when the graphics are generated on line by functions. Do you have any idea on this?
you can generate the data similar to what stat_function is doing ahead of time, something like:
x <- seq(-4,9,0.1)
dat <- data.frame(p = c(pnorm(x), pgammaX(x)), g = rep(c(0,1), each = 131), x = rep(x, 2) )
ggplot(dat)+geom_line(aes(x,p, group = g)) + facet_grid(~g)
The issue with doing facet_wrap is that the same stat_function is designed to be applied to each panel of the faceted variable which you don't have.
I would instead plot them separately and use grid.arrange to combine them.
f1 <- ggplot(data.frame(x=c(-4, 9)), aes(x)) + stat_function(fun = pgammaX) + ggtitle("Gamma") + theme(plot.title = element_text(hjust = 0.5))
f2 <- ggplot(data.frame(x=c(-4, 9)), aes(x)) + stat_function(fun = pnorm) + ggtitle("Norm") + theme(plot.title = element_text(hjust = 0.5))
library(gridExtra)
grid.arrange(f1, f2, ncol=2)
Otherwise create the data frame with y values from both pgammaX and pnorm and categorize them under a faceting variable.
Finally I got the answer. First, I need to have two data sets and attach each function to each data set, as follows:
library(ggplot2)
pgammaX <- function(x) pgamma(x, shape = 64.57849, scale = 0.08854802)
a <- data.frame(x=c(3,9), category="Gamma")
b <- data.frame(x=c(-4,4), category="Normal")
f <- ggplot(a, aes(x)) + stat_function(fun=pgammaX) + stat_function(data = b, mapping = aes(x), fun = pnorm)
Then, using facet_wrap(), I separate into two graphics according to the category assigned to each data set, and establishing a free_x scale.
f + facet_wrap("category", scales = "free_x")
The result is shown below:

Conditional colouring of a geom_smooth

I'm analyzing a series that varies around zero. And to see where there are parts of the series with a tendency to be mostly positive or mostly negative I'm plotting a geom_smooth. I was wondering if it is possible to have the color of the smooth line be dependent on whether or not it is above or below 0. Below is some code that produces a graph much like what I am trying to create.
set.seed(5)
r <- runif(22, max = 5, min = -5)
t <- rep(-5:5,2)
df <- data.frame(r+t,1:22)
colnames(df) <- c("x1","x2")
ggplot(df, aes(x = x2, y = x1)) + geom_hline() + geom_line() + geom_smooth()
I considered calculating the smoothed values, adding them to the df and then using a scale_color_gradient, but I was wondering if there is a way to achieve this in ggplot directly.
You may use the n argument in geom_smooth to increase "number of points to evaluate smoother at" in order to create some more y values close to zero. Then use ggplot_build to grab the smoothed values from the ggplot object. These values are used in a geom_line, which is added on top of the original plot. Last we overplot the y = 0 values with the geom_hline.
# basic plot with a larger number of smoothed values
p <- ggplot(df, aes(x = x2, y = x1)) +
geom_line() +
geom_smooth(linetype = "blank", n = 10000)
# grab smoothed values
df2 <- ggplot_build(p)[[1]][[2]][ , c("x", "y")]
# add smoothed values with conditional color
p +
geom_line(data = df2, aes(x = x, y = y, color = y > 0)) +
geom_hline(yintercept = 0)
Something like this:
# loess data
res <- loess.smooth(df$x2, df$x1)
res <- data.frame(do.call(cbind, res))
res$posY <- ifelse(res$y >= 0, res$y, NA)
res$negY <- ifelse(res$y < 0, res$y, NA)
# plot
ggplot(df, aes(x = x2, y = x1)) +
geom_hline() +
geom_line() +
geom_line(data=res, aes(x = x, y = posY, col = "green")) +
geom_line(data=res, aes(x = x, y = negY, col = "red")) +
scale_color_identity()

Saving ggplot to a list then applying to grid.arrange geom_line from last plot populates all previous plots

I am very new to R and ggplot2. I am trying to create a grid of plots of correlations as well as their trailing max and min values using a for loop. The plots are then saved as a PDF to a directory. When they are saved the blue lines(min max) are correctly plotted. However when I then use the do.call(grid.arrange,t) or any other call to the plots in the list. you do not get the correct blue lines, but the last plots blue lines populate all of the plots.
I dont understand how this can plot and save the pdf correctly but not store the ggplot object correctly in the t list() or how there is some confusion in the render using do.call(grid.arrange,t). How can the original line (black) plot correctly but the geom_line additions do not ? I am really confused.
If someone could kindly help me check this code and find out how to plot all lines correctly then place them in a grid that would be great.
reproducable code below using random data
require(TTR)
require(ggplot2)
library(gridExtra)
set.seed(12345)
filelocation = "c:/"
values <- as.data.frame(matrix( rnorm(5*500,mean=0,sd=3), 500, 5))
t <- list()
rollLength = 25
for( i in 1:(ncol(values)))
{
p <- ggplot(data=values, aes(x = index(values)) )
p <- p + geom_line(data=values, aes_string(y = colnames(values)[i]))
p <- p + geom_line(data = values, aes(x = index(values), y = runMax(values[,i], n = rollLength) ), colour = "blue", linetype = "longdash" )
p <- p + geom_line(data = values, aes(x = index(values), y = runMin(values[,i], n = rollLength) ), colour = "blue", linetype = "longdash" )
p <- p + ggtitle(colnames(values)[i]) + xlab("Date") + ylab("Pearson Correlation")
print(p)
ggsave( file = paste(colnames(values)[i],".pdf",sep = "") , path = filelocation)
assign(paste("p", i, sep = ""), p)
t[[i]] <- p
}
do.call(grid.arrange,t)
Hmm, this isn't exactly what you want I think, but close, and less code
require(TTR)
require(ggplot2)
set.seed(12345)
values <- as.data.frame(matrix( rnorm(5*500,mean=0,sd=3), 500, 5))
rollLength = 25
library(reshape2)
dfmelt <- melt(values)
dfmelt$max <- runMax(dfmelt$value, n=rollLength)
dfmelt$min <- runMin(dfmelt$value, n=rollLength)
dfmelt$row <- index(dfmelt)
ggplot(dfmelt, aes(x = row, y = value)) +
geom_line() +
geom_line(aes(x = row, y = max), data=dfmelt, colour = "blue",
linetype = "longdash") +
geom_line(aes(x = row, y = min), data=dfmelt, colour = "blue",
linetype = "longdash") +
facet_wrap(~ variable, scales="free")

ggplot2 lm lines with categorical variables

I have some data from a R course class. The professor was adding each line kind of manually using base graphics. I'd like to do it using ggplot2.
So far I've created a facet'd plot in ggplot with scatter plots of hunger in different regions and also separately fitted a model to the data. The specific model has interaction terms between the x variable in the plot and the group/colour variable.
What I want to do now is plot the lines resulting for that model one per panel. I could do this by using geom_abline and defining the slope and the intercept as the sum of 2 of the coefficients (as the categorical variables for group have 0/1 values and in each panel only some values are multiplied by 1) - but this seems not easy.
I tried the same equation I used in lm in stat_smooth with no luck, I get an error.
Ideally, I'd think one can put the equation somehow into the stat_smooth and have ggplot do all the work. How would one go about it?
download.file("https://sparkpublic.s3.amazonaws.com/dataanalysis/hunger.csv",
"hunger.csv", method = "curl")
hunger <- read.csv("hunger.csv")
hunger <- hunger[hunger$Sex!="Both sexes",]
hunger_small <- hunger[hunger$WHO.region!="WHO Non Members",c(5,6,8)]
q<- qplot(x = Year, y = Numeric, data = hunger_small,
color = WHO.region) + theme(legend.position = "bottom")
q <- q + facet_grid(.~WHO.region)+guides(col=guide_legend(nrow=2))
q
# I could add the standard lm line from stat_smooth, but I dont want that
# q <- q + geom_smooth(method="lm",se=F)
#I want to add the line(s) from the lm fit below, it is really one line per panel
lmRegion <- lm(hunger$Numeric ~ hunger$Year + hunger$WHO.region +
hunger$Year *hunger$WHO.region)
# I also used a loop to do it, as below, but all in one panel
# I am not able to do that
# with facets, I used a function I found to get the colors
ggplotColours <- function(n=6, h=c(0, 360) +15) {
if ((diff(h)%%360) < 1) h[2] <- h[2] - 360/n
hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65)
}
n <- length(levels(hunger_small$WHO.region))
q <- qplot(x = Year, y = Numeric, data = hunger_small,
color = WHO.region) + theme(legend.position = "bottom")
q <- q + geom_abline(intercept = lmRegion$coefficients[1],
slope = lmRegion$coefficients[2], color = ggplotColours(n=n)[1])
for (i in 2:n) {
q <- q + geom_abline(intercept = lmRegion$coefficients[1] +
lmRegion$coefficients[1+i], slope = lmRegion$coefficients[2] +
lmRegion$coefficients[7+i], color = ggplotColours(n=n)[i])
}
If you have one categorical data:
geom_point()
will not work,
geom_boxplot()
will work.
ggplot(hunger, aes(x = sex, y = hunger)) + geom_boxplot() + labs(x="sex") + geom_smooth(method = "lm",se=FALSE, col = "blue"). Susy

Resources