R ggplot on-the-fly calculation by grouping variable - r

I have often wondered if you can get ggplot to do on-the-fly calculations by the facet groups of the plot in a similar way that they would be done using dplyr::group_by. So in the example below is it possible to calculate the cumsum for each different category, rather than the overall cumsum without altering df first?
library(ggplot2)
df <- data.frame(X = rep(1:20,2), Y = runif(40), category = rep(c("A","B"), each = 20))
ggplot(df, aes(x = X, y = cumsum(Y), colour = category))+geom_line()
I can obviously do an easy workaround using dplyr, however as I do this frequently I was keen to know if there is a way to prevent having to specify the grouping variables multiple times (here in group_by and aes(colour = …).
Working alternative, but not what I'm asking for in this case
library(dplyr)
library(ggplot2)
df %>% group_by(category) %>% mutate(Ysum = cumsum(Y)) %>%
ggplot(aes(x = X, y = Ysum, colour = category))+geom_line()
Edit: (To answer in response to the #42- comment) I am mainly asking out of curiosity if this is possible, not because the alternative doesn't work. I also think it would be neater in my code if I am making a number of plots which are summing (or other similar calculations) different variables based on different columns or in different datasets, rather than continuously having to group, mutate then plot. I could write a function to do it for me but I thought it might be inbuilt functionality that I missing (the ggplot help doesn't go into the real details).

I have added stat_apply_group() and stat_apply_panel() to the development version of my package 'ggpmisc'. It will take some time before this update makes it to CRAN as the previous update has just been accepted.
For the time being 'ggpmisc' should be installed from Bitbucket for the new stats to be available.
devtools::install_bitbucket("aphalo/ggpmisc", ref = "no-debug")
Then this solves the question:
library(ggplot2)
library(ggpmisc)
set.seed(123456)
df <- data.frame(X = rep(1:20,2),
Y = runif(40),
category = rep(c("A","B"), each = 20))
ggplot(df, aes(x = X, y = Y, colour = category)) +
stat_apply_group(.fun.y = cumsum)
Applying cumsum() within the ggplot code instead of using a 'dplyr' "pipe" as in the second example saves us from having to specify the grouping twice.

Related

e_facet using grouped data in echarts4r question

I really like the possibilities this package offers and would like to use it in a shiny app. however i am struggling to recreate a plot from ggplot to echarts4r
library(tidyverse)
library(echarts4r)
data = tibble(time = factor(sort(rep(c(4,8,24), 30)), levels = c(4,8,24)),
dose = factor(rep(c(1,2,3), 30), levels = c(1,2,3)),
id = rep(sort(rep(LETTERS[1:10], 3)),3),
y = rnorm(n = 90, mean = 5, sd = 3))
This is the plot i am aiming to recreate:
ggplot(data = data, mapping = aes(x = time, y = y, group = id)) +
geom_point() +
geom_line() +
facet_wrap(~dose)
The problem i am having is to make groups of my data using group = id in ggplot syntax in echarts4r . I am aiming to do e_facet on grouped data using group_by() however i can not (or dont know how to) add a group to connect the dots using geom_line()
data %>%
group_by(dose) %>%
e_charts(time) %>%
e_line(y) %>%
e_facet(rows = 1, cols = 3)
You can do this with echarts4r.
There are two methods that I know of that work, one uses e_list. I think that method would make this more complicated than it needs to be, though.
It might be useful to know that e_facet, e_arrange, and e_grid all fall under echarts grid functionality—you know, sort of like everything that ggplot2 does falls under base R's grid.
I used group_split from dplyr and imap from purrr to create the faceted graph. You'll notice that I didn't use e_facet due to its constraints.
group_split is interchangeable with base R's split and either could have been used.
I used imap so I could map over the groups and have the benefit of using an index. If you're familiar with the use of enumerate in a Python for statement or a forEach in Javascript, this sort of works the same way. In the map call, j is a data frame; k is an index value. I appended the additional arguments needed for e_arrange, then made the plot.
library(tidyverse) # has both dplyr and purrrrrr (how many r's?)
library(echarts4r)
data %>% group_split(dose) %>%
imap(function(j, k) {
j %>% group_by(id) %>%
e_charts(time, name = paste0("chart_", k)) %>%
e_line(y, name = paste0("Dose ", k)) %>%
e_color(color = "black")
}) %>% append(c(rows = 1, cols = 3)) %>%
do.call(e_arrange, .)

plotting thousands of lines with ggplot2 and melt

I want to create a plot overlaying 1000 simulations of a MA(2) process (1 plot, 1000 lines). I cant seem to get ggplot to plot more than one of these 1000 series. There are many posts about this problem on here, and it seems that the problem for most is solved by using the melt function in the reshape2 library, however, this does not solve my problem. I am unsure what my problem is other than that other datasets/examples from here seem to work in plotting multiple lines, so I am wondering if it is the data my function is generating. Sorry if this is a simple fix, I just cant seem to find an answer.
#create function to simulate MA(2) process
sim<-function(n,sigma,mu,theta) {
epsilon<-rnorm(n,0,sigma)
z<-matrix(nrow=n,ncol=1001)
z<-replicate(1000,
z[,1]<-as.numeric(mu+epsilon[1:n]+theta*epsilon[2:n-1]) )
}
#run simulation, add time vector
z <-sim(23,0.5,0.61,0.95)
time<-matrix(seq(1,23))
z<-data.frame(time,z)
#collapse data
df <- data.frame(melt(zz, id.vars = 'time', variable.name = 'series'))
df[["series"]] <- gsub("X", "series", df[["series"]])
#attempt to plot using ggplot2's 'group' and 'color'
ggplot(data=df, aes(x=time,y=value, group=series)) +
geom_line(aes(color = series)) +
theme(legend.position='none')
It's not clear to me how you want to show 1000 lines in a single plot. This sounds like a bad idea. Furthermore, the code you're using to simulate random data from an MA(2) process doesn't work; why not use arima.sim, which is there for exactly that purpose.
Here is an example plotting 10 random time series with data generated from an MA(2) process.
set.seed(2017);
replicate(10, arima.sim(model = list(ma = c(-.7, .1)), n = 100)) %>%
as_tibble() %>%
mutate(time = 1:n()) %>%
gather(key, val, -time) %>%
ggplot(aes(time, val, group = key)) +
geom_line(alpha = 0.4)

(Re)name factor levels (or include variable name) in ggplot2 facet_ call

One pattern I do a lot is to facet plots on cuts of numeric values. facet_wrap in ggplot2 doesn't allow you to call a function from within, so you have to create a temporary factor variable. This is okay using mutate from dplyr. The advantage of this is that you can play around doing EDA and varying the number of quantiles, or changing to set cut points etc. and view the changes in one line. The downside is that the facets are only labelled by the factor level; you have to know, for example, that it's a temperature. This isn't too bad for yourself, but even I get confused if I'm doing a facet_grid on two such variables and have to remember which is which. So, it's really nice to be able to relabel the facets by including a meaningful name.
The key points of this problem is that the levels will change as you change the number of quantiles etc.; you don't know what they are in advance. You could use the base levels() function, but that means augmenting the data frame with the cut variable, then calling levels(), then passing this augmented data frame to ggplot().
So, using plyr::mapvalues, we can wrap all this into a dplyr::mutate, but the required arguments for mapvalues() makes it quite clunky. Having to retype "Temp.f" many times is not very "dplyr"!
Is there a neater way of renaming such factor levels "on the fly"? I hope this description is clear enough and the code example below helps.
library(ggplot2)
library(plyr)
library(dplyr)
library(Hmisc)
df <- data.frame(Temp = seq(-100, 100, length.out = 1000), y = rnorm(1000))
# facet_wrap doesn't allow functions so have to create new, temporary factor
# variable Temp.f
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
# fine, but facet headers aren't very clear,
# we want to highlight that they are temperature
ggplot(df %>% mutate(Temp.f = paste0("Temp: ", cut2(Temp, g = 4)))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
# use of paste0 is undesirable because it creates a character vector and
# facet_wrap then recodes the levels in the wrong numerical order
# This has the desired effect, but is very long!
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4), Temp.f = mapvalues(Temp.f, levels(Temp.f), paste0("Temp: ", levels(Temp.f))))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f)
I think you can do this from within facet_wrap using a custom labeller function, like so:
myLabeller <- function(x){
lapply(x,function(y){
paste("Temp:", y)
})
}
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) +
geom_histogram(aes(x = y)) +
facet_wrap(~Temp.f
, labeller = myLabeller)
That labeller is clunky, but at least an example. You could write one for each variable that you are going to use (e.g. tempLabeller, yLabeller, etc).
A slight tweak makes this even better: it automatically uses the name of the thing you are facetting on:
betterLabeller <- function(x){
lapply(names(x),function(y){
paste0(y,": ", x[[y]])
})
}
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) +
geom_histogram(aes(x = y)) +
facet_wrap(~Temp.f
, labeller = betterLabeller)
Okay, with thanks to Mark Peterson for pointing me towards the labeller argument/function, the exact answer I'm happy with is:
ggplot(df %>% mutate(Temp.f = cut2(Temp, g = 4))) + geom_histogram(aes(x = y)) + facet_wrap(~Temp.f, labeller = labeller(Temp.f = label_both))
I'm a fan of lazy and "label_both" means I can simply create a meaningful temporary (or overwrite the original) variable column and both the name and the value are given. Rolling your own labeller function is more powerful, but using label_both is a good, easy option.

visualizing statistical test results with ggplot2

I would like to get my statistical test results integrated to my plot. Example of my script with dummy variables (dummy data below generated after first post):
cases <- rep(1:1:5,times=10)
var1 <- rep(11:15,times=10)
outcome <- rep(c(1,1,1,2,2),times=10)
maindata <- data.frame(cases,var1,outcome)
df1 <- maindata %>%
group_by(cases) %>%
select(cases,var1,outcome) %>%
summarise(var1 = max(var1, na.rm = TRUE), outcome=mean(outcome, na.rm =TRUE))
wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
ggplot(df1, aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) + geom_boxplot()
With these everything works just fine, but I can't find a way to integrate my wilcox.test results to my plot automatically (of course I can make use annotation() and write the results manually but that's not what I'm after.
My script produces two boxplots with max-value of var1 on the y-axis and grouped by outcome on the x-axis (only two different values for outcome). I would like to add my wilcox.test results to that boxplot, all other relevant data is present. Tried to find a way from forums and help files but can't find a way (at least with ggplot2)
I'm new to R and trying learn stuff through using ggplot2 and dplyr which I see as most intuitive packages for manipulation and visualization. Don't know if they are optimal for the solution which I'm after so feel free to suggest solutions from alternative packages also...
I thinks this figure shows what you want. I also added some parts to the code because you're new with ggplot2. Take or leave them, but there're things I do make publication quality figures:
wtOut = wilcox.test(df1$var1[df1$outcome<=1], df1$var1[df1$outcome>1])
exampleOut <- ggplot(df1,
aes(x = as.factor(outcome), y = as.numeric(var1), fill=outcome)) +
geom_boxplot() +
scale_fill_gradient(name = paste0("P-value: ",
signif(wtOut$p.value, 3), "\nOutcome")) +
ylab("Variable 1") + xlab("Outcome") + theme_bw()
ggsave('exampleOut.jpg', exampleOut, width = 6, height = 4)
If you want to include the p-value as its own legend, it looks like it is some work, but doable.
Or, if you want, just throw signif(wtOut$p.value, 3) into annotate(...). You'll just need to come up with rules for where to place it.

dataframe2delta: how to plot a delta function directly from the dataframe using ggplot2

I looked for an answer throughout the former threads, but with no luck.
I was wondering if it could be possible, given a data frame having a structure similar to this one
df <- data.frame(x = rep(1:100, times = 2 ),
y = c(rnorm(100), rnorm(100, 10)),
group = rep(c("a", "b"), each = 100))
to plot directly the difference, between the observations of the two groups, instead of plotting the two samples using different colours, which is what I'm able to do so far using ggplot2. Of course I know I could do that using the base plotting system by simply using
plot(df[df$group == "a",]$y - df[df$group == "b",]$y)
but doing so I waste all the cool features of ggplot2.
Thanks in advance!
EB
You could try something like this:
library(reshape2)
library(ggplot2)
df <- dcast(df, x~group, value.var='y')
df$dif = df$a-df$b
ggplot(df, aes(x, dif)) + geom_line()
Or if you use data.table here is how to do it:
library(data.table)
dt=data.table(df)
dt<-dcast.data.table(dt, x~group, value.var='y')
dt[,dif:=a-b]
ggplot(dt, aes(x, dif)) + geom_line()
How does this look?
Another possibility using dplyr is the following:
ggplot(df %>% group_by(x) %>% summarise(delta = diff(y)),
aes(x = x, y = delta)) + geom_line()
In this case you can avoid the dcast using the function diff and assuming the order between the groups, otherwise you need to sort the factors or apply a dcast on your data frame. I am quite sure that you can do something very similar using data.table.
It's not completely solved, but it looks close to what I meant:
qplot( x = x,
y = diff,
data = dcast( data = df,
value.var = y,
formula = x ~ "diff",
fun.aggregate = function( x ) x[1] - x[2] )
It's quite tricky and strongly depends on what you have in your group variable, but works.
An alternative was to mutate the output of dcast, but in my case the group column was filled in with TRUE and FALSEvalues. Thus, using mutate to obtain diff=TRUE-FALSE returned a column of 1s, not very useful.

Resources