ggplot only printing the last graph - r

Thank you for this wonderful community. I am trying to create a loop to make lots of graphs. However, only the last graph is being shown.
Here is my dataset. I am essentially subsetting by gene and then trying to create graphs.
Here is my code.
data_before_heart <- subset(data_before, Organ == 0)
uniq <- unique(unlist(data_before_heart$Gene))
for (i in 1:length(uniq)){
data_1 <- subset(data_before_heart, Gene == uniq[i])
print(ggplot(data_1, aes(x=Drug, y=Expression)) +
geom_bar(stat="identity"))
}
Unfortunately this only generates the last graph. (there are 6 genes from 0, 1, 2, 3, 4, 5)

You code works fine for me with 'fake' data and generates the expected number of bar plots without any issues:
library(ggplot2)
library(dplyr)
data_before_heart <- data.frame(
Gene = as.vector(sapply(seq(1:10), function(x) x*rep(1, 5))) - 1,
Drug = c(0, seq(1:50) %% 5)[1:50],
Expression = runif(n = 50, min = 0, max = 1),
stringsAsFactors = FALSE
)
#data_before_heart <- subset(data_before, Organ == 0)
uniq <- unique(unlist(data_before_heart$Gene))
for (i in 1:length(uniq)){
data_1 <- subset(data_before_heart, Gene == uniq[i])
print(ggplot(data_1, aes(x=Drug, y=Expression)) +
geom_bar(stat="identity"))
}
Perhaps your data set has issues? Are there any error messages?
facet_wrap provides an elegant alternative:
data_before_heart %>%
ggplot(aes(x = Drug, y = Expression)) +
geom_bar(stat = "identity") +
facet_wrap(~ Gene, nrow = 5, ncol = 2, scales = "fixed")

Related

Set / Link point and shape options for variables in ggplot2

I would like to link variables I have in a dataframe i.e. ('prop1', 'prop2', 'prop3') to specific colours and shapes in the plot. However, I also want to exclude data (using dplyr::filter) to customise the plot display WITHOUT changing the points and shapes used for a specific variable. A minimal example is given below.
library(ggplot2)
library(dplyr)
library(magrittr)
obj <- c("cmpd 1","cmpd 1","cmpd 1","cmpd 2","cmpd 2")
x <- c(1, 2, 4, 7, 3)
var <- c("prop1","prop2","prop3","prop2","prop3")
y <- c(1, 2, 3, 2.5, 4)
col <- c("#E69F00","#9E0142","#56B4E9","#9E0142","#56B4E9")
shp <- c(0,1,2,1,2)
df2 <- cbind.data.frame(obj,x,var,y,col,shp)
plot <- ggplot(data = df2 %>%
filter(obj %in% c(
"cmpd 1",
"cmpd 2"
)),
aes(x = x,
y = y,
colour = as.factor(var),
shape = as.factor(var))) +
geom_point(size=2) +
#scale_shape_manual(values=shp) +
#scale_color_manual(values=col) +
facet_grid(.~obj)
plot
However, when I redact cmpd1 (just hashing in code) the colour and shape of prop2 and prop3 for cmpd2 change (please see plot2).
To this end, I tried adding in scale_shape_manual and scale_color_manual to the code (currently hashed) and linked these to specific vars (col and shp) in the dataframe (df2), but the same problem arises that both the shape and color of these variables changes when excluding one of the conditions?
Any and all help appreciated.
Try something like this:
library(tidyverse)
obj <- c("cmpd 1","cmpd 1","cmpd 1","cmpd 2","cmpd 2")
x <- c(1, 2, 4, 7, 3)
var <- c("prop1","prop2","prop3","prop2","prop3")
y <- c(1, 2, 3, 2.5, 4)
df2 <- cbind.data.frame(obj,x,var,y)
col <- c("prop1" = "#E69F00",
"prop2" = "#9E0142",
"prop3" = "#56B4E9")
shp <- c("prop1" = 0,
"prop2" = 1,
"prop3" = 2)
plot <- ggplot(data = df2 %>%
filter(obj %in% c(
"cmpd 1",
"cmpd 2"
)),
aes(x = x,
y = y,
colour = var,
shape = var)) +
geom_point(size=2) +
scale_shape_manual(values=shp) +
scale_color_manual(values=col) +
facet_grid(.~obj)
plot

Changing the colorbar labels in a plotly heatmap (R)

I am working with survey data where a person responds from 'not' to 'absolutely' in response to being asked whether one variable affects another. The responses are mapped to numbers, and a mean response is calculated.
I am presenting these mean values in a heatmap (x and y are lists with the same variable names). I would like the colour of squares in the heatmap to reflect the numerical mean, but I would like the labels on the colorbar to reflect the actual response text (e.g. 'not', 'lowly, 'moderately', 'highly', 'very highly') and to limit the tick marks to positions 0,1,2,3,4.
I'm not sure this can be done with plotly. I was able to do it with ggiraph, but this is going into Shiny and ggiraph has its own issues there - in plotly I have more control over display size and I couldn't get ggiraph to render large enough.
Minimal code is below, and so is the output.
library(plotly)
library(tidyr)
M <- matrix(c(NA, 1,3, 2, NA, 4, 3, 0, NA), nrow = 3, ncol = 3)
names_M <- c('var1', 'var2', 'var3')
val_to_char <- function(x) {
if(is.na(x)) {return(x)}
else if(x < 0.5) {return('not')}
else if(x < 1.5) {return('lowly')}
else if(x < 2.5) {return('moderately')}
else if(x < 3.5) {return('highly')}
else {return('very high')}
}
labels <- apply(M, c(1,2), val_to_char)
fig <- plot_ly()
fig <- fig %>%
add_trace(
type = 'heatmap',
x = names_M, y = names_M, z = M, text = labels,
hovertemplate = '<extra></extra> Row: %{y}</br></br>Col: %{x}</br>Avg response: %{text}'
)
fig
You could use ggplot and ggplotly for more flexibility. But you first need to tranform your data into long data.frame:
M <- as.data.frame.matrix(matrix(c(NA, 1,3, 2, NA, 4, 3, 0, NA), nrow = 3, ncol = 3))
names(M) <- c('var1', 'var2', 'var3')
p <- M %>%
pivot_longer(.,paste0("var",1:3),names_to = "x",values_to = "z") %>%
mutate(y = rep(paste0("var",1:3),each = 3)) %>%
ggplot(aes(x,y,fill = z))+
geom_tile() +
scale_fill_continuous(breaks = 0:4,labels = c("not","lowly","moderatly","highly","very high"))
ggplotly(p)
The function scale_fill_continuous allows you to specify manually your labels for different breaks. You have a lot of possibilities to tweak your graph with ggplot

Reordering a factor based on a summary statistic of a subset of the data

I'm trying to reorder a factor from a subset of my data frame, defined by another factor using forcats::fct_reorder().
Consider the following data frame df:
set.seed(12)
df <- data.frame(fct1 = as.factor(rep(c("A", "B", 'C'), each = 200)),
fct2 = as.factor(rep(c("j", "k"), each = 100)),
val = c(rnorm(100, 2), # A - j
rnorm(100, 1), # A - k
rnorm(100, 1), # B - j
rnorm(100, 6), # B - k
rnorm(100, 8), # C - j
rnorm(100, 4)))# C - k
I want to plot facetted group densities using the ggridges package. For example:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)
I would now like to order fct1 by the median (default in fct_reorder()) of the values of the upper density in each facet, i.e. where fct2 == "k". The goal in this example would therefore be that the facets appear in the order B - C - A.
This seems very similar to this question here, with the difference that I do not want to summarize the data first because I need the raw data to plot the densities.
I've tried to adapt the code in the answer of the linked question:
df <- df %>% mutate(fct1 = forcats::fct_reorder(fct1, filter(., fct2 == 'k') %>% pull(val)))
But it returns the following error:
Error in forcats::fct_reorder(fct1, filter(., fct2 == "k") %>% pull(val)) :
length(f) == length(.x) is not TRUE
It's obvious that they are not the same length, but I don't quite get why this error is necessary. My guess is that it's generally not guaranteed that all levels of fct1 are present in the subset, which would certainly be problematic. Yet, this isn't the case in my example. Is there a way to work around this error or am I doing something wrong more generally?
I'm aware that I can work around this with a couple of lines of extra code, e.g. create a helper variable of the subsetted data, reorder that and then take the level order to my factor in the original data set. I would still like a prettier solution, because I regularly face that very same task.
You can do this with a little helper function:
f <- function(i) -median(df$val[df$fct2 == "k" & df$fct1 == df$fct1[i]])
Which allows you to reorder like this:
df$fct1 <- forcats::fct_reorder(df$fct1, sapply(seq(nrow(df)), f))
Which gives you this plot:
ggplot(data = df, aes(y = fct2, x = val)) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = T,
quantile_fun = median,
quantile_lines = T) +
facet_wrap(~fct1, ncol = 1)

Apply MASS::fitdistr to multiple data by a factor

My question is at the end in bold.
I know how to fit the beta distribution to some data. For instance:
library(Lahman)
library(dplyr)
# clean up the data and calculate batting averages by playerID
batting_by_decade <- Batting %>%
filter(AB > 0) %>%
group_by(playerID, Decade = round(yearID - 5, -1)) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
ungroup() %>%
filter(AB > 500) %>%
mutate(average = H / AB)
# fit the beta distribution
library(MASS)
m <- MASS::fitdistr(batting_by_decade$average, dbeta,
start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
# plot the histogram of data and the beta distribution
ggplot(career_filtered) +
geom_histogram(aes(average, y = ..density..), binwidth = .005) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("Batting average")
Which yields:
Now I want to calculate different beta parameters alpha0 and beta0 for each batting_by_decade$Decade column of the data so I end up with 15 parameter sets, and 15 beta distributions that I can fit to this ggplot of batting averages faceted by Decade:
batting_by_decade %>%
ggplot() +
geom_histogram(aes(x=average)) +
facet_wrap(~ Decade)
I can hard code this by filtering for each decade, and passing that decade's worth of data into the fidistr function, repeating this for all decades, but is there a way of calculating all beta parameters per decade quickly and reproducibly, perhaps with one of the apply functions?
You can leverage summarise together with two custom functions for this:
getAlphaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[1]}
getBetaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[2]}
batting_by_decade %>%
group_by(Decade) %>%
summarise(alpha = getAlphaEstimate(average),
beta = getBetaEstimate(average)) -> decadeParameters
However, you will not be able to plot it with stat_summary according to Hadley's post here: https://stackoverflow.com/a/1379074/3124909
Here's an example of how you'd go from generating dummy data all the way through to plotting.
temp.df <- data_frame(yr = 10*187:190,
al = rnorm(length(yr), mean = 4, sd = 2),
be = rnorm(length(yr), mean = 10, sd = 2)) %>%
group_by(yr, al, be) %>%
do(data_frame(dats = rbeta(100, .$al, .$be)))
First I made up some scale parameters for four years, grouped by each combination, and then used do to create a dataframe with 100 samples from each distribution. Aside from knowing the "true" parameters, this dataframe should look a lot like your original data: a vector of samples with an associated year.
temp.ests <- temp.df %>%
group_by(yr, al, be) %>%
summarise(ests = list(MASS::fitdistr(dats, dbeta, start = list(shape1 = 1, shape2 = 1))$estimate)) %>%
unnest %>%
mutate(param = rep(letters[1:2], length(ests)/2)) %>%
spread(key = param, value = ests)
This is the bulk of your question here, very much solved the way you solved it. If you step through this snippet line by line, you'll see you have a dataframe with a column of type list, containing <dbl [2]> in each row. When you unnest() it splits those two numbers into separate rows, so then we identify them by adding a column that goes "a, b, a, b, ..." and spread them back apart to get two columns with one row for each year. Here you can also see how closely fitdistr matched the true population we sampled from, looking at a vs al and b vs be.
temp.curves <- temp.ests %>%
group_by(yr, al, be, a, b) %>%
do(data_frame(prop = 1:99/100,
trueden = dbeta(prop, .$al, .$be),
estden = dbeta(prop, .$a, .$b)))
Now we turn that process inside out to generate the data to plot the curves. For each row, we use do to make a dataframe with a sequence of values prop, and calculate the beta density at each value for both the true population parameters and our estimated sample parameters.
ggplot() +
geom_histogram(data = temp.df, aes(dats, y = ..density..), colour = "black", fill = "white") +
geom_line(data = temp.curves, aes(prop, trueden, color = "population"), size = 1) +
geom_line(data = temp.curves, aes(prop, estden, color = "sample"), size = 1) +
geom_text(data = temp.ests,
aes(1, 2, label = paste("hat(alpha)==", round(a, 2))),
parse = T, hjust = 1) +
geom_text(data = temp.ests,
aes(1, 1, label = paste("hat(beta)==", round(b, 2))),
parse = T, hjust = 1) +
facet_wrap(~yr)
Finally we put it together, plotting a histogram of our sample data. Then a line from our curve data for the true density. Then a line from our curve data for our estimated density. Then some labels from our parameter estimate data to show the sample parameters, and facets by year.
This is an apply solution, but I prefer #CMichael's dplyr solution.
calc_beta <- function(decade){
dummy <- batting_by_decade %>%
dplyr::filter(Decade == decade) %>%
dplyr::select(average)
m <- fitdistr(dummy$average, dbeta, start = list(shape1 = 1, shape2 = 10))
alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]
return(c(alpha0,beta0))
}
decade <- seq(1870, 2010, by =10)
params <- sapply(decade, calc_beta)
colnames(params) <- decade
Re: #CMichael's comment about avoiding a double fitdistr, we could rewrite the function to getAlphaBeta.
getAlphaBeta = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate}
batting_by_decade %>%
group_by(Decade) %>%
summarise(params = list(getAlphaBeta(average))) -> decadeParameters
decadeParameters$params[1] # it works!
Now we just need to unlist the second column in a nice way....

Using loop for gpplot2 cause issue on displaying the legends

I have an R code that creates a linear regression. I am having some problems with the legends in a graph. I would like to use the dates specified in the trendDateRange as the legend with different colors. Since these dates are in YYYY-MM-DD format. I only need the YYYY-MM. So for example, the trendDateRage1 = c("2015-01-01", "2015-12-31") and I want to display "2015-01 - 2015-12" as a legend with a any colour. When I run this in a for loop, it's only displaying 1 legend which uses the last trendDateRange i.e trendDateRange3 which displays "2013-01 - 2013-12". It does not display the legend for the other 2 dates. I do not have any problem with graphs although they're using the same colour. I would like to see different colours for each legend even though they have different line types.
If I run the code below showing individual graphs, it's working with the proper legend. I get the legend for each graph.
Month_Names <- c("2010-11","2010-12",
"2011-01","2011-02","2011-03","2011-04","2011-05","2011-06","2011-07","2011-08","2011-09","2011-10","2011-11","2011-12",
"2012-01","2012-02","2012-03","2012-04","2012-05","2012-06","2012-07","2012-08","2012-09","2012-10","2012-11","2012-12",
"2013-01","2013-02","2013-03","2013-04","2013-05","2013-06","2013-07","2013-08","2013-09","2013-10","2013-11","2013-12",
"2014-01","2014-02","2014-03","2014-04","2014-05","2014-06","2014-07","2014-08","2014-09","2014-10","2014-11","2014-12",
"2015-01","2015-02","2015-03","2015-04","2015-05","2015-06","2015-07","2015-08","2015-09","2015-10","2015-11","2015-12",
"2016-01","2016-02","2016-03","2016-04","2016-05","2016-06","2016-07","2016-08","2016-09","2016-10","2016-11","2016-12",
"2017-01")
Actual_volume <- c(54447,57156,
52033,49547,58718,53109,56488,60095,54683,60863,56692,55283,55504,56633,
53267,52587,54680,55569,60013,56985,59709,61281,54188,59832,56489,55819,
59295,52692,56663,59698,61232,57694,63111,60473,58984,64050,54957,63238,
59460,54430,58901,61088,60496,62984,66895,62720,65591,67815,58289,72002,
61054,60329,69283,68002,63196,72267,71058,69539,71379,70925,68704,76956,
65863,70494,77348,70214,74770,77480,69721,83034,76761,77927,79768,81836,
75381)
df_data <- data.frame(Month_Names, Actual_volume)
trendDateRange1 <- c("2010-11-01", "2017-01-31")
trendDateRange2 <- c("2012-01-01", "2012-12-31")
trendDateRange3 <- c("2013-01-01", "2013-12-31")
numoftrends <- 3
list_of_df <- list()
list_of_df<- lapply(1:numoftrends, function(j) {
trend.period <- get(paste0("trendDateRange", j))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
TRx <- subset(df_data, as.character(Month_Names) >= trend1 &
as.character(Month_Names) <= trend2)
})
i = 1
trend.period <- get(paste0("trendDateRange", i))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
Trend.dates <- paste0(trend1, '-' ,trend2)
plot = ggplot() +
geom_line(data = list_of_df[[i]],
aes(x = Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
lty = i + 1)
print(ggplotly(plot))
i = 2
trend.period <- get(paste0("trendDateRange", i))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
Trend.dates <- paste0(trend1, '-' ,trend2)
plot = ggplot() +
geom_line(data = list_of_df[[i]],
aes(x=Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
lty = i + 1)
print(ggplotly(plot))
i = 3
trend.period <- get(paste0("trendDateRange", i))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
Trend.dates <- paste0(trend1, '-' ,trend2)
plot = ggplot() +
geom_line(data = list_of_df[[i]],
aes(x = Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
lty = i+1)
print(ggplotly(plot))
But when I put this in the loop to make it one graph with each legend it does not work
plot = ggplot()
for (i in seq_along(list_of_df)) {
trend.period = get(paste0("trendDateRange", i))
trend1 = substr(trend.period[1], 1, 7)
trend2 = substr(trend.period[2], 1, 7)
Trend.dates = paste0(trend1, '-' ,trend2)
plot = plot + geom_line(aes(x = Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
data = list_of_df[[i]], lty = i + 1)
}
print(ggplotly(plot))
You'll have a much easier time working with ggplot2 if you combine the three datasets into one with an aesthetic that separates them, rather than adding them together in a for loop.
There are a number of ways you could do this, but here's an example using the dplyr and tidyr packages. It would replace everything after your df_data <- line.
library(ggplot2)
library(dplyr)
library(tidyr)
trends <- data_frame(Start = c("2010-11", "2012-01", "2013-01"),
End = c("2017-01", "2012-12", "2013-12"))
combined_data <- df_data %>%
crossing(trends) %>%
mutate(Month_Names = as.character(Month_Names),
TrendName = paste(Start, End, sep = "-")) %>%
filter(Month_Names >= Start,
Month_Names <= End)
# rotated x-axes to make plot slightly more readable
ggplot(combined_data, aes(Month_Names, y = Actual_volume,
group = TrendName,
color = TrendName)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
If you combine your list into a data.frame with an ID representing which element the observation came from and parse the dates, getting a decent plot is pretty simple:
library(dplyr)
library(ggplot2)
list_of_df %>%
bind_rows(.id = 'id') %>%
mutate(date = as.Date(paste0(Month_Names, '-01'))) %>%
ggplot(aes(date, Actual_volume, color = id)) +
geom_line()
or without dplyr,
df <- do.call(rbind,
Map(function(df, i){df$id <- i; df},
df = list_of_df,
i = as.character(seq_along(list_of_df))))
df$date <- as.Date(paste0(df$Month_Names, '-01'))
ggplot(df, aes(date, Actual_volume, color = id)) + geom_line()
which returns the same thing.
If you'd like more descriptive group labels, set the names of the list elements or define id as a string pasted together from the formatted minimums and maximums of the parsed dates.
Here is a solution using ggplotly.
nrows <- unlist(lapply(list_of_df,nrow))
df <- data.frame(do.call(rbind,list_of_df), Grp = factor(rep(1:3, nrows)))
plot <- ggplot(aes(x=Month_Names, y=Actual_volume, group = Grp,
colour=Grp), data=df) + geom_line()
print(ggplotly(plot))

Resources