How to combine geom_bar for three dataframe? - r

Suppose I have:
a = data.frame(a = sample(1:10, 20, replace = T))
b = data.frame(b = sample(1:11, 19, replace = T))
c = data.frame(c = sample(1:9, 21, replace = T))
a.a = ggplot(data = a, aes(a)) + geom_bar()
b.b = ggplot(data = b, aes(b)) + geom_bar()
c.c = ggplot(data = c, aes(c)) + geom_bar()
How can I combine a.a, b.b and c.c into one plot? Like
I have tried
d = ggplot() +
geom_bar(data = a.a, aes(a)) +
geom_bar(data = b.b, aes(b)) +
geom_bar(data = c.c, aes(c))
d
But it doesn't work...

Combine them into a single "long" data frame that has a grouping column marking which data frame each row came from.
library(reshape2)
library(dplyr)
# Individual data frames
a = data.frame(a = sample(1:10, 20, replace = T))
b = data.frame(b = sample(1:11, 19, replace = T))
c = data.frame(c = sample(1:9, 21, replace = T))
Combine data frames in "long" format. The data frames have different numbers of rows, so we need our new grouping variable (called data_source below) to repeat each data frame's name a number of times equal to the number of rows in each data frame. We use the rep function to take care of this. One way is as follows: rep(c("a","b","c"), times=c(nrow(a), nrow(b), nrow(c))), however, I use sapply below because is seemed cleaner (though perhaps more opaque).
df = data.frame(value =c(a$a,b$b,c$c),
data_source=rep(c("a","b","c"), times=sapply(list(a,b,c), nrow)))
# Pre-summarise counts in order to add zero counts for empty categories
df.summary = df %>% group_by(data_source, value) %>%
tally %>%
dcast(data_source ~ value, value.var="n", fill=0) %>%
melt(id.var="data_source", variable.name="value", value.name="n")
ggplot(df.summary, aes(value, n, fill=data_source)) +
geom_bar(stat="identity", position="dodge", colour="grey20", lwd=0.3)
If we didn't have some categories with zero counts (for example, data frames b and c have no values equal to 10), then we could just do this:
ggplot(df, aes(factor(value), fill=data_source)) +
geom_bar(position="dodge", colour="grey20", lwd=0.3)
But then note how ggplot expands the remaining bars when one or two data frames don't contain a given value:

Related

Extrapolate dataset with limited data points and add all values to new dataset

I have a dataset with very limited data points.
x<- c(4, 8, 13, 24)
y<- c(40, 37, 28, 20)
df<- data.frame(x,y)
Now I want to extrapolate this data, creating a dataset where the value of y will be given for every value (no decimals) of x between 1-100. x and y have a linear relationship.
Secondly, could this be done for multiple dataframes by using something like a loop?
Thank you!
This is a short snippet that does this:
linear_xy <- lm(y ~ x, data = df)
# df <- broom:::augment.lm(linear_xy, newdata = complete(df, x = 1:100)) # one way
df <- df %>% # another way
complete(x = 1:100) %>%
mutate(.fitted = predict(linear_xy, newdata = .))
ggplot(df, aes(x, y)) +
geom_line(aes(y = .fitted)) +
geom_point() +
ggpubr::theme_pubr()
This requires that you have the packages {tidyverse}, {broom}, and {ggpubr} installed.
Second part
Assumming we want to do this with multiple data-frames, we have to
restructure things a bit.
x <- c(4, 8, 13, 24)
y <- c(40, 37, 28, 20)
df <- tibble(x, y)
I don't have multiple data-frames (or tibbles), so I'll make this the
primary one, and make up a function (a factory) that yields data-frames, that are a bit different from the above df.
df_factory <- . %>%
mutate(x_new = x + sample.int(100, size = n()),
x = if_else(x_new >= 100, x, x_new),
y_new = y + rnorm(n(), mean = median(y), sd = sd(y)),
y = y_new,
y_new = NULL,
x_new = NULL)
Thus df_factory is a function of one-variable, and that must be a
data-frame that has an x and y;
df1 <- df_factory(df)
df2 <- df_factory(df)
df3 <- df_factory(df)
all_dfs <- list(df1, df2, df3)
all_dfs <- bind_rows(all_dfs, .id = "df_id")
Here we ensure that the relation to the original data-frame is preserved in the all_dfs data-frame via the new variable df_id.
Next we want to:
Collapse the variables into their individual data-frame, and we put
that in a list-column named data.
For each (see rowwise) we have to perform:
An "interpolating" linear model (not a piece-wise one so...)
Predict on each of these linear_xy (which are also stored in a list-column`).
Unnest it all back, so it can be fed into ggplot as one contiguous data-frame.
all_dfs %>%
nest(data = c(x,y)) %>%
rowwise() %>%
mutate(linear_xy = list(lm(y ~ x, data = data)),
augment = list(broom:::augment.lm(linear_xy,
newdata = complete(data, x = 1:100)))) %>%
ungroup() %>%
select(-data, -linear_xy) %>%
unnest(augment) ->
all_dfs_predictions
Note: -> at the end shows what the pipe result is now assigned to.
The group informs ggplot to treat the rows as separate via their
df_id. And for fun we add the color and fill to also depend on df_id. In fact I could have choosen something else to be the coloraesthetics dependent, like "original df" vs. "others" or if a certain threshold should distinguish them, etc.. But then the group aesthetic would still tell ggplot to separate the rows amongst this relation.
ggplot(all_dfs_predictions, aes(x, y, group = df_id, color = df_id, fill = df_id)) +
geom_line(aes(y = .fitted)) +
geom_point() +
lims(x = c(1,100)) +
ggpubr::theme_pubr()

How to plot a(n unknown) number of data series as geom_line in same chart

My first Q here, so please go lightly if I'm out of step anywhere.
I'm trying to code R to produce a single chart to contain a number of data series lines. The number of data series may vary but will be provided in the data frame. I have tried to rearrange another thread's content to print the geom_line , but not successfully.
The logic is:
#desire to replace loop of 1:5 with ncol(df)
print(ggplot(df,aes(x=time))
for (i in 1:5) {
print (+ geom_line(aes(y=df[,i]))
}
#functioning geom point loops ggplot production:
for (i in 1:5) {
print(ggplot(df,aes(x=time,y=df[,i]))+geom_point())
}
#functioning multi-line ggplot where n is explicit:
ggplot(data=df, aes(x=time), group=1) +
geom_line(aes(y=df$`3`))+
geom_line(aes(y=df$`4`))
The functioning example code produces n number of point charts, 5 in this case. I would like just one chart to contain n line series.
This may be similar to How to plot n dimensional matrix? for which there are currently no relevant answers
Any contributions much appreciated, thanks
You can use gather from tidyverse "world" to do that.
As you didn't supply a sample data I used mtcars.
I created two data.frames one with 3 columns one with 9. In each one of them I plotted all of the variables against the variable mpg.
library(tidyverse)
df3Columns <- mtcars[, 1:4]
df9Columns <- mtcars[, 1:10]
df3Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
df9Columns %>%
gather(var, value, -mpg) %>%
ggplot(aes(mpg, value, group = var, color = var)) +
geom_line()
Edit - using the sample data in comments.
library(tidyverse)
df %>%
rownames_to_column("time") %>%
gather(var, value, -time) %>%
ggplot(aes(time, value, group = var, color = var)) +
geom_line()
Sample data:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
To strictly answer your question, you can simply store your ggplot in a variable and add the geom_line one by one:
df <- structure(list("39083" = c(96, 100, 100), "39090" = c(99, 100, 100), "39097" = c(99, 100, 100)), row.names = 3:5, class = "data.frame")
g <- ggplot(df, aes(x = 1:nrow(df)))
for (i in colnames(df))
{
g <- g + geom_line(y = df[,i])
}
g <- g + scale_y_continuous(limits = c(min(df), max(df)))
print(g)
However, this is not a very convenient solution. I would highly recommend to refactor your data frame to be more ggplot style.
df.ultimate <- data.frame(time = numeric(), value = numeric(), group = character())
for (i in colnames(df))
{
df.ultimate <- rbind(df.ultimate, data.frame(time = 1:nrow(df), value = df[, i], group = i))
}
g <- ggplot(df.ultimate, aes(x = time, y = value, color = group))
g <- g + geom_line()
print(g)
A one-line solution:
ggplot(data.frame(time = rep(1:nrow(df), ncol(df)),
value = as.vector(as.matrix(df)),
group = rep(colnames(df), each = nrow(df))),
aes(x = time, y = value, color = group)) + geom_line()

geom_density plots with nested vectors

I have a data frame with a nested vector in one column. Any ideas how to ggplot a geom_density using the values from the nested vector?
If I use pivot_longer the entire data frame, I get 25 million rows, so I'd prefer to avoid that if possible.
library(ggplot2)
df = data.frame(a = rep(letters[1:5],length.out = 100), b = sample(LETTERS, 100, replace = T))
df[["c"]] = purrr::map(1:100, function(x) rnorm(100))
# works but too heavy for the actual implementation
ggplot(tidyr::unnest(df, c), aes(c, group = a)) + geom_density() + facet_wrap(vars(b))
# doesn't work
ggplot(df, aes(c, group = a)) + geom_density() + facet_wrap(vars(b))
Different solution: Prepare each plot separately and rearrange your plots afterwards using gridExtra package.
library(ggplot2)
df = data.frame(a = rep(letters[1:5],length.out = 100), b = sample(LETTERS, 100, replace = T))
df[["c"]] = purrr::map(1:100, function(x) rnorm(100))
lst_plot <- lapply(sort(unique(df$b)), function(x){
data <- df[df$b == x,
data <- purrr::map_dfr(seq(length(data$a)), ~ data.frame(a = data$a[.x], c = data$c[.x][[1]]))
gg <- ggplot(data) +
geom_density(aes(c, group = a)) +
ylab(NULL)
return(gg)
})
gridExtra::grid.arrange(grobs = lst_plot, ncol = 6, left = "density")
To be honest, I'm not sure how well this works with your massive dataset...

Sorting factors in multipanel plot in ggplot2 according to the first panel

Is it possible to sort factors in a multipanel plot in ggplot2 according to the first panel? The first panel decides the order and the remaining panels follow that order.
Here is an example:
require(ggplot2)
set.seed(36)
xx<-data.frame(YEAR=rep(c("X","Y"), each=20),
CLONE=rep(c("A","B","C","D","E"), each=4, 2),
TREAT=rep(c("T1","T2","T3","C"), 10),
VALUE=sample(c(1:10), 40, replace=T))
ggplot(xx, aes(x=CLONE, y=VALUE, fill=YEAR)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~TREAT)
Which gives me this plot:
Now I would like to sort CLONE based on the VALUE in YEAR X in a descending order (highest to lowest) but only for the Control (C panel). This order should then be maintained for T1, T2, and T3. By looking at the plot above, I want panel C sorted as CLONE C, B or D (both are 5), A and E. This order of CLONE should then be replicated for the remaining panels.
There's no easy way to do this right in ggplot since you have to reorder CLONE by
3 conditions, TREAT, YEAR and VALUE, otherwise forcats::fct_reorder2 could have been an option.
Instead, extract the order of CLONE from the subset of data corresponding to YEAR = "X",
TREAT = "C", and re-define your factor levels for the whole data set based on this subset.
library("ggplot2")
library("dplyr")
set.seed(36)
xx <- data.frame(YEAR = rep(c("X","Y"), each = 20),
CLONE = rep(c("A","B","C","D","E"), each = 4, 2),
TREAT = rep(c("T1","T2","T3","C"), 10),
VALUE = sample(c(1:10), 40, replace = TRUE), stringsAsFactors = FALSE)
clone_order <- xx %>% subset(TREAT == "C" & YEAR == "X") %>%
arrange(-VALUE) %>% select(CLONE) %>% unlist()
xx <- xx %>% mutate(CLONE = factor(CLONE, levels = clone_order))
ggplot(xx, aes(x = CLONE, y = VALUE, fill = YEAR)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~TREAT)
giving

How to make one bar-chart from different data-frames with same format?

I have three different data-frames that have same format and I can not combine them because each one represent different data source. I would like to show percentage of one variable for different data frames in one bar chart.
I can get bar-chart for column1 of one dataframe by using:
ggplot(baseline, aes(x = c1)) +
geom_bar(aes(y = (..count..)/sum(..count..)),fill="blue",colour="blue") +
geom_text(aes(y = ((..count..)/sum(..count..)), label=scales::percent((..count..)/sum(..count..))), stat = "count")
I want output similar to this plot(except that I am showing percentage of each category) while race will be name of different data-frames and factor is values of column 1 of data frames.
I do not use ggplot2 but here is an illustration of how to accomplish what you want. It will be easiest to add a column to your data.frames indicating the source of each data.frame. Then calculate whatever metric you want, by source, then plot. Alternatively, you could calculate the metrics first, then combine the data.frames.
library(RColorBrewer)
library(data.table)
set.seed(1234)
make_data <- function() {
n <- sample(5:10, 1)
data.frame(id = rep(c("A", "B", "C"), each = n),
vals = c(rnorm(n, 5, 1), rnorm(n, 10, 1), rnorm(n, 15, 1)))
}
df1 <- make_data()
df2 <- make_data()
df3 <- make_data()
df4 <- make_data()
df1$src <- "source1"
df2$src <- "source2"
df3$src <- "source3"
df4$src <- "source4"
dat <- do.call(rbind, list(df1, df2, df3, df4))
dat <- as.data.table(dat)
res <- dat[ , mean(vals), by = list(id, src)][order(id)]
barplot(height = res$V1, col = rep(brewer.pal(4, "Set1"), 3))
EDIT
Here is the ggplot2 code provided by Sumedh:
library(ggplot2)
ggplot(res, aes(x = id, y = V1, fill = src)) +
geom_bar(stat = "identity", position = "dodge")

Resources