I am working with the R programming language.
I simulated the following data:
set.seed(123)
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
name = myFun(400)
variable = rnorm(400, 50,10)
part1 = data.frame(name,variable)
name = myFun(10)
variable = rnorm(10, 130,10)
part2 = data.frame(name,variable)
final = rbind(part1, part2)
final$name = substr(final$name,1,3)
Then, I made a barplot of this data:
library(ggplot2)
p<-ggplot(data=final, aes(x=name, y=variable)) +
geom_bar(stat="identity") + ggtitle(" Title of Barplot")
Is there a way I can take the "names" with the ten largest values of "variable" and place them as labels on top of the corresponding bars?
I saw this link over here that shows how to do this for all bars (https://www.geeksforgeeks.org/how-to-add-labels-over-each-bar-in-barplot-in-r/) - but is there a way to do this only for the 10 largest bars?
Thanks!
Note: Is there some way to better "center" the labels and prevent them from overlapping?
Another option using geom_text_repel which has a lot of options for positioning your labels. Here is a reproducible example:
set.seed(123)
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
name = myFun(400)
variable = rnorm(400, 50,10)
part1 = data.frame(name,variable)
name = myFun(10)
variable = rnorm(10, 130,10)
part2 = data.frame(name,variable)
final = rbind(part1, part2)
library(dplyr)
# Create subset dataframe
final_10 <- final %>%
arrange(desc(variable)) %>%
slice(1:10)
library(ggplot2)
library(ggrepel)
ggplot(data=final, aes(x=name, y=variable)) +
geom_bar(stat="identity") +
geom_text_repel(data = final_10, aes(x = name, y = variable, label = variable), size = 2, segment.color = "grey50") +
ggtitle(" Title of Barplot")
Created on 2022-08-24 with reprex v2.0.2
Here you go, just create a new variable with the name for the top 10 and NA otherwise
# Setup
library(tidyverse)
# Create a label column
final <-
final %>%
arrange(desc(variable)) %>%
mutate(label = ifelse(row_number() <= 10, name, NA))
# Add geom_text() for label
p <-
ggplot(
data = final,
aes(
x = name,
y = variable,
label = label)) +
geom_bar(stat="identity") +
geom_text() +
ggtitle(" Title of Barplot")
Related
I am trying to run a for loop over a factor level (treatment in this case) to plot graphs for each of the levels using a function. My goal is to obtain several graphs on my wd(), one for each treatment level.
Problem: The outcome is always one single messed up barplot with all the variables and errorbars included.
dataset looks something like this:
set.seed(108) test <- data.frame(
n = 1:12,
treatment = factor(paste("trt", 1:2)),
rep= factor(paste("rep", 1:2)),
type = sample(LETTERS, 3),
mean= sample(1:100, 12),
sd= sample(1:50, 12),
var3 = sample(1:100, 12),
var4 = sample(1:100, 12))
I believe that I'm missing something on my for loop code:
df$treatment<- as.factor(df$treatment)
treatment_levels<- unique(levels(df$treatment))
for(i in 1:length(treatment_levels)){
df <- df[treatment_levels[i],]
x <- df$type
avg <- df$mean
sd <- df$sd
grp<- df$rep
title<- treatment_levels[i]
xtitle<- "type"
ytitle<- " "
fig_name <- paste(title,"_bp")
bpfunction(df, x, avg, sd, grp, title, xtitle, ytitle, fig_name)
}
my function to plot a barplot is:
bpfunction(df, x, avg, sd, grp, title, xtitle, ytitle, fig_name)
{
bp <- ggplot(df, aes(x = x, y = avg, fill = grp)) +
geom_bar(stat = 'identity', aes(fill = grp), size = 1) +
geom_errorbar(aes(ymin=avg-sd , ymax=avg +sd))+
labs(x = x, y = avg, title = title)
ggsave(paste(fig_name, "png", sep = "."), plot = bp)
}
Several issues with your implementation:
Currently, your row indexing is not logical but factor value:
df <- df[treatment_levels[i],]
Therefore, adjust to proper filtering by column:
sub_df <- df[df$treatment == treatment_levels[i],]
Better yet, avoid bookkeeping of unique treatment factor levels and use by (object-oriented wrapper to tapply) or split + lapply.
Reassigning df in a for loop. After first iteration, df can no longer be subsetted for other treatment values. Therefore, use a different object name. (Actually, avoid df altogether for more substantive name).
Using a numeric vector of many values as labels per your labs argument.
Avoid passing vectors pointing to data frame columns into aes. Instead, pass string variables to be dynamically rendered with .data[[]] or double curly brace {{}} (ggplot2 v3.0.0+):
bpfunction(treatment_df, x, avg, sd, grp, title, xtitle, ytitle, fig_name)
{
bp <- ggplot(treatment_df, aes(x = .data[[x]], y = .data[[avg]], fill = .data[[grp]])) +
geom_bar(stat = 'identity', aes(fill = .data[[grp]]), size = 1) +
geom_errorbar(aes(ymin={{avg}}-{{sd}}, ymax={{avg}}+{{sd}})) +
labs(x = x, y = avg, title = title)
ggsave(fig_name, plot = bp)
return(bp)
}
# REFACTOR USING by
treatment_plots <- by(df, df$treatment, function(sub_df)
bpfunction(
sub_df,
x = "type",
avg = "mean",
sd = "sd",
grp = "rep",
title sub_df$treatment[1],
xtitle = "type",
ytitle = " ",
fig_name = paste0(sub_df$treatment[1], "_bp.png")
)
)
# REFACTOR USING split + lapply
treatment_plots <- split(df, df$treatment) |> lapply(
function(sub_df) bp_function(...same as above...)
)
Basically, I have a heatmap that contains some points. What Im trying to do is automatically rescale the size of the points in a sensible way for different sized heatmaps. For example, if I have a heatmap that looks like so:
library(reshape)
library(ggplot2)
library(ggnewscale)
# Create matrix
set.seed(1701)
a <- sample(1:10,100, replace=TRUE)
s <- matrix(a, nrow = 5, ncol=5)
s[upper.tri(s)] = t(s)[upper.tri(s)]
rownames(s) <- colnames(s) <- paste0("x", 1:5)
diag(s) <- 0
sDf <- melt(s)
# create diagonal values
diagDf <- data.frame(
var1 = c(paste0("x", 1:5)),
var2 = c(paste0("x", 1:5)),
val = c(2,5,3,1,5)
)
# make plot
ggplot(sDf, aes(X1,X2)) +
geom_tile(aes(fill = value)) +
scale_fill_gradientn(colors = rev(colorspace::sequential_hcl(palette = "Blues 3", n = 100))) +
new_scale_fill() +
geom_point(data = diagDf, aes(var1, var2, col = val), size = 20) +
theme(aspect.ratio = 1)
So in the image above, the diagonal contains geom_points and their size is manually set to size = 20.... This works for this example, but the issue is:
If the heatmap dimensions were changed to say 20x20, then having the size hardcoded to equal 20 won't work due to overlapping & the points being too big etc.
So what Im trying to do is come up with a method that will automatically resize the points to effectively fill square they are contained in without overlapping, being too big or too small.
Any suggestions as to how I could do this?
I would do something like this:
library(reshape)
library(ggplot2)
library(ggnewscale)
n <- 5
# Create matrix
set.seed(1701)
a <- sample(1:10,100, replace=TRUE)
s <- matrix(a, nrow = n, ncol=n)
s[upper.tri(s)] = t(s)[upper.tri(s)]
rownames(s) <- colnames(s) <- paste0("x", 1:n)
diag(s) <- 0
sDf <- melt(s)
# create diagonal values
diagDf <- data.frame(
var1 = c(paste0("x", 1:n)),
var2 = c(paste0("x", 1:n)),
val = sample(1:5,n,replace = T)
)
# make plot
ggplot(sDf, aes(X1,X2)) +
geom_tile(aes(fill = value)) +
scale_fill_gradientn(colors = rev(colorspace::sequential_hcl(palette = "Blues 3", n = 100))) +
new_scale_fill() +
geom_point(data = diagDf, aes(var1, var2, col = val), size = 1/sqrt(nrow(sDf))*80) +
theme(aspect.ratio = 1)
here the size of the points depends on the dimension of the matrix.
an example of the output with a 3x3, 5x5, and 10x10 matrix
You can modify diagDf to contain the co-ordinates of the circles you want to plot using some basic trigonometry, then plot them as filled polygons. This ensures they will always scale exactly with your plot.
library(dplyr)
diagDf <- diagDf %>%
mutate(var1 = as.numeric(as.factor(var1)),
var2 = as.numeric(as.factor(var2))) %>%
split.data.frame(diagDf$var1) %>%
lapply(function(x) {
deg <- seq(0, 2 * pi, length = 100)
var1 <- cos(deg)/2.2
var2 <- sin(deg)/2.2
val <- rep(x$val, 100)
data.frame(var1 = var1 + x$var1, var2 = var2 + x$var2, val = val)}) %>%
{do.call(rbind, .)}
Now with slightly modified plot code, we get:
ggplot(sDf, aes(X1,X2)) +
geom_tile(aes(fill = value)) +
scale_fill_gradientn(colors=rev(colorspace::sequential_hcl(palette = "Blues 3", n=100))) +
new_scale_fill() +
geom_polygon(data = diagDf, aes(var1, var2, fill = val, group = val)) +
theme(aspect.ratio = 1)
Created on 2021-09-27 by the reprex package (v2.0.0)
I would like to be able to plot each of "X1 by grpA", "X2 by grpA", "X3 by grpB", "X1 by grpB", "X2 by grpB", and "x3 by grpB" using ggplot2::ggplot() in conjunction with a for loop.
So far, I can get it to almost work, but the argument for the column of the grouping variable in the facet_grid() function does not resolve correctly when I try to use tidy_eval properties. It does work, however, when I type the column name explicitly, but of course, having to type the name explicitly would make it so I would not be able to dynamically change the grouping variable.
I provide the following data-set returned by the following code snippet to give context to my question:
set.seed(1)
dfr <- tibble(x1 = factor(sample(letters[1:7], 50, replace = T), levels=letters[1:7]),
x2 = factor(sample(letters[1:7], 50, replace = T), levels=letters[1:7]),
x3 = factor(sample(letters[1:7], 50, replace = T), levels=letters[1:7]),
grpA = factor(sample(c("grp1","grp2"),50, prob=c(0.3, 0.7) ,replace=T), levels = c("grp1", "grp2")),
grpB = factor(sample(c("grp1","grp2"),50, prob=c(0.6, 0.4) ,replace=T), levels = c("grp1", "grp2"))
)
head(df)
I also provide a function that creates the plotting data I need to make the grouped plots. It accepts strings as arguments for the parameters 'groupvar' and 'mainvar':
plot_data_prepr <- function(dat, groupvar, mainvar){
groupvar <- sym(groupvar)
mainvar <- sym(mainvar)
plot_data <- dat %>%
group_by(!!groupvar) %>%
count(!!mainvar, .drop = F) %>% drop_na() %>%
mutate(pct = n/sum(n),
pct2 = ifelse(n == 0, 0.005, n/sum(n)),
grp_tot = sum(n),
pct_lab = paste0(format(pct*100, digits = 1),'%'),
pct_pos = pct2 + .02)
return(plot_data)
}
here is normal usage of the function:
plot_data_prepr(dat = dfr, groupvar = "grpA", mainvar = "x1")
Now I share my for loop that fails when I try to use tidy_eval in the facet_grid() function in the context of ggplot(); the returned error = "Error in !sgvar : invalid argument type"
"FAILING EXAMPLE:"
for (i in seq_along(names(dfr)[1:3])){
mvar <- names(dfr)[i]
print(mvar)
gvar <- names(dfr[4])
print(gvar)
smvar <- sym(mvar)
sgvar <- sym(gvar)
plot <- ggplot(data=plot_data_prepr(dfr, gvar, mvar),
mapping = aes(x=!!smvar, y = pct2, fill = !!smvar)) +
geom_bar(stat = 'identity') +
ylim(0,1) +
geom_text(aes(x=!!smvar, label=pct_lab, y = pct_pos + .02)) +
facet_grid(. ~ !!sgvar) +
ggtitle(paste0(mvar," by ",gvar))
print(plot)
}
When I run the loop by explicitly typing grpA in place of !!sgvar in the facet_grid() function, it works for some reason:
"FUNCTIONING BUT NOT WHAT I WANT EXAMPLE:"
for (i in seq_along(names(dfr)[1:3])){
mvar <- names(dfr)[i]
print(mvar)
gvar <- names(dfr[4])
print(gvar)
smvar <- sym(mvar)
sgvar <- sym(gvar)
plot <- ggplot(data=plot_data_prepr(dfr, gvar, mvar),
mapping = aes(x=!!smvar, y = pct2, fill = !!smvar)) +
geom_bar(stat = 'identity') +
ylim(0,1) +
geom_text(aes(x=!!smvar, label=pct_lab, y = pct_pos + .02)) +
facet_grid(. ~ grpA) +
ggtitle(paste0(mvar," by ",gvar))
print(plot)
}
Of course, if I wanted to loop through a set of grouping variables, then needing to explicitly type each one would not allow for looping. Could someone explain why my code with the 'bang bang' operator inside facet_gric() doesn't work properly in the 'FAILING EXAMPLE' and also suggest how to remedy this error?
Thank you.
It's difficult to piece together exactly what you're looking for, since your example code has errors, unassigned variable names and pieces of code missing. However, I think you're wanting the loop to print all of the pairs of grouping variables and main variables by cycling through the names of your data frame.
So that there is no dubiety, here is a full reprex:
Load packages and create reproducible data:
library(dplyr)
library(ggplot2)
set.seed(1)
df <- tibble(x1 = factor(sample(letters[1:7], 50, replace = TRUE)),
x2 = factor(sample(letters[1:7], 50, replace = TRUE)),
x3 = factor(sample(letters[1:7], 50, replace = TRUE)),
grpA = factor(sample(c("grp1", "grp2"), 50,
prob = c(0.3, 0.7), replace=TRUE)),
grpB = factor(sample(c("grp1", "grp2"), 50,
prob = c(0.6, 0.4), replace=TRUE)))
Define data preparation function
plot_data_prepr <- function(dat, groupvar, mainvar)
{
groupvar <- sym(groupvar)
mainvar <- sym(mainvar)
plot_data <- dat %>%
group_by(!!groupvar) %>%
count(!!mainvar, .drop = F) %>% tidyr::drop_na() %>%
mutate(pct = n/sum(n),
pct2 = ifelse(n == 0, 0.005, n/sum(n)),
grp_tot = sum(n),
pct_lab = paste0(format(pct*100, digits = 1),'%'),
pct_pos = pct2 + .02)
return(plot_data)
}
Loop to create all 6 plots
for(gvar in names(df)[4:5]){
for(mvar in names(df)[1:3])
{
print(ggplot(plot_data_prepr(df, gvar, mvar),
aes(x = !!sym(mvar), y = pct2, fill = !!sym(mvar))) +
geom_bar(stat = 'identity') +
ylim(0,1) +
geom_text(aes(label=pct_lab, y = pct_pos + .02)) +
facet_grid(as.formula(paste0(".~", gvar))) +
ggtitle(paste0(mvar, " by ", gvar))
)
}
}
Output:
Created on 2020-06-30 by the reprex package (v0.3.0)
There is a good discussion about using ggplot in loop and other creative ways at Looping over variables in ggplot. However, the discussion does not quite solve my problem.
I have a vertical dataset that I need to create plots from in a loop. There is no error in the code but my code only prints the last plot. Can't figure out why. Here is a reproducible example:
df <- cbind.data.frame(var = sample(c('a','b'), size = 100, replace = TRUE),
grp = sample(c('x','y'), size = 100, replace = TRUE), value = rnorm(100))
for (i in 2) {
plot.df <- df[which(df$var == c('a','b')[i]),]
print(ggplot(plot.df, aes(x = 1:nrow(plot.df), y = value, color = grp)) +
geom_line() + ggtitle(c('a','b')[i]))
}
As an alternative, you might also consider using lapply, as it makes the code a lot more readable.
If I am not mistaken you want to produce plots for each of the levels of the variable var.
You can firstly define your function, and then apply it to all levels
my_plot <- function(x){
# debug: x <- "a"
plot.df <- df[df$var %in% x,]
ggplot(plot.df, aes(x = 1:nrow(plot.df), y = value, color = grp)) +
geom_line() + ggtitle(x)
}
lapply(unique(df$var), my_plot)
The comment by #EJJ is correct, your loop isn't you need something like
for (i in seq_along(1:nlevels(factor(df$var))))
library(ggplot2)
library(dplyr)
df <- cbind.data.frame(var = sample(c('a','b'), size = 100, replace = TRUE),
grp = sample(c('x','y'), size = 100, replace = TRUE), value = rnorm(100))
for (i in seq_along(1:nlevels(factor(df$var)))) {
plot.df <- df[which(df$var == c('a','b')[i]),]
print(ggplot(plot.df, aes(x = 1:nrow(plot.df), y = value, color = grp)) +
geom_line() + ggtitle(c('a','b')[i]))
}
I'm trying to replicate this histogram in R.
Here is how to mock my dataset:
dft <- data.frame(
menutype = sample(c(1,2,4,5,6,8,12), 120, replace = T),
Belief = sample(c(0,1), 120, replace = T),
Choice = sample(c(0,1), 120, replace = T)
)
Here is my code :
library(ggplot2)
library(dplyr)
library(tidyr)
library(MASS)
df <- data.frame(
menutype = factor(df$menutype, labels = c("GUILT" , "SSB0", "SSB1", "FLEX0", "FLEX1", "STD", "FLEX01"),
levels = c(1,2,4,5,6,8,12)),
Belief = factor(df$belieflearn, levels = c(1), labels= c("Believe Learn")), #Interested only in this condition
Choice = factor(df$learned, levels = c(1), labels= c("Learn")) #Same here
)
df1 <- rbind(na.omit(df %>%
count(Belief, menutype) %>%
group_by(menutype) %>%
mutate(prop = n / sum(n))),
na.omit(df %>%
count(Choice, menutype) %>%
group_by(menutype) %>%
mutate(prop = n / sum(n))))
test <- paste(df1$Belief[1:6],paste(df1$Choice[7:13]))
test[1:6] <- paste(df1$Belief[1:6])
test[7:13] <- paste(df1$Choice[7:13])
df1$combine <- paste(test)
ggplot(data = df1, aes(menutype, prop, fill = combine)) +
labs(title = "Classification based on rank ordering\n", x = "", y = "Fraction of subjects", fill = "\n") +
geom_bar(stat = "identity", position = "dodge")+
theme_bw() +
theme(legend.position="bottom", plot.title = element_text(hjust = 0.5)) #Centering of the main title+
#geom_text(aes(label="ok"), vjust=-0.3, size=3.5)+
The problem is that it's more or less working, I'm almost getting the graph that I want but it is a workaround and there is still some errors. Indeed, I've for example the same value for STD (0.10), while it should be 0 and 0.10 like in the original graph.
What I would like to do optimally is to have two different dataframe, one with menutype and Belief, the other one with menutype and Choice, then as I did, compute the proportion of a specific modality in each latter variables on menutype, and finally to plot it as histograms, much as the graph in the original study. Additionally, I'd like to have the proportions as fractions above each bar, but that is optional.
Could someone help me on this matter? I'm really struggling to get it working.
Thanks in advance!
EDIT: I think the issue is with the fill =. I would like to specify for each bar the variable I want (e.g, fill = df2$Belief & df2$Choice) but I don't know how to proceed.
library(tidyverse)
set.seed(10)
# example data frame
df <- data.frame(
menutype = sample(c(1,2,4,5,6,8,12), 120, replace = T),
Belief = sample(c(0,1), 120, replace = T),
Choice = sample(c(0,1), 120, replace = T)
)
# calculate all metrics based on all variables you want to plot in a tidy way
df_plot = df %>%
group_by(Choice) %>%
count(menutype, Belief) %>%
mutate(prop = n / sum(n),
prop_text = paste0(n, "/", sum(n))) %>%
ungroup()
# barplots using one variable and split plots using another variable
df_plot %>%
mutate(Belief = factor(Belief),
menutype = factor(menutype)) %>%
ggplot(aes(menutype, prop, fill = Belief))+
geom_col(position = "dodge")+
facet_wrap(~Choice, ncol=1)+
geom_text(aes(label=prop_text), position = position_dodge(1), vjust = -0.5)+
ylim(0,0.2)