Related
I have a plot that looks like below. I want to change the order so that the larger value comes first (so cyan would precede red). But I can't seem to do this. What am I doing wrong?
This is my current code block so far:
ggplot(df, aes(x = Gene.Set.Size, y = OR, label =P.value, color = Method, group = Method)) +
geom_point(position=position_dodge(width=0.5)) +
ggrepel::geom_text_repel(size = 6, box.padding = 1, segment.angle = 20, position=position_dodge(width=0.5))+
geom_pointrange(aes(ymax = UpperCI, ymin = LowerCI),position=position_dodge(width=0.5)) +
theme_bw() +
theme(text=element_text(size=25),axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Odds ratio") +
xlab("Gene set size") +
theme(plot.margin = unit(c(2,2,2,2), "cm"))
> dput(df)
structure(list(Method = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("MAGMA",
"Pairwise"), class = "factor"), P.value = c(8.74e-28, 1.33e-56,
5.57e-92, 1.63e-44, 4.23e-71, 2.78e-95), OR = c(1.39, 1.424668,
1.4, 1.513, 1.478208, 1.409563), UpperCI = c(1.481491, 1.487065,
1.446039, 1.601557, 1.417117, 1.455425), LowerCI = c(1.316829,
1.364601, 1.356358, 1.42, 1.541768, 1.365056), Gene.Set.Size = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("500", "1000", "2000"), class = "factor")), row.names = c(NA,
-6L), class = "data.frame")
You must set the factor order.
library(ggplot2)
df <- structure(list(Method = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("MAGMA",
"Pairwise"), class = "factor"), P.value = c(8.74e-28, 1.33e-56,
5.57e-92, 1.63e-44, 4.23e-71, 2.78e-95), OR = c(1.39, 1.424668,
1.4, 1.513, 1.478208, 1.409563), UpperCI = c(1.481491, 1.487065,
1.446039, 1.601557, 1.417117, 1.455425), LowerCI = c(1.316829,
1.364601, 1.356358, 1.42, 1.541768, 1.365056), Gene.Set.Size = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("500", "1000", "2000"), class = "factor")), row.names = c(NA,
-6L), class = "data.frame")
#reorder Factor
df$Method = factor(df$Method, levels=c("Pairwise", "MAGMA"))
ggplot(df, aes(x=Gene.Set.Size, y=OR, label=P.value,
group= Method, color=Method)) +
geom_point(position=position_dodge(width=0.5)) +
ggrepel::geom_text_repel(size = 6, box.padding = 1, segment.angle = 20, position=position_dodge(width=0.5))+
geom_pointrange(aes(ymax = UpperCI, ymin = LowerCI),position=position_dodge(width=0.5)) +
theme_bw() +
theme(text=element_text(size=25),axis.text.x = element_text(angle = 45, hjust = 1)) +
ylab("Odds ratio") +
xlab("Gene set size") +
theme(plot.margin = unit(c(2,2,2,2), "cm"))
df %>% mutate(Method = fct_relevel(Method, 'Pairwise')) %>% <<your ggplot2 code>
should do the job, assuming you have imported the tidyverse pipe operator %>% and the forcats package, which you can do with require(tidyverse)
You can simply reverse the ordering of the Method factor with forcats::fct_rev.
df$Method <- fct_rev(df$Method)
Alternatively, you can specify the first level when you initially converted that column to a factor.
mydat=structure(list(date = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), .Label = c("01.01.2018", "02.01.2018"), class = "factor"),
x = structure(c(2L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("e",
"q", "w"), class = "factor"), y = structure(c(2L, 2L, 2L,
3L, 1L, 1L, 1L, 1L, 1L), .Label = c("e", "q", "w"), class = "factor")), .Names = c("date",
"x", "y"), class = "data.frame", row.names = c(NA, -9L))
As we can see x and y are groups varibles (we have only the group categories q-q,w-w,e-e)
for 1 january
q q = count 3
w w =count 1
then for 2 january
e e =count 5
How count of categories display in graph like this: dataset is large so graph needed for january month, so the plot display number of sold categories by day
I've found your question not too much clear, but maybe this could help:
library(lubridate) # manipulate date
library(tidyverse) # manipulate data and plot
# your data
mydat %>%
# add columns (here my doubts)
mutate(group = paste (x,y, sep ='-'), # here the category pasted
cnt = ifelse(paste (x,y, sep ='-') == 'q-q',3,
ifelse(paste (x,y, sep ='-') == 'w-w',1,5)), # ifelse with value
day = day(dmy(date))) %>% # day
group_by(group,day) %>% # grouping
summarise(cnt = sum(cnt)) %>% # add the count as sum
# now the plot, here other doubts on your request
ggplot(aes(x = as.factor(day), y = cnt, group = group, fill = group, label = group)) +
geom_bar(stat = 'identity', position = 'dodge') +
geom_label(position = position_dodge(width = 1)) +
theme(legend.position="none")
Your question isn't too much clean as I wish, but I think you wanna to find how much of each group we have in each day, right?
You can use group_by from dplyr package.
I created a new variable called group which contatenate x and y.
mydata <- mydat %>%
mutate('group' = paste(x, y, sep = '-')) %>%
group_by(date, group) %>%
summarise('qtd' = length(group))
Result:
date group qtd
01.01.2018 q-q 3
01.01.2018 w-w 1
02.01.2018 e-e 5
You can use ggplot2 package and create as below where you can use facet_wrap to separate the plots by date:
ggplot(data = mydata, aes(x = group, y = qtd)) +
geom_bar(stat = 'identity') +
facet_wrap(~date)
Otherwise you can use another syntax of ggplot2 and use fill. It's better sometimes specially if you have a lot of dates.
Code:
ggplot(data = mydata, aes(x = group, y = qtd, fill = date)) +
geom_bar(stat = 'identity')
Good luck!
This is my first question on stackoverlow, please correct me if I am not following correct question protocols.
I am trying to create some graphs for data that has been collected over three time points (time 1, time 2, time 3) which equates to X1..., X2... and X3... at the beginning of column names. The graphs are also separated by the column $Group from the data frame.
I have no problem creating the graphs, I just have many variables (~170) and am wanting to compare time 1 vs time 2, time 2 vs time 3, etc. so am trying to work a shortcut to be running this kind of code rather than having to type out each one individually.
As indicated above, I have created variable names like X1... X2... which indicate the time that the variable was recorded i.e. X1BCSTCAT = time 1; X2BCSTCAT = time 2; X3BCSTCAT = time 3. Here is a small sample of what my data looks like:
df <- structure(list(ID = structure(1:6, .Label = c("101","102","103","118","119","120"), class = "factor"),
Group = structure(c(1L,1L,1L,2L,2L,2L), .Label = c("C8","TC"), class = "factor"),
Wave = structure(c(1L, 2L, 3L, 4L, 1L, 2L), .Label = c("A","B","C","D"), class = "factor"),
Yr = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("3","5"), class = c("ordered", "factor")),
Age.Yr. = c(10.936,10.936, 9.311, 10.881, 10.683, 11.244),
Training..hr. = c(10.667,10.333, 10.667, 10.333, 10.333, 10.333),
X1BCSTCAT = c(-0.156,0.637,-1.133,0.637,2.189,1.229),
X1BCSTCR = c(0.484,0.192, -1.309, 0.912, 1.902, 0.484),
X1BCSTPR = c(-1.773,0.859, 0.859, 0.12, -1.111, 0.12),
X2BCSTCAT = c(1.006, -0.379,-1.902, 0.444, 2.074, 1.006),
X2BCSTCR = c(0.405, -0.457,-1.622, 1.368, 1.981, 0.168),
X2BCSTPR = c(-0.511, -0.036,2.189, -0.036, -0.894, 0.949),
X3BCSTCAT = c(1.18, -1.399,-1.399, 1.18, 1.18, 1.18),
X3BCSTCR = c(0.967, -1.622, -1.622,0.967, 0.967, 1.255),
X3BCSTPR = c(-1.282, -1.282, 1.539,1.539, 0.792, 0.792)),
row.names = c(1L, 2L, 3L, 4L, 5L,8L), class = "data.frame")
Here is some working code to create one graph using ggplot for time 1 vs time 2 data on one variable:
library(ggplot2)
p <- ggplot(df, aes(x=df$X1BCSTCAT, y=df$X2BCSTCAT, shape = df$Group, color = df$Group)) +
geom_point() + geom_smooth(method=lm, aes(fill=df$Group), fullrange = TRUE) +
labs(title="BCSTCAT", x="Time 1", y = "Time 2") +
scale_color_manual(name = "Group",labels = c("C8","TC"),values = c("blue", "red")) +
scale_shape_manual(name = "Group",labels = c("C8","TC"),values = c(16, 17)) +
scale_fill_manual(name = "Group",labels = c("C8", "TC"),values = c("light blue", "pink"))
So I am really trying to create some kind of a shortcut where R will cycle through and match up variable names X1... vs X2... and so on and create the graphs. I assume there must be some way to plot either based upon matching column numbers e.g. df[,7] vs df[,10] and iterating through this process or plotting by actually matching the names (where the only difference in variable names is the number which indicates time).
I have previously cycled through creating individual graphs using the lapply function, but have no idea where to even start with trying to do this one.
A solution using tidyeval approach. We will need ggplot2 v3.0.0 (remember to restart your R session)
install.packages("ggplot2", dependencies = TRUE)
First we build a function that takes column and group names as inputs. Note the use of rlang::sym, rlang::quo_name & !!.
Then create 2 name vectors for x- & y- values so that we can loop through them simultaneously using purrr::map2.
library(rlang)
library(tidyverse)
df <- structure(list(ID = structure(1:6, .Label = c("101","102","103","118","119","120"), class = "factor"),
Group = structure(c(1L,1L,1L,2L,2L,2L), .Label = c("C8","TC"), class = "factor"),
Wave = structure(c(1L, 2L, 3L, 4L, 1L, 2L), .Label = c("A","B","C","D"), class = "factor"),
Yr = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("3","5"), class = c("ordered", "factor")),
Age.Yr. = c(10.936,10.936, 9.311, 10.881, 10.683, 11.244),
Training..hr. = c(10.667,10.333, 10.667, 10.333, 10.333, 10.333),
X1BCSTCAT = c(-0.156,0.637,-1.133,0.637,2.189,1.229),
X1BCSTCR = c(0.484,0.192, -1.309, 0.912, 1.902, 0.484),
X1BCSTPR = c(-1.773,0.859, 0.859, 0.12, -1.111, 0.12),
X2BCSTCAT = c(1.006, -0.379,-1.902, 0.444, 2.074, 1.006),
X2BCSTCR = c(0.405, -0.457,-1.622, 1.368, 1.981, 0.168),
X2BCSTPR = c(-0.511, -0.036,2.189, -0.036, -0.894, 0.949),
X3BCSTCAT = c(1.18, -1.399,-1.399, 1.18, 1.18, 1.18),
X3BCSTCR = c(0.967, -1.622, -1.622,0.967, 0.967, 1.255),
X3BCSTPR = c(-1.282, -1.282, 1.539,1.539, 0.792, 0.792)),
row.names = c(1L, 2L, 3L, 4L, 5L,8L), class = "data.frame")
# define a function that accept strings as input
pair_plot <- function(x_var, y_var, group_var) {
# convert strings to symbols
x_var <- rlang::sym(x_var)
y_var <- rlang::sym(y_var)
group_var <- rlang::sym(group_var)
# unquote symbols using !!
ggplot(df, aes(x = !! x_var, y = !! y_var, shape = !! group_var, color = !! group_var)) +
geom_point() + geom_smooth(method = lm, aes(fill = !! group_var), fullrange = TRUE) +
labs(title = "BCSTCAT", x = rlang::quo_name(x_var), y = rlang::quo_name(y_var)) +
scale_color_manual(name = "Group", labels = c("C8", "TC"), values = c("blue", "red")) +
scale_shape_manual(name = "Group", labels = c("C8", "TC"), values = c(16, 17)) +
scale_fill_manual(name = "Group", labels = c("C8", "TC"), values = c("light blue", "pink")) +
theme_bw()
}
# Test if the new function works
pair_plot("X1BCSTCAT", "X2BCSTCAT", "Group")
# Create 2 parallel lists
list_x <- colnames(df)[-c(1:6, (ncol(df)-2):(ncol(df)))]
list_x
#> [1] "X1BCSTCAT" "X1BCSTCR" "X1BCSTPR" "X2BCSTCAT" "X2BCSTCR" "X2BCSTPR"
list_y <- lead(colnames(df)[-(1:6)], 3)[1:length(list_x)]
list_y
#> [1] "X2BCSTCAT" "X2BCSTCR" "X2BCSTPR" "X3BCSTCAT" "X3BCSTCR" "X3BCSTPR"
# Loop through 2 lists simultaneously
# Supply inputs to pair_plot function using purrr::map2
map2(list_x, list_y, ~ pair_plot(.x, .y, "Group"))
Sample outputs:
#> [[1]]
#>
#> [[2]]
Created on 2018-05-24 by the reprex package (v0.2.0).
I have a dataframe df with many columns ...
I'd like plot of subset of columns where c is a list of the columns I'd like to plot.
I'm currently doing the following
df <-structure(list(Image.Name = structure(1:5, .Label = c("D1C1", "D2C2", "D4C1", "D5C3", "D6C2"), class = "factor"), Experiment = structure(1:5, .Label = c("020718 perfusion EPC_BC_HCT115_Day 5", "020718 perfusion EPC_BC_HCT115_Day 6", "020718 perfusion EPC_BC_HCT115_Day 7", "020718 perfusion EPC_BC_HCT115_Day 8", "020718 perfusion EPC_BC_HCT115_Day 9"), class = "factor"), Type = structure(c(2L, 1L, 1L, 2L, 1L), .Label = c("VMO", "VMT"), class = "factor"), Date = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "18-Apr-18", class = "factor"), Time = structure(1:5, .Label = c("12:42:02 PM", "12:42:29 PM", "12:42:53 PM", "12:43:44 PM", "12:44:23 PM"), class = "factor"), Low.Threshold = c(10L, 10L, 10L, 10L, 10L), High.Threshold = c(255L, 255L, 255L, 255L, 255L), Vessel.Thickness = c(7L, 7L, 7L, 7L, 7L), Small.Particles = c(0L, 0L, 0L, 0L, 0L), Fill.Holes = c(0L, 0L, 0L, 0L, 0L), Scaling.factor = c(0.001333333, 0.001333333, 0.001333333, 0.001333333, 0.001333333), X = c(NA, NA, NA, NA, NA), Explant.area = c(1.465629333, 1.093447111, 1.014612444, 1.166950222, 1.262710222), Vessels.area = c(0.255562667, 0.185208889, 0.195792, 0.153907556, 0.227996444), Vessels.percentage.area = c(17.43706003, 16.93807474, 19.29722044, 13.18887067, 18.05611774), Total.Number.of.Junctions = c(56L, 32L, 39L, 18L, 46L), Junctions.density = c(38.20884225, 29.26524719, 38.43832215, 15.42482246, 36.42957758), Total.Vessels.Length = c(12.19494843, 9.545333135, 10.2007416, 7.686755647, 11.94211976), Average.Vessels.Length = c(0.182014156, 0.153956986, 0.188902622, 0.08938088, 0.183724919), Total.Number.of.End.Points = c(187L, 153L, 145L, 188L, 167L), Average.Lacunarity = c(0.722820111, 0.919723402, 0.86403871, 1.115896082, 0.821753818)), .Names = c("Image.Name", "Experiment", "Type", "Date", "Time", "Low.Threshold", "High.Threshold", "Vessel.Thickness", "Small.Particles", "Fill.Holes", "Scaling.factor", "X", "Explant.area", "Vessels.area", "Vessels.percentage.area", "Total.Number.of.Junctions", "Junctions.density", "Total.Vessels.Length", "Average.Vessels.Length", "Total.Number.of.End.Points", "Average.Lacunarity"), row.names = c(NA, -5L), class = "data.frame")
doBarPlot <- function(x) {
p <- ggplot(x, aes_string(x="Type", y=colnames(x), fill="Type") ) +
stat_summary(fun.y = "mean", geom = "bar", na.rm = TRUE) +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", width=0.5, na.rm = TRUE) +
ggtitle("VMO vs. VMT") +
theme(plot.title = element_text(hjust = 0.5) )
print(p)
ggsave(sprintf("plots/%s_bars.pdf", colnames(x) ) )
return(p)
}
c = c('Total.Vessels.Length', 'Total.Number.of.Junctions', 'Total.Number.of.End.Points', 'Average.Lacunarity')
p[c] <- lapply(df[c], doBarPlot)
However this yields the following error :
Error: ggplot2 doesn't know how to deal with data of class numeric
Debugging shows that x inside of doBarPlot is of the type numeric rather than data.frame, so ggplot errors. However, test <- df2[c] yields a variable of the type data.frame.
Why is x a numeric?
What's the best way to apply doBarPlot without resorting to a loop?
As others have noted, the problem with your initial approach is that when you use lapply on a data frame, the elements that you are iterating over will be the column vectors, rather than 1-column data frames. However, even if you did iterate over 1-column data frames, your function would fail: the data frame supplied to the ggplot call wouldn't contain the Type column that you use in the plot.
Instead, you could modify the function to take two arguments: the full data frame, and the name of the column that you want to use on the y-axis.
doBarPlot <- function(data, y) {
p <- ggplot(data, aes_string(x = "Type", y = y, fill = "Type")) +
stat_summary(fun.y = "mean", geom = "bar", na.rm = TRUE) +
stat_summary(
fun.data = "mean_cl_normal",
geom = "errorbar",
width = 0.5,
na.rm = TRUE
) +
ggtitle("VMO vs. VMT") +
theme(plot.title = element_text(hjust = 0.5))
print(p)
ggsave(sprintf("plots/%s_bars.pdf", y))
return(p)
}
Then, you can use lapply to iterate over the character vector of columns you want to plot, while supplyig the data frame via the ... as a fixed argument to your plotting function:
library(ggplot2)
cols <- c('Total.Vessels.Length', 'Total.Number.of.Junctions',
'Total.Number.of.End.Points', 'Average.Lacunarity')
p <- lapply(cols, doBarPlot, data = df)
Further, if you don't mind having all of the plots in one file, you could also use tidyr::gather to reshape your data into long form, and use facet_wrap in your plot (as suggested by #RichardTelford in his comment), avoiding the iteration and the need for a function altogether:
library(tidyverse)
df %>%
gather(variable, value, cols) %>%
ggplot(aes(x = Type, y = value, fill = Type)) +
facet_wrap(~ variable, scales = "free_y") +
stat_summary(fun.y = "mean", geom = "bar", na.rm = TRUE) +
stat_summary(
fun.data = "mean_cl_normal",
geom = "errorbar",
width = 0.5,
na.rm = TRUE
) +
ggtitle("VMO vs. VMT") +
theme(plot.title = element_text(hjust = 0.5))
The apply family of functions vectorise the objected passed. A simple example to illustrate this:
lapply(mtcars, function(x) print(x))
With your code, you are passing a vector of each column in your df to the function doBarPlot. The ggplot2 package works with dataframes, not lists or vectors and therefore you get the error.
If you want to use your function, apply it directly to the subsetted df:
doBarPlot(df[ , c])
If you have a bunch of dataframes and you want to subset by the columns in c checkout this answer:
How to apply same function to every specified column in a data.table
or alternatively, look into the dplyr::select()
I am working on the dataset reported here below (pre.sss)
pre.sss <- pre.sss <- structure(list(Pretest.num = c(63, 62, 61, 60, 59, 58, 57, 4,2, 1), stress = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L,1L), .Label = c("[0,6]", "(6,9]"), class = "factor"), time = c(1L,1L, 1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L), after = structure(c(2L,2L, 2L, 2L, 2L, 2L, 1L, 1L, NA, 1L), .Label = c("no", "yes"), class = "factor"),id = c("call_fam", "call_fam", "call_fam", "call_fam", "call_fam","call_fam", "call_fam", "counselor", "counselor", "counselor")), .Names = c("Pretest.num", "stress", "time", "after","id"), reshapeLong = structure(list(varying = structure(list(after = c("after.call.fam", "after.speak", "after.send.email","after.send.card", "after.attend", "after.fam.mtg", "after.sup.grp","after.counselor")), .Names = "after", v.names = "after", times = 1:8),v.names = "after", idvar = "Pretest.num", timevar = "time"), .Names = c("varying","v.names", "idvar", "timevar")), row.names = c("63.1", "62.1","61.1", "60.1", "59.1", "58.1", "57.1", "4.8", "2.8", "1.8"), class = "data.frame")
and I need to plot the counts of several categorical variables according to a specific level of another categorical variable ('stress'): so, a faceted bobble-lot would do the job in my case
So what I do is the following:
ylabels = c('call_fam' = "call fam.member for condolences",
'speak' = "speak to fam.member in person",
'send.email' = "send condolence email to fam.member",
'send.card' = "send condolence card/letter to fam.member",
'attend' = "attend funeral/wake",
'fam.mtg' = "provide fam.meeting",
'sup.grp' = "suggest attending support grp.",
'counselor' = "make referral to bereavement counselor" )
p = ggplot(pre.sss, aes(x = after, y = id)) +
geom_count(alpha = 0.5, col = 'darkblue') +
scale_size(range = c(1,30)) +
theme(legend.position = 'none') +
xlab("Response") +
ylab("What did you do after learning about death?") +
scale_y_discrete(labels = ylabels) +
facet_grid(.~ pre.sss$stress, labeller = as_labeller(stress.labels))
and I obtain the following image, exactly as I want.
Now I would like to label each bubble with the count with which the corresponding data appear in the dataset.
dat = data.frame(ggplot_build(p)$data[[1]][, c('x', 'y', 'PANEL', 'n')])
dat$PANEL = ifelse(dat$PANEL==1, "[0,6]", "(6-9]")
colnames(dat) = c('x', 'y', 'stress', 'n')
p + geom_text(aes(x, y, label = n, group = NULL), data = dat)
This gives me the following error I really can't understand.
> p + geom_text(aes(x, y, label=n, group=NULL), data=dat)
Error in `$<-.data.frame`(`*tmp*`, "PANEL", value = c(1L, 1L, 1L, 1L, :
replacement has 504 rows, data has 46
Can anybody help me with this?
Thanks!
EM
The function you refer to as your labeller function is missing from this example still. geom_count uses stat_sum, which calculates a parameter n, the number of observations at that point. Because you can use this calculated parameter, you don't actually have to assign the plot to a variable and pull out its data, as you did with ggplot_build.
This should do what you're looking for:
ggplot(pre.sss, aes(x = after, y = id)) +
geom_count(alpha = 0.5, col = 'darkblue') +
# note the following line
stat_sum(mapping = aes(label = ..n..), geom = "text") +
scale_size(range = c(1,30)) +
theme(legend.position = 'none') +
xlab("Response") +
ylab("What did you do after learning about death?") +
scale_y_discrete(labels = ylabels) +
facet_grid(.~ stress)
The line I added computes the same thing as what's behind the scenes in geom_count, but gives it a text geom instead, with the label mapped to that computed parameter n.