summary and plot for multiple subgroups(more columns) - r

I am interested in two things 1) Summary for multiple subgroups in the same table and 2) dotplot for the subgroups based on the summary generated in step1.
For example ,
if this is my dataset
data("pbc")
I like to generate summary of cholesterol (chol), by sex, stage, ascites and spiders for two treatment levels 1, 2
table(pbc$trt)
1 2
158 154
I can do this separately like this.
library(Hmisc)
summary(chol ~ sex + stage + ascites + spiders, data = subset(pbc, trt=1))
summary(chol ~ sex + stage + ascites + spiders, data = subset(pbc, trt=2))
This creates two separate summaries.
Two different corresponding plots
plot(summary(chol ~ sex + stage + ascites + spiders, data = subset(pbc, trt=1)))
plot(summary(chol ~ sex + stage + ascites + spiders, data = subset(pbc, trt=2)))
I like the summaries to be in one table , two columns 1 column for trt=1 and 2nd column for trt=2
N
chol (trt=1)
chol (trt=2)
sex
m
..
..... .
.... ..
f
..
..... .
.... ..
And the plot side by side. 1st plot for trt=1 , second plot for trt=2
Kindly suggest suggest how to scale the Hmisc:::summary.formula , summary function to 1) show summaries by subgroups side-by-side 2) Plot the summaries side-by-side. Thanks.

Please note that your current summaries and plots are identical; despite using subset with the two levels of trt, your two posted plots are identical. You can use filter to definitively filter by the levels of trt.
First, I prefer gtsummary with my tables, since you can use tbl_continuous to make one singular table instead of trying to combine two tables. Second, you will likely encounter difficulty trying to combine your two plots since you're using base R plotting functions on Hmisc summary objects. Even trying to save each plot to an object will result in NULL. In the long run, it may be easier to recreate each plot using ggplot and combining with cowplot::plot_grid.
library(survival)
library(Hmisc)
# create combined summary
library(gtsummary)
library(tidyverse)
data(pbc)
df <- pbc %>%
select(id, trt, chol, sex, stage, ascites, spiders) %>%
mutate(across(c(sex, stage, ascites, spiders), as.factor)) %>%
mutate(trt = factor(trt)) %>%
mutate(chol = as.numeric(chol))
dftrt1 <- df %>% filter(trt == 1)
dftrt2 <- df %>% filter(trt == 2)
df %>%
select(trt, chol, sex, stage, ascites, spiders) %>%
tbl_continuous(variable = chol,
digits = everything() ~ 2,
statistic = everything() ~ "{mean}",
label = list(sex ~ "Sex",
stage ~ "Stage",
ascites ~ "Ascites",
spiders ~ "Spiders"),
by = trt)
# create combined plot
library(cowplot)
p1 <- dftrt1 %>%
select(-trt) %>% pivot_longer(cols = -c(id, chol)) %>% group_by(name, value) %>%
summarise(chol = mean(chol, na.rm = TRUE)) %>%
ggplot(aes(x = value, y = chol, fill = factor(value))) +
geom_point() + coord_flip() +
facet_wrap(~name, scales = "free_y", nrow = 4, strip.position = "top") +
theme(panel.spacing = unit(0, "lines"),
panel.border = element_rect(fill = NA),
strip.background = element_blank(),
axis.title.y = element_blank(),
legend.position = "none",
strip.placement = "outside") +
ggtitle("trt = 1") + theme(plot.title = element_text(hjust = 0.5))
p2 <- dftrt2 %>%
select(-trt) %>% pivot_longer(cols = -c(id, chol)) %>% group_by(name, value) %>%
summarise(chol = mean(chol, na.rm = TRUE)) %>%
ggplot(aes(x = value, y = chol, fill = factor(value))) +
geom_point() + coord_flip() +
facet_wrap(~name, scales = "free_y", nrow = 4, strip.position = "top") +
theme(panel.spacing = unit(0, "lines"),
panel.border = element_rect(fill = NA),
strip.background = element_blank(),
axis.title.y = element_blank(),
legend.position = "none",
strip.placement = "outside") +
ggtitle("trt = 2") + theme(plot.title = element_text(hjust = 0.5))
plot_grid(p1, p2, ncol = 2)

Related

plot time series and NA values in R

I have been searching for missing values visualization in R and even though there are many nice options, I haven't found a code to get exactly what I need.
My data frame (df) is
data.frame(
stringsAsFactors = FALSE,
Date = c("01/01/2000","02/01/2000",
"03/01/2000","04/01/2000","05/01/2000","06/01/2000",
"07/01/2000","08/01/2000","09/01/2000"),
Site.1 = c(NA,0.952101337,0.066766616,
0.77279551,0.715427011,NA,NA,NA,0.925705179),
Site.2 = c(0.85847963,0.663818831,NA,NA,
0.568488712,0.002833073,0.349365844,0.652482654,
0.334879886),
Site.3 = c(0.139854891,0.057024999,
0.297705256,0.914754178,NA,0.14108163,0.282896932,
0.823245136,0.153609705),
Site.4 = c(0.758317946,0.284147119,
0.756356853,NA,NA,0.313465424,NA,0.013689324,0.654615632)
) -> df
And I would like to get a plot similar to the following:
Taking into account that my actual data consists of 51 Sites and around 9,000 dates
You can try with something like this:
library(tidyr)
library(dplyr)
library(ggplot2)
df %>%
# from wide to long
pivot_longer(!Date, names_to = "sites", values_to = "value") %>%
# add a column of one an NAs following your data
mutate(fake = ifelse(is.na(value),NA, 1),
sites = as.factor(sites)) %>%
# plot it
ggplot(aes(x = Date, y = reorder(sites,desc(sites)), color = fake, group = sites)) +
# line size
geom_line( size = 2) +
# some aesthetics
ylab('sites') +
scale_color_continuous(high="black",na.value="white") +
theme(legend.position = 'none',
panel.background = element_rect(fill ='white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Despite, I prefere something simpler like this:
df %>%
pivot_longer(!Date, names_to = "sites", values_to = "value") %>%
ggplot(aes(x = Date, y = sites, fill =value)) +
geom_tile() +
theme_light() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
This answer is similar to the first one, but uses NA values to interrupt the lines.
library(ggplot2)
library(tidyr)
library(stringr)
df %>% pivot_longer(., cols = 2:5) %>%
mutate(present = !is.na(value)) %>%
mutate(height = as.numeric(str_remove(name, "Site.")) * present) %>% mutate(value2 = case_when(!is.na(value) ~ height)) %>%
ggplot(aes(Date, value2, group = name)) +
geom_line() +
theme(legend.position = "none") +
scale_x_discrete(guide = guide_axis(angle = 90))

How to do a bar graphic with multiple columns out of an excel archive?

How can I make a graphic bar using barplot() or ggplopt() of an excel archive that has 83 columns?
I need to plot every column that has a >0 value on ich raw. (ich column represents a gene function and I need to know how many functions there is on ich cluster).
Iwas trying this,but it didn't work:
ggplot(x, aes(x=Cluster, y=value, fill=variable)) +
geom_bar(stat="bin", position="dodge") +
theme_bw() +
ylab("Funções no cluster") +
xlab("Cluster") +
scale_fill_brewer(palette="Blues")
Link to the excel:
https://github.com/annabmarques/GenesCorazon/blob/master/AllclusPathwayEDIT.xlsx
What about a heatmap? A rough example:
library(dplyr)
library(tidyr)
library(ggplot2)
library(openxlsx)
data <- read.xlsx("AllclusPathwayEDIT.xlsx")
data <- data %>%
mutate(cluster_nr = row_number()) %>%
pivot_longer(cols = -c(Cluster, cluster_nr),
names_to = "observations",
values_to = "value") %>%
mutate(value = as.factor(value))
ggplot(data, aes(x = cluster_nr, y = observations, fill = value)) +
geom_tile() +
scale_fill_brewer(palette = "Blues")
Given the large number of observations consider breaking this up into multiple charts.
It's difficult to understand exactly what you're trying to do. Is this what you're trying to achieve?
#install.packages("readxl")
library(tidyverse)
library(readxl)
read_excel("AllclusPathwayEDIT.xlsx") %>%
pivot_longer(!Cluster, names_to = "gene_counts", values_to = "count") %>%
mutate(Cluster = as.factor(Cluster)) %>%
ggplot(aes(x = Cluster, y = count, fill = gene_counts)) +
geom_bar(position="stack", stat = "identity") +
theme(legend.position = "right",
legend.key.size = unit(0.4,"line"),
legend.text = element_text(size = 7),
legend.title = element_blank()) +
guides(fill = guide_legend(ncol = 1))
ggsave(filename = "example.pdf", height = 20, width = 35, units = "cm")

How to suppress legend in dot-whisker plot

For some reason I am getting two legends in my dot-whisker plot.
Plot produced by the below code:
The data are available here.
#first importing data
Q2a<-read.table("~/Q2a.txt", header=T)
# Optionally, read in data directly from figshare.
# Q2a <- read.table("https://ndownloader.figshare.com/files/13283882?private_link=ace5b44bc12394a7c46d", header=TRUE)
library(dplyr)
#splitting into female and male
F2female<-Q2a %>%
filter(sex=="F")
F2male<-Q2a %>%
filter(sex=="M")
library(lme4)
#Female models
ab_f2_f_LBS = lmer(LBS ~ ft + grid + (1|byear), data = subset(F2female))
ab_f2_f_surv = glmer.nb(age ~ ft + grid + (1|byear), data = subset(F2female), control=glmerControl(tol=1e-6,optimizer="bobyqa",optCtrl=list(maxfun=1e19)))
#Male models
ab_f2_m_LBS = lmer(LBS ~ ft + grid + (1|byear), data = subset(F2male))
ab_f2_m_surv = glmer.nb(age ~ ft + grid + (1|byear), data = subset(F2male), control=glmerControl(tol=1e-6,optimizer="bobyqa",optCtrl=list(maxfun=1e19)))
I only plot two of the variables (ft2 and gridSU) from each model.
ab_f2_f_LBS <- tidy(ab_f2_f_LBS) %>% filter(!grepl('sd_Observation.Residual', term)) %>% filter(!grepl('byear', group)) %>% mutate(model = "ab_f2_f_LBS")
ab_f2_m_LBS <- tidy(ab_f2_m_LBS) %>% filter(!grepl('sd_Observation.Residual', term)) %>% filter(!grepl('byear', group)) %>% mutate(model = "ab_f2_m_LBS")
ab_f2_f_surv <- tidy(ab_f2_f_surv)%>% filter(!grepl('sd_Observation.Residual', term)) %>% filter(!grepl('byear', group)) %>% mutate(model = "ab_f2_f_surv")
ab_f2_m_surv <- tidy(ab_f2_m_surv) %>% filter(!grepl('sd_Observation.Residual', term)) %>% filter(!grepl('byear', group)) %>% mutate(model = "ab_f2_m_surv")
tidy_mods <- bind_rows(ab_f2_f_LBS, ab_f2_m_LBS, ab_f2_f_surv, ab_f2_m_surv)
I am then ready to make a dot-whisker plot.
#required packages
library(dotwhisker)
library(broom)
dwplot(tidy_mods,
vline = geom_vline(xintercept = 0, colour = "black", linetype = 2),
conf.int=TRUE,
dodge_size=0.2, #space between the CI's
dot_args = list(aes(shape = model), size = 3), #changes shape of points and the size of the points
style="dotwhisker") %>% # plot line at zero _behind_ coefs
relabel_predictors(c(DamDisFate2= "Immigrant mothers",
gridSU = "Grid (SU)")) +
theme_classic() +
xlab("Coefficient estimate (+/- CI)") +
ylab("") +
scale_color_manual(values=c("#000000", "#666666", "#999999", "#CCCCCC"),
labels = c("Daughter LBS", "Son LBS", "Daughter longevity", "Son longevity"),
name = "First generation models, maternity known") +
theme(axis.title=element_text(size=15),
axis.text.x = element_text(size=15),
axis.text.y = element_text(size=15, angle=90, hjust=.5),
legend.position = c(0.7, 0.7),
legend.justification = c(0, 0),
legend.title=element_text(size=15),
legend.text=element_text(size=13),
legend.key = element_rect(size = 0),
legend.key.size = unit(0.5, "cm"))+
guides(colour = guide_legend(override.aes=list(shape=c(16,17,15,3)))) #changes shape of points in legend
I am encountering this problem:
As is obvious from the plot, I have two legends. One that is unmodified and one that is modified.
I can't find any short cut within the theme() function and the dwplot() package doesn't offer any solutions either.
How can I suppress the unmodified legend (bottom one) and only keep my modified legend (top one)?
Assuming this function uses ggplot, try adding shape="none" to your guides():
guides(colour = guide_legend(override.aes=list(shape=c(16,17,15,3))), shape="none")

How to create such a figure using ggplot2 in R?

I have a matrix with many zero elements. The column names are labeled on the horizontal axis. I'd like to show explictly the nonzero elements as the bias from the vertical line for each column.
So how should construct a figure such as the example using ggplot2?
An example data can be generated as follow:
set.seed(2018)
N <- 5
p <- 40
dat <- matrix(0.0, nrow=p, ncol=N)
dat[2:7, 1] <- 4*rnorm(6)
dat[4:12, 2] <- 2.6*rnorm(9)
dat[25:33, 3] <- 2.1*rnorm(9)
dat[19:26, 4] <- 3.3*rnorm(8)
dat[33:38, 5] <- 2.9*rnorm(6)
colnames(dat) <- letters[1:5]
print(dat)
Here is another option using facet_wrap and geom_col with theme_minimal.
library(tidyverse)
dat %>%
as.data.frame() %>%
rowid_to_column("row") %>%
gather(key, value, -row) %>%
ggplot(aes(x = row, y = value, fill = key)) +
geom_col() +
facet_wrap(~ key, ncol = ncol(dat)) +
coord_flip() +
theme_minimal()
To further increase the aesthetic similarity to the plot in your original post we can
move the facet strips to the bottom,
rotate strip labels,
add "zero lines" in matching colours,
remove the fill legend, and
get rid of the x & y axis ticks/labels/title.
library(tidyverse)
dat %>%
as.data.frame() %>%
rowid_to_column("row") %>%
gather(key, value, -row) %>%
ggplot(aes(x = row, y = value, fill = key)) +
geom_col() +
geom_hline(data = dat %>%
as.data.frame() %>%
gather(key, value) %>%
count(key) %>%
mutate(y = 0),
aes(yintercept = y, colour = key), show.legend = F) +
facet_wrap(~ key, ncol = ncol(dat), strip.position = "bottom") +
coord_flip() +
guides(fill = FALSE) +
theme_minimal() +
theme(
strip.text.x = element_text(angle = 45),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
It would be much easier if you can provide some sample data. Thus I needed to create them and there is no guarantee that this will work for your purpose.
set.seed(123)
# creating some random sample data
df <- data.frame(id = rep(1:100, each = 3),
x = rnorm(300),
group = rep(letters[1:3], each = 100),
bias = sample(0:1, 300, replace = T, prob = c(0.7, 0.3)))
# introducing bias
df$bias <- df$bias*rnorm(nrow(df))
# calculate lower/upper bias for errorbar
df$biaslow <- apply(data.frame(df$bias), 1, function(x){min(0, x)})
df$biasupp <- apply(data.frame(df$bias), 1, function(x){max(0, x)})
Then I used kind of hack to be able to print groups in sufficient distance to make them not overlapped. Based on group I shifted bias variable and also lower and upper bias.
# I want to print groups in sufficient distance
df$bias <- as.numeric(df$group)*5 + df$bias
df$biaslow <- as.numeric(df$group)*5 + df$biaslow
df$biasupp <- as.numeric(df$group)*5 + df$biasupp
And now it is possible to plot it:
library(ggplot2)
ggplot(df, aes(x = x, col = group)) +
geom_errorbar(aes(ymin = biaslow, ymax = biasupp), width = 0) +
coord_flip() +
geom_hline(aes(yintercept = 5, col = "a")) +
geom_hline(aes(yintercept = 10, col = "b")) +
geom_hline(aes(yintercept = 15, col = "c")) +
theme(legend.position = "none") +
scale_y_continuous(breaks = c(5, 10, 15), labels = letters[1:3])
EDIT:
To incorporate special design you can add
theme_bw() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
to your plot.
EDIT2:
To incorporate several horizontal lines, you can create different dataset:
df2 <- data.frame(int = unique(as.numeric(df$group)*5),
gr = levels(df$group))
And use
geom_hline(data = df2, aes(yintercept = int, col = gr))
instead of copy/pasting geom_hline for each group level.

Overlaying barplot with line graphs using ggplot2

My question is similar to those posted here and here.
I am working on creating a graph in ggplot where I have one bar plot and then want to overlay multiple line graphs. For the purposes of this question, I have reproduced my code for two barplots (one that includes all years (2007-2015) and two from specific years (2007 and 2015), but ultimately I will be overlaying data from 10 different years. The data used can be found here.
library(dplyr)
library(tidyr)
library(gridExtra)
library(ggplot2)
overallpierc<-data[(data$item=="piercing"),]
overp<-overallpierc %>%
group_by(age) %>%
count(sex) %>%
ungroup %>%
mutate(age = factor(age)) %>%
complete(age, sex, fill = list(n = 0)) %>%
ggplot(aes(age, n)) + geom_col(aes(fill = sex), position = "dodge") +
theme_classic() +
scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") +
labs(x = "Age", y = "Number of observations") +
theme(legend.position=c(0.4,0.8),
plot.title = element_text(size = 10),
legend.title=element_text(size=15),
axis.title=element_text(size=15),
legend.key.size = unit(1.13, "cm"),
legend.direction="vertical",
legend.text=element_text(size=15))
p07<-data[(data$yy=="2007") & (data$item=="piercing"),]
summary(p07)
subp07<-p07 %>%
group_by(age) %>%
count(sex) %>%
ungroup %>%
mutate(age = factor(age)) %>%
complete(age, sex, fill = list(n = 0)) %>%
ggplot(aes(age, n)) + geom_col(aes(fill = sex), position = "dodge") +
theme_classic() +
scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") +
labs(x = "Age", y = "Number of observations") +
theme(legend.position=c(0.4,0.8),
plot.title = element_text(size = 10),
legend.title=element_text(size=15),
axis.title=element_text(size=15),
legend.key.size = unit(1.13, "cm"),
legend.direction="vertical",
legend.text=element_text(size=15))
p15<-data[(data$yy=="2015") & (data$item=="piercing"),]
subp15<-p15 %>%
group_by(age) %>%
count(sex) %>%
ungroup %>%
mutate(age = factor(age)) %>%
complete(age, sex, fill = list(n = 0)) %>%
ggplot(aes(age, n)) + geom_col(aes(fill = sex), position = "dodge") +
theme_classic() +
scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") +
labs(x = "Age", y = "Number of observations") +
theme(legend.position=c(0.4,0.8),
plot.title = element_text(size = 10),
legend.title=element_text(size=15),
axis.title=element_text(size=15),
legend.key.size = unit(1.13, "cm"),
legend.direction="vertical",
legend.text=element_text(size=15))
grid.arrange(overp, subp07, subp15)
The code I have posted gives me the following figure.
What I am trying to do is plot the frequencies for females in 2007 and 2015 and males in 2007 and 2015 on top of the barplot for total frequencies (where this is also reflected in the legend). Is there a way to do that in R using ggplot2?
UPDATE: I tried using the geom_smooth and geom_line functions to add the lines to my ggplot as suggested in the comments and as other solutions to users questions, but I get the following error:
Error: Discrete value supplied to continuous scale
I created a new data frame for a subset that I would like to plot:
df<-data.frame(age=c(15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,40,50,60), val=c(0,5,13,77,70,106,62,51,46,27,46,16,22,16,14,48,21, 3,4))
And then added it to the ggplot code:
overallpierc %>%
filter(age != "15") %>%
group_by(age) %>%
count(sex) %>%
ungroup %>%
mutate(age = factor(age)) %>%
complete(age, sex, fill = list(n = 0)) %>%
ggplot(aes(age, n)) +
geom_line(data=df,aes(x=as.numeric(age),y=val),colour="blue") +
geom_col(aes(fill = sex), position = "dodge") +
theme_classic() +
scale_fill_manual(values=c("#000000", "#CCCCCC"), name = "Sex") +
labs(x = "Age", y = "Number of observations") +
theme(legend.position=c(0.4,0.8),
plot.title = element_text(size = 10),
legend.title=element_text(size=15),
axis.title=element_text(size=15),
legend.key.size = unit(1.13, "cm"),
legend.direction="vertical",
legend.text=element_text(size=15))
Others have encountered similar issues and used as.numeric to solve the problem. However, age needs to be treated as a factor for the purposes of plotting.
Based on our discussion in the comments, let's try stacked bars and facets. I think it works but you can decide for yourself.
The stacked bar has the advantage of showing both proportions and total count in the same bar. To compare years, a facet grid places years in rows, so the eye can scan downwards to compare the same age in different years. Note that I kept age as a continuous variable here, rather than a factor.
library(dplyr)
library(ggplot2)
data30g %>%
count(yy, sex, age) %>%
ggplot(aes(age, n)) +
geom_col(aes(fill = sex)) +
facet_grid(yy ~ .) +
theme_bw() +
scale_fill_manual(values = c("#000000", "#cccccc"))
Not bad - I can see straight away, for example, an increase in both total and female count at age 30 over time, but perhaps a little small and crowded.
We can use a facet wrap instead of a grid to make the bars clearer, but at the expense of quick visual comparison across years.
data30g %>%
count(yy, sex, age) %>%
ggplot(aes(age, n)) +
geom_col(aes(fill = sex)) +
facet_wrap(~yy, ncol = 2) +
theme_bw() +
scale_fill_manual(values = c("#000000", "#cccccc"))
One more example which does not address your question in terms of total counts or barplots - but I thought it might be of interest. This code generates a "heatmap" style of plot which is poor for quantitative comparison, but can sometimes give a quick visual impression of interesting features. I think it shows, for example, that females aged 20 in 2014 have the highest total count.
data30g %>%
count(yy, sex, age) %>%
ggplot(aes(factor(age), yy)) +
geom_tile(aes(fill = n)) +
facet_grid(sex ~ .) +
scale_fill_gradient2() +
scale_y_reverse(breaks = 2006:2015) +
labs(x = "age", y = "Year")
EDIT:
Based on further discussions in the comments, here is one way to plot age as a factor, using bars for sexes, overlaid with a line for the totals and split by year.
overallpierc %>%
count(yy, sex, age) %>%
ggplot() +
geom_col(aes(factor(age), n, fill = sex), position = "dodge") +
stat_summary(aes(factor(age), n), fun.y = "sum", geom = "line", group = 1) +
facet_grid(yy ~ .)

Resources