Plot a multivariate histogram in R - r

I would like to plot 6 different variables with their corresponding calculated statistical data. The following dataframe may serve as an example
X aggr_a aggr_b count
<chr> <dbl> <dbl> <dbl>
1 A 470676 594423 58615
2 B 549142 657291 67912
3 C 256204 311723 26606
4 D 248256 276593 40201
5 E 1581770 1717788 250553
6 F 1932096 2436769 385556
I would like to plot each row as category with its statistics as histogram bins. The desired output is
May I use ggplots for this kind of graphs?
All the available resources seem to cover the uni variate case only.

library(tidyverse)
df = read.table(text = "
X aggr_a aggr_b count
A 470676 594423 58615
B 549142 657291 67912
C 256204 311723 26606
D 248256 276593 40201
E 1581770 1717788 250553
F 1932096 2436769 385556
", header=T)
df %>%
gather(type,value,-X) %>% # reshape dataset
ggplot(aes(X,value,fill=type))+
geom_bar(position = "dodge", stat = "identity")

Related

line graph with multiple variables on y axis stepwise

I need some help. Here is my data which i want to plot. I want to keep $path.ID on y axis and numerics of all other columns added stepwise. this is a subset of very large dataset so i want to pathID labels attached to each line. and also the values of the other columns with each point if possible.
head(table)
Path.ID sc st rc rt
<chr> <dbl> <dbl> <dbl> <dbl>
1 map00230 1 12 5 52
2 map00940 1 20 10 43
3 map01130 NA 15 8 34
4 map00983 NA 14 5 28
5 map00730 NA 5 3 26
6 map00982 NA 16 2 24
somewhat like this
Thank you
Here is the pseudo code.
library(tidyr)
library(dplyr)
library(ggplot2)
# convert your table into a long format - sorry I am more used to this type of data
table_long <- table %>% gather(x_axis, value, sc:rt)
# Plot with ggplot2
ggplot() +
# draw line
geom_line(data=table_long, aes(x=x_axis, y=value, group=Path.ID, color=Path.ID)) +
# draw label at the last x_axis in this case is **rt**
geom_label(data=table_long %>% filter(x_axis=="rt"),
aes(x=x_axis, y=value, label=Path.ID, fill=Path.ID),
color="#FFFFFF")
Note that with this code if a Path.ID doesn't have the rt value then it will not have any label
p<-ggplot() +
# draw line
geom_line(data=table_long, aes(x=x_axis, y=value, group=Path.ID, color=Path.ID)) +
geom_text(data=table_long %>% filter(x_axis=="rt"),
aes(x=x_axis, y=value, label=Path.ID),
color= "#050505", size = 3, check_overlap = TRUE)
p +labs(title= "title",x = "x-lable", y="y-label")
I had to use geom_text as i had large dataset and it gave me somewhat more clear graph
thank you #sinh it it helped a lot.

How to make a boxplot with 3D array with ggplot?

I have technical question for you please.
Here are my observed data. :
observed <- structure(c(4.06530084555243e-05, 4.34037362577724e-05, 5.25472735118296e-05,
5.75250282219017e-05, 5.33322813829422e-05, 4.31323519093776e-05,
2.93059438168564e-05, 3.2907253754896e-05, 3.93244409813805e-05,
4.44607200813546e-05, 4.28121839343577e-05, 4.41339340180233e-05,
2.45819615043229e-05, 2.77652788697063e-05, 3.471280169582e-05,
4.0759303004447e-05, 4.1444945573338e-05, 3.91053759171617e-05
), .Dim = c(6L, 3L))
After a simulation I have this dataset :
simul <- structure(c(4.19400641566714e-05, 4.34037362577724e-05, 5.21778240776188e-05,
5.72766282640455e-05, 5.33322813829422e-05, 4.4984474595369e-05,
3.04758260711529e-05, 3.35466566427138e-05, 4.07527347018512e-05,
4.51672959887775e-05, 4.42496416020706e-05, 4.41339340180233e-05,
2.38725672336555e-05, 2.78960210968267e-05, 3.42390390339277e-05,
4.0759303004447e-05, 4.1444945573338e-05, 4.16181419135288e-05,
4.06530084555243e-05, 4.52163381730998e-05, 5.37744538705153e-05,
5.75250282219017e-05, 5.44384786782902e-05, 4.27640158845638e-05,
2.93059438168564e-05, 3.16988003284864e-05, 3.88757470111112e-05,
4.16839537839391e-05, 4.1923490779897e-05, 4.43697930071784e-05,
2.53312977844189e-05, 2.82780740113101e-05, 3.49483644305925e-05,
4.23308636691264e-05, 4.36574393087853e-05, 3.91053759171617e-05,
3.97856427517231e-05, 4.25485977213641e-05, 5.21380124071012e-05,
5.62879076217168e-05, 5.18161751345512e-05, 4.22404154190924e-05,
2.84842421189343e-05, 3.2907253754896e-05, 3.93244409813805e-05,
4.28921326811218e-05, 4.2391125283836e-05, 4.28233487269764e-05,
2.45819615043229e-05, 2.67311845213199e-05, 3.3715109777394e-05,
4.00991849427121e-05, 4.07259705233212e-05, 3.62825448554739e-05,
3.95854341194398e-05, 4.23930151174446e-05, 5.25472735118296e-05,
5.76202168197769e-05, 5.23957149070388e-05, 4.31323519093776e-05,
2.90350657890489e-05, 3.22693947104228e-05, 3.90988677457566e-05,
4.44607200813546e-05, 4.28121839343577e-05, 4.28542288317551e-05,
2.56149959419174e-05, 2.77652788697063e-05, 3.49302533009518e-05,
4.13777396322285e-05, 4.12908495437265e-05, 3.92084109551252e-05,
4.14887591359563e-05, 4.39273564362111e-05, 5.31197050290816e-05,
5.77484133948985e-05, 5.36319646972061e-05, 4.62472643466539e-05,
3.06756490605887e-05, 3.49917045844483e-05, 4.15936967740209e-05,
4.66221720234964e-05, 4.48785430220286e-05, 4.44766996381653e-05,
2.36916432633518e-05, 2.69248181080789e-05, 3.471280169582e-05,
3.94762090257435e-05, 4.17765202936009e-05, 3.8021359310749e-05
), .Dim = c(6L, 3L, 5L))
This is a 3D array with 3 dimensions. The columns correspond to the study areas, and the rows to the "months" followed. The third dimension corresponds to the values of the simulation.
My question : Is it possible, with ggplot, to present a multipanel graph (grid) - 1 panel for 1 study area - of boxplots simulations (values of the 3rd dimension) with months at "x axis" please (= 6 boxplots per panel) ? I would also like to draw the lines of the values observed through the boxplots of each panel. Thank you !
I hope I understood it right: for each type of study - make boxplots for each month, summarizing values obtained from all of 5 simulations.
First I gave dimension names to array:
attributes(simul)$dimnames <- list(
month = month.abb[1:6],
study = letters[1:3],
simval = 1:5
)
After that I converted the named array to the cube_tibble, and further into the tibble so I can plot data using usual tidyverse routine:
library(tidyverse)
library(magrittr)
as.tbl_cube(simul) %>%
as_tibble() %>%
rename('value' = simul) %>%
mutate(
study = factor(paste('Study', study)),
month = factor(month, levels = month.abb[1:6])
) %T>%
print %>%
ggplot(aes(x = month, y = value)) +
geom_boxplot(outlier.colour = 'red') +
facet_wrap(~ study, nrow = 1, scale = 'free_y') +
ggthemes::theme_few()
# # A tibble: 90 x 4
# month study simval value
# <fct> <fct> <int> <dbl>
# 1 Jan Study a 1 0.0000419
# 2 Feb Study a 1 0.0000434
# 3 Mar Study a 1 0.0000522
# 4 Apr Study a 1 0.0000573
# 5 May Study a 1 0.0000533
# 6 Jun Study a 1 0.0000450
# 7 Jan Study b 1 0.0000305
# 8 Feb Study b 1 0.0000335
# 9 Mar Study b 1 0.0000408
# 10 Apr Study b 1 0.0000452
# # ... with 80 more rows

adding rows to a tibble based on mostly replicating existing rows

I have data that only shows a variable if it is not 0. However, I would like to have gaps representing these 0s in the graph.
(I will be working from a large dataframe, but have created an example data based on how I will be manipulating it for this purpose.)
library(tidyverse)
library(ggplot2)
A <- tibble(
name = c("CTX_M", "CblA_1"),
rpkm = c(350, 4),
sample = "A"
)
B <- tibble(
name = c("CTX_M", "OXA_1", "ampC"),
rpkm = c(324, 357, 99),
sample = "B"
)
plot <- bind_rows(A, B)
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")
Sample A and B both have CTX_M, however the othre three "names" are only present in either sample A or sample B. When I run the code, the output graph shows two bars for sample A and three bars for sample B the resulting graph was:
Is there a way for me to add ClbA_1 to sample B with rpkm=0, and OXA_1 and ampC to sample A with rpkm=0, while maintaining sample separation? - so the tibble would look like this (order not important):
and the graph would therefore look like this:
You can use complete from tidyr.
plot <- plot %>% complete(name,sample,fill=list(rpkm=0))
# A tibble: 8 x 3
name sample rpkm
<chr> <chr> <dbl>
1 ampC A 0
2 ampC B 99
3 CblA_1 A 4
4 CblA_1 B 0
5 CTX_M A 350
6 CTX_M B 324
7 OXA_1 A 0
8 OXA_1 B 357
ggplot()+ geom_col(data = plot, aes(x = sample, y = rpkm, fill = name),
position = "dodge")

ggplot2 geom_bar position failure

I am using the ..count.. transformation in geom_bar and get the warning
position_stack requires non-overlapping x intervals when some of my categories have few counts.
This is best explained using some mock data (my data involves direction and windspeed and I retain names relating to that)
#make data
set.seed(12345)
FF=rweibull(100,1.7,1)*20 #mock speeds
FF[FF>60]=59
dir=sample.int(10,size=100,replace=TRUE) # mock directions
#group into speed classes
FFcut=cut(FF,breaks=seq(0,60,by=20),ordered_result=TRUE,right=FALSE,drop=FALSE)
# stuff into data frame & plot
df=data.frame(dir=dir,grp=FFcut)
ggplot(data=df,aes(x=dir,y=(..count..)/sum(..count..),fill=grp)) + geom_bar()
This works fine, and the resulting plot shows the frequency of directions grouped according to speed. It is of relevance that the velocity class with the fewest counts (here "[40,60)") will have 5 counts.
However more velocity classes leads to a warning. For instance, with
FFcut=cut(FF,breaks=seq(0,60,by=15),ordered_result=TRUE,right=FALSE,drop=FALSE)
the velocity class with the fewest counts (now "[45,60)") will have only 3 counts and ggplot2 will warn that
position_stack requires non-overlapping x intervals
and the plot will show data in this category spread out along the x axis.
It seems that 5 is the minimum size for a group to have for this to work correctly.
I would appreciate knowing if this is a feature or a bug in stat_bin (which geom_bar is using) or if I am simply abusing geom_bar.
Also, any suggestions how to get around this would be appreciated.
Sincerely
This occurs because df$dir is numeric, so the ggplot object assumes a continuous x-axis, and aesthetic parameter group is based on the only known discrete variable (fill = grp).
As a result, when there simply aren't that many dir values in grp = [45,60), ggplot gets confused over how wide each bar should be. This becomes more visually obvious if we split the plot into different facets:
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar() +
facet_wrap(~ grp)
> for(l in levels(df$grp)) print(sort(unique(df$dir[df$grp == l])))
[1] 1 2 3 4 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
[1] 2 3 4 5 7 9 10
[1] 2 4 7
We can also check manually that the minimum difference between sorted df$dir values is 1 for the first three grp values, but 2 for the last one. The default bar width is thus wider.
The following solutions should all achieve the same result:
1. Explicitly specify the same bar width for all groups in geom_bar():
ggplot(data=df,
aes(x=dir,y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar(width = 0.9)
2. Convert dir to a categorical variable before passing it to aes(x = ...):
ggplot(data=df,
aes(x=factor(dir), y=(..count..)/sum(..count..),
fill = grp)) +
geom_bar()
3. Specify that the group parameter should be based on both df$dir & df$grp:
ggplot(data=df,
aes(x=dir,
y=(..count..)/sum(..count..),
group = interaction(dir, grp),
fill = grp)) +
geom_bar()
This doesn't directly solve the issue, because I also don't get what's going on with the overlapping values, but it's a dplyr-powered workaround, and might turn out to be more flexible anyway.
Instead of relying on geom_bar to take the cut factor and give you shares via ..count../sum(..count..), you can easily enough just calculate those shares yourself up front, and then plot your bars. I personally like having this type of control over my data and exactly what I'm plotting.
First, I put dir and FF into a data frame/tbl_df, and cut FF. Then count lets me group the data by dir and grp and count up the number of observations for each combination of those two variables, then calculate the share of each n over the sum of n. I'm using geom_col, which is like geom_bar but when you have a y value in your aes.
library(tidyverse)
set.seed(12345)
FF <- rweibull(100,1.7,1) * 20 #mock speeds
FF[FF > 60] <- 59
dir <- sample.int(10, size = 100, replace = TRUE) # mock directions
shares <- tibble(dir = dir, FF = FF) %>%
mutate(grp = cut(FF, breaks = seq(0, 60, by = 15), ordered_result = T, right = F, drop = F)) %>%
count(dir, grp) %>%
mutate(share = n / sum(n))
shares
#> # A tibble: 29 x 4
#> dir grp n share
#> <int> <ord> <int> <dbl>
#> 1 1 [0,15) 3 0.03
#> 2 1 [15,30) 2 0.02
#> 3 2 [0,15) 4 0.04
#> 4 2 [15,30) 3 0.03
#> 5 2 [30,45) 1 0.01
#> 6 2 [45,60) 1 0.01
#> 7 3 [0,15) 6 0.06
#> 8 3 [15,30) 1 0.01
#> 9 3 [30,45) 2 0.02
#> 10 4 [0,15) 6 0.06
#> # ... with 19 more rows
ggplot(shares, aes(x = dir, y = share, fill = grp)) +
geom_col()

ggplot: Generate facet grid plot with multiple series

I have following data frame:
Quarter x y p q
1 2001 8.714392 8.714621 3.3648435 3.3140090
2 2002 8.671171 8.671064 0.9282508 0.9034387
3 2003 8.688478 8.697413 6.2295996 8.4379698
4 2004 8.685339 8.686349 3.7520135 3.5278024
My goal is to generate a facet plot where x and y column in one plot in the facet and p,q together in another plot instead of 4 facets.
If I do following:
x.df.melt <- melt(x.df[,c('Quarter','x','y','p','q')],id.vars=1)
ggplot(x.df.melt, aes(Quarter, value, col=variable, group=1)) + geom_line()+
facet_grid(variable~., scale='free_y') +
scale_color_discrete(breaks=c('x','y','p','q'))
I all the four series in 4 different facets but how do I combine x,y to be one while p,q to be in another together. Preferable no legends.
One idea would be to create a new grouping variable:
x.df.melt$var <- ifelse(x.df.melt$variable == "x" | x.df.melt$variable == "y", "A", "B")
You can use it for facetting while using variable for grouping:
ggplot(x.df.melt, aes(Quarter, value, col=variable, group=variable)) + geom_line()+
facet_grid(var~., scale='free_y') +
scale_color_discrete(breaks=c('x','y','p','q'), guide = F)
I think beetroot's answer above is more elegant but I was working on the same problem and arrived at the same place a different way. I think it is interesting because I used a "double melt" (yum!) to line up the x,y/p,q pairs. Also, it demonstrates tidyr::gather instead of melt.
library(tidyr)
x.df<- data.frame(Year=2001:2004,
x=runif(4,8,9),y=runif(4,8,9),
p=runif(4,3,9),q=runif(4,3,9))
x.df.melt<-gather(x.df,"item","item_val",-Year,-p,-q) %>%
group_by(item,Year) %>%
gather("comparison","comp_val",-Year,-item,-item_val) %>%
filter((item=="x" & comparison=="p")|(item=="y" & comparison=="q"))
> x.df.melt
# A tibble: 8 x 5
# Groups: item, Year [8]
Year item item_val comparison comp_val
<int> <chr> <dbl> <chr> <dbl>
1 2001 x 8.400538 p 5.540549
2 2002 x 8.169680 p 5.750010
3 2003 x 8.065042 p 8.821890
4 2004 x 8.311194 p 7.714197
5 2001 y 8.449290 q 5.471225
6 2002 y 8.266304 q 7.014389
7 2003 y 8.146879 q 7.298253
8 2004 y 8.960238 q 5.342702
See below for the plotting statement.
One weakness of this approach (and beetroot's use of ifelse) is the filter statement quickly becomes unwieldy if you have a lot of pairs to compare. In my use case I was comparing mutual fund performances to a number of benchmark indices. Each fund has a different benchmark. I solved this by with a table of meta data that pairs the fund tickers with their respective benchmarks, then use left/right_join. In this case:
#create meta data
pair_data<-data.frame(item=c("x","y"),comparison=c("p","q"))
#create comparison name for each item name
x.df.melt2<-x.df %>% gather("item","item_val",-Year) %>%
left_join(pair_data)
#join comparison data alongside item data
x.df.melt2<-x.df.melt2 %>%
select(Year,item,item_val) %>%
rename(comparison=item,comp_val=item_val) %>%
right_join(x.df.melt2,by=c("Year","comparison")) %>%
na.omit() %>%
group_by(item,Year)
ggplot(x.df.melt2,aes(Year,item_val,color="item"))+geom_line()+
geom_line(aes(y=comp_val,color="comp"))+
guides(col = guide_legend(title = NULL))+
ylab("Value")+
facet_grid(~item)
Since there is no need for an new grouping variable we preserve the names of the reference items as labels for the facet plot.

Resources