Boxplot R select data with non unique values only - r

I have a data frame like this
head(data)
n OESST wsB
4 0.52924690 4
8 0.04488144 6
6 0.29909668 6
0 1.42228888 6
2 1.92228888 4
4 1.85659560 6
and I am doing a box plot of OESST as a function of wsB for the different n values
ggplot(na.omit(data), aes(x=factor(wsB), y=OESST, colour = factor(n))) + geom_boxplot(outlier.size=0,fill = "white",position="dodge",size=0.3,alpha=0.3) + stat_summary(fun.y=median, geom="line", aes(group=factor(n), colour = factor(n)),size=1)
What I would like to do is to remove from the plot the unique n-wsB combinations (which are visualized only as a line but don't have actually a box).
Any help?
Thanks

I think the best approach is just filter your data first. Using dplyr
library(dplyr)
data %>%
group_by(n, wsB) %>%
mutate(n.wsB.count = n()) %>%
filter(n.wsB.count > 1) %>%
na.omit() %>%
ggplot(aes(x=factor(wsB), y=OESST, colour = factor(n))) +
geom_boxplot(outlier.size=0,fill = "white", position="dodge", size=0.3, alpha=0.3) +
stat_summary(fun.y=median, geom="line", aes(group=factor(n)), size=1)
Not tested as (#MrFlick points out) the provided data isn't reproducible for the problem. I also took out the redundant colour aesthetic in the stat_summary.

Related

How to plot multiple bar plots with same x values, different y values, and final product is similar to that of using facet_wrap function?

I have a dataset that has one column as x value for all other columns, and different y values, which are those other columns that looks like this (data frame is called freq_dist):
Duration
D0
D1
D2
D3
D4
1
130
101
53
30
10
2
23
36
13
9
0
I want to set duration (column 1) as x value for all other columns, and display a bar chart using ggplot2 function. I tried to use geom_bar but instead of returning the values I already have in the table as my y values, it returns the counts of the occurance of each value, so instead of geom_bar, geom_col might be what I want, but how do I set my column 1 as the fixed x value for all other columns? I also tried to use simple barplot() function, which is essentially what I'm looking for, like this:
xx = freq_dist[,1]
yy = freq_dist[,2]
barplot(yy~xx)
Thanks all!
library(tidyverse)
freq_dist %>%
pivot_longer(-1) %>%
ggplot(aes(as.factor(Duration), value)) +
geom_col() +
theme_classic() +
facet_wrap(~name) +
labs(x = "Duration")
You need to pivot into long format and use geom_col
ggplot(pivot_longer(freq_dist, -1), aes(factor(Duration), value, fill = name)) +
geom_col(width = 0.5) +
scale_fill_brewer(palette = "Set1") +
theme_light(base_size = 16) +
labs(x = "Duration", fill = NULL) +
facet_wrap(~name) +
theme(legend.position = c(0.84, 0.25))

R ggplot: Modifying aesthetics of individual lines without recreating entire color palette

In ggplot, is there any simple way of overriding the line attributes of a single group(s) without having to specify the entirety of the color/line pallet via scale_*_manual()?
In the example below, I basically want to make all the boot_* lines gray and skinny, while I want all other lines to retain the default colors/widths otherwise being used.
I know there's a lot of brute ways of doing this by some combo of a) creating some auxiliary variables in the data-frame based on the string-pattern that will server as my color/size group, then b) generating the plot below, extracting all the color-layer info, and then filling out an entire scale_color_manual() and scale_size_manual() map, and c)replacing the 'boot_*' values with "grey."
Are there any versatile shortcuts here?
library(dplyr)
library(ggplot)
set.seed(231)
df=tibble(time=c(1:5), actual=2*time+3, estimate = actual+rnorm(length(actual)))
for(i in 1:8){
df[paste('boot_', i, sep='')] = df$estimate + rnorm(nrow(df))
}
> head(df) %>% data.frame
# time actual estimate boot_1 boot_2 boot_3 boot_4 boot_5 boot_6 boot_7
# 1 1 5 4.466898 4.684295 4.240585 4.786520 5.904332 4.862498 2.092772 4.595850
# 2 2 7 4.688336 4.751258 6.074914 5.694181 3.445036 4.639329 4.548511 5.453597
# 3 3 9 8.045802 7.167972 6.858666 7.519752 7.721405 7.801243 10.156436 9.521482
# 4 4 11 11.262516 11.826206 10.682760 11.137814 11.252465 11.452442 11.925339 11.754248
# 5 5 13 12.526643 12.492315 13.927974 14.176896 11.924183 12.950479 11.257865 13.430229
# boot_8
# 1 3.987001
# 2 3.813539
# 3 7.549984
# 4 11.482360
# 5 11.645106
# Melt for ggplot compatibility
df_long = df %>%
pivot_longer(cols=(-time))
head(df_long) %>% data.frame
# time name value
# 1 1 actual 5.000000
# 2 1 estimate 4.466898
# 3 1 boot_1 4.684295
# 4 1 boot_2 4.240585
# 5 1 boot_3 4.786520
# 6 1 boot_4 5.904332
## The basic ggplot
df_long %>%
ggplot(aes(x=time, y=value, color=name)) + geom_line()
You could just use the first four characters of name for the colour aesthetic (using substr), and the full name as a group aesthetic. It's a bit hacky but it's short, effective, and all gets done in the plotting code without extra data wrangling, post-hoc changes or a long vector of colour mappings.
df_long %>%
ggplot(aes(x = time, y = value, color = substr(name, 1, 4), group = name)) +
geom_line() +
scale_color_manual(labels = c("actual", "boot", "estimate"),
values = c("orange", "gray", "blue3"), name = "name")
An alternative is using filtering to have two sets of lines: one coloured, and one merely grouped. This has the benefit that you don't need to add any scale calls at all:
df_long %>%
filter(!grepl("boot", name)) %>%
ggplot(aes(x = time, y = value, color = name)) +
geom_line(data = filter(df_long, grepl("boot", name)),
aes(group = name), color = "gray", size = 0.3) +
geom_line()
EDIT
It's pretty difficult to only specify an aesthetic mapping for a single (multiple) group, while leaving the others at default values. However, it is possible using ggnewscale. Here we only have to specify the color of the boot group:
library(ggnewscale)
df_long %>%
filter(!grepl("boot", name)) %>%
ggplot(aes(x = time, y = value)) +
new_scale_color() +
geom_line(aes(color = name)) +
scale_color_discrete(name = "Variable") +
new_scale_color() +
geom_line(data = filter(df_long, grepl("boot", name)),
aes(group = name, color = "boot"), size = 0.3) +
scale_color_manual(values = "gray", name = "") +
theme(legend.margin = margin(-28, 10, 0, 0))

How to calculate SD by group in R, without losing columns still needed for plotting in ggplot2?

I have a dataset of 'scenario's (27x) where A, B en C have been certain input values into a model, and value is the outcome of a variable.
Now I want to make a grouped barplot with ggplot (value on y, with factor B on x, fill by A. I want to make errorbars based on the variation caused by factor C.
My dataset is (simplified) approximatly in this format:
data <- data.frame(matrix(ncol=0, nrow=27))
data$value <- runif(27, min=10, max=60)
data$A <- factor((rep(1:9, each=3)))
data$B <- factor((rep(1:3, each=9)))
data$C <- factor(rep(rep(1:3),9))
Looks like:
value A B C
1 27.76710 1 1 1
2 34.71762 1 1 2
3 20.72895 1 1 3
4 34.83710 2 1 1
5 31.44144 2 1 2
6 13.11038 2 1 3
etc
The ggplot would be
ggplot(data, aes(fill=A, y=value, x=B)) +
geom_bar(stat="identity",position=position_dodge())+
geom_errorbar(aes(ymin=?????, ymax=????), width=.2,
position=position_dodge(.9))
So I am struggling with ymin and ymax. It could be value+sd or -sd, but I don't have a sd calculated yet.
My approach now is using summarize from dplyr by group A. This gives me:
data %>%
group_by(A) %>%
summarise(mean=mean(value), sd = sd(value))
A mean sd
<fct> <dbl> <dbl>
1 1 27.7 6.99
2 2 26.5 11.7
3 3 33.7 21.9
4 4 27.7 6.99
etc
This is fine, however, now I lost all my other columns (in this case I still need B for my ggplot). How can I still calculate a mean and sd and keep all my other columns?
Or are there other ways to get the effect I need?
(I could re-add the column B by hand but I'd like to know if there are other ways also for the future and for occasions B is not easily re-made)
You have three rows of data for each combination of A and B, so your current code is actually overplotting three bars at each x-axis position. You can see this by adding transparency to the bars.
ggplot(data, aes(fill=A, y=value, x=B)) +
geom_bar(stat="identity", position=position_dodge(), alpha=0.3)
It looks like you're actually trying to do the following (but let me know if I've misunderstood):
pd = position_dodge(0.92)
data %>%
group_by(A,B) %>%
summarise(mean=mean(value), sd=sd(value)) %>%
ggplot(aes(fill=A, x=B)) +
geom_col(aes(y=mean), position=pd)+
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), position=pd, width=0.2)
Facetting is another option:
data %>%
group_by(A,B) %>%
summarise(mean=mean(value), sd=sd(value)) %>%
ggplot(aes(x=A)) +
geom_col(aes(y=mean), fill=hcl(240,100,65)) +
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=0.2) +
facet_grid(. ~ B, labeller=label_both, space="free_x", scales="free_x")
But do you really need bars?
data %>%
group_by(A,B) %>%
summarise(mean=mean(value), sd=sd(value)) %>%
ggplot(aes(x=A)) +
geom_pointrange(aes(y=mean, ymin=mean-sd, ymax=mean+sd), shape=21, fill="red",
fatten=6, stroke=0.3) +
facet_grid(. ~ B, labeller=label_both, space="free_x", scales="free_x")
We can also do this calculation within ggplot, using stat_summary:
data %>%
ggplot(aes(x=A, y=value)) +
stat_summary(fun.data=mean_sdl, fun.args=list(mult=1), geom="pointrange",
shape=21, fill="red", fatten=6, stroke=0.3) +
facet_grid(. ~ B, labeller=label_both, space="free_x", scales="free_x")
Either way, the plot looks like this:

geom_bar overlapping labels

for simplicity lets suppose we have a database like
# A
1 1
2 2
3 2
4 2
5 3
We have a categorical variable "A" with 3 possible values (1,2,3). And im tring this code:
ggplot(df aes(x="", y=df$A, fill=A))+
geom_bar(width = 1, stat = "identity")
The problem is that the labels are overlapping. Also i want to change the labes for 1,2,3 to x,y,z.
Here is picture of what is happening
And here is a link for the actual data that im using.
https://a.uguu.se/anKhhyEv5b7W_Data.csv
Your graph does not correspond to the sample of data you are showing, so it is hard to be sure that the structure of your real data is actually the same.
Using a random example, I get the following plot:
df <- data.frame(A = sample(1:3,20, replace = TRUE))
library(ggplot2)
ggplot(df, aes(x="A", y=A, fill=as.factor(A)))+
geom_bar(width = 1, stat = "identity") +
scale_fill_discrete(labels = c("x","y","z"))
EDIT: Using data provided by the OP
Here using your data, you should get the following plot:
ggplot(df, aes(x = "A",y = A, fill = as.factor(A)))+
geom_col()
Or if you want the count of each individual values of A, you can do:
library(dplyr)
library(ggplot2)
df %>% group_by(A) %>% count() %>%
ggplot(aes(x = "A", y = n, fill = as.factor(A)))+
geom_col()
Is it what you are looking for ?

Displaying a Cross-tabulation As a Plot on RStudio

I'm trying to visualize a cross-tabulation on RStudio using ggplot2. I've been able to create plots in the past, and have a cross-tabulation done as well, but can't crack this. Can anyone help?
Here's my code for an x-tab:
library(dplyr)
data_dan %>%
group_by(Sex, Segment) %>%
count(variant) %>%
mutate(prop = prop.table(n))
and here's what I've got for creating a plot:
#doing a plot
variance_art_new.plot = ggplot(data_dan, aes(Segment, fill=variant)) +
geom_bar(position="fill")+
theme_classic()+
scale_fill_manual(values = c("#fc8d59", "#ffffbf", "#99d594"))
variance_art_new.plot
Here's a sample of the data I'm operating with:
Word Segment variant Position Sex
1 LIKE K R End Female
2 LITE T S End Male
3 CRACK K R End Female
4 LIKE K R End Male
5 LIPE P G End Female
6 WALK K G End Female
My aim is to have the independent variables of 'Sex', 'Segment' plotted on a boxplot against the dependent variable 'variant'.
I included the first code to show that I can create a table to show this cross-tabulation and the second bit is what I normally do for running a box plot for just one independent variable.
I'm still not sure if this gets all the way to what you are asking, but if you are asking for counts (or portions) within two separate variable, you can use facet_wrap to separate the two groups.
(Note, all of these are run with theme_set(theme_bw()) because I prefer it for this type of plot.)
Working with the builtin dataset mtcars you can get counts with:
mtcars %>%
ggplot(aes(x = factor(cyl), fill = factor(gear))) +
geom_bar() +
facet_wrap(~vs)
Or with the sorting reversed with:
mtcars %>%
ggplot(aes(x = factor(vs), fill = factor(gear))) +
geom_bar() +
facet_wrap(~cyl, labeller = label_both)
You can also plot the within-group distribution by using position = "fill"
mtcars %>%
ggplot(aes(x = factor(vs), fill = factor(gear))) +
geom_bar(position = "fill") +
facet_wrap(~cyl, labeller = label_both) +
scale_y_continuous(name = "Within group Percentage"
, labels = scales::percent)

Resources