box plots for two columns side by side using ggplot - r

I have a dataset in the following format
value1 value2 group
10 20 A
20 30 A
67 45 B
98 76 C
102 11 A
11 22 B
10 10 B
19 20 C
I am trying to make box plots for three groups (A, B and C) and the box plots for 1st and end column should be side by side. I can do two separate plots like following, but not able to figure out how to combine to put it side by side.
p1 <- ggplot(x, aes(x=group, y=value1)) + geom_boxplot()
p2 <- ggplot(x, aes(x=group, y=value)) + geom_boxplot()
I would appreciate any help. I am a newbie in R and ggplot.

Here's an option using pivot_longer from tidyr
x_new <- tidyr::pivot_longer(x, c(value1, value2))
ggplot(x_new, aes(x = group, y = value, col = name, fill = name)) + geom_boxplot(alpha = .5)

The gridExtra package can do this too. Assign your plots to variables then just use grid.arrange(plot1,plot2). Look up the documentation with ?grid.arrange for extra options.

Related

How to add legend on a line plot?

I have a data like this
year catch group
2011 22 1
2012 45 1
2013 34 1
2011 11 2
2012 22 2
2013 32 2
I would like to have the number of the group (1 and 2) to appear above the line in the plot.
Any suggestion?
My real data has 8 groups in total with 8 lines which makes it hard to see because the lines cross one another and the colors of the legend are similar.
I tried this:
library(ggplot2)
ggplot(aes(x=as.factor(year), y=catch, group=as.factor(group),
col=as.factor(group)), data=df) +
geom_line() +
geom_point() +
xlab("year") +
labs(color="group")
Firstly, distinguishing 8 different colours is very difficult. That's why your 8 groups seem to have similar colors.
What you want in this case is not a legend (which usually is an off-chart summary), but rather "annotation".
You can directly add the groups with
ggplot(...) +
geom_text(aes(x=as.factor(year), y=catch, label=group)) +
...
and then try to tweak the position of the text with nudge_x and nudge_y. But if you wanted only 1 label per group, you would have to prepare a data frame with it:
labels <- df %>% group_by(group) %>% top_n(1, -year)
ggplot(...) +
geom_text(data=labels, aes(x=as.factor(year), y=catch, label=group)) +
...

plot count histogram in R ggplot

I need help with this please.
I have searched here but not got the right output.
I am trying to plot this in R, so I can plot 3 files side-by-side in a single plot using GGplot.
The output I desire (plotted with excel) is this
What i am getting using GGplot is this
The R code i am using is this
A1 <- read.table("A1.txt", header = T, sep = "\t")
library(ggplot2)
ggplot(A1, aes(x = count)) + geom_bar()
The data is a tab-delimited file like this
length count
26 344776
27 289439
18 673395
28 338146
19 710702
20 928326
21 3491352
22 2724981
23 699007
24 726121
25 472509
The length, as it were will only be labels on the x axis for the counts plotted on the y-axis.
Is this what you wanted?
ggplot(A1, aes(x = as.character(length), y=count)) + geom_bar(stat="identity")

Merge data.frames for grouped boxplot r

I have two data frames z (1 million observations) and b (500k observations).
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
b= Tracer time treatment
2 0 S
35 0 S
10 0 X
04 0 X
20 15 S
11 15 S
12 15 X
25 15 X
I'd like to create grouped boxplots using time as a factor and treatment as colour. Essentially I need to bind them together and then differentiate between them but not sure how. One way I tried was using:
zz<-factor(rep("Z", nrow(z))
bb<-factor(rep("B",nrow(b))
dumB<-merge(z,zz) #this won't work because it says it's too big
dumB<-merge(b,zz)
total<-rbind(dumB,dumZ)
But z and zz merge won't work because it says it's 10G in size (which can't be right)
The end plot might be similar to this example: Boxplot with two levels and multiple data.frames
Any thoughts?
Cheers,
EDIT: Added boxplot
I would approach it as follows:
# create a list of your data.frames
l <- list(z,b)
# assign names to the dataframes in the list
names(l) <- c("z","b")
# bind the dataframes together with rbindlist from data.table
# the id parameter will create a variable with the names of the dataframes
# you could also use 'bind_rows(l, .id="id")' from 'dplyr' for this
library(data.table)
zb <- rbindlist(l, id="id")
# create the plot
ggplot(zb, aes(x=factor(time), y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~id) +
theme_bw()
which gives:
Other alternatives for creating your plot:
# facet by 'time'
ggplot(zb, aes(x=id, y=Tracer, color=treatment)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
# facet by 'time' & color by 'id' instead of 'treatment'
ggplot(zb, aes(x=treatment, y=Tracer, color=id)) +
geom_boxplot() +
facet_wrap(~time) +
theme_bw()
In respons to your last comment: to get everything in one plot, you use interaction to distinguish between the different groupings as follows:
ggplot(zb, aes(x=treatment, y=Tracer, color=interaction(id, time))) +
geom_boxplot(width = 0.7, position = position_dodge(width = 0.7)) +
theme_bw()
which gives:
The key is you do not need to perform a merge, which is computationally expensive on large tables. Instead assign a new variable and value (source c(b,z) in my code below) to each dataframe and then rbind. Then it becomes straight forward, my solution is very similar to #Jaap's just with different faceting.
library(ggplot2)
#Create some mock data
t<-seq(1,55,by=2)
z<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
b<-data.frame(tracer=sample(t,size = 10,replace = T), time=c(0,15), treatment=c("S","X"))
#Add a variable to each table to id itself
b$source<-"b"
z$source<-"z"
#concatenate the tables together
all<-rbind(b,z)
ggplot(all, aes(source, tracer, group=interaction(treatment,source), fill=treatment)) +
geom_boxplot() + facet_grid(~time)

Generating a histogram and density plot from binned data

I've binned some data and currently have a dataframe that consists of two columns, one that specifies a bin range and another that specifies the frequency like this:-
> head(data)
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16
I want to plot a histogram and density plot using this but I can't seem to find a way of doing so without having to generate new bins etc. Using this solution here I tried to do the following:-
p <- ggplot(data, aes(x= binRange, y=Frequency)) + geom_histogram(stat="identity")
but it crashes. Anyone know of how to deal with this?
Thank you
the problem is that ggplot doesnt understand the data the way you input it, you need to reshape it like so (I am not a regex-master, so surely there are better ways to do is):
df <- read.table(header = TRUE, text = "
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16")
library(stringr)
library(splitstackshape)
library(ggplot2)
# extract the numbers out,
df$binRange <- str_extract(df$binRange, "[0-9].*[0-9]+")
# split the data using the , into to columns:
# one for the start-point and one for the end-point
df <- cSplit(df, "binRange")
# plot it, you actually dont need the second column
ggplot(df, aes(x = binRange_1, y = Frequency, width = 0.025)) +
geom_bar(stat = "identity", breaks=seq(0,0.125, by=0.025))
or if you don't want the data to be interpreted numerically, you can just simply do the following:
df <- read.table(header = TRUE, text = "
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16")
library(ggplot2)
ggplot(df, aes(x = binRange, y = Frequency)) + geom_bar(stat = "identity")
you won't be able to plot a density-plot with your data, given its not continous but rather categorical, thats why I actually prefer the second way of showing it,
You can try
library(ggplot2)
ggplot(df, aes(x = binRange, y = Frequency)) + geom_col()

Subset x and y variables separately in ggplot

I have a dataframe, df, that looks like so:
group ID y1 y2
A 1 21 14
A 2 11 21
A 3 21 17
...
B 1 71 12
B 2 41 14
B 3 31 15
...
And would like to use ggplot() to plot variables in one group against variables in another. For example, df$y1[df$group=="A"] against df$y2[df$group=="B"]. I naively thought the code for plotting may be something like this, but it's obviously not correct:
ggplot(df, aes(x = df$y1[df$group=="A"], y = df$y2[df$group=="B"])) + geom_point()
I know that if I wanted to subset the overall data, for example to plot only group A, I could do something like:
ggplot(subset(df, group=="A"), aes(x = y1, y = y2)) + geom_point()
I think I could solve this by reshaping my data so as to create variables y1.A, y1.B, y2.A, y2.B and so forth, but I have many variables and this seems like a long-winded approach.

Resources