I need to create some box plots showing the abundance of some bacterial taxa in different samples.
My data looks like:
my.data <- "Taxon 06.TO.VG 21.TO.V 02.TO.VG 41.TO.VG 30.TO.V 04.BA.V 34.TO.VG 01.BA.V 28.TO.VG 18.TO.O 44.TO.V 08.BA.O 07.BA.O 06.BA.V 11.TO.V 06.BA.VG 07.BA.VG 05.BA.VG 07.BA.V 05.BA.V 06.BA.O 02.BA.O 04.BA.O 01.BA.O 05.BA.O 03.BA.O 02.BA.VG 03.BA.V 02.BA.V 04.BA.VG 03.BA.VG 01.BA.VG 15.TO.O 31.TO.O 09.TO.O 27.TO.V 42.TO.VG 08.TO.VG 16.TO.O 07.TO.V 13.TO.O 32.TO.V 29.TO.VG 10.TO.V 25.TO.V 05.TO.VG 20.TO.O 19.TO.V 17.TO.O 35.TO.V 43.TO.O 24.TO.V 26.TO.VG 01.TO.VG 37.TO.O 04.TO.VG 33.TO.O 39.TO.VG 14.TO.O 12.TO.O 38.TO.VG 22.TO.O
Bacteroides 0.072745558 0.011789182 0.028956894 0.059031877 0.097387173 0.086673889 0.432662192 0.060246679 0.269535674 0.152713335 0.014511873 0.063421323 0.091253905 0.139856373 0.013677012 0.200847907 0.180712032 0.21332737 0.031756181 0.272166702 0.019861211 0.133804422 0.168692685 0.100862392 0.152431791 0.104702194 0.119352089 0.410334347 0.024104844 0.0493905 0.068065382 0.047854785 0.011860175 0.168986083 0.015748031 0.407974482 0.264409881 0.250364431 0.330547112 0.536443695 0.578045113 0.400459167 0.204446209 0.357879234 0.242751388 0.488863722 0.521495803 0.001852281 0.045638126 0.503566932 0.069072806 0.171181339 0.183629007 0.371751412 0.385231317 0.023690205 0.255697356 0.104054054 0.242741552 0.043973941 0.221033868 0.004587156
Prevotella 0.073080791 0.302011096 0.586048042 0.487603306 0.290973872 0.014897075 0 0.333254269 0.029445074 0 0.153034301 0.002399726 0.025658188 0.090664273 0.440294582 0.100688924 0 0 0 0 0 0.000227946 0.093623374 0 0.000197707 0.115987461 0.076442171 0 0.047507606 0.000210172 0.000243962 0.042079208 0.52184769 0 0.394750656 0 0 0.235787172 0 0.000936856 0.000300752 0 0.051607781 0 0 0 0.002289494 0.735586941 0.023828756 0 0.011200996 0 0.046374105 0 0.00044484 0.085421412 0.000455789 0.306756757 0 0.11970684 0.008912656 0.371559633"
I'm wandering bout using ggplot2 to do to do the box plot, but I'm not sure about how the data have to be formatted....
I tried this:
df <- read.csv("my.data", header=T)
ggplot(data = df, aes(x=variable, y=value)) + geom_boxplot(aes(fill=Taxon))
but it gave me an error saying that the variable was not found...
Anyone can help me?
Many thanks
Francesca
An quick example of how to format your data:
categs = sample(LETTERS[1:3], 120, TRUE)
y = c(rnorm(40), rnorm(40, 3, 2), rnorm(40, 5, 3))
# example dataset
dados = data.frame(categs, y)
require(ggplot2)
ggplot(dados) + geom_boxplot(aes(x = categs, y = y))
# categs y
#1 B 0.7392673
#2 B -0.1694076
#3 A -2.3804024
#4 B 0.5999949
#5 A 0.5816400
#6 A 2.1263669
See also http://ggplot2.org/
Related
I'm trying to plot the seasonality of nesting and hatching for turtles on one graph - with the count of how many nests were laid/hatched for every day of the season (01/05/2021-30/09/2021). Some of my data is as follows:
Date - Laid Green - Hatched Green
14/05/2021 - 0 - 0
15/05/2021- 0 - 0
16/05/2021- 0 - 0
17/05/2021- 0 - 0
18/05/2021- 0 - 0
19/05/2021 - 0 - 0
20/05/2021 -1 - 0
21/05/2021- 2 - 0
22/05/2021- 0 - 0
23/05/2021- 1 - 0
24/05/2021 - 2- 0
25/05/2021- 0 - 0
26/05/2021 -1 - 0
27/05/2021 - 4 - 0
When then trying to plot it with ggplot using:
ggplot(seasonality,aes(x=Date,y=seasonality$Laid Green))+geom_bar(stat="identity",width=1)
I get this:
I want to pool my data so that this is visually more pleasing, perhaps into 5 days? but I'm unsure how to do this. I am also trying to plot the green hatched on the same graph with nesting and hatching in 2 different colours.
Any help is appreciated!
You can use the package lubridate to round dates to a week start. dplyr from tidyverse can help you to then sum the counts.
library(lubridate)
library(tidyverse)
# so our random dataframes look the same
set.seed(123)
# fake data
seasonality <- tibble(date = sample(seq(as.Date('2021-04-01'), as.Date('2021-06-01'), by="day"),
size = 100,
replace = TRUE),
laid_green = sample(c(0:1),
size = 100,
replace = TRUE),
hatched_green = sample(c(0:1),
size = 100,
replace = TRUE)
) %>%
arrange(date)
# plot
seasonality %>%
mutate(week = floor_date(date,
unit = 'week')
) %>%
group_by(week) %>%
summarise(laid_green = sum(laid_green),
hatched_green = sum(hatched_green)) %>%
pivot_longer(-week) %>%
ggplot(aes(x=week,y=value, fill = name)) +
geom_col(pos = 'dodge')
I have a dataframe as below
G1 G2 G3 G4 group
S_1 0 269.067 0.0817233 243.22 N
S_2 0 244.785 0.0451406 182.981 N
S_3 0 343.667 0.0311259 351.329 N
S_4 0 436.447 0.0514887 371.236 N
S_5 0 324.709 0 293.31 N
S_6 0 340.246 0.0951976 393.162 N
S_7 0 382.889 0.0440337 335.208 N
S_8 0 368.021 0.0192622 326.387 N
S_9 0 267.539 0.077784 225.289 T
S_10 0 245.879 0.368655 232.701 T
S_11 0 17.764 0 266.495 T
S_12 0 326.096 0.0455578 245.6 T
S_13 0 271.402 0.0368059 229.931 T
S_14 0 267.377 0 248.764 T
S_15 0 210.895 0.0616382 257.417 T
S_16 0.0401525 183.518 0.0931699 245.762 T
S_17 0 221.535 0.219924 203.275 T
Now I want to make a multiboxplot with all the 4 genes in columns. The first 8 rows are for normal samples an rest 9 rows are tumor samples so for each gene I should be able to make 2 box plots with labels of tissues. I am able to make individual boxplots but how should I put all the 4 genes in one plot and also label the tissue for each boxplots and use the stripchart points. Is there a easy way to do it? I can only make individual plots using the row and column names but cannot mark the labels based on column groups in the plot and also plot the points with the stripchart. Any help will be appreciated. Thanks
with facet_wrap:
head(df)
G1 G2 G3 G4 group
S_1 0 269.067 0.0817233 243.220 N
S_2 0 244.785 0.0451406 182.981 N
S_3 0 343.667 0.0311259 351.329 N
S_4 0 436.447 0.0514887 371.236 N
S_5 0 324.709 0.0000000 293.310 N
S_6 0 340.246 0.0951976 393.162 N
library(reshape2)
df <- melt(df)
library(ggplot2)
ggplot(df, aes(x = variable,y = value, group=group, col=group)) +
facet_wrap(~variable, scales = 'free') + geom_boxplot()
Not sure what you mean with stripchart points, I assumed you wanted to visualize the actual points overlaid on the boxplots. Would the following suffice?
library(ggplot2)
library(dplyr)
library(reshape2)
melt(df) %>%
ggplot(aes(x = variable, y = value, col = group)) +
geom_boxplot() +
geom_jitter()
Where df is the above data frame. Result:
Would anybody please help using ggplot2 in R, to show a barplot, where i need to show columns (first, second, third, fourth, fifth) on x axis and their values on y-axis ? without showing the column "uname".
> head(golQ1Grades)
qname uname first second third fourth fifth
1 onlinelernen_quiz_1 xxx 100 0 0 0 0
2 onlinelernen_quiz_2 xxxx 100 0 0 0 0
3 onlinelernen_quiz_4 xxxx 42 71 0 0 0
4 onlinelernen_quiz_7 xxxx 85 100 0 0 0
5 onlinelernen_quiz_1 xxx 85 100 0 0 0
6 onlinelernen_quiz_3 xxxx 71 0 0 0 0
Thanks for the advanced help.
It is my guess that you would like to display the mean value on the Y-axis.
library(ggplot2)
Data
dat<-data.frame(c(100,100,42,85,85,71), c(0,0,71,100,100,0), c(0,0,0,0,0,0), c(0,0,0,0,0,0), c(0,0,0,0,0,0))
names(dat)<-NULL
Compute mean and get new data
v1<-apply(dat, 2, mean)
nv1<-c("first","second","third", "fourth","fifth")
ndat<-data.frame(nv1, v1)
Plot
p <- ggplot(ndat, aes(factor(nv1), v1))
p + geom_bar(stat="identity")
I think the better option is dplyr and tidyr.E.g. (I change data.frame a little)
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(qname = letters[1:10],
first = seq(1,10,1),
second = seq(10,100,10),
third = seq(2,20,2))
And then use gather feature:
df <- df %>%
gather(variable, value, -qname)
in your case it will be
df <- golQ1Grades %>%
gather(variable,value, -qname, -uname)
Futhermore, instead of computing average value it is also extremely helpful facet_grid:
ggplot(df, aes(factor(qname),value))+
geom_bar(stat = "identity")+
facet_grid(.~variable)
Following this example:
http://wiki.stdout.org/rcookbook/Graphs/Multiple%20graphs%20on%20one%20page%20(ggplot2)/
See the graph titled "Fitted growth curve per diet", I want to do the same thing but with a set of data that is in a CSV file such as (values are in µs, except for column "N"):
$ head RandomArray25PercentDup.csv
N SystemSort QuickSort RandomizedQuickSort TopDownMergeSort BottomUpMergeSort SelectionSort InsertionSort BubbleSort
4 0 1 0 1 0 1 0 0
5 0 0 0 1 1 0 1 0
6 0 0 0 1 1 0 0 0
7 0 0 0 0 1 0 0 0
8 0 0 1 0 1 0 1 1
...
I've tried this so far:
library(ggplot2)
library(reshape2)
data <- read.table("RandomArray25PercentDup.csv",
sep="\t",
header=TRUE)
data.m <- melt(data, id.vars = 1)
ggplot(data.m, aes(data, value, colour=variable)) +
geom_point(alpha=.3) +
geom_smooth(alpha=.2, size=1) +
ggtitle("Random array with ~25% duplicate values")
My background in R is very limited, and I'm trying to learn using various ressources.
I have about 800'000 rows worth of data, with 20 repetitions in the measurement of each N (the reason why I want to see the scatter in transparent with a fitting curve for each algorithm).
Replacing this
data.m <- melt(data, id.vars = 1)
with
data.m <- melt(data, id.vars = "N")
and then
ggplot(data.m, aes(data, value, colour=variable)) +
geom_point(alpha=.3) +
geom_smooth(alpha=.2, size=1) +
ggtitle("Random array with ~25% duplicate values")
with
ggplot(data.m, aes(N, value, colour=variable)) +
geom_point(alpha=.3) +
geom_smooth(alpha=.2, size=1) +
ggtitle("Random array with ~25% duplicate values")
should do the trick. First replacement isn't really necessary, but it's always preferable to use variable names in case the order of the columns change. The first argument in aes is mapped to the x-axis. data is not a column so it can't be mapped.
I'm trying to get a series of plots from the following dataset and for loop:
> head(all5new[c(6,70,22:23)])#This is a snapshot of my dataset. There is more species, see below.
setID fishery blackdog smoothdog
11 1 TRAWL-PAND.BOR. 0 0
12 1 TRAWL-PAND.BOR. 0 0
13 1 TRAWL-REDFISH 0 0
14 1 TRAWL-PAND.BOR. 0 0
21 10 TRAWL-PAND.BOR. 0 0
22 10 TRAWL-PAND.BOR. 0 0
> elasmo #This is the list of the species for which I would like to have individual barplots
[1] "blackdog" "smoothdog" "spinydog" "mako" "porbeagle"
[6] "blue" "greenland" "portuguese" "greatwhite" "mackerelNS"
[11] "dogfish" "basking" "thresher" "deepseacat" "atlsharp"
[16] "oceanicwt" "roughsagre" "dusky" "sharkNS" "sand"
[21] "sandbar" "smoothhammer" "tiger" "wintersk" "abyssalsk"
[26] "arcticsk" "barndoorsk" "roundsk" "jensensk" "littlesk"
[31] "richardsk" "smoothsk" "softsk" "spinysk" "thorny"
[36] "whitesk" "stingrays" "skateNS" "manta" "briersk"
[41] "pelsting" "roughsting" "raysNS" "skateraysNS" "allSHARK"
[46] "allSKATE" "PELAGIC"
This is my for loop. The code works fine when I run it for one species, however when I run it for all, I always get the same barplot. I know it must be just a quick fix adding for example [[i]] somewhere in the code, but I tried different things without any success.
for (i in elasmo) {
# CALUCLATE THE CATCH PER UNIT OF EFFORT (KG/SET) FOR ALL SPECIES FOR EACH FISHERY
test<-ddply(all5new,.(fishery),summarize, sets=length(as.factor(setID)),LOGcpue=log((sum(i)/length(as.factor(setID)))))
#TAKE THE FIRST 10 FISHERY WITH THE HIGHEST LOGcpue
x<-test[order(-test$LOGcpue)[1:10],]
#REORDER THE FISHERY FACTOR ACCORDINGLY (FOR GGPLOT2, TO HAVE EACH LEVEL IN ORDER)
list<-x$fishery
x$fishery <- factor(x$fishery, levels =list)
#BAR PLOT
graph<-ggplot(x, aes(fishery,LOGcpue)) + geom_bar() + coord_flip() +
geom_text(aes(label=sets,hjust=0.5,vjust=-1),size=4,angle = 270)
#SAVE GRAPH IN NEW DIR
ggsave(graph,filename=paste("barplot",i,".png",sep=""))
}
Here's a subset of my dataset after melting: mydata.
> data.melt<-melt(all5new, id.vars=c("tripID","setID","fishery"), measure.vars = c(22:23))
> head(data.melt);dim(data.melt)
tripID setID fishery variable value
1 1 1 TRAWL-PAND.BOR. blackdog 0
2 1 1 TRAWL-PAND.BOR. blackdog 0
3 1 1 TRAWL-REDFISH blackdog 0
4 1 1 TRAWL-PAND.BOR. blackdog 0
5 1 10 TRAWL-PAND.BOR. blackdog 0
6 1 10 TRAWL-PAND.BOR. blackdog 0
[1] 350100 5
Here's a workflow I use for generating lots of graphs, adapted to your dataset (or, my interpretation of it). This is a nice illustration of the power of plyr, I think. For your application, I don't think calculation times really matter. What is more important for you is generating easy-to-read code, and I think plyr is good for this.
#Load packages
require(plyr)
require(reshape)
require(ggplot2)
#Recreate your data set, with only two species
setID <- rep(1:5, each=4, times=1)
fishery <- gl(10, 2)
blackdog <- sample(1:5, size=20, replace=TRUE)
smoothdog <- sample(1:5, size=20, replace=TRUE)
df <- data.frame(setID, fishery, blackdog, smoothdog)
#Melt the data frame
dfm <- melt(df, id.vars <- c("setID", "fishery"))
#Calculate LOGcpue for each fish at each fishery
cpueDF <- ddply(dfm, c("fishery", "variable"), summarise, LOGcpue = log(sum(value)/length(value)))
#Plot all the data in one (potentially huge) faceted plot.
#(I often use huge plots like this for onscreen analysis
# - obviouly it can't be printed in practice, but you can get a visual overview of the data)
ggplot(cpueDF, aes(x=fishery, y=LOGcpue)) + geom_bar() + coord_flip() + facet_wrap(~variable)
ggsave("giant plot.pdf", height=30, width=30, units="in")
#Print each plot individually to screen, and save it, and put it in a list
printGraph <- function(df) {
p <-ggplot(df, aes(x=fishery, y=LOGcpue)) +
geom_bar() + coord_flip()
print(p)
fn <- paste(df$variable[1], ".png")
ggsave(fn)
printGraph <- p
}
plotList <- dlply(cpueDF, .(variable), printGraph)
#Now pick out the top n fisheries for each fish
cpueDFtopN <- ddply(cpueDF, .(variable), function(x) head(x[order(x$LOGcpue, decreasing=T),], n=5))
ggplot(cpueDFtopN, aes(x=fishery, y=LOGcpue)) + geom_bar() +
coord_flip() + facet_wrap(~variable, scales="free")