Related
I'm trying to create a proportional stacked area graph as shown below in my mock data (Figure 1). When I try to do this with my real data, it comes out to Figure 2.
The class of the data are all the same after converting to percentages between the mock and 16S and are as follows: Timepoint - integer, Taxa - character, n - integer, percentage - numeric.
I'm looking to get the x-axis treated categorically and numerically (for two separate graphs) in the 16S data as with the mock and also to tidy up the overlapping lines (e.g., aesthetically the plot for 16S will look like the mock data).
dput(S1_RA1[1:40,])
structure(list(Timepoint = c(-10L, -10L, -10L, -10L, -10L, -10L,
-10L, -10L, -10L, -3L, -3L, -3L, -3L, -3L, -3L, -3L, -3L, -3L,
-3L, -3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), Taxa = c(" Anaerococcus", " Bacteroides",
" Bifidobacterium", " Bilophila", " Collinsella", " Lachnoclostridium",
" Streptococcus", " Veillonella", "Enterobacter", " Acinetobacter",
" Anaerococcus", " Bacteroides", " Bifidobacterium", " Escherichia-Shigella",
" Flavobacterium", " Lachnoclostridium", " Parabacteroides",
" Peptoniphilus", " Veillonella", "Enterobacter", " Acinetobacter",
" Bacteroides", " Bifidobacterium", " Bilophila", " Collinsella",
" Desemzia", " Escherichia-Shigella", " Lachnoclostridium", " Parabacteroides",
" Streptococcus", " Veillonella", " Bacteroides", " Bifidobacterium",
" Bilophila", " Desemzia", " Escherichia-Shigella", " Lachnoclostridium",
" Parabacteroides", " Streptococcus", " Veillonella"), n = c(40L,
2188L, 665L, 84L, 55L, 131L, 153L, 11325L, 185L, 127L, 62L, 1123L,
172L, 63L, 2L, 118L, 100L, 9L, 23123L, 109L, 253L, 2658L, 348L,
163L, 204L, 27L, 163L, 245L, 290L, 41L, 17497L, 2325L, 50L, 197L,
13L, 255L, 152L, 478L, 92L, 19692L), percentage = c(0.00269796303790638,
0.147578578173479, 0.0448536355051936, 0.0056657223796034, 0.00370969917712127,
0.0088358289491434, 0.0103197086199919, 0.763860785107244, 0.012478079050317,
0.00507837492002559, 0.00247920665387076, 0.0449056301983365,
0.00687779910428663, 0.00251919385796545, 7.99744081893794e-05,
0.00471849008317338, 0.00399872040946897, 0.000359884836852207,
0.92462412028151, 0.00435860524632118, 0.0115583169628581, 0.121430855680936,
0.0158983964548403, 0.0074466627072959, 0.00931974964594088,
0.00123349627666865, 0.0074466627072959, 0.0111928365845859,
0.0132486637123669, 0.00187308693864498, 0.799351272328567, 0.0978823727529154,
0.00210499726350356, 0.00829368921820402, 0.000547299288510925,
0.0107354860438681, 0.00639919168105081, 0.020123773839094, 0.00387319496484655,
0.829032122258241)), row.names = c(NA, -40L), groups = structure(list(
Timepoint = c(-10L, -3L, 0L, 1L), .rows = structure(list(
1:9, 10:20, 21:31, 32:40), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
I've tried the following:
Setting the scale_x_discrete to scale_x_continuous
Converting aes(x = as.factor(Timepoint)..
Changing the limits/expand parameters in the scale_x_discrete code
Removing the negative timepoints
Changing the Number column in the S1_RA2 file to match the number system in Table 1
My code for the 16S is as follows and is almost identical to the mock except for the colors:
library(ggplot2)
library(dplyr)
RA1 <- read.csv("RA1.csv", header=TRUE)
#Transform relative abundance from RA1.csv to percentages
S1_RA1 <- RA1 %>%
group_by(Timepoint, Taxa) %>%
summarise(n = sum(Relative.Abundance)) %>%
mutate (percentage = n / sum(n))
head(Shime1_RA2)
#Set color palette to be able to include 15 colors
nb.cols <- 16
getPalette <- colorRampPalette(brewer.pal(9, 'Set1'))(nb.cols)
#Revised code - The code below works courtesy of Gregor's comment
library(tidyr)
Shime1_RA2 <- Shime1_RA2 %>% ungroup %>%
complete(Timepoint, Taxa, fill = list(n = 0, percentage = 0))
ggplot(Shime1_RA2, aes(x = factor(Timepoint), y = percentage, fill = Taxa, group = Taxa)) +
geom_area(position = "fill", colour = "black", size = .5, alpha = .7) +
scale_y_continuous(name="Relative Abundance", expand=c(0,0)) +
scale_x_discrete(name="Timepoint (d)", expand=c(0,0)) +
scale_fill_manual(values = getPalette) +
theme(legend.position='bottom')
I fixed three things:
You want the x-scale to be treated categorically, so we need to factor(Timepoint). (And then the default scale will be fine, so we delete your manually specified limitsl)
When we use a discrete x-axis scale, we have to explicitly tell ggplot which dots we want to connect. We do this by adding the group = Taxa aesthetic.
The weird lines cutting through the middle of other polygons are because you don't have an observation for every taxa at every timepoint, so when the dots are connected they may cut through intermediate timepoints. Use tidyr::complete to fill in the missing observations with 0s.
library(tidyr)
S1_RA1 = S1_RA1 %>% ungroup %>%
complete(Timepoint, Taxa, fill = list(n = 0, percentage = 0))
ggplot(S1_RA1, aes(x = factor(Timepoint), y = percentage, fill = Taxa, group = Taxa)) +
geom_area(position = "fill", colour = "black", size = .5, alpha = .7) +
scale_y_continuous(name="Relative Abundance", expand=c(0,0)) +
scale_x_discrete(
name="Timepoint (d)", expand=c(0,0)
) +
scale_fill_manual(values = getPalette) +
theme(legend.position='bottom')
The code I used and the result can be seen in the image below. The main problem is that the title doesn't appear in the center and the x and y labels don't appear at all. How do I fix this?
The graph and code
You should upload your code as a snippet and your data so we can reproduce this on our own machines easily...
Take the example below. You can recreate the data set and then run the code immediately.
Using ggtitle, xlab, ylab you can plot the text and center it with theme.
If this does not help you have the wrong print / render settings.
balloon <- data.table(structure(list(Genera = c("Prevotella", "Treponema", "Fusobacterium","Selenomonas", "Veillonella", "Porphyromonas", "Streptococcus","Leptotrichia", "Aggregatibacter", "Succiniclasticum"), S1 = c(97L,28L, 11L, 40L, 5L, 13L, 10L, 24L, 0L, 16L), S3 = c(5370L, 3760L,5551L, 2087L, 533L, 873L, 1330L, 5877L, 1213L, 44L), S4 = c(7892L,8004L, 11017L, 19712L, 5115L, 2695L, 7451L, 13611L, 301L, 2557L), S5 = c(23L, 79L, 30L, 7L, 0L, 34L, 0L, 2L, 2L, 0L), S6 = c(8310L,3379L, 38058L, 1133L, 2506L, 17811L, 12103L, 403L, 668L, 3L),S2 = c(7379L, 14662L, 10085L, 148L, 1502L, 5222L, 1010L,2463L, 4790L, 28L), S7 = c(6238L, 18977L, 2674L, 2198L, 27L,2999L, 174L, 1197L, 5268L, 5L), S8 = c(20019L, 18674L, 15306L,1472L, 1898L, 9600L, 1683L, 2221L, 3435L, 1109L), S9 = c(153L,12L, 23L, 36L, 15L, 15L, 6L, 41L, 0L, 30L), S10 = c(20103L,29234L, 10857L, 2869L, 4923L, 14206L, 1415L, 4574L, 649L,2160L)), .Names = c("Genera", "S1", "S3", "S4", "S5", "S6","S2", "S7", "S8", "S9", "S10"), class = c("data.table", "data.frame"), row.names = c(NA, -10L)))
library(ggplot2)
library(reshape2)
library(data.table)
balloon<-fread("Downloads/balloon.csv")
balloon
balloon_melted<-melt(balloon)
head(balloon_melted)
p <- ggplot(balloon_melted, aes(x =variable, y = Genera))
p+
geom_point( aes(size=value))+
theme(panel.background=element_blank(),
panel.border = element_rect(colour = "blue", fill=NA, size=1)) +
ggtitle("Pretty title") +
xlab("x lab label") +
ylab("y lab label") +
theme(plot.title = element_text(hjust = 0.5))
This is my vector
head(sep)
I must find percent of all SEP 11 in each row.
For instance, in first row, percent of SEP 11 is
100 * ((63 + 124)/ (63 + 124 + 0 + 0))
And would like this stored in newly created 8th column
Thanks
dput
> dput(head(sep))
structure(list(Site = structure(1:6, .Label = c("31R001", "31R002",
"31R003", "31R004", "31R005", "31R006", "31R007", "31R008", "31R011",
"31R013", "31R014", "31R016", "31R018", "31R019", "31R020", "31R021",
"31R022", "31R023", "31R024", "31R025", "31R026", "31R027", "31R029",
"31R030", "31R031", "31R032", "31R034", "31R035", "31R036", "31R038",
"31R039", "31R040", "31R041", "31R042", "31R043", "31R044", "31R045",
"31R046", "31R048", "31R049", "31R050", "31R051", "31R052", "31R053",
"31R054", "31R055", "31R056", "31R057", "31R058", "31R059", "31R060",
"31R061", "31R069", "31R071", "31R072", "31R075", "31R435", "31R440",
"31R445", "31R450", "31R455", "31R460", "31R470", "31R600", "31R722",
"31R801", "31R825", "31R826", "31R829", "31R840", "31R843", "31R861",
"31R880"), class = "factor"), Latitude = c(33.808874, 33.877256,
33.820825, 33.852373, 33.829697, 33.810274), Longitude = c(-117.844048,
-117.700135, -117.811845, -117.795516, -117.787532, -117.830429
), Windows.SEP.11 = c(63L, 174L, 11L, 85L, 163L, 71L), Mac.SEP.11 = c(0L,
1L, 4L, 0L, 0L, 50L), Windows.SEP.12 = c(124L, 185L, 9L, 75L,
23L, 5L), Mac.SEP.12 = c(0L, 1L, 32L, 1L, 0L, 50L)), .Names = c("Site",
"Latitude", "Longitude", "Windows.SEP.11", "Mac.SEP.11", "Windows.SEP.12",
"Mac.SEP.12"), row.names = c(NA, 6L), class = "data.frame")
Assuming that you want to get the rowSums of columns that have 'Windows' as column names, we subset the dataset ("sep1") using grep. Then get the rowSums(Sub1), divide by the rowSums of all the numeric columns (sep1[4:7]), multiply by 100, and assign the results to a new column ("newCol")
Sub1 <- sep1[grep("Windows", names(sep1))]
sep1$newCol <- 100*rowSums(Sub1)/rowSums(sep1[4:7])
I have the following data set
structure(list(Collimator = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("n", "y"), class = "factor"), angle = c(0L,
15L, 30L, 45L, 60L, 75L, 90L, 105L, 120L, 135L, 150L, 165L, 180L,
0L, 15L, 30L, 45L, 60L, 75L, 90L, 105L, 120L, 135L, 150L, 165L,
180L), X1 = c(2099L, 11070L, 17273L, 21374L, 23555L, 23952L,
23811L, 21908L, 19747L, 17561L, 12668L, 6008L, 362L, 53L, 21L,
36L, 1418L, 6506L, 10922L, 12239L, 8727L, 4424L, 314L, 38L, 21L,
50L), X2 = c(2126L, 10934L, 17361L, 21301L, 23101L, 23968L, 23923L,
21940L, 19777L, 17458L, 12881L, 6051L, 323L, 40L, 34L, 46L, 1352L,
6569L, 10880L, 12534L, 8956L, 4418L, 344L, 58L, 24L, 68L), X3 = c(2074L,
11109L, 17377L, 21399L, 23159L, 23861L, 23739L, 21910L, 20088L,
17445L, 12733L, 6046L, 317L, 45L, 26L, 46L, 1432L, 6495L, 10862L,
12300L, 8720L, 4343L, 343L, 38L, 34L, 60L), average = c(2099.6666666667,
11037.6666666667, 17337, 21358, 23271.6666666667, 23927, 23824.3333333333,
21919.3333333333, 19870.6666666667, 17488, 12760.6666666667,
6035, 334, 46, 27, 42.6666666667, 1400.6666666667, 6523.3333333333,
10888, 12357.6666666667, 8801, 4395, 333.6666666667, 44.6666666667,
26.3333333333, 59.3333333333)), .Names = c("Collimator", "angle",
"X1", "X2", "X3", "average"), row.names = c(NA, -26L), class = "data.frame")
I first scale average counts for both collimator y and n to a make the highest counts 1
df <- ddply(df, .(Collimator), transform,
norm.average = average / max(average))
and plot the curves:
ggplot(df, aes(x=angle,y=norm.average,col=Collimator)) +
geom_point() + geom_line()
Using geom_line is quite unpleasing on the eye and I would rather fit to the data using stat_smooth. Each data set should be symmetric about the mean so I think a Gaussian fit should be ideal. How can I fit a Gaussian to the dataset collimator="y" and collimator="n" in ggplot2 or using base R. Also I would like to output the mean and standard deviation. Can this be done?
By definition your data is not Gaussian but a kind of Gaussian-like shape, and here is the example of the visualization of fitting:
fit <- dlply(df, .(Collimator), function(x) {
co <- coef(nls(norm.average ~ exp(-(angle - m)^2/(2 * s^2)), data = x, start = list(s = 50, m = 80)))
stat_function(fun = function(x) exp(-(x - co["m"])^2/(2 * co["s"]^2)), data = x)
})
ggplot(df, aes(x = angle, y = norm.average, col = Collimator)) + geom_point() + fit
Updated
To obtain the parameters:
fit <- dlply(df, .(Collimator), function(x) {
co <- coef(nls(norm.average ~ exp(-(angle - m)^2/(2 * s^2)), data = x, start = list(s = 50, m = 80)))
r <- stat_function(fun = function(x) exp(-(x - co["m"])^2/(2 * co["s"]^2)), data = x)
attr(r, ".coef") <- co
r
})
then,
> ldply(fit, attr, ".co")
Collimator s m
1 n 52.99117 82.60820
2 y 21.99518 86.61268
2 questions based on my data.frame
structure(list(Collimator = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("n", "y"), class = "factor"), angle = c(0L,
15L, 30L, 45L, 60L, 75L, 90L, 105L, 120L, 135L, 150L, 165L, 180L,
0L, 15L, 30L, 45L, 60L, 75L, 90L, 105L, 120L, 135L, 150L, 165L,
180L), X1 = c(2099L, 11070L, 17273L, 21374L, 23555L, 23952L,
23811L, 21908L, 19747L, 17561L, 12668L, 6008L, 362L, 53L, 21L,
36L, 1418L, 6506L, 10922L, 12239L, 8727L, 4424L, 314L, 38L, 21L,
50L), X2 = c(2126L, 10934L, 17361L, 21301L, 23101L, 23968L, 23923L,
21940L, 19777L, 17458L, 12881L, 6051L, 323L, 40L, 34L, 46L, 1352L,
6569L, 10880L, 12534L, 8956L, 4418L, 344L, 58L, 24L, 68L), X3 = c(2074L,
11109L, 17377L, 21399L, 23159L, 23861L, 23739L, 21910L, 20088L,
17445L, 12733L, 6046L, 317L, 45L, 26L, 46L, 1432L, 6495L, 10862L,
12300L, 8720L, 4343L, 343L, 38L, 34L, 60L), average = c(2099.6666666667,
11037.6666666667, 17337, 21358, 23271.6666666667, 23927, 23824.3333333333,
21919.3333333333, 19870.6666666667, 17488, 12760.6666666667,
6035, 334, 46, 27, 42.6666666667, 1400.6666666667, 6523.3333333333,
10888, 12357.6666666667, 8801, 4395, 333.6666666667, 44.6666666667,
26.3333333333, 59.3333333333)), .Names = c("Collimator", "angle",
"X1", "X2", "X3", "average"), row.names = c(NA, -26L), class = "data.frame")
I wish to plot detector counts versus angle with and without a collimator attached to the device. I guess geom_point is probably the best way to summarise the data
p <- ggplot(df, aes(x=angle,y=average,col=Collimator)) + geom_point() + geom_line()
Instead of plotting average count in the y-axis, I would prefer to rescale the data so that the angle with max counts has a value 1 for both collimator Y and N. The way I have done this seems quite cumbersome
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
coly = subset(df,Collimator=='y')
coly$norm_count = range01(coly$average)
coln = subset(df,Collimator=='n')
coln$norm_count = range01(coln$average)
df = rbind(coln,coly)
p <- ggplot(df, aes(x=angle,y=norm_count,col=Collimator) + geom_point() + geom_line()
I'm sure this can be done in a more efficient manner, applying the function to the data.frame based on the variable 'Collimator'. How can I do this?
Also I want to fit a function to the data rather than using geom_line. I think a Gaussian function may work in this case but have no idea how/if I can implement this in stat_smooth. Also can I pull out mead/standard deviation from such a fit?
ggplot2 goes hand in hand with the package plyr:
df <- ddply(df,.(Collimator),
transform,
norm_count1 = (average - min(average)) / (max(average) - min(average)) )
joran's answer scales the highest value to 1 and the lowest to 0; if you just want to scale to make the highest value 1 (and leaving 0 as 0), it is even simpler.
library("plyr")
df <- ddply(df, .(Collimator), transform,
norm.average = average / max(average))
The the plot is
ggplot(df, aes(x=angle,y=norm.average,col=Collimator)) +
geom_point() + geom_line()