overlay the histograms by using the fill parameter - r

I would like to create a graphic to show how often each type of event is responsible for reducing each specie.
In total I have 9 species and 8 events. I would like to fix the events like different bars groups (fill) and the species on the x-axis like in the picture below.
I created the following script but I get this error message
Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?
Would anyone have any suggestions on how to do a correct script?
Thank you very much in advance
library(ggplot2)
event <- factor(Dataset, levels = c("A", "B", "C", "D", "E", "F", "G", "H"))
ggplot(Dataset) +
geom_histogram(aes(x=specie, fill=event),
colour="grey50", alpha=0.5, position="identity")
data
Dataset <- structure(list(specie = structure(1:9, .Label = c("Hipp_amph",
"Hipp_eq", "Phil_mont", "Pota_larv", "Red_aru", "Sylv_grim",
"Sync_caf", "Trag_oryx", "Trag_scri"), class = "factor"), A = c(2.97029703,
0, 13.86138614, 12.87128713, 0, 17.82178218, 2.97029703, 0, 0.99009901
), B = c(0, 7.920792079, 55.44554455, 51.48514851, 33.66336634,
27.72277228, 33.66336634, 15.84158416, 62.37623762), C = c(0,
5.940594059, 0.99009901, 8.910891089, 2.97029703, 0, 10.89108911,
4.95049505, 21.78217822), D = c(0, 0, 0, 0.99009901, 0, 0, 0,
0, 0), E = c(16.83168317, 28.71287129, 74.25742574, 100, 40.59405941,
32.67326733, 89.10891089, 27.72277228, 86.13861386), F = c(6.930693069,
0, 10.89108911, 42.57425743, 0, 0, 7.920792079, 0, 2.97029703
), G = c(0, 0, 0, 0.99009901, 0, 0, 0, 0, 0), H = c(0, 4.95049505,
1.98019802, 1.98019802, 15.84158416, 0, 19.8019802, 0, 1.98019802
)), .Names = c("specie", "A", "B", "C", "D", "E", "F", "G", "H"
), class = "data.frame", row.names = c(NA, -9L))

The problem is indeed that you are trying to pass a factor/character variable to x axis which in this case needs numeric values.
You could try the below with your dataframe and make a trellis with specie; either this or you sacrifice filling the bars with event (A, B, etc.), and put specie in fill.
Moreover, what is needed in the first place is to gather the data in a long format in order to be able to pass it to aes.
library(tidyverse)
Dataset <- Dataset %>% gather(event, value, 2:9)
ggplot(Dataset) +
geom_histogram(aes(x=value, fill=event), colour="grey50", alpha=0.5, position="identity") +
facet_wrap(~ specie)

Related

Volcano plot for multiple clusters

I am trying to make a volcano plot for different clusters. I have 2 conditions, untreated vs. treated. I have a differential expression excel file that cellranger generated for me but within the file it has multiple clusters each which have a fold change and p value. How do I create a volcano plot that contains all the clusters rather than one? Would I have to do a volcano plot for each cluster and then combine them all somehow?
I used this code to generate the plot for just one of the clusters...
macrophage_list <- read.table("differential_expression_macrophage.csv", header = T, sep = ",")`
EnhancedVolcano(macrophage_list, lab = as.character(macrophage_list$FeatureName), x = 'Cluster1.Log2.Fold.Change', y = 'Cluster1.Adjusted.P.Value', xlim = c(-8,8), title = 'Macrophage', pCutoff = 10e-5, FCcutoff = 1.5, pointSize = 3.0, labSize = 3.0)
How do I merge all the information in the excel file to create a volcano plot?
I uploaded each data cluster one by one and then merged them by using rbind, but is there a simpler/quicker way to do this?
output for dput(gene_list[1:20, 1:14])
structure(list(Feature.ID = structure(1:20, .Label = c("a", "b",
"c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t"), class = "factor"), Feature.Name = structure(1:20, .Label = c("A",
"B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T"), class = "factor"), Cluster.1_Mean.Counts = c(0.000960904,
0.000320301, 0.001281205, 0.000320301, 0.000320301, 0.016335362,
0.000960904, 0, 0.001601506, 0.000320301, 0.007046627, 0.026585,
0.017296265, 0.004804518, 0, 0.874742598, 0.017616566, 0.007366928,
0.008327831, 0.001921807), Cluster.1_Log2.fold.change = c(0.291978774,
1.954943787, -2.008530337, -2.482461526, 3.539906287, 0.407455991,
-0.214981215, 1.539906287, 0.802940693, 2.539906287, -1.333136538,
-1.879953595, -0.52422405, -0.877946228, 1.539906287, -0.629373147,
1.118442519, 0.170672478, 1.065975099, 1.099333696), Cluster.1_Adjusted.p.value = c(1,
0.910243711, 0.04672812, 0.080866038, 0.610296549, 0.80063597,
1, 1, 0.951841603, 0.797013021, 0.103401275, 0.000594428, 0.907754993,
0.532689631, 1, 0.480958806, 0.078345008, 1, 0.198557945, 0.668312142
), Cluster.2_Mean.Counts = c(0.000902278, 0.001804555, 0.006315943,
0.004511388, 0, 0.029775159, 0.001804555, 0, 0.002706833, 0,
0.023459216, 0.128123411, 0.030677437, 0.009022775, 0, 2.174488883,
0.018947828, 0.019850106, 0.010827331, 0.000902278), Cluster.2_Log2.fold.change = c(0.792589781,
4.769869705, 0.35201719, 0.839132367, 3.184907204, 1.32985554,
0.962514783, 3.184907204, 1.725475586, 2.599944703, 0.560416339,
0.580736324, 0.407299626, 0.184907204, 3.184907204, 0.816580902,
1.120776867, 1.742684876, 1.409613491, 0.599944703), Cluster.2_Adjusted.p.value = c(1,
0.153573448, 1, 0.737977734, 1, 0.14478935, 0.853816767, 1, 0.47952604,
1, 0.65316285, 0.507251471, 0.776636022, 1, 1, 0.346630571, 0.285006452,
0.060868933, 0.21546202, 1), Cluster.3_Mean.Counts = c(0.001813813,
0, 0.019045032, 0.00725525, 0, 0.022672657, 0.000906906, 0, 0,
0, 0.029927908, 0.043531502, 0.046252221, 0.029021001, 0, 3.146057931,
0.020858845, 0.013603594, 0.008162157, 0), Cluster.3_Log2.fold.change = c(1.455721575,
2.192687169, 2.008262598, 1.504631175, 3.192687169, 0.9044422,
0.334706174, 3.192687169, -0.451169021, 2.607724668, 0.931421856,
-1.032594057, 1.038258504, 1.970294748, 3.192687169, 1.412371018,
1.26985503, 1.14829305, 0.991053308, -0.451169021), Cluster.3_Adjusted.p.value = c(0.757752635,
1, 0.032609935, 0.33316083, 1, 0.441825712, 1, 1, 1, 1, 0.380305075,
0.605158722, 0.339946318, 0.016952505, 1, 0.056529024, 0.259458704,
0.339639234, 0.536765022, 1), Cluster.4_Mean.Counts = c(0.000641899,
0, 0.002567596, 0.004493293, 0, 0.010270384, 0.003209495, 0,
0.000641899, 0, 0.028243557, 0.160474756, 0.012196081, 0.005135192,
0, 1.199709274, 0.005135192, 0.004493293, 0.005777091, 0.001283798
), Cluster.4_Log2.fold.change = c(0.269229783, 1.661547206, -0.886889419,
0.778904157, 2.661547206, -0.289908942, 1.602653517, 2.661547206,
0.076584705, 2.076584705, 0.854192284, 0.961549693, -0.967809414,
-0.644261223, 2.661547206, -0.104384578, -0.790579612, -0.467735811,
0.459913345, 0.722947751), Cluster.4_Adjusted.p.value = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.584036686, 1, 1, 1, 1, 1, 1,
1, 1)), class = "data.frame", row.names = c(NA, 20L))
Based on your dataset, you need to reshape them but first in order to reshape them using the right pattern, we will rename some column names:
colnames(df) <- gsub(".Mean", "_Mean", colnames(df))
colnames(df) <- gsub(".Log2", "_Log2", colnames(df))
colnames(df) <- gsub(".Adjus","_Adjus",colnames(df))
Now, we can reshape it using the right pattern with pivot_longer function from tidyr package:
library(tidyr)
final_df <- df %>% pivot_longer(., -c(Feature.ID, Feature.Name), names_to = c("set",".value"), names_pattern = "(.+)_(.+)")
# A tibble: 80 x 6
Feature.ID Feature.Name set Mean.Counts Log2.fold.change Adjusted.p.value
<fct> <fct> <chr> <dbl> <dbl> <dbl>
1 a A Cluster.1 0.000961 0.292 1
2 a A Cluster.2 0.000902 0.793 1
3 a A Cluster.3 0.00181 1.46 0.758
4 a A Cluster.4 0.000642 0.269 1
5 b B Cluster.1 0.000320 1.95 0.910
6 b B Cluster.2 0.00180 4.77 0.154
7 b B Cluster.3 0 2.19 1
8 b B Cluster.4 0 1.66 1
9 c C Cluster.1 0.00128 -2.01 0.0467
10 c C Cluster.2 0.00632 0.352 1
# … with 70 more rows
Now, we can create the volcano plot by using ggplot2 and ggrepel libraries for the labeling of Feature.Name (if you don't have ggrepel, you have to install it):
library(ggplot2)
library(ggrepel)
ggplot(final_df, aes(x = Log2.fold.change,y = -log10(Adjusted.p.value), label = Feature.Name))+
geom_point()+
geom_text_repel(data = subset(final_df, Adjusted.p.value < 0.05),
aes(label = Feature.Name))
And you get your volcano plot with all clusters merged, all points with the same color, and with labeling of Feature.names with an adjusted p value < 0.05

Create faceted xy scatters using vectors of column names in R

I have two character vectors of equal length; where position one in vector.x matches position one in vector.y and so on. The elements refer to column names in a data frame (wide format). I would like to somehow loop through these vectors to produce xy scatter graphs for each pair in the vector, preferably in a faceted plot. Here is a (hopefully) reproducible example. To be clear, with this example, I would end up with 10 scatter graphs.
vector.x <- c("Aplanochytrium", "Aplanochytrium", "Aplanochytrium", "Aplanochytrium", "Aplanochytrium", "Bathycoccus", "Brockmanniella", "Brockmanniella", "Caecitellus_paraparvulus", "Caecitellus_paraparvulus")
vector.y <- c("Aliiroseovarius", "Neptuniibacter", "Pseudofulvibacter", "Thalassobius", "unclassified_Porticoccus", "Tenacibaculum", "Pseudomonas", "unclassified_GpIIa", "Marinobacter", "Thalassobius")
structure(list(Aliiroseovarius = c(0, 0, 0, 0.00487132352941176,
0.0108639420589757), Marinobacter = c(0, 0.00219023779724656,
0, 0.00137867647058824, 0.00310398344542162), Neptuniibacter = c(0.00945829750644884,
0.00959532749269921, 0.0171310629514964, 0.2796875, 0.345835488877393
), Pseudofulvibacter = c(0, 0, 0, 0.00284926470588235, 0.00362131401965856
), Pseudomonas = c(0.00466773123694878, 0.00782227784730914,
0.0282765737874097, 0.00707720588235294, 0.00400931195033627),
Tenacibaculum = c(0, 0, 0, 0.00505514705882353, 0.00362131401965856
), Thalassobius = c(0, 0.00166875260742595, 0, 0.0633272058823529,
0.147697878944646), unclassified_GpIIa = c(0, 0.000730079265748853,
0, 0.003125, 0.00103466114847387), unclassified_Porticoccus = c(0,
0, 0, 0.00119485294117647, 0.00569063631660631), Aplanochytrium = c(0,
0, 0, 0.000700770847932726, 0.0315839846865529), Bathycoccus = c(0.000388802488335925,
0, 0, 0.0227750525578136, 0.00526399744775881), Brockmanniella = c(0,
0.00383141762452107, 0, 0.000875963559915907, 0), Caecitellus_paraparvulus = c(0,
0, 0, 0.000875963559915907, 0.00797575370872547)), row.names = c("B11",
"B13", "B22", "DI5", "FF6"), class = "data.frame")
As Rui Barradas shows, it's possible to get a very nice plot from ggplot and gridExta. If you wanted to stick to base R, here's how you'd do that (assuming your data set is called df1):
# set plot sizes
par(mfcol = c(floor(sqrt(length(vector.x))), ceiling(sqrt(length(vector.x)))))
# loop through plots
for (i in 1:length(vector.x)) {
plot(df1[[vector.x[i]]], df1[[vector.y[i]]], xlab = vector.x[i], ylab = vector.y[i])
}
# reset plot size
par(mfcol = c(1,1))
This is a bit long and convoluted but it works.
library(tidyverse)
library(gridExtra)
df_list <- apply(data.frame(vector.x, vector.y), 1, function(x){
DF <- df1[which(names(df1) %in% x)]
i <- which(names(DF) %in% vector.x)
if(i == 2) DF[2:1] else DF
})
gg_list <- lapply(df_list, function(DF){
ggplot(DF, aes(x = get(names(DF)[1]), y = get(names(DF)[2]))) +
geom_point() +
xlab(label = names(DF)[1]) +
ylab(label = names(DF)[2])
})
g <- do.call(grid.arrange, gg_list)
g
Not too elegant, but should get you going:
vector.x <- c("Aplanochytrium", "Aplanochytrium", "Aplanochytrium", "Aplanochytrium", "Aplanochytrium", "Bathycoccus", "Brockmanniella", "Brockmanniella", "Caecitellus_paraparvulus", "Caecitellus_paraparvulus")
vector.y <- c("Aliiroseovarius", "Neptuniibacter", "Pseudofulvibacter", "Thalassobius", "unclassified_Porticoccus", "Tenacibaculum", "Pseudomonas", "unclassified_GpIIa", "Marinobacter", "Thalassobius")
df1 = structure(
list(Aliiroseovarius = c(0, 0, 0, 0.00487132352941176, 0.0108639420589757),
Marinobacter = c(0, 0.00219023779724656, 0, 0.00137867647058824, 0.00310398344542162),
Neptuniibacter = c(0.00945829750644884, 0.00959532749269921, 0.0171310629514964, 0.2796875, 0.345835488877393),
Pseudofulvibacter = c(0, 0, 0, 0.00284926470588235, 0.00362131401965856),
Pseudomonas = c(0.00466773123694878, 0.00782227784730914, 0.0282765737874097, 0.00707720588235294, 0.00400931195033627),
Tenacibaculum = c(0, 0, 0, 0.00505514705882353, 0.00362131401965856),
Thalassobius = c(0, 0.00166875260742595, 0, 0.0633272058823529, 0.147697878944646),
unclassified_GpIIa = c(0, 0.000730079265748853, 0, 0.003125, 0.00103466114847387),
unclassified_Porticoccus = c(0, 0, 0, 0.00119485294117647, 0.00569063631660631),
Aplanochytrium = c(0, 0, 0, 0.000700770847932726, 0.0315839846865529),
Bathycoccus = c(0.000388802488335925, 0, 0, 0.0227750525578136, 0.00526399744775881),
Brockmanniella = c(0, 0.00383141762452107, 0, 0.000875963559915907, 0),
Caecitellus_paraparvulus = c(0, 0, 0, 0.000875963559915907, 0.00797575370872547)),
row.names = c("B11", "B13", "B22", "DI5", "FF6"),
class = "data.frame"
)
df2 = NULL
for(i in 1:10) {
df.tmp = data.frame(
plot = paste0(vector.x[i], ":", vector.y[i]),
x = df1[[vector.x[i]]],
y = df1[[vector.y[i]]]
)
if(is.null(df2)) df2=df.tmp else df2 = rbind(df2, df.tmp)
}
ggplot(data=df2, aes(x, y)) +
geom_point() +
facet_grid(cols = vars(plot))

Dealing with ties in agricolae Kruskal test, R

I am running a kruskal.test on some non-normal data with the agricolae package. Some groups have exactly the same value as each other. The kruskal test doesn't handle this well, I receive the error Error in if (s) { : missing value where TRUE/FALSE needed. At first, I thought this was because all the values were 0, but when I make them all the same large number (to test), the same error appears and the function will stop (running function through a loop) and doesn't evaluate anything beyond the first tied variable.
Obviously there is no point running stats on these groups as there will be no difference, but I am using the information generated by agricolae:kruskal to produce a summary table and I need these variables included. I would prefer to keep using this package as it gives me a lot of valuable information. Is there anything I can do to make it run through the tied variables?
dput(example)
structure(list(TREATMENT = c("A", "A", "A", "B", "B", "C", "C",
"C", "D", "D"), W = c(0, 1.6941524646937, 1.524431531984, 0.959282869723864,
1.45273122733115, 0, 1.57479386520925, 0.421759202661462, 1.34235435984449,
1.52131484305823), X = c(0, 0.663872820198758, 0.202935807030853,
0.836223346381214, 0.750767193777965, 1.18128574225979, 2.03622986392828,
3.56466682539425, 0.919751117364462, 0.917347336682722), Y = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Z = c(2.1477548118197, 2.0111754022729,
3.14642815196242, 4.46967452127494, 1.53715421615569, 2.36274861406182,
2.33262528044302, 2.50970456594739, 2.96088598025103, 2.22841740590261
)), class = "data.frame", row.names = c(NA, 10L), .Names = c("TREATMENT",
"W", "X", "Y", "Z"))
library(agricolae)
example<-as.data.frame(example)
for(i in 2:(ncol(example))){
krusk <- kruskal(example[,i],TREATMENT,group=TRUE)
print(krusk)
}
for(i in 2:(ncol(example))){
if(var(example[,i]) > 0){
krusk <- kruskal(example[,i],example$TREATMENT,group=TRUE)
print(krusk)
}
}

How to use radarchart {fmsb} Drawing radar chart (a.k.a. spider plot)

Im using radarchart of the library "fmsb". My problem is that I dont understand how the input has to be (see here "df" with type data frame: radarchart).
My code:
dat2 <- data.frame(c(0.6, 0.4, 0.5), c(0.5, 0.3, 0.4), c(0.4, 0.2, 0.3), c(0.7, 0.5, 0.6), c(0.9, 0.7, 0.8))
colnames(dat2) <- c("A", "B", "C", "D", "E")
radarchart(dat2, axistype=1, seg=5, plty=1, vlabels=c("A", "B", "C", "D", "E"), vlcex=1, title="(PAKR)")
My purpose is to get a line conncting the points A:0.5, B:0.4, C:0.3, D:0.6, and E:0.8. Something like specifing only a vector like in LaTeX using tkz-kiviat where you need only one vector for drawing a Spider Chart.
Thank you.
One solution is to use this data:
dat2 <- data.frame(c(100, -10, 50), c(100, -10, 40), c(100, -10, 30), c(100, -10, 60), c(100, -10, 80))

ggplot, facets, and changing color in a series

I've got sensor data that looks like this:
tm <- seq(1,17)
chg <- c(13.6,13.7,13.8,13.9,14.1,14.2,14.4,14.6,14.7,14.9,14.9,15.0,15.0,13.7,13.7,13.6,13.7)
batt_A <- c( 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 0, 0)
batt_B <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)
bus <- c(12.4,12.5,12.4,11.7,11.6,12.2,12.4,11.8,11.7,11.5,12.1,12.0,11.6,11.5,11.4,12.6,12.5)
pwr <- data.frame(tm,batt_A,batt_B,chg,bus)
I want to create two line graphs (chg and bus against tm) in separate facet panels. The twist is that I want also to have each line colored to represent which battery it's tracking. So if batt_A>0, it's charging and I want the charge line to be green; and if batt_A==0, it's on the bus, and I want the bus line to be green. Same for batt_B, except the lines would be blue (or whatever color).
I get the melt + facet combination, but how to add the coloring?
(ps: I'm using facets because there are 6 more sensors varying on the same timescale and I want to watch them all)
With Andrie's answer below, I got to this solution, but the recode is horrible:
mpwr <- melt(pwr, id.vars=1:3)
mpwr$batt <- ''
mpwr$batt <- ifelse(mpwr$batt_A>0 & mpwr$variable=="chg", "A", mpwr$batt)
mpwr$batt <- ifelse(mpwr$batt_B>0 & mpwr$variable=="chg", "B", mpwr$batt)
mpwr$batt <- ifelse(mpwr$batt_A==0 & mpwr$variable=="bus", "A", mpwr$batt)
mpwr$batt <- ifelse(mpwr$batt_B==0 & mpwr$variable=="bus", "B", mpwr$batt)
mpwr$batt <- as.factor(mpwr$batt)
ggplot(mpwr, aes(x=tm, group=1)) +
geom_line(aes(y=value, colour=batt)) +
geom_line(aes(y=value, colour=batt)) +
facet_grid(~variable) +
scale_colour_discrete("Charging")
The data processing could be cleaned up, but I think I'm there!
Something like the following:
library(reshape2)
library(ggplot2)
mpwr <- melt(pwr, id.vars=1:3)
ggplot(mpwr, aes(x=tm, group=1)) +
geom_line(aes(y=value, colour=factor(batt_A!=0))) +
geom_line(aes(y=value, colour=factor(batt_B!=0))) +
facet_grid(~variable) +
scale_colour_discrete("Charging")

Resources