Boxplot Integration three information levels - r

I have a question on how to plot my data using a boxplot and integrating 3 different information types. In particular, I have a data frame that looks like this:
Exp_number Condition Cell_Type Gene1 Gene2 Gene3
1 2 Cancer 0.33 0.2 1.2
1 2 Cancer 0.12 1.12 2.5
1 4 Fibro 3.4 2.2 0.8
2 4 Cancer 0.12 0.4 0.11
2 4 Normal 0.001 0.01 0.001
3 1 Cancer 0.22 1.2 3.2
2 1 Normal 0.001 0.00003 0.00045
for a total of 20.000 columns and 110 rows (rows are samples).
I would like to plot a boxplot in which data are grouped first by a condition. Then, in each condition, I would like to highlight, for example using different colors, the exp_number and finally, I don't know how but I would like to highlight the cell type. The aim is to highlight the differences between exp_number between conditions in terms of gene expression and also differences of cell types between Exp_numbers.
Is there a simple way to integrate all this information in a single plot?
Thank you in advance

What about this approach
dat <- data.frame(Exp_number=factor(sample(1:3,100,replace = T)),
condition=factor(sample(1:4,100,T)),
Cell_type=factor(sample(c("Normal", "Cancer", "Fibro"), 100, replace=T)),
Gene1=abs(rnorm(100, 5, 1)),
Gene2=abs(rnorm(100, 6, 0.5)),
Gene3=abs(rnorm(100, 4, 3)))
library(reshape2)
dat2 <- melt(dat, id=c("Exp_number", "condition", "Cell_type"))
ggplot(dat2, aes(x=Exp_number, y=value, col=Cell_type)) +
geom_boxplot() +
facet_grid(~ condition) +
theme_bw() +
ylab("Expression")
That gives the following result

Similar to #storaged's answer, but leveraging the two dimensions of facet_grid to represent 2 of your variables:
ggplot(dat2, aes(x=Cell_type, y=Expression)) +
geom_boxplot() +
facet_grid(Exp_number ~ condition) +
theme_bw()
The data:
library(reshape2)
dat <- data.frame(Exp_number=factor(sample(1:3,100,replace = T)),
condition=factor(sample(1:4,100,T)),
Cell_type=factor(sample(c("Normal", "Cancer", "Fibro"), 100, replace=T)),
Gene1=abs(rnorm(100, 5, 1)),
Gene2=abs(rnorm(100, 6, 0.5)),
Gene3=abs(rnorm(100, 4, 3)))
dat2 <- melt(dat, id=c("Exp_number", "condition", "Cell_type"), value.name = 'Expression')
dat2$Exp_number <- paste('Exp.', dat2$Exp_number)
dat2$condition <- paste('Condition', dat2$condition)

Related

How can I bind_rows of two random sets and then plot them as a histogram in dplyr?

I'm having a hard time binding rows of two random samples of 500 each to get one file with 1000 rows.
Then I'm trying to plot a histogram of this combined sample and geom_density().
For my bind_rows line, the error I get is
"Argument 1 must have names"
Does anyone have an idea what is wrong? Thank you,
x <- 1:500
rand1 <- rnorm(length(x), -1, 0.6)
rand2 <- rnorm(length(x), 1, 1.2)
combined <- bind_rows(rand1, rand2)
ggplot(combined, aes(x=y, y=..density..)) +
geom_histogram(fill = "red", alpha = 0.5, color="darkred") +
geom_density()
Change the offending line to:
combined <- data.frame(y= c(rand1, rand2))
There were two issues that prevent the original code from completing the task: a) no name for the data argument, and b) lack of packaging in a form that could be coerced to a dataframe. The combined could also have been a named list.
To be able to bind your two samples, you need to convert them as dataframe. However, you will also need to have their names matching.
So something like that should work:
library(dplyr)
combined <- bind_rows(data.frame(x =rand1),
data.frame(x =rand2))
x
1 -1.1979747
2 -0.7819008
3 -2.0965976
4 -0.4637334
5 -1.4314750
6 -0.4356943
However, you can't differentiate rand1 and rand2 anymore.
So, an alternative solution is to start by binding your two random samples as columns and then pivot the dataframe into a longer format using pivot_longer from tidyr package:
df <- data.frame(rand1, rand2)
library(tidyverse)
df <- df %>% pivot_longer(everything(), names_to = "rand", values_to = "value")
rand value
<chr> <dbl>
1 rand1 -1.20
2 rand2 2.45
3 rand1 -0.782
4 rand2 1.35
5 rand1 -2.10
6 rand2 1.98
7 rand1 -0.464
8 rand2 0.733
9 rand1 -1.43
10 rand2 2.72
# … with 990 more rows
For plotting histogram and density, I used stat(ndensity) and ..scaled.. in order to set both random samples to be scaled up to 1:
library(ggplot2)
ggplot(df, aes(x = value, fill = rand))+
geom_density(aes(y = ..scaled..), alpha = 0.4)+
geom_histogram(aes(x = value, stat(ndensity)), color = "black", alpha =0.2)

Box plot with bands with standard deviation and range

I am new to R and ggplot2. Any help is much appreciated! I have here a data set, I am trying to graph
weight band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42
In this set of data, I am trying to plot a bar graph of mean 1 and mean 2 side by side under the given weight_band (1-4) and applying error bars for min (1&2 respectively) and max (1&2 respectively). The "." notates that no data.
I have browsed through stackoverflow and other website, but haven't find the solution I am looking for.
the code I have is as follows:
sk1 <- read.csv(file="analysis.csv")
library(reshape2)
sk2 <- melt(sk1,id.vars = "Weight_band")
c <- ggplot(sk2, aes(x = Weight_band, y = value, fill = variable))
c + geom_bar(stat = "identity", position="dodge")
However, using this method, it does not limit the graph to only plotting the mean only. Is there a set of code to do so? Furthermore, is there a method to apply min and max as error bars to their respective mean? I thank everyone in advance. This would help me greatly in advancing my understanding of R's ggplot2 function
This should get you close, we need to do a little more data cleaning and reshaping to make ggplot happy :)
library(reshape2)
df <- read.table(text = "weight_band mean_1 mean_2 SD_1 SD_2 min_1 min_2 max_1 max_2
1 5 . 3 . 0.17 . 27 .
2 6 . 3.7 . 1.1 . 23 .
3 8 8 4.3 4.1 1 1.749 27 27
4 8 9 3.3 6 2.3 1.402 13 42", header = T)
sk2 <- melt(df,id.vars = "weight_band")
## Clean
sk2$group <- gsub(".*_(\\d)", "\\1", sk2$variable)
# new column used for color or fill aes and position dodging
sk2$variable <- gsub("_.*", "", sk2$variable)
# make these variables universal not group specific
## Reshape again
sk3 <- dcast(sk2, weight_band + group ~ variable)
# spread it back to kinda wide format
sk3 <- dplyr::mutate_if(sk3, is.character, as.numeric)
# convert every column to numeric if character now
# plot values seem a little wonky but the plot is coming together
ggplot(sk3, aes(x = as.factor(weight_band), y = mean, color = as.factor(group))) +
geom_bar(position = "dodge", stat = "identity") +
geom_errorbar(aes(ymax = max, ymin = min), position = "dodge")

How to generate summary information and error bars in R

I have a set of data:
COL1 COL2
1 3.45
2 8.48
1 2.53
2 9.42
2 2.56
etc.
COL1 specifies a category, whereas COL2 is data. I'd like to, for each distinct value in COL1 generate mean, stddev, min & max values. So in the end have something like (not real numbers):
COL1VAL MEAN STDDEV
1 4.59 1.24
2 4.75 1.20
I'd also then like to generate a bar chart with error bars, with X axis being the COL1VAL and bar height being the mean.
Can one do this in R, and if so, how?
Here's how you could do those things using packages dplyr and ggplot2, assuming your data frame is called df.
library(dplyr)
dfsummary <- df %>%
group_by(COL1) %>%
summarise_each(funs(mean, sd, min, max))
dfsummary
#Source: local data frame [2 x 5]
#
# COL1 mean sd min max
#1 1 2.99 0.6505382 2.53 3.45
#2 2 6.82 3.7190859 2.56 9.42
library(ggplot2)
ggplot(dfsummary, aes(x = factor(COL1), y = mean)) +
geom_bar(stat = "identity", fill = "lightblue") +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd))
If you prefer to stay in base R, you could use tapply and arrows:
head(chickwts, 15) # chicken growth depending on food#
means <- tapply(X=chickwts$weight, INDEX=chickwts$feed, FUN=mean)
sds <- tapply(X=chickwts$weight, INDEX=chickwts$feed, FUN=sd )
or <- order(means)
bp <- barplot(means[or], ylim=c(0, 390), las=2)
arrows(x0=bp, y0=(means+sds)[or], y1=(means-sds)[or],
code=3, angle=90, length=0.1)
Regards,
Berry

R binning dataset and surface plot

I have a large data set that I am trying to discretise and create a 3d surface plot with:
rowColFoVCell wpbCount Feret
1 001001001001 1 0.58
2 001001001001 1 1.30
3 001001001001 1 0.58
4 001001001001 1 0.23
5 001001001001 2 0.23
6 001001001001 2 0.58
There are currently 695302 rows in this data set. I am trying to discretise the third 'Feret' column based on the second column, so for each 'wpbCount' bin the 'Feret' column.
I think the solution will involve using cut but I am not sure how to go about this. I would like to end up with a data frame something like this:
wpbCount Feret Count
1 1 [0.0,0.2] 3
2 1 [0.2,0.4] 5
3 1 [0.4,0.6] 6
4 1 [0.8,0.8] 9
5 2 [0.0,0.2] 6
6 2 [0.4,0.6] 23
This is to answer the first part:
Create Some data
DF <- data.frame(wpbCount = sample(1:1000, 1000),
Feret = sample(seq(0, 1, 0.001), 1000))
1) Discretize
Use cut with right = FALSE so the intervals are [)
I normally find this more usefull than the default
DF$cut_it <- cut(DF$Feret, right = FALSE,
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1))
2) Aggregate
TABLE <- data.frame(table(DT$cut_it))
EDIT Another attempt
library(data.table)
DT <- data.table(DF)
DT <- DT[, list(wpbCount = length(wpbCount),
Feret = length(Feret)
), by=cut_it]
Perhaps you are just trying to discretize and not aggregate.
Try this:
DF2 <- data.frame(wpbCount = sample(1:3, 1000, replace=T),
Feret = sample(seq(0, 1, 0.001), 1000))
DF2$Feret2 <- cut(DF$Feret, right = FALSE,
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.1))
DF2 <- DF2[, c(1, 3)]
Thanks very much for your help I used the following functions in R:
x$bin <- cut(x$Feret, right = FALSE, breaks = seq(0,max(wpbFeatures$Feret), by=0.1))
y <-aggregate(x$bin, by = x[c('wpbCount', 'bin')], length)
From your suggestions I have been able to get the data frame that I require:
wpbCount | bin | x
1 [0.2,0.3) 72
2 [0.2,0.3) 142
3 [0.2,0.3) 224
4 [0.2,0.3) 299
5 [0.2,0.3) 421
6 [0.2,0.3) 479
Now I need to plot this in 3D and I am not sure how to do so with a non-numerical column i.e. the bin column which is factors.
Does anyone know how I can plot these three columns against each other?
Check out this link.
There are some 3d plots. However, 3d plots aren't the greatest tool to analize data.
If you insist with the 3d approach, try stat_contout()
from the ggplot2 package.
However, a probably better apprach is to do a few plots in 2d, or use facet_grid().
Take a look at ggplot2 current documentation also.
Try this based on your last answer (not tested):
ggplot(DF, aes(wpbCount , x)) +
geon_point() +
facet_grid(. ~ bin)
The idea is to use the factor variable (in this case, bin) to facet the plot.

how to put percentage label in ggplot when geom_text is not suitable?

Here is my simplified data :
company <-c(rep(c(rep("company1",4),rep("company2",4),rep("company3",4)),3))
product<-c(rep(c(rep(c("product1","product2","product3","product4"),3)),3))
week<-c( c(rep("w1",12),rep("w2",12),rep("w3",12)))
mydata<-data.frame(company=company,product=product,week=week)
mydata$rank<-c(rep(c(1,3,2,3,2,1,3,2,3,2,1,1),3))
mydata=mydata[mydata$company=="company1",]
And, R code I used :
ggplot(mydata,aes(x = week,fill = as.factor(rank))) +
geom_bar(position = "fill")+
scale_y_continuous(labels = percent_format())
In the bar plot, I want to label the percentage by week, by rank.
The problem is the fact that the data doesn't have percentage of rank. And the structure of this data is not suitable to having one.
(of course, the original data has much more observations than the example)
Is there anyone who can teach me How I can label the percentage in this graph ?
I'm not sure I understand why geom_text is not suitable. Here is an answer using it, but if you specify why is it not suitable, perhaps someone might come up with an answer you are looking for.
library(ggplot2)
library(plyr)
mydata = mydata[,c(3,4)] #drop unnecessary variables
data.m = melt(table(mydata)) #get counts and melt it
#calculate percentage:
m1 = ddply(data.m, .(week), summarize, ratio=value/sum(value))
#order data frame (needed to comply with percentage column):
m2 = data.m[order(data.m$week),]
#combine them:
mydf = data.frame(m2,ratio=m1$ratio)
Which gives us the following data structure. The ratio column contains the relative frequency of given rank within specified week (so one can see that rank == 3 is twice as abundant as the other two).
> mydf
week rank value ratio
1 w1 1 1 0.25
4 w1 2 1 0.25
7 w1 3 2 0.50
2 w2 1 1 0.25
5 w2 2 1 0.25
8 w2 3 2 0.50
3 w3 1 1 0.25
6 w3 2 1 0.25
9 w3 3 2 0.50
Next, we have to calculate the position of the percentage labels and plot it.
#get positions of percentage labels:
mydf = ddply(mydf, .(week), transform, position = cumsum(value) - 0.5*value)
#make plot
p =
ggplot(mydf,aes(x = week, y = value, fill = as.factor(rank))) +
geom_bar(stat = "identity")
#add percentage labels using positions defined previously
p + geom_text(aes(label = sprintf("%1.2f%%", 100*ratio), y = position))
Is this what you wanted?

Resources