Combining multiple functions into one plot (ggplot) - r

I have a (25x6) matrix containing the following observations (class: dataframe):
Mkt.RF SMB HML RMW CMA WML
-3.86 1.37 1.14 1.47 -2.35 0.05
1.10 -0.95 -1.60 1.17 -0.33 -2.96
2.44 -1.79 0.39 1.14 -2.31 -1.55
9.10 2.48 0.01 -1.43 -0.12 -7.61
-2.37 2.90 -0.84 0.84 -1.22 1.81
0.54 0.09 0.48 0.30 0.32 0.03
0.72 -0.48 0.40 0.20 -0.12 0.87
-6.09 1.57 1.04 1.05 0.43 1.13
3.43 -1.63 -0.55 1.45 -0.63 3.35
-1.35 0.32 -0.59 1.57 -0.80 3.43
2.90 0.52 0.00 -0.26 0.39 1.56
1.35 -0.22 -1.42 -1.58 0.19 2.25
-5.10 0.77 -1.34 1.21 -0.35 1.06
6.26 -1.91 -2.70 1.89 -1.94 3.01
-2.21 4.04 3.00 -0.07 1.09 0.38
-1.93 2.50 1.88 0.53 1.13 1.26
-5.48 1.04 2.45 0.79 0.61 0.90
-0.11 -1.34 2.59 3.32 2.21 0.10
4.13 0.15 0.66 -1.51 1.13 -0.18
-3.72 0.76 0.92 0.87 0.42 2.96
-0.64 -2.35 -1.31 0.27 0.55 0.94
2.52 -2.70 -1.71 -0.16 0.86 -3.55
-1.41 -0.20 -0.96 0.47 -0.25 2.56
-3.08 -0.45 -0.35 0.23 -2.21 1.55
1.78 -0.19 -1.64 -0.10 -1.17 0.69
I wish to produce two plots: (1) a probability density function, and (2) a cumulative distribution function in ggplot. I would like to have a function for each column, hence there should be 6 pdfs and 6 cdfs. I have produced the following:
Loaddata <- setwd("~/Desktop")
library(ggplot2)
library(plyr)
library(reshape2)
D <- read.table(file = "MyData.csv", header = TRUE, sep =";", dec = ",")
attach(D)
factors <- cbind(D[,2:7])
ggplot(faktors, aes(Mkt.RF)) + geom_density() + labs(x = "Return", y = "Distribution", title = "PDF")+
xlim(-20,20) + theme(plot.title = element_text(hjust = 0.5))
With this I can produce a plot with a single function (one column of data), but I am having trouble with combining all six functions into one plot. So that I can replicate something similar to this:
PDF functions example
Thank you in advance!

You can try
library(tidyverse)
df %>%
bind_rows(df, .id="gr") %>%
gather(key, value, -gr) %>%
ggplot() +
geom_density(data = . %>% filter(gr == 1), aes(value, color = key), size=1.1) +
stat_ecdf(data = . %>% filter(gr == 2), aes(value, color = key), size=1.1) +
facet_wrap(~gr, labeller = labeller(gr=c("1" = "PD", "2" = "CD")))
The single plots can be created using
df %>%
gather(key, value) %>%
ggplot(aes(value, color=key)) +
geom_density()

Here is a reproduceable example using mtcars, and plotting all the distributions on top of eachother
library(tidyverse)
mtcars %>%
gather(Variable, Value) %>%
ggplot(aes(x=Value, color=Variable)) +
geom_density(alpha=0)

Related

Show different data in top and bottom of Rcirclize

I have 2 dataframes with different number of rows and columns, and I'd like to show both of them in a circos plot with circlize.
My data looks like this:
df1=data.frame(replicate(7,sample(-200:200,200,rep=TRUE))/100)
df2=data.frame(replicate(2,sample(-200:200,200,rep=TRUE))/100)
#head(df1)
X1 X2 X3 X4 X5 X6 X7
1 -0.03 0.63 -0.33 0.73 -1.37 -1.39 1.96
2 -1.81 -1.24 -1.63 1.58 0.13 1.39 -0.76
3 0.02 -2.00 -1.93 -1.35 1.06 -0.58 -0.77
4 -1.11 -1.38 -0.66 -0.40 1.69 -0.47 -1.55
5 0.98 0.06 0.00 -0.35 1.97 1.74 0.72
6 1.51 -1.68 -0.44 -1.74 0.15 0.26 0.36
#head(df2)
X1 X2
1 0.16 -0.81
2 -1.38 -0.16
3 -0.22 -0.74
4 0.73 -0.82
5 0.58 -1.87
6 -0.63 1.50
I want to build a single circos plot where the top is showing df1 and bottom is showing df2, but I can only show individual dfs. For instance, this is how I show df1:
col_fun1=colorRamp2(c(min(df1), 0, max(df1)), c("blue", "white", "red"))
circos.heatmap(df1, col = col_fun1, cluster = T, track.height = 0.2, rownames.side = "outside", rownames.cex = 0.6)
circos.clear()
How can I df1 only in the top half, and df2 only in the bottom half?

putting multiple plots in one panel

I am trying to plot a scatter plot in R using ggscatter function from ggpubr package. I am showing you a subset of my data.frame
tracking_id gene_short_name B1 B2 C1 C2
ENSG00000000003.14 TSPAN6 1.2 1.16 1.22 1.26
ENSG00000000419.12 DPM1 1.87 1.87 1.68 1.83
ENSG00000000457.13 SCYL3 0.59 0.63 0.82 0.69
ENSG00000000460.16 C1orf112 0.87 0.99 0.97 0.83
ENSG00000001036.13 FUCA2 1.59 1.59 1.4 1.39
ENSG00000001084.10 GCLC 1.43 1.55 1.46 1.32
ENSG00000001167.14 NFYA 1.2 1.3 1.39 1.21
ENSG00000001460.17 STPG1 0.43 0.46 0.34 0.76
ENSG00000001461.16 NIPAL3 0.72 0.84 0.78 0.74
I want to make scatter plot between B1 vs B1, B1 vs B2, B1 vs C1, B2 vs C2.
I used the following command
df <- read.table(file="transformation.txt",header= TRUE,sep = "\t")
lapply(3:6, function(X) ggscatter(df, x = "B1", y = colnames(df[X]), add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",add.params = list(color="blue")))
I get individual 4 plots. I want to have all 4 plots in 1 plot. How can I do this?
Thanks
Do you perhaps mean something like this?
library(GGally)
ggpairs(df[, -(1:2)])
GGally is a very nice R package offering a lot of customisation options for its plotting routines.
Sample data
df <- read.table(text =
"tracking_id gene_short_name B1 B2 C1 C2
ENSG00000000003.14 TSPAN6 1.2 1.16 1.22 1.26
ENSG00000000419.12 DPM1 1.87 1.87 1.68 1.83
ENSG00000000457.13 SCYL3 0.59 0.63 0.82 0.69
ENSG00000000460.16 C1orf112 0.87 0.99 0.97 0.83
ENSG00000001036.13 FUCA2 1.59 1.59 1.4 1.39
ENSG00000001084.10 GCLC 1.43 1.55 1.46 1.32
ENSG00000001167.14 NFYA 1.2 1.3 1.39 1.21
ENSG00000001460.17 STPG1 0.43 0.46 0.34 0.76
ENSG00000001461.16 NIPAL3 0.72 0.84 0.78 0.74", header = T)

How to skip NA when applying geometric-mean function

I have the following data frame:
1 8.03 0.37 0.55 1.03 1.58 2.03 15.08 2.69 1.63 3.84 1.26 1.9692516
2 4.76 0.70 NA 0.12 1.62 3.30 3.24 2.92 0.35 0.49 0.42 NA
3 6.18 3.47 3.00 0.02 0.19 16.70 2.32 69.78 3.72 5.51 1.62 2.4812459
4 1.06 45.22 0.81 1.07 8.30 196.23 0.62 118.51 13.79 22.80 9.77 8.4296220
5 0.15 0.10 0.07 1.52 1.02 0.50 0.91 1.75 0.02 0.20 0.48 0.3094169
7 0.27 0.68 0.09 0.15 0.26 1.54 0.01 0.21 0.04 0.28 0.31 0.1819510
I want to calculate the geometric mean for each row. My codes is
dat <- read.csv("MXreport.csv")
if(any(dat$X18S > 25)){ print("Fail!") } else { print("Pass!")}
datpass <- subset(dat, dat$X18S <= 25)
gene <- datpass[, 42:52]
gm_mean <- function(x){ prod(x)^(1/length(x))}
gene$score <- apply(gene, 1, gm_mean)
head(gene)
I got this output after typing this code:
1 8.03 0.37 0.55 1.03 1.58 2.03 15.08 2.69 1.63 3.84 1.26 1.9692516
2 4.76 0.70 NA 0.12 1.62 3.30 3.24 2.92 0.35 0.49 0.42 NA
3 6.18 3.47 3.00 0.02 0.19 16.70 2.32 69.78 3.72 5.51 1.62 2.4812459
4 1.06 45.22 0.81 1.07 8.30 196.23 0.62 118.51 13.79 22.80 9.77 8.4296220
5 0.15 0.10 0.07 1.52 1.02 0.50 0.91 1.75 0.02 0.20 0.48 0.3094169
7 0.27 0.68 0.09 0.15 0.26 1.54 0.01 0.21 0.04 0.28 0.31 0.1819510
The problem is I got NA after applying the geometric mean function to the row that has NA. How do I skip NA and calculate the geometric mean for the row that has NA
When I used gene<- na.exclude(datpass[, 42:52]). It skipped the row that has NA and not calculate the geometric mean at all. That is now what I want. I want to also calculate the geometric mean for the row that has NA also. How do I do this?

A better way to plot lots of lines (in ggplot perhaps)?

Using R 3.0.2, I have a dataframe that looks like
head()
0 5 10 15 30 60 120 180 240
YKL134C 0.08 -0.03 -0.74 -0.92 -0.80 -0.56 -0.54 -0.42 -0.48
YMR056C -0.33 -0.26 -0.56 -0.58 -0.97 -1.47 -1.31 -1.53 -1.55
YBR085W 0.55 3.33 4.11 3.47 2.16 2.19 2.01 2.09 1.55
YJR155W -0.44 -0.92 -0.27 0.75 0.28 0.45 0.45 0.38 0.51
YNL331C 0.42 0.01 -0.05 0.23 0.19 0.43 0.73 0.95 0.86
YOL165C -0.49 -0.46 -0.25 0.03 -0.26 -0.16 -0.12 -0.37 -0.34
Where row.names() are variable names, names() are measurement times, and the values are measurements. It's several thousand rows deep. Let's call it tmp.
I want to do a sanity check of plotting every variable as time versus value as a line-plot on one plot. What's a better way to do it than naively plotting each line with plot() and lines():
timez <- names(tmp)
plot(x=timez, y=tmp[1,], type="l", ylim=c(-5,5))
for (i in 2:length(tmp[,1])) {
lines(x=timez,y=tmp[i,])
}
The above crude answer is good enough, but I'm looking for a way to do this right. I had a concusion recently, so sorry if I'm missing something obvious. I've been doing that a lot.
Could it be something with transposing the data.frame so it's each timepoint observed across several thousand variables? Or melt()-ing the data.frame in some meaningful way? Is there someway of handling it in ggplot using aggregate()s of data.frames or something? This isn't the right way to do this, is it?
At a loss.
I personally prefer ggplot2 for all of my plotting needs. Assuming I've understood you correctly, you can put the data in long format with reshape2 and then use ggplot2 to plot all of your lines on the same plot:
library(reshape2)
df2<-melt(df,id.var="var")
names(df2)<-c("var","time","value")
df2$time<-as.numeric(substring(df2$time,2))
library(ggplot2)
ggplot(df2,aes(x=time,y=value,colour=var))+geom_line()
You can simply use matplot as follows
DF
## 0 5 10 15 30 60 120 180 240
## YKL134C 0.08 -0.03 -0.74 -0.92 -0.80 -0.56 -0.54 -0.42 -0.48
## YMR056C -0.33 -0.26 -0.56 -0.58 -0.97 -1.47 -1.31 -1.53 -1.55
## YBR085W 0.55 3.33 4.11 3.47 2.16 2.19 2.01 2.09 1.55
## YJR155W -0.44 -0.92 -0.27 0.75 0.28 0.45 0.45 0.38 0.51
## YNL331C 0.42 0.01 -0.05 0.23 0.19 0.43 0.73 0.95 0.86
## YOL165C -0.49 -0.46 -0.25 0.03 -0.26 -0.16 -0.12 -0.37 -0.34
matplot(t(DF), type = "l", xaxt = "n", ylab = "") + axis(side = 1, at = 1:length(names(DF)), labels = names(DF))
xaxt = "n" suppresses ploting x axis annotations. axis function allows you to specify details for any axis, in this case we are using to specify labels of x axis.
It should produce plot as below.

Boxplot of table using ggplot2

I'm trying to plot a boxplot graph with my data, using 'ggplot' in R, but I just can't do it. Can anyone help me out?
The data is like the table below:
Paratio ShapeIdx FracD NNDis Core
-3.00 1.22 0.14 2.71 7.49
-1.80 0.96 0.16 0.00 7.04
-3.00 1.10 0.13 2.71 6.85
-1.80 0.83 0.16 0.00 6.74
-0.18 0.41 0.27 0.00 6.24
-1.66 0.12 0.11 2.37 6.19
-1.07 0.06 0.14 0.00 6.11
-0.32 0.18 0.23 0.00 5.93
-1.16 0.32 0.15 0.00 5.59
-0.94 0.14 0.15 1.96 5.44
-1.13 0.31 0.16 0.00 5.42
-1.35 0.40 0.15 0.00 5.38
-0.53 0.25 0.20 2.08 5.32
-1.96 0.36 0.12 0.00 5.27
-1.09 0.07 0.13 0.00 5.22
-1.35 0.27 0.14 0.00 5.21
-1.25 0.21 0.14 0.00 5.19
-1.02 0.25 0.16 0.00 5.19
-1.28 0.22 0.14 0.00 5.11
-1.44 0.32 0.14 0.00 5.00
And what I exactly want is a boxplot of each column, without any relation "column by column".
ggplot2 requires data in a specific format. Here, you need x= and y= where y will be the values and x will be the corresponding column ids. Use melt from reshape2 package to melt the data to get the data in this format and then plot.
require(reshape2)
ggplot(data = melt(dd), aes(x=variable, y=value)) + geom_boxplot(aes(fill=variable))

Resources