I am trying to plot a scatter plot in R using ggscatter function from ggpubr package. I am showing you a subset of my data.frame
tracking_id gene_short_name B1 B2 C1 C2
ENSG00000000003.14 TSPAN6 1.2 1.16 1.22 1.26
ENSG00000000419.12 DPM1 1.87 1.87 1.68 1.83
ENSG00000000457.13 SCYL3 0.59 0.63 0.82 0.69
ENSG00000000460.16 C1orf112 0.87 0.99 0.97 0.83
ENSG00000001036.13 FUCA2 1.59 1.59 1.4 1.39
ENSG00000001084.10 GCLC 1.43 1.55 1.46 1.32
ENSG00000001167.14 NFYA 1.2 1.3 1.39 1.21
ENSG00000001460.17 STPG1 0.43 0.46 0.34 0.76
ENSG00000001461.16 NIPAL3 0.72 0.84 0.78 0.74
I want to make scatter plot between B1 vs B1, B1 vs B2, B1 vs C1, B2 vs C2.
I used the following command
df <- read.table(file="transformation.txt",header= TRUE,sep = "\t")
lapply(3:6, function(X) ggscatter(df, x = "B1", y = colnames(df[X]), add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",add.params = list(color="blue")))
I get individual 4 plots. I want to have all 4 plots in 1 plot. How can I do this?
Thanks
Do you perhaps mean something like this?
library(GGally)
ggpairs(df[, -(1:2)])
GGally is a very nice R package offering a lot of customisation options for its plotting routines.
Sample data
df <- read.table(text =
"tracking_id gene_short_name B1 B2 C1 C2
ENSG00000000003.14 TSPAN6 1.2 1.16 1.22 1.26
ENSG00000000419.12 DPM1 1.87 1.87 1.68 1.83
ENSG00000000457.13 SCYL3 0.59 0.63 0.82 0.69
ENSG00000000460.16 C1orf112 0.87 0.99 0.97 0.83
ENSG00000001036.13 FUCA2 1.59 1.59 1.4 1.39
ENSG00000001084.10 GCLC 1.43 1.55 1.46 1.32
ENSG00000001167.14 NFYA 1.2 1.3 1.39 1.21
ENSG00000001460.17 STPG1 0.43 0.46 0.34 0.76
ENSG00000001461.16 NIPAL3 0.72 0.84 0.78 0.74", header = T)
Related
I have 2 dataframes with different number of rows and columns, and I'd like to show both of them in a circos plot with circlize.
My data looks like this:
df1=data.frame(replicate(7,sample(-200:200,200,rep=TRUE))/100)
df2=data.frame(replicate(2,sample(-200:200,200,rep=TRUE))/100)
#head(df1)
X1 X2 X3 X4 X5 X6 X7
1 -0.03 0.63 -0.33 0.73 -1.37 -1.39 1.96
2 -1.81 -1.24 -1.63 1.58 0.13 1.39 -0.76
3 0.02 -2.00 -1.93 -1.35 1.06 -0.58 -0.77
4 -1.11 -1.38 -0.66 -0.40 1.69 -0.47 -1.55
5 0.98 0.06 0.00 -0.35 1.97 1.74 0.72
6 1.51 -1.68 -0.44 -1.74 0.15 0.26 0.36
#head(df2)
X1 X2
1 0.16 -0.81
2 -1.38 -0.16
3 -0.22 -0.74
4 0.73 -0.82
5 0.58 -1.87
6 -0.63 1.50
I want to build a single circos plot where the top is showing df1 and bottom is showing df2, but I can only show individual dfs. For instance, this is how I show df1:
col_fun1=colorRamp2(c(min(df1), 0, max(df1)), c("blue", "white", "red"))
circos.heatmap(df1, col = col_fun1, cluster = T, track.height = 0.2, rownames.side = "outside", rownames.cex = 0.6)
circos.clear()
How can I df1 only in the top half, and df2 only in the bottom half?
Let's say we have df1 with p values:
Symbol p1 p2 p3 p4 p5
AABT 0.01 0.12 0.23 0.02 0.32
ABC1 0.13 0.01 0.01 0.12 0.02
ACDC 0.15 0.01 0.34 0.24 0.01
BAM1 0.01 0.02 0.04 0.01 0.02
BCR 0.01 0.36 0.02 0.07 0.04
BDSM 0.02 0.43 0.01 0.03 0.41
BGL 0.27 0.77 0.01 0.04 0.02
and df2 with Fold Changes:
Symbol FC1 FC2 FC3 FC4 FC5
AABT 1.21 -0.32 0.23 -0.72 0.45
ABC1 0.13 0.93 -1.61 0.12 1.03
ACDC 0.23 1.31 0.42 -0.39 1.50
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23
I would like to do the following in df2:
Keep rows that in df1, have values < 0.05 in 3/5 of columns or greater
Eliminate rows that show discordant signs of FC. FC should be taken into consideration only when the respective p from df1 is lower than 0.05 (i.e. significant)
Sort the resulting data in an intuitive order so as to discriminate rows having positive FC from rows having negative FC, and if possible, discriminate rows whose significances in FC arise sequentially (e.g. FC3 FC4 FC5) from others that don't (e.g. FC1 FC3 FC5)
For example, step 1 would result in:
Symbol FC1 FC2 FC3 FC4 FC5
ABC1 0.13 0.93 -1.61 0.12 1.03
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23
and step 2, in:
Symbol FC1 FC2 FC3 FC4 FC5
BCR 1.43 -0.25 1.29 0.54 0.97
BGL 0.33 0.12 -1.33 -1.14 -1.23
How can this be achieved? I imagine using a for loop and the count function would do the job for step 1, but steps 2 and 3 look somewhat complicated to me. Thank you in advance for your elegant solutions.
data
df1:
df1 <- read.table(h=T,strin=F,text="Symbol p1 p2 p3 p4 p5
AABT 0.01 0.12 0.23 0.02 0.32
ABC1 0.13 0.01 0.01 0.12 0.02
ACDC 0.15 0.01 0.34 0.24 0.01
BAM1 0.01 0.02 0.04 0.01 0.02
BCR 0.01 0.36 0.02 0.07 0.04
BDSM 0.02 0.43 0.01 0.03 0.41
BGL 0.27 0.77 0.01 0.04 0.02")
df2:
df2 <- read.table(h=T,strin=F,text="Symbol FC1 FC2 FC3 FC4 FC5
AABT 1.21 -0.32 0.23 -0.72 0.45
ABC1 0.13 0.93 -1.61 0.12 1.03
ACDC 0.23 1.31 0.42 -0.39 1.50
BAM1 -1.33 -1.27 -0.89 1.22 -1.03
BCR 1.43 -0.25 1.29 0.54 0.97
BDSM 1.20 0.23 -1.98 -1.09 -0.31
BGL 0.33 0.12 -1.33 -1.14 -1.23")
I'm not sure how elegant this is, but you can get the result you requested using apply and sapply with subsetting, like this:
# Create logical matrix telling us whether p values are significant
sig <- apply(df1[-1], 2, function(x) x < 0.05)
# Create numeric matrix of the sign of each FC (will be either -1 or 1)
sign <- apply(df2[-1], 2, function(x) sign(x))
# Create a vector telling us whether there were 3 or more p < 0.05 in each row
ss1 <- apply(sig, 1, function(x) length(which(x)) > 2)
# Create a vector telling us whether all FC signs match excluding p = ns
ss2 <- sapply(seq(nrow(df1)), function(i) length(table(sign[i,][sig[i,]])) == 1)
# Subset the data frames accordingly:
df1[ss1, ]
#> Symbol p1 p2 p3 p4 p5
#> 2 ABC1 0.13 0.01 0.01 0.12 0.02
#> 4 BAM1 0.01 0.02 0.04 0.01 0.02
#> 5 BCR 0.01 0.36 0.02 0.07 0.04
#> 6 BDSM 0.02 0.43 0.01 0.03 0.41
#> 7 BGL 0.27 0.77 0.01 0.04 0.02
df2[ss1 & ss2, ]
#> Symbol FC1 FC2 FC3 FC4 FC5
#> 5 BCR 1.43 -0.25 1.29 0.54 0.97
#> 7 BGL 0.33 0.12 -1.33 -1.14 -1.23
Created on 2020-07-10 by the reprex package (v0.3.0)
Subject var1 var2 var3 var4 var5
1 0.2 0.78 7.21 0.5 0.47
1 0.52 1.8 11.77 -0.27 -0.22
1 0.22 0.84 7.32 0.35 0.36
2 0.38 1.38 10.05 -0.25 -0.2
2 0.56 1.99 13.76 -0.44 -0.38
3 0.35 1.19 7.23 -0.16 -0.06
4 0.09 0.36 4.01 0.55 0.51
4 0.29 1.08 9.48 -0.57 -0.54
4 0.27 1.03 9.42 -0.19 -0.21
4 0.25 0.9 7.06 0.12 0.12
5 0.18 0.65 5.22 0.41 0.42
5 0.15 0.57 5.72 0.01 0.01
6 0.26 0.94 7.38 -0.17 -0.13
6 0.14 0.54 5.13 0.16 0.17
6 0.22 0.84 6.97 -0.66 -0.58
6 0.18 0.66 5.79 0.23 0.25
# the above is sample data matrix (dat11)
# The following lines of function is to calculate the p-value (P.z) for a
# variable pair var2 and var3 using lmer().
fit1 <- lmer(var2 ~ var3 + (1|Subject), data = dat11)
summary(fit1)
coefs <- data.frame(coef(summary(fit1)))
# use normal distribution to approximate p-value
coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
round(coefs,6)
# the following is the result
Estimate Std.Error t.value p.z
(Intercept) -0.280424 0.110277 -2.542913 0.010993
var3 0.163764 0.013189 12.417034 0.000000
The real data contains 65 variables (var1, var2....var65). I would like to use the above codes to find the above result for all possible pairs of 65 variables, eg, var1 ~ var2, var1 ~var3, ... var1 ~var65; var2 ~var3, var2 ~ var4, ... var2~var65; var3~var4, ... and so on. There will be about 2000 pairs. Can somebody help me with the loop codes and get the results to a .csv file? Thank you.
I have a (25x6) matrix containing the following observations (class: dataframe):
Mkt.RF SMB HML RMW CMA WML
-3.86 1.37 1.14 1.47 -2.35 0.05
1.10 -0.95 -1.60 1.17 -0.33 -2.96
2.44 -1.79 0.39 1.14 -2.31 -1.55
9.10 2.48 0.01 -1.43 -0.12 -7.61
-2.37 2.90 -0.84 0.84 -1.22 1.81
0.54 0.09 0.48 0.30 0.32 0.03
0.72 -0.48 0.40 0.20 -0.12 0.87
-6.09 1.57 1.04 1.05 0.43 1.13
3.43 -1.63 -0.55 1.45 -0.63 3.35
-1.35 0.32 -0.59 1.57 -0.80 3.43
2.90 0.52 0.00 -0.26 0.39 1.56
1.35 -0.22 -1.42 -1.58 0.19 2.25
-5.10 0.77 -1.34 1.21 -0.35 1.06
6.26 -1.91 -2.70 1.89 -1.94 3.01
-2.21 4.04 3.00 -0.07 1.09 0.38
-1.93 2.50 1.88 0.53 1.13 1.26
-5.48 1.04 2.45 0.79 0.61 0.90
-0.11 -1.34 2.59 3.32 2.21 0.10
4.13 0.15 0.66 -1.51 1.13 -0.18
-3.72 0.76 0.92 0.87 0.42 2.96
-0.64 -2.35 -1.31 0.27 0.55 0.94
2.52 -2.70 -1.71 -0.16 0.86 -3.55
-1.41 -0.20 -0.96 0.47 -0.25 2.56
-3.08 -0.45 -0.35 0.23 -2.21 1.55
1.78 -0.19 -1.64 -0.10 -1.17 0.69
I wish to produce two plots: (1) a probability density function, and (2) a cumulative distribution function in ggplot. I would like to have a function for each column, hence there should be 6 pdfs and 6 cdfs. I have produced the following:
Loaddata <- setwd("~/Desktop")
library(ggplot2)
library(plyr)
library(reshape2)
D <- read.table(file = "MyData.csv", header = TRUE, sep =";", dec = ",")
attach(D)
factors <- cbind(D[,2:7])
ggplot(faktors, aes(Mkt.RF)) + geom_density() + labs(x = "Return", y = "Distribution", title = "PDF")+
xlim(-20,20) + theme(plot.title = element_text(hjust = 0.5))
With this I can produce a plot with a single function (one column of data), but I am having trouble with combining all six functions into one plot. So that I can replicate something similar to this:
PDF functions example
Thank you in advance!
You can try
library(tidyverse)
df %>%
bind_rows(df, .id="gr") %>%
gather(key, value, -gr) %>%
ggplot() +
geom_density(data = . %>% filter(gr == 1), aes(value, color = key), size=1.1) +
stat_ecdf(data = . %>% filter(gr == 2), aes(value, color = key), size=1.1) +
facet_wrap(~gr, labeller = labeller(gr=c("1" = "PD", "2" = "CD")))
The single plots can be created using
df %>%
gather(key, value) %>%
ggplot(aes(value, color=key)) +
geom_density()
Here is a reproduceable example using mtcars, and plotting all the distributions on top of eachother
library(tidyverse)
mtcars %>%
gather(Variable, Value) %>%
ggplot(aes(x=Value, color=Variable)) +
geom_density(alpha=0)
also posted as an issue on github
After using group_by, cannot output table with pandoc correctly with the digits= or round= parameters.
Take the group_by out of the chain and pandoc displays the table just fine. Add the group_by in and the number of decimal places of the floating point numbers is way to big to display.
# test dataframe
dat <- data.frame(matrix(rnorm(10 * 10), 10))
group <- rbinom(10,20,.1)
df1 <- cbind(group, dat)
library(pander)
pander(df1, digits = 2, keep.line.breaks = TRUE, split.table = Inf,
caption = "Not Grouped, correct format")
library(dplyr)
df2 <- df1 %>%
group_by(group)
pander(df2, digits = 2, keep.line.breaks = TRUE, split.table = Inf,
caption = "Grouped, incorrect format")
Is there a way around this?
As a workaround, you can convert the object df2 (of class tbl_df) to a data.frame object.
pander(as.data.frame(df2), digits = 2, keep.line.breaks = TRUE, split.table = Inf)
The result:
-----------------------------------------------------------------------
group X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
------- ------ ----- ----- ----- ------ ----- ----- ------ ------ -----
0 -0.55 -0.13 -0.71 -1.3 -0.096 0.49 0.73 -0.53 0.17 -0.44
2 -1.5 1.4 -2.1 0.96 -0.2 -0.36 0.33 0.2 0.67 -0.27
1 -2.3 -0.98 -1.5 1.1 0.87 -0.54 1.2 -0.24 0.31 -0.76
1 0.24 0.086 -0.78 0.39 -0.17 -0.2 -1.5 -1.1 -1.3 -0.72
0 0.2 -1.2 0.27 2.1 0.73 1.8 -0.12 -0.45 0.07 -0.29
1 0.022 0.084 -0.41 0.32 -0.023 0.38 0.57 -0.16 0.0011 -0.76
2 0.99 0.7 -0.32 -0.25 -0.17 -0.68 -0.59 0.29 0.77 -0.12
3 -1.3 -1.6 -0.14 0.49 0.61 1.2 0.14 -0.087 -1.2 -0.95
0 -0.073 -0.86 2 -0.87 0.51 -1.3 -0.94 0.022 0.6 0.68
3 1.8 -0.81 -0.4 0.72 2.1 0.19 0.086 1.7 0.19 -0.49
-----------------------------------------------------------------------