How to create a chord diagram in r? - r

I've never made a plot like this before, so sorry as this is probably a basic question, but I am stuck on how to make a chord diagram and specifically get the outer sections to be my column headings (drug mechanisms) and the inner connections between the sections to be the rows (genes) which don't need to be named in the plot as there are so many.
My data is rows of genes that are marked as interacting with columns of drug mechanisms by zeros or ones.
For example a subset of my data looks like:
Gene Diuretic Beta_blocker ACE_inhibitor
Gene1 1 0 0
Gene2 0 0 1
Gene3 1 1 1
Gene4 0 1 1
My total data is actually 700 genes for 15 columns of drug mechanisms with all zeors and ones. I am currently just creating a chord diagram with:
df <- fread('df.csv')
df[is.na(df)] <- 0
df <- df %>% data.frame %>% set_rownames(.$Gene) %>% dplyr::select(-Gene)
mt <- as.matrix(df)
circos.par(gap.degree = 0.9) #set this as I was otherwise getting an error with my total data
chordDiagram(mt, transparency = 0.5)
With my total data this plot looks like:
I've been getting various errors with trying to get this plot to be 15 sections only (and even just trying to get the sections to have the column names).
Is there a way for me plot a chord diagram with the sections being representative of each column? Then for genes/rows that have an interaction (a 1 in the data) for that section and any other section to be shown in the chord diagram? I don't need the gene names to be visible, I am looking to just visualize the amount of overlap between my columns/sections.
Example input data (for which my problem would be trying to make only have 3 sections per each column to show their overlap):
df <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"),
Diuretic = c(1L, 0L, 1L, 0L), Beta_blocker = c(0L, 0L, 1L,
1L), ACE_inhibitor = c(0L, 1L, 1L, 1L)), row.names = c(NA,
-4L), class = c("data.table", "data.frame")

If you have 15 different drug mechanisms, it would be best to count the genes that various mechanisms have in common, and use these as weightings for the links between drug effects.
Your sample data is too limited to give a feel for how this would look, but the code would be something like this:
new_df <-apply(df, 1, function(x) {
x <- names(df)[which(x == 1)]
m <- 1 - diag(length(x))
dimnames(m) <- list(x, x)
inds <- which(lower.tri(m), arr.ind = TRUE)
data.frame(from = x[inds[,1]], to = x[inds[,2]])}) %>%
bind_rows() %>%
mutate(wt = 1) %>%
group_by(from, to) %>%
summarize(wt = sum(wt), .groups = 'drop')
new_df
#> # A tibble: 3 x 3
#> from to wt
#> <chr> <chr> <dbl>
#> 1 ACE_inhibitor Beta_blocker 2
#> 2 ACE_inhibitor Diuretic 1
#> 3 Beta_blocker Diuretic 1
We can see that we have two genes that have a common action on ACE inhibitor and Beta blocker mechansim (which is what your table implies), and a single gene that links diuretic to both beta blocker and ACE inhibitor to diuretic.
This produces the following rather dull chord diagram:
chordDiagram(new_df)
However, if we make a sample data set that is of the same scale as your real data, we get a more satisfactory result:
set.seed(123)
big_dat <- as.data.frame(matrix(rbinom(15 * 700, 1, 0.5), 700),
row.names = paste0('Gene', 1:700)) %>%
setNames(c('ACE_inhibitor', 'Diuretic', 'Beta_Blocker',
'CCB', 'Nitrate', 'K_channel', 'Aldosterone_blocker',
'Vasodilator', 'PDEI', 'Central', 'Relaxant',
'ARB', 'Alpha_blocker', 'Dopaminergic', 'Unknown'))
big_df <- apply(big_dat, 1, function(x) {
x <- names(big_dat)[which(x == 1)]
m <- 1 - diag(length(x))
dimnames(m) <- list(x, x)
inds <- which(lower.tri(m), arr.ind = TRUE)
data.frame(from = x[inds[,1]], to = x[inds[,2]])}) %>%
bind_rows() %>%
mutate(wt = 1) %>%
subset(complete.cases(.)) %>%
group_by(from, to) %>%
summarize(wt = sum(wt), .groups = 'drop')
chordDiagram(big_df)

Related

How to find the clusters that produce the maximum colMeans in R?

I have a data frame like
V1 V2 V3
1 1 1 2
2 0 1 0
3 3 0 3
....
and I have a vector of the same length as the number of rows in the data frame (it's the cluster from kmeans, if that matters)
[1] 2 2 1...
From those I can get the colMeans for each cluster, like
cm1 <- colMeans(df[fit$cluster==1,])
cm2 <- colMeans(df[fit$cluster==2,])
(I don't think I should do that part explicitly, but that's how I'm thinking about the problem.)
What I want is to get, for each column of the data frame, the value from the vector for which the colMeans is the maximum. Also I'd like to do (separately is fine) the second-highest, third, etc. So in the example I would want the output to be a vector with one element for each column of the data frame:
1 2 1...
because for the first column of the data frame, the column mean for the first cluster is 3, while the column mean for the second cluster is 0.5.
If the cluster vector is of the same length as the number of rows of 'df', split the data by the 'cluster' column into a list,
lst1 <- lapply(split(df, fit$cluster), function(x) stack(colMeans(x)))
dat <- do.call(rbind, Map(cbind, cluster = names(lst1), lst1))
aggregate(values ~ ind, dat, FUN = which.max)
If we need to subset multiple element based on column means, create the 'cluster' column in the data, reshape to 'long' format (or use summarise/across), grouped by 'cluster', 'name', get the mean of 'value', arrange the column 'name' and the 'value' in descending order, then return the n rows with slice_head
library(dplyr)
library(tidyr)
df %>%
mutate(cluster = fit$cluster) %>%
pivot_longer(cols = -cluster) %>%
group_by(cluster, name) %>%
summarise(value = mean(value), .groups = 'drop') %>%
arrange(name, desc(value)) %>%
group_by(name) %>%
slice_head(n = 2)
data
df <- structure(list(V1 = c(1L, 0L, 3L), V2 = c(1L, 1L, 0L), V3 = c(2L,
0L, 3L)), class = "data.frame", row.names = c("1", "2", "3"))
fit <- structure(list(cluster = c(2, 2, 1)), class = "data.frame",
row.names = c(NA,
-3L))

How to filter a tibble and turn it into a matrix?

I have made a tibble with three columns. Cancer status (case/control), trait A (x1) and trait B (x2).
cancer <- tibble(case = factor(c(rep(1,3),rep(0,3)),
levels = c(0,1),
labels = c("control","case")),
x1 = c(1,2,3,4,2,3),
x2 = c(4,8,6,0,2,4))
Now I want to make a matrix with the means of x1 and x2, but just for the patients with cancer (case). So a 2x1 matrix. How can I do this? I tried to start with those two functions, but I can't get them to work...
matrix <- filter(cancer$case == "case")
mean(cancer$x1 & cancer$case == "case")
I know this is a very simple example and I could do it by hand, but I'm new to R and want to know how I could handle this with 6000 instead of 6 rows.
Like this:
library(dplyr)
cancer %>%
summarise(meanx1 = mean(x1[case == "case"]),
meanx2 = mean(x2[case == "case"]))
# A tibble: 1 x 2
meanx1 meanx2
<dbl> <dbl>
1 2 6
If you want it as a matrix:
cancer %>%
summarise(meanx1 = mean(x1[case == "case"]),
meanx2 = mean(x2[case == "case"])) %>%
as.matrix()
meanx1 meanx2
[1,] 2 6
And with across if you have updated to dplyr 1.0.0:
cancer %>%
summarise(across(x1:x2, ~mean(x1[case == "case"]), .names = "mean_{col}")) %>%
as.matrix()
mean_x1 mean_x2
[1,] 2 2
With the new dplyr 1.0.0, you can do
library(tidyverse)
cancer <- tibble(case = factor(c(rep(1,3),rep(0,3)),
levels = c(0,1),
labels = c("control","case")),
x1 = c(1,2,3,4,2,3),
x2 = c(4,8,6,0,2,4))
cancer %>%
filter(case == "case") %>%
summarize(across(x1:x2, mean))

Best way to apply code to 24 similar datasets?

I have a 24 datasets that each have one factor and one response. I have written code to subset the 93 entries into 3 categories, but I'm not sure what the most efficient way there is to run this code for all 24 of my datasets. Any ideas would be much appreciated.
Here's the data I'm working with.
dput(head(data))
structure(list(run.size.percentage = structure(c(2L, 13L, 24L,
35L, 46L, 57L), .Label = c(",2000,", "1,0.375,0.013", "10,0.868,0.11",
"11,0.953,0.12", "12,1.047,0.12", "13,1.149,0.13", "14,1.261,0.14",
"15,1.385,0.14", "16,1.520,0.15", "17,1.668,0.15", "18,1.832,0.16",
"19,2.011,0.17", "2,0.412,0.023", "20,2.207,0.17", "21,2.423,0.18",
"22,2.660,0.19", "23,2.920,0.20", "24,3.205,0.21", "25,3.519,0.22",
"26,3.863,0.24", "27,4.240,0.25", "28,4.655,0.26", "29,5.110,0.28",
"3,0.452,0.034", "30,5.610,0.30", "31,6.158,0.31", "32,6.760,0.33",
"33,7.421,0.35", "34,8.147,0.37", "35,8.943,0.39", "36,9.817,0.42",
"37,10.78,0.45", "38,11.83,0.47", "39,12.99,0.50", "4,0.496,0.049",
"40,14.26,0.53", "41,15.65,0.56", "42,17.18,0.58", "43,18.86,0.59",
"44,20.70,0.59", "45,22.73,0.58", "46,24.95,0.55", "47,27.39,0.52",
"48,30.07,0.49", "49,33.01,0.46", "5,0.545,0.061", "50,36.24,0.45",
"51,39.78,0.45", "52,43.67,0.45", "53,47.94,0.44", "54,52.62,0.42",
"55,57.77,0.38", "56,63.41,0.35", "57,69.61,0.32", "58,76.42,0.31",
"59,83.89,0.33", "6,0.598,0.072", "60,92.09,0.36", "61,101.1,0.42",
"62,111.0,0.49", "63,121.8,0.59", "64,133.7,0.74", "65,146.8,0.94",
"66,161.2,1.19", "67,176.9,1.49", "68,194.2,1.82", "69,213.2,2.18",
"7,0.656,0.083", "70,234.1,2.55", "71,256.9,2.94", "72,282.1,3.34",
"73,309.6,3.78", "74,339.9,4.25", "75,373.1,4.73", "76,409.6,5.20",
"77,449.7,5.60", "78,493.6,5.87", "79,541.9,5.93", "8,0.721,0.093",
"80,594.9,5.77", "81,653.0,5.37", "82,716.8,4.77", "83,786.9,4.03",
"84,863.9,3.21", "85,948.3,2.36", "86,1041,1.55", "87,1143,0.81",
"88,1255,0.30", "89,1377,0.056", "9,0.791,0.10", "90,1512,0.0044",
"91,1660,0", "92,1822,0"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
Here's the code that worked for each dataset.
data2 <- tidyr::separate(names(data), unlist(strsplit(names(data), "\\.")), ",", data=data)
group1 <- data2 %>% filter(size <= 2)
group2 <- data2 %>% filter(size > 2 & size <= 50)
group3 <- data2 %>% filter(size > 50 & size <= 2000)
sum(as.numeric(group1$percentage), na.rm=TRUE)
sum(as.numeric(group2$percentage), na.rm=TRUE)
sum(as.numeric(group3$percentage), na.rm=TRUE)
Put your dataframes in a list and use lapply. Used cut to create the needed size groups. Also added convert = TRUE arg to separate to convert numbers into numeric -
df_list <- list(df, df) # creating a dummy list with same df
lapply(df_list, function(x) {
separate(names(df), unlist(strsplit(names(df), "\\.")), ",",
data = df, convert = TRUE) %>%
group_by(group = cut(size, breaks = c(0,2,50,2000,Inf))) %>%
summarise(percentage = sum(percentage))
})
# every list element is your desired output df
[[1]]
# A tibble: 1 x 2
group percentage
<fct> <dbl>
1 (0,2] 0.252
[[2]]
# A tibble: 1 x 2
group percentage
<fct> <dbl>
1 (0,2] 0.252

Count frequency of same value in several columns

I'm quite new to R and I'm facing a problem which I guess is quite easy to fix but I couldn't find the answer.
I have a dataframe called clg where basically I have 3 columns date, X1, X2.
X1 and X2 are name of country teams. X1 and X2 have the same list of countries.
I'm simply trying to count the frequency of each country in the two columns as a total.
So far, I've only been able to count the frequency of the X1 column but I didn't find a way to sum both columns.
clt <- as_tibble(na.omit(count(clg, clg$X1)))
I would like to get a data frame where in the first columns I have unique countries, and in the second column the sum of occurrences in X1 + X2.
You can useunlist() and table() to get the overall counts. Wrapping it in data.frame() will give you the desired two column output.
clg <- data.frame(date=1:3,
X1=c("nor", "swe", "alg"),
X2=c("swe", "alg", "jpn"))
data.frame(table(unlist(clg[c("X1", "X2")])))
# Var1 Freq
# 1 alg 2
# 2 nor 1
# 3 swe 2
# 4 jpn 1
With tidyverse, we can gather into 'long' format and then do the count
library(tidyverse)
gather(clg, key, Var1, -date) %>%
count(Var1)
# A tibble: 4 x 2
# Var1 n
# <chr> <int>
#1 alg 2
#2 jpn 1
#3 nor 1
#4 swe 2
data
clg <- structure(list(date = 1:3, X1 = structure(c(2L, 3L, 1L), .Label = c("alg",
"nor", "swe"), class = "factor"), X2 = structure(c(3L, 1L, 2L
), .Label = c("alg", "jpn", "swe"), class = "factor")),
class = "data.frame", row.names = c(NA,
-3L))
You can obtain your goal with two steps. In the first step, you calculate the sum of occurrences for each country. In the next step, you're joining the two df's together and calculate the total sum.
X1_sum <- df %>%
dplyr::group_by(X1) %>%
dplyr::summarize(n_x1 = n())
X2_sum <- df %>%
dplyr::group_by(X2) %>%
dplyr::summarize(n_x2 = n()
final_summary <- X1_sum %>%
# merging data with by country names
dplyr::left_join(., X2_sum, by = c("X1", "X2")) %>%
dplyr::mutate(n_sum = n_x1 + n_x2)

How to create conditionally new groups when summarizing group means in R

I have data for which I want to summarize group means. I then would like to re-group some of the smaller groups (matching a certain n < x condition) into a group called "others". I found a way to do this. But it feels like there are more efficient solutions out there. I wonder how a data.table approach would solve the problem.
Here is an example using tibble and dyplr.
# preps
library(tibble)
library(dplyr)
set.seed(7)
# generate 4 groups with more observations
tbl_1 <- tibble(group = rep(sample(letters[1:4], 150, TRUE), each = 4),
score = sample(0:10, size = 600, replace = TRUE))
# generate 3 groups with less observations
tbl_2 <- tibble(group = rep(sample(letters[5:7], 50, TRUE), each = 3),
score = sample(0:10, size = 150, replace = TRUE))
# put them into one data frame
tbl <- rbind(tbl_1, tbl_2)
# aggregate the mean scores and count the observations for each group
tbl_agg1 <- tbl %>%
group_by(group) %>%
summarize(MeanScore = mean(score),
n = n())
So far so easy.
Next I want to only show groups with more than 100 observations. All other groups should be merged into one group called "others".
# First, calculate summary stats for groups less then n < 100
tbl_agg2 <- tbl_agg1 %>%
filter(n<100) %>%
summarize(MeanScore = weighted.mean(MeanScore, n),
sumN = sum(n))
Note: There was a mistake in the calculation above which is now corrected (#Frank: thanks for spotting it!)
# Second, delete groups less then n < 100 from the aggregate table and add a row containing the summary statistics calculated above instead
tbl_agg1 <- tbl_agg1 %>%
filter(n>100) %>%
add_row(group = "others", MeanScore = tbl_agg2[["MeanScore"]], n = tbl_agg2[["sumN"]])
tbl_agg1 basically shows what I want it to show, but I wonder if there is a smoother, more efficient way to do this. At the same time I wonder how a data.table approach would deal with the problem at hand.
I welcome any suggestions.
Your calculation for the "other" group is wrong, I guess... should be...
tbl_agg1 %>% {bind_rows(
filter(., n>100),
filter(., n<100) %>%
summarize(group = "other", MeanScore = weighted.mean(MeanScore, n), n = sum(n))
)}
However, you could keep things a lot simpler from the start by using a different grouping variable:
tbl %>%
group_by(group) %>%
group_by(g = replace(group, n() < 100, "other")) %>%
summarise(n = n(), m = mean(score))
# A tibble: 5 x 3
g n m
<chr> <int> <dbl>
1 a 136 4.79
2 b 188 4.49
3 c 160 5.32
4 d 116 4.78
5 other 150 5.42
Or with data.table
library(data.table)
DT = data.table(tbl)
DT[, n := .N, by=group]
DT[, .(.N, m = mean(score)), keyby=.(g = replace(group, n < 100, "other"))]
g N m
1: a 136 4.786765
2: b 188 4.489362
3: c 160 5.325000
4: d 116 4.784483
5: other 150 5.420000

Resources