R: Sparse? Transforming data for co-occurrence matrix - r

I'm a Bio major using R to generate some visualizations showing which human proteins (uniprots) are targeted by different bacterial strains.
# sample data
human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
"P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
"Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
"P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
strains <- rep(c("A", "B", "C", "C"), each = 5)
final <- cbind(human.uniprots, strains)
I'm trying to generate a co-occurrence matrix/heat map...something like
h.map <- data.frame(matrix(nrow = length(unique(human.uniprots)),
ncol = length(unique(strains)) + 1))
h.map.cols <- c("human_uniprots", "A", "B", "C")
colnames(h.map) <- h.map.cols
...where the columns have the strains, the rows have the proteins, and the data frame cells are populated with the counts of times that a protein interacts with a strain. So if strain A, B, and C all interact with a uniprot, they should all have a value of 3 in their cells for that uniprot row.
I've tried making a list of tuples of unique strain and human_uniprots, then searching for that tuple that matches the strain and human uniprot pair from the matrix I want to populate, and adding a "1" if there's a match...but I'm not sure how to work with tuples in R. Then I saw this: Populating a co-occurrence matrix
Which is what I want, but I'm not understanding the usage or syntax...is sparse() even a function in R?
Additionally...it would be nice to rank all the proteins by ones which interact with all strains. So all the proteins that interact with all the strains should be at the top, followed by ones that interact with 2 strains, and then 1 strain...

sparse() is a MATLAB function from the looks of it. You're describing a bipartite network represented by an incidence matrix.
human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
"P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
"Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
"P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
strains <- rep(c("A", "B", "C", "D"), each = 5)
final <- cbind(human.uniprots, strains)
final_df <- as.data.frame(final)
library(igraph) # install.packages("igraph")
g <- graph_from_data_frame(final_df, directed = FALSE)
V(g)$type <- ifelse(V(g)$name %in% strains, FALSE, TRUE)
as_incidence_matrix(g)
#> P15311 P0CG48 Q8WYH8 P42224 Q9NXR8 P40763 P05067 P60709 Q9UDW1 Q9H160
#> A 1 1 1 1 1 0 0 0 0 0
#> B 0 0 0 0 0 1 1 1 1 1
#> C 0 0 0 0 0 0 0 0 0 0
#> D 1 0 0 0 0 0 1 1 1 1
#> Q9UKL0 P26038 P61244 O95817 Q09472
#> A 0 0 0 0 0
#> B 0 0 0 0 0
#> C 1 1 1 1 1
#> D 0 0 0 0 0
or.....
V(g)$type <- ifelse(V(g)$name %in% strains, TRUE, FALSE)
# swap TRUE/FALSE
as_incidence_matrix(g)
#> A B C D
#> P15311 1 0 0 1
#> P0CG48 1 0 0 0
#> Q8WYH8 1 0 0 0
#> P42224 1 0 0 0
#> Q9NXR8 1 0 0 0
#> P40763 0 1 0 0
#> P05067 0 1 0 1
#> P60709 0 1 0 1
#> Q9UDW1 0 1 0 1
#> Q9H160 0 1 0 1
#> Q9UKL0 0 0 1 0
#> P26038 0 0 1 0
#> P61244 0 0 1 0
#> O95817 0 0 1 0
#> Q09472 0 0 1 0
Created on 2018-05-25 by the reprex package (v0.2.0).

Using dplyr, you can group_by, count, and spread to get a per-strain count. Then replace the per-strain count with the total count for that row, using rowSums():
library(dplyr)
as.data.frame(final) %>%
group_by(human.uniprots, strains) %>%
count() %>%
spread(strains, n) %>%
ungroup() %>%
mutate(total_n = rowSums(.[2:ncol(.)])) %>%
mutate_if(is.numeric, funs(ifelse(. == 0, 0, total_n))) %>%
select(-total_n)
# A tibble: 15 x 5
human.uniprots A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 O95817 0. 0. 1. 0.
2 P05067 0. 2. 0. 2.
3 P0CG48 1. 0. 0. 0.
4 P15311 2. 0. 0. 2.
5 P26038 0. 0. 1. 0.
6 P40763 0. 1. 0. 0.
7 P42224 1. 0. 0. 0.
8 P60709 0. 2. 0. 2.
9 P61244 0. 0. 1. 0.
10 Q09472 0. 0. 1. 0.
11 Q8WYH8 1. 0. 0. 0.
12 Q9H160 0. 2. 0. 2.
13 Q9NXR8 1. 0. 0. 0.
14 Q9UDW1 0. 2. 0. 2.
15 Q9UKL0 0. 0. 1. 0.

You can do this using table, or if you want it sparse you can use xtabs.
So for your example, you can use either
tab <- table(final[,"human.uniprots"], final[,"strains"])
tab* rowSums(tab)
Or sparse
tab <- xtabs(~human.uniprots + strains, final, sparse=TRUE)
tab <- tab*Matrix::rowSums(tab)
You can then plot it using
Matrix::image(tab, scales=list(y=list(at=1:nrow(tab), label=rownames(tab)),
x=list(at=1:ncol(tab), label=colnames(tab))),
ylab="uniprots",
xlab="strains")
You can also rank the rows by occurance
r <- order(-Matrix::rowSums(tab))
# and then reorder the rows of the matrix and the labels
Matrix::image(tab[r,],
scales=list(y=list(at=1:nrow(tab), label=rownames(tab)),
x=list(at=1:ncol(tab), label=colnames(tab)[r])),
ylab="uniprots",
xlab="strains")

Related

All possible combinations (sequential)

I am wondering what an efficient approach to the following question would be:
Suppose I have three characters in group 1 and two characters in group 2:
group_1 = c("X", "Y", "Z")
group_2 = c("A", "B")
Clearly, the "all" possible combinations for group_1 and group_2 are given by:
group_1_combs = data.frame(X = c(0,1,0,0,1,1,0,1),
Y = c(0,0,1,0,1,0,1,1),
Z = c(0,0,0,1,0,1,1,1))
group_2_combs = data.frame(A = c(0,1,0,1),
B = c(0,0,1,1))
My question is the following:
(1) How do I go from group_1 to group_1_combs efficiently (given that the character vector might be large).
(2) How do I do an "all possible" combinations of each row of group_1_combs and group_2_combs? Specifically, I want a "final" data.frame where each row of group_1_combs is "permuted" with every row of group_2_combs. This means that the final data.frame would have 8 x 4 rows (since there are 8 rows in group_1_combs and 4 rows in group_2_combs) and 5 columns (X,Y,Z,A,B).
Thanks!
You want expand.grid and merge:
Question 1:
group_1_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_1)), group_1))
group_2_combs <- expand.grid(setNames(rep(list(c(0, 1)), length(group_2)), group_2))
Question 2:
> merge(group_1_combs, group_2_combs)
X Y Z A B
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
6 1 0 1 0 0
7 0 1 1 0 0
...
Or you can go directly to the merged data.frame:
group_12 <- c(group_1, group_2)
expand.grid(setNames(rep(list(c(0, 1)), length(group_12)), group_12))

r: how to simultaneously change multiple column names based on the individual suffix of each column name

I have received a datasheet p autogenerated from a registry and containing 1855 columns. The autogeneration adds _vX automatically to each column name where X correspond the number of follow-ups. Unfortunately, this creates ridiculously long column names.
Eg
p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10 and p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
correspond to the 10th and 20th MRI scan on the same patient. I.e., each column that addresses clinical parameters related to the 10th follow-up ends with v1_v2_v3_v4_v5_v6_v7_v8_v9_v10.
I seek a solution, preferably in dplyr or a function, that changes the entire _v1_v2_...." suffix to fuX corresponding to the xth follow-up.
Lets say that p looks like:
a_v2 b_v2_v3 a_v2_v3_v4 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
Data
p <- structure(list(dia_maxrd_v2 = c(0, 1), hear_sev_v2_v3 = c(1, 1), reop_ind_v2_v3_v4___1 = c(1,
0), neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(1,
1), symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(0,
0)), class = "data.frame", row.names = c(NA, -2L))
EDIT
To complicate things, some column names end with "___1" indicating a specific parameter relating to that clinical parameter and should be preserved, e.g.: _v1_v2_v3_v4___1. Hence, this is still to be considered as fu4 and the ___1 part should not be omitted.
a_v2 b_v2_v3 a_v2_v3_v4___1 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
EDIT
My apologies, the solution must consider the "basic" column name specifying what parameter the column contain, e.g. post-surgical complications. It is only the _v1_v2_v3..._vX-part that should be substituted with the corresponding fuX. What comes before and after the _v1_v2_v3..._vX-part must be preserved.
Consider
dia_maxrd_v2 hear_sev_v2_v3 reop_ind_v2_v3_v4___1 neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20 symp_pre_lokal_fu20
1 0 1 1 1 0
2 1 1 0 1 0
You can use gsub with two capturing groups:
names(p) <- gsub("^(.).*?(\\d+)$", "\\1_fu\\2", names(p))
p
#> a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
With new requirements stipulated by OP for including in pipe in having some different endings not in original question:
p %>% setNames(gsub("^(.).*?(\\d+_*\\d*)$", "\\1_fu\\2", names(.)))
#> a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
For arbitrary starting strings, it may be easiest to gsub twice:
p %>% setNames(gsub("(\\d{1,2}_v)+", "", names(.))) %>%
setNames(gsub("_v(\\d+)", "_fu\\1", names(.)))
#> dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20
#> 1 0 1 1 1
#> 2 1 1 0 1
#> symp_pre_lokal_fu20
#> 1 0
#> 2 0

Dummy code categorical / ordinal variables in the tidyverse r

Let's say I have a tibble.
library(tidyverse)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib
# A tibble: 10 x 3
record gender like_product
<int> <fctr> <fctr>
1 1 F 2
2 2 M 1
3 3 M 2
4 4 F 3
5 5 F 4
6 6 M 2
7 7 F 4
8 8 M 4
9 9 F 4
10 10 M 5
I would like to dummy code my data with 1's and 0's so that the data looks more/less like this.
# A tibble: 10 x 8
record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 1 0 0
2 2 0 1 0 0 0 0 0
3 3 0 1 0 1 0 0 0
4 4 0 1 1 0 0 0 0
5 5 1 0 0 0 0 0 0
6 6 0 1 0 0 0 0 0
7 7 0 1 0 0 0 0 0
8 8 0 1 0 1 0 0 0
9 9 1 0 0 0 0 0 0
10 10 1 0 0 0 0 0 1
My workflow would require that I know a range of variables to dummy code (i.e. gender:like_product), but don't want to identify EVERY variable by hand (there could be hundreds of variables). Likewise, I don't want to have to identify every level/unique value of every variable to dummy code. I'm ultimately looking for a tidyverse solution.
I know of several ways of doing this, but none of them that fit perfectly within tidyverse. I know I could use mutate...
tib %>%
mutate(gender_M = ifelse(gender == "M", 1, 0),
gender_F = ifelse(gender == "F", 1, 0),
like_product_1 = ifelse(like_product == 1, 1, 0),
like_product_2 = ifelse(like_product == 2, 1, 0),
like_product_3 = ifelse(like_product == 3, 1, 0),
like_product_4 = ifelse(like_product == 4, 1, 0),
like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
select(-gender, -like_product)
But this would break my workflow rules of needing to specify every dummy coded output.
I've done this in the past with model.matrix, from the stats package.
model.matrix(~ gender + like_product, tib)
Easy and straightforward, but I want a solution in the tidyverse. EDIT: Reason being, I still have to specify every variable, and being able to use select helpers to specify something like gender:like_product would be much preferred.
I think the solution is in purrr
library(purrr)
dummy_code <- function(x) {
lvls <- levels(x)
sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
}
tib %>%
map_at(c("gender", "like_product"), dummy_code)
$record
[1] 1 2 3 4 5 6 7 8 9 10
$gender
# A tibble: 10 x 2
F M
<int> <int>
1 1 0
2 0 1
3 0 1
4 1 0
5 1 0
6 0 1
7 1 0
8 0 1
9 1 0
10 0 1
$like_product
# A tibble: 10 x 5
`1` `2` `3` `4` `5`
<int> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 1 0 0 0
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0
10 0 0 0 0 1
This attempt produces a list of tibbles, with the exception of the excluded variable record, and I've been unsuccessful at combining them all back into a single tibble. Additionally, I still have to specify every column, and overall it seems clunky.
Any better ideas? Thanks!!
An alternative to model.matrix is using the package recipes. This is still a work in progress and is not yet included in the tidyverse. At some point it might / will be included in the tidyverse packages.
I will leave it up to you to read up on recipes, but in the step step_dummy you can use special selectors from the tidyselect package (installed with recipes) like the selectors you can use in dplyr as starts_with(). I created a little example to show the steps.
Example code below.
But if this is handier I will leave up to you as this has already been pointed out in the comments. The function bake() uses model.matrix to create the dummies. The difference is mostly in the column names and of course in the internal checks that are being done in the underlying code of all the separate steps.
library(recipes)
library(tibble)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE))))
dum <- tib %>%
recipe(~ .) %>%
step_dummy(gender, like_product) %>%
prep(training = tib) %>%
bake(newdata = tib)
dum
# A tibble: 10 x 6
record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1. 1. 0. 0. 0.
2 2 1. 1. 0. 0. 0.
3 3 1. 1. 0. 0. 0.
4 4 0. 0. 1. 0. 0.
5 5 0. 0. 0. 0. 0.
6 6 0. 1. 0. 0. 0.
7 7 0. 1. 0. 0. 0.
8 8 0. 0. 0. 1. 0.
9 9 0. 0. 0. 0. 1.
10 10 1. 0. 0. 0. 0.
In case you don't want to load any additional packages, you could also use pivot_wider statements like this:
tib %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = gender, values_from = dummy, values_fill = 0) %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = like_product, values_from = dummy, values_fill = 0, names_glue = "like_product_{like_product}")

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

How to create a variable that indicates agreement from two dichotomous variables

I d like to create a new variable that contains 1 and 0. A 1 represents agreement between the rater (both raters 1 or both raters 0) and a zero represents disagreement.
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- cbind(rater_A, rater_B)
The new variable would be like the following vector I created manually:
df$agreement <- c(1,0,0,0,1,0,1,1,1,1)
Maybe there's a package or a function I don't know. Any help would be great.
You could create df as a data.frame (instead of using cbind) and use within and ifelse:
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- data.frame(rater_A, rater_B)
##
df <- within(df,
agreement <- ifelse(
rater_A==rater_B,1,0))
##
> df
rater_A rater_B agreement
1 1 1 1
2 0 1 0
3 1 0 0
4 1 0 0
5 1 1 1
6 0 1 0
7 0 0 1
8 1 1 1
9 0 0 1
10 0 0 1

Resources