How do I get all pairs of values in a variable based on shared values in a different variable - r

My problem is perhaps a little difficult to formulate, hence I haven't found any solutions yet, but I'll try:
I wan't to find all pairs of values in a variable based on whether they share any value in another variable. Maybe the following example can illustrate it more clearly.
In a 2 variable data frame like this:
data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162"))
#> scaffold geneID
#> A 162
#> A 276
#> B 64
#> B 276
#> B 281
#> C 64
#> C 162
#> D 162
... I want to find all pairs of "scaffolds" A, B, C, and D, that share any of the "geneID"s 64, 162, 176, and 281, so that the above would become a data frame with all pairs of scaffolds in 2 new columns like this:
data.frame(V1 = c("A", "A", "A", "B", "C"), V2 =c("B", "C", "D", "C", "D"))
#> V1 V2
#> A B
#> A C
#> A D
#> B C
#> C D
Obviously A and B is the same pair as B and A, so these should be removed somehow, but that's probably easy. Afterwards, this data frame needs to be combined with a data frame containing x/y coordinates of the scaffolds for drawing a line between the pairs on top of a plot with the scaffolds.
I do have a working for-loop to do the job, but I need to replace that with a much faster alternative. I'll spare you the code, it's complicated and doesn't always do it right. Running it on just 20 scaffolds can take seconds, but I need to do it on thousands. I was hoping a series of dplyr or data.table functions could do the job as they probably are as fast as it gets, but I haven't been able to get my head around how.
I hope you can help me, or perhaps something similar is already in another threat I just wasn't able to find.
A performance comparison of the two solutions by #Florian and #Roman can be found at http://rpubs.com/kasperskytte/SO_question_48407650

Here is a possible solution. Note that I modified your example df so A and C share both 162 and 64, and we have to make sure that this group does not occur twice in the output.
df = data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D","A"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162","64"),stringsAsFactors = F)
y = split(df$scaffold,df$geneID)
unique(do.call(rbind,(lapply(y[which(sapply(y, length) > 1)],function(x){t(combn(sort(x),2))}))))
Output:
[,1] [,2]
[1,] "A" "C"
[2,] "A" "D"
[3,] "C" "D"
[4,] "A" "B"
[5,] "B" "C"
How it works: First we split the data into groups based on df$geneID, the result we call y. Then we lapply over every element of y that has more than 1 element in it a function that gives us all n possible combinations of 2 as a nx2 matrix. By calling sort() on x inside this function we make removing duplicates easier later on, because we then rbind this list into a large matrix, and call unique() on the result to remove duplicates.
Hope this helps!

See the commends in the code.
xy <- data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162"))
# split by gene
xy1 <- split(xy, f = xy$geneID)
# find all combinations
out <- sapply(xy1, FUN = function(x) {
x$scaffold <- as.character(x$scaffold)
# add NA so that we can remove any cases that have a single scaffold
tryCatch(t(combn(x$scaffold, 2)), error = function(e) NA)
}, simplify = FALSE)
# remove NAs and some fiddling to get the desired format
out <- out[!is.na(out)]
out <- do.call(rbind, out)
# sort the data
out <- t(apply(out, MARGIN = 1, FUN = function(x) sort(x)))
# remove duplicates
out <- out[!duplicated(out), ]
out
[,1] [,2]
[1,] "A" "C"
[2,] "A" "D"
[3,] "C" "D"
[4,] "A" "B"
[5,] "B" "C"

Related

Save unique values of variable for each combination of two variables in a dataset

I have a (large) dataset with three variables. For each combination of sub1 and sub2, I would like to save a all unique IVs in a separate vector or dataset, ignoring id, and name it using the variables "sub1.and.sub2.IV". As my dataset is quite large, I would like to avoid using which and automatically extract all combinations.
id sub1 sub2 IV
<chr> <chr> <chr> <chr>
1 3 a a p
2 3 a a f
3 6 a b z
4 6 a b e
5 7 a c b
6 7 a c b
In the end, I would have three vector or datasets:
> a.and.a.IV
[1] "p" "f"
> a.and.b.IV
[1] "z" "e"
> a.and.c.IV
[1] "b"
MRE example:
structure(list(id = c("3", "3", "6", "6", "7", "7"), sub1 = c("a",
"a", "a", "a", "a", "a"), sub2 = c("a", "a", "b", "b", "c", "c"
), IV = c("p", "f", "z", "e", "b", "b")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
Maybe split
> split(df$IV, df[c("sub1","sub2")])
$a.a
[1] "p" "f"
$a.b
[1] "z" "e"
$a.c
[1] "b" "b"
One possibility could be::
a.and.a.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="a"),]$IV)
a.and.b.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="b"),]$IV)
a.and.c.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="c"),]$IV)
> a.and.a.IV
[1] "p" "f"
> a.and.b.IV
[1] "z" "e"
> a.and.c.IV
[1] "b"
I used #ThomasIsCoding's comment to search for more solutions. I have found 3 solutions to split the dataframe into a list of tibbles and 1 solution using a loop to split a list into dataframes. The for loop stays the same for every solution:
Solution 1:
Using a custom made function by #romainfrancois to split and name the data.frames with the corresponding combinations of sub1 and sub2.
library(dplyr, warn.conflicts = FALSE)
named_group_split <- function(.tbl, ...) {
grouped <- group_by(.tbl, ...)
names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = " / ")))
grouped %>%
group_split() %>%
rlang::set_names(names)
}
df_split1 <- df %>%
named_group_split(sub1, sub2) %>%
unique()
for(i in 1:length(df_split1)) {
assign(paste0(names(df_split1[i])), as.data.frame(df_split1[[i]]))
}
Solution 2:
Using dplyr::group_split to split the dataset into a list with all the original variables and their respective names. Unfortunately, this solution is not able to name the data.frames. Solution found here.
df_split2 <- df %>%
group_split(sub1, sub2)
for(i in 1:length(df_split2)) {
assign(paste0(names(df_split2[i])), as.data.frame(df_split2[[i]]))
}
Solution 3:
Using base::split allows to split the dataset into a list with just IVs as variable and the for loop.
df_split3 <- split(df$IV, df[c("sub1","sub2")])
for(i in 1:length(df_split3)) {
assign(paste0(names(df_split3[i])), as.data.frame(df_split3[[i]]))
}

coercing data frame rows to matrix in R

I'm unsure of better terminology for my question, so forgive me for the long winded approach.
I'm trying to use two identifying variables, id and duration to fill up the rows of a matrix where the columns denote half hour periods (so there should be 6 for a 3 hour period) and the rows are a given person's activities in those time periods. If the activities do not fill up the matrix, a dummy variable should be used instead. I've written an example below which should help clarify.
Example:
data has 3 columns, id, activity, and duration. id and duration should serve as identifying variables and activity should serve as the variable in the matrix.
data <- data.frame(id = c(1, 1, 1, 2, 2, 3, 3, 3),
activity = c("a", "b", "c", "d", "e", "b", "b", "a"),
duration = c(60, 30, 90, 45, 30, 15, 60, 100))
For the example, I used a 3-hour duration hence the 6 columns in the matrix. The matrix below is the wanted output. There are DUMMY instances where the total duration of a person's activities does not sum to the duration of the matrix. In this example, the total duration is 180 (3 hours * 60), so person 2 who's activity duration sums to 75 (45 + 30) will get the DUMMY variable after the activities for the first 75 minutes are done.
mat <- t(matrix(c("a", "a", "b", "c", "c", "c",
"d", "d", "e", "DUMMY", "DUMMY", "DUMMY",
"b", "b", "b", "a", "a", "a"),
nrow = 6, ncol = 3))
colnames(mat) <- c("0", "30", "60", "90", "120", "150")
I'm unsure how to fill the matrix mat above with the data above. Any help would be appreciated. Please let me know if the question needs to be made clearer.
EDIT: edited output
EDIT2: Added matrix column names
EDIT3: Added info on dummy variable
EDIT4: Would it be easier if I added start and end time instead of duration?
An approach would be to locate the activities for every 30-min interval by "id":
ints = seq(0, by = 30, length.out = 6)
data2 = do.call(rbind,
lapply(split(data, data$id),
function(d) {
dur = d$duration
i = findInterval(ints, c(cumsum(c(0, dur[-length(dur)])), sum(dur)))
data.frame(id = d$id[1], ints = ints, activity = d$activity[i])
}))
And on the new "data.frame":
tapply(as.character(data2$activity), data2[c("id", "ints")], identity)
# ints
#id 0 30 60 90 120 150
# 1 "a" "a" "b" "c" "c" "c"
# 2 "d" "d" "e" NA NA NA
# 3 "b" "b" "b" "a" "a" "a"

Getting the set of nodes connected till the main parent node in R

I have a data set which has 6 rows and 3 columns. The first column represents children, whereas second column onward immediate parents of the corresponding child is allocated.
Above, one can see that "a" and "b" don't have any parents. whereas "c" has only parent and that is "a". "d" has parents "b" and "c" and so on.
What I need is: if given the input as the child, it should give me all the ancestors of that child including child.
e.g. "f" is the child I chose then desired output should be :
{"f", "d", "b"}, {"f", "d", "c", "a"}, {"f", "e", "b"}, {"f", "e", "c", "a"}.
Note: Order of the nodes does not matter.
Thank you so much in advance.
Create sample data. Note use of stringsAsFactors here, I'm assuming your data are characters and not factors:
> d <- data.frame(list("c" = c("a", "b", "c", "d", "e", "f"), "p1" = c(NA, NA, "a", "b", "b", "d"), "p2" = c(NA, NA, NA, "c", "c", "e")),stringsAsFactors=FALSE)
First tidy it up - make the data long, not wide, with each row being a child-parent pair:
> pairs = subset(reshape2::melt(d,id.vars="c",value.name="parent"), !is.na(parent))[,c("c","parent")]
> pairs
c parent
3 c a
4 d b
5 e b
6 f d
10 d c
11 e c
12 f e
Now we can make a graph of the parent-child relationships. This is a directed graph, so plots child-parent as an arrow:
> g = graph.data.frame(pairs)
> plot(g)
Now I'm not sure exactly what you want, but igraph functions can do anything... So for example, here's a search of the graph starting at d from which we can get various bits of information:
> d_search = bfs(g,"d",neimode="out", unreachable=FALSE, order=TRUE, dist=TRUE)
First, which nodes are ancestors of d? Its the ones that can be reached from d via the exhaustive (here, breadth-first) search:
> d_search$order
+ 6/6 vertices, named:
[1] d c b a <NA> <NA>
Note it includes d as well. Trivial enough to drop from this list. That gives you the set of ancestors of d which is what you asked for.
What is the relationship of those nodes to d?
> d_search$dist
c d e f a b
1 0 NaN NaN 2 1
We see that e and f are unreachable, so are not ancestors of d. c and b are direct parents, and a is a grandparent. You can check this from the graph.
You can also get all the paths from any child upwards using functions like shortest_paths and so on.
Here is a recursive function that makes all possible family lines:
d <- data.frame(list("c" = c("a", "b", "c", "d", "e", "f"),
"p1" = c(NA, NA, "a", "b", "b", "d"),
"p2" = c(NA, NA, NA, "c", "c", "e")), stringsAsFactors = F)
# Make data more convenient for the task.
library(reshape2)
dp <- melt(d, id = c("c"), value.name = "p")
# Recursive function builds ancestor vectors.
getAncestors <- function(data, x, ancestors = list(x)) {
parents <- subset(data, c %in% x & !is.na(p), select = c("c", "p"))
if(nrow(parents) == 0) {
return(ancestors)
}
x.c <- parents$c
p.c <- parents$p
ancestors <- lapply(ancestors, function(x) {
if (is.null(x)) return(NULL)
# Here we want to repeat ancestor chain for each new parent.
res <- list()
matches <- 0
for (i in 1:nrow(parents)) {
if (tail(x, 1) == parents[i, ]$c){
res[[i]] <- c(x, parents[i, ]$p)
matches <- matches + 1
}
}
if (matches == 0) { # There are no more parents.
res[[1]] <- x
}
return (res)
})
# remove one level of lists.
ancestors <- unlist(ancestors, recursive = F)
res <- getAncestors(data, p.c, ancestors)
return (res)
}
# Demo of results for the lowest level.
res <- getAncestors(dp, "f")
res
#[[1]]
#[1] "f" "d" "b"
#[[2]]
#[1] "f" "d" "c" "a"
#[[3]]
#[1] "f" "e" "b"
#[[4]]
#[1] "f" "e" "c" "a"
You will need to implement this in a similar way through recursion or with a while loop.

How to make a unique combination of vectors in R?

I have a couple of vectors consisting of three names. I want to get all unique pairwise combinations of these vectors. As an example, with two of those vectors, I can get the non-unique combinations with
sham1 <- c('a', 'b')
sham2 <- c('d', 'e')
shams <- list(sham1, sham2)
combinations <- apply(expand.grid(shams, shams),1, unname)
which gives the following combinations
> dput(combinations)
list(
list(c("a", "b"), c("a", "b")),
list(c("d", "e"), c("a", "b")),
list(c("a", "b"), c("d", "e")),
list(c("d", "e"), c("d", "e"))
)
I tried using unique(combinations), but this gives the same result. What I would like to get is
> dput(combinations)
list(
list(c("a", "b"), c("a", "b")),
list(c("d", "e"), c("a", "b")),
list(c("d", "e"), c("d", "e"))
)
Because there is already the combination list(c("d", "e"), c("a", "b")), I don't need the combination list(c("a", "b"), c("d", "e"))
How can I get only the unique combination of vectors?
s <- seq(length(shams))
# get unique pairs of shams indexes, including each index with itself.
uniq.pairs <- unique(as.data.frame(t(apply(expand.grid(s, s), 1, sort))))
# V1 V2
# 1 1 1
# 2 1 2
# 4 2 2
result <- apply(uniq.pairs, 1, function(x) shams[x])
I am also not exactly sure what you want but this function might help:
combn
Here is a simple example:
> combn(letters[1:4], 2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
I don't think this is what you want, but if you clarify perhaps I can edit to get you what you want:
> sham1<-c('a','b')
> sham2<-c('d','e')
> combn(c(sham1,sham2),2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "d"
[2,] "b" "d" "e" "d" "e" "e"
combn gets you the combinations (so, unique), but not the repeated ones. So combine that with something that gives you the repeated ones and you have it:
c(combn(shams, 2, simplify=FALSE),
lapply(shams, function(s) list(s,s)))
No idea what your examples are saying. If you want unique pairwise combinations:
strsplit(levels(interaction(sham1, sham2, sep="*")), "\\*")
I don't understand what do you want. And it seems that you changed the desired output from your other question.
You want your two list nested in a list inside another list???
It is not simpler to just once? Like when you have shams?
dput(shams)
list(c("Sham1.r1", "Sham1.r2", "Sham1.r3"), c("Sham2.r1", "Sham2.r2",
"Sham2.r3"))
To create such a nested list you could use that:
combinations <- list(shams, "")
dput(combinations)
list(list(c("Sham1.r1", "Sham1.r2", "Sham1.r3"), c("Sham2.r1", "Sham2.r2",
"Sham2.r3"), "")
Although it is not exactly what do you said...

How to calculate how many times vector appears in a list? in R

I have a list of 10,000 vectors, and each vector might have different elements and different lengths. I would like to know how many unique vectors I have and how often each unique vector appears in the list.
I guess the way to go is the function "unique", but I don't know how I could use it to also get the number of times each vector is repeated.
So what I would like to get is something like that:
"a" "b" "c" d" 301
"a" 277
"b" c" 49
being the letters, the contents of each unique vector, and the numbers, how often are repeated.
I would really appreciate any possible help on this.
thank you very much in advance.
Tina.
Maybe you should look at table:
Some sample data:
myList <- list(A = c("A", "B"),
B = c("A", "B"),
C = c("B", "A"),
D = c("A", "B", "B", "C"),
E = c("A", "B", "B", "C"),
F = c("A", "C", "B", "B"))
Paste your vectors together and tabulate them.
table(sapply(myList, paste, collapse = ","))
#
# A,B A,B,B,C A,C,B,B B,A
# 2 2 1 1
You don't specify whether order matters (that is, is A, B the same as B, A). If it does, you can try something like:
table(sapply(myList, function(x) paste(sort(x), collapse = ",")))
#
# A,B A,B,B,C
# 3 3
Wrap this in data.frame for a vertical output instead of horizontal, which might be easier to read.
Also, do be sure to read How to make a great R reproducible example? as already suggested to you.
As it is, I'm just guessing at what you're trying to do.

Resources