I am trying to match between all available combinations of vectors.
For example, i have 4 vectors:
a<-c(1,2,3,4)
b<-c(1,2,6,7)
c<-c(1,2,8,9)
d<-c(3,6,8,2)
The intended output should be able to tell me:
similarity between a & b: 1, 2
similarity between a & c: 1, 2
similarity between a & d: 2, 3
similarity between b & c: 1, 2
similarity between b & d: 2, 6
similarity between c & d: 2, 8
similarity between a & b & c: 1, 2
similarity between b & c & d: 2
similarity between a & c & d: 2
similarity between a & b & d: 2
similarity between a & b & c & d: 2
Does R have a function that does such comparison/ matching?
For simplicity, the number of vectors is set at 4 for now. I am in fact dealing with 100s of vectors and would like to match/intersect/compare between all possible combinations of vectors. For example with 4 vectors, there will be a possible 4C2+4C3+4C4=11 available combinations. With 100 vectors, there will be a possible 100C100+ 100C99+100C98+...+100C2 available combinations
thanks in advance
intersect seems to do what you want. It only does pairs of vectors at a time though eg
intersect(a, b) # 1 2
intersect(b, intersect(c, d)) # 2
If you want a shorthand to intersect more than 2, try Reduce (?Reduce)
# intersection of a/b/c/d
Reduce(intersect, list(a, b, c, d), a)
# intersection of b/c/d
Reduce(intersect, list(b, c, d), b)
Reduce will successively apply intersect to the list and the result of the previous intersect call, starting with intersect(b, b) (the init argument I just set to one of the vectors to be intersected, as the intersection of a set with itself is the set).
If you wanted a way to go through all (pairs, tuples, quadruples) of (a, b, c, d) and return the intersection, you could try
generate all combinations of (a, b, c, d) in lengths 2 (pairs), 3 (tuples), 4 (quadruples)
combos = lapply(2:4, combn, x=c('a', 'b', 'c', 'd'), simplify=F)
# [[1]]
# [[1]][[1]]
# [1] "a" "b"
# [[1]][[2]]
# [1] "a" "c"
# ...
# [[2]]
# [[2]][[1]]
# [1] "a" "b" "c"
# [[2]][[2]]
# [1] "a" "b" "d"
# ...
# [[3]]
# [[3]][[1]]
# [1] "a" "b" "c" "d"
Flatten it out to just a list of character vectors
combos = unlist(combos, recursive=F)
# [[1]]
# [1] "a" "b"
# ...
# [[10]]
# [1] "b" "c" "d"
# [[11]]
# [1] "a" "b" "c" "d"
For each set, call Reduce as specified above. We can use (e.g.) get("a") to get the variable a; or mget(c("a", "b", "c") to get the variables a, b, c in a list. If your variables are columns in a dataframe, then you can modify appropriately.
intersects = lapply(combos, function (varnames) {
Reduce(intersect, mget(varnames, inherits=T), get(varnames[1]))
})
# add some labels for clarity.
# You will probably actually want to /do/ something with the
# resulting intersections rather than this.
names(intersects) <- sapply(combos, paste, collapse=", ")
intersects
# $`a, b`
# [1] 1 2
# $`a, c`
# [1] 1 2
# ...
# $`a, b, c, d`
# [1] 2
You will need to modify to suit how your data is in R; e.g. columns of a dataframe vs named vectors in the workspace and so on.
You might also just prefer a for loop from step 3. onwards rather than all the *apply depending on what you want to do with the result. (Plus, if you have many vectors, holding all the intersections simultaneously in memory might not be a good idea anyway).
Related
I have a vector like:
A B C A B A B D D E
and I'd like to break it into as many vectors as the number of "A" I have, like:
A B C
A B
A B D D E
is there a way to accomplish this task?
You can use split and cumsum:
split(x, cumsum(x == "A"))
What you get in return is a list of vectors. A list seems most useful to me here since it allows vectors of different sizes in each element (unlike a data.frame for instance).
Not as elegant as split approach but we can go also for strsplit:
strsplit(paste0("A", strsplit(paste0(vec, collapse = ""), "A")[[1]][-1]),"")
# [[1]]
# [1] "A" "B" "C"
# [[2]]
# [1] "A" "B"
# [[3]]
# [1] "A" "B" "D" "D" "E"
I want to include a list element c in a list L in R and name it C.
The example is as follows:
a=c(1,2,3)
b=c("a","b","c")
c=rnorm(3)
L<-list(A=a,
B=b,
C=c)
print(L)
## $A
## [1] 1 2 3
##
## $B
## [1] "a" "b" "c"
##
## $C
## [1] -2.2398424 0.9561929 -0.6172520
Now I want to introduce a condition on C, so it is only included in
the list if C.bool==T:
C.bool<-T
L<-list(A=a,
B=b,
if(C.bool) C=c)
print(L)
## $A
## [1] 1 2 3
##
## $B
## [1] "a" "b" "c"
##
## [[3]]
## [1] -2.2398424 0.9561929 -0.6172520
Now, however, the list element of c is not being named as specified in
the list statement. What's the trick here?
Edit: The intention is to only include the element in the list if the condition is met (no NULL shoul be included otherwise). Can this be done within the core definition of the list?
I don't know why you want to do it "without adding C outside the core definition of the list?" but if you're content with two lists in a single c then:
L <- c(list(A=a, B=b), if(C.bool) list(C=c))
If you really want one list but don't mind subsetting after creation then
L <- list(A=a, B=b, C=if(C.bool) c)[c(TRUE, TRUE, C.bool)]
(pace David Arenburg, isTRUE() omitted for brevity)
you can try this if you want to keep the names
L2 <-list(A=a,
B=b,
C = if (TRUE) c)
You can of course replace TRUE with the statement containing C.bool
You could place the if statement outside the core definition of the list, like this:
L <- list(A = a, B= b)
if (isTRUE(C.bool)) L$C <- c
#> L
#$A
#[1] 1 2 3
#
#$B
#[1] "a" "b" "c"
#
#$C
#[1] -0.7631459 0.7353929 -0.2085646
(Edit with isTRUE() owing to the comment by #DavidArenburg)
As a combination of the previous answers by #MamounBenghezal, #user20637
and the comment made by #DavidArenburg, I would suggest this generalized
version that does not depend on the length of the list:
L <- Filter(Negate(is.null),
x = list(A = a, B = b, C = if (isTRUE(C.bool)) c, D = "foo"))
I'm trying to match a character value, "C", to multiple columns in a dataframe. Here's part of the frame:
X1 X2
1 F F
2 C C
3 D D
4 A# A#
Here's what happens when I try to match the value "C":
> "C" %in% frame[, 1]
[1] TRUE
> "C" %in% frame[, 1:2]
[1] FALSE
Considering that "C" is in both columns, I can't figure out why it's returning false. Is there a function or operator that can test to see if a value is present in multiple columns? My goal is to create a function that can sum the number of times a character value like "C" is found in specified columns.
Try:
apply(frame, 2, function(u) "C" %in% u)
You can also use is.element:
apply(frame, 2, function(u) is.element("C",u))
You probably want to use grepl here, which returns a logical vector. Then you can count the number of occurrences with sum.
> frame
X1 X2
1 F F
2 C C
3 D D
4 A# A#
> grepl('C', frame$X1)
[1] FALSE TRUE FALSE FALSE
> sum(grepl('C', frame$X1))
[1] 1
and to count the total number of Cs in every column you can use lapply
(note: apply is better suited for matrices, not data frames which are
lists.)
> sum(unlist(lapply(frame, function(col) grepl('C', col))))
[1] 2
I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...
I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A