find combination pairs of attribute variables - r

I looked around for a solution but could not find an exact one.
Given:
a<-c('a','b','c')
b<-c('d','e','f')
d<-c('g','h')
as a toy subset of a much larger set, I want to be able to find unique pairs between
attribute (vector) sets. If I use
combn(c(a,b,d),2)
It would return ALL pairwise combinations of all of the attribute elements.
e.g.
combn(c(a,b,d),2)
returns c(a,b) c(a,d) c(a,d) c(a,e)...
But I only want pairs of elements between attributes. So I would not see a,b or a,c but
a,d a,e a,f b,d b,e,b,f etc...
I could sort of do it with expand.grid(a,b,d)..
Var1 Var2 Var3
1 a d g
2 b d g
3 c d g
4 a e g
5 b e g
6 c e g
7 a f g
8 b f g
9 c f g
10 a d h
11 b d h
12 c d h
13 a e h
14 b e h
15 c e h
16 a f h
17 b f h
18 c f h
but now I have an n-col dimensional set of the combinations. Is there any way to limit
it to just attribute pairs of elements, such as combn(x,2)
The main goal is to find a list of unique pairwise combinations of elements between all attribute pairs, but I do not want combinations of elements
within the same attribute column, as it is redundant in my application.

Taking combinations of pairs in each row in the grid, then filtering to get unique entries, we have this:
unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))
A list of combinations is returned:
> L <- unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))[1:5]
> length(L) ## 21
> L
## [[1]]
## Var1 Var2
## "a" "d"
##
## [[2]]
## Var1 Var3
## "a" "g"
##
## [[3]]
## Var2 Var3
## "d" "g"
##
## [[4]]
## Var1 Var2
## "b" "d"
##
## [[5]]
## Var1 Var3
## "b" "g"

First, create a list where each element is a pair of your original vectors, e.g. list(a, b):
L <- list(a, b, d)
L.pairs <- combn(seq_along(L), 2, simplify = FALSE, FUN = function(i)L[i])
Then run expand.grid for each of these pairs and put the pieces together:
do.call(rbind, lapply(L.pairs, expand.grid))
# Var1 Var2
# 1 a d
# 2 b d
# 3 c d
# [...]
# 19 d h
# 20 e h
# 21 f h

Related

avoiding nested sapply when collapsing variable in data.frame with multiple factors

I have a dataframe with multiple factors and multiple numeric vars. I would like to collapse one of the factors (say by mean).
In my attempts I could only think of nested sapply or for loops to isolate the numerical elements to be averaged.
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
> var
A B C D E
1 a c e 1.1601720731 -0.57092435
2 a c f -0.0120178626 1.05003748
3 a d e 0.5311032778 1.67867806
4 a d f -0.3399901000 0.01459940
5 a c e -0.2887561691 -0.03847519
6 a c f 0.0004299922 -0.36695879
7 a d e 0.8124655890 0.05444033
8 a d f -0.3777058654 1.34074427
9 b c e 0.7380720821 0.37708543
10 b c f -0.3163496271 0.10921373
11 b d e -0.5543252191 0.35020193
12 b d f -0.5753686426 0.54642790
13 b c e -1.9973216646 0.63597405
14 b c f -0.3728926714 -3.07669300
15 b d e -0.6461596329 -0.61659041
16 b d f -1.7902722068 -1.06761729
sapply(4:ncol(var), function(i){
sapply(1:length(levels(var$A)), function(j){
sapply(1:length(levels(var$B)), function(t){
sapply(1:length(levels(var$C)), function(z){
mean(var[var$A == levels(var$A)[j] &
var$B == levels(var$B)[t] &
var$C == levels(var$C)[z],i])
})
})
})
})
[,1] [,2]
[1,] 0.435707952 -0.3046998
[2,] -0.005793935 0.3415393
[3,] 0.671784433 0.8665592
[4,] -0.358847983 0.6776718
[5,] -0.629624791 0.5065297
[6,] -0.344621149 -1.4837396
[7,] -0.600242426 -0.1331942
[8,] -1.182820425 -0.2605947
Is there a way to do this without this many sapply? maybe with mapply or outer
Maybe just,
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
library(dplyr)
var %>%
group_by(A,B,C) %>%
summarise_if(is.numeric,mean)
(Note that the output you show isn't what I get when I run your sapply code, but the above is identical to what I get when I run your sapply's.)
For inline aggregation (keeping same number of rows of data frame), consider ave:
var$D_mean <- with(var, ave(D, A, B, C, FUN=mean))
var$E_mean <- with(var, ave(E, A, B, C, FUN=mean))
For full aggregation (collapsed to factor groups), consider aggregate:
aggregate(. ~ A + B + C, var, mean)
I will complete the holy trinity with a data.table solution. Here .SD is a data.table of all the columns not listed in the by portion. This is a near-dupe of this question (only difference is >1 column being summarized), so click that if you want more solutions.
library(data.table)
setDT(var)
var[, lapply(.SD, mean), by = .(A, B, C)]
# A B C D E
# 1: a c e 0.07465822 0.032976115
# 2: a c f 0.40789460 -0.944631574
# 3: a d e 0.72054938 0.039781185
# 4: a d f -0.12463910 0.003363382
# 5: b c e -1.64343115 0.806838905
# 6: b c f -1.08122890 -0.707975411
# 7: b d e 0.03937829 0.048136471
# 8: b d f -0.43447899 0.028266455

Partitioning a vector into all possible combinations of paired values [duplicate]

I looked around for a solution but could not find an exact one.
Given:
a<-c('a','b','c')
b<-c('d','e','f')
d<-c('g','h')
as a toy subset of a much larger set, I want to be able to find unique pairs between
attribute (vector) sets. If I use
combn(c(a,b,d),2)
It would return ALL pairwise combinations of all of the attribute elements.
e.g.
combn(c(a,b,d),2)
returns c(a,b) c(a,d) c(a,d) c(a,e)...
But I only want pairs of elements between attributes. So I would not see a,b or a,c but
a,d a,e a,f b,d b,e,b,f etc...
I could sort of do it with expand.grid(a,b,d)..
Var1 Var2 Var3
1 a d g
2 b d g
3 c d g
4 a e g
5 b e g
6 c e g
7 a f g
8 b f g
9 c f g
10 a d h
11 b d h
12 c d h
13 a e h
14 b e h
15 c e h
16 a f h
17 b f h
18 c f h
but now I have an n-col dimensional set of the combinations. Is there any way to limit
it to just attribute pairs of elements, such as combn(x,2)
The main goal is to find a list of unique pairwise combinations of elements between all attribute pairs, but I do not want combinations of elements
within the same attribute column, as it is redundant in my application.
Taking combinations of pairs in each row in the grid, then filtering to get unique entries, we have this:
unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))
A list of combinations is returned:
> L <- unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))[1:5]
> length(L) ## 21
> L
## [[1]]
## Var1 Var2
## "a" "d"
##
## [[2]]
## Var1 Var3
## "a" "g"
##
## [[3]]
## Var2 Var3
## "d" "g"
##
## [[4]]
## Var1 Var2
## "b" "d"
##
## [[5]]
## Var1 Var3
## "b" "g"
First, create a list where each element is a pair of your original vectors, e.g. list(a, b):
L <- list(a, b, d)
L.pairs <- combn(seq_along(L), 2, simplify = FALSE, FUN = function(i)L[i])
Then run expand.grid for each of these pairs and put the pieces together:
do.call(rbind, lapply(L.pairs, expand.grid))
# Var1 Var2
# 1 a d
# 2 b d
# 3 c d
# [...]
# 19 d h
# 20 e h
# 21 f h

How to filter in dplyr based upon an associated condition

I have a data frame. I want to filter out some issues only in the case they are associated with a specific group.
For a dummy example, suppose I have the following:
> mydf
Group Issue
1 A G
2 A H
3 A L
4 B V
5 B M
6 C G
7 C H
8 C L
9 C X
10 D G
11 D H
12 D I
I want to filter out rows with a "G" or "H" or "L" issue if there is also an "L" issue in that Group.
So in this case, I want to filter out rows 1, 2, 3, 6,7,8 but leave rows 4,5,9, 10,11, and 12. Thus the result would be:
> mydf
Group Issue
4 B V
5 B M
9 C X
10 D G
11 D H
12 D I
I think I first need to group_by(Group) but then I'm wondering what's the best way to do this.
Thanks!
If the rule is
When a group contains L, drop L, G & H.
then
mydf %>%
group_by(Group) %>%
filter( if (any(Issue=="L")) !(Issue %in% c("G","H","L")) else TRUE )
# Group Issue
# 1 B V
# 2 B M
# 3 C X
# 4 D G
# 5 D H
# 6 D I

Extract n rows after string in R

I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.

Filter a data.frame with another data.frame using index notation instead of subset

Given:
df <- data.frame(rep = letters[sample(4, 30, replace=TRUE)], loc = LETTERS[sample(5:8, 30, replace=TRUE)], y= rnorm(30))
lookup <- data.frame(rep=letters[1:4], loc=LETTERS[5:8])
This will give me the rows in df that have rep,loc combinations that occur in lookup:
mdply(lookup, function(rep,loc){
r=rep
l=loc
subset(df, rep==r & loc==l)
})
But I've read that using subset() inside a function is poor practice due to scoping issues. So how do I get the desired result using index notation?
In this particular case, merge seems to make the most sense to me:
merge(df, lookup)
# rep loc y
# 1 a E 1.6612394
# 2 a E 1.1050825
# 3 a E -0.7016759
# 4 b F 0.4364568
# 5 d H 1.3246636
# 6 d H -2.2573545
# 7 d H 0.5061980
# 8 d H 0.1397326
A simple alternative might be to paste together the "rep" and "loc" columns from df and from lookup and subset based on that:
df[do.call(paste, df[c("rep", "loc")]) %in% do.call(paste, lookup), ]
# rep loc y
# 4 d H 1.3246636
# 10 b F 0.4364568
# 14 a E -0.7016759
# 15 a E 1.6612394
# 19 d H 0.5061980
# 20 a E 1.1050825
# 22 d H -2.2573545
# 28 d H 0.1397326

Resources