Generate Unique Combinations When Duplicates Exist - r

My goal is to generate a unique list of combinations when we know that there may exist a similar combination of variables since part of the set being operated upon has duplicate values. So, the problem I am trying to solve is obtaining all combinations without replacement on non distinct items. The solution needs to be general (i.e. works for any set of N elements with M values of distinct items. So, the solution should work with N = 4, M = 2 with (Var1 = Var2, Var3=Var4) or (Var1 = Var2 = Var3, Var4) etc.). As a simple example that I am trying to do, take three variables: X,Y,Z
Classic Combinations are:
X Y Z
Y Z
X Z
Z
X Y
Y
X
If we let X = Y, then we have:
X X Z
X Z
X Z
Z
X X
X
X
Thus, we have two combinations that are not "unique": (X) and (X Z).
So, the list that I would want is:
X X Z
X Z
Z
X X
X
Edit: Added case for when N=4 as recommended by #Sam Thomas
If we expand this to N=4, we have: W,X,Y,Z
W X Y Z
X Y Z
W Y Z
Y Z
W X Z
X Z
W Z
Z
W X Y
X Y
W Y
Y
W X
X
W
Here, we can have M=2 distinct elements in forms of either: (W=X, Y=Z), (X=Z,W=Y), (X=Y,W=Z), (W = X = Y, Z), (W = Z = Y, X), (W = Z = X, Y), or (X = Y = Z, W).
In the case of (W=X, Y=Z), we have:
W W Y Y
W Y Y
W Y Y
Y Y
W W Y
W Y
W Y
Y
W W Y
W Y
W Y
Y
W W
W
W
The output should be:
W W Y Y
W Y Y
Y Y
W W Y
W Y
Y
W W
W
In the case of, (W = X = Y, Z) the matrix would initially look like:
W W W Z
W W Z
W W Z
W Z
W W Z
W Z
W Z
Z
W W W
W W
W W
W
W W
W
W
The desired output would be:
W W W Z
W W Z
W Z
Z
W W W
W W
W
End Edit
Using R, I already have a way to generate a list of all possible combinations in binary matrix form:
comb.mat = function(n){
c = rep(list(1:0), n)
expand.grid(c)
}
comb.mat(3)
This gives:
Var1 Var2 Var3
1 1 1 1
2 0 1 1
3 1 0 1
4 0 0 1
5 1 1 0
6 0 1 0
7 1 0 0
8 0 0 0
If we consider Var1 = Var2, this structure would have redundancies. e.g. lines (2,3) and then (6,7) would represent the same object. Thus, the redundancy free version would be:
Var1 Var2 Var3
1 1 1 1
2 0 1 1
4 0 0 1
5 1 1 0
6 0 1 0
8 0 0 0
To add "variable" values similar to the initial structure, I use:
nvars = ncol(m)
for(i in 1:nvars){
m[m[,i]==1,i] = LETTERS[22+i]
}
To modify it so that Var1 = Var2, I just use:
m[m[,i]=="Y",i] = "X"
Any suggestions on how I could move from the initial matrix to the later matrix?
Especially if we have more variables that are paired?
E.g. comb.mat(4), with: (Var1 = Var2, Var3 = Var4) or (Var1=Var2=Var3, Var4)

This has all of the combinations, I believe.
m <- comb.mat(3)
res <- lapply(split(m, m$Var3), function(x, vars=c("Var1", "Var2")) {
x[Reduce(`==`, x[vars]) | cumsum(Reduce(xor, x[vars])) == 1, ]
})
do.call(rbind, res)
Var1 Var2 Var3
0.5 1 1 0
0.6 0 1 0
0.8 0 0 0
1.1 1 1 1
1.2 0 1 1
1.4 0 0 1
Edit: Think this works for multiple equivalent variables- couldn't figure out a method without a for loop. I'm sure there's a way with Reduce somehow.
And I think this gives the right combination of results, but if not let me know as it's late in the day and I'm a bit tired.
remove_dups <- function(m, vars) {
for (k in 1:length(vars)) {
res <- lapply(split(m, m[, !names(m) %in% vars[[k]]]), function(x, vn=vars[[k]]) {
x[Reduce(`==`, x[vn]) | cumsum(Reduce(xor, x[vn])) == 1, ]
})
m <- do.call(rbind, res)
}
m
}
m <- comb.mat(4)
remove_dups(m, list(vars=c("Var1", "Var2"), vars=c("Var3", "Var4")))
Var1 Var2 Var3 Var4
0.0.0.0.16 0 0 0 0
0.0.1.0.12 0 0 1 0
0.0.1.1.4 0 0 1 1
0.1.0.0.14 0 1 0 0
0.1.1.0.10 0 1 1 0
0.1.1.1.2 0 1 1 1
1.1.0.0.13 1 1 0 0
1.1.1.0.9 1 1 1 0
1.1.1.1.1 1 1 1 1

Related

extracting specific column with certain condition on one column only

I have a data(in R) in below form,
enter image description here
A B C D
x alpha sine 0
y gama cos 1
z beta tan 2
and I want to extract only column A & B where column D > 0.
i tried using data %>% filter(D > 0), which gives me last two rows where D>0 but it also gives me column c which i don't want.
how can i get only column A&B with condition applied on column D only.?
Data in text:
A
B
C
D
x
alpha
sine
0
y
gama
cos
1
z
beta
tan
2
data %>% filter(D > 0) %>%select(A,B, D)
A B D
1 y gama 1
2 z beta 2
or even:
data %>% filter(D > 0) %>%select(-C)
A B D
1 y gama 1
2 z beta 2

What exactly does the logical parameter on the `subset` function in R?

I am Learning R with the book Learning R - Richard Cotton, Chapter 5: List and Dataframes and I don't understand this example give, I have this dataframe and the following scripts:
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = runif(5) > 0.5
))
x y z
1 a 0.6395739 FALSE
2 b -1.1645383 FALSE
3 c -1.3616093 FALSE
4 d 0.5658254 FALSE
5 e 0.4345538 FALSE
subset(a_data_frame, y > 0 | z, x) # what exactly mean y > 0 | z ?
I read the book and said:
subset takes up to three arguments: a data frame to subset, a
logical vector of conditions for rows to include, and a vector of
column names to keep
No more information about the second logic parameter.
It's a tricky example because the (a_data_frame, y > 0 | z, x) the second parameter means y > 0 and the "| z" means or the values in z column that are True.
y>0 evaluate the values given by rnorm(5) your values is different than the book because are randomly generate also the "or" "|" symbol is in the case the column z is selected if the condition is True, in your case all the values False and you can't see what's going on but as didactic example if we change z = rnorm(5) instead of runif(5)>5, you can understand better how works this function.
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = rnorm(5)
))
x y z
1 a -0.91016367 2.04917552
2 b 0.01591093 0.03070526
3 c 0.19146220 -0.42056236
4 d 1.07171934 1.31511485
5 e 1.14760483 -0.09855757
So If we have y<0 or z<0 the output of column will be the row a,c,e
> subset(a_data_frame, y < 0 | z < 0, x)
x
1 a
3 c
5 e
> subset(a_data_frame, y < 0 & z<0, x)
[1] x
<0 rows> (or 0-length row.names) # there is no values for y<0 and z<0
> subset(a_data_frame, y < 0 & z, x) # True for row 2.
x
2 b
> subset(a_data_frame, y < 0 | z, x) # true for row 2 and row 4.
x
2 b
4 d

Loop within a loop with column names in R

I have the following data:
id A B C
1 1 1 0
2 1 1 1
3 0 1 1
I will like to create a function that computes the following three information between columns:
the number of individuals i) with A and B, ii) with A but not B, iii) B but not A. Similarly, I will like a recursive loop that computes these three numbers for A and C, and B and C. Is there a smart way to do so? a loop within a loop? So far, I have tried the following:
for(ii in colnames(df)){
for(jj in (ii+1):df){
print(ii,jj)
}}
Perhaps something like this:
# function to return your metrics
foo = function(x, y) {
c(
"x and y" = sum(x & y),
"x not y" = sum(x & !y),
"y not x" = sum(!x & y)
)
}
# generate combinations of columns
col_combos = combn(names(df)[-1], 2)
result = apply(col_combos, 2, function(x) foo(df[[x[1]]], df[[x[2]]]))
colnames(result) = apply(col_combos, 2, toString)
result
# A, B A, C B, C
# x and y 2 1 2
# x not y 0 1 1
# y not x 1 1 0
Using this data:
df = read.table(text = 'id A B C
1 1 1 0
2 1 1 1
3 0 1 1 ', header = TRUE)

New column of factors based on shared group values [R]

Suppose I have the following data. I'm interested in making a new column of factors that captures whether Item_i, Item_j, and/or Item_k are coded "1" for each category A,B,C,D,etc.
dat <- data.frame(c("A","A","B","B","C","C","D","D"), c("x","y","y","z","x","z","y","z"), c(1,0,0,1,1,0,0,0), c(0,1,1,0,0,0,1,0), c(0,0,0,1,0,1,0,1))
names(dat) <- c("Categories","Aspects","Item_i", "Item_j", "Item_k")
If I didn't care about the categories and wanted to do this row-by-row, it would be simple enough to do using an ifelse() statement:
dat$FactorCol <- ifelse(dat$Item_i==1 & dat$Item_j==0 & dat$Item_k==0, "i", NA)
dat$FactorCol <- ifelse(dat$Item_i==0 & dat$Item_j==1 & dat$Item_k==0, "j", dat$FactorCol)
dat$FactorCol <- ifelse(dat$Item_i==0 & dat$Item_j==0 & dat$Item_k==1, "k", dat$FactorCol)
dat$FactorCol <- ifelse(dat$Item_i==1 & dat$Item_j==0 & dat$Item_k==1, "i and k", dat$FactorCol)
But what I actually want is for dat$FactorCol to reflect whether i, j, k, or some combination appears anywhere within each Category, and then to return a new column (with the same number of rows).
Output would be something like:
Categories Aspects Item_i Item_j Item_k FactorCol
1 A x 1 0 0 i and j
2 A y 0 1 0 i and j
3 B y 0 1 0 i and j and k
4 B z 1 0 1 i and j and k
5 C x 1 0 0 i and k
6 C z 0 0 1 i and k
7 D y 0 1 0 j and k
8 D z 0 0 1 j and k
It's also not the case in my data that categories restart neatly every two rows. I'm guessing dplyr() can handle this easily, but I wasn't able to do it on my own. Appreciate any tips.
For each Categories, we can get max value for 'Item_' columns, for columns which are 1 we assign i,j or k value in each row. To get same number of rows back we left_join with dat
library(dplyr)
cols <- c('i', 'j', 'k')
dat %>%
group_by(Categories) %>%
summarise(across(starts_with('Item_'), max)) %>%
#In old dplyr
#summarise_at(vars(starts_with('Item_')), max)
mutate(FactorCol = purrr::pmap_chr(select(., starts_with('Item_')),
~toString(cols[c(...) == 1]))) %>%
select(Categories, FactorCol) %>%
left_join(dat, by = 'Categories')
# Categories FactorCol Items Item_i Item_j Item_k
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 A i, j x 1 0 0
#2 A i, j y 0 1 0
#3 B i, j, k y 0 1 0
#4 B i, j, k z 1 0 1
#5 C i, k x 1 0 0
#6 C i, k z 0 0 1
#7 D j, k y 0 1 0
#8 D j, k z 0 0 1

In R: Split a character vector to find specific characters and return a data frame

I want to be able to extract specific characters from a character vector in a data frame and return a new data frame. The information I want to extract is auditors remark on a specific company's income and balance sheet. My problem is that the auditors remarks are stored in vectors containing the different remarks. For instance:
vec = c("A C G H D E"). Since "A" %in% vec won't return TRUE, I have to use strsplit to break up each character vector in the data frame, hence "A" %in% unlist(strsplit(dat[i, 2], " "). This returns TRUE.
Here is a MWE:
dat <- data.frame(orgnr = c(1, 2, 3, 4), rat = as.character(c("A B C")))
dat$rat <- as.character(dat$rat)
dat[2, 2] <- as.character(c("A F H L H"))
dat[3, 2] <- as.character(c("H X L O"))
dat[4, 2] <- as.character(c("X Y Z A B C"))
Now, to extract information about every single letter in the rat coloumn, I've tried several approaches, following similar problems such as Roland's answer to a similar question (How to split a character vector into data frame?)
DF <- data.frame(do.call(rbind, strsplit(dat$rat, " ", fixed = TRUE)))
DF
X1 X2 X3 X4 X5 X6
1 A B C A B C
2 A F H L H A
3 H X L O H X
4 X Y Z A B C
This returnsthe following error message: Warning message:
In (function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 2)
It would be a desirable approach since it's fast, but I can't use DF since it recycles.
Is there a way to insert NA instead of the recycling because of the different length of the vectors?
So far I've found a solution to the problem by using for-loops in combination with ifelse-statements. However, with 3 mill obs. this approach takes years!
dat$A <- 0
for(i in seq(1, nrow(dat), 1)) {
print(i)
dat[i, 3] <- ifelse("A" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
}
dat$B <- 0
for(i in seq(1, nrow(dat), 1)) {
print(i)
dat[i, 4] <- ifelse("B" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
}
This gives the results I want:
dat
orgnr rat A B
1 1 A B C 1 1
2 2 A F H L H 1 0
3 3 H X L O 0 0
4 4 X Y Z A B C 1 1
I've searched through most of the relevant questions I could find here on StackOverflow. This one is really close to my problem: How to convert a list consisting of vector of different lengths to a usable data frame in R?, but I don't know how to implement strsplit with that approach.
We can use for-loop with grepl to achieve this task. + 0 is to convert the column form TRUE or FALSE to 1 or 0
for (col in c("A", "B")){
dat[[col]] <- grepl(col, dat$rat) + 0
}
dat
# orgnr rat A B
# 1 1 A B C 1 1
# 2 2 A F H L H 1 0
# 3 3 H X L O 0 0
# 4 4 X Y Z A B C 1 1
If performance is an issue, try this data.table approach.
library(data.table)
# Convert to data.table
setDT(dat)
# Create a helper function
dummy_fun <- function(col, vec){
grepl(col, vec) + 0
}
# Apply the function to A and B
dat[, c("A", "B") := lapply(c("A", "B"), dummy_fun, vec = rat)]
dat
# orgnr rat A B
# 1: 1 A B C 1 1
# 2: 2 A F H L H 1 0
# 3: 3 H X L O 0 0
# 4: 4 X Y Z A B C 1 1
using Base R:
a=strsplit(dat$rat," ")
b=data.frame(x=rep(dat$orgnr,lengths(a)),y=unlist(a),z=1)
cbind(dat,as.data.frame.matrix(xtabs(z~x+y,b)))
orgnr rat A B C F H L O X Y Z
1 1 A B C 1 1 1 0 0 0 0 0 0 0
2 2 A F H L H 1 0 0 1 2 1 0 0 0 0
3 3 H X L O 0 0 0 0 1 1 1 1 0 0
4 4 X Y Z A B C 1 1 1 0 0 0 0 1 1 1
From here you can Just call those columns that you want:
d=as.data.frame.matrix(xtabs(z~x+y,b))
cbind(dat,d[c("A","B")])
orgnr rat A B
1 1 A B C 1 1
2 2 A F H L H 1 0
3 3 H X L O 0 0
4 4 X Y Z A B C 1 1

Resources