R add all combinations of three values of a vector to a three-dimensional array - r

I have a data frame with two columns. The first one "V1" indicates the objects on which the different items of the second column "V2" are found, e.g.:
V1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C")
V2 <- c("a","b","c","d","a","c","d","a","b","d","e")
df <- data.frame(V1, V2)
"A" for example contains "a", "b", "c", and "d". What I am looking for is a three dimensional array with dimensions of length(unique(V2)) (and the names "a" to "e" as dimnames).
For each unique value of V1 I want all possible combinations of three V2 items (e.g. for "A" it would be c("a", "b", "c"), c("a", "b", "d", and c("b", "c", "d").
Each of these "three-item-co-occurrences" should be regarded as a coordinate in the three-dimensional array and therefore be added to the frequency count which the values in the array should display. The outcome should be the following array
ar <- array(data = c(0,0,0,0,0,0,0,1,2,1,0,1,0,2,0,0,2,2,0,1,0,1,0,1,0,
0,0,1,2,1,0,0,0,0,0,1,0,0,1,0,2,0,1,0,1,1,0,0,1,0,
0,1,0,2,0,1,0,0,1,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,
0,2,2,0,1,2,0,1,0,1,2,1,0,0,0,0,0,0,0,0,1,1,0,0,0,
0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0),
dim = c(5, 5, 5),
dimnames = list(c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e")))

I was wondering about the 3D symmetry of your result. It took me a while to understand that you want to have all permutations of all combinations.
library(gtools) #for the permutations
foo <- function(x) {
#all combinations:
combs <- combn(x, 3, simplify = FALSE)
#all permutations for each of the combinations:
combs <- do.call(rbind, lapply(combs, permutations, n = 3, r = 3))
#tabulate:
do.call(table, lapply(asplit(combs, 2), factor, levels = letters[1:5]))
}
#apply grouped by V1, then sum the results
res <- Reduce("+", tapply(df$V2, df$V1, foo))
#check
all((res - ar)^2 == 0)
#[1] TRUE

I used to use the crossjoin CJ() to retain the pairwise count of all combinations of two different V2 items
res <- setDT(df)[,CJ(unique(V2), unique(V2)), V1][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
This code creates a data frame res with three columns. V1 and V2 contain the respective items of V2 from the original data frame df and N contains the count (how many times V1 and V2 appear with the same value of V1 (from the original data frame df).
Now, I found that I could perform this crossjoin with three 'dimensions' as well by just adding another unique(V2) and adapting the rest of the code accordingly.
The result is a data frame with four columns. V1, V2, and V3 indicate the original V2 items and N again shows the number of mutual appearances with the same original V1 objects.
res <- setDT(df)[,CJ(unique(V2), unique(V2), unique(V2)), V1][V1!=V2 & V1 != V3 & V2 != V3,
.N, .(V1,V2,V3)][order(V1,V2,V3)]
The advantage of this code is that all empty combinations (those which do not appear at all) are not considered. It worked with 1,000,000 unique values in V1 and over 600 unique items in V2, which would have otherwise caused an extremely large array of 600 x 600 x 600

Related

How to combine multiple vectors such that elements of each vector are distributed as equally as possible?

Let's say I have two or more vectors with to or more elements (single factor) each, e.g.
v1 = c("a", "a", "a")
v2 = c("b", "b")
What I want to do is to merge all vectors and distribute the elements for each group as equally as possible.
For the simple example above there would be a single solution:
c("a", "b", "a", "b", "a")
If v1 = c("a", "a", "a", "a") any of these
c("a", "b", "a", "b", "a", "a")
c("a", "b", "a", "a", "b", "a")
c("a", "a", "b", "a", "b", "a")
would be the best solution. Is there a built-in function that can do this? Any ideas how to implement it?
This would work for two vectors.
v1 = c("a", "a", "a")
v2 = c("b", "b")
distribute_equally <- function(v1, v2) {
v3 <- c(v1, v2)
tab <- sort(table(v3))
c(rep(names(tab), min(tab)), rep(names(tab)[2], diff(range(tab))))
}
distribute_equally(v1, v2)
#[1] "b" "a" "b" "a" "a"
distribute_equally(c('a', 'a'), c('b', 'b'))
#[1] "a" "b" "a" "b"
Thinking of the problem in terms of experimental design optimization, we can get a general solution using the MaxProQQ function in the MaxPro package.
Each position in the merged vector can be thought of as coming from a discrete quantitative factor, and the factors from your v1, v2, etc. can be thought of as qualitative factors. Here's some example code (MaxProQQ takes integer factors instead of characters, but you can convert it afterward):
library(MaxPro)
set.seed(1)
v1 <- rep(1, sample.int(10, 1))
v2 <- rep(2, sample.int(10, 1))
v3 <- rep(3, sample.int(10, 1))
v4 <- rep(4, sample.int(10, 1))
vComb <- c(v1, v2, v3, v4)
vMerge1234 <- MaxProQQ(cbind(1:length(vComb), sample(vComb, length(vComb))), p_nom = 1)$Design
vMerge1234 <- vMerge1234[order(vMerge1234[,1]),][,2]
> vMerge1234
[1] 4 3 4 2 4 3 4 1 2 4 3 4 2 4 3 1 4 3 2 4 1 3 4
Generate 100 samples, say, without replacement from c(v1, v2) giving m which is 5x100 with one column per sample. Then find the column for which the sum of the variances of the frequencies over each group is minimized. If there are more than two vectors just concatenate them in the line marked ## and the rest of the code stays the same.
set.seed(123)
v1 = c("a", "a", "a")
v2 = c("b", "b")
v <- c(v1, v2) ##
m <- replicate(100, sample(v))
varsum <- apply(m, 2, function(x) {
f <- factor(x, levels = unique(v))
sum(tapply(f, v, function(x) var(table(x))))
})
m[, which.min(varsum)]
## [1] "a" "a" "b" "b" "a"

Finding values in a columns "a" which has different values in column "b" for two different data set

Data contains multiple columns and 3000 row
Same OrderNo but different Ordertype.
I want to get all the OrderNo whose Ordertype are different in the two data frame.
I have isolated the two columns from the two data frame and set them in ascending order. Then I tried to use the function cbind to combine the two columns and find the missing values in one of the columns.
xxx <- data.frame( orderNo = c(1:10), Ordertype = c("a", "b", "c", "d", "a", "b", "c", "d", "e", "f"))
yyy <- data.frame( orderNo = c(1:10), Ordertype = c("a", "b", "c", "d", "a", "b", "e", "d", "e", "f"))
In the above example: OrderNo "7" corresponds to "c" in one data frame and "e" in another data frame. I want a set of all such number with a different value in the column Ordertype as my output.
It sounds like you want a data frame that contains differences between two data frames, matched by (and including) orderNo. Is that correct?
One possibility is:
res <- merge(xxx, yyy, by = "orderNo")
res[res[,2] != res[,3], ]
orderNo Ordertype.x Ordertype.y
7 7 c e
Using dplyr and anti_join you can do the following to find differences:
library(dplyr)
inner_join(anti_join(xxx, yyy), anti_join(yyy, xxx), by='orderNo')
orderNo Ordertype.x Ordertype.y
1 7 c e

how count the number of rows in a dataframe with cell matching each other

I have two columns (one with predicted values (in strings) and one with real values (in strings) and my wish is to assess the number of rows in which the real values or string do match the predicted values or string in the same row.
I was wondering whether it is possible to something like that with R?
# create sample dataset
df <- data.frame(
col1 = c("a", "b", "c", "d", "e"),
col2 = c("a", "x", "y", "z", "e"),
stringsAsFactors = FALSE
)
# count the number of rows where two columns equal each other
sum( df$col1 == df$col2 )

Subset a Data Frame Based on All Combinations and Sub-combinations of Factor Variables

I need to subset a data.frame based on all combinations an sub-combinations of multiple columns of factor variables. Additionally the number of columns factor variables may change so the method needs to be flexible in accepting different numbers of attributes. I can figure out how to create the combinations of variables in a simple example but don't have a good way to subset the data.frame efficiently. Any thoughts?
#setup an example data.frame
a <- c("a", "b", "b", "b", "e")
b <- c("b", "c", "b", "b", "f")
c <- c("c", "d", "b", "b", "g")
df <- data.table(a = a, b = b, c = c)
#build a data.frame of unique combos to subset on
df_unique <- df[!duplicated(df), ]
df_combos <- data.table()
for(i in 1:ncol(df_unique)){
for(x in 1:ncol(df_unique)){
df_sub <- df_unique[,i:x, with = F]
df_combos <- rbind(df_combos, df_sub, fill = T)
}
}
df_combos <- df_combos[!duplicated(df_combos), ]
rm(df_unique)
#create a loop to build the subsets
combos_out <- data.table()
for(i in 1:nrow(df_combos)){
df_combos_sub <- df_combos[i, ]
df_combos_sub <- df_combos_sub[,which(unlist(lapply(df_combos_sub, function(x)!all(is.na(x))))),with=F]
df_sub <- merge(df, df_combos_sub, by = colnames(df_combos_sub))
#interesting code here that performs analysis on the subsets
}

Count number of observations with elements in the same order

I am trying to pre-process some data in order to build a Sunburst plot in R. In short, I need to count how many observations have their elements in the same order.
The elements of each observation are character strings. The order does matter.
mylist <- list(c("a", "b", "c"),
c("x", "y"),
c("b", "c", "a"),
c("a", "b", "c"))
Desired output would be something like:
"a-b-c" = 2
"x-y" = 1
"b-c-a" = 1

Resources