Generate all combinations of items with two values in Julia? - julia

I have m items. Each item is a pair of two values. For example, for m=4, I have the matrix:
julia> valid_pairs = [0 1;
1 2;
1 2;
2 3];
I would like to generate all combinations of the four items where each item i can take only the values in valid_pairs[i, :]. Based on the previous example, I would like to have:
julia> all_combs
4x16 Array{Int,2}
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2
1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
I feel like this can be done easily using Combinatorics.jl.
Though I used Combinatorics.jl, what I did was the following:
using Combinatorics
m = 4
combs = combinations(1:m) |> collect
L = length(combs)
all_combs = zeros(Int, m, L+1)
for j in 1:L
for i in 1:m
if !in(i, combs[j])
all_combs[i, j] = valid_pairs[i, 1]
else
all_combs[i, j] = valid_pairs[i, 2]
end
end
end
all_combs[:, end] = valid_pairs[:, 1]

Not the same order, but
julia> [collect(x) for x in Iterators.product(eachrow(valid_pairs)...)]
2×2×2×2 Array{Array{Int64,1},4}:
[:, :, 1, 1] =
[0, 1, 1, 2] [0, 2, 1, 2]
[1, 1, 1, 2] [1, 2, 1, 2]
[:, :, 2, 1] =
[0, 1, 2, 2] [0, 2, 2, 2]
[1, 1, 2, 2] [1, 2, 2, 2]
[:, :, 1, 2] =
[0, 1, 1, 3] [0, 2, 1, 3]
[1, 1, 1, 3] [1, 2, 1, 3]
[:, :, 2, 2] =
[0, 1, 2, 3] [0, 2, 2, 3]
[1, 1, 2, 3] [1, 2, 2, 3]
should do. If you really want a matrix (2D array), then you can hcat the previous answer, or directly do
julia> reduce(hcat, collect(x) for x in Iterators.product(eachrow(valid_pairs)...))
4×16 Array{Int64,2}:
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2
2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
EDIT: side note, I would define the pairs as tuples to clarify what's happening, so something like
valid_pairs = [(0,1), (1,2), (1,2), (2,3)]
and I would not create the 2D (or 4D, or m-D) array, but, instead, do
comb_pairs = Iterators.product(valid_pairs...)
which then gives you a lazy version of all the pair combinations, so that you can iterate on it without actually creating it first, which should be more efficient (and looks cleaner) I think.

Related

Efficient way to copy a matrix except for one column

Consider a matrix where you don't need the third column:
X = zeros(Int64, (4, 3));
X[:, 1] = [0, 0, 1, 1];
X[:, 2] = [1, 2, 1, 2];
julia> X
4×3 Matrix{Int64}:
0 1 0
0 2 0
1 1 0
1 2 0
So you want to select (copy) everything except column 3:
4×2 Matrix{Int64}:
0 1
0 2
1 1
1 2
Is there a shorthand way to express this?
These work, but feel impractical when you have a large number of columns:
X[:, [1, 2]]
X[:, sort(collect(setdiff(Set([1, 2, 3]), Set([3]))))]
There are plenty of ways to do this. Below is a solution in which you express which ranges of column numbers to include:
X = zeros(Int64, (8, 3));
X[:, 1] = [0, 0, 0, 0, 1, 1, 1, 1];
X[:, 2] = [1, 1, 2, 2, 1, 1, 2, 2];
return X[:,1:2] #Columns 1 through 2 are being directly included.
Alternatively, you could express which you would like to exclude, which is perhaps a more widely useful version of the code:
return X[:, 1:end .!= 3] #column number 3 is being directly excluded.
Both of which would return:
8×2 Matrix{Int64}:
0 1
0 1
0 2
0 2
1 1
1 1
1 2
1 2
If it is some column in the middle you can get perhaps get most elegant code by using InvertedIndices. (This also gets loaded by other packages such as DataFrames).:
julia> A = collect(reshape(1:16,4,4))
4×4 Matrix{Int64}:
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
julia> A[:, Not(3)]
4×3 Matrix{Int64}:
1 5 13
2 6 14
3 7 15
4 8 16

Using R dplyr::mutate() with a for loop and dynamic variables

Disclaimer: I think there is a much more efficient solution (perhaps an anonymous function with a list or *apply functions?) hence why I have come to you much more experienced people for help!
The data
Let's say I have a df with participant responses to 3 question As and 3 question Bs e.g.
qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3
EDIT df also contains other columns with other irrelevant data!
I have a vector with correct answers to each of qa1-3 and qb1-3 in sequence with the columns.
correct_answer <- c(1,3,2,2,1,4)
(i.e. for qa1,qa2,qa3,qb1,qb2,qb3)
Desired manipulation
I want to create a new column per question (e.g. qa1_correct), coding for whether the participant has responded correctly (1) or incorrectly (0) based on matching each response in df with corresponding answer in correct_answer. Ideally I would end up with:
qa1, qa2, qa3, qb1, qb2, qb3, qa1_correct, qa2_correct, qa3_correct ...
1, 3, 1, 2, 4, 4, 1, 1, 0, ...
1, 3, 2, 2, 1, 4, 1, 1, 1, ...
2, 3, 1, 2, 1, 4, 0, 1, 0, ...
1, 3, 2, 1, 1, 3, 1, 1, 1, ...
Failed Attempt
This is my attempt for question As only (would repeat for Bs) but it doesn't work (maybe wrong function paste0()?):
index <- c(1:3)
for (i in index) {
df <- df %>% mutate(paste0("qa",i,"_correct") =
case_when(paste0("qa"i) == correct_answer[i] ~ 1,
paste0("qa"i) != correct_answer[i] ~ 0))
}
Many thanks for any guidance!
You can combine mutate and across.
Code 1: Correct_answer as vector
df %>%
mutate(across(everything(),
~as.numeric(.x == correct_answer[names(df) == cur_column()]),
.names = "{.col}_correct"))
Code 2: Correct_answer as data.frame (df_correct)
correct_answer <- c(1,3,2,2,1,4)
df_correct <- data.frame(
matrix(correct_answer, ncol = length(correct_answer))
)
colnames(df_correct) <- names(df)
df %>%
mutate(across(everything(),
.fn = ~as.numeric(.x == df_correct[,cur_column()]),
.names = "{.col}_correct"))
Output
qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct qb2_correct qb3_correct
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 1 1 1 1 1 1
3 2 3 1 2 1 4 0 1 0 1 1 1
4 1 3 2 1 1 3 1 1 1 0 1 0
This may also be an alternative (In R version 4.1.0 onwards that has made apply gain a new argument simplify with default TRUE)
df <- read.table(header = T, text = 'qa1, qa2, qa3, qb1, qb2, qb3
1, 3, 1, 2, 4, 4
1, 3, 2, 2, 1, 4
2, 3, 1, 2, 1, 4
1, 3, 2, 1, 1, 3', sep = ',')
df
#> qa1 qa2 qa3 qb1 qb2 qb3
#> 1 1 3 1 2 4 4
#> 2 1 3 2 2 1 4
#> 3 2 3 1 2 1 4
#> 4 1 3 2 1 1 3
correct_answer <- c(1,3,2,2,1,4)
cbind(df,
setNames(as.data.frame(t(apply(df, 1,
\(x) +(x == correct_answer)))),
paste0(names(df), '_correct')))
#> qa1 qa2 qa3 qb1 qb2 qb3 qa1_correct qa2_correct qa3_correct qb1_correct
#> 1 1 3 1 2 4 4 1 1 0 1
#> 2 1 3 2 2 1 4 1 1 1 1
#> 3 2 3 1 2 1 4 0 1 0 1
#> 4 1 3 2 1 1 3 1 1 1 0
#> qb2_correct qb3_correct
#> 1 0 1
#> 2 1 1
#> 3 1 1
#> 4 1 0
Created on 2021-07-23 by the reprex package (v2.0.0)
You can also use the following solution in base R:
cbind(df,
do.call(cbind, mapply(function(x, y) as.data.frame({+(x == y)}),
df, correct_answer, SIMPLIFY = FALSE)) |>
setNames(paste0(names(df), "_corr")))
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Or a potential tidyverse solution could be:
library(tidyr)
library(purrr)
df %>%
mutate(output = pmap(df, ~ setNames(+(c(...) == correct_answer),
paste0(names(df), "_corr")))) %>%
unnest_wider(output)
qa1 qa2 qa3 qb1 qb2 qb3 qa1_corr qa2_corr qa3_corr qb1_corr qb2_corr qb3_corr
1 1 3 1 2 4 4 1 1 0 1 0 1
2 1 3 2 2 1 4 0 0 0 0 0 0
3 2 3 1 2 1 4 1 0 0 0 0 0
4 1 3 2 1 1 3 1 1 1 0 1 0
Try this:
df_new <- cbind(df, t(apply(df, 1, function(x) as.numeric(x == correct_answer))))
EDIT works with addition of sym()
Found a related solution here Paste variable name in mutate (dplyr) but it only pastes 0's
for (i in index) {
df <- df %>% mutate( !!paste0("qa",i,"_correct") :=
case_when(!!sym(paste0("qa",i)) == correct_answer[i] ~ 1,
!!sym(paste0("qa",i)) != correct_answer[i] ~ 0))
}

Assignement in matrixes using double indexes

I can't figure out how to obtain this behavior:
From this matrix:
julia> a = [1 1 1; 1 1 1; 1 1 2]
3×3 Array{Int64,2}:
1 1 1
1 1 1
1 1 2
I want to change all the 1s to 5s but only in the last row.
What I did is a[3, :][a[3, :] .== 1] .= 5 but the value of a isn't changed.
I've noticed that with:
foo[foo .== 1] .= 5
a[3, :] = foo
It works, but I'm trying to reduce allocations and this should be removed.
Thanks in advance
You can use #view and replace!:
julia> a = [1 1 1
1 1 1
1 1 2]
3×3 Array{Int64,2}:
1 1 1
1 1 1
1 1 2
julia> replace!(#view(a[end, :]), 1 => 5)
3-element view(::Array{Int64,2}, 3, :) with eltype Int64:
5
5
2
julia> a
3×3 Array{Int64,2}:
1 1 1
1 1 1
5 5 2
The problem is
a[3, :][a[3, :] .== 1] .= 5
is the same as getindex(a, 3, :)[a[3, :] .== 1] .=5
getindex returns a copy of that part of a
You are mutating the copy, not the original a
You want to use a view
view(a, 3, :)[a[3, :] .== 1] .=5
You can also do this with the #view or #views macro.

Generate new column in dataframe, based on group-event in nested groups

I have a dataframe with three "main"-groups (x: 1, 2, 3), three groups within the main-groups (v: 2, 3 or 1) and some events within the main-groups (0 and 1 in y):
x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3)
v <- c(2, 3, 3, 2, 2, 1, 1, 2, 2)
y <- c(0, 0, 1, 0, 0, 0, 0, 0, 1)
df <- data.frame(x, v, y)
df
> df
x v y
1 1 2 0
2 1 3 0
3 1 3 1
4 2 2 0
5 2 2 0
6 3 1 0
7 3 1 0
8 3 2 0
9 3 2 1
For example: In group 1 (x = 1) there are two more groups (v = 2 and v = 3), event y = 1 happens in group x = 1 and v = 3.
Now i want to generate a new column z, based on the events in y: if there is any y = 1 in one group, all cases in group v in x should get a 1 for z; else NA. How can z be generated this way? df should look like:
> df
x v y z
1 1 2 0 NA
2 1 3 0 1
3 1 3 1 1
4 2 2 0 NA
5 2 2 0 NA
6 3 1 0 1
7 3 1 1 1
8 3 2 0 NA
9 3 2 0 NA
I am grateful for any help.
df %>% group_by(x, v) %>% mutate(z = if(any(y == 1)) 1 else NA)
After grouping by x and y, the new column z is filled with 1's if there are any 1's in y and with NA's otherwise.
Try this:
library(dplyr)
df %>%
group_by(x, v) %>%
mutate(
z = ifelse(any(y == 1), 1, NA)
)

Removing rows/columns with only one element from a binary matrix

I'm trying to remove "singletons" from a binary matrix. Here, singletons refers to elements that are the only "1" value in the row AND the column in which they appear. For example, given the following matrix:
> matrix(c(0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1), nrow=6)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0 1 0 0 0 0 0
[2,] 1 0 1 0 0 0 0
[3,] 0 0 0 1 0 0 0
[4,] 1 1 0 0 0 0 0
[5,] 0 0 0 0 1 1 1
[6,] 0 0 0 0 1 0 1
...I would like to remove all of row 3 (and, if possible, all of column 4), because the 1 in [3,4] is the only 1 in that row/column combination. [1,2] is fine, since there are other 1's in column [,2]; similarly, [2,3] is fine, since there are other 1's in row [2,]. Any help would be appreciated - thanks!
You first want to find which rows and columns are singletons and then check if there are pairs of singletons rows and columns that share an index. Here is a short bit of code to accomplish this task:
foo <- matrix(c(0,1,0,...))
singRows <- which(rowSums(foo) == 1)
singCols <- which(colSums(foo) == 1)
singCombinations <- expand.grid(singRows, singCols)
singPairs <- singCombinations[apply(singCombinations, 1,
function(x) which(foo[x[1],] == 1) == x[2]),]
noSingFoo <- foo[-unique(singPairs[,1]), -unique(singPairs[,2])]
With many sinlgeton ros or columns you might need to make this a bit more efficient, but it does the job.
UPDATE: Here is the more efficient version I knew could be done. This way you loop only over the rows (or columns if desired) and not all combinations. Thus it is much more efficient for matrices with many singleton rows/columns.
## starting with foo and singRows as before
singPairRows <- singRows[sapply(singRows, function(singRow)
sum(foo[,foo[singRow,] == 1]) == 1)]
singPairs <- sapply(singPairRows, function(singRow)
c(singRow, which(foo[singRow,] == 1)))
noSingFoo <- foo[-singPairs[1,], -singPairs[2,]]
UPDATE 2: I have compared the two methods (mine=nonsparse and #Chris's=sparse) using the rbenchmark package. I have used a range of matrix sizes (from 10 to 1000 rows/columns; square matrices only) and levels of sparsity (from 0.1 to 5 non-zero entries per row/column). The relative level of performance is shown in the heat map below. Equal performance (log2 ratio of run times) is designated by white, faster with sparse method is red and faster with non-sparse method is blue. Note that I am not including the conversion to a sparse matrix in the performance calculation, so that will add some time to the sparse method. Just thought it was worth a little effort to see where this boundary was.
cr1msonB1ade's way is a great answer. For more computationally intensive matrices (millions x millions), you can use this method:
Encode your matrix in sparse notation:
DT <- structure(list(i = c(1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 6), j = c(2,
1, 3, 4, 1, 2, 5, 6, 7, 5, 7), val = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1)), .Names = c("i", "j", "val"), row.names = c(NA, -11L
), class = "data.frame")
Gives (0s are implicit)
> DT
i j val
1 1 2 1
2 2 1 1
3 2 3 1
4 3 4 1
5 4 1 1
6 4 2 1
7 5 5 1
8 5 6 1
9 5 7 1
10 6 5 1
11 6 7 1
Then we can filter using:
DT <- data.table(DT)
DT[, rowcount := .N, by = i]
DT[, colcount := .N, by = j]
Giving:
>DT[!(rowcount*colcount == 1)]
i j val rowcount colcount
1: 1 2 1 1 2
2: 2 1 1 2 2
3: 2 3 1 2 1
4: 4 1 1 2 2
5: 4 2 1 2 2
6: 5 5 1 3 2
7: 5 6 1 3 1
8: 5 7 1 3 2
9: 6 5 1 2 2
10: 6 7 1 2 2
(Note the (3,4) row is now missing)

Resources