Related
Is there any existing R functionality to check if two columns have a one-to-one relationship (regardless of column type).
Example of expected output:
A B C
0 'a' 'apple'
1 'b' 'banana'
2 'c' 'apple'
A & B are one-to-one? TRUE
A & C are one-to-one? FALSE
B & C are one-to-one? FALSE
If you match a vector to itself it will return an integer vector giving the first index each unique value occurs at. We can compare these integer vectors directly:
is_one_to_one = function(x, y) {
xu = match(x, x)
yu = match(y, y)
identical(xy, yu)
}
You could then apply this to each pair of columns.
Wrapping it up in a function:
cor_1to1 = function(df) {
mat = vapply(df, \(x) match(x, x), FUN.VALUE = integer(nrow(df)))
nm = combn(colnames(mat), m = 2, FUN = paste, collapse = " :: ")
val = combn(colnames(mat), m = 2, FUN = function(i) {
identical(mat[, i[1]], mat[, i[2]])
}, simplify = TRUE)
setNames(val, nm)
}
# A :: B A :: C B :: C
# TRUE FALSE FALSE
You can do:
one_to_one <- function(data){
data[] <- sapply(data, \(x) match(x, x))
pairs <- t(combn(seq_len(ncol(data)), 2))
cbind(t(matrix(colnames(data)[t(pairs)], nrow = 2)),
One2One = apply(pairs, 1, function(x) all(Reduce(`==`, data[, x])))) |>
as.data.frame()
}
test
one_to_one(df)
# V1 V2 One2One
#1 A B TRUE
#2 A C FALSE
#3 B C FALSE
Given a set of any # of vectors:
a<-c("giraffe", "dolphin", "pig")
b<-c("elephant" , "pig")
c<-c("zebra","cobra","spider","porcupine")
d<-c("porcupine")
e<-c("spider","cobra")
f<-c("elephant","pig","porcupine")
and a target vector:
target<- c("elephant" , "pig","cobra","spider","porcupine")
Is there a way to check if any combinations of the vectors can match the target vector (order doesn't matter)?
In this case, answers would be:
b,d,e
e,f
Clarifying:
I need to know which combinations exactly match the target vector with no duplicates. Any answers that would repeat a value (e.x. b,d,e,f) would not work.
The solutions shown in the question consist of non-overlapping vectors so we assume that that is a requirement so that we are looking to partition the target into disjoint vectors that cover it. If the vectors may overlap then instead of using = or == in the constraints involving A below use >=.
The assumed problem is known as a set partitioning problem and the problem with overlaps is known as a set covering problem.
Assuming the list of vectors L and the target shown in the Note at the end form the objective (all one's), incidence matrix A of vectors, animals and the right hand of the constraint equations rhs derived from the target and run the linear program shown.
If a solution is found then we add a constraint that will eliminate it in the next iteration by insisting that at least one of its zeros be one. We iterate 5 times (i.e. up to 5 solutions) or until we can find no more solutions.
We show a solution using the lpSolveAPI package and then in the section after that repeat it using the CVXR package.
lpSolveAPI
library(lpSolveAPI)
animals <- sort(unique(unlist(L)))
A <- +outer(animals, L, Vectorize(`%in%`))
rownames(A) <- animals
nr <- nrow(A)
nc <- ncol(A)
rhs <- rownames(A) %in% target
lp <- make.lp(nr, nc)
set.objfn(lp, rep(1, nc))
for(i in 1:nr) add.constraint(lp, A[i, ], "=", rhs[i])
for(j in 1:nc) set.type(lp, j, type = "binary")
soln <- solns <- NULL
for(s in 1:5) {
if (!is.null(soln)) add.constraint(lp, 1-soln, ">=", 1)
if (solve(lp) != 0) break
soln <- get.variables(lp)
solns <- c(solns, list(names(L)[soln == 1]))
}
solns
## [[1]]
## [1] "e" "f"
##
## [[2]]
## [1] "b" "d" "e"
CVXR
An alternative to lpSolve is CVXR. We use nc, A and rhs from above. Below we find up to 5 solutions.
library(CVXR)
x <- Variable(nc, boolean = TRUE)
objective <- Minimize(sum(x))
constraints <- list(A %*% x == matrix(rhs))
solns <- soln <- NULL
for(i in 1:5) {
if (!is.null(soln)) constraints <- c(constraints, sum((1 - soln) * x) >= 1)
prob <- Problem(objective, constraints)
result <- solve(prob)
if (result$status != "optimal") break
soln <- result$getValue(x)
solns <- c(solns, list(names(L)[soln == 1]))
}
solns
## [[1]]
## [1] "e" "f"
##
## [[2]]
## [1] "b" "d" "e"
Note
L <- within(list(), {
a <- c("giraffe", "dolphin", "pig")
b <- c("elephant" , "pig")
c <- c("zebra","cobra","spider","porcupine")
d <- c("porcupine")
e <- c("spider","cobra")
f <- c("elephant","pig","porcupine")
})
L <- L[order(names(L))]
target<- c("elephant" , "pig","cobra","spider","porcupine")
By first converting your vectors into a list l <- list(a = a, b = b, c = c, d = d, e = e, f = f)
In base R you can use lapply:
unlist(lapply(l, FUN = function(x) all(x %in% target)))
a b c d e f
FALSE TRUE FALSE TRUE TRUE TRUE
You could accomplish this with the purrr library function imap_lgl:
library(purrr)
purrr::imap_lgl(l, ~ all( . %in% target))
a b c d e f
FALSE TRUE FALSE TRUE TRUE TRUE
If you add a pipe names you can get a character vector of the names if you prefer:
purrr::imap_lgl(l, ~ all( . %in% target)) %>%
names(.)[.]
[1] "b" "d" "e" "f"
Both of these solutions use all and the operator %in%. %in% works by testing if everything in the LHS vector is in the RHS vector:
a %in% target
[1] FALSE FALSE TRUE
all(a %in% target)
[1] FALSE
Since "giraffe" and "dolphin" are not in target the first two values return FALSE and the last value is TRUE since "pig" is in target. all tests if all values of a vector are TRUE. Since not all values of a are in target it returns FALSE.
Try this:
Build a list with your vectors
vec_list <- list(a, b, c, d, e, f)
names(vec_list) <- c("a", "b", "c", "d", "e", "f")
Write a function that identifies matches
match_elem <- function(i, the_list, target) {
if (all( the_list[[i]] %in% target)) {
return(names(the_list)[[i]])
}
}
Apply match_elem to each element of the list
unlist(lapply(1:6, match_elem, vec_list, target))
> "b" "d" "e" "f"
A base R option using combn
lst <- list(a, b, c, d, e, f)
nms <- c("a", "b", "c", "d", "e", "f")
names(
Filter(
isTRUE,
unlist(
lapply(
seq_along(lst),
function(k) {
setNames(
combn(lst, k, FUN = function(v) !(length(setdiff(unlist(v), target)) + length(setdiff(target, unlist(v))))),
combn(nms, k, toString)
)
}
)
)
)
)
or
subset(
unlist(
lapply(
seq_along(nms), function(k) combn(nms, k, toString)
)
),
unlist(
lapply(
seq_along(lst),
function(k) combn(lst, k, FUN = function(v) !(length(setdiff(unlist(v), target)) + length(setdiff(target, unlist(v)))))
)
)
)
gives
[1] "e, f" "b, d, e" "b, e, f" "d, e, f" "b, d, e, f"
Update
If do need to find exclusive combinations, i.e., without overlap, we can try the code below
subset(
unlist(
lapply(
seq_along(nms), function(k) combn(nms, k, toString)
)
),
unlist(
lapply(
seq_along(lst),
function(k) combn(lst, k, FUN = function(v) length(unlist(v))==length(target) & all(unlist(v)%in% target))
)
)
)
or
names(
Filter(
isTRUE,
unlist(
lapply(
seq_along(lst),
function(k) {
setNames(
combn(lst, k, FUN = function(v) length(unlist(v))==length(target) & all(unlist(v)%in% target)),
combn(nms, k, toString)
)
}
)
)
)
)
which gives
[1] "b, f" "e, f" "b, d, e"
I'm working on a project where I have to apply the same transformation to multiple variables. For example
a <- a + 1
b <- b + 1
d <- d + 1
e <- e + 1
I can obviously perform the operations in sequence using
for (i in c(a, b, d, e)) i <- i + 1
However, I can't actually assign the result to each variable this way, since i is a copy of each variable, not a reference.
Is there a way to do this? Obviously, it'd be easier if the variables were merged in a data.frame or something, but that's not possible.
Usually if you find yourself doing the same thing to multiple objects, they should be stored / thought-of as single object with sub-components. You say that storing these as a data.frame is not possible, so you can use a list instead. This allows you to use lapply/sapply to apply a function to each element of the list in one step.
a <- c(1, 2, 3)
b <- c(1, 4)
c <- 5
d <- rnorm(10)
e <- runif(5)
lstt <- list(a = a, b = b, c = c, d = d, e = e)
lstt$a
# [1] 1 2 3
lstt <- lapply(lstt, '+', 1)
lstt$a
# [1] 2 3 4
The question states that the variables to increment cannot be in a larger structure but then in the comments it is stated that that is not so after all so we will assume they are in a list L.
L <- list(a = 1, b = 2, d = 3, e = 4) # test data
for(nm in names(L)) L[[nm]] <- L[[nm]] + 1
# or
L <- lapply(L, `+`, 1)
# or
L <- lapply(L, function(x) x + 1)
Scalars
If they are all scalars then they can be put in an ordinary vector:
v <- c(a = 1, b = 2, d = 3, e = 4)
v <- v + 1
Vectors
If they are all vectors of the same length they can be put in data frame or if they are also of the same type they can be put in a matrix in which case we can also add 1 to it.
Environment
If the variables do have to be free in an environment then if nms is a vector of the variable names then we can iterate over the names and use those names to subscript the environment env. If the names follow some pattern we may be able to use nms <- ls(pattern = "...", envir = env) or if they are the only variables in that environment we can use nms <- ls(env).
a <- b <- d <- e <- 1 # test data
env <- .GlobalEnv # can change this if not being done in global envir
nms <- c("a", "b", "d", "e")
for(nm in nms) env[[nm]] <- env[[nm]] + 1
a;b;d;e # check
## [1] 2
## [1] 2
## [1] 2
## [1] 2
let's assume we have 4 vectors
a <- c(200,204,209,215)
b <- c(215,220,235,245)
c <- c(230,236,242,250)
d <- c(240,242,243,267)
I basically want to create a loop which creates the differentials between each pair, and then calculate the Z scores for those differentials. So something like scale(d-a). How do I create the loop that basically goes scale(b-a), then scale(c-a), scale(d-a) etc? many thanks.
Single named variables don't lend themselves too well to "looping".
Let's use a list() of vectors instead:
vecs <- list(
a = c(200,204,209,215),
b = c(215,220,235,245),
c = c(230,236,242,250),
d = c(240,242,243,267)
)
This allows us to apply a function to all pairs using combn
scale_diff <- function(subset) {
z <- scale(subset[[1]] - subset[[2]])
colnames(z) <- paste(names(subset), collapse = " - ")
z
}
z_scores <- combn(vecs, 2, scale_diff, simplify = FALSE)
Now z_scores is a list of 6 matrices (column vectors). The column names show you which vectors were subtracted before scaling.
We can place it in a list and use combn to get the combinations and then apply the difference
lst1 <- list(a = a, b = b, c = c, d = d)
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x))[,1])
colnames(out) <- combn(names(lst1), 2, FUN = paste, collapse='_')
out
# a_b a_c a_d b_c b_d c_d
#[1,] 0.9108601 1.2009612 0.1290994 -0.7643506 -0.753390 -0.2219686
#[2,] 0.7759179 0.2401922 0.3872983 -0.9441978 -0.360317 0.3699477
#[3,] -0.5735045 -0.2401922 0.9036961 0.6744270 1.474024 1.1098432
#[4,] -1.1132735 -1.2009612 -1.4200939 1.0341214 -0.360317 -1.2578222
As #AlexR mentioned in the comments, if the attributes are important, then remove [,1] and keep it as a matrix of 1 column
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x)), simplify = FALSE)
I am trying to remove a named component from a list, using within and rm. This works for a single component, but not for two or more. I am completely befuddled.
For example - this works
aa = list(a = 1:3, b = 2:5, cc = 1:5)
within(aa, {rm(a)})
the output from within will have just the non-removed components.
However, this does not:
aa = list(a = 1:3, b = 2:5, cc = 1:5)
within(aa, {rm(a); rm(b)})
Neither does this:
within(aa, {rm(a, b)})
The output from within will have all the components, with the ones I am trying to remove, set to NULL. Why?
First, note the following behavior:
> aa = list(a = 1:3, b = 2:5, cc = 1:5)
>
> aa[c('a', 'b')] <- NULL
>
> aa
# $cc
# [1] 1 2 3 4 5
> aa = list(a = 1:3, b = 2:5, cc = 1:5)
>
> aa[c('a', 'b')] <- list(NULL, NULL)
>
> aa
# $a
# NULL
#
# $b
# NULL
#
# $cc
# [1] 1 2 3 4 5
Now let's look at the code for within.list:
within.list <- function (data, expr, ...)
{
parent <- parent.frame()
e <- evalq(environment(), data, parent)
eval(substitute(expr), e)
l <- as.list(e)
l <- l[!sapply(l, is.null)]
nD <- length(del <- setdiff(names(data), (nl <- names(l))))
data[nl] <- l
if (nD)
data[del] <- if (nD == 1) NULL else vector("list", nD)
data
}
Look in particular at the second to last line of the function. If the number of deleted items in the list is greater than one, the function is essentially calling aa[c('a', 'b')] <- list(NULL, NULL), because vector("list", 2) creates a two item list where each item is NULL. We can create our own version of within where we remove the else statement from the second to last line of the function:
mywithin <- function (data, expr, ...)
{
parent <- parent.frame()
e <- evalq(environment(), data, parent)
eval(substitute(expr), e)
l <- as.list(e)
l <- l[!sapply(l, is.null)]
nD <- length(del <- setdiff(names(data), (nl <- names(l))))
data[nl] <- l
if (nD) data[del] <- NULL
data
}
Now let's test it:
> aa = list(a = 1:3, b = 2:5, cc = 1:5)
>
> mywithin(aa, rm(a, b))
# $cc
# [1] 1 2 3 4 5
Now it works as expected!