Selecting logical vector elements based on certain conditions - r

I have a situation where I would like to detect conditions between two logical, named vectors based on the TRUE / FALSE combination at each position in the vector. For example:
x <- c(TRUE, FALSE, FALSE, TRUE)
names(x) <- c("a", "b", "c", "d")
y <- c(TRUE, TRUE, FALSE, FALSE)
names(y) <- names(x)
For each element in these two vectors I want to detect 3 conditions:
x[i] is TRUE and y[i] is TRUE;
x[i] is FALSE and y[i] is TRUE,
x[i] is TRUE and y[i] is FALSE.
The length of x and y are the same but could be longer than this example. I want to retrieve the name of the element for each condition and assign the element name to a new variable. For this example:
v1 <- "a"
v2 <- "b"
v3 <- "d"
In a longer version of these two vectors I might end up with something like:
v1 <- c("a", "e")
v2 <- c("b", "f", "g")
v3 <- c("d", "i", "k", "l")
What is the best vectorized way to do this. I think it is simple but I am unable to come up with the answer. Thanks in advance.

We can efficiently use split, but before that, we need a single grouping index. Here is a possibility:
g <- x + y + x
split(names(x), g)
To understand the above grouping index, consider this:
x <- c(TRUE, TRUE, FALSE, FALSE)
y <- c(TRUE, FALSE, TRUE, FALSE)
x + y + x
#[1] 3 2 1 0
So you can see that 4 combinations of TRUE and FALSE are mapped to 4 integer values.
Ah, so "a" get assigned to T-T, "b" to T-F, etc. But, why the x + y + x?? I don't follow adding x twice.
If you only do x + y, the result is only 0, 1 and 2. You won't be able to differentiate T-F and F-T as they are both 1.
#thelatemail offers a more readable way:
split(names(x), interaction(x, y, drop=TRUE))
Update
Ah... stupid me... Why did I bother creating g. I suddenly remember that we can pass a list to f argument in split:
split(names(x), list(x, y))
Note, internally in split.default:
if (is.list(f))
f <- interaction(f, drop = drop, sep = sep)

Related

Test if string in n positions of alphabet

As the title suggests, I am looking for an elegant* way to test whether a character is in the first n positions in the alphabet.
So, for a character vector as follows:
names <- c("Brian", "Cormac", "Zachariah")
And with n <- 6
It would return:
TRUE','TRUE', 'FALSE'
*I am aware that I can use substr(names,1,1) %in% c("A", "B", "C", "D", "E", "F"), but I was hoping for a better solution.
EDIT: What I mean by position in the alphabet is whether the first letter is in the first n letters in alphabetical order. So, "A" is in the first n = 1+, "B" is in the first n =2+, "Y" in the first n=25 letters, etc.
PoGibas comment seems to have as elegant as it gets. Next step would be wrapping it in a function:
cht6_pog <- function(string) {
x <- toupper(substring(string, 1, 1)) %in% LETTERS[1:6]
names(x) <- string
x
}
cht6_pog(names)
Brian Cormac Zachariah
TRUE TRUE FALSE
Here is my answer for your question.
# fun:
check_char <- function(string, start_n, end_n, char_pattern)
{
str_list <- strsplit(substr(string, start_n, end_n), "")
out <- sapply(str_list, function(x) any(tolower(x) %in% tolower(char_pattern)))
return(out)
}
# args:
str_vec <- c("Google", "Facebook", "Amazon")
str_n <- 1
end_n <- 4
char <- LETTERS[1:6]
# run:
out <- check_char(str_vec, str_n, end_n, char)
print(out)
# [1] FALSE TRUE TRUE

which column is a duplicate column a duplicate of?

I have a data frame with possibly several duplicated columns.
I would like to do 2 thing (the following is a reproducible example, in practice I have dozens of columns):
First, Extract the unique columns. This can efficiently be accomplished with the following command:
library(dplyr)
df = data_frame(x = rnorm(100), y = rnorm(100), z = y, w = x)
df[!duplicated(as.list(df))]
This outputs columns x and y.
Next, I would like to find specifically which is the duplicate of x (w) and which is the duplicate of y (z). How can I do this efficiently?
Very similar to the answer provided by #Sotos
library(dplyr)
# example data
df = data_frame(x = rnorm(100), y = rnorm(100), z = y, w = x)
# function to compare vectors
f = function(x,y) { all(df[,x] == df[,y]) }
# vectorise function
f = Vectorize(f)
data.frame(t(combn(names(df),2)), stringsAsFactors = F) %>%
mutate(flag = f(X1, X2))
# X1 X2 flag
# 1 x y FALSE
# 2 x z FALSE
# 3 x w TRUE
# 4 y z TRUE
# 5 y w FALSE
# 6 z w FALSE
You can then focus on the pairs where the flag is TRUE.
You could use combn to get all combinations between your columns. Once you have that, there are various metrics to catch equality. Since you are dealing with floats, I would suggest to use correlation. If it is 1 then the 2 columns are the same, i.e.
setNames(combn(df, 2, FUN = function(i) all(cor(i) == 1)),
combn(names(df), 2, FUN = toString))
# x, y x, z x, w y, z y, w z, w
#FALSE FALSE TRUE TRUE FALSE FALSE
If correlation is not enough, then we can add the variance in there as well, i.e.
setNames(combn(df, 2, FUN = function(i) all(cor(i) == 1) & length(unique(as.vector(var(i)))) == 1),
combn(names(df), 2, toString))
# x, y x, z x, w y, z y, w z, w
#FALSE FALSE TRUE TRUE FALSE FALSE
You can structure the output any way you want.

R: Filter vectors by 'two-way' partial match

With two vectors
x <- c("abc", "12")
y <- c("bc", "123", "nomatch")
is there a way to do a filter of both by 'two-way' partial matching (remove elements in one vector if they contain or are contained in any element in the other vector) so that the result are these two vectors:
x1 <- c()
y1 <- c("nomatch")
To explain - every element of x is either a substring or a superstring of one of the elements of y, hence x1 is empty. Update - it is not sufficient for a substring to match the initial chars - a substring might be found anywhere in the string it matches. Example above has been updated to reflect this.
I originally thought ?pmatch might be handy, but your edit clarifies you don't just want to match the start of items. Here's a function that should work:
remover <- function(x,y) {
pmx <- sapply(x, grep, x=y)
pmy <- sapply(y, grep, x=x)
hit <- unlist(c(pmx,pmy))
list(
x[!(seq_along(x) %in% hit)],
y[!(seq_along(y) %in% hit)]
)
}
remover(x,y)
#[[1]]
#character(0)
#
#[[2]]
#[1] "nomatch"
It correctly does nothing when no match is found (thanks #Frank for picking up the earlier error):
remover("yo","nomatch")
#[[1]]
#[1] "yo"
#
#[[2]]
#[1] "nomatch"
We can do the following:
# Return data.frame of matches of a in b
m <- function(a, b) {
data.frame(sapply(a, function(w) grepl(w, b), simplify = F));
}
# Match x and y and remove
x0 <- x[!apply(m(x, y), 2, any)]
y0 <- y[!apply(m(x, y), 1, any)]
# Match y and x and remove
x1 <- x0[!apply(m(y0, x0), 1, any)]
y1 <- y0[!apply(m(y0, x0), 2, any)]
x1;
#character(0)
x2;
#[1] "nomatch"
I build a matrix of all possible matches in both directions, then combine both with | as a match in any direction is equally a match, and then and use it to subset x and y:
x <- c("abc", "12")
y <- c("bc", "123", "nomatch")
bool_mat <- sapply(x,function(z) grepl(z,y)) | t(sapply(y,function(z) grepl(z,x)))
x1 <- x[!apply(bool_mat,2,any)] # character(0)
y1 <- y[!apply(bool_mat,1,any)] # [1] "nomatch"

Minimum distance between elements in two logical vectors

I have two logical vectors x and y and weighted values, z corresponding to each index. For column x values that are TRUE I'd like to find the nearest y column index that is also TRUE. Then grab the sum of z between min{x_i, y_i}. If there are two min{x_i, y_i} then the smaller sum of z is used.
x y z
1 FALSE TRUE 0.05647057
2 FALSE FALSE 0.09577802
3 TRUE FALSE 0.04150954
4 FALSE FALSE 0.07242995
5 FALSE TRUE 0.06220041
6 FALSE FALSE 0.01861535
7 FALSE FALSE 0.05056971
8 TRUE FALSE 0.07726933
9 FALSE TRUE 0.04669694
10 TRUE TRUE 0.02312497
There are 3 x values that are TRUE so we'll call them {x_1, x_2, x_3}. Here I demonstrate the summing of the minimum indexes between each x_i and it's nearest y_i neighbor. What is the most efficient base R way to accomplish this. I have a method at the end that utilizes 2 lapply telling me it's probably not efficient. I don't have a math background and usually there's some algebraic way to accomplish these sorts of tasks that is vectorized over the brute computational power.
## x_1
sum(z[3:5]) ## This one is smaller so use it
sum(z[1:3])
## x_2
sum(z[8:9])
## x_3
sum(z[10])
c(sum(z[3:5]), sum(z[8:9]), sum(z[10]))
[1] 0.17613990 0.12396627 0.02312497
MWE:
x <- y <- rep(FALSE, 10)
x[c(3, 8, 10)] <- TRUE
y[c(1, 5, 9, 10)] <- TRUE
set.seed(15)
z <- rnorm(10, .5, .25)/10
data.frame(x=x, y=y, z=z)
Here is an approach that is less than optimal:
dat <- data.frame(x=x, y=y, z=z)
sapply(which(dat[, "x"]), function(x) {
ylocs <- which(dat[, "y"])
dists <- abs(x - ylocs)
min.ylocs <- ylocs[min(dists) == dists]
min(sapply(min.ylocs, function(y, x2 = x) {
sum(dat[, "z"][x2:y])
}))
})
## [1] 0.17613990 0.12396627 0.02312497
I'd prefer to keep the solution within base.
This uses no loops or apply functions. We use na.locf from zoo to move the index of the last TRUE y up giving fwd and the next TRUE y back giving bck. Finally we determine which of the two corresponding sums is greater. This depends on na.locf in the zoo package but at the end we
extract the core code from zoo to avoid the dependence:
library(zoo) # na.locf
x <- dat$x
y <- dat$y
z <- dat$z
yy <- ifelse(y, TRUE, NA) * seq_along(y)
fwd <- na.locf(yy, fromLast = FALSE)[x]
bck <- na.locf(yy, fromLast = TRUE)[x]
cs <- cumsum(z)
pmin(cs[x] - cs[fwd] + z[fwd], cs[bck] - cs[x] + z[x])
The last line gives:
[1] 0.17613990 0.12396627 0.02312497
Here is a mini version of na.locf. The library call above could be replaced with this.
# code extracted from zoo package
na.locf <- function(x, fromLast = FALSE) {
L <- !is.na(x)
if (fromLast) rev(c(NA, rev(which(L)))[cumsum(rev(L)) + 1])
else c(NA, which(L))[cumsum(L)+1L]
}
REVISED: some improvements.

Proof if S4 objects are identical

I have two lists that contain S4 objects. Now I want to ask if an element of list_1 contains an element of list_2, like I do in the following example for lists of character vectors.
s<-list(a=LETTERS[1:3],b=LETTERS[4:6])
t<-list(n=LETTERS[1:3],v=LETTERS[1:4])
s %in% t
But does it prove if the objects are identical? If not, how to select the element of list_1 which exists in list_2 without using a loop?
If you want to compare S4 objects I believe you will have to use (as Ben Bolker suggested) a mixture of functions slotNames, slot, and sapply.
setClass("MyClass",representation(Slot1="vector",Slot2="vector"))
x <- new("MyClass")
x#Slot1 <- 1:4
x#Slot2 <- LETTERS[1:4]
y <- new("MyClass")
y#Slot1 <- 1:4
y#Slot2 <- LETTERS[4:6]
id <- function(a,b){
sapply(slotNames(a),function(x)identical(slot(a,x),slot(b,x)))
}
id(x,y)
Slot1 Slot2
TRUE FALSE
And now if you want to extend that to a list of S4 objects, use on top of that Metrics solution:
X <- new("MyClass")
X#Slot1 <- 1:5
X#Slot2 <- LETTERS[1:4]
Y <- new("MyClass")
Y#Slot1 <- 1:4
Y#Slot2 <- letters[1:4]
a <- list(l1 = x, l2 = X)
b <- list(l1 = y, l2 = Y)
Map(id, a, b)
$l1
Slot1 Slot2
TRUE FALSE
$l2
Slot1 Slot2
FALSE FALSE
You can use Map for that:
Map(function (x,y) x %in% y, s, t)
$a
[1] TRUE TRUE TRUE
$b
[1] TRUE FALSE FALSE
Or, as suggested by #plannapus just use:
Map(`%in%`,s,t)
Now I define a operator, did I do this in the right way?
"%inS4%" <- function(a,b) sapply(a,function(x) any(unlist(Map(identical,list(x),b))))
setClass(Class = "MyClass",
representation = representation(name = "character",
type = "character"
)
)
a<-list(new("MyClass",name="abc",type="abc"),new("MyClass",name="abc",type="123"))
b<-list(new("MyClass",name="abc",type="123"),new("MyClass",name="abc",type="123"))
a %inS4% b

Resources