which rows match a given vector in R - r

I have a matrix A,
A = as.matrix(data.frame(col1 = c(1,1,2,3,1,2), col2 = c(-1,-1,-2,-3,-1,-2), col3 = c(2,6,1,3,2,4)))
And I have a vector v,
v = c(-1, 2)
How can I get a vector of TRUE/FALSE that compares the last two columns of the matrix and returns TRUE if the last two columns match the vector, or false if they don't?
I.e., If I try,
A[,c(2:3)] == v
I obtain,
col2 col3
[1,] TRUE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] TRUE FALSE
[6,] FALSE FALSE
Which is not what I want, I want both columns to be the same as vector v, more like,
result = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE)
Since the first, and 5th rows match the vector v entirely.

Here's a simple alternative
> apply(A[, 2:3], 1, function(x) all(x==v))
[1] TRUE FALSE FALSE FALSE TRUE FALSE

Ooops by looking into R mailing list I found an answer: https://stat.ethz.ch/pipermail/r-help/2010-September/254096.html,
check.equal <- function(x, y)
{
isTRUE(all.equal(y, x, check.attributes=FALSE))
}
result = apply(A[,c(2:3)], 1, check.equal, y=v)
Not sure I need to define a function and do all that, maybe there are easier ways to do it.

Here's another straightforward option:
which(duplicated(rbind(A[, 2:3], v), fromLast=TRUE))
# [1] 1 5
results <- rep(FALSE, nrow(A))
results[which(duplicated(rbind(A[, 2:3], v), fromLast=TRUE))] <- TRUE
results
# [1] TRUE FALSE FALSE FALSE TRUE FALSE
Alternatively, as one line:
duplicated(rbind(A[, 2:3], v), fromLast=TRUE)[-(nrow(A)+1)]
# [1] TRUE FALSE FALSE FALSE TRUE FALSE

A dirty one:
result <- c()
for(n in 1:nrow(A)){result[n] <-(sum(A[n,-1]==v)==2)}
> result
[1] TRUE FALSE FALSE FALSE TRUE FALSE

Related

Problem with a vectorized operation with 2 logical vectors: replace values in first vector to NA according to FALSE values in the other vector

I have a question about a vectorized operation with logical vectors. In my problem, there are two vectors: main and secondary. They're both of the same length. I want to replace some elements in the main vector to NA, based on insights I gather from the secondary vector.
The main vector is comprised of TRUE and FALSE that can appear in any random order.
The secondary vector is either:
a sequence of TRUE then a sequence of FALSE with/without NA as the last element; or
all TRUE; or
all FALSE; or
all FALSE with last element as NA; or
all TRUE with last element as NA
I'll provide several examples below and explain the desired algorithm.
A - The most common case
replace x values with NA for positions that are FALSE in y
# here, `x` is the main vector
x <- c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, NA)
# `y` is the secondary vector
y <- c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE)
# `z` is the desired output
z <- c(FALSE, FALSE, TRUE, FALSE, FALSE, NA, NA) # in this case only index 7 in `x` actually changed
B - The secondary vector is all FALSE
x2 <- c(FALSE, NA)
y2 <- c(FALSE, FALSE)
# desired output
z2 <- c(NA, NA)
C - The secondary vector is all TRUE
x3 <- rep(FALSE, 4)
y3 <- rep(TRUE, 4)
# desired output
z3 <- rep(FALSE, 4)
My attempt
I've almost figured this out, but unfortunately it fails with scenario B.
my_func <- function(main, secondary) {
idx_last_true_in_secondary <- max(which(secondary))
if(idx_last_true_in_secondary == length(secondary)) {
return(main)
}
main[(idx_last_true_in_secondary + 1): length(main)] <- NA
main
}
# case A
my_func(x, y)
#> [1] FALSE FALSE TRUE FALSE FALSE NA NA
# case B
my_func(x2, y2)
#> Warning in max(which(secondary)): no non-missing arguments to max; returning
#> -Inf
#> Error in (idx_last_true_in_secondary + 1):length(main): result would be too long a vector
# case C
my_func(x3, y3)
#> [1] FALSE FALSE FALSE FALSE
My question is whether anyone sees a better way to approach the problem?
This seems to work as expected:
my_func <- function(main,secondary ) {
main[!secondary] <- NA
return(main)
}
my_func(x,y)
[1] FALSE FALSE TRUE FALSE FALSE NA NA
my_func(x2,y2)
[1] NA NA
my_func(x3,y3)
[1] FALSE FALSE FALSE FALSE
We could do it this way:
my_func <- function(x, y) {
replace(x, !y, NA)
}
How it works:
# A
replace(x, !y, NA)
[1] FALSE FALSE TRUE FALSE FALSE NA NA
# B
replace(x2, !y2, NA)
[1] NA NA
# C
replace(x3, !y3, NA)
[1] FALSE FALSE FALSE FALSE
my_func(x,y)
my_func(x2,y2)
my_func(x3,y3)
output:
> my_func(x,y)
[1] FALSE FALSE TRUE FALSE FALSE NA NA
> my_func(x2,y2)
[1] NA NA
> my_func(x3,y3)
[1] FALSE FALSE FALSE FALSE
We can try ifelse like below
> ifelse(y,x,NA)
[1] FALSE FALSE TRUE FALSE FALSE NA NA

Identifying positions of the last TRUEs in a sequence of TRUEs and FALSEs

I have a vector of TRUEs and FALSEs:
x <- c(F,F,F,T,T,T,F,F,F,T,T,T,F,T,T)
I'd like to elegantly (and in base) identify the position of the last TRUE before it changes to FALSE.
The following works, though, it seems like it could be simplified:
c((x[-1] != x[-length(x)]),T) & x
> FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Input and output:
Taking advantage of diff with an appended FALSE to catch the implied TRUE-to-FALSE at the end.
diff(c(x,FALSE)) == -1
# [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
#[13] FALSE FALSE TRUE
We may look where x is greater than shifted x with 0 appended.
x>c(x[-1],0)
# [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Check rle
rlex = rle(x)
end = cumsum(rlex$lengths)
x&(seq(length(x)) %in% end)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Another layout suggested by Frank
seq_along(x) %in% with(rle(x), cumsum(lengths)[values])
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Another version with rle
x[setdiff(seq_along(x), with(rle(x), cumsum(lengths) * values))] <- FALSE
x
#[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
An option with duplicated
library(data.table)
!duplicated(rleid(x), fromLast = TRUE) & x
#[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Non-base solution for identifying the last TRUE before a FALSE.
library(dplyr)
y <- data.frame(x = c(FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,
FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,TRUE))
y %>%
mutate(lasttrue = case_when(x == TRUE & lead(x) == FALSE ~ TRUE,
TRUE ~ FALSE))
Edit:
y %>%
mutate(lasttrue = case_when(x > lead(x) ~ T,
T ~ F))
benchmarks
Thanks for all the solutions. If anyone is interested in benchmarks:
library(dplyr)
library(data.table)
set.seed(1)
x <- sample(c(TRUE, FALSE), 1000000, replace = T)
y <- data.frame(x = x) # For M. Viking's solution
x_dt <- x # For Ronak Shah's solution
microbenchmark::microbenchmark(Khaynes = {Khaynes <- c((x[-1] != x[-length(x)]),T) & x},
jay.sf = {jay.sf <- x>c(x[-1],0)},
jay.sf_2 = {jay.sf_2 <- diff(c(x,0))<0},
thelatemail = {thelatemail <- diff(c(x,FALSE)) == -1},
WeNYoBen = {rlex = rle(x); end = cumsum(rlex$lengths); WeNYoBen <- x&(seq(length(x)) %in% end)},
M._Viking = {M._Viking <- y %>% mutate(lasttrue = case_when(x > lead(x) ~ T, T ~ F))},
akrun = {akrun <- !duplicated(rleid(x), fromLast = TRUE) & x},
frank = {frank <- seq_along(x) %in% with(rle(x), cumsum(lengths)[values])},
Ronak_Shah = {x_dt[setdiff(seq_along(x_dt), with(rle(x_dt), cumsum(lengths) * values))] <- FALSE},
times = 50)
# Output:
# Unit: milliseconds
# expr min lq mean median uq max neval
# Khaynes 23.0283 26.5010 31.76180 31.71290 37.1449 46.3824 50
# jay.sf 13.0630 13.5373 17.84056 13.77135 20.5462 73.5926 50
# jay.sf_2 26.1960 27.7653 35.25296 36.39615 39.3686 61.8858 50
# thelatemail 24.8204 26.7178 32.51675 33.50165 36.6328 41.9279 50
# WeNYoBen 83.9070 98.4700 107.79965 101.88475 107.1933 170.2940 50
# M._Viking 73.5963 83.4467 93.99603 86.58535 94.0915 151.7075 50
# akrun 42.5265 43.2879 48.42697 44.98085 51.1533 105.2836 50
# frank 81.9115 90.1559 95.40261 93.97015 98.2921 129.6162 50
# Ronak_Shah 109.0678 121.8230 133.10690 125.63930 133.7222 231.5350 50
all.equal(Khaynes, jay.sf)
all.equal(Khaynes, jay.sf_2)
all.equal(Khaynes, thelatemail)
all.equal(Khaynes, WeNYoBen)
all.equal(Khaynes, M._Viking$lasttrue) # When the last element is TRUE it will return false.
all.equal(Khaynes, akrun)
all.equal(Khaynes, frank)
all.equal(Khaynes, x_dt) # Ronak Shah solution.

Extract rows from data frame

#Simulated data /model output
data1 <-data.frame(col1=c(10,20,30,40,50,60,70,80,90,100),
col2=c(2,4,6,8,10,12,14,16,20,22),
col3=c(3,9,12,15,18,21,24,27,30,33),
col4= c(4,8,12,16,20,24,28,32,36,40))
#Lower 95% CI from real data
lowdata <- as.data.frame(matrix(0,1,4))
lowdata[,1] <-5
lowdata[,2] <-34
lowdata[,3] <-25
lowdata[,4] <-30
# Higher 95% CI from real data
highdata <-as.data.frame(matrix(0,1,4))
highdata[,1] <- 59
highdata[,2] <- 60
highdata[,3] <- 50
highdata[,4] <- 49
I am new to R programming and not sure how to do the following.
I want to basically find out which ones of the data1 rows have values (for each column) that are within the Cis defined by two the two dataframes (lowdata1 and highdata1).
I have tried this, but it takes each value and not row:
wmax <- which(data1[,1:4] < highdata1[,1:4])
wmin <- which(data1[,1:4] > lowdata1[,1:4])
w <- intersect(wmax, wmin)
How could I achieve extracting which rows in data1 basically "fit" between the CIs?
You can check whether each element of each column is in the corresponding interval with
is.btwn <-
Map(function(x, low, high) x >= low & x <= high,
data1, lowdata, highdata)
is.btwn
# $col1
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#
# $col2
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# $col3
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
#
# $col4
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
You can use Reduce to get rows where all columns are in the interval, but in this case there are no such rows
Reduce(`&`, is.btwn)
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or the rows with any column in the corresponding interval
Reduce(`|`, is.btwn)
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
All of these results can be passed to which to get indices with value TRUE

Applying operators stored in a vector as text

I would like to apply operators stored in a vector operatorsUsed to series1 and series2 of the data frame df:
operatorsUsed = c('==', '>=', '<=')
series1 = 1:5
series2 = c(1, 3, 2, 4, 5)
df = data.frame(series1,
series2,
stringsAsFactors = FALSE)
I tried combining the parse() and eval() function:
nbrOperators = length(operatorsUsed)
for (j in 1:nbrOperators){
a = df[eval(parse(text = paste0(df$series1, operatorsUsed[j], df$series2))),]
tableCreated = paste0('b', j)
assign(tableCreated, a)
}
But this doesn't work. With parse, I obtain for e.g. j=1
expression(1==1, 2==3, 3==2, 4==4, 5==5)
Which looks promising but then applying eval yields
[1] TRUE
Rather than the looked for
[1] TRUE FALSE FALSE TRUE TRUE
Is there away I can apply operators stored in a vector as text?
We can use lapply with get
lapply(operatorsUsed, function(op) get(op)(df$series1, df$series2))
#[[1]]
#[1] TRUE FALSE FALSE TRUE TRUE
#[[2]]
#[1] TRUE FALSE TRUE TRUE TRUE
#[[3]]
#[1] TRUE TRUE FALSE TRUE TRUE
as #rawr mentioned in the comments, we can also use match.fun(op) instead of get(op) in the lapply

does vector exist in matrix?

how can I check to see if vector exists inside a matrix. The vector will be of size 2. I have an approach but I would like something vectorized/faster.
dim(m)
[1] 30 2
x = c(1, -2)
for(j in 1:nrow(m)){
if ( isTRUE(as.vector(x[1]) == as.vector(m[j,1])) && as.vector(x[2] == as.vector(m[j,2]) )) {
print(TRUE)
}
}
note, x=c(1, -2) is not the same as -2, 1 in the matrix.
If we are comparing the rows of the matrix ('m') with 'x' having the same length as the number of columns of 'm', we can replicate 'x' (x[col(m)]) to make the lengths same, compare (!=), get the rowSums. If the sum is 0 for a particular row, it means that all the values in the vector matches that row of 'm'. Negate (!) to convert 0 to TRUE and all other values as FALSE.
indx1 <- !rowSums(m!=x[col(m)])
Or if we need a solution using apply, we can use identical
indx2 <- apply(m, 1, identical, y=x)
identical(indx1, indx2)
#[1] TRUE
If this to find only a single TRUE/FALSE, we can wrap any to 'indx1' or 'indx2'.
data
x <- c(1, -2)
set.seed(24)
m <- matrix(sample(c(1,-2,3,4), 30*2, replace=TRUE), ncol=2)
Try
m<-matrix(rnorm(60),30)
x<-m[8,]
m[9,]<-c(x[2],x[1]) # to prove 1,-2 not same -2,1
apply(m,1,function(n,x) all(n==x),x=x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
if you need just one T/F use any() you
any(apply(m,1,function(n,x) all(n==x),x=x))
[1] TRUE
if run this code with akrun's data
x <- c(1, -2)
set.seed(24)
m <- matrix(sample(c(1,-2,3,4), 30*2, replace=TRUE), ncol=2)
any(apply(m,1,function(n,x) all(n==x),x=x))
[1] TRUE

Resources