How do I test if three variables are equal [R] - r

I'm trying to do if else statement which includes a condition if three variables in the data frame equal each other.
I was hoping to use the identical function but not sure whether this works for three variables.
I've also used the following but R doesn't seem to like this:
geno$VarMatch <- ifelse((geno[c(1)] != '' & geno[c(2)] != '' & geno[c(3)] != '')
& (geno[c(5)] == geno[c(4)] == geno[c(6)]), 'Not Important', 'Important')
Keeps telling me:
Error: unexpected '=='
Am I supposed to specify something as data.frame/vector etc... Coming from an SPSS stand point, I'm slightly confused.
Sorry for the simplistic query.

I see so complicated results, mine is simple:
all(sapply(list(a,b,c,d), function(x) x == d))
returns TRUE, if all equals d all equals each other.

Here's a recursive function which generalises to any number of inputs and runs identical on them. It returns FALSE if any member of the set of inputs is not identical to the others.
ident <- function(...){
args <- c(...)
if( length( args ) > 2L ){
# recursively call ident()
out <- c( identical( args[1] , args[2] ) , ident(args[-1]))
}else{
out <- identical( args[1] , args[2] )
}
return( all( out ) )
}
ident(1,1,1,1,1)
#[1] TRUE
ident(1,1,1,1,2)
#[1] FALSE

If it's about numeric values, you can put the numbers in an array, then check the array's max and min, as well:
if(max(list) == min(list))
# all numbers in list are equal
else
# at least one element has a different value

You need to use:
geno$VarMatch <- ifelse((gene[c(1)] != '' & gene[c(2)] != '' &
gene[c(3)] != '') &
((gene[c(5)] == gene[c(4)]) &
(gene[c(4)] == gene[c(6)]))),
'Not Important', 'Important')

The == is a binary operator which returns a single logical value. R doesn't expect further input past your first evaluation, unless you feed it a Boolean & for vectors. You may want to modify this, but here's one attempt at a functional programming approach:
testEqual <- function(x, y) ifelse(x == y, x, FALSE)
all(!!Reduce(testEqual, list(1:10, 1:10))) # True
all(!!Reduce(testEqual, rep(T, 3))) # True
all(!!Reduce(testEqual, list(1, 5, 10))) # False
all(!!Reduce(testEqual, list(T, T, F))) # False
The double negation is used to convert values to logical vectors, and the all command returns a single Boolean. This only works for numeric values or logical vectors.
I'm throwing this out here just for fun. I'm not sure if I would actually use this approach, but any critiques are welcomed.

This answer is based on #John's comment under the OP. This is by far the easiest way to go about this.
geno$VarMatch <- ifelse((geno[c(1)] != '' & geno[c(2)] != '' & geno[c(3)] != '')
& (geno[c(5)] == geno[c(4)] & geno[c(5)] == geno[c(6)]), 'Not Important', 'Important')
Simpler than the other answers, and can be used with basic subsetting/ assignment too, e.g.
geno$VarMatch[geno[c(5)] == geno[c(4)] & geno[c(5)] == geno[c(6)]] <– 'Important'

I think you can just come up with simple generic function comparing three elements and then using mutate and rowwise from dplyr apply those to each combination.
library("tidyverse")
set.seed(123)
dta_sample <- tibble(
colA = sample(letters, 10000, TRUE),
colB = sample(letters, 10000, TRUE),
colC = sample(letters, 10000, TRUE)
)
compare_strs <- function(one, two, three) {
if (one == two) {
if (two == three) {
return(TRUE)
} else {
return(FALSE)
}
} else {
return(FALSE)
}
}
dta_sample %>%
rowwise() %>%
mutate(all_cols_identical = compare_strs(colA, colB, colC)) %>%
# For results
filter(all_cols_identical)
Preview
# A tibble: 25 x 4
# Rowwise:
colA colB colC all_cols_identical
<chr> <chr> <chr> <lgl>
1 w w w TRUE
2 k k k TRUE
3 m m m TRUE
4 b b b TRUE
5 y y y TRUE
6 n n n TRUE
7 e e e TRUE
8 j j j TRUE
9 q q q TRUE
10 a a a TRUE
# … with 15 more rows

Related

Is there a better way to check if all elements in a list are named?

I want to check if all elements in a list are named. I've came up with this solution, but I wanted to know if there is a more elegant way to check this.
x <- list(a = 1, b = 2)
y <- list(1, b = 2)
z <- list (1, 2)
any(stringr::str_length(methods::allNames(x)) == 0L) # FALSE, all elements are
# named.
any(stringr::str_length(methods::allNames(y)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
any(stringr::str_length(methods::allNames(z)) == 0L) # TRUE, at least one
# element is not named.
# Throw an error here.
I am not sure if the following base R code works for your general cases, but it seems work for the ones in your post.
Define a function f to check the names
f <- function(lst) length(lst) == sum(names(lst) != "",na.rm = TRUE)
and you will see
> f(x)
[1] TRUE
> f(y)
[1] FALSE
> f(z)
[1] FALSE
We can create a function to check if the the names attribute is NULL or (|) there is blank ("") name, negate (!)
f1 <- function(lst1) is.list(lst1) && !(is.null(names(lst1))| '' %in% names(lst1))
-checking
f1(x)
#[1] TRUE
f1(y)
#[1] FALSE
f1(z)
#[1] FALSE
Or with allNames
f2 <- function(lst1) is.list(lst1) && !("" %in% allNames(lst1))
-checking
f2(x)
#[1] TRUE
f2(y)
#[1] FALSE
f2(z)
#[1] FALSE

lappy conditional on variable value

I want to lappy two functions on a data set conditional on the value of a specific variable.
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
Someting like:
df <- lapply(df, if(df$Letters=="A") first_function else second_function )
To produce:
df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4))
You can do it with dplyr and purrr. Obviously this is a basic function, but you should be able to build on it for your needs:
library(dplyr)
library(purrr)
calc <- function(y, x){
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
Letters Numbers
1 A 1
2 B 3
3 B 4
>(df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4)))
Letters Numbers
1 A 1
2 B 3
3 B 4
BENCHMARKING
I am not a data.table expert (feel free to add), so did not incorporate here. But, #R Yoda is correct. Although it reads nicely and future you will find it easier to read and extend the function, the purrr solution is not that fast. I liked the ifelse approach, so added case_when which is easier to scale when dealing with multiple functions. Here are a couple solutions:
library(dplyr)
library(purrr)
library(microbenchmark)
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
calc <- function(y, x){
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = rep(c("A","B","B"),1000), Numbers = 1:3)
basic <- function(){
data.frame(df$Letters, apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
}))
}
dplyr_purrr <- function(){
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
}
dplyr_case_when <- function(){
df %>%
mutate(Numbers = case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
map_list <- function(){
data.frame(df$Letters, map2_dbl(df2$Letters, df2$Numbers, ~calc(.x, .y)))
}
within_mapply <- function(){
within(df, Numbers <- mapply(Letters, Numbers,
FUN = function(x, y){
switch(x,
"A" = first_function(y),
"B" = second_function(y))
}))
}
within_ifelse <- function(){
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
}
within_case_when <- function(){
within(df, Numbers <- case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
(mbm <- microbenchmark(
basic(),
dplyr_purrr(),
dplyr_case_when(),
map_list(),
within_mapply(),
within_ifelse(),
within_case_when(),
times = 1000
))
Unit: microseconds
expr min lq mean median uq max neval cld
basic() 12816.427 24028.3375 27719.8182 26741.7770 29417.267 277756.650 1000 f
dplyr_purrr() 9682.884 17817.0475 20072.2752 19736.8445 21767.001 48344.265 1000 e
dplyr_case_when() 1098.258 2096.2080 2426.7183 2325.7470 2625.439 9039.601 1000 b
map_list() 8764.319 16873.8670 18962.8540 18586.2790 20599.000 41524.564 1000 d
within_mapply() 6718.368 12397.1440 13806.1752 13671.8120 14942.583 24958.390 1000 c
within_ifelse() 279.796 586.6675 690.1919 653.3345 737.232 8131.292 1000 a
within_case_when() 470.155 955.8990 1170.4641 1070.5655 1219.284 46736.879 1000 a
The simple way to do this with *apply would be to put the whole logic (with the conditional and the two functions) into another function and use apply with MARGIN=1 to pass the data in row by row (lapply will pass in the data by column):
apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
})
[1] 1 3 4
The problem with this approach, at #r2evans points out in the comment below, is that when you use apply with a heterogeneous data.frame (in this case, Letters is type factor while Numbers is type integer) each row passed into the applied function is passed as a vector which can only have a single type, so everything in the row is coerced to the same type (in this case character). This is why it's necessary to use as.numeric(row['Numbers']), to turn Numbers back into type numeric. Depending on your data, this could be a simple fix (as above) or it could make things much more complicated and bug-prone. Either way #akrun's solution is much better, since it preserves each variable's original data type.
lapply has difficulty in this case because it's column-based. However you can try transpose your data by t() and use lapply if you persist. Here I provide two ways which use mapply and ifelse :
df$Letters <- as.character(df$Letters)
# Method 1
within(df, Numbers <- mapply(Letters, Numbers, FUN = function(x, y){
switch(x, "A" = first_function(y),
"B" = second_function(y))
}))
# Method 2
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
Both above got the same outputs :
# Letters Numbers
# 1 A 1
# 2 B 3
# 3 B 4
Here a data.table variant for better performance in case of many data rows (but also showing an implicit conversion problem):
library(data.table)
setDT(df) # fast convertion from data.frame to data.table
df[ Letters == "A", Numbers := first_function(Numbers) ]
df[!(Letters == "A"), Numbers := second_function(Numbers)] # issues a warning, see below
df
# Letters Numbers
# 1: A 1
# 2: B 3
# 3: B 4
The issued warning is:
Warning message: In [.data.table(df, !(Letters == "A"),
:=(Numbers, second_function(Numbers))) : Coerced 'double' RHS to
'integer' to match the column's type; may have truncated precision.
Either change the target column ['Numbers'] to 'double' first (by
creating a new 'double' vector length 3 (nrows of entire table) and
assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g.
1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for
speed. Or, set the column type correctly up front when you create the
table and stick to it, please.
The reason is that the data.frame column Numbers is an integer
> str(df)
'data.frame': 3 obs. of 2 variables:
$ Letters: Factor w/ 2 levels "A","B": 1 2 2
$ Numbers: int 1 2 3
but the functions return a double (for whatever reason):
> typeof(first_function(df$Numbers))
[1] "double"

Identifying source of FALSE

My question is, does there exist a function that, given a logical statement, identifies the source of FALSE (if it is false)?
For example,
x=1; y=1; z=1;
x==1 & y==1 & z==2
Obviously it is the value of z that makes the statement false. In general though, is there a function that let's me identify the variable(s) in a logical statement who's value makes a logical statement false?
Instead of writing x==1 & y==1 & z==2 you could define
cn <- c(x == 1, y == 1, z == 2)
or
cn <- c(x, y, z) == c(1, 1, 2)
and use all(cn). Then
which(!cn)
# [1] 3
gives the source(s) of FALSE.
In general, no, there is no such function that you are looking for, but for different logical statements a similar approach should work, although it might be too lengthy to pursue.
Considering (!(x %in% c(1,2,3)) & y==3) | z %in% c(4,5), we get FALSE if z %in% c(4,5) is FALSE and (!(x %in% c(1,2,3)) & y==3) is FALSE simultaneously. So, if (!(x %in% c(1,2,3)) & y==3) | z %in% c(4,5) returns FALSE, we are sure about z and still need to check x and y, so that the list of problematic variables can be obtained as follows:
if(!((!(x %in% c(1,2,3)) & y==3) | z %in% c(4,5)))
c("x", "y", "z")[c(x %in% c(1,2,3), !y == 3, TRUE)]
# [1] "x" "y" "z"
or
a <- !(x %in% c(1,2,3))
b <- y == 3
c <- z %in% c(4,5)
if(!((a & b) | c))
c("x", "y", "z")[c(!a, !b, TRUE)]
# [1] "x" "y" "z"
I like #julius's answer but there is also the stopifnot function.
x <- 1; y <- 1; z <- 2
stopifnot(x == 1, y == 1, z == 1)
#Error: z == 1 is not TRUE
Not that the result is an error if there are any false statements and nothing if they're all true. It also stops at the first false statement so if you had something like
x <- T; y <- F; z <- F
stopifnot(x, y, z)
#Error: y is not TRUE
you would not be told that z is FALSE in this case.
So the result isn't a logical or an index but instead is either nothing or an error. This doesn't seem desirable but it is useful if the reason you're using it is for checking inputs to a function or something similar where you want to produce an error on invalid inputs and just keep on moving if everything is fine. I mention stopifnot because it seems like this might be the situation you're in. I'm not sure.
Here is a silly example where you might use it. In this case you apparently only want positive numbers as input and reject everything else:
doublePositiveNumber <- function(x){
stopifnot(is.numeric(x), x >= 0)
return(2*x)
}
which results in
> doublePositiveNumber("hey")
Error: is.numeric(x) is not TRUE
> doublePositiveNumber(-2)
Error: x >= 0 is not TRUE
> doublePositiveNumber(2)
[1] 4
So here you guarantee you get the inputs you want and produce and error message for the user that hopefully tells them what the issue is.

What are practical uses of the "&&" and "||" operators in R?

Main question
In what practical programming situations or R "idioms" would you only want to check the first element of each of two vectors for logical comparison? (I.e. disregarding the rest of each vector as in && and ||.)
I can see the use of & and | in R, where they do element-wise logical comparison of two vectors. But I cannot see a real life practical use of their sibling operators && and ||. Can anyone provide a clear example of their use?
The documentation ,help("&&"), says:
The longer form evaluates left to right examining only the first element of each vector.
Evaluation proceeds only until the result is determined.
The longer form is appropriate for
programming control-flow and typically preferred in if clauses.
The issue for me is the following: I interpret the documentation of && and || to say that for logical vectors x and y, the && and || operators only use x[1] and y[1] to provide a result.
> c(TRUE, FALSE, FALSE) && c(TRUE, FALSE)
[1] TRUE
> c(TRUE, FALSE, FALSE) && c(FALSE, FALSE)
[1] FALSE
> c(FALSE, FALSE, FALSE) && c(TRUE, FALSE)
[1] FALSE
> c(FALSE, FALSE, FALSE) && c(FALSE, FALSE)
[1] FALSE
I don't see any "programming control-flow" situations where I would have two logical vectors and I would disregard any values past the first element of each.
It seems that x && y acts like x[1] & y[1], and x || y acts like x[1] | y[1].
Benchmarks
Here's a test function that evaluates how often these formulations return the same result using randomly generated logical vectors of different lengths. This suggests that they are doing the same thing.
> test <- function( n, maxl=10 ) {
foo <- lapply( X=seq_len( n ), FUN=function(i) {
x <- runif( n=sample( size=1, maxl ) ) > 0.5
y <- runif( n=sample( size=1, maxl ) ) > 0.5
sameres <- all.equal( (x||y), (x[1]|y[1]) )
sameres
} )
table( unlist( foo ) )
}
test( 10000 )
Yields:
TRUE
10000
Here's a benchmarking test on which is faster. It start by creating a list of lists, where each of N items in dat is a list containing two randomly generated logical vectors. Then we apply each of the variants on the same data to see which is faster.
library(rbenchmark)
N <- 100
maxl <- 10
dat <- lapply( X=seq_len(N), FUN=function(i) {
list( runif( n=sample( size=1, maxl ) ) > 0.5,
runif( n=sample( size=1, maxl ) ) > 0.5) } )
benchmark(
columns=c("test","replications","relative"),
lapply(dat, function(L){ L[[1]] || L[[2]] } ),
lapply(dat, function(L){ L[[1]][1] | L[[2]][1] } )
)
Yields the following output (removed the \n characters and extra whitespace):
test replications relative
2 lapply(dat, function(L) { L[[1]][1] | L[[2]][1] }) 100 1.727
1 lapply(dat, function(L) { L[[1]] || L[[2]] }) 100 1.000
Clearly, the || formulation is faster than cherry picking the first element of each argument. But I'm still curious as to why one would need such an operator.
I guess that there are a couple of reasons, but probably the most important one is the short-circuit behavior. If a evaluates to FALSE in a && b, then b is not evaluated. Similarly, if a evaluates to TRUE in a || b, then b is not evaluated. This allows writing code like
v <- list(1, 2, 3, 4, 5)
idx <- 6
if (idx < length(v) && v[[idx]] == 5) {
foo
} else {
bar
}
Otherwise one needs to write this (maybe) as
if (idx < length(v)) {
if (v[idx] == 5) {
foo
} else {
bar
}
} else {
bar
}
which is 1) much less readable, and 2) repeats bar, which is bad if bar is a bigger piece of code.
You cannot use & in the if condition, because your index would be out of bounds, and this is not allowed for lists in R:
if (idx < length(v) & v[[idx]] == 5) {
foo
} else {
bar
}
# Error in v[[idx]] : subscript out of bounds
Here is a small illustration of the short-circuit behavior:
t <- function() { print("t called"); TRUE }
f <- function() { print("f called"); FALSE }
f() && t()
# [1] "f called"
# [1] FALSE
f() & t()
# [1] "f called"
# [1] "t called"
# [1] FALSE
t() || f()
# [1] "t called"
# [1] TRUE
t() | f()
# [1] "t called"
# [1] "f called"
# [1] TRUE

R: Different range checks for different elements of a vector

Is there an R idiom for performing a different (integer) range check for each element of a vector?
My function is passed a two-element (integer) vector of the form v = c(m, n) and must make the following range checks:
1 <= m <= M
1 <= n <= N
For my current task, I've implemented them by manually accessing each element, and running the associated range check against it.
# Check if this is a valid position on an M x N chess board.
validate = function (square) {
row = square[1]
col = square[2]
(row %in% 1:M) && (col %in% 1:N)
}
I wonder whether there's a compacter way of doing the range checks, especially if we were to generalize it to K-element vectors.
Since you're presumably setting up different criteria for each v[j], I'd recommend creating a list out of your range criteria. Like:
Rgames> set.seed(10)
Rgames> foo<-sample(1:5,5,rep=TRUE)
Rgames> foo
[1] 3 2 3 4 1
Rgames> bar<-list(one=1:5, two=3:5,three=1:3,four=c(2,4),five=c(1,4) )
Rgames> checkit<-NA
Rgames> for(j in 1:5) checkit[j]<-foo[j]%in%bar[[j]]
Rgames> checkit
[1] TRUE FALSE TRUE TRUE TRUE
If I understand your goal correctly, the inequality operators are vectorized in R, so you can make use of this fact.
limits <- c(M=3, N=4, 5)
v <- c(m=2, n=5, 8)
result <- 1 <= v & v <= limits
# m n
# TRUE FALSE FALSE
And if you want a single value that's FALSE if any of the limits are exceeded, then you can wrap the inequality expression with all.
all(1 <= v & v <= limits)
Maybe something like this:
`%between%` <- function(x,rng){
all(x <= max(rng,na.rm = TRUE)) && all(x >= min(rng,na.rm = TRUE))
}
> 3 %between% c(1,10)
[1] TRUE
> 3:5 %between% c(1,10)
[1] TRUE
> 9:12 %between% c(1,10)
[1] FALSE
With tweaks depending on how you want to handle NAs, and other edge cases.

Resources