How do I pre-determine mutually exclusive comparisons? - r

The human eye can see that no value x satisfies the condition
x<1 & x>2
but how can I make R see that. I want to use this in a function which gets passed comparisons (say as strings) and not necessarily data. Let's say I want to write a function that checks whether a combination of comparisons can ever be fulfilled anyway, like this
areTherePossibleValues <- function(someString){
someCode
}
areTherePossibleValues("x<1 & x>2")
[1] FALSE
I mean one could do that by interpreting the substrings that are comparison signs and so on, but I feel like there's got to be a better way. The R comparison functions ('<','>','=' and so on) themselves actually might be the answer to this, right?

Another option is to use the library validatetools (disclaimer, I'm its author).
library(validatetools)
rules <- validator( r1 = x < 1, r2 = x > 2)
is_infeasible(rules)
# [1] TRUE
make_feasible(rules)
# Dropping rule(s): "r1"
# Object of class 'validator' with 1 elements:
# r2: x > 2
# Rules are evaluated using locally defined options
# create a set of rules that all must hold:
rules <- validator( x > 1, x < 2, x < 2.5)
is_infeasible(rules)
# [1] FALSE
remove_redundancy(rules)
# Object of class 'validator' with 2 elements:
# V1: x > 1
# V2: x < 2
rules <- validator( x >= 1, x < 1)
is_infeasible(rules)
# [1] TRUE

To compare among ranges, min of the range max(s) should always be greater than the max of the range min(s), showed as below:
library(dplyr)
library(stringr)
areTherePossibleValues <- function(s) {
str_split(s, pattern = " *& *", simplify = TRUE)[1, ] %>%
{lapply(c("max" = "<", "min" = ">"), function(x) str_subset(., pattern = x) %>% str_extract(., pattern = "[0-9]+"))} %>%
{as.numeric(min(.$max)) > as.numeric(max(.$min))}
}
Update: add inclusion comparison
The only difference is that min of the range max(s) can be equal to the max of the range min(s).
library(dplyr)
library(stringr)
areTherePossibleValues <- function(s) {
str_split(s, pattern = " *& *", simplify = TRUE)[1, ] %>%
{lapply(c("max" = "<", "min" = ">"), function(x) str_subset(., pattern = x) %>% str_remove(., pattern = paste0("^.*", x)))} %>%
{ifelse(sum(grepl(pattern = "=", unlist(.))),
as.numeric(min(str_remove(.$max, "="))) >= as.numeric(max(str_remove(.$min, "="))),
as.numeric(min(.$max)) > as.numeric(max(.$min)))}
}
areTherePossibleValues("x<1 & x>2")
areTherePossibleValues("x>1 & x<2")
areTherePossibleValues("x>=1 & x<1")

Here is my way of solving it, it may not be the best, but it should work even you have many comparisons.
Let's call the numbers appeared in your comparisons 'cutoffs', then all we need to do is to test 1 number between each pair of cutoffs, 1 number that is larger than the max cutoff, and 1 number that is smaller than the min cutoff.
The intuition is illustrated with the plot:
Here is the code:
areTherePossibleValues <- function(s){
# first get the numbers that appeared in your string, sort them, and call them the cutoffs
cutoffs = sort(as.numeric(gsub("\\D", "", strsplit(s, "&")[[1]])))
# get the numbers that in between each cutoffs, and a bit larger/smaller than the max/min in the cutoffs
testers = (c(min(cutoffs)-1, cutoffs) + c( cutoffs ,max(cutoffs) + 1))/2
# take out each comparisons
comparisons = strsplit(s, "&")[[1]]
# check if ANY testers statisfy all comparisons
any(sapply(testers, function(te){
# check if a test statisfy ALL comparisons
all(sapply(comparisons, function(co){eval(parse(text =gsub(pattern = 'x',replacement =te, co)))}))
}))
}
areTherePossibleValues("x<1 & x>2")
#[1] FALSE
areTherePossibleValues("x>1 & x<2 & x < 2.5")
#[1] TRUE
areTherePossibleValues("x=> 1 & x < 1")
#[1] FALSE

We see x<1 & x>2 is impossible because we are taught a simple rule: if a number x is smaller than another number a then it can not be bigger than another number that is bigger than a, or more fundamentally we are using the transitivity property of any partially ordered set. There is no reason we can not teach a computer (or R) to see that. If your logic string in your question only consists of statements in the forms x # a where # can be <, >, <=, and >=, and the operator is always &, then Yue Y's solution above perfectly answers your question. It can be even generalized to include the | operator. Beyond this you'll have to be more specific what the logic expression can be.

Related

How can I extract the coefficient of a specific variable in an equation in R?

For example, I have a string like "2 * a + 3 * b".
I already have something that checks if a certain variable exists in the expression. So for example I input b, how could I make it return the 3?
1) If we know that the formula is linear, as in the example in the question, then we can take the derivative.
# inputs
s <- "2 * a + 3 * b"
var <- "b"
D(parse(text = s), var)
## [1] 3
2) If in addition there is no constant term -- there is none in the example in the question -- then we can replace all variables with 0 except for the one whose coefficient we want and replace that one with 1. Then evaluate the expression. Assuming the same inputs we have
p <- parse(text = s)
L <- Map(function(x) 0, all.vars(p)) # list(a = 0, b = 0)
eval(p, replace(L, var, 1))
## [1] 3
Not sure I understand you 100%. But if it is that you have a string and want to see the digit(s) that occur immediately prior to b, then you can use str_extract:
library(stringr)
str_extract(x, "\\d+(?=b)")
[1] "3"
This works by the look-ahead (?=b), which assterts that the digit(s) to be extracted must be followed by the character b. If you need the extracted substring in numeric format:
as.numeric(str_extract(x, "\\d+(?=b)"))
Data:
x <- "2a+3b"

Is there a function in R which checks if two XStrings have matching substrings of some size (n) in corresponding positions

I am trying to determine if two XStrings of equal length have the same substring of some given length in the corresponding positions.
Is there a built-in function in R for this problem?
Let's say I have strings
a <- "AACCT**GCCCGGAA**CCT" ,
b <- "CCATC**GCCCGGAA**CCT"
and given length is 8
I need a function fun(a,b,len=8) that would return TRUE or possibly even a position where such a substring first occurs.
Of course, real strings that I am using are much longer and the given length of substring may not be 8 all the time.
This could be done by for lops but it would be preferred not to use them
You could do this by splitting the strings into individual characters, testing equality of the resulting vectors, and performing run-length encoding on the logical vector produced:
f <- function(a, b, n) {
rl <- rle(strsplit(a, "")[[1]] == strsplit(b, "")[[1]])
ind <- which(rl$values == TRUE & rl$lengths >= n)[1]
cumsum(rl$lengths[seq(ind - 1)]) + 1
}
This will tell you the first position in the strings where there are at least n parallel matching bases:
f(a, b, 8)
#> [1] 6
We can use rleid
library(data.table)
Map(function(u, v) {i1 <- u == v
grp <- rleid(i1); which(i1 & ave(seq_along(grp), grp,
FUN = length) >= 8)[1]},
strsplit(a, ""), strsplit(b, ""))[[1]]
#[1] 6

How to find a string in a vector in r?

I have created a function that essentially creates a vector of a 1000 binary values. I have been able to count the longest streak of consecutive 1s by using rle.
I was wondering how to find a specific vector (say c(1,0,0,1)) in this larger vector? I would want it to return the amount of occurrences of that vector. So c(1,0,0,1,1,0,0,1) should return 2, while c(1,0,0,0,1) should return 0.
Most solutions that I have found just find whether a sequence occurs at all and return TRUE or FALSE, or they give results for the individual values, not the specific vector that is specified.
Here's my code so far:
# creates a function where a 1000 people choose either up or down.
updown <- function(){
n = 1000
X = rep(0,n)
Y = rbinom(n, 1, 1 / 2)
X[Y == 1] = "up"
X[Y == 0] = "down"
#calculate the length of the longest streak of ups:
Y1 <- rle(Y)
streaks <- Y1$lengths[Y1$values == c(1)]
max(streaks, na.rm=TRUE)
}
# repeat this process n times to find the average outcome.
longeststring <- replicate(1000, updown())
longeststring(p_vals)
This will also work:
library(stringr)
x <- c(1,0,0,1)
y <- c(1,0,0,1,1,0,0,1)
length(unlist(str_match_all(paste(y, collapse=''), '1001')))
[1] 2
y <- c(1,0,0,0,1)
length(unlist(str_match_all(paste(y, collapse=''), '1001')))
[1] 0
If you want to match overlapped patterns,
y <- c(1,0,0,1,0,0,1) # overlapped
length(unlist(gregexpr("(?=1001)",paste(y, collapse=''),perl=TRUE)))
[1] 2
Since Y is only 0s and 1s, we can paste it into a string and use regex, specifically gregexpr. Simplified a bit:
set.seed(47) # for reproducibility
Y <- rbinom(1000, 1, 1 / 2)
count_pattern <- function(pattern, x){
sum(gregexpr(paste(pattern, collapse = ''),
paste(x, collapse = ''))[[1]] > 0)
}
count_pattern(c(1, 0, 0, 1), Y)
## [1] 59
paste reduces the pattern and Y down to strings, e.g. "1001" for the pattern here, and a 1000-character string for Y. gregexpr searches for all occurrences of the pattern in Y and returns the indices of the matches (together with a little more information so they can be extracted, if one wanted). Because gregexpr will return -1 for no match, testing for numbers greater than 0 will let us simply sum the TRUE values to get the number of macthes; in this case, 59.
The other sample cases mentioned:
count_pattern(c(1,0,0,1), c(1,0,0,1,1,0,0,1))
## [1] 2
count_pattern(c(1,0,0,1), c(1,0,0,0,1))
## [1] 0

Check whether vector in R is sequential?

How can I check whether an integer vector is "sequential", i.e. that the difference between subsequent elements is exactly one. I feel like I am missing something like "is.sequential"
Here's my own function:
is.sequential <- function(x){
all(diff(x) == rep(1,length(x)-1))
}
There's no need for rep since 1 will be recicled:
Edited to allow 5:2 as true
is.sequential <- function(x){
all(abs(diff(x)) == 1)
}
To allow for diferent sequences
is.sequential <- function(x){
all(diff(x) == diff(x)[1])
}
So, #Iselzer has a fine answer. There are still some corner cases though: rounding errors and starting value. Here's a version that allows rounding errors but checks that the first value is (almost) an integer.
is.sequential <- function(x, eps=1e-8) {
if (length(x) && isTRUE(abs(x[1] - floor(x[1])) < eps)) {
all(abs(diff(x)-1) < eps)
} else {
FALSE
}
}
is.sequential(2:5) # TRUE
is.sequential(5:2) # FALSE
# Handle rounding errors?
x <- ((1:10)^0.5)^2
is.sequential(x) # TRUE
# Does the sequence need to start on an integer?
x <- c(1.5, 2.5, 3.5, 4.5)
is.sequential(x) # FALSE
# Is an empty vector a sequence?
is.sequential(numeric(0)) # FALSE
# What about NAs?
is.sequential(c(NA, 1)) # FALSE
This question is quite old by now, but in certain circumstances it is actually quite useful to know whether a vector is sequential.
Both of the OP answers are quite good, but as mentioned by Tommy the accepted answer has some flaws. It seems natural that a 'sequence' is any 'sequence of numbers, which are equally spaced'. This would include negative sequences, sequences with a starting value outside different from 0 or 1, and so forth.
A very diverse and safe implementation is given below, which accounts for
negative values (-3 to 1) and negative directions (3 to 1)
sequences with none integer steps (3.5, 3.6, 3.7...)
wrong input types such as infinite values, NA and NAN values, data.frames etc.
is.sequence <- function(x, ...)
UseMethod("is.sequence", x)
is.sequence.default <- function(x, ...){
FALSE
}
is.sequence.numeric <- function(x, tol = sqrt(.Machine$double.eps), ...){
if(anyNA(x) || any(is.infinite(x)) || length(x) <= 1 || diff(x[1:2]) == 0)
return(FALSE)
diff(range(diff(x))) <= tol
}
is.sequence.integer <- function(x, ...){
is.sequence.numeric(x, ...)
}
n <- 1236
#Test:
is.sequence(seq(-3, 5, length.out = n))
# TRUE
is.sequence(seq(5, -3, length.out = n))
# TRUE
is.sequence(seq(3.5, 2.5 + n, length.out = n))
# TRUE
is.sequence(LETTERS[1:7])
Basically the implementation checks if the max and min of the differences are exactly equal.
While using the S3 class methods makes the implementation slightly more complicated it simplifies checks for wrong input types, and allows for implementations for other classes. For example this makes it simple to extend this method to say Date objects, which would require one to consider if a sequence of only weekdays (or work days) is also a sequence.
Speed comparison
This implementation is very safe, but using S4 classes adds some overhead. For small length vectors the benefit is the diversity of the implementation, while it is around 15 % slower at worst. For larger vectors it is however slightly faster as shown in the microbenchmark below.
Note that the median time is better for comparison, as the garbage cleaner may add uncertain time to the benchmark.
ss <- seq(1, 1e6)
microbenchmark::microbenchmark(is.sequential(ss),
is.sequence(ss), #Integer calls numeric, adding a bit of overhead
is.sequence.numeric(ss))
# Unit: milliseconds
# expr min lq mean median uq max neval
# is.sequential(ss) 19.47332 20.02534 21.58227 20.45541 21.23700 66.07200 100
# is.sequence(ss) 16.09662 16.65412 20.52511 17.05360 18.23958 61.23029 100
# is.sequence.numeric(ss) 16.00751 16.72907 19.08717 17.01962 17.66150 55.90792 100

Test for equality among all elements of a single numeric vector

I'm trying to test whether all elements of a vector are equal to one another. The solutions I have come up with seem somewhat roundabout, both involving checking length().
x <- c(1, 2, 3, 4, 5, 6, 1) # FALSE
y <- rep(2, times = 7) # TRUE
With unique():
length(unique(x)) == 1
length(unique(y)) == 1
With rle():
length(rle(x)$values) == 1
length(rle(y)$values) == 1
A solution that would let me include a tolerance value for assessing 'equality' among elements would be ideal to avoid FAQ 7.31 issues.
Is there a built-in function for type of test that I have completely overlooked? identical() and all.equal() compare two R objects, so they won't work here.
Edit 1
Here are some benchmarking results. Using the code:
library(rbenchmark)
John <- function() all( abs(x - mean(x)) < .Machine$double.eps ^ 0.5 )
DWin <- function() {diff(range(x)) < .Machine$double.eps ^ 0.5}
zero_range <- function() {
if (length(x) == 1) return(TRUE)
x <- range(x) / mean(x)
isTRUE(all.equal(x[1], x[2], tolerance = .Machine$double.eps ^ 0.5))
}
x <- runif(500000);
benchmark(John(), DWin(), zero_range(),
columns=c("test", "replications", "elapsed", "relative"),
order="relative", replications = 10000)
With the results:
test replications elapsed relative
2 DWin() 10000 109.415 1.000000
3 zero_range() 10000 126.912 1.159914
1 John() 10000 208.463 1.905251
So it looks like diff(range(x)) < .Machine$double.eps ^ 0.5 is fastest.
Why not simply using the variance:
var(x) == 0
If all the elements of x are equal, you will get a variance of 0.
This works only for double and integers though.
Edit based on the comments below:
A more generic option would be to check for the length of unique elements in the vector which must be 1 in this case. This has the advantage that it works with all classes beyond just double and integer from which variance can be calculated from.
length(unique(x)) == 1
If they're all numeric values then if tol is your tolerance then...
all( abs(y - mean(y)) < tol )
is the solution to your problem.
EDIT:
After looking at this, and other answers, and benchmarking a few things the following comes out over twice as fast as the DWin answer.
abs(max(x) - min(x)) < tol
This is a bit surprisingly faster than diff(range(x)) since diff shouldn't be much different than - and abs with two numbers. Requesting the range should optimize getting the minimum and maximum. Both diff and range are primitive functions. But the timing doesn't lie.
And, in addition, as #Waldi pointed out, abs is superfluous here.
I use this method, which compares the min and the max, after dividing by the mean:
# Determine if range of vector is FP 0.
zero_range <- function(x, tol = .Machine$double.eps ^ 0.5) {
if (length(x) == 1) return(TRUE)
x <- range(x) / mean(x)
isTRUE(all.equal(x[1], x[2], tolerance = tol))
}
If you were using this more seriously, you'd probably want to remove missing values before computing the range and mean.
You can just check all(v==v[1])
> isTRUE(all.equal( max(y) ,min(y)) )
[1] TRUE
> isTRUE(all.equal( max(x) ,min(x)) )
[1] FALSE
Another along the same lines:
> diff(range(x)) < .Machine$double.eps ^ 0.5
[1] FALSE
> diff(range(y)) < .Machine$double.eps ^ 0.5
[1] TRUE
You can use identical() and all.equal() by comparing the first element to all others, effectively sweeping the comparison across:
R> compare <- function(v) all(sapply( as.list(v[-1]),
+ FUN=function(z) {identical(z, v[1])}))
R> compare(x)
[1] FALSE
R> compare(y)
[1] TRUE
R>
That way you can add any epsilon to identical() as needed.
Since I keep coming back to this question over and over, here's an Rcpp solution that will generally be much much faster than any of the R solutions if the answer is actually FALSE (because it will stop the moment it encounters a mismatch) and will have the same speed as the fastest R solution if the answer is TRUE. For example for the OP benchmark, system.time clocks in at exactly 0 using this function.
library(inline)
library(Rcpp)
fast_equal = cxxfunction(signature(x = 'numeric', y = 'numeric'), '
NumericVector var(x);
double precision = as<double>(y);
for (int i = 0, size = var.size(); i < size; ++i) {
if (var[i] - var[0] > precision || var[0] - var[i] > precision)
return Rcpp::wrap(false);
}
return Rcpp::wrap(true);
', plugin = 'Rcpp')
fast_equal(c(1,2,3), 0.1)
#[1] FALSE
fast_equal(c(1,2,3), 2)
#[2] TRUE
I wrote a function specifically for this, which can check not only elements in a vector, but also capable of checking if all elements in a list are identical. Of course it as well handle character vectors and all other types of vector well. It also has appropriate error handling.
all_identical <- function(x) {
if (length(x) == 1L) {
warning("'x' has a length of only 1")
return(TRUE)
} else if (length(x) == 0L) {
warning("'x' has a length of 0")
return(logical(0))
} else {
TF <- vapply(1:(length(x)-1),
function(n) identical(x[[n]], x[[n+1]]),
logical(1))
if (all(TF)) TRUE else FALSE
}
}
Now try some examples.
x <- c(1, 1, 1, NA, 1, 1, 1)
all_identical(x) ## Return FALSE
all_identical(x[-4]) ## Return TRUE
y <- list(fac1 = factor(c("A", "B")),
fac2 = factor(c("A", "B"), levels = c("B", "A"))
)
all_identical(y) ## Return FALSE as fac1 and fac2 have different level order
You do not actually need to use min, mean, or max.
Based on John's answer:
all(abs(x - x[[1]]) < tolerance)
Here an alternative using the min, max trick but for a data frame. In the example I am comparing columns but the margin parameter from apply can be changed to 1 for rows.
valid = sum(!apply(your_dataframe, 2, function(x) diff(c(min(x), max(x)))) == 0)
If valid == 0 then all the elements are the same
Another solution which uses the data.table package, compatible with strings and NA is uniqueN(x) == 1

Resources