How can I check whether an integer vector is "sequential", i.e. that the difference between subsequent elements is exactly one. I feel like I am missing something like "is.sequential"
Here's my own function:
is.sequential <- function(x){
all(diff(x) == rep(1,length(x)-1))
}
There's no need for rep since 1 will be recicled:
Edited to allow 5:2 as true
is.sequential <- function(x){
all(abs(diff(x)) == 1)
}
To allow for diferent sequences
is.sequential <- function(x){
all(diff(x) == diff(x)[1])
}
So, #Iselzer has a fine answer. There are still some corner cases though: rounding errors and starting value. Here's a version that allows rounding errors but checks that the first value is (almost) an integer.
is.sequential <- function(x, eps=1e-8) {
if (length(x) && isTRUE(abs(x[1] - floor(x[1])) < eps)) {
all(abs(diff(x)-1) < eps)
} else {
FALSE
}
}
is.sequential(2:5) # TRUE
is.sequential(5:2) # FALSE
# Handle rounding errors?
x <- ((1:10)^0.5)^2
is.sequential(x) # TRUE
# Does the sequence need to start on an integer?
x <- c(1.5, 2.5, 3.5, 4.5)
is.sequential(x) # FALSE
# Is an empty vector a sequence?
is.sequential(numeric(0)) # FALSE
# What about NAs?
is.sequential(c(NA, 1)) # FALSE
This question is quite old by now, but in certain circumstances it is actually quite useful to know whether a vector is sequential.
Both of the OP answers are quite good, but as mentioned by Tommy the accepted answer has some flaws. It seems natural that a 'sequence' is any 'sequence of numbers, which are equally spaced'. This would include negative sequences, sequences with a starting value outside different from 0 or 1, and so forth.
A very diverse and safe implementation is given below, which accounts for
negative values (-3 to 1) and negative directions (3 to 1)
sequences with none integer steps (3.5, 3.6, 3.7...)
wrong input types such as infinite values, NA and NAN values, data.frames etc.
is.sequence <- function(x, ...)
UseMethod("is.sequence", x)
is.sequence.default <- function(x, ...){
FALSE
}
is.sequence.numeric <- function(x, tol = sqrt(.Machine$double.eps), ...){
if(anyNA(x) || any(is.infinite(x)) || length(x) <= 1 || diff(x[1:2]) == 0)
return(FALSE)
diff(range(diff(x))) <= tol
}
is.sequence.integer <- function(x, ...){
is.sequence.numeric(x, ...)
}
n <- 1236
#Test:
is.sequence(seq(-3, 5, length.out = n))
# TRUE
is.sequence(seq(5, -3, length.out = n))
# TRUE
is.sequence(seq(3.5, 2.5 + n, length.out = n))
# TRUE
is.sequence(LETTERS[1:7])
Basically the implementation checks if the max and min of the differences are exactly equal.
While using the S3 class methods makes the implementation slightly more complicated it simplifies checks for wrong input types, and allows for implementations for other classes. For example this makes it simple to extend this method to say Date objects, which would require one to consider if a sequence of only weekdays (or work days) is also a sequence.
Speed comparison
This implementation is very safe, but using S4 classes adds some overhead. For small length vectors the benefit is the diversity of the implementation, while it is around 15 % slower at worst. For larger vectors it is however slightly faster as shown in the microbenchmark below.
Note that the median time is better for comparison, as the garbage cleaner may add uncertain time to the benchmark.
ss <- seq(1, 1e6)
microbenchmark::microbenchmark(is.sequential(ss),
is.sequence(ss), #Integer calls numeric, adding a bit of overhead
is.sequence.numeric(ss))
# Unit: milliseconds
# expr min lq mean median uq max neval
# is.sequential(ss) 19.47332 20.02534 21.58227 20.45541 21.23700 66.07200 100
# is.sequence(ss) 16.09662 16.65412 20.52511 17.05360 18.23958 61.23029 100
# is.sequence.numeric(ss) 16.00751 16.72907 19.08717 17.01962 17.66150 55.90792 100
Related
Suppose you have two variables that both have some missing data, but these missing data may not overlap perfectly. What is the easiest way of finding the number of common datapoints with no missing values? Is there some built-in function?
One way is to do make a function like the following:
pairwise.miss = function(x, y) {
#deal with input types
x = as.vector(x)
y = as.vector(y)
#make combined object
c = cbind(x, y)
#remove NA rows
c = c[complete.cases(c), ]
#return length
return(nrow(c))
}
Another idea is to use some function that returns the pairwise complete data. For instance, rcorr() from Hmisc does this, but may give errors for non-numeric data. So:
rcorr(x, y)$n[1,2]
Is there an easier way?
You can simply list the two variables in complete.cases() and sum() the output.
x <- c(1, 2, 3, NA, NA, NA, 5)
y <- c(1, NA, 3, NA, 3, 2, NA)
complete.cases(x, y)
#[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
sum(complete.cases(x, y))
#[1] 2
The sum of a logical vector is the number of TRUE elements since TRUE is coerced to 1 and FALSE to 0.
This works for any data type. However, note that empty strings, i.e. "", are not considered missing. An actual missing character value is denoted by NA_character_.
A possible solution is to use is.na and logical operators:
!(is.na(x) | is.na(y)) # logical vector
which(!(is.na(x) | is.na(y))) # integer vector of indices.
If you want only the total count, use:
sum(!(is.na(x) | is.na(y)))
I benchmarked the solutions given above:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(microbenchmark)
#fetch some data
x = iris[1] #from isis
y = iris[1]
x[sample(1:150, 50), ] = NA #random subset
y[sample(1:150, 50), ] = NA
#benchmark
times = microbenchmark(pairwise.function = pairwise.miss(x, y),
sum.is.na = sum(!is.na(x) & !is.na(y)),
sum.is.na2 = sum(!(is.na(x) | is.na(y))),
sum.complete.cases = sum(complete.cases(x, y)));times
Results:
> times
Unit: microseconds
expr min lq mean median uq max neval
pairwise.function 202.205 217.2935 244.31481 233.3150 253.8460 450.763 100
sum.is.na 75.594 78.5500 89.26383 80.5730 94.1035 248.558 100
sum.is.na2 74.662 77.6170 89.23899 80.5725 94.8825 167.676 100
sum.complete.cases 14.311 16.1770 18.77197 17.1105 17.7330 155.233 100
So my original method was horribly slow compared to the sum.complete.cases one.
Perhaps there is rarely a need for speed in this computation, but one might as well use the most efficient method when it is equally easy to use.
I have an ex. where I have to see how many values of a vector are divisible by 2. I have this random sample:
set.seed(1)
y <- sample(c(0:99, NA), 400, replace=TRUE)
I created a new variable d to see which of the values are or aren't divisible by 2:
d <- y/2 ; d
What I want to do is to create a logical argument, where all entire numbers give true and the rest gives false. (ex: 22.0 -> TRUE & 24.5 -> FALSE)
I used this command, but I believe that the answer is wrong since it would only give me the numbers that are in the sample:
sum(d %in% y, na.rm=T)
I also tried this (I found on the internet, but I don't really understand it)
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
sum(is.wholenumber(d),na.rm = T)
Are there other ways that I could use the operator "%%"?
you can sum over the mod operator like so: sum(1-y%%2) or sum(y%%2 == 0). Note that x %% 2 is the remainder after dividing by two which is why this solution works.
Here are three different ways:
length(y[y %% 2 == 0])
length(subset(y, y %% 2 == 0))
length(Filter(function(x) x %% 2 == 0, y))
Since we're talking about a division by 2, I would actually take it to the bit level and check if the last bit of the number is a 0 or a 1 (a 0 means it would be divisible by 2).
Going out on a limb here (not sure how the compiler handles this division by 2) but think that would likely be more optimized than a division, which is typically fairly expensive.
To do this at the bit level, you can just do an AND operation between the number itself and 1, if result it 1 it means won't be divisible by 2:
bitwAnd(a, b)
I am very new in using R and trying to get my hear around different commands. I have this simple code:
setwd("C:/Research")
tempdata=read.csv("temperature_humidity.csv")
Thour=tempdata$t
RHhour=tempdata$RH
weather=data.frame(cbind(hour,Thour,RHhour))
head(weather)
if (Thour>25) {
y=0 else {
y=3
}
x=Thour+y*2
x
I simply want the code to read the Thour(temperature) from CSV file and if it is higher than 25 then uses y=0 in the formula, if its lower than 25 then uses y=3
I tried ifelse but it doesn't work as well.
Thanks for your help.
I've said that too many times today already, but avoid ifelse statements as much as possible (very inefficient and unnecessary in most cases), try this instead:
c(3, 0)[(Thour >= 25) + 1]
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when added to 1 and become 1/2 which will be the indexes in c(3, 0)
Or even better solution (posted by #BondedDust in comments) would be:
3*(Thour <= 25)
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when multiplied by 3
Benchmark comparison:
Thour <- sample(1:100000)
require(microbenchmark)
microbenchmark(ifel = {ifelse(Thour < 25 , 0 , 3)}, Bool = {3*(Thour >= 25)})
Unit: microseconds
expr min lq median uq max neval
ifel 38633.499 41643.768 41786.978 55153.050 59169.69 100
Bool 901.135 1848.091 1972.434 2010.841 20754.74 100
This should work for you. Just replace what you're naming Thour with the appropriate code.
Thour <- sample(1:100, 1)
Thour
# [1] 8
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 3
And:
Thour <- sample(1:100, 1)
Thour
# [1] 37
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 0
You may need to change the logical operator (>=) to match your exact circumstance since it's unclear, which if any of the higher or lower range you want to be inclusive.
R has a very flexible syntax. So you can write this in many ways:
# ifelse() function
y <- ifelse(Thour > 25, 0, 3)
# more ifelse()
y <- 3 * ifelse(Thour > 25, 0, 1)
# The simpler way:
y <- 3 * (Thour > 25)
By the way, use <- instead of = for assignment... it's the "preferred" style
I'm trying to test whether all elements of a vector are equal to one another. The solutions I have come up with seem somewhat roundabout, both involving checking length().
x <- c(1, 2, 3, 4, 5, 6, 1) # FALSE
y <- rep(2, times = 7) # TRUE
With unique():
length(unique(x)) == 1
length(unique(y)) == 1
With rle():
length(rle(x)$values) == 1
length(rle(y)$values) == 1
A solution that would let me include a tolerance value for assessing 'equality' among elements would be ideal to avoid FAQ 7.31 issues.
Is there a built-in function for type of test that I have completely overlooked? identical() and all.equal() compare two R objects, so they won't work here.
Edit 1
Here are some benchmarking results. Using the code:
library(rbenchmark)
John <- function() all( abs(x - mean(x)) < .Machine$double.eps ^ 0.5 )
DWin <- function() {diff(range(x)) < .Machine$double.eps ^ 0.5}
zero_range <- function() {
if (length(x) == 1) return(TRUE)
x <- range(x) / mean(x)
isTRUE(all.equal(x[1], x[2], tolerance = .Machine$double.eps ^ 0.5))
}
x <- runif(500000);
benchmark(John(), DWin(), zero_range(),
columns=c("test", "replications", "elapsed", "relative"),
order="relative", replications = 10000)
With the results:
test replications elapsed relative
2 DWin() 10000 109.415 1.000000
3 zero_range() 10000 126.912 1.159914
1 John() 10000 208.463 1.905251
So it looks like diff(range(x)) < .Machine$double.eps ^ 0.5 is fastest.
Why not simply using the variance:
var(x) == 0
If all the elements of x are equal, you will get a variance of 0.
This works only for double and integers though.
Edit based on the comments below:
A more generic option would be to check for the length of unique elements in the vector which must be 1 in this case. This has the advantage that it works with all classes beyond just double and integer from which variance can be calculated from.
length(unique(x)) == 1
If they're all numeric values then if tol is your tolerance then...
all( abs(y - mean(y)) < tol )
is the solution to your problem.
EDIT:
After looking at this, and other answers, and benchmarking a few things the following comes out over twice as fast as the DWin answer.
abs(max(x) - min(x)) < tol
This is a bit surprisingly faster than diff(range(x)) since diff shouldn't be much different than - and abs with two numbers. Requesting the range should optimize getting the minimum and maximum. Both diff and range are primitive functions. But the timing doesn't lie.
And, in addition, as #Waldi pointed out, abs is superfluous here.
I use this method, which compares the min and the max, after dividing by the mean:
# Determine if range of vector is FP 0.
zero_range <- function(x, tol = .Machine$double.eps ^ 0.5) {
if (length(x) == 1) return(TRUE)
x <- range(x) / mean(x)
isTRUE(all.equal(x[1], x[2], tolerance = tol))
}
If you were using this more seriously, you'd probably want to remove missing values before computing the range and mean.
You can just check all(v==v[1])
> isTRUE(all.equal( max(y) ,min(y)) )
[1] TRUE
> isTRUE(all.equal( max(x) ,min(x)) )
[1] FALSE
Another along the same lines:
> diff(range(x)) < .Machine$double.eps ^ 0.5
[1] FALSE
> diff(range(y)) < .Machine$double.eps ^ 0.5
[1] TRUE
You can use identical() and all.equal() by comparing the first element to all others, effectively sweeping the comparison across:
R> compare <- function(v) all(sapply( as.list(v[-1]),
+ FUN=function(z) {identical(z, v[1])}))
R> compare(x)
[1] FALSE
R> compare(y)
[1] TRUE
R>
That way you can add any epsilon to identical() as needed.
Since I keep coming back to this question over and over, here's an Rcpp solution that will generally be much much faster than any of the R solutions if the answer is actually FALSE (because it will stop the moment it encounters a mismatch) and will have the same speed as the fastest R solution if the answer is TRUE. For example for the OP benchmark, system.time clocks in at exactly 0 using this function.
library(inline)
library(Rcpp)
fast_equal = cxxfunction(signature(x = 'numeric', y = 'numeric'), '
NumericVector var(x);
double precision = as<double>(y);
for (int i = 0, size = var.size(); i < size; ++i) {
if (var[i] - var[0] > precision || var[0] - var[i] > precision)
return Rcpp::wrap(false);
}
return Rcpp::wrap(true);
', plugin = 'Rcpp')
fast_equal(c(1,2,3), 0.1)
#[1] FALSE
fast_equal(c(1,2,3), 2)
#[2] TRUE
I wrote a function specifically for this, which can check not only elements in a vector, but also capable of checking if all elements in a list are identical. Of course it as well handle character vectors and all other types of vector well. It also has appropriate error handling.
all_identical <- function(x) {
if (length(x) == 1L) {
warning("'x' has a length of only 1")
return(TRUE)
} else if (length(x) == 0L) {
warning("'x' has a length of 0")
return(logical(0))
} else {
TF <- vapply(1:(length(x)-1),
function(n) identical(x[[n]], x[[n+1]]),
logical(1))
if (all(TF)) TRUE else FALSE
}
}
Now try some examples.
x <- c(1, 1, 1, NA, 1, 1, 1)
all_identical(x) ## Return FALSE
all_identical(x[-4]) ## Return TRUE
y <- list(fac1 = factor(c("A", "B")),
fac2 = factor(c("A", "B"), levels = c("B", "A"))
)
all_identical(y) ## Return FALSE as fac1 and fac2 have different level order
You do not actually need to use min, mean, or max.
Based on John's answer:
all(abs(x - x[[1]]) < tolerance)
Here an alternative using the min, max trick but for a data frame. In the example I am comparing columns but the margin parameter from apply can be changed to 1 for rows.
valid = sum(!apply(your_dataframe, 2, function(x) diff(c(min(x), max(x)))) == 0)
If valid == 0 then all the elements are the same
Another solution which uses the data.table package, compatible with strings and NA is uniqueN(x) == 1
In R, what is the most efficient/idiomatic way to count the number of TRUE values in a logical vector? I can think of two ways:
z <- sample(c(TRUE, FALSE), 1000, rep = TRUE)
sum(z)
# [1] 498
table(z)["TRUE"]
# TRUE
# 498
Which do you prefer? Is there anything even better?
The safest way is to use sum with na.rm = TRUE:
sum(z, na.rm = TRUE) # best way to count TRUE values
which gives 1.
There are some problems with other solutions when logical vector contains NA values.
See for example:
z <- c(TRUE, FALSE, NA)
sum(z) # gives you NA
table(z)["TRUE"] # gives you 1
length(z[z == TRUE]) # f3lix answer, gives you 2 (because NA indexing returns values)
Additionally table solution is less efficient (look at the code of table function).
Also, you should be careful with the "table" solution, in case there are no TRUE values in the logical vector. See for example:
z <- c(FALSE, FALSE)
table(z)["TRUE"] # gives you `NA`
or
z <- c(NA, FALSE)
table(z)["TRUE"] # gives you `NA`
Another option which hasn't been mentioned is to use which:
length(which(z))
Just to actually provide some context on the "which is faster question", it's always easiest just to test yourself. I made the vector much larger for comparison:
z <- sample(c(TRUE,FALSE),1000000,rep=TRUE)
system.time(sum(z))
user system elapsed
0.03 0.00 0.03
system.time(length(z[z==TRUE]))
user system elapsed
0.75 0.07 0.83
system.time(length(which(z)))
user system elapsed
1.34 0.28 1.64
system.time(table(z)["TRUE"])
user system elapsed
10.62 0.52 11.19
So clearly using sum is the best approach in this case. You may also want to check for NA values as Marek suggested.
Just to add a note regarding NA values and the which function:
> which(c(T, F, NA, NULL, T, F))
[1] 1 4
> which(!c(T, F, NA, NULL, T, F))
[1] 2 5
Note that which only checks for logical TRUE, so it essentially ignores non-logical values.
Another way is
> length(z[z==TRUE])
[1] 498
While sum(z) is nice and short, for me length(z[z==TRUE]) is more self explaining. Though, I think with a simple task like this it does not really make a difference...
If it is a large vector, you probably should go with the fastest solution, which is sum(z). length(z[z==TRUE]) is about 10x slower and table(z)[TRUE] is about 200x slower than sum(z).
Summing up, sum(z) is the fastest to type and to execute.
Another option is to use summary function. It gives a summary of the Ts, Fs and NAs.
> summary(hival)
Mode FALSE TRUE NA's
logical 4367 53 2076
>
which is good alternative, especially when you operate on matrices (check ?which and notice the arr.ind argument). But I suggest that you stick with sum, because of na.rm argument that can handle NA's in logical vector.
For instance:
# create dummy variable
set.seed(100)
x <- round(runif(100, 0, 1))
x <- x == 1
# create NA's
x[seq(1, length(x), 7)] <- NA
If you type in sum(x) you'll get NA as a result, but if you pass na.rm = TRUE in sum function, you'll get the result that you want.
> sum(x)
[1] NA
> sum(x, na.rm=TRUE)
[1] 43
Is your question strictly theoretical, or you have some practical problem concerning logical vectors?
There's also a package called bit that is specifically designed for fast boolean operations. It's especially useful if you have large vectors or need to do many boolean operations.
z <- sample(c(TRUE, FALSE), 1e8, rep = TRUE)
system.time({
sum(z) # 0.170s
})
system.time({
bit::sum.bit(z) # 0.021s, ~10x improvement in speed
})
I've been doing something similar a few weeks ago. Here's a possible solution, it's written from scratch, so it's kind of beta-release or something like that. I'll try to improve it by removing loops from code...
The main idea is to write a function that will take 2 (or 3) arguments. First one is a data.frame which holds the data gathered from questionnaire, and the second one is a numeric vector with correct answers (this is only applicable for single choice questionnaire). Alternatively, you can add third argument that will return numeric vector with final score, or data.frame with embedded score.
fscore <- function(x, sol, output = 'numeric') {
if (ncol(x) != length(sol)) {
stop('Number of items differs from length of correct answers!')
} else {
inc <- matrix(ncol=ncol(x), nrow=nrow(x))
for (i in 1:ncol(x)) {
inc[,i] <- x[,i] == sol[i]
}
if (output == 'numeric') {
res <- rowSums(inc)
} else if (output == 'data.frame') {
res <- data.frame(x, result = rowSums(inc))
} else {
stop('Type not supported!')
}
}
return(res)
}
I'll try to do this in a more elegant manner with some *ply function. Notice that I didn't put na.rm argument... Will do that
# create dummy data frame - values from 1 to 5
set.seed(100)
d <- as.data.frame(matrix(round(runif(200,1,5)), 10))
# create solution vector
sol <- round(runif(20, 1, 5))
Now apply a function:
> fscore(d, sol)
[1] 6 4 2 4 4 3 3 6 2 6
If you pass data.frame argument, it will return modified data.frame.
I'll try to fix this one... Hope it helps!
I've just had a particular problem where I had to count the number of true statements from a logical vector and this worked best for me...
length(grep(TRUE, (gene.rep.matrix[i,1:6] > 1))) > 5
So This takes a subset of the gene.rep.matrix object, and applies a logical test, returning a logical vector. This vector is put as an argument to grep, which returns the locations of any TRUE entries. Length then calculates how many entries grep finds, thus giving the number of TRUE entries.