unique() or duplicated() with all.equal() functionality? - r

I am searching for a (simple) function in R to remove duplicated elements, like unique() or duplicated() which can consider for "near equality" of numerical values like all.equal():
unique( c(0, 0))
[1] 0
works fine, but
unique( c(0, cos(pi/2)) )
[1] 0.000000e+00 6.123032e-17
does not remove the second element, although a comparison with all.equal returns TRUE:
all.equal( 0, cos(pi/2) )
[1] TRUE
Same is valid for duplicated:
duplicated( c(0, cos(pi/2)))
[1] FALSE FALSE
Any suggestions? Thanks!

You might also consider the zapsmall function:
x <- rep(c(1,2), each=5) + rnorm(10)/(10^rep(1:5,2))
unique(x)
# [1] 1.0571484 1.0022854 1.0014347 0.9998829 0.9999985 2.1095720 1.9888208 2.0002687 1.9999723 2.0000078
unique(zapsmall(x, digits=4))
# [1] 1.0571 1.0023 1.0014 0.9999 1.0000 2.1096 1.9888 2.0003 2.0000
unique(zapsmall(x, digits=2))
# [1] 1.06 1.00 2.11 1.99 2.00
unique(zapsmall(x, digits=0))
# [1] 1 2

If you'd like to take into account the absolute error, and not the relative error (as all.equal does), try:
x <- c(0, cos(pi/2), 1, 1+1e-16)
unique(x)
## [1] 0.000000e+00 6.123234e-17 1.000000e+00
(x <- x[!duplicated(round(x, 10))])
## [1] 0 1
Here we remove the elements that are the same w.r.t. a fixed (10 above) number of decimal digits.

You could try this code (disclaimer: from my package cgwtools)
approxeq <- function (x, y, tolerance = .Machine$double.eps^0.5, ...)
{
if (length(x) != length(y))
warning("x,y lengths differ. Will recycle.")
checkit <- abs(x - y) < tolerance
return(invisible(checkit))
}

Related

How to concisely deal with subsets when their lengths become zero?

To exclude elements from a vector x,
x <- c(1, 4, 3, 2)
we can subtract a vector of positions:
excl <- c(2, 3)
x[-excl]
# [1] 1 2
This also works dynamically,
(excl <- which(x[-which.max(x)] > quantile(x, .25)))
# [1] 2 3
x[-excl]
# [1] 1 2
until excl is of length zero:
excl.nolength <- which(x[-which.max(x)] > quantile(x, .95))
length(excl.nolength)
# [1] 0
x[-excl.nolength]
# integer(0)
I could kind of reformulate that, but I have many objects to which excl is applied, say:
letters[1:4][-excl.nolength]
# character(0)
I know I could use setdiff, but that's rather long and hard to read:
x[setdiff(seq(x), excl.nolength)]
# [1] 1 4 3 2
letters[1:4][setdiff(seq(letters[1:4]), excl.nolength)]
# [1] "a" "b" "c" "d"
Now, I could exploit the fact that nothing is excluded if the element number is greater than the number of elements:
length(x)
# [1] 4
x[-5]
# [1] 1 4 3 2
To generalize that I should probably use .Machine$integer.max:
tmp <- which(x[-which.max(x)] > quantile(x, .95))
excl <- if (!length(tmp) == 0) tmp else .Machine$integer.max
x[-excl]
# [1] 1 4 3 2
Wrapped into a function,
e <- function(x) if (!length(x) == 0) x else .Machine$integer.max
that's quite handy and clear:
x[-e(excl)]
# [1] 1 2
x[-e(excl.nolength)]
# [1] 1 4 3 2
letters[1:4][-e(excl.nolength)]
# [1] "a" "b" "c" "d"
But it seems a little fishy to me...
Is there a better equally concise way to deal with a subset of length zero in base R?
Edit
excl comes out as dynamic result of a function before (as shown with which above) and might be of length zero or not. If length(excl) == 0 nothing should be excluded. Following lines of code, e.g. x[-excl] should not have to be changed at best or as little as possible.
You can overwrite [ with your own function.
"[" <- function(x,y) {if(length(y)==0) x else .Primitive("[")(x,y)}
x <- c(1, 4, 3, 2)
excl <- c(2, 3)
x[-excl]
#[1] 1 2
excl <- integer()
x[-excl]
#[1] 1 4 3 2
rm("[") #Go back to normal mode
I would argue this is somewhat opinion based.
For example i find:
x <- x[-if(length(excl <- which(x[-which.max(x)] > quantile(x, .95))) == 0) .Machine$integer.max else excl]
very unreadable, but some people like one-liners. Reading package code you'll often find this is instead split up into one of the many suggestions you gave
excl <- which(x[-which.max(x)] > quantile(x, .95))
if(length(excl) != 0)
x <- x[-excl]
Alternatively, you could avoid which, and simply use the logical vector for subsetting, and this would likely be considered more clean by most
x <- x[!x[-which.max(x)] > quantile(x, .95)]
This would avoid zero-length index problem, at the cost of some loss of efficiency.
As a side note, the very example used above and in the question seems somewhat off. First which.max only returns the first index which is equal to the max value, and in addition the index will be offset for every value removed. More likely the expected example would be
x <- x[!(x > quantile(x, .95))[-which(x == max(x))]]
How bout this?
a <- letters[1:3]
excl1 <- c(1,3)
excl2 <- c()
a[!(seq_along(a) %in% excl1)]
a[!(seq_along(a) %in% excl2)]

Floor and ceiling with 2 or more significant digits

It is possible to round results into two significant digits using signif:
> signif(12500,2)
[1] 12000
> signif(12501,2)
[1] 13000
But are there an equally handy functions, like the fictitious functions below signif.floor and signif.ceiling, so that I could get two or more significant digits with flooring or ceiling?
> signif.ceiling(12500,2)
[1] 13000
> signif.floor(12501,2)
[1] 12000
EDIT:
The existing signif function works with negative numbers and decimal numbers.
Therefore, the possible solution would preferably work also with negative numbers:
> signif(-125,2)
[1] -120
> signif.floor(-125,2)
[1] -130
and decimal numbers:
> signif(1.23,2)
[1] 1.2
> signif.ceiling(1.23,2)
[1] 1.3
As a special case, also 0 should return 0:
> signif.floor(0,2)
[1] 0
I think this approach is proper for all types of numbers (i.e. integers, negative, decimal).
The floor function
signif.floor <- function(x, n){
pow <- floor( log10( abs(x) ) ) + 1 - n
y <- floor(x / 10 ^ pow) * 10^pow
# handle the x = 0 case
y[x==0] <- 0
y
}
The ceiling function
signif.ceiling <- function(x, n){
pow <- floor( log10( abs(x) ) ) + 1 - n
y <- ceiling(x / 10 ^ pow) * 10^pow
# handle the x = 0 case
y[x==0] <- 0
y
}
They both do the same thing. First count the number of digits, next use the standard floor/ceiling function. Check if it works for you.
Edit 1 Added the handler for the case of x = 0 as suggested in the comments by Heikki.
Edit 2 Again following Heikki I add some examples:
Testing different values of x
# for negative values
> values <- -0.12151 * 10^(0:4); values
# [1] -0.12151 -1.21510 -12.15100 -121.51000 -1215.10000
> sapply(values, function(x) signif.floor(x, 2))
# [1] -0.13 -1.30 -13.00 -130.00 -1300.00
> sapply(values, function(x) signif.ceiling(x, 2))
# [1] -0.12 -1.20 -12.00 -120.00 -1200.00
# for positive values
> sapply(-values, function(x) signif.floor(x, 2))
# [1] 0.12 1.20 12.00 120.00 1200.00
> sapply(-values, function(x) signif.ceiling(x, 2))
# [1] 0.13 1.30 13.00 130.00 1300.00
Testing different values of n
> sapply(1:5, function(n) signif.floor(-121.51,n))
# [1] -200.00 -130.00 -122.00 -121.60 -121.51
> sapply(1:5, function(n) signif.ceiling(-121.51,n))
# [1] -100.00 -120.00 -121.00 -121.50 -121.51
Edit Nowhere near as nice as #storaged's answer, but I'd started so I might as well finish:
Basically runs through each case (positive, negative, decimal or not)
signif.floor=function(x,n){
if(x==0)(out=0)
if(x%%round(x)==0 & sign(x)==1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:n],collapse=''))*10^(nchar(x)-n)}
if(x%%round(x) >0 & sign(x)==1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse=''))}
if(x%%round(x)==0 & sign(x)==-1){out=(as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse=''))-1)*10^(nchar(x)-n-1)}
if(x%%round(x) <0 & sign(x)==-1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+2)],collapse=''))-+10^(-n+1)}
return(out)
}
signif.ceiling=function(x,n){
if(x==0)(out=0)
if(x%%round(x)==0 & sign(x)==1){out=(as.numeric(paste0(el(strsplit(as.character(x),''))[1:n],collapse=''))+1)*10^(nchar(x)-n)}
if(x%%round(x) >0 & sign(x)==1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse=''))+10^(-n+1)}
if(x%%round(x)==0 & sign(x)==-1){out=(as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse='')))*10^(nchar(x)-n-1)}
if(x%%round(x) < 0 & sign(x)==-1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+2)],collapse=''))}
return(out)
}

logical check of vector values at the same precesion or not

I have a vector with variable elements in it, and I want to check whether it's last two element are in the same digit order.
For example, if the last two vectors are 0.0194 and 0.0198 return TRUE. because their digit order after zero is the same (0.01 order 10^-2). ! for other example the number could be 0.00014 and 0.00012 so their precision is still around the same the function should return also TRUE.
How can we build a logical statement or function to check this.
x<- c(0.817104, 0.241665, 0.040581, 0.022903, 0.019478, 0.019846)
I may be over-thinking this, but you can test that the order of magnitude and first non-zero digit are identical for each.
x <- c(0.817104, 0.241665, 0.040581, 0.022903, 0.019478, 0.019846)
oom <- function(x, base = 10) as.integer(ifelse(x == 0, 0, floor(log(abs(x), base))))
oom(x)
# [1] -1 -1 -2 -2 -2 -2
(tr <- trunc(x / 10 ** oom(x, 10)))
# [1] 8 2 4 2 1 1
So for the last two, the order of magnitude for both is -2 and the first non-zero digit is 1 for both.
Put into a function:
f <- function(x) {
oom <- function(x, base = 10) as.integer(ifelse(x == 0, 0, floor(log(abs(x), base))))
x <- tail(x, 2)
oo <- oom(x)
tr <- trunc(x / 10 ** oo)
(oo[1] == oo[2]) & (tr[1] == tr[2])
}
## more test cases
x1 <- c(0.019, 0.011)
x2 <- c(0.01, 0.001)
f(x) ## TRUE
f(x1) ## TRUE
f(x2) ## FALSE
Here is a more general function than the above for checking the last n instead of 2
g <- function(x, n = 2) {
oom <- function(x, base = 10) as.integer(ifelse(x == 0, 0, floor(log(abs(x), base))))
x <- tail(x, n)
oo <- oom(x)
tr <- trunc(x / 10 ** oo)
Reduce(`==`, oo) & Reduce(`==`, tr)
}
g(c(.24, .15, .14), 2) ## TRUE
g(c(.24, .15, .14), 3) ## FALSE
#rawr worries about over-thinking. I guess I should as well. This is what I came up with and do note that this handles the fact that print representations of floating point numbers are sometimes deceiving.
orddig <- function(x) which( sapply( 0:16, function(n){ isTRUE(all.equal(x*10^n ,
round(x*10^n,0)))}))[1]
> sapply( c(0.00014 , 0.00012 ), orddig)
[1] 6 6
My original efforts were with the signif function but that's a different numerical thought trajectory, since 0.01 and 0.001 have the same number of significant digits. Also notice that:
> sapply( 10^5*c(0.00014 , 0.00012 ), trunc, 4)
[1] 13 12
Which was why we need the isTRUE(all.equal(... , ...))

Weight any set of numbers to sum to zero in R

I want to write a function in R that calculates weights to sum any set of numbers in R to zero. For example if
x <- c(-5, 6, 2, 4, -3)
I want a function that would return a new vector which has been weighted to force the vector sum to zero, by taking something off the positive numbers and adding something to the negative values...
EDIT: To clarify I do not want to shift values up or down a scale... I want to weight so that the rescaled negative numbers become slightly more/less negative and the rescaled positive numbers become slightly less/more positive.
I am not sure 1) how to go about calculating the right values for proportional weights and 2) if there is a function in R that can do it?
How about
x <- scale(x)
> x
[,1]
[1,] -1.2450825
[2,] 1.1162809
[3,] 0.2576033
[4,] 0.6869421
[5,] -0.8157437
attr(,"scaled:center")
[1] 0.8
attr(,"scaled:scale")
[1] 4.658326
> sum(scale(x))
[1] 5.551115e-17
Edit:
As suggested by #Josh O'brien, setting scale = FALSE gives
scale(x, scale = FALSE)
[,1]
[1,] -5.8
[2,] 5.2
[3,] 1.2
[4,] 3.2
[5,] -3.8
attr(,"scaled:center")
[1] 0.8
sum(scale(x, scale = FALSE))
[1] 6.661338e-16
1) offsets #jdharrison has already indicated if you want a vector a such that sum(x-a) is zero then setting a to be the mean of x will do it.
2) weight vector The wording of the question seems to ask for a weight vector w such that sum(w * x) is zero.
(i) If x is not constant (i.e. its elements are not all the same) then in mathematical notation P = I-xx'/(x'x) is a projection orthogonal to x and P1 = 1 - xx'1/(x'x) is a vector in the range of P so switching to R code:
w <- 1 - x * sum(x) / sum(x*x)
is such a weight vector. We can verify this:
> sum(w*x)
[1] 2.220446e-16
(ii) If x is constant but not identically zero then choose any non-constant vector s <- seq_along(x), say. Then Ps = s - xx's/(x'x) is orthogonal to x so:
x <- c(1, 1, 1, 1)
s <- seq_along(x)
w <- s - x * sum(s*x) / sum(x*x)
sum(w * x)
giving:
> sum(w * x)
[1] 0
Elaborating #jdharrison's comment:
> x
[1] -5 6 2 4 -3
> sum(x)
[1] 4
> mean(x)
[1] 0.8
> x - mean(x)
[1] -5.8 5.2 1.2 3.2 -3.8
> sum(x - mean(x))
[1] 6.661338e-16 #floating point 0
So x - mean(x) will do the trick.
If you want to keep the sign after the rescaling...
x <- c(-5, -3, 0, 2, 4, 6, 50)
rescale_zero <- function(x){
x1 <- x[x>0]
x2 <- x[x<0]
d <- (sum(x1) + sum(x2)) / 2
w1 <- (sum(x1) - d) / sum(x1)
w2 <- (sum(x2) - d) / sum(x2)
y <- x
y[x>0] <- x1*w1
y[x<0] <- x2*w2
y
}
rescale_zero(x)
# [1] -21.875000 -13.125000 0.000000 1.129032 2.258065 3.387097 28.225806

Apply a function on subsequent pairs of elements

Suppose x is a vector, and myfunc is a function of two arguments. I wish to get a vector of the results of myfunc on subsequent pairs of elements from x. By definition, that vector should be of length 1 less than x's length.
For example, if x <- 1:4 and
myfunc <- function(a,b) {
return(log(b/a))
}
Then I would expect
> apply_on_pairs(x, myfunc)
[1] 0.6931472 0.4054651 0.2876821
(which is equivalent to c(myfunc(1,2), myfunc(2,3), myfunc(3,4)))
mapply(myfunc,x[-length(x)],x[-1])
# [1] 0.6931472 0.4054651 0.2876821
mapply(...) "applies" the function in the first argument to the subsequent arguments, in this case we pass x[1:3] and x[2:4] as the second the third arguments to mapply(...).
library(zoo)
x <- 1:4
rollapply(x, width=2, FUN=function(x) return(log(x[2]/x[1])))
## [1] 0.6931472 0.4054651 0.2876821
In this case you can diff() the log() of your vector.
x <- 1:4
diff(log(x))
Yields:
> diff(log(x))
[1] 0.6931472 0.4054651 0.2876821
Update: I more general solution uses head() and tail() to remove the last and first elements. You want to do your best to stick to vectorized solutions, which should be faster and more memory efficient.
myFun <- function(x) log(tail(x, -1)) - log(head(x, -1))
There's a slight speed edge to diff().
> x <- seq(1e8)
> system.time(A <- diff(log(x)))
user system elapsed
8.42 1.28 9.90
> myFun <- function(x) log(tail(x, -1)) - log(head(x, -1))
> system.time(B <- myFun(x))
user system elapsed
9.29 1.40 10.78
> all.equal(A, B)
[1] TRUE

Resources