big64 - sum() on a vector of NA produces odd results - r

When using big64 package, summing a vector of NAs to another vector of integers yields an inaccurate result. Depending on whether the NA vector is summed first or last, the results will be either 0 or twice the correct answer, respectively.
Notice that converting the NA vector away from integer64 will remove the issue.
However, when experimenting with other small values in place of y, the results were awfully strange.
For example:
40 + 35 = 75 but
35 + 40 = 80
Any thoughts as to what is going on?
EXAMPLE:
library(bit64)
x <- as.integer64(c(20, 20))
y <- as.integer64(c(NA, NA))
sum(y, x, na.rm=TRUE)
# integer64
# [1] 80 # <~~~ Twice the correct value
sum(x, y, na.rm=TRUE)
# integer64
# [1] 0 # <~~~~ Incorrect 0. Should be 40.
## Removing the NAs does not help.
y <- y[!is.na(y)]
## A vector of 0's gives the same issue
y <- as.integer64(c(0, 0))
## Same results
sum(y, x, na.rm=TRUE)
# integer64
# [1] 80
sum(x, y, na.rm=TRUE)
# integer64
# [1] 0
## Converting to numeric does away with the issue (but is not a viable workaround, for obvious reasons)
y <- as.numeric(y)
sum(y, x, na.rm=TRUE)
# [1] 1.97626e-322
sum.integer64(y, x, na.rm=TRUE)
# integer64
# [1] 40
sum(x, y, na.rm=TRUE)
# integer64
# [1] 40
Give y a single value, and the results are also very out of place
y <- as.integer64(c(35, NA, NA))
sum.integer64(x, if (!all(is.na(y))) removeNA(y), na.rm=TRUE)
sum.integer64(x, y[[1]], na.rm=TRUE)
sum.integer64(y[[1]], x, na.rm=TRUE)
## No NA's present
sum.integer64(as.integer64(35), x)
# integer64
# [1] 80
sum.integer64(x, as.integer64(35))
# integer64
# [1] 70

Not an answer, but an exploration. Hope it might help you.
From the sum.integer64 function of the bit64 package:
function (..., na.rm = FALSE)
{
l <- list(...)
ret <- double(1)
if (length(l) == 1) {
.Call("sum_integer64", l[[1]], na.rm, ret)
oldClass(ret) <- "integer64"
ret
}
else {
ret <- sapply(l, function(e) {
if (is.integer64(e)) {
.Call("sum_integer64", e, na.rm, ret)
ret
}
else {
as.integer64(sum(e, na.rm = na.rm))
}
})
oldClass(ret) <- "integer64"
sum(ret, na.rm = na.rm)
}
}
Here is your example:
library(bit64)
x <- as.integer64(c(20, 20))
y <- as.integer64(c(NA, NA))
na.rm <- TRUE
l <- list(y, x)
ret <- double(1)
ret
#[1] 0
# We use the sapply function as in the function:
ret <- sapply(l, function(e) { .Call("sum_integer64", e, na.rm, ret) })
oldClass(ret) <- "integer64"
ret
#integer64
#[1] 40 40 <-- twice the value "40"
sum(ret, na.rm = na.rm)
# integer64
#[1] 80 <-- twice the expected value, as you said
Here we decompose the calculation, for each vector:
ret <- double(1)
ret2 <- NULL
ret2[1] <- .Call("sum_integer64", y, na.rm, ret)
ret2[2] <- .Call("sum_integer64", x, na.rm, ret)
oldClass(ret2) <- "integer64"
ret2
#integer64
#[1] 0 40 <-- only once the value "40", and "0" because of NaNs
sum(ret2, na.rm = na.rm)
#integer64
#[1] 40 <- expected value

Related

Subsetting with negative indices: best practices?

Say I have a function for subsetting (this is just a minimal example):
f <- function(x, ind = seq(length(x))) {
x[ind]
}
(Note: one could use only seq(x) instead of seq(length(x)), but I don't find it very clear.)
So, if
x <- 1:5
ind <- c(2, 4)
ind2 <- which(x > 5) # integer(0)
I have the following results:
f(x)
[1] 1 2 3 4 5
f(x, ind)
[1] 2 4
f(x, -ind)
[1] 1 3 5
f(x, ind2)
integer(0)
f(x, -ind2)
integer(0)
For the last result, we would have wanted to get all x, but this is a common cause of error (as mentionned in the book Advanced R).
So, if I want to make a function for removing indices, I use:
f2 <- function(x, ind.rm) {
f(x, ind = `if`(length(ind.rm) > 0, -ind.rm, seq(length(x))))
}
Then I get what I wanted:
f2(x, ind)
[1] 1 3 5
f2(x, ind2)
[1] 1 2 3 4 5
My question is:
Can I do something cleaner and that doesn't need passing seq(length(x)) explicitly in f2 but using directly the default value of f's parameter ind when ind.rm is integer(0)?
If you anticipate having "empty" negative indices a lot, you can get a performance improvement for these cases if you can avoid the indexing used by x[seq(x)] as opposed to just x. In other words, if you are able to combine f and f2 into something like:
new_f <- function(x, ind.rm){
if(length(ind.rm)) x[-ind.rm] else x
}
There will be a huge speedup in the case of empty negative indices.
n <- 1000000L
x <- 1:n
ind <- seq(0L,n,2L)
ind2 <- which(x>n+1) # integer(0)
library(microbenchmark)
microbenchmark(
f2(x, ind),
new_f(x, ind),
f2(x, ind2),
new_f(x, ind2)
)
all.equal(f2(x, ind), new_f(x, ind)) # TRUE - same result at about same speed
all.equal(f2(x, ind2), new_f(x, ind2)) # TRUE - same result at much faster speed
Unit: nanoseconds
expr min lq mean median uq max neval
f2(x, ind) 6223596 7377396.5 11039152.47 9317005 10271521 50434514 100
new_f(x, ind) 6190239 7398993.0 11129271.17 9239386 10202882 59717093 100
f2(x, ind2) 6823589 7992571.5 11267034.52 9217149 10568524 63417978 100
new_f(x, ind2) 428 1283.5 5414.74 6843 7271 14969 100
What you have isn't bad, but if you want to avoid passing the default value of a default argument you could restructure like this:
f2 <- function(x, ind.rm) {
`if`(length(ind.rm) > 0, f(x,-ind.rm), f(x))
}
which is slightly shorter than what you have.
On Edit
Based on the comments, it seems you want to be able to pass a function nothing (rather than simply not pass at all), so that it uses the default value. You can do so by writing a function which is set up to receive nothing, also known as NULL. You can rewrite your f as:
f <- function(x, ind = NULL) {
if(is.null(ind)){ind <- seq(length(x))}
x[ind]
}
NULL functions as a flag which tells the receiving function to use a default value for the parameter, although that default value must be set in the body of the function.
Now f2 can be rewritten as
f2 <- function(x, ind.rm) {
f(x, ind = `if`(length(ind.rm) > 0, -ind.rm, NULL))
}
This is slightly more readable than what you have, but at the cost of making the original function slightly longer.
To implement "parameter1 = if(cond1) then value1 else default_value_of_param1", I used formals to get default parameters as a call:
f <- function(x, ind.row = seq_len(nrow(x)), ind.col = seq_len(ncol(x))) {
x[ind.row, ind.col]
}
f2 <- function(x, ind.row.rm = integer(0), ind.col.rm = integer(0)) {
f.args <- formals(f)
f(x,
ind.row = `if`(length(ind.row.rm) > 0, -ind.row.rm, eval(f.args$ind.row)),
ind.col = `if`(length(ind.col.rm) > 0, -ind.col.rm, eval(f.args$ind.col)))
}
Then:
> x <- matrix(1:6, 2)
> f2(x, 1:2)
[,1] [,2] [,3]
> f2(x, , 1:2)
[1] 5 6
> f2(x, 1, 2)
[1] 2 6
> f2(x, , 1)
[,1] [,2]
[1,] 3 5
[2,] 4 6
> f2(x, 1, )
[1] 2 4 6
> f2(x)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

R: Extracting the index of the final / last minimum value in a vector

I have a vector
y <- c(10:1, c(0.1,0.1,0.1))
if I do which.min(y), it returns the index 11.
However, I want it to return the last / final index, where the minimum has been seen.
I have a clumsy of doing this, which is:
rev(which(y == min(y)))[1]
But is there a better way to do this ?
I would just create my own utility function
which.min2 <- function(x, last.index = FALSE, ...){
if(last.index) max(which(x == min(x, ...))) else which.min(x)
}
which.min2(y, TRUE)
## [1] 13
which.min2(y)
## [1] 11
This will also work if you have NAs in your vector
y <- c(10:1, c(0.1,0.1,NA))
which.min2(y, TRUE, na.rm = TRUE)
## [1] 12
which.min2(y, na.rm = TRUE)
## [1] 11
You can use the max.col function, which has a richer way of handling ties, but it requires a matrix and I did not find the corresponding min function:
> y <- c(10:1, c(0.1,0.1,0.1))
> max.col(t(as.matrix(-y)), ties.method='last')
[1] 13
>
So depending on how your real y vector is built, you cain gain from this method.
tail(order(y, decreasing=T),1)
#[1] 13

Reproducing the result from Map() with mapply()

Take the following data frame and vector,
df <- data.frame(x = 1:3, y = 4:6, z = 7:9)
v <- c(5, 10, 15)
Assume I want to multiply df columnwise by the elements of v, meaning df[1] * v[1], df[2] * v[2], and df[3] * v[3]
I can do this with Map
> Map(`*`, df, v)
$x
[1] 5 10 15
$y
[1] 40 50 60
$z
[1] 105 120 135
Now, since Map is defined as
> Map
function (f, ...)
{
f <- match.fun(f)
mapply(FUN = f, ..., SIMPLIFY = FALSE)
}
<bytecode: 0x3950e00>
<environment: namespace:base>
it seems logical that I should be able to reproduce the above exactly with the following call to mapply, but this is not the case.
> mapply(`*`, df, v, simplify = FALSE)
# Error in .Primitive("*")(dots[[1L]][[1L]], dots[[2L]][[1L]],
# simplify = dots[[3L]][[1L]]) : operator needs one or two arguments
The problem seems to be within the arguments of "*", and those arguments are
> args("*")
function (e1, e2)
NULL
So two more tries yield similar errors.
> mapply(`*`, e1 = df, e2 = v, simplify = FALSE)
# Error in .Primitive("*")(e1 = dots[[1L]][[1L]], e2 = dots[[2L]][[1L]], :
# operator needs one or two arguments
> mapply(`*`, ..1 = df, ..2 = v, simplify = FALSE)
# Error in .Primitive("*")(..1 = dots[[1L]][[1L]], ..2 = dots[[2L]][[1L]], :
# operator needs one or two arguments
What is the issue here? And how can I reproduce (exactly) the result from
Map(`*`, df, v)
with mapply?
Notice that Map calls
mapply(FUN = f, ..., SIMPLIFY = FALSE)
not
mapply(FUN = f, ..., simplify = FALSE)
and of course R is case sensitive. Try
mapply(`*`, df, v, SIMPLIFY = FALSE)
# $x
# [1] 5 10 15
#
# $y
# [1] 40 50 60
#
# $z
# [1] 105 120 135
instead. With simplify = FALSE, it's trying to call
`*`(df[[1]], v[1], simplify = FALSE)
which is what is giving that error.

match.fun provide error with functions defined inside functions

I get error when try to apply match.fun to the functions define within other functions.
x <- matrix(rnorm(10*100), nrow=100) # data sample
descStats <- function(x, stats = c("n", "min", "max", "srange", "mean", "median", "sd")) {
n <- function(x, ...) sum(!is.na(x), ...)
srange <- function(x, ...) max(x, ...) - min(x, ...)
fun <- function(x) {
result <- vapply(stats, function(z) match.fun(z)(x, na.rm=TRUE), FUN.VALUE=numeric(1))
}
if (is.vector(x)) {
result <- fun(x)
}
if (is.matrix(x) || is.data.frame(x)) {
result <- t(apply(x, 2, fun))
}
return(result)
}
descStats(x)
## Error in get(as.character(FUN), mode = "function", envir = envir) :
## object 'n' of mode 'function' was not found
If I define n and srange outside of descStats function it works fine.
n <- function(x, ...) sum(!is.na(x), ...)
srange <- function(x, ...) max(x, ...) - min(x, ...)
descStats2 <- function(x, stats = c("n", "min", "max", "srange", "mean", "median", "sd")) {
fun <- function(x) {
result <- vapply(stats, function(z) match.fun(z)(x, na.rm=TRUE), FUN.VALUE=numeric(1))
}
if (is.vector(x)) {
result <- fun(x)
}
if (is.matrix(x) || is.data.frame(x)) {
result <- t(apply(x, 2, fun))
}
return(result)
}
descStats2(x)
## n min max srange mean median sd
## [1,] 100 -2.303839 2.629366 4.933205 0.03711611 0.14566523 1.0367947
## [2,] 100 -1.968923 2.169382 4.138305 -0.03917503 0.02239458 0.9048509
## [3,] 100 -2.365891 2.424077 4.789968 -0.08012138 -0.23515910 1.0438133
## [4,] 100 -2.740045 2.127787 4.867832 0.03978241 0.15363449 0.9778891
## [5,] 100 -1.598295 2.603525 4.201820 0.23796616 0.16376239 1.0428915
## [6,] 100 -1.550385 1.684155 3.234540 -0.11114479 -0.09264598 0.8260126
## [7,] 100 -2.438641 3.268796 5.707438 0.02948100 -0.05594740 1.0481331
## [8,] 100 -1.716407 2.795340 4.511747 0.22463606 0.16296613 0.9555129
## [9,] 100 -2.359165 1.975993 4.335158 -0.33321888 -0.17580933 0.9784788
## [10,] 100 -2.139267 2.838986 4.978253 0.15540182 0.07803265 1.0149671
Another way it's use eval(call(FUN, args)). For instance.
descStats3 <- function(x, stats = c("n", "min", "max", "srange", "mean", "median", "sd")) {
n <- function(x, ...) sum(!is.na(x), ...)
srange <- function(x, ...) max(x, ...) - min(x, ...)
fun <- function(x) {
result <- vapply(stats, function(z) eval(call(z, x, na.rm=TRUE)), FUN.VALUE=numeric(1))
}
if (is.vector(x)) {
result <- fun(x)
}
if (is.matrix(x) || is.data.frame(x)) {
result <- t(apply(x, 2, fun))
}
return(result)
}
descStats3(x)
## n min max srange mean median sd
## [1,] 100 -2.303839 2.629366 4.933205 0.03711611 0.14566523 1.0367947
## [2,] 100 -1.968923 2.169382 4.138305 -0.03917503 0.02239458 0.9048509
## [3,] 100 -2.365891 2.424077 4.789968 -0.08012138 -0.23515910 1.0438133
## [4,] 100 -2.740045 2.127787 4.867832 0.03978241 0.15363449 0.9778891
## [5,] 100 -1.598295 2.603525 4.201820 0.23796616 0.16376239 1.0428915
## [6,] 100 -1.550385 1.684155 3.234540 -0.11114479 -0.09264598 0.8260126
## [7,] 100 -2.438641 3.268796 5.707438 0.02948100 -0.05594740 1.0481331
## [8,] 100 -1.716407 2.795340 4.511747 0.22463606 0.16296613 0.9555129
## [9,] 100 -2.359165 1.975993 4.335158 -0.33321888 -0.17580933 0.9784788
## [10,] 100 -2.139267 2.838986 4.978253 0.15540182 0.07803265 1.0149671
identical(descStats2(x), descStats3(x))
## [1] TRUE
Why descStats not work?
It's relatively easy (and illustrative) to write your own version of match.fun. I've called my function fget to indicate that it's a version of get specifically designed for functions, and hence obeys the regular scoping rules for functions. (If you're not sure what they are, think about this code: c <- 10; c(c, 5))
#' Find a function with specified name.
#'
#' #param name length one character vector giving name
#' #param env environment to start search in.
#' #examples
#' c <- 10
#' fget("c")
fget <- function(name, env = parent.frame()) {
if (identical(env, emptyenv())) {
stop("Could not find function called ", name, call. = FALSE)
}
if (exists(name, env, inherits = FALSE) && is.function(env[[name]])) {
env[[name]]
} else {
fget(name, parent.env(env))
}
}
The implementation is as a straightforward recursive function: the base case is the emptyenv(), the eventual ancestor of every environment, and for each environment along the stack of parents, we check to see that both an object called name exists, and that it is a function.
It works in the simple test case provided by #nograpes because the environment defaults to the calling environment:
fun <- function(x) {
n <- sum
fget('n')(x)
}
fun(10)
# [1] 10
it is a scope problem. Looking in the code of match.fun you get the answer.
match.fun scope is the envir <- parent.frame(2)
get scope is in the envir = as.environment(-1) = parent.frame(1)
I think we can't pass the envir as an argument.
One solution is to use get as presented by #nograpes ( unsafe) or to hack match.fun and change
envir <- parent.frame(2) to envir <- parent.frame(1)
For reasons I don't completely understand yet, if you use get instead of match.fun, everything works fine.
x <- matrix(rnorm(10*100), nrow=100) # data sample
descStats <- function(x, stats = c("n", "min", "max", "srange", "mean", "median", "sd")) {
n <- function(x, ...) sum(!is.na(x), ...)
srange <- function(x, ...) max(x, ...) - min(x, ...)
fun <- function(x) {
# get added here.
result <- vapply(stats, function(z) get(z)(x, na.rm=TRUE), FUN.VALUE=numeric(1))
}
if (is.vector(x)) {
result <- fun(x)
}
if (is.matrix(x) || is.data.frame(x)) {
result <- t(apply(x, 2, fun))
}
return(result)
}
descStats(x)

Function argument as a part of the output name

Perhaps a silly question, but I can't find any answers to it anywhere (that I've looked :P ). I am trying to create a function with two arguments, these will be vectors (e.g.x=c(a,b,c) and y=c(50,75,100)). I will write a function which calculates all the combinations of these and have the argument used as a part of the output name. E.g.
function(x,y)
df$output_a_50 = a*2+50^2
df$output_a_75 = a*2+75^2
.....
Any suggestions will be appreciated :)
As #Spacedman and others discussed, your problem is that if you pass c(a, b, c) to your function, the names will be lost. The best alternative in my opinion, is to pass a list:
foo <- function(x, y) {
df <- list()
for (xx in names(x)) {
for (yy in y) {
varname <- paste("output", xx, yy, sep = "_")
df[[varname]] <- x[[xx]]*2 + yy^2
}
}
df
}
foo(x = list(a = NA, b = 1, c = 2:3),
y = c(50, 75, 100))
# $output_a_50
# [1] NA
#
# $output_a_75
# [1] NA
#
# $output_a_100
# [1] NA
#
# $output_b_50
# [1] 2502
#
# $output_b_75
# [1] 5627
#
# $output_b_100
# [1] 10002
#
# $output_c_50
# [1] 2504 2506
#
# $output_c_75
# [1] 5629 5631
#
# $output_c_100
# [1] 10004 10006

Resources