how to count NAs in mean calculation - r

Very simple question, I'm sure it's been answered and I'm just phrasing it incorrectly but I want to calculate the mean of a vector of numbers including NA values, here's an example:
dummy<-c(1,2,NA, 3)
with this I can use mean with na.rm=T and receive the mean of 2, but what I want to receive is the mean of 6/4, including the NA value as a place holder which would return 1.5.

How about just swapping NA values with 0 temporarily.
mean(ifelse(is.na(dummy),0,dummy))

Try using sum and length
> sum(dummy, na.rm=TRUE)/length(dummy)
[1] 1.5

Since there are a lot of ways to do this, here goes another solution:
mean(replace(dummy, is.na(dummy), 0)) ## 1.5
[1] 1.5
Just out of curiosity, the most efficient solution seems to be the sum/length by Jilber:
bigdummy <- rnorm(1000)
bigdummy[sample(1:length(bigdummy), 100)] <- NA
library(microbenchmark)
mean_length <- function(x) sum(x, na.rm=TRUE)/length(x)
mean_replace <- function(x) mean(replace(x, is.na(x), 0))
mean_ifelse <- function(x) mean(ifelse(is.na(x),0,x))
microbenchmark(mean_length(bigdummy),
mean_replace(bigdummy),
mean_ifelse(bigdummy),
times=1000L)
Unit: microseconds
expr min lq median uq max neval
mean_length(bigdummy) 4.033 4.400 5.499 5.866 109.976 1000
mean_replace(bigdummy) 25.661 27.128 28.594 29.327 198.690 1000
mean_ifelse(bigdummy) 142.602 144.802 145.902 152.500 3405.209 1000

Related

R missing date calculation [duplicate]

I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values.
How can I remove the NA values so that I can compute the max?
Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)
Setting na.rm=TRUE does just what you're asking for:
d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)
If you do want to remove all of the NAs, use this idiom instead:
d <- d[!is.na(d)]
A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.
The na.omit function is what a lot of the regression routines use internally:
vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000
Use discard from purrr (works with lists and vectors).
discard(v, is.na)
The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [:
v %>% discard(is.na)
v %>% `[`(!is.na(.))
Note that na.omit does not work on lists:
> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1
$b
[1] 2
$c
[1] NA
?max shows you that there is an extra parameter na.rm that you can set to TRUE.
Apart from that, if you really want to remove the NAs, just use something like:
myvec[!is.na(myvec)]
Just in case someone new to R wants a simplified answer to the original question
How can I remove NA values from a vector?
Here it is:
Assume you have a vector foo as follows:
foo = c(1:10, NA, 20:30)
running length(foo) gives 22.
nona_foo = foo[!is.na(foo)]
length(nona_foo) is 21, because the NA values have been removed.
Remember is.na(foo) returns a boolean matrix, so indexing foo with the opposite of this value will give you all the elements which are not NA.
You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.
I ran a quick benchmark comparing the two base approaches and it turns out that x[!is.na(x)] is faster than na.omit. User qwr suggested I try purrr::dicard also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
times = 1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
purrr::map(airquality, function(x) { x[!is.na(x)] }) 66.8 75.9 130.5643 86.2 131.80 541125.5 1e+06 a
purrr::map(airquality, na.omit) 95.7 107.4 185.5108 129.3 190.50 534795.5 1e+06 b
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06 c
For reference, here's the original test of x[!is.na(x)] vs na.omit:
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
times = 1000000)
Unit: microseconds
expr min lq mean median uq max neval cld
map(airquality, function(x) { x[!is.na(x)] }) 53.0 56.6 86.48231 58.1 64.8 414195.2 1e+06 a
map(airquality, na.omit) 85.3 90.4 134.49964 92.5 104.9 348352.8 1e+06 b
Another option using complete.cases like this:
d <- c(1, 100, NA, 10)
result <- complete.cases(d)
output <- d[result]
output
#> [1] 1 100 10
max(output)
#> [1] 100
Created on 2022-08-26 with reprex v2.0.2

R: scan vectors once instead of 4 times?

Suppose I have two equal length logical vectors.
Computing the confusion matrix the easy way:
c(sum(actual == 1 & predicted == 1),
sum(actual == 0 & predicted == 1),
sum(actual == 1 & predicted == 0),
sum(actual == 0 & predicted == 0))
requires scanning the vectors 4 times.
Is it possible to do that in a single pass?
PS. I tried table(2*actual+predicted) and table(actual,predicted) but both are obviously much slower.
PPS. Speed is not my main consideration here, I am more interested in understanding the language.
You could try using data.table
library(data.table)
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N
data
set.seed(24)
actual <- sample(0:1, 10 , replace=TRUE)
predicted <- sample(0:1, 10, replace=TRUE)
Benchmarks
Using data.table_1.9.5 and dplyr_0.4.0
library(microbenchmark)
set.seed(245)
actual <- sample(0:1, 1e6 , replace=TRUE)
predicted <- sample(0:1, 1e6, replace=TRUE)
f1 <- function(){
DT <- data.table(actual, predicted)
setkey(DT, actual, predicted)[,.N, .(actual, predicted)]$N}
f2 <- function(){table(actual, predicted)}
f3 <- function() {data_frame(actual, predicted) %>%
group_by(actual, predicted) %>%
summarise(n())}
microbenchmark(f1(), f2(), f3(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
#f1() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a
#f2() 20.818410 22.378995 22.321816 22.56931 22.140855 22.984667 20 b
#f3() 1.262047 1.248396 1.436559 1.21237 1.220109 2.504662 20 a
Including the count from dplyr and tabulate also in the benchmarks on a slightly bigger dataset
set.seed(498)
actual <- sample(0:1, 1e7 , replace=TRUE)
predicted <- sample(0:1, 1e7, replace=TRUE)
f4 <- function() {data_frame(actual, predicted) %>%
count(actual, predicted)}
f5 <- function(){tabulate(4-actual-2*predicted, 4)}
Update
Including another data.table solution (provided by #Arun) also in the benchmarks
f6 <- function() {setDT(list(actual, predicted))[,.N, keyby=.(V1,V2)]$N}
microbenchmark(f1(), f3(), f4(), f5(), f6(), unit='relative', times=20L)
#Unit: relative
#expr min lq mean median uq max neval cld
#f1() 2.003088 1.974501 2.020091 2.015193 2.080961 1.924808 20 c
#f3() 2.488526 2.486019 2.450749 2.464082 2.481432 2.141309 20 d
#f4() 2.388386 2.423604 2.430581 2.459973 2.531792 2.191576 20 d
#f5() 1.034442 1.125585 1.192534 1.217337 1.239453 1.294920 20 b
#f6() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
Like this:
tabulate(4 - actual - 2*predicted, 4)
(tabulate here is much faster than table because it knows the output will be a vector of length 4).
There is table which computes a cross tabulation and should give similar results if actual and predicted contain only zeros and ones:
table(actual, predicted)
Internally, this works by pasteing the vectors -- horribly inefficient. It seems that the coercion to character also happens when tabulating only one value, and this might be the very reason for the bad performance also of table(actual*2 + predicted).

identify and remove single valued columns from table in R

I have a reasonably large dataset (~250k rows and 400 cols # .5gb) where a number of columns are single valued (ie they only have one value). To remove these columns from the dataset I use data[, apply(data, 2, function(x) length(unique(x)) != 1)] which works fine. I was wondering if there might be a more efficient way of doing this? This on my pc takes:
> system.time(apply(data, 2, function(x) length(unique(x))))
# user system elapsed
# 34.37 0.71 35.15
Which isnt so bad for one data set, but I'd like to repeat multiple times on different datasets.
You can use lapply instead:
data[, unlist(lapply(data, function(x) length(unique(x)) > 1L))]
Note that I added unlist to convert the resulting list to a vector of TRUE / FALSE values which will be used for the subsetting.
Edit: here's a little benchmark:
library(benchmark)
a <- runif(1e4)
b <- 99
c <- sample(LETTERS, 1e4, TRUE)
df <- data.frame(a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c,a,b,c)
microbenchmark(
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
#Unit: relative
# expr min lq median uq max neval
#apply 41.29383 40.06719 39.72256 39.16569 28.54078 100
#lapply 1.00000 1.00000 1.00000 1.00000 1.00000 100
Note that apply will first convert the data.frame to matrix and then perform the operation, which is less efficient. So in most cases where you're working with data.frames you can (and should) avoid using apply and use e.g. lapply instead.
You may also try:
set.seed(40)
df <- as.data.frame(matrix(sample(letters[1:3], 3*10,replace=TRUE), ncol=10))
Filter(function(x) (length(unique(x))>1), df)
Or
df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)] #still better than `apply`
Including these also in speed comparison (#beginneR's sample data)
microbenchmark(
new ={Filter(function(x) (length(unique(x))>1), df)},
new1={df[,colSums(df[-1,]==df[-nrow(df),])!=(nrow(df)-1)]},
apply = {df[, apply(df, 2, function(x) length(unique(x)) != 1)]},
lapply = {df[, unlist(lapply(df, function(x) length(unique(x)) > 1L))]},
unit = "relative",
times = 100)
# Unit: relative
# expr min lq median uq max neval
# new 1.0000000 1.0000000 1.000000 1.0000000 1.000000 100
# new1 4.3741503 4.5144133 4.063634 3.9591345 1.713178 100
# apply 23.9635826 24.0895813 21.361140 20.7650416 5.757233 100
#lapply 0.9991514 0.9979483 1.002005 0.9958308 1.002603 100

Count number of distinct values in a vector

I have a vector of scalar values of which I'm trying to get: "How many different values there are".
For instance in group <- c(1,2,3,1,2,3,4,6) unique values are 1,2,3,4,6 so I want to get 5.
I came up with:
length(unique(group))
But I'm not sure it's the most efficient way to do it. Isn't there a better way to do this?
Note: My case is more complex than the example, consisting of around 1000 numbers with at most 25 different values.
Here are a few ideas, all points towards your solution already being very fast. length(unique(x)) is what I would have used as well:
x <- sample.int(25, 1000, TRUE)
library(microbenchmark)
microbenchmark(length(unique(x)),
nlevels(factor(x)),
length(table(x)),
sum(!duplicated(x)))
# Unit: microseconds
# expr min lq median uq max neval
# length(unique(x)) 24.810 25.9005 27.1350 28.8605 48.854 100
# nlevels(factor(x)) 367.646 371.6185 380.2025 411.8625 1347.343 100
# length(table(x)) 505.035 511.3080 530.9490 575.0880 1685.454 100
# sum(!duplicated(x)) 24.030 25.7955 27.4275 30.0295 70.446 100
You can use rle from base package
x<-c(1,2,3,1,2,3,4,6)
length(rle(sort(x))$values)
rle produces two vectors (lengths and values ). The length of values vector gives you the number of unique values.
I have used this function
length(unique(array))
and it works fine, and doesn't require external libraries.
uniqueN function from data.table is equivalent to length(unique(group)). It is also several times faster on larger datasets, but not so much on your example.
library(data.table)
library(microbenchmark)
xSmall <- sample.int(25, 1000, TRUE)
xBig <- sample.int(2500, 100000, TRUE)
microbenchmark(length(unique(xSmall)), uniqueN(xSmall),
length(unique(xBig)), uniqueN(xBig))
#Unit: microseconds
# expr min lq mean median uq max neval cld
#1 length(unique(xSmall)) 17.742 24.1200 34.15156 29.3520 41.1435 104.789 100 a
#2 uniqueN(xSmall) 12.359 16.1985 27.09922 19.5870 29.1455 97.103 100 a
#3 length(unique(xBig)) 1611.127 1790.3065 2024.14570 1873.7450 2096.5360 3702.082 100 c
#4 uniqueN(xBig) 790.576 854.2180 941.90352 896.1205 974.6425 1714.020 100 b
We can use n_distinct from dplyr
dplyr::n_distinct(group)
#[1] 5
If one wants to get number of unique elements in a matrix or data frame or list, the following code would do:
if( typeof(Y)=="list"){ # Y is a list or data frame
# data frame to matrix
numUniqueElems <- length( na.exclude( unique(unlist(Y)) ) )
} else if ( is.null(dim(Y)) ){ # Y is a vector
numUniqueElems <- length( na.exclude( unique(Y) ) )
} else { # length(dim(Y))==2, Yis a matrix
numUniqueElems <- length( na.exclude( unique(c(Y)) ) )
}

How to access the last value in a vector?

Suppose I have a vector that is nested in a dataframe with one or two levels. Is there a quick and dirty way to access the last value, without using the length() function? Something ala PERL's $# special var?
So I would like something like:
dat$vec1$vec2[$#]
instead of:
dat$vec1$vec2[length(dat$vec1$vec2)]
I use the tail function:
tail(vector, n=1)
The nice thing with tail is that it works on dataframes too, unlike the x[length(x)] idiom.
To answer this not from an aesthetical but performance-oriented point of view, I've put all of the above suggestions through a benchmark. To be precise, I've considered the suggestions
x[length(x)]
mylast(x), where mylast is a C++ function implemented through Rcpp,
tail(x, n=1)
dplyr::last(x)
x[end(x)[1]]]
rev(x)[1]
and applied them to random vectors of various sizes (10^3, 10^4, 10^5, 10^6, and 10^7). Before we look at the numbers, I think it should be clear that anything that becomes noticeably slower with greater input size (i.e., anything that is not O(1)) is not an option. Here's the code that I used:
Rcpp::cppFunction('double mylast(NumericVector x) { int n = x.size(); return x[n-1]; }')
options(width=100)
for (n in c(1e3,1e4,1e5,1e6,1e7)) {
x <- runif(n);
print(microbenchmark::microbenchmark(x[length(x)],
mylast(x),
tail(x, n=1),
dplyr::last(x),
x[end(x)[1]],
rev(x)[1]))}
It gives me
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 171 291.5 388.91 337.5 390.0 3233 100
mylast(x) 1291 1832.0 2329.11 2063.0 2276.0 19053 100
tail(x, n = 1) 7718 9589.5 11236.27 10683.0 12149.0 32711 100
dplyr::last(x) 16341 19049.5 22080.23 21673.0 23485.5 70047 100
x[end(x)[1]] 7688 10434.0 13288.05 11889.5 13166.5 78536 100
rev(x)[1] 7829 8951.5 10995.59 9883.0 10890.0 45763 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 204 323.0 475.76 386.5 459.5 6029 100
mylast(x) 1469 2102.5 2708.50 2462.0 2995.0 9723 100
tail(x, n = 1) 7671 9504.5 12470.82 10986.5 12748.0 62320 100
dplyr::last(x) 15703 19933.5 26352.66 22469.5 25356.5 126314 100
x[end(x)[1]] 13766 18800.5 27137.17 21677.5 26207.5 95982 100
rev(x)[1] 52785 58624.0 78640.93 60213.0 72778.0 851113 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 214 346.0 583.40 529.5 720.0 1512 100
mylast(x) 1393 2126.0 4872.60 4905.5 7338.0 9806 100
tail(x, n = 1) 8343 10384.0 19558.05 18121.0 25417.0 69608 100
dplyr::last(x) 16065 22960.0 36671.13 37212.0 48071.5 75946 100
x[end(x)[1]] 360176 404965.5 432528.84 424798.0 450996.0 710501 100
rev(x)[1] 1060547 1140149.0 1189297.38 1180997.5 1225849.0 1383479 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 327 584.0 1150.75 996.5 1652.5 3974 100
mylast(x) 2060 3128.5 7541.51 8899.0 9958.0 16175 100
tail(x, n = 1) 10484 16936.0 30250.11 34030.0 39355.0 52689 100
dplyr::last(x) 19133 47444.5 55280.09 61205.5 66312.5 105851 100
x[end(x)[1]] 1110956 2298408.0 3670360.45 2334753.0 4475915.0 19235341 100
rev(x)[1] 6536063 7969103.0 11004418.46 9973664.5 12340089.5 28447454 100
Unit: nanoseconds
expr min lq mean median uq max neval
x[length(x)] 327 722.0 1644.16 1133.5 2055.5 13724 100
mylast(x) 1962 3727.5 9578.21 9951.5 12887.5 41773 100
tail(x, n = 1) 9829 21038.0 36623.67 43710.0 48883.0 66289 100
dplyr::last(x) 21832 35269.0 60523.40 63726.0 75539.5 200064 100
x[end(x)[1]] 21008128 23004594.5 37356132.43 30006737.0 47839917.0 105430564 100
rev(x)[1] 74317382 92985054.0 108618154.55 102328667.5 112443834.0 187925942 100
This immediately rules out anything involving rev or end since they're clearly not O(1) (and the resulting expressions are evaluated in a non-lazy fashion). tail and dplyr::last are not far from being O(1) but they're also considerably slower than mylast(x) and x[length(x)]. Since mylast(x) is slower than x[length(x)] and provides no benefits (rather, it's custom and does not handle an empty vector gracefully), I think the answer is clear: Please use x[length(x)].
If you're looking for something as nice as Python's x[-1] notation, I think you're out of luck. The standard idiom is
x[length(x)]
but it's easy enough to write a function to do this:
last <- function(x) { return( x[length(x)] ) }
This missing feature in R annoys me too!
Combining lindelof's and Gregg Lind's ideas:
last <- function(x) { tail(x, n = 1) }
Working at the prompt, I usually omit the n=, i.e. tail(x, 1).
Unlike last from the pastecs package, head and tail (from utils) work not only on vectors but also on data frames etc., and also can return data "without first/last n elements", e.g.
but.last <- function(x) { head(x, n = -1) }
(Note that you have to use head for this, instead of tail.)
The dplyr package includes a function last():
last(mtcars$mpg)
# [1] 21.4
I just benchmarked these two approaches on data frame with 663,552 rows using the following code:
system.time(
resultsByLevel$subject <- sapply(resultsByLevel$variable, function(x) {
s <- strsplit(x, ".", fixed=TRUE)[[1]]
s[length(s)]
})
)
user system elapsed
3.722 0.000 3.594
and
system.time(
resultsByLevel$subject <- sapply(resultsByLevel$variable, function(x) {
s <- strsplit(x, ".", fixed=TRUE)[[1]]
tail(s, n=1)
})
)
user system elapsed
28.174 0.000 27.662
So, assuming you're working with vectors, accessing the length position is significantly faster.
Another way is to take the first element of the reversed vector:
rev(dat$vect1$vec2)[1]
I have another method for finding the last element in a vector.
Say the vector is a.
> a<-c(1:100,555)
> end(a) #Gives indices of last and first positions
[1] 101 1
> a[end(a)[1]] #Gives last element in a vector
[1] 555
There you go!
Package data.table includes last function
library(data.table)
last(c(1:10))
# [1] 10
Whats about
> a <- c(1:100,555)
> a[NROW(a)]
[1] 555
The xts package provides a last function:
library(xts)
a <- 1:100
last(a)
[1] 100
As of purrr 1.0.0, pluck now accepts negative integers to index from the right:
library(purrr)
pluck(LETTERS, -1)
"Z"

Resources