R-like way to add a % column to a data frame - r

I want to add a column that shows me the percentage part of that row compared to the some of the column. (sorry for my bad mathematical english here).
> trees['Heigth_%'] <- round((100 / sum(trees$Height) * trees$Height), digits=2)
> head(trees)
Girth Height Volume Heigth_%
1 8.3 70 10.3 2.97
2 8.6 65 10.3 2.76
3 8.8 63 10.2 2.67
4 10.5 72 16.4 3.06
5 10.7 81 18.8 3.44
6 10.8 83 19.7 3.52
This work.
But the question is if this is a good and R-like way?
e.g. Is sum() called for each row? Or is R intelligent enough here?

To answer you question if sum is called for every row or is R intelligent enough, you can use trace:
df = data.frame(a = 1:10, b = 21:30)
df['b_%'] = round((100 / sum(df$b) * df$b), digits=2)
trace('sum')
round((100 / sum(df$b) * df$b), digits=2)
untrace('sum')
Which shows only one call to the sum function. Afterwards, R recognizes that the lengths of trees$Height and sum(trees$Height) differ and tries to replicate the shorter one until is has the same length as the bigger one.

Converting first to data.table and then using prop.table is a bit faster then f1-f3 (IF you neglect the conversion to a data.table, as this is usually only done once for all subsequent commands).
# Get data
data(trees)
# Load package & convert to data.table
library(data.table)
trees <- as.data.table(trees)
# data.table way to create the new variable
f4 <- function(trees) {
trees[, Heigth_percentage:=round(prop.table(Height)*100,2)]
}
Here are the bechmark results:
# > microbenchmark(r1 <- f1(trees),
# + r2 <- f2(trees),
# + r3 <- f3(trees),
# + r3 <- f4(trees),
# + times = 10000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# r1 <- f1(trees) 616.616 666.290 730.8883 683.122 708.164 8390.818 10000 b
# r2 <- f2(trees) 617.437 666.701 730.3211 683.533 709.191 8100.574 10000 b
# r3 <- f3(trees) 596.500 655.616 721.1057 672.243 697.080 55048.757 10000 b
# r4 <- f4(trees) 551.342 612.922 680.7581 633.037 665.059 54672.712 10000 a

To begin, Vandenman's answer is much more adequate and precise. What follows is not really worth an answer, but as usual - not readable as a comment.
I have added prop.table() and data.table() (see majom's answer) approaches to the timings. With 40k rows data.table() is a bit closer to the rest, but still slower (~3 ms to ~3.7 ms), with 400k rows it starts to be comparable, and with 4M rows it is finally faster than the rest:
library(microbenchmark)
trees <- data.frame(Height = runif(400000, 9, 11),
Heigth_PCT = numeric(4000000))
trees_dt <- as.data.table(trees)
f1 <- function(trees) {
trees$Heigth_PCT <- round((100 / sum(trees$Height) * trees$Height), digits = 2)
return(trees)
}
f2 <- function(trees) {
sum_trees <- sum(trees$Height)
trees$Heigth_PCT <- round((100 / sum_trees * trees$Height), digits = 2)
return(trees)
}
f3 <- function(trees) {
trees$Heigth_PCT <- round(prop.table(trees$Height)*100, digits = 2)
return(trees)
}
f4 <- function(trees_dt) {
trees_dt[, Heigth_PCT := round(prop.table(Height)*100, 2)]
}
# Time both functions
microbenchmark(r1 <- f1(trees),
r2 <- f2(trees),
r3 <- f3(trees),
r4 <- f4(trees_dt),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# r1 <- f1(trees) 296.4452 309.3853 373.5945 318.7987 400.0373 639.8556 100 a
# r2 <- f2(trees) 296.3453 310.6638 381.4048 323.0655 474.9295 682.2172 100 a
# r3 <- f3(trees) 304.3185 317.0654 383.9600 328.5494 395.6238 783.2435 100 a
# r4 <- f4(trees_dt) 304.3327 315.4685 361.9526 325.8711 366.1153 722.7629 100 a
sapply(list(r2, r3, as.data.frame(r4)), identical, r1)
# [1] TRUE TRUE TRUE
Edit: prop.table() added.
Edit 2: data.table() added.

Related

What's the most efficient way to generate a vector of values from -2 to 5 with a difference of 0.1 in R?

What's the most efficient way to generate a vector of values from -2 to 5 with a difference of 0.1 in R?
Use seq function,
seq(-2, 5, 0.1)
seq.int() is known to be much faster than seq().
seq.int(-2, 5, .1)
Alternatively, we could try a while loop,
(5 - -2)/.1 + 1 ## calculate number of steps
# [1] 71
i <- 1; r <- s <- -2 ## define starting values
while (i < 71L) {
s <- s + .1
r[i + 1] <- s
i <- i + 1
}
print(r)
an even try integers.
ii <- 1L; ri <- si <- -20L ## define starting values
while (ii < 71L) {
si <- si + 1L
ri[ii + 1L] <- si
ii <- ii + 1L
}
print(ri*.1)
Looking at the benchmark (sequence from -2 to 1e5 used), the while loops are faster than seq() (ignoring the time to define the start values!!), but seq.int() is still faster.
Unit: milliseconds
expr min lq mean median uq max neval cld
seq 8.857606 9.468344 11.579976 10.448676 12.225051 49.10238 1000 c
seq.int 1.351898 1.495189 2.187937 1.720612 2.243119 25.16653 1000 a
While 3.777003 3.946604 4.797641 4.164015 4.365438 476.62461 1000 b
While.int 3.775510 3.930066 4.709157 4.155621 4.348228 410.80900 1000 b

R Compare one set of values with multiple sets

I have a vector of values (x).
I would like to determine the length of its overlap with each of the sets sitting in a list (y) - but without running a loop or lapply. Is it possible?
I am really interested in accelerating the execution.
Thank you very much!
Below is an example with an implementation using a loop:
x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
# overlaps[i] <- length(intersect(x, y[[i]])) # it is slower than %in%
overlaps[i] <- sum(x %in% y[[i]])
}
overlaps
And below is the comparison of some of the methods that were suggested in the responses below. As you can see, the loop is still the fastest - but I'd love to find something faster:
# Function with the loop:
myloop <- function(x, y) {
overlaps <- rep(0, length(y))
for (i in seq(length(y))) overlaps[i] <- sum(x %in% y[[i]])
overlaps
}
# Function with sapply:
mysapply <- function(x, y) sapply(y, function(e) sum(e %in% x))
# Function with map_dbl:
library(purrr)
mymap <- function(x, y) {
map_dbl(y, ~sum(. %in% x))
}
library(microbenchmark)
microbenchmark(myloop(x, y), mysapply(x, y), mymap(x, y), times = 30000)
# Unit: microseconds
# expr min lq mean median uq max neval
# myloop(x, y) 17.2 19.4 26.64801 21.2 22.6 9348.6 30000
# mysapply(x, y) 27.1 29.5 39.19692 31.0 32.9 20176.2 30000
# mymap(x, y) 59.8 64.1 88.40618 66.0 70.5 114776.7 30000
Use sapply for code compactness.
Even if sapply doesn't bring much performance benefits, compared to a for loop, at least the code is far more compact. This is the sapply equivalent of your code:
x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)
res <- sapply(y, function(e) length(intersect(e, x)))
> res
[1] 5 4 3 2 1 0
Performance gains
As correctly stated by #StupidWolf, it's not sapply that is slowing down the execution, but rather length and intersect. That's my test with 100.000 executions:
B <- 100000
system.time(replicate(B, sapply(y, function(e) length(intersect(e, x)))))
user system elapsed
9.79 0.01 9.79
system.time(replicate(B, sapply(y, function(e) sum(e %in% x))))
user system elapsed
2 0 2
#Using microbenchmark for preciser results:
library(microbenchmark)
microbenchmark(expr1 = sapply(y, function(e) length(intersect(e, x))), times = B)
expr min lq mean median uq max neval
expr1 81.4 84.9 91.87689 86.5 88.2 7368.7 1e+05
microbenchmark(expr2 = sapply(y, function(e) sum(e %in% x)), times = B)
expr min lq mean median uq max neval
expr2 15.4 16.1 17.68144 16.4 17 7567.9 1e+05
As we can see, the second approach is by far the performance winner.
Hope this helps.
You can use map from purrr, it goes through every element of the list y, and performs a function. Below i use map_dbl which returns a vector
library(purrr)
map_dbl(y,~+(. %in% x))
[1] 5 4 3 2 1 0
To see the time:
f1 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
map_dbl(y,~sum(. %in% x))
}
f2 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
overlaps[i] <- length(intersect(x, y[[i]]))
}
overlaps
}
f3 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
sapply(y,function(i)sum(i%in%x))
}
Let's put it to test:
system.time(replicate(10000,f1()))
user system elapsed
1.27 0.02 1.35
system.time(replicate(10000,f2()))
user system elapsed
1.72 0.00 1.72
system.time(replicate(10000,f3()))
user system elapsed
0.97 0.00 0.97
So if you want speed, do something like sapply + %in% , if something easily readable, do purrr
Here is an option using data.table which should be fast if you have a long list of vectors in y.
library(data.table)
DT <- data.table(ID=rep(seq_along(y), lengths(y)), Y=unlist(y))
DT[.(Y=x), on=.(Y)][, .N, ID]
In addition if you need to run this for multiple x, I would suggest creating a data.table that combines all of the x before running the code
output:
ID N
1: 1 5
2: 2 4
3: 3 3
4: 4 2
5: 5 1

R Improve performance of function(s)

This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.

Vectorize comparison of a row vector with every row of a dataframe in R?

Suppose I have a data frame that comes from reading in the following file Foo.csv
A,B,C
1,2,3
2,2,4
1,7,3
I would like to count the number of matching elements between the first row and subsequent rows. For example, the first row matches with the second row in one position, and matches with the third row in two positions. Here is some code that will achieve the desired effect.
foo = read.csv("Foo.csv")
numDiffs = rep(0,dim(foo)[1])
for (i in 2:dim(foo)[1]) {
numDiffs[i] = sum(foo[i,] == foo[1,])
}
print(numDiffs)
My question is, can this be vectorized to kill the loop and possibly reduce the running time? My first attempt is below, but it leaves an error because == is not defined for this type of comparison.
colSums(foo == foo[1,])
> rowSums(sapply(foo, function(x) c(0,x[1] == x[2:nrow(foo)])))
[1] 0 1 2
Or using the automatic recycling of matrix comparisons:
bar <- as.matrix(foo)
c(0, rowSums(t(t(bar[-1, ]) == bar[1, ])))
# [1] 0 1 2
t() is there twice because the recycling is column- rather than row-wise.
As your dataset grows larger, you might get a bit more speed with something like this:
as.vector(c(0, rowSums(foo[rep(1, nrow(foo) - 1), ] == foo[-1, ])))
# [1] 0 1 2
The basic idea is to create a data.frame of the first row the same dimensions of the overall dataset less one row, and use that to check for equivalence with the remaining rows.
Deleting my original update, here are some benchmarks instead. Change "N" to see the effect on different data.frame sizes. The solution from #nacnudus scales best.
set.seed(1)
N <- 10000000
mydf <- data.frame(matrix(sample(10, N, replace = TRUE), ncol = 10))
dim(mydf)
# [1] 1000000 10
fun1 <- function(data) rowSums(sapply(data, function(x) c(0,x[1] == x[2:nrow(data)])))
fun2 <- function(data) as.vector(c(0, rowSums(data[rep(1, nrow(data) - 1), ] == data[-1, ])))
fun3 <- function(data) {
bar <- as.matrix(data)
c(0, rowSums(t(t(bar[-1, ]) == bar[1, ])))
}
library(microbenchmark)
## On your original sample data
microbenchmark(fun1(foo), fun2(foo), fun3(foo))
# Unit: microseconds
# expr min lq median uq max neval
# fun1(foo) 109.903 119.0975 122.5185 127.0085 228.785 100
# fun2(foo) 333.984 354.5110 367.1260 375.0370 486.650 100
# fun3(foo) 233.490 250.8090 264.7070 269.8390 518.295 100
## On the sample data created above--I don't want to run this 100 times!
system.time(fun1(mydf))
# user system elapsed
# 15.53 0.06 15.60
system.time(fun2(mydf))
# user system elapsed
# 2.05 0.01 2.06
system.time(fun3(mydf))
# user system elapsed
# 0.32 0.00 0.33
HOWEVER, if Codoremifa were to change their code to vapply instead of sapply, that answer wins! From 15 seconds down to 0.24 seconds on 1 million rows.
fun4 <- function(data) {
rowSums(vapply(data, function(x) c(0, x[1] == x[2:nrow(data)]),
vector("numeric", length=nrow(data))))
}
microbenchmark(fun3(mydf), fun4(mydf), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun3(mydf) 369.5957 422.9507 438.8742 462.6958 486.3757 20
# fun4(mydf) 238.1093 316.9685 323.0659 328.0969 341.5154 20
eh, I don't see why you can't just do..
c(foo[1,]) == foo
# A B C
#[1,] TRUE TRUE TRUE
#[2,] FALSE TRUE FALSE
#[3,] TRUE FALSE TRUE
.. or even better foo[1,,drop=TRUE] == foo...
Thus the result becomes...
rowSums( c( foo[1,] ) == foo[-1,] )
#[1] 3 1 2
Remember, f[1,] is still a data.frame. Coerce to a vector and == is defined for what you need. This seems to be a little quicker than the vapply answer suggested #AnandaMahto on a big dataframe.
Benchmarking
Comparing this against fun3 and fun4 from #AnandaMahto's answer above I see a small speed improvement when using the larger data.frame, my.df...
microbenchmark(fun3(mydf), fun4(mydf), fun6(mydf) , times = 20)
#Unit: milliseconds
# expr min lq median uq max neval
# fun3(mydf) 320.7485 344.9249 356.1657 365.7576 399.5334 20
# fun4(mydf) 299.6660 313.7105 319.1700 327.8196 555.4625 20
# fun6(mydf) 196.8244 241.4866 252.6311 258.8501 262.7968 20
fun6 is defined as...
fun6 <- function(data) rowSums( c( data[1,] ) == data )

Speeding up a function: checking NA count before computing mean

The function below calculates the mean of a vector. However, it first checks the proportion of NA's present in the vector
and if above a given threshold, returns NA instead of the mean.
My issue is that my current implementation is rather innefficient. It takes more than 7x longer than simply running mean(vec, na.rm=TRUE)
I tried an alternate method using na.omit, but that is even slower.
Given the size of my data, executing the single lapply is taking over 40 minutes.
Any suggestions on how to accomplish the same task more quickly?
UPDATE - RE: #thelatemail 's solution and #Arun's comment:
I am executing this function over several hundred groups, each group of varying size. The sample data (originally) provided in this question was provided as a neat data frame simply for ease of creating artificial data.
Alternate sample data to avoid the confusion
# Sample Data
# ------------
set.seed(1)
# slightly different sizes for each group
N1 <- 5e3
N2 <- N1 + as.integer(rnorm(1, 0, 100))
# One group has only a moderate amount of NA's
SAMP1 <- rnorm(N1)
SAMP1[sample(N1, .25 * N1, FALSE)] <- NA # add in NA's
# Another group has many NA's
SAMP2 <- rnorm(N2)
SAMP2[sample(N2, .95 * N2, FALSE)] <- NA # add in large number of NA's
# put them all in a list
SAMP.NEW <- list(SAMP1, SAMP2)
# keep it clean
rm(SAMP1, SAMP2)
# Execute
# -------
lapply(SAMP.NEW, meanIfThresh)
Original Sample Data, function etc
# Sample Data
# ------------
set.seed(1)
rows <- 20000 # actual data has more than 7M rows
cols <- 1000
SAMP <- replicate(cols, rnorm(rows))
SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA # add in NA's
# Select 5 random rows, and have them be 90% NA
tooSparse <- sample(rows, 5)
for (r in tooSparse)
SAMP[r, sample(cols, cols * .9, FALSE)] <- NA
# Function
# ------------
meanIfThresh <- function(vec, thresh=12/15) {
# Calculates the mean of vec, however,
# if the number of non-NA values of vec is less than thresh, returns NA
# thresh : represents how much data must be PRSENT.
# ie, if thresh is 80%, then there must be at least
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
# if the proportion of NA's is greater than the threshold, return NA
# example: if I'm looking at 14 days, and I have 12 NA's,
# my proportion is 85.7 % = (12 / 14)
# default thesh is 80.0 % = (12 / 15)
# Thus, 12 NAs in a group of 14 would be rejected
# else, calculate the mean, removing NA's
return(mean(vec, na.rm=TRUE))
}
# Execute
# -----------------
apply(SAMP, 1, meanIfThresh)
# Compare with `mean`
#----------------
plain <- apply(SAMP, 1, mean, na.rm=TRUE)
modified <- apply(SAMP, 1, meanIfThresh)
# obviously different
identical(plain, modified)
plain[tooSparse]
modified[tooSparse]
microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
, "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
, times = 15L)
# With the actual data, the penalty is sevenfold
# Unit: seconds
# expr min lq median uq max neval
# meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871 15
# mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450 15
Couldn't you just replace the high NA rows' mean values afterwards like so?:
# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA
Some benchmarking:
Ricardo <- function(vec, thresh=12/15) {
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
return(mean(vec, na.rm=TRUE))
}
DanielFischer <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
thelatemail <- function(mat) {
result <- rowMeans(mat, na.rm=TRUE)
NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
result[NArows] <- NA
result
}
require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo),
m2 <- apply(SAMP, 1, DanielFischer),
m3 <- thelatemail(SAMP), times = 5L)
Unit: milliseconds
expr min lq median uq max neval
m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283 5
m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015 5
m3 <- latemail(SAMP) 337.1862 340.6339 371.6148 376.5517 383.4436 5
all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE
Is it so that you have to go twice through your vector vec in your function? If you can store your NA first, maybe it could speed up your calculations a bit:
meanIfThresh2 <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
EDIT: I performed the similar benchmarking, to see the effect on this change:
> microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
+ , "meanIfThresh2" = apply(SAMP, 1, meanIfThresh2)
+ , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+ , times = 15L)
Unit: seconds
expr min lq median uq max neval
meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493 15
meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028 15
mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495 15

Resources