Creating iterator for a data frame in R - r

I'm trying to create a function in R using an iterator from the iterator package: https://cran.r-project.org/web/packages/iterators/iterators.pdf to help iterator through each row of a data frame.
Given a table like this:
data<-data.frame(c(1,0,0,NA,NA,1,1,NA,0), ncol=3, byrow=TRUE)
>data
[,1] [,2] [,3]
[1,] 1 0 0
[2,] NA NA 1
[3,] 1 NA 0
I want it to go through each row and return the first non NA value from left to right and return NA if all values are NA. So with the above data frame, it should return 1, 1, 1.
The general idea I have right now is to use the iter() function from the package like so:
vec<-vector()
iterRow<-iter(data[x,]) #Creates iterator object for row x of data
i<-1
while(i<iterRow$length){ #iterRow$length gives # of columns essentially
temp<-nextElem(iterRow) #Set temp to the next element of the iterator
if(!is.na(temp)){ #If the value is not NA, set value in vec to the value
vec<-c(vec, temp)
}
i<-i+1
}
vec<-c(vec, NA) #Otherwise set it to NA
return(vec)
The data i'm working with will be up to millions of rows long so ideally I would like to vectorize the function. I'm stuck on how to apply that idea across the whole data frame.
Would it work to make the function like this:
iterateRows<- function(dataFrame){
...
}
with the data frame i'm working with as the argument.
I also do know c++ so if using c++ to write a similar function would be easier, I could also do that. Any help will be appreciated!

Start with a simple approach. Here's a function to do what you want to each row:
first_not_na = function(x) if(all(is.na(x))) NA else x[!is.na(x)][1]
Here's a couple simple ways to apply that to every row of data.
# apply
results = apply(data, 1, foo)
# for loop
results = numeric(nrow(data))
for (i in 1:row) results[i] = foo(data[i, ])
Here's a benchmark comparing timings on pretty large data:
row = 1e6
col = 5
data = matrix(sample(c(1, 0, NA), size = row * col, replace = T), nrow = row)
microbenchmark::microbenchmark(
apply = {results = apply(data, 1, foo)},
loop = {
results = numeric(row)
for (i in 1:row) results[i] = foo(data[i, ])
},
times = 5
)
# Unit: seconds
# expr min lq mean median uq max neval cld
# apply 2.140379 2.249405 2.399239 2.480180 2.524667 2.601563 5 a
# loop 1.970481 1.982853 2.160342 2.090484 2.264797 2.493095 5 a
A simple for loop takes about 2 seconds for 1M rows and 5 columns. If you want to speed up more, you could certainly use foreach to parallelize. Only if that's still not fast enough should you look for more complex solutions like iterators or an implementation in C++.

Related

fastest way to count the number of rows in a data frame that has at least one NA

When you have the data set, usually you want to see that is the fraction of rows that has at least one NA (or missing value) in the data set.
In R, what I did is the following:
TR = apply(my_data,1,anyNA)
sum(TR)/length(TR)
But I found that if my data set has 1 million rows, it takes some time. I wonder if there is a fastest way to achieve this goal in R?
Before I begin, note that none of the code here is mine. I was merely fascinated by the code in the comments and wondered which one really performed the best.
I suspected some of the time was being absorbed in transforming a data frame to a matrix for apply and rowSums, so I've also done most of the solutions on matrices to illustrate the penalty applied by running these solutions on a data frame.
# Make a data frame of 10,000 rows and set random values to NA
library(dplyr)
set.seed(13)
MT <- mtcars[sample(1:nrow(mtcars), size = 10000, replace = TRUE), ]
MT <- lapply(MT,
function(x) { x[sample(1:length(x), size = 100)] <- NA; x }) %>%
bind_cols()
MT_mat <- as.matrix(MT)
library(microbenchmark)
microbenchmark(
apply(MT,1,anyNA),
apply(MT_mat,1,anyNA), # apply on a matrix
row_sum = rowSums(is.na(MT)) > 0,
row_sum_mat = rowSums(is.na(MT_mat)), # rowSums on a matrix
reduce = Reduce('|', lapply(MT, is.na)) ,
complete_case = !complete.cases(MT),
complete_case_mat = !complete.cases(MT_mat) # complete.cases on a matrix
)
Unit: microseconds
expr min lq mean median uq max neval cld
apply(MT, 1, anyNA) 12126.013 13422.747 14930.6022 13927.5695 14589.1320 60958.791 100 d
apply(MT_mat, 1, anyNA) 11662.390 12546.674 14758.1266 13336.6785 14083.7225 66075.346 100 d
row_sum 1541.594 1581.768 2233.1150 1617.3985 1647.8955 49114.588 100 bc
row_sum_mat 579.161 589.131 707.3710 618.7490 627.5465 3235.089 100 a c
reduce 2028.969 2051.696 2252.8679 2084.8320 2102.8670 4271.127 100 c
complete_case 321.984 330.195 346.8692 342.5115 351.3090 436.057 100 a
complete_case_mat 348.083 358.640 384.1671 379.0205 406.8790 503.503 100 ab
#* Verify that they all return the same result
MT$apply <- apply(MT, 1, anyNA)
MT$apply_mat <- apply(MT_mat, 1, anyNA)
MT$row_sum <- rowSums(is.na(MT)) > 0
MT$row_sum_mat <- rowSums(is.na(MT_mat)) > 0
MT$reduce <- Reduce('|', lapply(MT, is.na))
MT$complete_case <- !complete.cases(MT)
MT$complete_case_mat <- !complete.cases(MT_mat)
all(MT$apply == MT$apply_mat)
all(MT$apply == MT$row_sum)
all(MT$apply == MT$row_sum_mat)
all(MT$apply == MT$reduce)
all(MT$apply == MT$complete_case)
all(MT$apply == MT$complete_case_mat)
complete.cases seems to be the clear winner, and works well for both data frames and matrices. As it turns out, complete.cases calls a C routine, which may account for much of its speed. looking at rowSums, apply, and Reduce shows R code.
Why apply is slower the rowSums probably has to do with rowSums being optimized for a specific task. rowSums knows it will be returning a numeric, apply has no such guarantee. I doubt that accounts for all of the difference--I'm mostly speculating.
I couldn't begin to tell you how Reduce is working.

R Improve performance of function(s)

This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.

Return indices of rows whose elements (columns) all match a reference vector

Using the following code;
c <- NULL
for (a in 1:4){
b <- seq(from = a, to = a + 5)
c <- rbind(c,b)
}
c <- rbind(c,c); rm(a,b)
Results in this matrix,
> c
[,1] [,2] [,3] [,4] [,5] [,6]
b 1 2 3 4 5 6
b 2 3 4 5 6 7
b 3 4 5 6 7 8
b 4 5 6 7 8 9
b 1 2 3 4 5 6
b 2 3 4 5 6 7
b 3 4 5 6 7 8
b 4 5 6 7 8 9
How can I return row indices for rows matching a specific input?
For example, with a search term of,
z <- c(3,4,5,6,7,8)
I need the following returned,
[1] 3 7
This will be used in a fairly large data frame of test data, related to a time step column, to reduce the data by accumulating time steps for matching rows.
Question answered well by others. Due to my dataset size (9.5M rows), I came up with an efficient approach that took a couple steps.
1) Sort the big data frame 'dc' containing time steps to accumulate in column 1.
dc <- dc[order(dc[,2],dc[,3],dc[,4],dc[,5],dc[,6],dc[,7],dc[,8]),]
2) Create a new data frame with unique entries (excluding column 1).
dcU <- unique(dc[,2:8])
3) Write Rcpp (C++) function to loop through unique data frame which iterates through the original data frame accumulating time while rows are equal and indexes to the next for loop step when an unequal row is identified.
require(Rcpp)
getTsrc <-
'
NumericVector getT(NumericMatrix dc, NumericMatrix dcU)
{
int k = 0;
int n = dcU.nrow();
NumericVector tU(n);
for (int i = 0; i<n; i++)
{
while ((dcU(i,0)==dc(k,1))&&(dcU(i,1)==dc(k,2))&&(dcU(i,2)==dc(k,3))&&
(dcU(i,3)==dc(k,4))&&(dcU(i,4)==dc(k,5))&&(dcU(i,5)==dc(k,6))&&
(dcU(i,6)==dc(k,7)))
{
tU[i] = tU[i] + dc(k,0);
k++;
}
}
return(tU);
}
'
cppFunction(getTsrc)
4) Convert function inputs to matrices.
dc1 <- as.matrix(dc)
dcU1 <- as.matrix(dcU)
5) Run the function and time it (returns time vector matching unique data frame)
pt <- proc.time()
t <- getT(dc1, dcU1)
print(proc.time() - pt)
user system elapsed
0.18 0.03 0.20
6) Self high-five and more coffee.
You can use apply.
Here we use apply on c, across rows (the 1), and use a function function(x) all(x == z) on each row.
The which then pulls out the integer positions of the rows.
which(apply(c, 1, function(x) all(x == z)))
b b
3 7
EDIT: If your real data is having problems with this, and is only 9 columns (not too much typing), you could try a fully vectorized solution:
which((c[,1]==z[1] & c[,2]==z[2] & c[,3]==z[3] & c[,4]==z[4]& c[,5]==z[5]& c[,6]==z[6]))
The answer by #jeremycg will definitely work, and is fast if you have many columns and few rows. However, you might be able to go a bit faster if you have lots of rows by avoiding using apply() on the row dimension.
Here's an alternative:
l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
[1] 3 7
It works by first turning each column into a list. Then, it takes each column-list and searches it for the corresponding element in z. In the last step, you find out which rows had all columns with the corresponding match in z. Even though the last step is a row-wise operation, by using .rowSums (mind the . at the front there) we can specify the dimensions of the matrix, and get a speed-up.
Let's test the timings of the two approaches.
The functions
f1 <- function(){
which(apply(c, 1, function(x) all(x == z)))
}
f2 <- function(){
l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
}
With 8 rows (dim in example):
> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 21.147 21.8375 22.86096 22.6845 23.326 30.443 100 a
f2() 42.310 43.1510 45.13735 43.7500 44.438 137.413 100 b
With 80 rows:
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 101.046 103.859 108.7896 105.1695 108.3320 166.745 100 a
f2() 93.631 96.204 104.6711 98.1245 104.7205 236.980 100 a
With 800 rows:
> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
expr min lq mean median uq max neval cld
f1() 920.146 1011.394 1372.3512 1042.1230 1066.7610 31290.593 100 b
f2() 572.222 579.626 593.9211 584.5815 593.6455 1104.316 100 a
Note that my timing assessment only had 100 replicates each, and although these results are reprsentative, there's a bit a of variability in the number of rows required before the two methods are equal.
Regardless, I think my approach would probably be faster once you have 100+ rows.
Also, note that you can't simply transpose c to make f1() faster. First, the t() takes up time; second, because you're comparing to z, you would then just have to make a column-wise (after the transpose) comparison, so it's no different at that point.
Finally, I'm sure there's an even faster way to do this. My answer was just the first thing that came to mind, and didn't require any packages to install. This could be a lot faster if you wanted to use data.table. Also, if you had a lot of columns, you might even be able to parallelize this procedure (although, to be worthwhile the dataset would have to be immense).
If these timings aren't tolerable for your data, you might consider reporting back with the dimensions of your data set.
In your code c is not a data frame. Try transforming it into one:
c <- data.frame(c)

How to efficiently compare each row of a matrix to each section of a list in R?

Here's an example of what I mean, this code outputs the right thing:
list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
compare <- function(list.t, matrix.t) {
pairs <- 0
for (i in 1:nrow(matrix.t)) {
for (j in 1:length(list.t)) {
if (length(intersect(matrix.t[i,], list.t[[j]])) == 2) {
pairs <- pairs + 1
}
}
}
return(pairs / nrow(matrix.t))
}
compare(list1, matrix1)
# = 0.33333
I hope that makes sense. I'm trying to take an nx2 matrix, and see if the two elements of each row of the matrix are also found in each section of the list. So, in the example above, the first row of the matrix is (1,2), and this pair is found in the first section of the list. The (1,5) or the (8,10) pairs are not found in any section of the list. So that's why I'm outputting 0.3333 (1/3).
I'm wondering if anyone knows a way that doesn't use two for-loops to compare each row to each section? I have larger matrices and lists, and so this is too slow.
Thank you for any help!
Wouldn't this work just the same? You could call sapply over the list and compare with all rows of the matrix simultaneously.
> list1 = list(c(1,2,3,4), c(5,6,7), c(8,9), c(10, 11))
> matrix1 = rbind(c(1,2), c(1,5), c(8, 10))
> s <- sapply(seq_along(list1), function(i){
length(intersect(list1[[i]], matrix1)) == 2
})
> sum(s)/nrow(matrix1)
# [1] 0.3333333
If we call your function f1(), and this sapply version of the same function f2(), we get the following difference in speed.
> library(microbenchmark)
> microbenchmark(f1(), f2())
# Unit: microseconds
# expr min lq median uq max neval
# f1() 245.017 261.2240 268.843 281.7350 1265.706 100
# f2() 113.727 117.7045 125.478 135.6945 268.310 100
Hopefully that's the increase in efficiency you're looking for.
This is offered in the spirit of your R golf challenge for your problem, a compact bu potentially inscrutable solution:
mean( apply(matrix1, 1,
function(x) any( {lapply(list1, function(z) {all(x %in% z) } )}) )
)
[1] 0.3333333
The inner lapply tests whether a particular element of list1 has both of the items in the two-element vector pass as a row from matrix1. Then the any function tests whether any of the 4 elements met the challenge for a particular row. The intermediate logical vector c(TRUE,FALSE,FALSE) is converted into a fraction by the mean. (It still really two nested loops.)

How can I efficiently generate a dataframe of simulated values?

I'm trying to generate a data frame of simulated values based on existing distribution parameters. My main data frame contains the mean and standard deviation for each observation, like so:
example.data <- data.frame(country=c("a", "b", "c"),
score_mean=c(0.5, 0.4, 0.6),
score_sd=c(0.1, 0.1, 0.2))
# country score_mean score_sd
# 1 a 0.5 0.1
# 2 b 0.4 0.1
# 3 c 0.6 0.2
I can use sapply() and a custom function to use the score_mean and score_sd parameters to randomly draw from a normal distribution:
score.simulate <- function(score.mean, score.sd) {
return(mean(rnorm(100, mean=score.mean, sd=score.sd)))
}
simulated.scores <- sapply(example.data$score_mean,
FUN=score.simulate,
score.sd=example.data$score_sd)
# [1] 0.4936432 0.3753853 0.6267956
This will generate one round (or column) of simulated values. However, I'd like to generate a lot of columns (like 100 or 1,000). The only way I've found to do this is to wrap my sapply() function inside a generic function inside lapply() and then convert the resulting list into a data frame with ldply() in plyr:
results.list <- lapply(1:5, FUN=function(x) sapply(example.data$score_mean, FUN=score.simulate, score.sd=example.data$score_sd))
library(plyr)
simulated.scores <- as.data.frame(t(ldply(results.list)))
# V1 V2 V3 V4 V5
# V1 0.5047807 0.4902808 0.4857900 0.5008957 0.4993375
# V2 0.3996402 0.4128029 0.3875678 0.4044486 0.3982045
# V3 0.6017469 0.6055446 0.6058766 0.5894703 0.5960403
This works, but (1) it seems really convoluted, especially with the as.data.frame(t(ldply(lapply(... FUN=function(x) sapply ...)))) approach, (2) it is really slow when using large numbers of iterations or bigger data—my actual dataset has 3,000 rows, and running 1,000 iterations takes 1–2 minutes.
Is there a more efficient way to create a data frame of simulated values like this?
The quickest way I can think of is to take advantage of the vectorisation built-in to rnorm. Both the mean and sd arguments are vectorised, however you can only supply a single integer for the number of draws. If you supply a vector to the mean and sd arguments, R will cycle through them until it has completed the required number of draws. Therefore, just make the argument n to rnorm a multiple of the length of your mean vector. The multiplier will be the number of replicates for each row of your data.frame. In the function below this is n.
I can't think of a factor way than using base::rnorm on its own.
Worked example
#example data
df <- data.frame(country=c("a", "b", "c"),
mean=c(1, 10, 100),
sd=c(1, 2, 10))
#function which returns a matrix, and takes column vectors as arguments for mean and sd
normv <- function( n , mean , sd ){
out <- rnorm( n*length(mean) , mean = mean , sd = sd )
return( matrix( out , , ncol = n , byrow = FALSE ) )
}
#reproducible result (note order of magnitude of rows and input sample data)
set.seed(1)
normv( 5 , df$mean , df$sd )
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.3735462 2.595281 1.487429 0.6946116 0.3787594
#[2,] 10.3672866 10.659016 11.476649 13.0235623 5.5706002
#[3,] 91.6437139 91.795316 105.757814 103.8984324 111.2493092
This can be done very quickly if you remember that rnorm(1, mean, sd) is the same as rnorm(1)*sd + mean so using your data frame df, you can generate sim simulations of your obs observations like:
obs = nrow(df)
sim = 1000
mat = data.frame(matrix(rnorm(obs*sim), obs, sim) * df$sd + df$mean)
You can check that this has the desired means by using rowMeans(mat) and check the standard deviation for, say, row 1 as sd(mat[1,]).

Resources