Is there such "colsd" in R? - r

I am using in my code colSums but I also need the standard deviation beside the sum.
I searched in the internet and found this page which contain only:
colSums
colMeans
http://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html
I tried this:
colSd
but I got this error:
Error: could not find function "colSd"
How I can do the same thing but for standard deviation:
colSd
Here is the code:
results <- colSums(x,na.rm=TRUE)#### here I want colsd

I want to provide a fourth (very similar to #Thomas) approach and some benchmarking:
library("microbenchmark")
library("matrixStats")
colSdApply <- function(x, ...)apply(X=x, MARGIN=2, FUN=sd, ...)
colSdMatrixStats <- colSds
colSdColMeans <- function(x, na.rm=TRUE) {
if (na.rm) {
n <- colSums(!is.na(x)) # thanks #flodel
} else {
n <- nrow(x)
}
colVar <- colMeans(x*x, na.rm=na.rm) - (colMeans(x, na.rm=na.rm))^2
return(sqrt(colVar * n/(n-1)))
}
colSdThomas <- function(x)sqrt(rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1)))
m <- matrix(runif(1e7), nrow=1e3)
microbenchmark(colSdApply(m), colSdMatrixStats(m), colSdColMeans(m), colSdThomas(m))
# Unit: milliseconds
# expr min lq median uq max neval
# colSdApply(m) 435.7346 448.8673 456.6176 476.8373 512.9783 100
# colSdMatrixStats(m) 344.6416 357.5439 383.8736 389.0258 465.5715 100
# colSdColMeans(m) 124.2028 128.9016 132.9446 137.6254 172.6407 100
# colSdThomas(m) 231.5567 240.3824 245.4072 274.6611 307.3806 100
all.equal(colSdApply(m), colSdMatrixStats(m))
# [1] TRUE
all.equal(colSdApply(m), colSdColMeans(m))
# [1] TRUE
all.equal(colSdApply(m), colSdThomas(m))
# [1] TRUE

colSds and rowSds are two of many similar functions in the matrixStats package

Use the following:
colSd <- function (x, na.rm=FALSE) apply(X=x, MARGIN=2, FUN=sd, na.rm=na.rm)

This is the quickest and shortest way to calculate the standard deviation of the columns:
sqrt(diag(cov(data_matrix)))
Since the diagonal of a co-variance matrix consists of the variances of each variable, we do the following:
Calculate the co-variance matrix using cov
Extract the diagonal of the matrix using diag
Take the square root of the diagonal values using sqrt in order to get the standard deviation
I hope that helps :)

I don't know if these are particularly fast, but why not just use the formulae for SD:
x <- data.frame(y = rnorm(1000,0,1), z = rnorm(1000,2,3))
# If you have a population:
colsdpop <- function(x,...)
sqrt(rowMeans((t(x)-colMeans(x,...))^2,...))
colsdpop(x)
sd(x$y); sd(x$z) # won't match `sd`
# If you have a sample:
colsdsamp <- function(x)
sqrt( (rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1))) )
colsdsamp(x)
sd(x$y); sd(x$z) # will match `sd`
Note: the sample solution won't handle NAs well. One could incorporate something like apply(x,2,function(z) sum(!is.na(z))) into the right-most part of the formula to get an appropriate denominator, but it would get really murky quite quickly.

I believe I have found a more elegant solution in diag(sqrt(var(data)))
This worked for me to get the standard deviation of each of my columns. However, it does compute a bunch of extra unnecessary covariances (and their square roots) along the way, so it isn't necessarily the most efficient approach. But if your data is small, it works excellently.
EDIT: I just realized that sqrt(diag(var(data))) is probably a bit more efficient, since it drops the unnecessary covariance terms earlier.

I usually do column sd's with apply:
x <- data.frame(y = rnorm(20,0,1), z = rnorm(20,2,3))
> apply(x, 2, sd)
y z
0.8022729 3.4700314
Verify:
> sd(x$y)
[1] 0.8022729
> sd(x$z)
[1] 3.470031
You can also do it with dplyr easily:
library(dplyr)
library(magrittr) # for pipes
> x %>% summarize_all(.,sd)
y z
1 0.8022729 3.470031

You can just use apply function
all.sd <- apply(data, 2,sd)

Related

Looping a function through a list of dataframes is very slow

I have a list, which contains 4438 dataframes with different sizes. I am not sure how to make a reproducible example, but the way I obtained the list is using the expand.grid function to have a dataframe with all the possible combination of elements:
citation <- citation %>%
map_depth(., 1, expand.grid)
List before applying expand.grid
List after applying expand.grid
What I am going to achieve is for each dataframe, counting the number of unique values per row, and finding the minimum number of unique values in the dataframe.
First, I write the function below
fun1 <- function(res){
min(apply(res,1,function(x) length(unique(x))))
}
Then, apply the function to each dataframe:
library(furrr)
plan(multisession, workers = 4)
min_set <- c()
min_set <- citation %>% future_map_dbl(fun1)
However, the calculation is super slow, almost 30 mins to complete. I would like to find another way to accelerate the performance. Looking forward to hear the solution from you guys. Thank you in advance
To speed up the current approach of enumerating the combinations, use rowTabulate from the Rfast package (or rowTabulates from the matrixStats package).
However, it will be much faster to get the desired results with the setcover function in the adagio package, which solves the set cover problem directly (i.e., without the use of expand.grid) via integer linear programming with lp from the lpSolve package.
library(Rfast) # for the rowTabulate function
library(adagio) # for the setcover function
# reproducible example data
set.seed(1141593349)
citation1 <- list(
lapply(c(5,2,8,12,6,38), function(size) sample(50, size)),
lapply(c(5,2,8,12,7), function(size) sample(50, size))
)
# get all combinations of the indices of the unique values for each list in citation1
citation2 <- lapply(citation1, function(x) expand.grid(lapply(x, match, table = unique(unlist(x)))))
# original solution
fun1 <- function(res) min(apply(res, 1, function(x) length(unique(x))))
# faster version of the original solution
fun2 <- function(res) min(rowsums(rowTabulate(as.matrix(res)) > 0L))
# linear programming solution (uses citation1 rather than citation2)
fun3 <- function(res) {
v <- unlist(res)
m <- matrix(0L, max(v), length(res))
m[cbind(v, rep.int(seq_along(res), lengths(res)))] <- 1L
setcover(m)$objective
}
microbenchmark::microbenchmark(fun1 = sapply(citation2, fun1),
fun2 = as.integer(sapply(citation2, fun2)),
fun3 = as.integer(sapply(citation1, fun3)),
times = 10,
check = "identical")
#> Unit: milliseconds
#> expr min lq mean median uq max
#> fun1 1110.4976 1162.003601 1217.049501 1204.608151 1281.121601 1331.057001
#> fun2 101.5173 113.123501 142.265371 145.964502 165.788700 187.196301
#> fun3 1.4038 1.461101 1.734781 1.850701 1.870801 1.888702

function that computes the average of 20 rolls several thousand times and estimates the expectation and the variance of Y

I need to write a function according to the info in the title. I'm trying to perform that with the following code:
my.function <- function(x=1:6,Nsample=20,prob1=NULL) {
rolling.die <- sample(x, size=Nsample, replace=TRUE, prob=prob1)
for (die in 1:10000) {
die.sum <- sum(rolling.die)
average <- die.sum/Nsample
}
return(var(average))
}
my.function()
But I always get N/A as a result. Could you, please, help me to understand what I am doing wrong?
You need replicate() -
set.seed(2)
test <- replicate(1000, mean(sample(1:6, 20, replace = T)))
# for expectation
mean(test)
[1] 3.50025
# for variance
var(test)
[1] 0.147535
average is a number. It does not make sense to calculate variance of a number. What is the variance of 5 ? Variance is applied to a collection of numbers. So your average must be a vector.
A more efficient approach is to generate all your data ahead of time. As long as you have the memory, this would be a very fast approach:
# sim parameters
n_rolls <- 20L #L means integer variables
n_sim <- 10000L
n_sides <- 6L
#generate data
set.seed(2)
sims <- sample(n_sides, n_rolls * n_sim, replace = T)
#make into matrix of n_sims x n_rolls
mat <- matrix(sims, ncol = n_rolls)
#mean of each simulation
rowMeans(mat)
#var of everything
var(rowMeans(mat))
This is around 14x faster than using replicate as this approach calls sample() once.
Unit: milliseconds
expr min lq mean median uq max neval
shree_replic 137.7283 138.9809 145.78485 142.34755 147.2499 172.4633 10
cole_samp_mat 11.3998 11.4477 11.57025 11.52105 11.7628 11.8218 10
As far as your current function, it doesn't make sense - the loop doesn't do anything. It just does the same calculation 10,000 times and as #user31264 points out, tries to calculate the var of a scalar after the loop. I think you mean to do something like:
my.function2 <- function(x=1:6,Nsample=20,prob1=NULL) {
rolling.die <- sample(x, size=Nsample, replace=TRUE, prob=prob1)
return(mean(rolling.die))
}
means <- vector(mode = 'double', length = n_sim)
for (i in 1:n_sim){
means[i] <- my.function2()
}
#which is equivalent to
means <- sapply(1:n_sim, my.function2)
#which is also equivalent to
means <- replicate(n_sim, my.function2())
var(means)
And #shree has a much more succinct version of your function.

Apply function on two matrices

For matrices Beta and x1, I am trying to apply re.fn to find the maximum for the (colSums(Beta*b))^2. Here, b is the column of the matrix x1. I wonder how I can make this algorithm faster.
set.seed(1)
D=10000
M=1000; N=1000
Beta=matrix(rnorm(N*D),ncol=D)
x1=matrix(rnorm(N*M),N)
re.fn <- function(b) {
sum1 <- colSums(Beta*b)
T_nc1 <- sum1^2
T_nc <- max(T_nc1)
return(T_nc)
}
T_nc=apply(x1,2,re.fn)
Using crossprod should be substantially faster
T_nc2 <- apply(crossprod(Beta,x1)^2,2,max)
all.equal(T_nc,T_nc2)
# [1] TRUE

Is it possible to to use apply with different vectors to each column?

Apologies to the poor question title. Not too sure how to describe the problem here.
First, I have the code below.
# Data
set.seed(100)
x = matrix(runif(10000,0,1),100,100)
grpA = round(runif(100,1,5),0) # Group 1, 2, 3, 4, 5
# function
funA <-function(y, A){
X = lm(y~A)
return(X$residuals)
}
# Calculation
A = apply(x,1,function(y) funA(y,grpA))
Now, instead of having grpA, I have grpB below which the groups are different for every column. Besides looping each column, can I still use the apply to calculate this? If so, how?
My actual funA calcultion is a lot more complex and I do need to calculate funA many times so I am trying to aviod using the for loop. Thanks.
grpB = matrix(round(runif(10000,1,5),0),100,100)
First off, if your function funA does a lot of work, then using a for loop versus apply won't affect performance that much. This is because the only difference is in the overhead of looping, and most of the work is going to take place inside of funA in either case.
In fact, even if funA is simple, for and apply won't be that different performance-wise. Either way, there needs to be a loop inside of R with multiple R function calls. The real performance improvements by avoiding for loops come in situations where there is a builtin R function that performs the computation you need by looping in the underlying C code without the overhead of multiple function calls in R. Here is an illustrative example.
x<-matrix(runif(10000,0,1),100,100)
require(microbenchmark)
f1<-function(z){
ret<-rep(0,ncol(z))
for(i in 1:ncol(z)){
ret[i]<-sum(z[,i])
}
ret
}
f2<-function(z){
apply(z,2,sum)
}
identical(f1(x),f2(x))
# [1] TRUE
identical(f1(x),colSums(x))
# [1] TRUE
microbenchmark(f1(x),f2(x),colSums(x))
# unit: microseconds
# expr min lq median uq max neval
# f1(x) 559.934 581.4775 596.4645 622.1425 773.519 100
# f2(x) 484.265 512.1570 526.5700 546.5010 1100.540 100
# colSums(x) 23.844 25.7915 27.0675 28.7575 59.485 100
So, in your situation, I wouldn't worry about using a for loop. There are ways to avoid a loop, for example, something like
sapply(1:ncol(x),function(i) fun(x[,i],y[,i]))
But it won't be much faster than a for loop.
Just as an answer to
can I still use the apply to calculate this? If so, how?
The answer is yes. You can combine x and grpB into an array and then use apply on the resulting array.
# Data
set.seed(100)
x = matrix(runif(10000,0,1),100,100)
grpA = round(runif(100,1,5),0) # Group 1, 2, 3, 4, 5
# function
funA <-function(y, A){
X = lm(y~A)
return(X$residuals)
}
# Original calculation
A <- apply(x, 1, funA, grpA)
# the array in this case
arr <- array(c(x, matrix(rep(grpA, 100), nrow=100, byrow=TRUE)), dim=c(nrow(x), ncol(x), 2))
# the new calculation
res <- apply(arr, 1, function(y) funA(y[, 1], y[, 2]))
# comparing results
all.equal(A, res)
## TRUE
#
# and for the new groupB
grpB = matrix(round(runif(10000,1,5),0),100,100)
# the array
arr <- array(c(x, grpB), dim=c(nrow(x), ncol(x), 2))
# the calculation (same as above)
res <- apply(arr, 1, function(y) funA(y[, 1], y[, 2]))
See #mrip's answer for the reasons this may not be a good idea.
You could easily use a sequence of the number of columns as an "indicator" or "extracting" variable, and use vapply instead of apply, like this:
vapply(sequence(ncol(x)),
function(z) funA(x[, z], grpB[, z]),
numeric(nrow(x)))

Efficiently building a large (200 MM line) dataframe

I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.

Resources