I have a large number of matrices for which I need to perform a QR factorization and store the resulting Q matrices (normalized such that the R matrix has positive number in its diagonal). Is there another method than using the qr() function?
Here is the working example:
system.time({
# Parameters for the matrix to be generated
row_number <- 1e6/4
col_number <- 4
# Generate large matrix of random numbers normally distributed.
# Basically it's a matrix containing 4x4 matrices for which I will perform the QR factorization:
RND <- matrix(data = rnorm(n=1e6 , mean = 0,sd = 1),nrow = row_number, ncol = col_number)
# Allocate a 0 matrix where the result will be entered
QSTACK <- matrix(0, nrow = row_number, ncol = col_number)
number_of_blocks <- row_number/4 # The number of 4x4 matrices in RND => 62,500
for (k in c(1:number_of_blocks)) {
l1 <- 1 + col_number * (k-1)
l2 <- col_number * k
QR <- qr(RND[l1:l2,]) # Perform QR factorization
R <- qr.R(QR)
QSTACK[l1:l2,] <- qr.Q(QR) %*% diag(sign(diag(R))) # Normalize such that R diagonal elements are positive
}
})
# user system elapsed
# 3.04 0.03 3.07
So that took 3.07 seconds to compute 62,500 QR factorization. I'm wondering if there is something faster?
If you want:
the R factor to have positive diagonal elements
to explicitly form the Q factor (rather than its sequential Householder vectors format)
you can cheat by using Cholesky factorization:
cheatQR <- function (X) {
XtX <- crossprod(X)
R <- chol(XtX)
Q <- t(forwardsolve(R, t(X), upper.tri = TRUE, transpose = TRUE))
list(Q = Q, R = R)
}
The raw QR:
rawQR <- function (X) {
QR <- qr(X)
Q <- qr.Q(QR)
R <- qr.R(QR)
sgn <- sign(diag(R))
R <- sgn * R
Q <- Q * rep(sgn, each = nrow(Q))
list(Q = Q, R = R)
}
Benchmark:
X <- matrix(rnorm(10000 * 4), nrow = 10000, ncol = 4)
ans1 <- rawQR(X)
ans2 <- cheatQR(X)
all.equal(ans1, ans2)
#[1] TRUE
library(microbenchmark)
microbenchmark(rawQR(X), cheatQR(X))
#Unit: microseconds
# expr min lq mean median uq max neval
# rawQR(X) 3083.537 3109.222 3796.191 3123.2230 4782.583 13895.81 100
# cheatQR(X) 828.025 837.491 1421.211 865.9085 1434.657 32577.01 100
For further speedup, it is often advised that we link our R software to an optimized BLAS library, like OpenBLAS. But more relevant to your context, where you are computing large number of QR factorizations for small matrices, it is more worthwhile to parallelize your for loop.
A similar question in How to write a double for loop in r with choosing maximal element in one loop?.
The same setup:
If I want to sample theta[j] as first for j=1,2,...,71, then draw replicated( like 1000 times) yrep[k] form Bin(n[j], theta[j]), n[j] is known.
For theta[1], we have yrep[1,1], yrep[1,2], ..., yrep[1,1000]. Then for all theta[j], we will have a matrix of data set of yrep[i,j], i=1,...,71, j=1,..,1000.Then compute mean, max or min of each column yrep[1,1], yrep[1,2], yrep[1,3], ... yrep[1,71], we will get 1000 mean, max or min.
How to write this for loop?
I first try to write a loop to sample theta[j] and yrep. I do not know how to add a code to compute the maximal, mean, and minimal in this loop. I am not sure if this code is right:
theta<-NULL
yrep<-NULL
test<-NULL
k=1
for(i in 1:1000){
for(j in 1:71){
theta[j] <- rbeta(1,samp_A+y[j], samp_B+n[j]-y[j])
yrep[k]<-rbinom(1, n[j], theta[j])
k=k+1
}
t<-c(test, max(yrep))
}
Data is given in How to write a double for loop in r with choosing maximal element in one loop?:
#Data
y <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,
2,1,5,2,5,3,2,7,7,3,3,2,9,10,4,4,4,4,4,4,4,10,4,4,4,5,11,12,
5,5,6,5,6,6,6,6,16,15,15,9,4)
n <-
c(20,20,20,20,20,20,20,19,19,19,19,18,18,17,20,20,20,20,19,19,18,18,25,24,
23,20,20,20,20,20,20,10,49,19,46,27,17,49,47,20,20,13,48,50,20,20,20,20,
20,20,20,48,19,19,19,22,46,49,20,20,23,19,22,20,20,20,52,46,47,24,14)
#Evaluate densities in grid
x <- seq(0.0001, 0.9999, length.out = 1000)
#Compute the marginal posterior of alpha and beta in hierarchical model Use grid
A <- seq(0.5, 15, length.out = 100)
B <- seq(0.3, 45, length.out = 100)
#Make vectors that contain all pairwise combinations of A and B
cA <- rep(A, each = length(B))
cB <- rep(B, length(A))
#Use logarithms for numerical accuracy!
lpfun <- function(a, b, y, n) log(a+b)*(-5/2) +
sum(lgamma(a+b)-lgamma(a)-lgamma(b)+lgamma(a+y)+lgamma(b+n-y)-
lgamma(a+b+n))
lp <- mapply(lpfun, cA, cB, MoreArgs = list(y, n))
#Subtract maximum value to avoid over/underflow in exponentiation
df_marg <- data.frame(x = cA, y = cB, p = exp(lp - max(lp)))
#Sample from the grid (with replacement)
nsamp <- 100
samp_indices <- sample(length(df_marg$p), size = nsamp,
replace = T, prob = df_marg$p/sum(df_marg$p))
samp_A <- cA[samp_indices[1:nsamp]]
samp_B <- cB[samp_indices[1:nsamp]]
df_psamp <- mapply(function(a, b, x) dbeta(x, a, b),
samp_A, samp_B, MoreArgs = list(x = x)) %>%
as.data.frame() %>% cbind(x) %>% gather(ind, p, -x)
This is not very well tested.
There is no need for loops to sample from distributions included in base R, those functions are vectorized on their arguments. Code following the lines below should be able to do what the question asks for.
Ni <- 1000
Nj <- 17
theta <- rbeta(Ni*Nj, rep(samp_A + y, each = Ni), rep(samp_B + n - y, each = Ni))
yrep <- rbinom(Ni*Nj, n, theta)
test1 <- matrix(yrep, nrow = Ni)
mins1 <- matrixStats::colMins(test1)
I have a dataset with the following structure:
require(data.table)
train <- data.table(sample(1:10, 10), runif(10, 0, 10))
However, the dataset is ~ 7,5 GB in memory and has ~630 million rows. Attempting summary(train) yields in an error: Error: cannot allocate vector of size 2.3 Gb. I can extract some information by manually calling train[, mean(V2)], train[, min(V2)] and train[, max(V2)], but median and quantiles will result in OOM. Is there a possibility to make these operations on a 16GB RAM machine?
An idea would be to split the dataset but that would be a bit cumbersome w.r.t to median and quantiles
So I came up with function summaryI, to which we supply our interested column name:
summaryI <- function(i2) {
setorderv(train, i2)
N <- train[, .N]
# count NAs:
# nas <- is.na(train[[i2]])
# nNA <- sum(nas)
# OR
i <- 1L
nNA <- 0L
while (is.na(train[[i2]][i])) {
nNA <- nNA + 1L
i <- i + 1L
}
nNA
# will be slow if many NAs, but more memory efficient
# (will not create additional vector)
n <- N - nNA
probs <- seq(0, 1, 0.25)
# quantiles, only type = 7
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
qs <- train[[i2]][lo + nNA]
i <- which(index > lo)
h <- (index - lo)[i]
qs[i] <- (1 - h) * qs[i] + h * train[[i2]][hi[i] + nNA]
qs # quantile results
rmean <- sum(train[[i2]], na.rm = T) / n
qq <- c(qs[1L:3L], rmean, qs[4L:5L])
digits <- max(3L, getOption("digits") - 3L)
qq <- signif(qq, digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
if (nNA > 0L) { # to match summary output
c(qq, `NA's` = nNA)
} else {
qq
}
}
The basic idea is that we order the interested column in place (with setorder from data.table) and then try to do all the calculations without copying data.
As mentioned in comments, if your data have a lot of NAs then this will be slow.
But hopefully you will be able to run on all of the data. Also, I hard coded inside NA management.
Example:
summaryI('V2')
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 2.398e-08 2.501e-01 5.000e-01 5.000e-01 7.500e-01 1.000e+00 1.000e+02
or run over multiple columns, like:
sapply(colnames(train), summaryI)
The source code of summary and quantile, which I used as basis, can be found:
github quantile.R
github summary.R
I would like to iterate through vectors of values and calculate something for every value while being within a function environment in R. For example:
# I have costs for 3 companies
c <- c(10, 20, 30)
# I have the same revenue across all 3
r <- 100
# I want to obtain the profits for all 3 within one variable
result <- list()
# I could do this in a for loop
for(i in 1:3){
result[i] <- r - c[i]
}
Now lets assume I have a model that is very long and I define everything as a function which is to be solved with various random draws for the costs.
# Random draws
n <- 1000
r <- rnorm(n, mean = 100, sd = 10)
c1 <- rnorm(n, mean = 10, sd = 1)
c2 <- rnorm(n, mean = 20, sd = 2)
c3 <- rnorm(n, mean = 30, sd = 3)
X <- data.frame(r, c1, c2, c3)
fun <- function(x){
r <- x[1]
c <- c(x[2], x[3], x[4])
for(i in 1:3){
result[i] <- r - c[i]
}
return(result)
}
I could then evaluate the result for all draws by iterating through the rows of randomly sampled input data.
for(j in 1:n){
x <- X[j,]
y <- fun(x)
}
In this example, the output variable y would entail the nested result variable which comprises of the results for all 3 companies. However, my line of thinking results in an error and I think it has to do with the fact that I try to return a nested variable? Hence my question how you guys would approach something like this.
I would suggest rethinking your coding approach. This is a very un-R-like way of doing things.
For example, the first for loop can be written much more succinctly as
x <- c(10, 20, 30)
r <- 100
result <- lapply(-x, `+`, r)
Then fun becomes something like
fun <- function(x) lapply(-x[-1], `+`, x[1])
To then operate over the rows of a data.frame (which is what you seem to do in the last step), you can use something like
apply(X, 1, fun)
where the MARGIN = 1 argument in apply ensures that you are applying a function per row (as opposed to per column).
Here's an approach using your function and a for loop:
# Random draws
n <- 1000
r <- rnorm(n, mean = 100, sd = 10)
c1 <- rnorm(n, mean = 10, sd = 1)
c2 <- rnorm(n, mean = 20, sd = 2)
c3 <- rnorm(n, mean = 30, sd = 3)
X <- data.frame(r, c1, c2, c3)
result <- list()
fun <- function(x){
r <- x[[1]]
c <- c(x[[2]], x[[3]], x[[4]])
for(i in 1:3){
result[i] <- r - c[i]
}
return(result)
}
# Create a list to store results
profits <- rep(rep(list(1:3)),nrow(X))
# Loop throuhg each row of dataframe and store in profits.
for(i in 1:nrow(X)){
profits_temp <-
fun(list(X[i,"r"],X[i,"c1"],X[i,"c2"],X[i,"c3"]))
for(j in 1:3)
profits[[i]][[j]] <- profits_temp[[j]]
}
# Eye results
profits[[1]]
#> [1] 93.23594 81.25731 70.27699
profits[[2]]
#> [1] 80.50516 69.27517 63.36439
In the following code "Weight" is a large matrix of weight sets. This matrix is consisted of let's say 1000 rows and 4 columns. Each row is a set of weights (sum of the elements in each row is equal to one).
In addition, there are four object and I want to select one of them based on the each weight sets. In other words, this random selection should be repeated for all of the weight sets.
Right now I have solved the problem with for. But is there any more efficient way to code it in R?
y <- c("a", "b", "c", "d")
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,]) #selection is a vector with the same number of rows as Weight
}
A more efficient way would be to first compute the row-wise cumulative sums of your weights then draw a number between 0 and 1 and see where it lands within that cumulative sum. This way, you only need to do one call to runif to get your random data, versus 1000 calls using other methods.
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) / rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
selection <- y[i]
Also note how I computed the cumulative sums by multiplying by a triangular matrix instead of using the slower apply(Weight, 1, cumsum). Everything is vectorized so it should be way faster than using an apply or for loop.
Benchmark comparison with apply and for:
f_runif <- function(Weight, y) {
x <- runif(nrow(Weight))
cumul.w <- Weight %*% upper.tri(diag(ncol(Weight)), diag = TRUE) /
rowSums(Weight)
i <- rowSums(x > cumul.w) + 1L
y[i]
}
f_for <- function(Weight, y) {
selection <- rep(NA, nrow(Weight))
for(i in 1:nrow(Weight)){
selection[i] <- sample(y, 1, prob=Weight[i,])
}
}
f_apply <- function(Weight, y) {
apply(Weight, 1, function(w)sample(y, 1, prob=w))
}
y <- c("a", "b", "c", "d")
Weight <- matrix(sample(1:100, 1000 * 4, TRUE), 1000, 4)
library(microbenchmark)
microbenchmark(f_runif(Weight, y),
f_for (Weight, y),
f_apply(Weight, y))
# Unit: microseconds
# expr min lq median uq max neval
# f_runif(Weight, y) 223.635 231.111 274.531 281.2165 1443.208 100
# f_for(Weight, y) 10220.674 11238.660 11574.039 11917.1610 14583.028 100
# f_apply(Weight, y) 9006.974 10016.747 10509.150 10879.9245 27060.189 100
Wrap your sample into a function that lets you pass only one argument, a row from Weight:
myfun <- function(w) {
sample(y, 1, prob=w)
}
Then you can use one of the apply family:
apply(Weight, 1, myfun)
However, so long as you have pre-allocated selection your method is not terribly inefficient.