I have a dataset with the following structure:
require(data.table)
train <- data.table(sample(1:10, 10), runif(10, 0, 10))
However, the dataset is ~ 7,5 GB in memory and has ~630 million rows. Attempting summary(train) yields in an error: Error: cannot allocate vector of size 2.3 Gb. I can extract some information by manually calling train[, mean(V2)], train[, min(V2)] and train[, max(V2)], but median and quantiles will result in OOM. Is there a possibility to make these operations on a 16GB RAM machine?
An idea would be to split the dataset but that would be a bit cumbersome w.r.t to median and quantiles
So I came up with function summaryI, to which we supply our interested column name:
summaryI <- function(i2) {
setorderv(train, i2)
N <- train[, .N]
# count NAs:
# nas <- is.na(train[[i2]])
# nNA <- sum(nas)
# OR
i <- 1L
nNA <- 0L
while (is.na(train[[i2]][i])) {
nNA <- nNA + 1L
i <- i + 1L
}
nNA
# will be slow if many NAs, but more memory efficient
# (will not create additional vector)
n <- N - nNA
probs <- seq(0, 1, 0.25)
# quantiles, only type = 7
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
qs <- train[[i2]][lo + nNA]
i <- which(index > lo)
h <- (index - lo)[i]
qs[i] <- (1 - h) * qs[i] + h * train[[i2]][hi[i] + nNA]
qs # quantile results
rmean <- sum(train[[i2]], na.rm = T) / n
qq <- c(qs[1L:3L], rmean, qs[4L:5L])
digits <- max(3L, getOption("digits") - 3L)
qq <- signif(qq, digits)
names(qq) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
if (nNA > 0L) { # to match summary output
c(qq, `NA's` = nNA)
} else {
qq
}
}
The basic idea is that we order the interested column in place (with setorder from data.table) and then try to do all the calculations without copying data.
As mentioned in comments, if your data have a lot of NAs then this will be slow.
But hopefully you will be able to run on all of the data. Also, I hard coded inside NA management.
Example:
summaryI('V2')
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 2.398e-08 2.501e-01 5.000e-01 5.000e-01 7.500e-01 1.000e+00 1.000e+02
or run over multiple columns, like:
sapply(colnames(train), summaryI)
The source code of summary and quantile, which I used as basis, can be found:
github quantile.R
github summary.R
Related
I have to calculate values of a matrix which one column depends of others. In summary this code is part of a simulation where I want to see how values of m1[,1] would change in function of the amount of columns in this matrix and other parameters.
My problem is that this simulation took around 10 days (or more) to complete and I was wondering if it's possible to do this calculations in a more efficient way.
Here is the code (this is only an example of the operations for time composition, because those results have no significance):
library(microbenchmark)
number_of_columns <- 6 #In my simulation I'm using 10,000 columns
microbenchmark({
m1 <- matrix(1.12345678912356789, nrow = 6, ncol = number_of_columns)
m2 <- matrix(1.12345678912356789, nrow = 6, ncol = number_of_columns)
v1 <- rnorm(6)
v2 <- rnorm(6)
v3 <- rnorm(6)
for (j in 1:10000) { #I need to loop 1e07 times
m1[,1] <- m2[, 2] + m2[, 1]*v1
m2[, 1] <- m2[, 1] + m1[, 1]
for (i in 2:(ncol(m1) - 1)) {
m1[,i] <- (m2[, i + 1] - m2[, i])*v2
m2[, i] <- m2[, i] + (m1[, i] - m1[, i - 1])*v3
}
m2[, 6] <- m2[, 5]
}
})
In each simulations I want to change values of v1, v2 and v3. Also, the number of columns of m1 and m2 would be 10000.
The result of time consumption of microbenchmark in my computer is:
min lq mean median uq max neval
133.823 144.2911 154.8575 151.2269 157.7208 232.0194 100
I'm using a i5-1135G7. I also have a NVIDIA GeForce MX350 in my laptop. I tried to use library gpuR to run my simulation in my GPU but I did not understand how to install OpenCL.
I have three dataframes in R, let's call them A, B, and C.
dataframe C contains two columns, the first one contains various row names from dataframe A and the second one contains row names in dataframe B:
C <- data.frame(col1 = c("a12", "a9"), col2 = c("b6","b54"))
I want to calculate the correlation coefficient and p-values for each row of the table C using the corresponding values from the rows of table A and B (i.e. correlating values from the a12 row in the table A with values from b6 row from table B, a9 row from table A with b54 row from table B, etc.) and put the resulting values in additional columns in the table C. This is my current naive and highly inefficient code:
for (i in 1:nrow(C)) {
correlation <- cor.test(unlist(A[C[i,1],]), unlist(B[C[i,2],]), method = "spearman")
C[i,3] <-correlation$estimate
C[i,4] <- correlation$p.value
}
The main problem is that with my current large datasets this analysis can literally take months. so I'm looking for a more efficient way to accomplish this task. I also tried the following code using the "Hmisc" package but the server I'm working on can't handle the large vectors:
A <- t(A)
B <- t(B)
ind.A <- match(C[,1], colnames(A))
A<- A[,ind.A]
ind.B <- match(C[,2], colnames(B))
B<- B[,ind.B]
C[,3]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$r[c(1:ncol(A)),c(1:ncol(A))])
C[,4]<- diag(rcorr(as.matrix(A),as.matrix(B),type = "spearman")$P[c(1:ncol(A)),c(1:ncol(A))])
Based on the comment by #HYENA, I tried parallelize processing. This approach accelerated the process approximately 4 times (with 8 cores). The code:
library(foreach)
library(doParallel)
cl<- makeCluster(detectCores())
registerDoParallel(cl)
cor.res<- foreach (i=1:nrow(C)) %dopar% {
a<- C[i,1]
b<- C[i,2]
correlation<- cor.test(unlist(A[a,]),unlist(B[b,]), method = "spearman")
c(correlation$estimate,correlation$p.value)
}
cor.res<- data.frame(Reduce("rbind",cor.res))
C[,c(3,4)]<- cor.res
Extract just the part you need from cor.test giving cor_test1 and use that instead or, in addition, create a lookup table for the p values giving cor_test2 which is slightly faster than cor_test1.
Based on the median column with 10-vectors these run about 3x faster than cor.test. Although cor_test2 is only slightly faster than cor_test1 here we have included it since the speed could depend on size of input which we don't have but you can try it out yourself with whatever sizes you have.
# given correlation and degrees of freedom output p value
r2pval <- function(r, dof) {
tval <- sqrt(dof) * r/sqrt(1 - r^2)
min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
}
# faster version of cor.test
cor_test1 <- function(x, y) {
r <- cor(x, y)
dof <- length(x) - 2
tval <- sqrt(dof) * r/sqrt(1 - r^2)
pval <- min(pt(tval, dof), pt(tval, dof, lower.tail = FALSE))
c(r, pval)
}
# even faster version of cor.test.
# Given x, y and the pvals table calculate a 2-vector of r and p value
cor_test2 <- function(x, y, pvals) {
r <- cor(x, y)
c(r, pvals[100 * round(r, 2) + 101])
}
# test
set.seed(123)
n <- 10
x <- rnorm(n); y <- rnorm(n)
dof <- n - 2
# pvals is the 201 p values for r = -1, -0.99, -0.98, ..., 1
pvals <- sapply(seq(-1, 1, 0.01), r2pval, dof = dof)
library(microbenchmark)
microbenchmark(cor.test(x, y), cor_test1(x, y), cor_test2(x, y, pvals))
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
cor.test(x, y) 253.7 256.7 346.278 266.05 501.45 650.6 100 a
cor_test1(x, y) 84.8 87.2 346.777 89.10 107.40 22974.4 100 a
cor_test2(x, y, pvals) 72.4 75.0 272.030 79.45 91.25 17935.8 100 a
I have a dataframe which looks a bit as produced by the following code (but much larger)
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
In the columns are issues and 1 indicates that an observation is interested in a specific issue. I want to generate a network comparing all observations and have a count of issues that each dyad is jointly interested in.
I have produced the following code, which seems to be working fine:
mat2 <- matrix(NA,20,20)
for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
}
So I compare every entry with every other entry, and only if both have a 1 entry (i.e., they are interested), then this sums to 2 and will be counted as joint interest in a topic.
My problem is that my dataset is very large, and the loop now runs for hours already.
Does anyone have an idea how to do this while avoiding the loop?
This should be faster:
tmat <- t(mat==1)
mat4 <- apply(tmat, 2, function(x) colSums(tmat & x))
going ahead and promoting #jogo's comment as it is by far the fastest (thank's for the hint, I will use that in production as well).
set.seed(10)
mat <- matrix(rbinom(200, size=1, prob = .5), ncol = 10)
mat2 <- matrix(NA,20,20)
binary_mat <- mat == 1
tmat <- t(mat==1)
microbenchmark::microbenchmark(
"loop" = for(i in 1:nrow(mat)){
for(j in 1:nrow(mat)){
mat2[i,j] <- sum(as.numeric(mat[i,]==1) + as.numeric(mat[j,]==1) == 2)
}
},
"apply" = mat4 <- apply(tmat, 2, function(x) colSums(tmat & x)),
"matrix multiplication" = mat5 <- mat %*% t(mat),
"tcrossprod" = tcrossprod(mat),
"tcrossprod binary" = tcrossprod(binary_mat)
)
On my machine this benchmark results in
Unit: microseconds
expr min lq mean median uq max neval cld
loop 16699.634 16972.271 17931.82535 17180.397 17546.1545 31502.706 100 b
apply 322.942 330.046 395.69045 357.886 368.8300 4299.228 100 a
matrix multiplication 21.889 28.801 36.76869 39.360 43.9685 50.689 100 a
tcrossprod 7.297 8.449 11.20218 9.984 14.4005 18.433 100 a
tcrossprod binary 7.680 8.833 11.08316 9.601 12.0970 35.713 100 a
i'm having trouble at finding a way to calculate faster the median and mean of a large vector in R. How would I implement a faster way?
I'm doing the code above, but its too slow.
I'm thinking about parallel processing, but i have no ideia how to make this work. Thanks.
vector <- 1:10000000000
m <- mean(vector)
md <- median(vector)
Assuming we're dealing with a sequential integer vector 1:n. This may help you:
## Given
V <- 1:10e8
n <- length(V)
## To get median,
median <- ifelse(n %% 2 == 0, mean(V [(n/2):((n/2) + 1)]), V [(n + 1)/2])
median
OUTPUT: 5e+08
## To get mean,
sum_series <- n*(n + 1) / 2 # Mathematical Fact
mean <- sum_series / n
mean
OUTPUT: 5e+08
For large random vectors, the median still works the same. The mean you can estimate if it doesn't have a closed formula:
### Estimation via Repeated Sampling ###
est_mean <- function (V, k, size) {
# k: Number of means to use in estimation
# size: Sample size of each estimation
est <- rep(NA, k)
samp <- matrix(NA, nrow = size, ncol = k)
for (j in 1:k) samp [, j] <- sample(V, size, replace = TRUE)
for (j in 1:k) est [j] <- mean(samp [, j])
est <- sort(est)
return(est [ceiling(length(est)/2)])
}
### Time Complexity of Estimation ###
# samp + est = k*size + k
# If size, k ~ 30 --> Enough to get normal mean distribution
# iterate amount*(create sample vector + store) = k*(size + size)
# --> 2*k*size
# Total = k + 3*k*size --> constant
### Time Complexity of Base R Mean () ###
# Assuming it's this: mean (V) <- sum(V)/length(V)
# sum N items + find length + 1 division + 1 return = N + 3
### Example ###
set.seed(0)
V <- sort(sample(0:10e8, 10e7, replace = TRUE))
start1 <- Sys.time()
est_mu <- est_mean(V, 1000, 30)
end1 <- Sys.time()
diff1 <- end1 - start1
start2 <- Sys.time()
r_mu <- mean (V)
end2 <- Sys.time()
diff2 <- end2 - start2
diff1
OUTPUT: Time difference of 0.08370018 secs
diff2
OUTPUT: Time difference of 0.5321879 secs
print(paste("% Difference = ", abs(r_mu - est_mu)/r_mu))
OUTPUT: "% Difference = 0.00678363793285072"
I have an issue finding the most efficient way to calculate a rolling linear regression over a xts object with multiple columns. I have searched and read several previously questions here on stackoverflow.
This question and answer comes close but not enough in my opinion as I want to calculate multiple regressions with the dependent variable unchanged in all the regressions. I have tried to reproduce an example with random data:
require(xts)
require(RcppArmadillo) # Load libraries
data <- matrix(sample(1:10000, 1500), 1500, 5, byrow = TRUE) # Random data
data[1000:1500, 2] <- NA # insert NAs to make it more similar to true data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
NR <- nrow(data) # number of observations
NC <- ncol(data) # number of factors
obs <- 30 # required number of observations for rolling regression analysis
info.names <- c("res", "coef")
info <- array(NA, dim = c(NR, length(info.names), NC))
colnames(info) <- info.names
The array is created in order to store multiple variables (residuals, coefficients etc.) over time and per factor.
loop.begin.time <- Sys.time()
for (j in 2:NC) {
cat(paste("Processing residuals for factor:", j), "\n")
for (i in obs:NR) {
regression.temp <- fastLm(data[i:(i-(obs-1)), j] ~ data[i:(i-(obs-1)), 1])
residuals.temp <- regression.temp$residuals
info[i, "res", j] <- round(residuals.temp[1] / sd(residuals.temp), 4)
info[i, "coef", j] <- regression.temp$coefficients[2]
}
}
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time) # prints the loop runtime
As the loop shows the idea is to run a 30 observations rolling regression with data[, 1] as the dependent variable (factor) every time against one of the other factors. I have to store the 30 residuals in a temporary object in order to standardize them as fastLm does not calculate standardized residuals.
The loop is extremely slow and becomes a cumbersome if the numbers of columns (factors) in the xts object increases to around 100 - 1,000 columns would take an eternity. I hope one has a more efficient code to create rolling regressions over a large data set.
It should be pretty quick if you go down to level of the math of the linear regression. If X is the independent variable and Y is the dependent variable. The coefficients are given by
Beta = inv(t(X) %*% X) %*% (t(X) %*% Y)
I'm a little confused about which variable you want to be the dependent and which one is the independent but hopefully solving a similar problem below will help you as well.
In the example below I take 1000 variables instead of the original 5 and do not introduce any NA's.
require(xts)
data <- matrix(sample(1:10000, 1500000, replace=T), 1500, 1000, byrow = TRUE) # Random data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
NR <- nrow(data) # number of observations
NC <- ncol(data) # number of factors
obs <- 30 # required number of observations for rolling regression analysis
Now we can calculate the coefficients using Joshua's TTR package.
library(TTR)
loop.begin.time <- Sys.time()
in.dep.var <- data[,1]
xx <- TTR::runSum(in.dep.var*in.dep.var, obs)
coeffs <- do.call(cbind, lapply(data, function(z) {
xy <- TTR::runSum(z * in.dep.var, obs)
xy/xx
}))
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time) # prints the loop runtime
Time difference of 3.934461 secs
res.array = array(NA, dim=c(NC, NR, obs))
for(z in seq(obs)) {
res.array[,,z] = coredata(data - lag.xts(coeffs, z-1) * as.numeric(in.dep.var))
}
res.sd <- apply(res.array, c(1,2), function(z) z / sd(z))
If I haven't made any errors in the indexing res.sd should give you the standardized residuals. Please feel free to fix this solution to correct any bugs.
Here is a much faster way to do it with the rollRegres package
library(xts)
library(RcppArmadillo)
#####
# simulate data
set.seed(50554709)
data <- matrix(sample(1:10000, 1500), 1500, 5, byrow = TRUE) # Random data
# data[1000:1500, 2] <- NA # only focus on the parts that are computed
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
#####
# setup for solution in OP
NR <- nrow(data)
NC <- ncol(data)
obs <- 30L
info.names <- c("res", "coef")
info <- array(NA, dim = c(NR, length(info.names), NC))
colnames(info) <- info.names
#####
# solve with rollRegres
library(rollRegres)
loop.begin.time <- Sys.time()
X <- cbind(1, drop(data[, 1]))
out <- lapply(2:NC, function(j){
fit <- roll_regres.fit(
y = data[, j], x = X, width = obs, do_compute = c("sigmas"))
# are you sure you want the residual of the first and not the last
# observation in each window?
idx <- 1:(nrow(data) - obs + 1L)
idx_tail <- idx + obs - 1L
resids <- c(rep(NA_real_, obs - 1L),
data[idx, j] - rowSums(fit$coefs[idx_tail, ] * X[idx, ]))
# the package uses the unbaised estimator so we have to time by this factor
# to get the same
sds <- fit$sigmas * sqrt((obs - 2L) / (obs - 1L))
unclass(cbind(coef = fit$coefs[, 2L], res = drop(round(resids / sds, 4))))
})
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time)
#R Time difference of 0.03123808 secs
#####
# solve with original method
loop.begin.time <- Sys.time()
for (j in 2:NC) {
cat(paste("Processing residuals for factor:", j), "\n")
for (i in obs:NR) {
regression.temp <- fastLm(data[i:(i-(obs-1)), j] ~ data[i:(i-(obs-1)), 1])
residuals.temp <- regression.temp$residuals
info[i, "res", j] <- round(residuals.temp[1] / sd(residuals.temp), 4)
info[i, "coef", j] <- regression.temp$coefficients[2]
}
}
#R Processing residuals for factor: 2
#R Processing residuals for factor: 3
#R Processing residuals for factor: 4
#R Processing residuals for factor: 5
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time) # prints the loop runtime
#R Time difference of 7.554767 secs
#####
# check that results are the same
all.equal(info[, "coef", 2L], out[[1]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 2L], out[[1]][, "res"])
#R [1] TRUE
all.equal(info[, "coef", 3L], out[[2]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 3L], out[[2]][, "res"])
#R [1] TRUE
all.equal(info[, "coef", 4L], out[[3]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 4L], out[[3]][, "res"])
#R [1] TRUE
all.equal(info[, "coef", 5L], out[[4]][, "coef"])
#R [1] TRUE
all.equal(info[, "res" , 5L], out[[4]][, "res"])
#R [1] TRUE
Do notice this comment inside the above solution
# are you sure you want the residual of the first and not the last
# observation in each window?
Here is a comparison to Sameer's answer
library(rollRegres)
require(xts)
data <- matrix(sample(1:10000, 1500000, replace=T), 1500, 1000, byrow = TRUE) # Random data
data <- xts(data, order.by = as.Date(1:1500, origin = "2000-01-01"))
NR <- nrow(data) # number of observations
NC <- ncol(data) # number of factors
obs <- 30 # required number of observations for rolling regression analysis
loop.begin.time <- Sys.time()
X <- cbind(1, drop(data[, 1]))
out <- lapply(2:NC, function(j){
fit <- roll_regres.fit(
y = data[, j], x = X, width = obs, do_compute = c("sigmas"))
# are you sure you want the residual of the first and not the last
# observation in each window?
idx <- 1:(nrow(data) - obs + 1L)
idx_tail <- idx + obs - 1L
resids <- c(rep(NA_real_, obs - 1L),
data[idx, j] - rowSums(fit$coefs[idx_tail, ] * X[idx, ]))
# the package uses the unbaised estimator so we have to time by this factor
# to get the same
sds <- fit$sigmas * sqrt((obs - 2L) / (obs - 1L))
unclass(cbind(coef = fit$coefs[, 2L], res = drop(round(resids / sds, 4))))
})
loop.end.time <- Sys.time()
print(loop.end.time - loop.begin.time)
#R Time difference of 0.9019711 secs
The time includes the time used to compute the standardized residuals.