R split DF and run tests in parallel

R split DF and run tests in parallel - r

I have two matrices that I want to do several statistics, where I compare every row of dataframe1 with dataframe2. These are large data frame (300,000 rows and 40,000 rows) so lots to compare.
I made a few functions to be apply the statistics. What I was wondering was whether it is possible to split dataframe1 into chunks are run these chunks in parallel on multiple cores.
library(lawstat)
library(reshape2)
df1 = matrix(ncol= 100, nrow=100)
for ( i in 1:100){
df1[,i] =floor(runif(100, min = 0, max =3))
}
df2 = matrix(ncol= 100, nrow=1000)
for ( i in 1:100){
df2[,i] =runif(1000, min = 0, max =1000)
}
testFunc<- function(df1, df2){
x=apply(df1, 1, function(x) apply(df2, 1, function(y) levene.test(y,x)$p.value))
x=melt(x)
return(x)
}
system.time(res <- testFunc(df1,df2 ))
Some of the statistics (e.g. levene tests) take a fairly long time to compute so any ways I can speed this up would be great.

There is room for optimisation in your function but here is an example of an improvement using the parallel package:
library(parallel)
library(snow)
# I have a quad core processor so I am using 3 cores here.
cl <- snow::makeCluster(3)
testFunc2<- function(df1, df2){
x <- parallel::parApply(cl = cl, X = df1, 1, function(x, df2) apply(df2, 1,
function(y) lawstat::levene.test(y,x)$p.value), df2)
x <- melt(x)
return(x)
}
system.time(res <- testFunc2(df1,df2 ))
On my machine this at least halves the running time if I have a cluster size of 3.
edit: I felt bad for dissing your code so below is a stripped down levene.test function that increases performance more that going parallel on most home/work machines.
lev_lite <- function(y, group){
N <- 100 # or length(y)
k <- 3 # or length(levels(group)) after setting to as.factor below
reorder <- order(group)
group <- group[reorder]
y <- y[reorder]
group <- as.factor(group)
n <- tapply(y,group, FUN = length)
yi_bar <- tapply(y,group, FUN = median)
zij <- abs(y - rep(yi_bar, n))
zidot <- tapply(zij, group, FUN = mean)
zdotdot <- mean(zij)
# test stat, see wiki
W <- ((N - k)/(k - 1)) * (
sum(n*(zidot - zdotdot)^2)/
sum((zij - rep(zidot, n))^2))
#p val returned
1 - pf(W, k-1, N-k)
}
testFunc2 <- function(df1, df2){
x <- apply(df1, 1, function(x) apply(df2, 1, lev_lite, group = x))
x <- melt(x)
return(x)
}
> system.time(res <- testFunc(df1[1:50, ],df2[1:50,] ))
user system elapsed
5.53 0.00 5.56
> system.time(res2 <- testFunc2(df1[1:50, ],df2[1:50, ] ))
user system elapsed
1.13 0.00 1.14
> max(res2 - res)
[1] 2.220446e-15
This is a ~5x improvement without parallelisation.

Related

More efficient way to create frequency column based on different groupings?

I have code below that calculates a frequency for each column element (respective to it's own column) and adds all five frequencies together in a column. The code works but is very slow and the majority of the processing time is spent on this process. Any ideas to accomplish the same goal but more efficiently?
Create_Freq <- function(Word_List) {
library(dplyr)
Word_List$AvgFreq <- (Word_List%>% add_count(FirstLet))[,"n"] +
(Word_List%>% add_count(SecLet))[,"n"] +
(Word_List%>% add_count(ThirdtLet))[,"n"] +
(Word_List%>% add_count(FourLet))[,"n"] +
(Word_List%>% add_count(FifthLet))[,"n"]
return(Word_List)
}
Edit:
To provide a word list for example
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
Word_List$FirstLet <- substr(Word_List$Word,1,1)
Word_List$SecLet <- substr(Word_List$Word,2,2)
Word_List$ThirdtLet <- substr(Word_List$Word,3,3)
Word_List$FourLet <- substr(Word_List$Word,4,4)
Word_List$FifthLet <- substr(Word_List$Word,5,5)
}
For context, I have another function that will then choose the word with the highest "Average" frequency. (It used to be an average, but dividing by 5 was useless as it didn't affect the max)

Here is one possible approach, defining a small auxiliary function f to access a list of counts. When tested, it is roughly 15 times faster on my machine.
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
w <- unlist(Word_List, use.names = F)
m <- matrix(unlist(strsplit(w, ""), use.names = F), ncol = 4)
lookup <- apply(m, 1, table)
Word_List$AvgFreq <- apply(m, 2, f, lookup)
Word AvgFreg
1 final 7
2 first 7
3 lover 5
4 thing 5
Further optimizations are possible, especially using a vectorized approach.

In response to Donald. Using your approach ended up being much slower but I had to make a couple changes to get it to work with a large word list, let me know if I messed up your methodology:
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(read.delim("Word List.txt"))
Word_List <- Turn_Vector_List(Word_List)
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
# Start the clock!
ptm <- proc.time()
m <- data.matrix(Word_List[2:6])
m
lookup <- apply(m, 2, table, simplify = FALSE)
lookup
Word_List$AvgFreq <- apply(m, 1, f, lookup)
# Stop the clock
ptm2 <- proc.time() - ptm
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
Word_List2 <- Create_Freq(Word_List_Vector)
ptm3 <- proc.time() - ptm - ptm2
ptm2
# user system elapsed
# 0.89 0.78 1.69
ptm3
# user system elapsed
# 0.06 0.00 0.06

Can this plot_normal_distribution function be optimized?

I've tried to optimize a function which I wrote a few weeks ago.
It got better but it is still slow. So I used Rprof() and found out split() takes the most time which for some reason makes me think this function can be a lot better.
Can it be done?!
normDist_V2 <- function(size=1e5, precision=1, ...)
{
data <- rnorm(size)
roundedData <- round(data, precision)
framedData <- data.frame(cbind(data, roundedData))
factoredData <- split(framedData$data, framedData$roundedData)
actualsize <- (size)/10^precision
X <- names(factoredData)
Probability <- sapply(factoredData, length) / actualsize
plot(X, Probability, ...)
}
Current speed:
system.time(normDist_V2(size=1e7, precision = 2)) #11.14 sec

normDist_V2 <- function(size = 1e5, precision = 1, ...) {
require(data.table)
data <- rnorm(size)
roundedData <- round(data, precision)
framedData <- data.table(data, roundedData)
actualsize <- (size)/10^precision
dt <- framedData[, .N, keyby = roundedData]
X <- dt$roundedData
Probability <- dt$N/actualsize
plot(X, Probability, ...)
}
system.time(normDist_V2(size=1e7, precision = 2)) # 1.26 sec

Apply function to vectorised column values in data.table

Please consider this
library(data.table)
mydt <-
data.table(id = 1:100,
p1 = sample(seq(0,1,length.out=1000),100))
mydt$p2 <- 1 - mydt$p1
I want to apply a function using as the argument a vector from columns p1 and p2.
myFun <- function(x) {
sample(c(1,2), 1, prob = x)
}
This works,
mydt$outcome <- apply(mydt[,2:3], 1, myFun)
but I have a 25M rows, so I reach the memory limit.
I tried this, but it doesn't work.
mydt[,mydt := mapply(myFun, p1, p2)]

prob argument in sample requires a vector. And to apply myFun to each row, you can use by=1:nrow(mydt) or by=1:mydt[,.N]
mydt[, chosen := myFun(c(p1, p2)), by=1:nrow(mydt)]
Hat-tip to #Roland for his usage of rbinom. His vectorized version for this Bernoulli trial is much faster.
> system.time(mydt[, chosen := myFun(c(p1, p2)), by=1:nrow(mydt)])
user system elapsed
4.82 0.00 4.86
> system.time(mydt[, outcome2 := rbinom(.N, 1, p2) + 1])
user system elapsed
0.05 0.02 0.06
data used in timings:
library(data.table)
set.seed(0L)
m <- 1e6
mydt <- data.table(id = 1:m, p1 = runif(m))[, p2 := 1 - p1]
myFun <- function(x) sample(c(1,2), 1, prob = x)
accuracy check:
n <- 0L
while (n < 1e3) {
set.seed(n)
mydt[, chosen := myFun(c(p1, p2)), by=1:nrow(mydt)]
set.seed(n)
mydt[, outcome2 := rbinom(.N, 1, p2) + 1]
if(!all.equal(mydt$chosen, mydt$outcome2)) stop("mismatch")
n <- n + 1
}

How to increase efficiency of double rolling window operation?

Does anyone have an idea or suggestion on how to increase the efficiency of the following example of code eating up all my ram using a "kind-of" double rolling window?
First, I go through a simple example defining the problem, with a full MWE (implementation) at the bottom of this post.
First, consider the following "random" test vector (usually of length >25000):
A <- c(1.23,5.44,6.3,8.45,NaN,3.663,2.63,1.32,6.623,234.6,252.36)
A is sectioned into a "kind-of" train and test set, both with rolling windows. In this MWE, a train-set start of length 4 and a test set length of 2 are considered (usually of length >200). So initially, the following values are part of the train and test set:
train_1 <- A[1:4]
test_1 <- A[5:6]
Next, I want to subtract test_1 from train_1 at each possible consecutive location of train_1 (hence the first rolling window), generating the run_1_sub matrix.
run_1_sub <- matrix(NaN,3,2)
run_1_sub[1,] <- train_1[1:2] - test_1
run_1_sub[2,] <- train_1[2:3] - test_1
run_1_sub[3,] <- train_1[3:4] - test_1
Afterwards, I want to find on each row in run_1_sub the sum of each row divided by the number of entries in each row not being NaN.
run_1_sum <-
sapply(1:3, function(x) {
sum(run_1_sub[x,], na.rm = T) / sum(!is.na(run_1_sub[x,]))
})
In the next step, the "kind-of" train and test sets are updated by increasing their order from A by one (hence the second rolling window):
train_2 <- A[2:5]
test_2 <- A[6:7]
As previously, test_2 is subtracted at each possible location in train_2 and run_2_sub and run_2_sum are computed. This procedure is continued until the test set represents the last two values of A and finally I end (in this MWE) up with 6 run_sum matrices. My implementation, however, is very slow, and I was wondering whether anyone could help me to increase it's efficiency?
Here's my implementation:
# Initialization
library(zoo)
#rm(list = ls())
A <- c(1.23, 5.44, 6.3, 8.45, NaN, 3.663, 2.63, 1.32, 6.623, 234.6, 252.36) # test vector
train.length <- 4
test.length <- 2
run.length <- length(A) - train.length - test.length + 1
# Form test sets
test.sets <- sapply(1:run.length, function(x) {
A[(train.length + x):(train.length + test.length + x - 1)]
})
# Generate run_sub_matrices
run_matrix <- lapply(1:run.length, function(x) {
rollapply(A[x:(train.length + x - 1)], width = test.length, by = 1,
function(y) {
y - test.sets[, x]
})
})
# Genereate run_sum_matrices
run_sum <- sapply(1:length(run_matrix), function(x) {
rowSums(run_matrix[[x]], na.rm = T) / apply(run_matrix[[x]], 1, function(y) {
sum(!is.na(y))})
})
Naturally, the following initialization set-up slows the generation of run_sum and run_sub significantly down:
A <- runif(25000)*400
train.length <- 400
test.length <- 200
Here, the elapsed time for generating run_sub is 120.04s and for run_sum 28.69s respectively.
Any suggestions on how to increase and improved the speed and code?

Usually the first two steps of code optimization in R are:
Do less;
Use vectorization.
We will come through both of these steps. Let's agree to note x as input vector (A in your example).
The key functional unit in your problem can be formulated as follows: given train_start (start index of subset of train. We will use word 'train' for this subset), test_start (start index of test) and test_length (length of test) compute:
train_inds <- train_start + 0:(test_length-1)
test_inds <- test_start + 0:(test_length-1)
run_diff <- x[train_inds] - x[test_inds]
sum(run_diff, na.rm = TRUE) / sum(!is.na(run_diff))
This unit is invoked many times and so is computation of sums and !is.na. We will do less: instead of computing many times differences with their sums we precompute cumulative sums ones and use this data. See 'Preparatory computations' in run_mean_diff.
res now contains needed sum of differences of x_mod (which is a copy of x but with 0 instead of NAs and NaNs). We should now subtract all overused elements, i.e. those which we shouldn't use in sums because the respective element in other set is NA or NaN. While computing this information we will also compute the denominator. See 'Info about extra elements' in run_mean_diff.
The beauty of this code is that train_start, test_start and test_length can now be vectors: ith element of each vector is treated as single element for our task. This is vectorization. Our job is now to construct these vectors suited for our task. See function generate_run_data.
Presented code is using much less RAM, doesn't need extra zoo dependency and is considerably faster original on small train_length and test_length. On big *_lengths also faster but not very much.
One of the next steps might be writing this code using Rcpp.
The code:
run_mean_diff <- function(x, train_start, test_start, test_length) {
# Preparatory computations
x_isna <- is.na(x)
x_mod <- ifelse(x_isna, 0, x)
x_cumsum <- c(0, cumsum(x_mod))
res <- x_cumsum[train_start + test_length] - x_cumsum[train_start] -
(x_cumsum[test_start + test_length] - x_cumsum[test_start])
# Info about extra elements
extra <- mapply(
function(cur_train_start, cur_test_start, cur_test_length) {
train_inds <- cur_train_start + 0:(cur_test_length-1)
test_inds <- cur_test_start + 0:(cur_test_length-1)
train_isna <- x_isna[train_inds]
test_isna <- x_isna[test_inds]
c(
# Correction for extra elements
sum(x_mod[train_inds][test_isna]) -
sum(x_mod[test_inds][train_isna]),
# Number of extra elements
sum(train_isna | test_isna)
)
},
train_start, test_start, test_length, SIMPLIFY = TRUE
)
(res - extra[1, ]) / (test_length - extra[2, ])
}
generate_run_data <- function(n, train_length, test_length) {
run_length <- n - train_length - test_length + 1
num_per_run <- train_length - test_length + 1
train_start <- rep(1:num_per_run, run_length) +
rep(0:(run_length - 1), each = num_per_run)
test_start <- rep((train_length + 1):(n - test_length + 1),
each = num_per_run)
data.frame(train_start = train_start,
test_start = test_start,
test_length = rep(test_length, length(train_start)))
}
A <- c(1.23, 5.44, 6.3, 8.45, NaN, 3.663,
2.63, 1.32, 6.623, 234.6, 252.36)
train_length <- 4
test_length <- 2
run_data <- generate_run_data(length(A), train_length, test_length)
run_sum_new <- matrix(
run_mean_diff(A, run_data$train_start, run_data$test_start,
run_data$test_length),
nrow = train_length - test_length + 1
)

The reason your code uses so much RAM is because you keep a lot of intermediate objects, mainly all the elements in run_matrix. And profiling via Rprof shows that most of the time is spent in rollapply.
The easiest and simplest way to avoid all the intermediate objects is to use a for loop. It also makes the code clear. Then you just need to replace the call to rollapply with something faster.
The function you want to apply to each rolling subset is simple: subtract the test set. You can use the stats::embed function to create the matrix of lags, and then take advantage of R's recycling rules to subtract the test vector from each column. The function I created is:
calc_run_sum <- function(A, train_length, test_length) {
run_length <- length(A) - train_length - test_length + 1L
window_size <- train_length - test_length + 1L
# Essentially what embed() does, but with column order reversed
# (part of my adaptation of echasnovski's correction)
train_lags <- 1L:test_length +
rep.int(1L:window_size, rep.int(test_length, window_size)) - 1L
dims <- c(test_length, window_size) # lag matrix dims are always the same
# pre-allocate result matrix
run_sum <- matrix(NA, window_size, run_length)
# loop over each run length
for (i in seq_len(run_length)) {
# test set indices and vector
test_beg <- (train_length + i)
test_end <- (train_length + test_length + i - 1)
# echasnovski's correction
#test_set <- rep(test_set, each = train_length - test_length + 1)
#lag_matrix <- embed(A[i:(test_beg - 1)], test_length)
#run_sum[,i] <- rowMeans(lag_matrix - test_set, na.rm = TRUE)
# My adaptation of echasnovski's correction
# (requires train_lags object created outside the loop)
test_set <- A[test_beg:test_end]
train_set <- A[i:(test_beg - 1L)]
lag_matrix <- train_set[train_lags]
dim(lag_matrix) <- dims
run_sum[,i] <- colMeans(lag_matrix - test_set, na.rm = TRUE)
}
run_sum
}
Now, for some benchmarks. I used the following input data:
library(zoo)
set.seed(21)
A <- runif(10000)*200
train.length <- 200
test.length <- 100
Here are the timings for your original approach:
system.time({
run.length <- length(A) - train.length - test.length + 1
# Form test sets
test.sets <- sapply(1:run.length, function(x) {
A[(train.length + x):(train.length + test.length + x - 1)]
})
# Generate run_sub_matrices
run_matrix <- lapply(1:run.length, function(x) {
rm <- rollapply(A[x:(train.length + x - 1)], width = test.length, by = 1,
FUN = function(y) { y - test.sets[, x] })
})
# Genereate run_sum_matrices
run_sum <- sapply(run_matrix, function(x) {
rowSums(x, na.rm = T) / apply(x, 1, function(y) {
sum(!is.na(y))})
})
})
# user system elapsed
# 19.868 0.104 19.974
And here are the timings for echasnovski's approach:
system.time({
run_data <- generate_run_data(length(A), train.length, test.length)
run_sum_new <- matrix(
run_mean_diff(A, run_data$train_start, run_data$test_start,
run_data$test_length),
nrow = train.length - test.length + 1
)
})
# user system elapsed
# 10.552 0.048 10.602
And the timings from my approach:
system.time(run_sum_jmu <- calc_run_sum(A, train.length, test.length))
# user system elapsed
# 1.544 0.000 1.548
The output from all 3 approaches are identical.
identical(run_sum, run_sum_new)
# [1] TRUE
identical(run_sum, run_sum_jmu)
# [1] TRUE

Efficient apply or mapply for multiple matrix arguments by row

I have two matrices that I want to apply a function to, by rows:
matrixA
GSM83009 GSM83037 GSM83002 GSM83029 GSM83041
100001_at 5.873321 5.416164 3.512227 6.064150 3.713696
100005_at 5.807870 6.810829 6.105804 6.644000 6.142413
100006_at 2.757023 4.144046 1.622930 1.831877 3.694880
matrixB
GSM82939 GSM82940 GSM82974 GSM82975
100001_at 3.673556 2.372952 3.228049 3.555816
100005_at 6.916954 6.909533 6.928252 7.003377
100006_at 4.277985 4.856986 3.670161 4.075533
I've found several similar questions, but not a whole lot of answers: mapply for matrices, Multi matrix row-wise mapply?. The code I have now splits the matrices by row into lists, but having to split it makes it rather slow and not much faster than a for loop, considering I have almost 9000 rows in each matrix:
scores <- mapply(t.test.stat, split(matrixA, row(matrixA)), split(matrixB, row(matrixB)))
The function itself is very simple, just finding the t-value:
t.test.stat <- function(x, y)
{
return( (mean(x) - mean(y)) / sqrt(var(x)/length(x) + var(y)/length(y)) )
}

Splitting the matrices isn't the biggest contributor to evaluation time.
set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)
system.time( scores <- mapply(t.test.stat,
split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
# user system elapsed
# 1.57 0.00 1.58
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
# user system elapsed
# 1.14 0.00 1.14
Look at the output from Rprof to see that most of the time is--not surprisingly--spent evaluating t.test.stat (mean, var, etc.). Basically, there's quite a bit of overhead from function calls.
Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()
You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.
Since your function is simple, you can take advantage of the vectorized rowMeans function to do this almost instantaneously (though it's a bit messy):
system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
# user system elapsed
# 0 0 0
head(ans)
# [1] 0.8272511 -1.0965269 0.9862844 -0.6026452 -0.2477661 1.1896181
UPDATE
Here's a "cleaner" version using a rowVars function:
rowVars <- function(x, na.rm=FALSE, dims=1L) {
rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )

This solution avoids splitting, and lists, so maybe it will be faster than your version:
## original data:
tmp1 <- matrix(sample(1:100, 20), nrow = 5)
tmp2 <- matrix(sample(1:100, 20), nrow = 5)
## combine them together
tmp3 <- cbind(tmp1, tmp2)
## calculate t.stats:
t.stats <- apply(tmp3, 1, function(x) t.test(x[1:ncol(tmp1)],
x[(1 + ncol(tmp1)):ncol(tmp3)])$statistic)
Edit: Just tested it on two matrices of 9000 rows and 5 columns each, and it completed in less than 6 seconds:
tmp1 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp2 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp3 <- cbind(tmp1, tmp2)
system.time(t.st <- apply(tmp3, 1, function(x) t.test(x[1:5], x[6:10])$statistic))
-> user system elapsed
-> 5.640 0.012 5.705

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R split DF and run tests in parallel - r

Related

More efficient way to create frequency column based on different groupings?

Can this plot_normal_distribution function be optimized?

Apply function to vectorised column values in data.table

How to increase efficiency of double rolling window operation?

Efficient apply or mapply for multiple matrix arguments by row

Categories

Resources