I have code below that calculates a frequency for each column element (respective to it's own column) and adds all five frequencies together in a column. The code works but is very slow and the majority of the processing time is spent on this process. Any ideas to accomplish the same goal but more efficiently?
Create_Freq <- function(Word_List) {
library(dplyr)
Word_List$AvgFreq <- (Word_List%>% add_count(FirstLet))[,"n"] +
(Word_List%>% add_count(SecLet))[,"n"] +
(Word_List%>% add_count(ThirdtLet))[,"n"] +
(Word_List%>% add_count(FourLet))[,"n"] +
(Word_List%>% add_count(FifthLet))[,"n"]
return(Word_List)
}
Edit:
To provide a word list for example
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
Word_List$FirstLet <- substr(Word_List$Word,1,1)
Word_List$SecLet <- substr(Word_List$Word,2,2)
Word_List$ThirdtLet <- substr(Word_List$Word,3,3)
Word_List$FourLet <- substr(Word_List$Word,4,4)
Word_List$FifthLet <- substr(Word_List$Word,5,5)
}
For context, I have another function that will then choose the word with the highest "Average" frequency. (It used to be an average, but dividing by 5 was useless as it didn't affect the max)
Here is one possible approach, defining a small auxiliary function f to access a list of counts. When tested, it is roughly 15 times faster on my machine.
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(Word = c("final", "first", "lover", "thing"))
w <- unlist(Word_List, use.names = F)
m <- matrix(unlist(strsplit(w, ""), use.names = F), ncol = 4)
lookup <- apply(m, 1, table)
Word_List$AvgFreq <- apply(m, 2, f, lookup)
Word AvgFreg
1 final 7
2 first 7
3 lover 5
4 thing 5
Further optimizations are possible, especially using a vectorized approach.
In response to Donald. Using your approach ended up being much slower but I had to make a couple changes to get it to work with a large word list, let me know if I messed up your methodology:
f <- function(x, tbl){
res <- integer(5)
for (i in seq_along(tbl)){
res[i] <- tbl[[i]][x[i]]
}
sum(res)
}
Word_List <- data.frame(read.delim("Word List.txt"))
Word_List <- Turn_Vector_List(Word_List)
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
# Start the clock!
ptm <- proc.time()
m <- data.matrix(Word_List[2:6])
m
lookup <- apply(m, 2, table, simplify = FALSE)
lookup
Word_List$AvgFreq <- apply(m, 1, f, lookup)
# Stop the clock
ptm2 <- proc.time() - ptm
Word_List2 <- data.frame(read.delim("Word List.txt"))
Word_List_Vector <- Turn_Vector_List(Word_List2)
Word_List2 <- Create_Freq(Word_List_Vector)
ptm3 <- proc.time() - ptm - ptm2
ptm2
# user system elapsed
# 0.89 0.78 1.69
ptm3
# user system elapsed
# 0.06 0.00 0.06
Related
I am currently running some functions on large data sets for which each operation takes a long time to execute.
To see the progress of my calculations, it would be handy to print the iterations/percentage of completed calculations. With loops, this can be easily done.
However, is it possible to have something similar working for vectorized functions or or pre-defined functions without actually making changes to the source code of those functions?
Example data:
generate_string taken from here : Generating Random Strings
generate_string <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
x <- generate_string(10000)
y <- generate_string(10000)
Example function to be monitored:
(i.e. printing the percentage completed):
library(stringdist)
# amatch will find for each element in x the index of the most similar element in y
ind <- amatch(x,y, method = "jw", maxDist = 1)
The pbapply is a option, but is more slow than the direct call:
system.time({ind <- amatch(x,y, method = "jw", maxDist = 1)})
user system elapsed
27.79 0.05 9.72
library(pbapply)
ind <- pbsapply(x, function(xi) amatch(xi,y, method = "jw", maxDist = 1))
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 30s
Also, the option that you comment (split data in chunks) is less elegant but faster, and this is easily parallelizable.
library(progress)
system.time({
nloops <- 20
pp <- floor(nloops * (0:(length(x)-1))/length(x)) + 1
ind <- c()
pb <- progress_bar$new(total = nloops)
for(i in 1:nloops) {
pb$tick()
ind <- c(ind, amatch(x[pp == i],y, method = "jw", maxDist = 1))
}
pb$terminate()
})
[===================================================================================] 100%
user system elapsed
25.96 0.06 9.21
I have two matrices that I want to do several statistics, where I compare every row of dataframe1 with dataframe2. These are large data frame (300,000 rows and 40,000 rows) so lots to compare.
I made a few functions to be apply the statistics. What I was wondering was whether it is possible to split dataframe1 into chunks are run these chunks in parallel on multiple cores.
library(lawstat)
library(reshape2)
df1 = matrix(ncol= 100, nrow=100)
for ( i in 1:100){
df1[,i] =floor(runif(100, min = 0, max =3))
}
df2 = matrix(ncol= 100, nrow=1000)
for ( i in 1:100){
df2[,i] =runif(1000, min = 0, max =1000)
}
testFunc<- function(df1, df2){
x=apply(df1, 1, function(x) apply(df2, 1, function(y) levene.test(y,x)$p.value))
x=melt(x)
return(x)
}
system.time(res <- testFunc(df1,df2 ))
Some of the statistics (e.g. levene tests) take a fairly long time to compute so any ways I can speed this up would be great.
There is room for optimisation in your function but here is an example of an improvement using the parallel package:
library(parallel)
library(snow)
# I have a quad core processor so I am using 3 cores here.
cl <- snow::makeCluster(3)
testFunc2<- function(df1, df2){
x <- parallel::parApply(cl = cl, X = df1, 1, function(x, df2) apply(df2, 1,
function(y) lawstat::levene.test(y,x)$p.value), df2)
x <- melt(x)
return(x)
}
system.time(res <- testFunc2(df1,df2 ))
On my machine this at least halves the running time if I have a cluster size of 3.
edit: I felt bad for dissing your code so below is a stripped down levene.test function that increases performance more that going parallel on most home/work machines.
lev_lite <- function(y, group){
N <- 100 # or length(y)
k <- 3 # or length(levels(group)) after setting to as.factor below
reorder <- order(group)
group <- group[reorder]
y <- y[reorder]
group <- as.factor(group)
n <- tapply(y,group, FUN = length)
yi_bar <- tapply(y,group, FUN = median)
zij <- abs(y - rep(yi_bar, n))
zidot <- tapply(zij, group, FUN = mean)
zdotdot <- mean(zij)
# test stat, see wiki
W <- ((N - k)/(k - 1)) * (
sum(n*(zidot - zdotdot)^2)/
sum((zij - rep(zidot, n))^2))
#p val returned
1 - pf(W, k-1, N-k)
}
testFunc2 <- function(df1, df2){
x <- apply(df1, 1, function(x) apply(df2, 1, lev_lite, group = x))
x <- melt(x)
return(x)
}
> system.time(res <- testFunc(df1[1:50, ],df2[1:50,] ))
user system elapsed
5.53 0.00 5.56
> system.time(res2 <- testFunc2(df1[1:50, ],df2[1:50, ] ))
user system elapsed
1.13 0.00 1.14
> max(res2 - res)
[1] 2.220446e-15
This is a ~5x improvement without parallelisation.
My question is about how to improve the performance of function that downsamples from the columns of a matrix without replacement (a.k.a. "rarefication" of a matrix... I know there has been mention of this here, but I could not find a clear answer that a) does what I need; b) does it quickly).
Here is my function:
downsampled <- function(data,samplerate=0.8) {
data.test <- apply(data,2,function(q) {
names(q) <- rownames(data)
samplepool <- character()
for (i in names(q)) {
samplepool <- append(samplepool,rep(i,times=q[i]))
}
sampled <- sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab <- table(sampled)
mat <- match(names(tab),names(q))
toret=numeric(length <- length(q))
names(toret) <- names(q)
toret[mat] <- tab
return(toret)
})
return(data.test)
}
I need to be downsampling matrices with millions of entries. I find this is quite slow (here I'm using a 1000x1000 matrix, which is about 20-100x smaller than my typical data size):
mat <- matrix(sample(0:40,1000*1000,replace=T),ncol=1000,nrow=1000)
colnames(mat) <- paste0("C",1:1000)
rownames(mat) <- paste0("R",1:1000)
system.time(matd <- downsampled(mat,0.8))
## user system elapsed
## 69.322 21.791 92.512
Is there a faster/easier way to perform this operation that I haven't thought of?
I think you can make this dramatically faster. If I understand what you are trying to do correctly, you want to down-sample each cell of the matrix, such that if samplerate = 0.5 and the cell of the matrix is mat[i,j] = 5, then you want to sample up to 5 things where each thing has a 0.5 chance of being sampled.
To speed things up, rather than doing all these operations on columns of the matrix, you can just loop through each cell of the matrix, draw n things from that cell by using runif (e.g., if mat[i,j] = 5, you can generate 5 random numbers between 0 and 1, and then add up the number of values that are < samplerate), and finally add the number of things to a new matrix. I think this effectively achieves the same down-sampling scheme, but much more efficiently (both in terms of running time and lines of code).
# Sample matrix
set.seed(23)
n <- 1000
mat <- matrix(sample(0:10,n*n,replace=T),ncol=n,nrow=n)
colnames(mat) <- paste0("C",1:n)
rownames(mat) <- paste0("R",1:n)
# Old function
downsampled<-function(data,samplerate=0.8) {
data.test<-apply(data,2,function(q){
names(q)<-rownames(data)
samplepool<-character()
for (i in names(q)) {
samplepool=append(samplepool,rep(i,times=q[i]))
}
sampled=sample(samplepool,size=samplerate*length(samplepool),replace = F)
tab=table(sampled)
mat=match(names(tab),names(q))
toret=numeric(length = length(q))
names(toret)<-names(q)
toret[mat]<-tab
return(toret)
})
return(data.test)
}
# New function
downsampled2 <- function(mat, samplerate=0.8) {
new <- matrix(0, nrow(mat), ncol(mat))
colnames(new) <- colnames(mat)
rownames(new) <- rownames(mat)
for (i in 1:nrow(mat)) {
for (j in 1:ncol(mat)) {
new[i,j] <- sum(runif(mat[i,j], 0, 1) < samplerate)
}
}
return(new)
}
# Compare times
system.time(downsampled(mat,0.8))
## user system elapsed
## 26.840 3.249 29.902
system.time(downsampled2(mat,0.8))
## user system elapsed
## 4.704 0.247 4.918
Using an example 1000 X 1000 matrix, the new function I provided runs about 6 times faster.
One source of savings would be to remove the for loop that appends samplepool using rep. Here is a reproducible example:
myRows <- 1:5
names(myRows) <- letters[1:5]
# get the repeated values for sampling
samplepool <- rep(names(myRows), myRows)
Within your function, this would be
samplepool <- rep(names(q), q)
I am trying to build a function that creates a vector where any item is NOT the sum of any combination of other items in the list (without duplication).
This function does the job but is quite slow... any bright thoughts on how to improve it?
sum_fun <- function(k)
{
out_list <- c(2,3,4)
new_num <- 4
while(length(out_list) < k)
{
new_num <- new_num + 1
#Check if new_num can be written as a sum of the terms in out_list
new_valid <- T
for (i in 2:(length(out_list) - 1)){
if (new_num %in% (apply(combn(out_list,i), FUN = sum, MAR = 2)))
{
new_valid <- F
break
}
}
if (new_valid)
{
out_list <- c(out_list, new_num)
}
}
return(out_list)
}
This was a good question. I made some changes to your original function and got mine to run a bit quicker than your function. On a side note, how many are you trying to find?
The main idea is that we shouldn't calculate more things more often than we absolutely have to. I think the for loop was probably slowing things down a bit, plus, how many of the column sums were repeated? If we can "de-dup" the list, maybe we can search through it more quickly [reduce, reuse, recycle :) ].
sum_fun2 <- function(k)
{
out_list <- c(2,3,4) #dummy list
new_num <- 4 #dummy number
calc_big_sum <- T #calculate big sum on the first go
while(length(out_list) < k)
{
new_num <- new_num + 1 #dummy number to add
#calculate big sum, and then find unique values
if(calc_big_sum){
big_sum<- unique(unlist(lapply(lapply(2:(length(out_list) - 1),
FUN = function(x) combn(out_list, m = x)),
FUN = function(y) apply(y, 2, sum))))
}
if(new_num %in% big_sum){
calc_big_sum = F #don't make it calculate the sum again
}else{
out_list <- c(out_list, new_num) #add number to list
calc_big_sum = T #make it calculate a new sum
}
}
return(out_list)
}
> system.time(sum_fun2(10))
user system elapsed
0.03 0.00 0.03
> system.time(sum_fun(10))
user system elapsed
1.30 0.00 1.27
> system.time(sum_fun2(14))
user system elapsed
3.35 0.07 3.47
> system.time(sum_fun(14))
## I ended it
Timing stopped at: 39.86 0 40.02
I have two matrices that I want to apply a function to, by rows:
matrixA
GSM83009 GSM83037 GSM83002 GSM83029 GSM83041
100001_at 5.873321 5.416164 3.512227 6.064150 3.713696
100005_at 5.807870 6.810829 6.105804 6.644000 6.142413
100006_at 2.757023 4.144046 1.622930 1.831877 3.694880
matrixB
GSM82939 GSM82940 GSM82974 GSM82975
100001_at 3.673556 2.372952 3.228049 3.555816
100005_at 6.916954 6.909533 6.928252 7.003377
100006_at 4.277985 4.856986 3.670161 4.075533
I've found several similar questions, but not a whole lot of answers: mapply for matrices, Multi matrix row-wise mapply?. The code I have now splits the matrices by row into lists, but having to split it makes it rather slow and not much faster than a for loop, considering I have almost 9000 rows in each matrix:
scores <- mapply(t.test.stat, split(matrixA, row(matrixA)), split(matrixB, row(matrixB)))
The function itself is very simple, just finding the t-value:
t.test.stat <- function(x, y)
{
return( (mean(x) - mean(y)) / sqrt(var(x)/length(x) + var(y)/length(y)) )
}
Splitting the matrices isn't the biggest contributor to evaluation time.
set.seed(21)
matrixA <- matrix(rnorm(5 * 9000), nrow = 9000)
matrixB <- matrix(rnorm(4 * 9000), nrow = 9000)
system.time( scores <- mapply(t.test.stat,
split(matrixA, row(matrixA)), split(matrixB, row(matrixB))) )
# user system elapsed
# 1.57 0.00 1.58
smA <- split(matrixA, row(matrixA))
smB <- split(matrixB, row(matrixB))
system.time( scores <- mapply(t.test.stat, smA, smB) )
# user system elapsed
# 1.14 0.00 1.14
Look at the output from Rprof to see that most of the time is--not surprisingly--spent evaluating t.test.stat (mean, var, etc.). Basically, there's quite a bit of overhead from function calls.
Rprof()
scores <- mapply(t.test.stat, smA, smB)
Rprof(NULL)
summaryRprof()
You may be able to find faster generalized solutions, but none will approach the speed of the vectorized solution below.
Since your function is simple, you can take advantage of the vectorized rowMeans function to do this almost instantaneously (though it's a bit messy):
system.time({
ncA <- NCOL(matrixA)
ncB <- NCOL(matrixB)
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowMeans((matrixA-rowMeans(matrixA))^2)*(ncA/(ncA-1))/ncA +
rowMeans((matrixB-rowMeans(matrixB))^2)*(ncB/(ncB-1))/ncB )
})
# user system elapsed
# 0 0 0
head(ans)
# [1] 0.8272511 -1.0965269 0.9862844 -0.6026452 -0.2477661 1.1896181
UPDATE
Here's a "cleaner" version using a rowVars function:
rowVars <- function(x, na.rm=FALSE, dims=1L) {
rowMeans((x-rowMeans(x, na.rm, dims))^2, na.rm, dims)*(NCOL(x)/(NCOL(x)-1))
}
ans <- (rowMeans(matrixA)-rowMeans(matrixB)) /
sqrt( rowVars(matrixA)/NCOL(matrixA) + rowVars(matrixB)/NCOL(matrixB) )
This solution avoids splitting, and lists, so maybe it will be faster than your version:
## original data:
tmp1 <- matrix(sample(1:100, 20), nrow = 5)
tmp2 <- matrix(sample(1:100, 20), nrow = 5)
## combine them together
tmp3 <- cbind(tmp1, tmp2)
## calculate t.stats:
t.stats <- apply(tmp3, 1, function(x) t.test(x[1:ncol(tmp1)],
x[(1 + ncol(tmp1)):ncol(tmp3)])$statistic)
Edit: Just tested it on two matrices of 9000 rows and 5 columns each, and it completed in less than 6 seconds:
tmp1 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp2 <- matrix(rnorm(5 * 9000), nrow = 9000)
tmp3 <- cbind(tmp1, tmp2)
system.time(t.st <- apply(tmp3, 1, function(x) t.test(x[1:5], x[6:10])$statistic))
-> user system elapsed
-> 5.640 0.012 5.705