The following code runs a loops but the problem is the speed; it takes several hours to finish and I am looking for an alternative so that I donĀ“t have to wait so long.
Basically what the code does the follolling calculations:
1.-It calculates the mean of the values of the 60 days.
2.-It gets the standard deviation of the values of the 60 days.
3.-It gets the Max of the values of the 60 days.
4.-It gets the Min of the values of the 60 days.
5.-Then with the previous calculations the code "smooths" the peaks up and down.
6.-Then the code simply get the means from 60, 30, 15 and 7 Days.
So the purpose of these code is to remove the peaks of the data using the method already mentioned.
Here is the code:
options(stringsAsFactors=F)
DAT <- data.frame(ITEM = "x", CLIENT = as.numeric(1:100000), matrix(sample(1:1000, 60, replace=T), ncol=60, nrow=100000, dimnames=list(NULL,paste0('DAY_',1:60))))
DATT <- DAT
nRow <- nrow(DAT)
TMP <- NULL
for(iROW in 1:nRow){#iROW <- 1
print(c(iROW,nRow))
Demand <- NULL
for(iCOL in 3:ncol(DAT)){#iCOL <- 1
Demand <- c(Demand,DAT[iROW,iCOL])
}
ww <- which(!is.na(Demand))
if(length(ww) > 0){
Average <- round(mean(Demand[ww]),digits=4)
DesvEst <- round(sd(Demand,na.rm=T),digits=4)
Max <- round(Average + (1 * DesvEst),digits=4)
Min <- round(max(Average - (1 * DesvEst), 0),digits=4)
Demand <- round(ifelse(is.na(Demand), Demand, ifelse(Demand > Max, Max, ifelse(Demand < Min, Min, Demand))))
Prom60 <- round(mean(Demand[ww]),digits=4)
Prom30 <- round(mean(Demand[intersect(ww,(length(Demand) - 29):length(Demand))]),digits=4)
Prom15 <- round(mean(Demand[intersect(ww,(length(Demand) - 14):length(Demand))]),digits=4)
Prom07 <- round(mean(Demand[intersect(ww,(length(Demand) - 6):length(Demand))]),digits=4)
}else{
Average <- DesvEst <- Max <- Min <- Prom60 <- Prom30 <- Prom15 <- Prom07 <- NA
}
DAT[iROW,3:ncol(DAT)] <- Demand
TMP <- rbind(TMP, cbind(DAT[iROW,], Average, DesvEst, Max, Min, Prom60, Prom30, Prom15, Prom07))
}
DAT <- TMP
If one runs your code (with smaller number of rows) through a profiler, one sees that the main issue is the rbind in the end, followed by the c mentioned by #Riverarodrigoa:
We can focus on these two by creating numeric matrices of suitable size and working with those. Only in the end the final data.frame is created:
options(stringsAsFactors=F)
N <- 1000
set.seed(42)
DAT <- data.frame(ITEM = "x",
CLIENT = as.numeric(1:N),
matrix(sample(1:1000, 60, replace=T), ncol=60, nrow=N, dimnames=list(NULL,paste0('DAY_',1:60))))
nRow <- nrow(DAT)
TMP <- matrix(0, ncol = 8, nrow = N,
dimnames = list(NULL, c("Average", "DesvEst", "Max", "Min", "Prom60", "Prom30", "Prom15", "Prom07")))
DemandMat <- as.matrix(DAT[,3:ncol(DAT)])
for(iROW in 1:nRow){
Demand <- DemandMat[iROW, ]
ww <- which(!is.na(Demand))
if(length(ww) > 0){
Average <- round(mean(Demand[ww]),digits=4)
DesvEst <- round(sd(Demand,na.rm=T),digits=4)
Max <- round(Average + (1 * DesvEst),digits=4)
Min <- round(max(Average - (1 * DesvEst), 0),digits=4)
Demand <- round(ifelse(is.na(Demand), Demand, ifelse(Demand > Max, Max, ifelse(Demand < Min, Min, Demand))))
Prom60 <- round(mean(Demand[ww]),digits=4)
Prom30 <- round(mean(Demand[intersect(ww,(length(Demand) - 29):length(Demand))]),digits=4)
Prom15 <- round(mean(Demand[intersect(ww,(length(Demand) - 14):length(Demand))]),digits=4)
Prom07 <- round(mean(Demand[intersect(ww,(length(Demand) - 6):length(Demand))]),digits=4)
}else{
Average <- DesvEst <- Max <- Min <- Prom60 <- Prom30 <- Prom15 <- Prom07 <- NA
}
DemandMat[iROW, ] <- Demand
TMP[iROW, ] <- c(Average, DesvEst, Max, Min, Prom60, Prom30, Prom15, Prom07)
}
DAT <- cbind(DAT[,1:2], DemandMat, TMP)
For 1000 rows this takes about 0.2 s instead of over 4 s. For 10.000 rows I get 2 s instead of 120 s.
Obviously, this is not really pretty code. One could do this much nicer using tidyverse or data.table. I just find it worth noting that for loops are not necessarily slow in R. But dynamically growing data structures is.
Related
Hello I am trying to speed up a block of code that is currently working, but is quite slow with the amount of data that I have. I need to identify the top n% highest value in a row and subsequently use this to make an average by subsetting a dataframe and averaging the values of the subset. Any help or suggestions would be appreciated. This is my current approach:
corrMat <- matrix(runif(944*9843), nrow=944, ncol = 9843)
GeneExpression <- matrix(runif(11674*9843, min=0, max=100), nrow = 11674, ncol=9843)
cutOff <- apply(corrMat, MARGIN = 1, FUN = quantile, 0.99)
topCells <- corrMat > cutOff
data <- matrix(, nrow = nrow(topCells), ncol = nrow(GeneExpression))
colnames(data) <- rownames(GeneExpression)
for(i in colnames(data)){
for(j in 1:nrow(topCells)){
data[j,i] <- mean(t(GeneExpression[i, topCells[j,]]))
}
}
data
Here's a smaller version of your example along with my base R solution. Chances are there's also a neat tidyverse way of doing this but I wouldn't know.
corrMat <- matrix(runif(24*18), nrow=24)
GeneExpression <- matrix(runif(36*18, min=0, max=100), nrow = 36)
cutOff <- apply(corrMat, MARGIN = 1, FUN = quantile, 0.99)
topCells <- corrMat > cutOff
data <- data2 <- matrix(, nrow = nrow(topCells), ncol = nrow(GeneExpression))
colnames(data) <- rownames(GeneExpression) # rownames are NULL so this is not needed
start <- Sys.time() # benchmarking
for(i in 1:ncol(data)){ # iterate by column rather than colname
for(j in 1:nrow(topCells)){
data[j,i] <- mean(t(GeneExpression[i, topCells[j,]]))
}
}
eric <- Sys.time() - start
start <- Sys.time()
# apply over rows of topCells to take row means of GeneExpression
# per row of topCells
# then just transpose
data2 <- t(apply(topCells, 1, function(x) rowMeans(GeneExpression[, x, drop = F])))
milan <- Sys.time() - start
all(data == data2)
[1] TRUE
eric
Time difference of 0.08776498 secs
milan
Time difference of 0.02593184 secs
Using your original example data, my solution takes 6.43s to run.
Hope this helps.
I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a
I am looking for a efficient way in R to derive possible combinations.
I have a data frame with 3 columns and on the basis first column contents I am calculating all the possible combinations.
df <- data.frame("H" = c("H1","H2","H3","H4"), "W1" = c(95, 0, 85 ,0) , "W2" = c(50, 85, 0,0))
df$H <- as.character.factor(df$H)
nH <- nrow(df)
nW <- 2
library(plyr)
library(gtools)
if(nW<=5){
# Find all possible combinations
mat1 <- matrix(nrow = 0, ncol = nH)
for(i in 1:nH){
# mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
mat1 <- rbind.fill.matrix(mat1, t(combn(df$H,nH-(i-1))))
}
df_comb <- data.frame(mat1)
}
View(df_comb)
df_comb gives correct output. Above code works good for small data sets but when the values for H column is more than 15 , R results into out of memory.
Looking for ways in which calculation of combinations in above scenario can be done efficiently in R till H1, H2 .... H49, H50.
EDIT:
Tried a different Approach, Now after certain number of possible combinations (in below case - 32767), applied random sampling to generate combinations using ratio method.
nH <- 26
nW <- 2
if(nW<=5){
# Find all possible combinations ~~~~~ Random Sampling
ncomb <- 0
for(i in 1:nH){
ncomb <- ncomb + choose(nH, nH-(i-1))
}
nmax <- 10000 # Total number of combinations cannot exceed 10000
mat1 <- matrix( nrow = 0, ncol = nH)
for(i in 1:nH){ # For each Group 26C1 26C2 26C3 ..... 26C25 26C26
ncombi <- choose(nH, nH-(i-1)) #For i = 1 , 26C25
ncombComputed <- ceiling(nmax/ncomb*choose(nH, nH-(i-1)))
if(ncomb <= 32767 ){ # This condition is independent of NMAX - For 15
#Combinations
print("sefirst")
final <- mat1
print(paste(nH," ",i))
abc <- combinations(nH,nH-(i-1),df$herbicide)
mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
}
else {
print(i)
print("second")
combi <- matrix( nrow = 0, ncol = nH-(i-1))
#random sampling
while(nrow(combi) < ncombComputed){
combi<- rbind(combi,sort(sample(df$herbicide,nH-(i-1))))
combi <- unique(combi)
}
mat1 <- rbind.fill.matrix(mat1, combi)
}
}
df_comb_New <- data.frame(mat1)
}
The above code gives the result but for 26 Entries its taking 36 seconds for 10000 Combinations.Now I am looking that is there a way to optimize the while loop so that execution becomes faster or any other way to achieve the same result in efficient manner.
I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE
I am trying to merge two data frames. The original data frame is much larger than the data frame that is going to be merged with however there is only 1 possible match for each row. The row is found by matching the type (a factor) and the level. The level is an integer that will be put into one of several buckets (the example only has two)
My current method works but uses sapply and is slow for large numbers of rows. How can I vectorise this operation?
set.seed(123)
sample <- 100
data <- data.frame(type= sample(LETTERS[1:4], sample, replace=TRUE), level =round(runif(sample, 1,sample)), value = round(runif(sample, 200,1000)))
data2 <- data.frame(type= rep(LETTERS[1:4],2), lower= c(rep(1,4), rep(51,4)), upper = c(rep(50,4), rep(sample,4)), cost1 = runif(8, 0,1), cost2 = runif(8, 0,1),cost3 = runif(8, 0,1))
data2[,4:6] <- data2[,4:6]/rowSums(data2[,4:6]) #turns the variables in to percentages, not necessary on real data
x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)))
data3 <- cbind(data, percentage= data2[x, -c(1:3)])
If I understand the matching problem you've set up, the following code seems to speed things up a bit by dividing data by type and then using cut to find the proper bucket. I think it will accommodate larger numbers of pairs of lower and upper values but haven't checked carefully.
library(plyr)
percents <- function(value, cost) {
cost <- cost[cost[,1]== value[1,1],]
cost <- cost[order(cost[,2]),]
ints <- cut(value[,2], breaks=c(t(cost[,2:3])), labels=FALSE, include.lowest=TRUE )
cbind(value,percentage=cost[ceiling(ints/2),-(1:3)])
}
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) )
Setting
sample <- 10000
gives the following execution time comparisons
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind.fill(mapply(percents, value=split(data, data$type), cost=list(data2), SIMPLIFY=FALSE) ),
times=10)
Unit: milliseconds
expr
{ x <- unlist(sapply(1:sample, function(n) which(ll <- data$type[n] == data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper))) data3 <- cbind(data, percentage = data2[x, -c(1:3)]) }
data4 <- rbind.fill(mapply(percents, value = split(data, data$type), cost = list(data2), SIMPLIFY = FALSE))
min lq mean median uq max neval
1198.18269 1214.10560 1225.85117 1226.79838 1234.2671 1258.63122 10
20.81022 20.93255 21.50001 21.24237 22.1305 22.65291 10
where the first numbers are for the code shown in your question and the second times are for the code in my post. For this case, the new code seems almost 60 times faster.
Edit
To use rbind_all and avoid mapply, use the following:
microbenchmark({x <- unlist(sapply(1:sample, function(n) which(ll <-data$type[n]==data2$type & data$level[n] >= data2$lower & data$level[n] <= data2$upper)));
data3 <- cbind(data, percentage= data2[x, -c(1:3)])} ,
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 )),
times=10)
which gives slightly improved execution times
min lq mean median uq max neval
1271.57023 1289.17614 1297.68572 1301.84540 1308.31476 1313.56822 10
18.33819 18.57373 23.28578 19.53742 19.95132 58.96143 10
Edit 2
Modification to use the data2$lower values only for setting intervals
percents <- function(value, cost) {
cost <- cost[cost[,"type"] == value[1,"type"],]
cost <- cost[order(cost[,"lower"]),]
ints <- cut(value[,"value"], breaks= c(cost[,"lower"], max(cost[,"upper"])), labels=FALSE, right=FALSE, include.highest=TRUE )
cbind(value,percentage=cost[ints,-(1:3)])
}
to use with
data4 <- rbind_all(lapply(split(data, data$type), percents, cost=data2 ))