Can this plot_normal_distribution function be optimized? - r

I've tried to optimize a function which I wrote a few weeks ago.
It got better but it is still slow. So I used Rprof() and found out split() takes the most time which for some reason makes me think this function can be a lot better.
Can it be done?!
normDist_V2 <- function(size=1e5, precision=1, ...)
{
data <- rnorm(size)
roundedData <- round(data, precision)
framedData <- data.frame(cbind(data, roundedData))
factoredData <- split(framedData$data, framedData$roundedData)
actualsize <- (size)/10^precision
X <- names(factoredData)
Probability <- sapply(factoredData, length) / actualsize
plot(X, Probability, ...)
}
Current speed:
system.time(normDist_V2(size=1e7, precision = 2)) #11.14 sec

normDist_V2 <- function(size = 1e5, precision = 1, ...) {
require(data.table)
data <- rnorm(size)
roundedData <- round(data, precision)
framedData <- data.table(data, roundedData)
actualsize <- (size)/10^precision
dt <- framedData[, .N, keyby = roundedData]
X <- dt$roundedData
Probability <- dt$N/actualsize
plot(X, Probability, ...)
}
system.time(normDist_V2(size=1e7, precision = 2)) # 1.26 sec

Related

Speeding up linear model fitting on complete pairwise observations in large sparse matrix in R

I have a numeric data.frame df with 134946 rows x 1938 columns.
99.82% of the data are NA.
For each pair of (distinct) columns "P1" and "P2", I need to find which rows have non-NA values for both and then do some operations on those rows (linear model).
I wrote a script that does this, but it seems quite slow.
This post seems to discuss a related task, but I can't immediately see if or how it can be adapted to my case.
Borrowing the example from that post:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
My script is:
tic = proc.time()
out <- do.call(rbind,sapply(1:(N_ps-1), function(i) {
if (i/10 == floor(i/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
do.call(rbind,sapply((i+1):N_ps, function(j) {
w <- which(complete.cases(df[,i],df[,j]))
N <- length(w)
if (N >= 5) {
xw <- df[w,i]
yw <- df[w,j]
if ((diff(range(xw)) != 0) & (diff(range(yw)) != 0)) {
s <- summary(lm(yw~xw))
o <- c(i,j,N,s$adj.r.squared,s$coefficients[2],s$coefficients[4],s$coefficients[8],s$coefficients[1],s$coefficients[3],s$coefficients[7])} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
},simplify=F))
}
,simplify=F))
toc = proc.time();
show(toc-tic);
This takes about 10 minutes on my machine.
You can imagine what happens when I need to handle a much larger (although more sparse) data matrix. I never managed to finish the calculation.
Question: do you think this could be done more efficiently?
The thing is I don't know which operations take more time (subsetting of df, in which case I would remove duplications of that? appending matrix data, in which case I would create a flat vector and then convert it to matrix at the end? ...).
Thanks!
EDIT following up from minem's post
As shown by minem, the speed of this calculation strongly depended on the way linear regression parameters were calculated. Therefore changing that part was the single most important thing to do.
My own further trials showed that: 1) it was essential to use sapply in combination with do.call(rbind, rather than any flat vector, to store the data (I am still not sure why - I might make a separate post about this); 2) on the original matrix I am working on, much more sparse and with a much larger nrows/ncolumns ratio than the one in this example, using the information on the x vector available at the start of each i iteration to reduce the y vector at the start of each j iteration increased the speed by several orders of magnitude, even compared with minem's original script, which was already much better than mine above.
I suppose the advantage comes from filtering out many rows a priori, thus avoiding costly xna & yna operations on very long vectors.
The modified script is the following:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.90)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x))
dl <- as.list(df)
rl <- sapply(1:(N_ps - 1), function(i) {
if ((i-1)/10 == floor((i-1)/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
x <- dl[[i]]
xna <- which(naIds[[i]])
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]][xna]
yna <- which(naIds[[j]][xna])
w <- xna[yna]
N <- length(w)
if (N >= 5) {
xw <- x[w]
yw <- y[yna]
if ((min(xw) != max(xw)) && (min(yw) != max(yw))) {
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic)
E.g. try with nr=100000; nc=100.
I should probably mention that I tried using indices, i.e.:
naIds <- lapply(df, function(x) which(!is.na(x)))
and then obviously generating w by intersection:
w <- intersect(xna,yna)
N <- length(w)
This however is slower than the above.
Larges bottleneck is lm function, because there are lot of checks & additional calculations, that you do not necessarily need. So I extracted only the needed parts.
I got this to run in +/- 18 seconds.
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x)) # outside loop
dl <- as.list(df) # sub-setting list elements is faster that columns
rl <- sapply(1:(N_ps - 1), function(i) {
x <- dl[[i]]
xna <- naIds[[i]] # relevant logical vector if not empty elements
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]]
yna <- naIds[[j]]
w <- xna & yna
N <- sum(w)
if (N >= 5) {
xw <- x[w]
yw <- y[w]
if ((min(xw) != max(xw)) && (min(xw) != max(xw))) { # faster
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,6))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic);
# user system elapsed
# 17.94 0.11 18.44

How to unscale rolled scaled data

I am working with time-series data and the model seems to improve around 10% when the scale data follows a rolling window. To calculate the rolling scale, I coded the folling function:
library(data.table)
d <- data.table(index = 1:100, x = rnorm(100))
roll_scale <- function(x, n = 3, ...) {
xout <- frollapply(x, n , FUN = function(z) c( scale(z)[n, 1]), ... )
return(xout)
}
d[, sData := roll_scale(x)]
Based on the comment how can I use rollapply with scale
After modelling the data, I would like to predict the most recent timepoints and bring the outcome to the original scale. However, this seems trickier than I expected. Here is my attempt through DMwR::unscale:
roll_UNscale <- function(scaled, Ref_values, n = 3, ...) {
xout <- frollapply(scaled, n , FUN = function(z, Ref_values) {
c(DMwR::unscale(scaled, scale(Ref_values))[ n,])
}, ..., Ref_values = Ref_values)
return(xout)
}
d[, roll_UNscale(sData, x)]
I hope you can give me a hand
After a day of thinking, I have come up with the solution. In a nutshell, what I need to roll is only the reference vector. Here it is:
roll_UNscale <- function(scaled, Ref_values, n = 3, align = "right", ...) {
means <- frollmean(Ref_values, n = n)
stds <- frollapply(x = Ref_values, n = n, FUN = sd)
unscaled <- scaled * stds + means
return(unscaled)
}
I hope this can help someone.

speed problems with odesolver in R

I have a differential equation model in R that uses the odesolver from the deSolve package. However, at the moment the model is running very slowly. I think this might be something to do with the function that I feed to odesolver being poorly written, but can't figure out what exactly is slowing it down and how I might speed it up. Does anyone have any ideas?
I've made an example that works in a similar way to mine:
library(data.table)
library(deSolve)
matrix_1 <- matrix(runif(100),10,10)
matrix_1[which(matrix_1 > 0.5)] <- 1
matrix_1[which(matrix_1 < 0.5)] <- 0
matrix_2 <- matrix(runif(100),10,10)
matrix_2[which(matrix_2 > 0.5)] <- 1
matrix_2[which(matrix_2 < 0.5)] <- 0
group_ID <- rep(c(1,2), 5)
N <- runif(10, 0, 100000)
Nchange <- function(t, N, parameters) {
with(as.list(c(N, parameters)), {
N_per_1 <- matrix_1 * N_per_connection
N_per_1[is.na(N_per_1)] <- 0
total_N_2 <- as.vector(N_per_1)
if (nrow(as.matrix(N_per_1)) > 1) {
total_N_2 <- colSums(N_per_1[drop = FALSE])
}
N_per_1_cost <- N_per_1
for (i in possible_competition) {
column <- as.vector(N_per_1[, i])
if (sum(column) > 0) {
active_groups <- unique(group_ID[column > 0])
if (length(active_groups) > 1){
group_ID_dets <- data.table("group_ID" = group_ID, "column"= column, "n_IDS" = 1:length(group_ID))
group_ID_dets$portions <- ave(group_ID_dets$column, group_ID_dets$group_ID, FUN = function(x) x / sum(x))
group_ID_dets[is.na(group_ID_dets)] <- 0
totals <- as.vector(unlist(tapply(group_ID_dets$column, group_ID_dets$group_ID, function(x) sum(x))))
totals[is.na(totals)] <- 0
totals <- totals*2 - sum(totals)
totals[totals < 0] <- 0
group_ID_totals <- data.table("group_ID" = unique(group_ID), "totals" = as.vector(totals))
group_ID_dets$totals <- group_ID_totals$totals[match(group_ID_dets$group_ID, group_ID_totals$group_ID)]
N_per_1[, i] <- group_ID_dets$totals * group_ID_dets$portions
}
}
}
res_per_1 <- N_per_1 * 0.1
N_per_2 <- matrix_2 * N_per_connection
N_per_2[is.na(N_per_2)] <- 0
res_per_2 <- N_per_2 * 0.1
dN <- rowSums(res_per_1) - rowSums(N_per_1_cost * 0.00003) + rowSums(res_per_2) -
rowSums(N_per_2 * 0.00003) - N*0.03
list(c(dN))
})
} # function describing differential equations
N_per_connection <- N/(rowSums(matrix_1) + rowSums(matrix_2))
possible_competition <- which(colSums(matrix_1 != 0)>1)
times <- seq(0, 100, by = 1)
out <- ode(y = N, times = times, func = Nchange, parms = NULL)
A good way to identify the bottle neck is with a profiler and the profvis package provides a good way of drilling down into the results. Wrapping your code in p <- profvis({YourCodeInHere}) and then viewing the results with print(p) gives the following insights:
The lines that are taking the most time are (in descending order of time taken):
group_ID_totals <- data.table("group_ID" = unique(group_ID), "totals" = as.vector(totals))
group_ID_dets$portions <- ave(group_ID_dets$column, group_ID_dets$group_ID, FUN = function(x) x / sum(x))
group_ID_dets <- data.table("group_ID" = group_ID, "column"= column, "n_IDS" = 1:length(group_ID))
totals <- as.vector(unlist(tapply(group_ID_dets$column, group_ID_dets$group_ID, function(x) sum(x))))
group_ID_dets$totals <- group_ID_totals$totals[match(group_ID_dets$group_ID, group_ID_totals$group_ID)]
I'm not familiar with the details of your ODE, but you should focus on optimising these tasks. I think the larger issue is that you're running these commands in a loop. Often, you'll hear that loops are slow in R, but a more nuanced discussion of this issue is found in the answers here. Some tips there might help you restructure your code/loop. Good luck!

R split DF and run tests in parallel

I have two matrices that I want to do several statistics, where I compare every row of dataframe1 with dataframe2. These are large data frame (300,000 rows and 40,000 rows) so lots to compare.
I made a few functions to be apply the statistics. What I was wondering was whether it is possible to split dataframe1 into chunks are run these chunks in parallel on multiple cores.
library(lawstat)
library(reshape2)
df1 = matrix(ncol= 100, nrow=100)
for ( i in 1:100){
df1[,i] =floor(runif(100, min = 0, max =3))
}
df2 = matrix(ncol= 100, nrow=1000)
for ( i in 1:100){
df2[,i] =runif(1000, min = 0, max =1000)
}
testFunc<- function(df1, df2){
x=apply(df1, 1, function(x) apply(df2, 1, function(y) levene.test(y,x)$p.value))
x=melt(x)
return(x)
}
system.time(res <- testFunc(df1,df2 ))
Some of the statistics (e.g. levene tests) take a fairly long time to compute so any ways I can speed this up would be great.
There is room for optimisation in your function but here is an example of an improvement using the parallel package:
library(parallel)
library(snow)
# I have a quad core processor so I am using 3 cores here.
cl <- snow::makeCluster(3)
testFunc2<- function(df1, df2){
x <- parallel::parApply(cl = cl, X = df1, 1, function(x, df2) apply(df2, 1,
function(y) lawstat::levene.test(y,x)$p.value), df2)
x <- melt(x)
return(x)
}
system.time(res <- testFunc2(df1,df2 ))
On my machine this at least halves the running time if I have a cluster size of 3.
edit: I felt bad for dissing your code so below is a stripped down levene.test function that increases performance more that going parallel on most home/work machines.
lev_lite <- function(y, group){
N <- 100 # or length(y)
k <- 3 # or length(levels(group)) after setting to as.factor below
reorder <- order(group)
group <- group[reorder]
y <- y[reorder]
group <- as.factor(group)
n <- tapply(y,group, FUN = length)
yi_bar <- tapply(y,group, FUN = median)
zij <- abs(y - rep(yi_bar, n))
zidot <- tapply(zij, group, FUN = mean)
zdotdot <- mean(zij)
# test stat, see wiki
W <- ((N - k)/(k - 1)) * (
sum(n*(zidot - zdotdot)^2)/
sum((zij - rep(zidot, n))^2))
#p val returned
1 - pf(W, k-1, N-k)
}
testFunc2 <- function(df1, df2){
x <- apply(df1, 1, function(x) apply(df2, 1, lev_lite, group = x))
x <- melt(x)
return(x)
}
> system.time(res <- testFunc(df1[1:50, ],df2[1:50,] ))
user system elapsed
5.53 0.00 5.56
> system.time(res2 <- testFunc2(df1[1:50, ],df2[1:50, ] ))
user system elapsed
1.13 0.00 1.14
> max(res2 - res)
[1] 2.220446e-15
This is a ~5x improvement without parallelisation.

How to avoid for loop when computing xts weighted sum?

How could I avoid for loop calculating xts weighted sum as I am trying to do below:
library(xts)
thetaSum <- function(theta, w=c(1, 1, 1)) {
sum(coredata(theta)*rev(w))
}
n <- 10
tmpVec <- rep(1, n)
tmpDates <- seq(as.Date("2000-01-01"), length = n, by = "day")
theta <- xts(tmpVec, order.by=tmpDates)
N <- 3
thetaSummed <- xts(rep(NA, n), order.by=tmpDates)
for (i in N:n) {
thetaTemp <- theta[(i-N+1):i, ]
thetaSummed[i] <- thetaSum(thetaTemp, w=rep(1, N))
}
thetaSummed
N is a look back period smaller than n.
What are some fast alternatives for for loop?
You can use rollapply.
rollapplyr(theta, width=3, FUN=thetaSum, fill=NA)

Resources