R: Efficient Rolling Calculations by Group - r

I have some asset data in the middle of a dplyr pipeline similar to this:
fcast <- data.frame(group = rep(c('a','b'),each=12),
yr = rep(2018:2019,each=6,times=2),
mo = rep(c(7:12,1:6),times=2),
book_value = c(10000,rep(0,times=11),15000,rep(0,times=11)),
accum_depr = c(200,rep(0,times=11),700,rep(0,times=11)),
depr_rate = .02,
depr_expense = c(10,rep(0,times=11),15,rep(0,times=11)),
book_addn = c(0,0,0,0,80,0,0,40,0,0,0,0,0,0,100,70,0,0,0,0,0,0,0,0),
book_growth = 1.01
)
I need to apply some (ideally, tidy) rolling function to each group like the one below, which does not work at the moment.
roll_depr <- function(.data) {
r_d <- .data$depr_rate[1]
r_g <- .data$book_growth[1]
for(i in 2:length(.data$depreciation_rate)) {
.data$book_value[i] <- .data$book_value[i-1]*r_g + .data$book_addn[i]
.data$depr_expense[i] <- (.data$book_value[i] - .data$accum_depr[i-1])*r_d
.data$accum_depr[i] <- .data$accum_depr[i-1]+.data$depr_expense[i]
}
return(.data)
}
To further complicate things, this calculation will be performed in a shiny dashboard repeatedly as users input new values for book_addn. The actual dataset is much larger, and for loops don't cut it.
I know a better solution must exist with data.table or apply, but I haven't been able to figure it out. Bonus points if this can be accomplished from within the pipeline!
EDIT: I'm expecting the code to output the following table. Basically, the book_value grows at 1% of the previous value, plus any additions in the period. The depr_expense takes the book_value net of the previous accum_depr, and multiplies by the depr_rate. Finally, accum_depr updates to account for the newly-calculated depr_expense.
group yr mo book_value accum_depr depr_rate depr_expense book_addn book_growth
a 2018 7 10000.00 200.00 0.02 10.00 0 1.01
a 2018 8 10100.00 398.00 0.02 198.00 0 1.01
a 2018 9 10201.00 594.06 0.02 196.06 0 1.01
a 2018 10 10303.01 788.24 0.02 194.18 0 1.01
a 2018 11 10486.04 982.20 0.02 193.96 80 1.01
a 2018 12 10590.90 1174.37 0.02 192.17 0 1.01
a 2019 1 10696.81 1364.82 0.02 190.45 0 1.01
a 2019 2 10843.78 1554.40 0.02 189.58 40 1.01
a 2019 3 10952.22 1742.35 0.02 187.96 0 1.01
a 2019 4 11061.74 1928.74 0.02 186.39 0 1.01
a 2019 5 11172.35 2113.61 0.02 184.87 0 1.01
a 2019 6 11284.08 2297.02 0.02 183.41 0 1.01
b 2018 7 15000.00 700.00 0.02 15.00 0 1.01
b 2018 8 15150.00 989.00 0.02 289.00 0 1.01
b 2018 9 15401.50 1277.25 0.02 288.25 100 1.01
b 2018 10 15625.52 1564.22 0.02 286.97 70 1.01
b 2018 11 15781.77 1848.57 0.02 284.35 0 1.01
b 2018 12 15939.59 2130.39 0.02 281.82 0 1.01
b 2019 1 16098.98 2409.76 0.02 279.37 0 1.01
b 2019 2 16259.97 2686.76 0.02 277.00 0 1.01
b 2019 3 16422.57 2961.48 0.02 274.72 0 1.01
b 2019 4 16586.80 3233.99 0.02 272.51 0 1.01
b 2019 5 16752.67 3504.36 0.02 270.37 0 1.01
b 2019 6 16920.19 3772.68 0.02 268.32 0 1.01

This can actually be done at decent speed with two simple functions that implement for loops, and using them within mutate.
The key is to recognize that book_value can be calculated independently in its own loop. Once that has been done, accum_depr[i] is only a function of accum_depr[i-1] and book_value[i]. The depr_expense can be extracted as the difference between accum_depr and its lag, but I don't need it for my purposes.
expn[i] = (book[i] - accum_depr[i-1])*depr_rate
accum_depr[i] = accum_depr[i-1] + expn[i]
Which implies
accum_depr[i] = accum_depr[i-1]*(1-depr_rate) + book_value[i]*depr_rate
The code:
roll_book <- function(book_val,addn,g_rate) {
z <- rep(0,length(book_val))
z[1] <- book_val[1]
for(i in 2:length(book_val)) {
z[i] <- z[i-1]*g_rate[1] + addn[i]
}
return(z)
}
roll_depr <- function(accum_depr,book_val,depr_rate) {
r_d <- depr_rate[1]
z <- rep(0, length(accum_depr))
z[1] <- accum_depr[1]
for(i in 2:length(accum_depr)) {
z[i] <- book_val[i]*r_d + z[i-1]*(1-r_d)
}
return(z)
}
fcast <- fcast %>%
group_by(group) %>%
mutate(book_value = roll_book(book_value,book_addn,book_growth),
accum_depr = roll_depr(accum_depr,book_value,depr_rate))
On my dataset with ~110,000 rows and ~450 groups:
Unit: milliseconds
min lq mean median uq max neval
65.01492 67.14825 70.80178 69.85741 72.53611 98.75224 100

Related

How to obtain percentile of variable and the rate of dummy for all

How to do this for all columns of the object x automatically, using base R or SAS.
Here is example using R:
# sample data
set.seed(123)
x <- data.frame(var1=runif(100), var2=runif(100), flag=rbinom(100, size=1, prob=0.7))
x
# calculate percentile of each column
r <- apply(x, 2, function(x) quantile(x, probs=seq(0,1,0.05)))
res <- data.frame(item_id=rownames(r), r, row.names = NULL)
# assign group for each percentile
res$group <- seq_along(res$item_id)
res
# find the bin of the variable (var1, var2, ...) given percentile bin (interval);
x$bin_var1 <- findInterval(x$var1, res$var1)
x
# calculate the occurence, rate of the dummy flag column name (no=no occurence; yes=occurence of flag==1; total=total obs per bucket; rate_var=rate of var1)
op <- data.frame(with(x, aggregate(flag, list(bin_var1), FUN=function(x) c(sum(x==0),sum(x==1), length(x), sum(x==1)/length(x)))))
op1 <- data.frame(do.call(data.frame, op))
colnames(op1) <- c("group","no","yes","total","rate_var1")
op1
# merge
final <- merge(res, op1, by="group")
final
In this SAS solution I'm missing how to include the rate the ration of flag=1/flag all, in R I'm using the findInterval function to assign bin and then calculate the rate, sum(flag=1)...this part I'm not sue how to do in SAS.
Example:
data x;
length groups $12;
input groups Var1 Var2 Flag;
datalines;
constrict 3.50 1.09 1
constrict 0.75 1.50 0
constrict 0.70 3.50 1
no_constrict 1.10 1.70 1
no_constrict 0.90 0.45 1
no_constrict 0.55 2.75 1
no_constrict 1.40 2.33 0
constrict 2.30 1.64 1
constrict 0.85 1.415 0
no_constrict 1.80 1.80 1
no_constrict 0.95 1.36 1
no_constrict 1.50 1.36 0
constrict 0.60 1.50 0
constrict 0.95 1.90 0
constrict 1.60 0.40 1
constrict 2.35 0.03 1
no_constrict 1.10 2.20 0
constrict 0.80 3.33 0
no_constrict 0.75 1.90 0
;
proc univariate data=x noprint;
class groups;
var var1
var2
flag
;
output out=res pctlpts=0 to 100 by 10 pctlpre=var1_
var2_
flag_
;
proc sql;
create table op1
as select a.*,
b.*
from x
as a
left join res
as b
on a.groups=b.groups;
quit;

How can I get row-wise max based on condition of specific column in R dataframe?

I'm trying to get the maximum value BY ROW across several columns (climatic water deficit -- def_59_z_#) depending on how much time has passed (time since fire -- YEAR.DIFF). Here are the conditions:
If 1 year has passed, select the deficit value for first year.
(def_59_z_1).
If 2 years: max deficit of first 2 years.
If 3 years: max of deficit of first 3 years.
If 4 years: max of deficit of first 4 years.
If 5 or more years: max of first 5 years.
However, I am unable to extract a row-wise max when I include a condition. There are several existing posts that address row-wise min and max (examples 1 and 2) and sd (example 3) -- but these don't use conditions. I've tried using apply but I haven't been able to find a solution when I have multiple columns involved as well as a conditional requirement.
The following code simply returns 3.5 in the new column def59_z_max15, which is the maximum value that occurs in the dataframe -- except when YEAR.DIFF is 1, in which case def_50_z_1 is directly returned. But for all the other conditions, I want 0.98, 0.67, 0.7, 1.55, 1.28 -- values that reflect the row maximum of the specified columns. Link to sample data here. How can I achieve this?
I appreciate any/all suggestions!
data <- data %>%
mutate(def59_z_max15 = ifelse(YEAR.DIFF == 1,
(def59_z_1),
ifelse(YEAR.DIFF == 2,
max(def59_z_1, def59_z_2),
ifelse(YEAR.DIFF == 3,
max(def59_z_1, def59_z_2, def59_z_3),
ifelse(YEAR.DIFF == 4,
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4),
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4, def59_z_5))))))
Throw this function in an apply family function
func <- function(x) {
first.val <- x[1]
if (first.val < 5) {
return(max(x[2:(first.val+)])
} else {
return(max(x[2:6]))
}
}
Your desired output should be obtained by:
apply(data, 1, function(x) func(x)) #do it by row by setting arg2 = 1
An option would be to get the pmax (rowwise max - vectorized) for each set of conditions separately in a loop (map - if the value of 'YEAR.DIFF' is 1, select only the 'def_59_z_1', for 2, get the max of 'def_59_z_1' and 'def_59_z_2', ..., for 5, max of 'def_59_z_1' to 'def_59_z_5', coalesce the columns together and replace the rest of the NA with the pmax of all the 'def59_z" columns
library(tidyverse)
out <- map_dfc(1:5, ~
df1 %>%
select(seq_len(.x) + 1) %>%
transmute(val = na_if((df1[["YEAR.DIFF"]] == .x)*
pmax(!!! rlang::syms(names(.))), 0))) %>%
transmute(def59_z_max15 = coalesce(!!! rlang::syms(names(.)))) %>%
bind_cols(df1, .)%>%
mutate(def59_z_max15 = case_when(is.na(def59_z_max15) ~
pmax(!!! rlang::syms(names(.)[2:6])), TRUE ~ def59_z_max15))
head(out, 10)
# YEAR.DIFF def59_z_1 def59_z_2 def59_z_3 def59_z_4 def59_z_5 def59_z_max15
#1 5 0.25 -2.11 0.98 -0.07 0.31 0.98
#2 9 0.67 0.65 -0.27 0.52 0.26 0.67
#3 10 0.56 0.33 0.03 0.70 -0.09 0.70
#4 2 -0.34 1.55 -1.11 -0.40 0.94 1.55
#5 4 0.98 0.71 0.41 1.28 -0.14 1.28
#6 3 0.71 -0.17 1.70 -0.57 0.43 1.70
#7 4 -1.39 -1.71 -0.89 0.78 1.22 0.78
#8 4 -1.14 -1.46 -0.72 0.74 1.32 0.74
#9 2 0.71 1.39 1.07 0.65 0.29 1.39
#10 1 0.28 0.82 -0.64 0.45 0.64 0.28
data
df1 <- read.csv("https://raw.githubusercontent.com/CaitLittlef/random/master/data.csv")

Simulating tournament results with R

aIn R, how do one run a tournament simulation?
I have the probabilities of each teams chance of winning against the other pairs, for example:
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
Which would mean something like this:
1 2 3 4 5 6 7 8
1 0 0.76 0.35 0.81 0.95 0.08 0.47 0.26
2 0 0.00 0.24 0.34 0.54 0.48 0.53 0.54
3 0 0.00 0.00 0.47 0.51 0.68 0.50 0.80
4 0 0.00 0.00 0.00 0.52 0.59 0.38 0.91
5 0 0.00 0.00 0.00 0.00 0.05 0.88 0.64
6 0 0.00 0.00 0.00 0.00 0.00 0.23 0.65
7 0 0.00 0.00 0.00 0.00 0.00 0.00 0.77
8 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The next step would be to run a set of simulations, say n = 100000
First the quarter-finals (best out of 3):
1 vs 8
2 vs 7
3 vs 6
4 vs 5
And then the winners of each pair face off in the semi-finals:
1-8 winner VS 4-5 winner
2-7 winner VS 3-6 winner
Winners move on to the final. All is best out of 3.
What approach/package could I use to run bracket simulations? I did find a package called mRchmadness but it's too specific to handle this simulation.
I have created some dummy code that can help you figure out how to do it. The code is not optimized at all, but it is quite linear for you to understand how to do it.
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
prob_res
## Total number of combinations
posscombi<-t(combn(1:8, 2))
## This function gives you winners of the match with n repetitionmatches against every other team possible combination of teams.
## It "reproduces" like the whole league assuming winning probabilities are static.
League <- function(repetitionMatches, posscomb , prob_res)
{
TotalVect<-integer(0)
for(i in 1:nrow(posscomb)){
pair <- posscomb[i,]
Vect<-sample(pair,
size = repetitionMatches,
prob = c(prob_res[pair[1], pair[2]], 1-prob_res[pair[1], pair[2]]),
replace = TRUE)
TotalVect <- c(TotalVect, Vect)
}
return(table(TotalVect))
}
Result<-League(100,posscomb = posscombi, prob_res= prob_res)
Myorder<-order(Result)
### Quarters
pair1<- c(names(Result)[Myorder[c(1,8)]])
pair2<- c(names(Result)[Myorder[c(2,7)]])
pair3<- c(names(Result)[Myorder[c(3,6)]])
pair4<- c(names(Result)[Myorder[c(4,5)]])
## This function gives you the results to n matches (being 3 in the example)
PlayMatch<-function(pairs, numMatches){
Res <-sample(pairs, size = numMatches,
prob = c(prob_res[pairs[1], pairs[2]], 1-prob_res[pairs[1], pairs[2]]),
replace = TRUE)
return(table(Res))
}
# Results of the matches
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
winner3<-PlayMatch(pairs = pair3, 3)
winner4<-PlayMatch(pairs = pair4, 3)
## Semis
#Choosing the winning teams
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
pair2<- c(names(winner3)[which.max(winner3)],names(winner4)[which.max(winner4)])
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
## Final
# Same as before
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
winner1<-PlayMatch(pairs = pair1, 3)
paste0( "team ",names(winner1)[which.max(winner1)], " is the winner!")

Mean of quartile for multiple columns and multiple dates

I'm trying to find the mean forward return (column fwd_rtn) of each quartile for each column (ie for quartiles for PB, PE, PS) for each date group (1/1/2016...1/4/2016)
head(df)
Date Stock Price PB PE PS fwd_rtn
1 1/1/2016 A 11.90 0.4 0.10 0.57 -0.015
2 1/1/2016 B 3.56 0.8 0.09 0.26 -0.036
3 1/1/2016 C 1.29 1.2 0.18 1.60 0.10
......
4 1/4/2016 A 12.80 0.39 0.13 0.53 -0.01
5 1/4/2016 B 4.03 0.76 0.08 0.23 0.02
6 1/4/2016 C 1.83 0.87 0.14 1.16 0.03
So far i have been able to find the mean return for 1 column for 1 date using this code
df$qPB <- cut(df$PB, breaks = quantile(df$PB, c(0,.25,.5,.75,1)),include.lowest = TRUE)
aggregate(df$fwd_rtn,list(qPB = df$qPB),FUN=mean)
which gave me the right answers. But I'm struggling to do it for the multiple columns. I think I'm supposed to use dplyr and the gather() function but i dont know how.
To get quartiles of a single variable by date you can use the ave function:
df$qPB <- ave(df$PB, df$Date, FUN= function(i) cut(i, breaks = quantile(df$PB,
c(0,.25,.5,.75,1)),include.lowest = TRUE)
# a minor addition to aggregate
aggregate(df$fwd_rtn, list("qPB"=df$qPB, "date"=df$Date), FUN=mean)
You should take a look at using lapply or sapply to move through multiple columns.

How to efficiently use Rprof in R?

I would like to know if it is possible to get a profile from R-Code in a way that is similar to matlab's Profiler. That is, to get to know which line numbers are the one's that are especially slow.
What I acchieved so far is somehow not satisfactory. I used Rprof to make me a profile file. Using summaryRprof I get something like the following:
$by.self
self.time self.pct total.time total.pct
[.data.frame 0.72 10.1 1.84 25.8
inherits 0.50 7.0 1.10 15.4
data.frame 0.48 6.7 4.86 68.3
unique.default 0.44 6.2 0.48 6.7
deparse 0.36 5.1 1.18 16.6
rbind 0.30 4.2 2.22 31.2
match 0.28 3.9 1.38 19.4
[<-.factor 0.28 3.9 0.56 7.9
levels 0.26 3.7 0.34 4.8
NextMethod 0.22 3.1 0.82 11.5
...
and
$by.total
total.time total.pct self.time self.pct
data.frame 4.86 68.3 0.48 6.7
rbind 2.22 31.2 0.30 4.2
do.call 2.22 31.2 0.00 0.0
[ 1.98 27.8 0.16 2.2
[.data.frame 1.84 25.8 0.72 10.1
match 1.38 19.4 0.28 3.9
%in% 1.26 17.7 0.14 2.0
is.factor 1.20 16.9 0.10 1.4
deparse 1.18 16.6 0.36 5.1
...
To be honest, from this output I don't get where my bottlenecks are because (a) I use data.frame pretty often and (b) I never use e.g., deparse. Furthermore, what is [?
So I tried Hadley Wickham's profr, but it was not any more useful considering the following graph:
Is there a more convenient way to see which line numbers and particular function calls are slow?
Or, is there some literature that I should consult?
Any hints appreciated.
EDIT 1:
Based on Hadley's comment I will paste the code of my script below and the base graph version of the plot. But note, that my question is not related to this specific script. It is just a random script that I recently wrote. I am looking for a general way of how to find bottlenecks and speed up R-code.
The data (x) looks like this:
type word response N Classification classN
Abstract ANGER bitter 1 3a 3a
Abstract ANGER control 1 1a 1a
Abstract ANGER father 1 3a 3a
Abstract ANGER flushed 1 3a 3a
Abstract ANGER fury 1 1c 1c
Abstract ANGER hat 1 3a 3a
Abstract ANGER help 1 3a 3a
Abstract ANGER mad 13 3a 3a
Abstract ANGER management 2 1a 1a
... until row 1700
The script (with short explanations) is this:
Rprof("profile1.out")
# A new dataset is produced with each line of x contained x$N times
y <- vector('list',length(x[,1]))
for (i in 1:length(x[,1])) {
y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"]))
}
all <- do.call('rbind',y)
colnames(all) <- colnames(x)
# create a dataframe out of a word x class table
table_all <- table(all$word,all$classN)
dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])])
dataf.all$words <- as.factor(rownames(dataf.all))
dataf.all$type <- "no"
# get type of the word.
words <- levels(dataf.all$words)
for (i in 1:length(words)) {
dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"])
}
dataf.all$type <- as.factor(dataf.all$type)
dataf.all$typeN <- as.numeric(dataf.all$type)
# aggregate response categories
dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum)
dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum)
dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum)
Rprof(NULL)
library(profr)
ggplot.profr(parse_rprof("profile1.out"))
Final data looks like this:
1a 1b 1c 1d 1e 1f 2a 2b 2c 3a 3b pa words type typeN c1 c2 c3 pa
3 0 8 0 0 0 0 0 0 24 0 0 ANGER Abstract 1 11 0 24 0
6 0 4 0 1 0 0 11 0 13 0 0 ANXIETY Abstract 1 11 11 13 0
2 11 1 0 0 0 0 4 0 17 0 0 ATTITUDE Abstract 1 14 4 17 0
9 18 0 0 0 0 0 0 0 0 8 0 BARREL Concrete 2 27 0 8 0
0 1 18 0 0 0 0 4 0 12 0 0 BELIEF Abstract 1 19 4 12 0
The base graph plot:
Running the script today also changed the ggplot2 graph a little (basically only the labels), see here.
Alert readers of yesterdays breaking news (R 3.0.0 is finally out) may have noticed something interesting that is directly relevant to this question:
Profiling via Rprof() now optionally records information at the statement level, not just the function level.
And indeed, this new feature answers my question and I will show how.
Let's say, we want to compare whether vectorizing and pre-allocating are really better than good old for-loops and incremental building of data in calculating a summary statistic such as the mean. The, relatively stupid, code is the following:
# create big data frame:
n <- 1000
x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))
# reasonable operations:
marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean)
# unreasonable operations:
marginal.means.2 <- marginal.means.1[NULL,]
row.counter <- 1
for (condition in levels(x$condition)) {
for (group in levels(x$group)) {
tmp.value <- 0
tmp.length <- 0
for (c in 1:nrow(x)) {
if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {
tmp.value <- tmp.value + x[c,"data"]
tmp.length <- tmp.length + 1
}
}
marginal.means.2[row.counter,"group"] <- group
marginal.means.2[row.counter,"condition"] <- condition
marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length
row.counter <- row.counter + 1
}
}
# does it produce the same results?
all.equal(marginal.means.1, marginal.means.2)
To use this code with Rprof, we need to parse it. That is, it needs to be saved in a file and then called from there. Hence, I uploaded it to pastebin, but it works exactly the same with local files.
Now, we
simply create a profile file and indicate that we want to save the line number,
source the code with the incredible combination eval(parse(..., keep.source = TRUE)) (seemingly the infamous fortune(106) does not apply here, as I haven't found another way)
stop the profiling and indicate that we want the output based on the line numbers.
The code is:
Rprof("profile1.out", line.profiling=TRUE)
eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE))
Rprof(NULL)
summaryRprof("profile1.out", lines = "show")
Which gives:
$by.self
self.time self.pct total.time total.pct
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
$by.total
total.time total.pct self.time self.pct
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
$by.line
self.time self.pct total.time total.pct
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
$sample.interval
[1] 0.02
$sampling.time
[1] 12.54
Checking the source code tells us that the problematic line (#17) is indeed the stupid if-statement in the for-loop. Compared with basically no time for calculating the same using vectorized code (line #6).
I haven't tried it with any graphical output, but I am already very impressed by what I got so far.
Update: This function has been re-written to deal with line numbers. It's on github here.
I wrote this function to parse the file from Rprof and output a table of somewhat clearer results than summaryRprof. It displays the full stack of functions (and line numbers if line.profiling=TRUE), and their relative contribution to run time:
proftable <- function(file, lines=10) {
# require(plyr)
interval <- as.numeric(strsplit(readLines(file, 1), "=")[[1L]][2L])/1e+06
profdata <- read.table(file, header=FALSE, sep=" ", comment.char = "",
colClasses="character", skip=1, fill=TRUE,
na.strings="")
filelines <- grep("#File", profdata[,1])
files <- aaply(as.matrix(profdata[filelines,]), 1, function(x) {
paste(na.omit(x), collapse = " ") })
profdata <- profdata[-filelines,]
total.time <- interval*nrow(profdata)
profdata <- as.matrix(profdata[,ncol(profdata):1])
profdata <- aaply(profdata, 1, function(x) {
c(x[(sum(is.na(x))+1):length(x)],
x[seq(from=1,by=1,length=sum(is.na(x)))])
})
stringtable <- table(apply(profdata, 1, paste, collapse=" "))
uniquerows <- strsplit(names(stringtable), " ")
uniquerows <- llply(uniquerows, function(x) replace(x, which(x=="NA"), NA))
dimnames(stringtable) <- NULL
stacktable <- ldply(uniquerows, function(x) x)
stringtable <- stringtable/sum(stringtable)*100
stacktable <- data.frame(PctTime=stringtable[], stacktable)
stacktable <- stacktable[order(stringtable, decreasing=TRUE),]
rownames(stacktable) <- NULL
stacktable <- head(stacktable, lines)
na.cols <- which(sapply(stacktable, function(x) all(is.na(x))))
stacktable <- stacktable[-na.cols]
parent.cols <- which(sapply(stacktable, function(x) length(unique(x)))==1)
parent.call <- paste0(paste(stacktable[1,parent.cols], collapse = " > ")," >")
stacktable <- stacktable[,-parent.cols]
calls <- aaply(as.matrix(stacktable[2:ncol(stacktable)]), 1, function(x) {
paste(na.omit(x), collapse= " > ")
})
stacktable <- data.frame(PctTime=stacktable$PctTime, Call=calls)
frac <- sum(stacktable$PctTime)
attr(stacktable, "total.time") <- total.time
attr(stacktable, "parent.call") <- parent.call
attr(stacktable, "files") <- files
attr(stacktable, "total.pct.time") <- frac
cat("\n")
print(stacktable, row.names=FALSE, right=FALSE, digits=3)
cat("\n")
cat(paste(files, collapse="\n"))
cat("\n")
cat(paste("\nParent Call:", parent.call))
cat(paste("\n\nTotal Time:", total.time, "seconds\n"))
cat(paste0("Percent of run time represented: ", format(frac, digits=3)), "%")
invisible(stacktable)
}
Running this on the Henrik's example file, I get this:
> Rprof("profile1.out", line.profiling=TRUE)
> source("http://pastebin.com/download.php?i=KjdkSVZq")
> Rprof(NULL)
> proftable("profile1.out", lines=10)
PctTime Call
20.47 1#17 > [ > 1#17 > [.data.frame
9.73 1#17 > [ > 1#17 > [.data.frame > [ > [.factor
8.72 1#17 > [ > 1#17 > [.data.frame > [ > [.factor > NextMethod
8.39 == > Ops.factor
5.37 ==
5.03 == > Ops.factor > noNA.levels > levels
4.70 == > Ops.factor > NextMethod
4.03 1#17 > [ > 1#17 > [.data.frame > [ > [.factor > levels
4.03 1#17 > [ > 1#17 > [.data.frame > dim
3.36 1#17 > [ > 1#17 > [.data.frame > length
#File 1: http://pastebin.com/download.php?i=KjdkSVZq
Parent Call: source > withVisible > eval > eval >
Total Time: 5.96 seconds
Percent of run time represented: 73.8 %
Note that the "Parent Call" applies to all the stacks represented on the table. This makes is useful when your IDE or whatever calls your code wraps it in a bunch of functions.
I currently have R uninstalled here, but in SPlus you can interrupt the execution with the Escape key, and then do traceback(), which will show you the call stack. That should enable you to use this handy method.
Here are some reasons why tools built on the same concepts as gprof are not very good at locating performance problems.
A different solution comes from a different question: how to effectively use library(profr) in R:
For example:
install.packages("profr")
devtools::install_github("alexwhitworth/imputation")
x <- matrix(rnorm(1000), 100)
x[x>1] <- NA
library(imputation)
library(profr)
a <- profr(kNN_impute(x, k=5, q=2), interval= 0.005)
It doesn't seem (to me at least), like the plots are at all helpful here (eg plot(a)). But the data structure itself does seem to suggest a solution:
R> head(a, 10)
level g_id t_id f start end n leaf time source
9 1 1 1 kNN_impute 0.005 0.190 1 FALSE 0.185 imputation
10 2 1 1 var_tests 0.005 0.010 1 FALSE 0.005 <NA>
11 2 2 1 apply 0.010 0.190 1 FALSE 0.180 base
12 3 1 1 var.test 0.005 0.010 1 FALSE 0.005 stats
13 3 2 1 FUN 0.010 0.110 1 FALSE 0.100 <NA>
14 3 2 2 FUN 0.115 0.190 1 FALSE 0.075 <NA>
15 4 1 1 var.test.default 0.005 0.010 1 FALSE 0.005 <NA>
16 4 2 1 sapply 0.010 0.040 1 FALSE 0.030 base
17 4 3 1 dist_q.matrix 0.040 0.045 1 FALSE 0.005 imputation
18 4 4 1 sapply 0.045 0.075 1 FALSE 0.030 base
Single iteration solution:
That is the data structure suggests the use of tapply to summarize the data. This can be done quite simply for a single run of profr::profr
t <- tapply(a$time, paste(a$source, a$f, sep= "::"), sum)
t[order(t)] # time / function
R> round(t[order(t)] / sum(t), 4) # percentage of total time / function
base::! base::%in% base::| base::anyDuplicated
0.0015 0.0015 0.0015 0.0015
base::c base::deparse base::get base::match
0.0015 0.0015 0.0015 0.0015
base::mget base::min base::t methods::el
0.0015 0.0015 0.0015 0.0015
methods::getGeneric NA::.findMethodInTable NA::.getGeneric NA::.getGenericFromCache
0.0015 0.0015 0.0015 0.0015
NA::.getGenericFromCacheTable NA::.identC NA::.newSignature NA::.quickCoerceSelect
0.0015 0.0015 0.0015 0.0015
NA::.sigLabel NA::var.test.default NA::var_tests stats::var.test
0.0015 0.0015 0.0015 0.0015
base::paste methods::as<- NA::.findInheritedMethods NA::.getClassFromCache
0.0030 0.0030 0.0030 0.0030
NA::doTryCatch NA::tryCatchList NA::tryCatchOne base::crossprod
0.0030 0.0030 0.0030 0.0045
base::try base::tryCatch methods::getClassDef methods::possibleExtends
0.0045 0.0045 0.0045 0.0045
methods::loadMethod methods::is imputation::dist_q.matrix methods::validObject
0.0075 0.0090 0.0120 0.0136
NA::.findNextFromTable methods::addNextMethod NA::.nextMethod base::lapply
0.0166 0.0346 0.0361 0.0392
base::sapply imputation::impute_fn_knn methods::new imputation::kNN_impute
0.0392 0.0392 0.0437 0.0557
methods::callNextMethod kernlab::as.kernelMatrix base::apply kernlab::kernelMatrix
0.0572 0.0633 0.0663 0.0753
methods::initialize NA::FUN base::standardGeneric
0.0798 0.0994 0.1325
From this, I can see that the biggest time users are kernlab::kernelMatrix and the overhead from R for S4 classes and generics.
Preferred:
I note that, given the stochastic nature of the sampling process, I prefer to use averages to get a more robust picture of the time profile:
prof_list <- replicate(100, profr(kNN_impute(x, k=5, q=2),
interval= 0.005), simplify = FALSE)
fun_timing <- vector("list", length= 100)
for (i in 1:100) {
fun_timing[[i]] <- tapply(prof_list[[i]]$time, paste(prof_list[[i]]$source, prof_list[[i]]$f, sep= "::"), sum)
}
# Here is where the stochastic nature of the profiler complicates things.
# Because of randomness, each replication may have slightly different
# functions called during profiling
sapply(fun_timing, function(x) {length(names(x))})
# we can also see some clearly odd replications (at least in my attempt)
> sapply(fun_timing, sum)
[1] 2.820 5.605 2.325 2.895 3.195 2.695 2.495 2.315 2.005 2.475 4.110 2.705 2.180 2.760
[15] 3130.240 3.435 7.675 7.155 5.205 3.760 7.335 7.545 8.155 8.175 6.965 5.820 8.760 7.345
[29] 9.815 7.965 6.370 4.900 5.720 4.530 6.220 3.345 4.055 3.170 3.725 7.780 7.090 7.670
[43] 5.400 7.635 7.125 6.905 6.545 6.855 7.185 7.610 2.965 3.865 3.875 3.480 7.770 7.055
[57] 8.870 8.940 10.130 9.730 5.205 5.645 3.045 2.535 2.675 2.695 2.730 2.555 2.675 2.270
[71] 9.515 4.700 7.270 2.950 6.630 8.370 9.070 7.950 3.250 4.405 3.475 6.420 2948.265 3.470
[85] 3.320 3.640 2.855 3.315 2.560 2.355 2.300 2.685 2.855 2.540 2.480 2.570 3.345 2.145
[99] 2.620 3.650
Removing the unusual replications and converting to data.frames:
fun_timing <- fun_timing[-c(15,83)]
fun_timing2 <- lapply(fun_timing, function(x) {
ret <- data.frame(fun= names(x), time= x)
dimnames(ret)[[1]] <- 1:nrow(ret)
return(ret)
})
Merge replications (almost certainly could be faster) and examine results:
# function for merging DF's in a list
merge_recursive <- function(list, ...) {
n <- length(list)
df <- data.frame(list[[1]])
for (i in 2:n) {
df <- merge(df, list[[i]], ... = ...)
}
return(df)
}
# merge
fun_time <- merge_recursive(fun_timing2, by= "fun", all= FALSE)
# do some munging
fun_time2 <- data.frame(fun=fun_time[,1], avg_time=apply(fun_time[,-1], 1, mean, na.rm=T))
fun_time2$avg_pct <- fun_time2$avg_time / sum(fun_time2$avg_time)
fun_time2 <- fun_time2[order(fun_time2$avg_time, decreasing=TRUE),]
# examine results
R> head(fun_time2, 15)
fun avg_time avg_pct
4 base::standardGeneric 0.6760714 0.14745123
20 NA::FUN 0.4666327 0.10177262
12 methods::initialize 0.4488776 0.09790023
9 kernlab::kernelMatrix 0.3522449 0.07682464
8 kernlab::as.kernelMatrix 0.3215816 0.07013698
11 methods::callNextMethod 0.2986224 0.06512958
1 base::apply 0.2893367 0.06310437
7 imputation::kNN_impute 0.2433163 0.05306731
14 methods::new 0.2309184 0.05036331
10 methods::addNextMethod 0.2012245 0.04388708
3 base::sapply 0.1875000 0.04089377
2 base::lapply 0.1865306 0.04068234
6 imputation::impute_fn_knn 0.1827551 0.03985890
19 NA::.nextMethod 0.1790816 0.03905772
18 NA::.findNextFromTable 0.1003571 0.02188790
Results
From the results, a similar but more robust picture emerges as with a single case. Namely, there is a lot of overhead from R and also that library(kernlab) is slowing me down. Of note, since kernlab is implemented in S4, the overhead in R is related since S4 classes are substantially slower than S3 classes.
I'd also note that my personal opinion is that a cleaned up version of this might be a useful pull request as a summary method for profr. Although I'd be interested to see others' suggestions!

Resources