Going from a for loop to a function in R - r

I'm curious how I could convert a for loop that I've written into a function in R? I've no experience with writing my own functions in R. I looked here and here but these did not seem to offer much help. I am aware that for loops are not necessary and overall I'm trying to do something similar to this blog post.
The for loop with reproducible data is here:
P <- c(1:50)
y <- length(P)
D <- as.data.frame(combs(P,2))
Z <- choose(y,2)
Num = NULL
Denom = NULL
Diff = NULL
for(n in 1:Z)
{
Num[n] = abs(D$V1[n]-D$V2[n])
Denom[n] = max(D$V1[n], D$V2[n])
Diff[n] = Num[n]/Denom[n]
}
PV=mean(Diff)
PV
But, I'm interested in calculating PV based on levels such as in this data:
DATA <- c(1:500)
NAME <- c("a", "b", "c", "d", "e")
mydf <- as.data.frame(cbind(DATA, NAME))
Therefore, my final code I would like to use would be:
ANSWER <- tapply(mydf$DATA, mydf$NAME, MY.FUNCTION)
So, if I could turn the above for loop into a working function I could run the tapply function to get PV based on levels.
Any help would be appreciated or any other suggestions opposed to the one I offer.
Thanks!

Once you have your library loaded:
library(caTools)
Here's a function you can run on your data:
mymeandiff <- function(values){
df <- as.data.frame(combs(values, 2))
diff <- abs(df$V1 - df$V2)/pmax(df$V1, df$V2)
mean(diff)
}
mymeandiff(1:50)
Then we can use dplyr to run on each group (after correcting the data):
mydf$DATA <-as.numeric(as.character(mydf$DATA))
library(dplyr)
mydf %>% group_by(NAME) %>%
summarise(mymeandiff(DATA))
For apply, rather than dplyr:
tapply(mydf$DATA, mydf$NAME, FUN = mymeandiff)
Let's time it:
microbenchmark::microbenchmark(tapply = tapply(mydf$DATA, mydf$NAME, FUN=mymeandiff),
dplyr = mydf %>% group_by(NAME) %>%
summarise(mymeandiff(DATA)))
Unit: milliseconds
expr min lq mean median uq max neval
tapply 60.36543 61.08658 63.81995 62.61182 66.13671 80.37819 100
dplyr 61.84766 62.53751 67.33161 63.61270 67.58688 287.78364 100
tapply is slightly faster

Related

How do i subtract 2 columns in a .csv file in R?

How to subtract 2 colmun in a .csv file uploaded in R?
I have named the new column using reading <- $started_time- $ended_time
Since you do not post any example data I post an example based on the iris built-in dataset: You can simply use - to subtract vector of the same length (if the length is not the same the shorter vector will be recycled).
You can select the column from your dataset with the $ operator or with [] operator
data(iris)
#assigning the result to a new column
iris$subtraction <- iris$Sepal.Length-iris$Sepal.Width
iris$subtraction <- iris[,1]-iris[,2]
#assigning the result to a new variable
subtraction <- iris[,1]-iris[,2]
subtraction <- iris$Sepal.Length-iris$Sepal.Width
EDIT
a mincrobenchmark of 3 equivalent solutions:
library(microbenchmark)
library(data.table)
library(dplyr)
library(ggplot2)
#prepare simulation ------------------------------------------------------------
#number of rows to be tested
nr <- seq(100000,10000000,100000)
#initialize an list to store results
time <- as.list(rep(NA,100))
#benchmark
for (i in 1:length(nr)) {
set.seed(5)
#create data
df <- data.frame(x=rnorm(nr[i]),y=rnorm(nr[i]))
dt <- data.table(x=rnorm(nr[i]),y=rnorm(nr[i]))
#benchmark
x <- print(microbenchmark(
base=df$new.col <- df$x-df$y,
DT=dt <- dt[,new.col:=x-y],
dplyr=df %>% mutate(new.col=x-y),
times = 10
))
#store results
time[[i]] <- x[,c(1,4)]
}
#discard the first 4 elements because they run in microsenconds
bench <- do.call(rbind,time[5:100])
#add the number of rows as column
bench$nrow <- rep(nr[5:100],each=3)
ggplot(bench,aes(x=nrow,y=mean,group=expr,col=expr))+
geom_smooth(se=F)+
theme_minimal()+
xlab("# rows")+
ylab("time (milliseconds)")
As you can see, for this simple task both the base and data.table solutions are equivalent, while the mutate solution is a bit slower. However, the entire simulation runs in a minute and the single operations in few milliseconds.
my PC has 16Gb RAM and 12 cores.
EDIT
After the OP asked for a Date case, here a small example with date as POSIXct class:
day <- Sys.Date()
hm <- merge(0:23, seq(0, 45, by = 15))
datetime <- merge(last7days, chron(time = paste(hm$x, ':', hm$y, ':', 0)))
colnames(datetime) <- c('date', 'time')
# create datetime
dt <- as.POSIXct(paste(datetime$date, datetime$time))
df <- data.frame(x=sample(dt,200000,replace = T),y=sample(dt,200000,replace = T))
microbenchmark(df$x-df$y)
the operation runs in a few milliseconds, as expected:
Unit: milliseconds
expr min lq mean median uq max neval
df$x - df$y 1.459801 1.544301 2.755227 1.624501 1.845401 62.7416 100

Why is method 1 faster than method 2?

My question: I was debugging some code at work, running it block-by-block, when I realized a small block was taking an unusual amount of time. I killed it and made a minor (but logically equivalent) tweak, and it ran almost instantly. I would like to understand why. The following code is in R, however, I imagine the answer may not be specific to R, and may apply to most programming languages of a similar paradigm or 'method-of-compiling'?
The code & information:
Using R version 3.6.1
Libraries loaded: dplyr, knitr, DataExplorer, glue, zoo
old_df is data frame of 5653380 obs. of 91 variables.
field1 is a col of policy numbers with class "character". Not unique, each occurs many times.
date_col1 and date_col2 are columns with class "date".
Method 1:
new_df <- old_df %>%
group_by(field1) %>%
mutate(checkfield = date_col1 - date_col2) %>%
filter(checkfield < 0) %>%
filter(row_number() == 1)
old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1,0)
Method 2:
new_df <- old_df %>%
group_by(field1) %>%
filter(date_col1 < date_col2) %>%
filter(row_number() == 1)
old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1, 0)
As you can probably see, the intended output of both methods is to add a flag, "1", in the column "filter" for policy numbers where date_col1 < date_col2. I did not write method 1, and my goal in writing method 2 was to change it as little as possible while also making it faster, so please avoid spending too much time talking about problems with method 1 that are not related to why it is unbearably slower than method 2. Feel free to mention such things, but I would like the crux to be why method 1 was taking 20, 30 minutes etc. For example, I believe in method 1, the first filter call could be above the group_by call. This might increase speed by an unnoticeable amount. I am not too concerned about this.
My thoughts: Clearly method 2 might be a little faster because it avoids making the column "checkfield", but I dont think this is the issue, as I ran method 1 line by line, and it appears to be the line 'filter(checkfield < 0)' where things went awry. For testing, I defined two dates x,y and checked class(x-y) which returned "difftime". So in this filter call, we are comparing "difftime" to a "numeric". Perhaps this requires some type of type-juggling to make the comparison, where as method 2 compares a date object to a date object?
Let me know what you think! I am very curious about this.
My explorations so far, with a simplified example and a slightly smaller data set (only a million rows and a minimal subset of columns) have the individual tests (test_cf for filtering on the checkfield variable, test_lt for filtering on the date comparison) taking about the same time, which both take about the same time as creating the checkfield column. Doing both at once (comb, creating and comparing) takes 2.5 x longer, not sure why.
Perhaps you can use this as a starting point for bisecting/benchmarking to find the culprit.
test elapsed relative
2 comb 5.557 2.860
1 make_cf 1.943 1.000
4 test_cf 2.122 1.092
3 test_lt 2.109 1.085
I used rbenchmark::benchmark() because I prefer the output format: microbenchmark::microbenchmark() might be more accurate (but I would be surprised if it made a big difference).
code
library(dplyr)
n <- 1e6 ## 5653380 in orig; reduce size for laziness
set.seed(101)
## sample random dates, following
## https://stackoverflow.com/questions/21502332/generating-random-dates
f <- function(n)
sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"),
replace=TRUE,
size=n)
dd <- tibble(
date_col1=f(n),
date_col2=f(n)
## set up checkfield so we can use it without creating it
) %>% mutate(cf=date_col1-date_col2)
Benchmark:
library(rbenchmark)
benchmark(
make_cf=dd %>% mutate(checkfield=date_col1-date_col2),
comb=dd %>% mutate(checkfield=date_col1-date_col2) %>% filter(checkfield<0),
test_lt=dd %>% filter(date_col1<date_col2),
test_cf=dd %>% filter(cf<0),
columns=c("test","elapsed","relative")
)
I believe most of the increased time is due to creating the new column. As you can see, M1 and M3 have similar times. Of course that difference of ~2 milliseconds between M1 and M3 will multiply based on the data size
library(tidyverse)
library(microbenchmark)
set.seed(42)
n = 1e5
d = seq.Date(Sys.Date() - 10000, Sys.Date(), 1)
x = sample(d, n, TRUE)
y = sample(d, n, TRUE)
df1 = data.frame(x, y, id = sample(LETTERS, n, TRUE))
microbenchmark(M1 = {
df1 %>%
group_by(id) %>%
mutate(chk = x - y) %>%
filter(chk < 0) %>%
filter(row_number() == 1)
},
M2 = {
df1 %>%
group_by(id) %>%
filter(x < y) %>%
filter(row_number() == 1)
},
M3 = {
df1 %>%
group_by(id) %>%
mutate(chk = x - y) %>%
filter(x < y) %>%
filter(row_number() == 1)
})
#Unit: milliseconds
# expr min lq mean median uq max neval
# M1 13.130673 13.405151 15.088266 14.096772 15.56080 22.636533 100
# M2 5.931192 6.208457 6.564363 6.402879 6.71973 9.354252 100
# M3 11.360640 11.607993 12.449220 12.001383 12.57732 18.260131 100
For the point about comparing difftime to numeric, there doesn't seem to be a lot of difference
library(microbenchmark)
set.seed(42)
n = 1e7
x = sample(d, n, TRUE)
y = sample(d, n, TRUE)
df1 = data.frame(x, y)
df1$difference = df1$x - df1$y
class(df1$difference)
#[1] "difftime"
microbenchmark(date_vs_date = {
df1 %>% filter(x < y)
},
date_vs_numeric ={
df1 %>% filter(difference < 0)
})
#Unit: milliseconds
# expr min lq mean median uq max neval
# date_vs_date 177.1789 222.4112 243.6617 233.7221 244.2765 430.4273 100
# date_vs_numeric 181.6222 217.1121 251.6127 232.7213 249.8218 455.8285 100

Why is dplyr so slow?

Like most people, I'm impressed by Hadley Wickham and what he's done for R -- so i figured that i'd move some functions toward his tidyverse ... having done so i'm left wondering what the point of it all is?
My new dplyr functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation.
So, what am i doing wrong? Why is dplyr so slow?
An example:
require(microbenchmark)
require(dplyr)
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
addSpread_base <- function() {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function() df %>% mutate(spread := a - b)
all.equal(addSpread_base(), addSpread_dplyr())
microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)
Timing results:
Unit: microseconds
expr min lq mean median uq max neval
addSpread_base() 12.058 15.769 22.07805 24.58 26.435 2003.481 10000
addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000
So using dplyr functions to transform the data takes about 30x longer -- surely this isn't the intention?
I figured that perhaps this is too easy a case -- and that dplyr would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.
# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
df[['spread']] <- df[['a']] - df[['b']]
df[, c(col1, col2, 'spread')]
}
addSpreadSub_dplyr <- function(df, col1, col2) {
var1 <- as.name(col1)
var2 <- as.name(col2)
qq <- quo(!!var1 - !!var2)
df %>%
mutate(spread := !!qq) %>%
select(!!var1, !!var2, spread)
}
all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))
microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'),
addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'),
times = 1e4)
Results:
Unit: microseconds
expr min lq mean median uq max neval
addSpreadSub_base(df, col1 = "a", col2 = "b") 22.725 30.610 44.3874 45.450 53.798 2024.35 10000
addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000
These are micro seconds, your dataset has 10 rows, unless you plan on looping on millions of datasets of 10 rows your benchmark is pretty much irrelevant (and in that case I can't imagine a situation where it wouldn't be wise to bind them together as a first step).
Let's do it with a bigger dataset, like 1 million times bigger :
df <- tibble(
a = 1:10,
b = c(1:5, 4:0),
c = 10:1)
df2 <- bind_rows(replicate(1000000,df,F))
addSpread_base <- function(df) {
df[['spread']] <- df[['a']] - df[['b']]
df
}
addSpread_dplyr <- function(df) df %>% mutate(spread = a - b)
microbenchmark::microbenchmark(
addSpread_base(df2),
addSpread_dplyr(df2),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# addSpread_base(df2) 25.85584 26.93562 37.77010 32.33633 35.67604 170.6507 100 a
# addSpread_dplyr(df2) 26.91690 27.57090 38.98758 33.39769 39.79501 182.2847 100 a
Still quite fast and not much difference.
As for the "whys" of the result that you got, it's because you're using a much more complex function, so it has overheads.
Commenters have pointed that dplyr doesn't try too hard to be fast and maybe it's true when you compare to data.table, and interface is the first concern, but the authors have been working hard on speed as well. Hybrid evaluation for example allows (if I got it right) to execute C code directly on grouped data when aggregating with common functions, which can be much faster than base code, but simple code will always run faster with simple functions.

plyr outperforms dplyr and data.table - What's wrong?

I have to apply a function to every row of a large table (~ 2M rows). I used to use plyr for that, but the table is growing continuously and the current solution starts to approach unacceptable runtimes. I thought I could just switch to data.table or dplyr and all is fine, but that's not the case.
Here's an example:
library(data.table)
library(plyr)
library(dplyr)
dt = data.table("ID_1" = c(1:1000), # unique ID
"ID_2" = ceiling(runif(1000, 0, 100)), # other ID, duplicates possible
"group" = sample(LETTERS[1:10], 1000, replace = T),
"value" = runif(1000),
"ballast1" = "X", # keeps unchanged in derive_dt
"ballast2" = "Y", # keeps unchanged in derive_dt
"ballast3" = "Z", # keeps unchanged in derive_dt
"value_derived" = 0)
setkey(dt, ID_1)
extra_arg = c("A", "F", "G", "H")
ID_1 is guaranteed to contain no duplicates. Now I define a function to apply to every row/ID_1:
derive = function(tmprow, extra_arg){
if(tmprow$group %in% extra_arg){return(NULL)} # exlude entries occuring in extra_arg
group_index = which(LETTERS == tmprow$group)
group_index = ((group_index + sample(1:26, 1)) %% 25) + 1
new_group = LETTERS[group_index]
if(new_group %in% unique(dt$group)){return(NULL)}
new_value = runif(1)
row_derived = tmprow
row_derived$group = new_group
row_derived$value = runif(1)
row_derived$value_derived = 1
return(row_derived)
}
This one doesn't do anything useful (the actual one does). The point is that the function takes one row and computes a new row of the same format.
Now the comparison:
set.seed(42)
system.time(result_dt <- dt[, derive(.SD, extra_arg), by = ID_1])
set.seed(42)
system.time(result_dplyr <- dt %>% group_by(ID_1) %>% do(derive(., extra_arg)))
set.seed(42)
system.time(results_plyr <- x <- ddply(dt, .variable = "ID_1", .fun = derive, extra_arg))
plyr is about 8x faster than both data.table and dplyr. Obviously I'm doing something wrong here, but what?
EDIT
Thanks to eddi's answer I could reduce runtimes for data.table and dplyr to ~ 0.6 and 0.8 of the plyr version, respectively. I intialized row_derived as data.frame: row_derived = as.data.frame(tmprow). That's cool, but I still expected a higher performance increase from these packages...any further suggestions?
The issue is the assignment you use has a very high overhead in data.table, and plyr converts the row to a data.frame before passing to your derive function, and thus avoids it:
library(microbenchmark)
df = as.data.frame(dt)
microbenchmark({dt$group = dt$group}, {df$group = df$group})
#Unit: microseconds
# expr min lq mean median uq max neval
# { dt$group = dt$group } 1895.865 2667.499 3092.38903 3080.3620 3389.049 4984.406 100
# { df$group = df$group } 26.045 45.244 64.13909 61.6045 79.635 157.266 100
I can't suggest a good fix, since you say your example is not real problem, so no point in solving it better. Some basic suggestions to look at are - vectorizing the code, and using := or set instead (depending on what exactly you end up doing).

R data.table sliding window

What is the best (fastest) way to implement a sliding window function with the data.table package?
I'm trying to calculate a rolling median but have multiple rows per date (due to 2 additional factors), which I think means that the zoo rollapply function wouldn't work. Here is an example using a naive for loop:
library(data.table)
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt = data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
get_window <- function(date, factor1, factor2) {
criteria <- data.table(
date=as.IDate((date - 7):(date - 1), origin="1970-01-01"),
factor1=as.integer(factor1),
factor2=as.integer(factor2)
)
return(dt[criteria][, value])
}
output <- data.table(unique(dt[, list(date, factor1, factor2)]))[, window_median:=as.numeric(NA)]
for(i in nrow(output):1) {
print(i)
output[i, window_median:=median(get_window(date, factor1, factor2))]
}
data.table doesn't have any special features for rolling windows, currently. Further detail here in my answer to another similar question here :
Is there a fast way to run a rolling regression inside data.table?
Rolling median is interesting. It would need a specialized function to do efficiently (same link as in earlier comment) :
Rolling median algorithm in C
The data.table solutions in the question and answers here are all very inefficient, relative to a proper specialized rollingmedian function (which isn't available for R afaik).
I managed to get the example down to 1.4s by creating a lagged dataset and doing a huge join.
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt2 <- data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
unique_set <- data.table(unique(dt[, list(original_date=date, factor1, factor2)]))
output2 <- data.table()
for(i in 1:7) {
output2 <- rbind(output2, unique_set[, date:=original_date-i])
}
setkeyv(output2, c("date", "factor1", "factor2"))
output2 <- output2[dt]
output2 <- output2[, median(value), by=c("original_date", "factor1", "factor2")]
That works pretty well on this test dataset but on my real one it fails with 8GB of RAM. I'm going to try moving up to one of the High Memory EC2 instance (with 17, 34 or 68GB RAM) to get it working. Any ideas on how to do this in a less memory intensive way would be appreciated
This solution works but it takes a while.
df <- data.frame(
id=30000,
date=rep(seq.Date(from=as.Date("2012-01-01"),to=as.Date("2012-01-30"),by="d"),each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
myFun <- function(dff,df){
median(df$value[df$date>as.Date(dff[2])-8 & df$date<as.Date(dff[2])-1 & df$factor1==dff[3] & df$factor2==dff[4]])
}
week_Med <- apply(df,1,myFun,df=df)
week_Med_df <- cbind(df,week_Med)
I address this in a related thread: https://stackoverflow.com/a/62399700/7115566
I suggest looking into the frollapply function. For instance, see below
library(data.table)
set.seed(17)
dt <- data.table(i = 1:100,
x = sample(1:10, 100, replace = T),
y = sample(1:10, 100, replace = T))
dt$index <- dt$x == dt$y
dt[,`:=` (MA = frollapply(index,10,mean)), ]
head(dt,12)

Resources