The proper way to vectorise an if-else tower in R - r

I came across the following post: Vectorized IF statement in R?, which deals with a vecotrisation of one if-else construct in R. However, I do not want to build nested $ifelse$ functions in R, is there another way to do this?
As an example I have this:
library(data.table)
precalc_nrcolumns <- function(TERM_DATE, CHANGE_DATE){
if(CHANGE_DATE < TERM_DATE){
if(CHANGE_DATE == 0) {
if(TERM_DATE > 100) {
nrcolumns <- 100
}else{
nrcolumns <- TERM_DATE
}
}else{
nrcolumns <- CHANGE_DATE
}
}else{
nrcolumns <- 100
}
return(nrcolumns)
}
test.data <- data.table(TERM_DATE = sample(1:500, 100, replace=TRUE),
CHANGE_DATE = sample(1:500, 100, replace=TRUE))
test.data[, value := mapply(precalc_nrcolumns, TERM_DATE, CHANGE_DATE)]
I am fully aware of using mapply, which actually works, but I was wondering what other ways there are to deal with this.

A possible approach:
test.data[, val := pmin(CHANGE_DATE, TERM_DATE)][
CHANGE_DATE==0L, val := pmin(100L, TERM_DATE)][
CHANGE_DATE >= TERM_DATE, val := 100L]
data:
set.seed(0L)
library(data.table)
nr <- 300
test.data <- data.table(TERM_DATE = sample(0:150, nr, replace=TRUE),
CHANGE_DATE = sample(0:150, nr, replace=TRUE))

Related

R non overlapping sample - faster function

I have a table with more than one contract per client. I want to take a sample but not allowing more than one contract per client within 6 months. I created one function (that uses another) that does the job, but it is too slow.
The callable function is:
non_overlapping_sample <- function (tbla, date_field, id_field, window_days) {
base_evaluar = data.table(tbla)
base_evaluar[,(date_field):= ymd(base_evaluar[[date_field]]) ]
setkeyv(base_evaluar, date_field)
setkeyv(base_evaluar, id_field)
id_primero = sample(1:nrow(tbla), 1)
base_muestra = data.frame(base_evaluar[id_primero,])
base_evaluar = remove_rows(base_evaluar, id_primero, date_field, id_field, window_days)
while (nrow(base_evaluar) > 0) {
id_a_sacar = sample(1:nrow(base_evaluar), 1)
base_muestra = rbind(base_muestra,data.frame(base_evaluar[id_a_sacar,]))
base_evaluar = remove_rows(base_evaluar, id_a_sacar, date_field, id_field, window_days)
}
base_muestra = base_muestra[order(base_muestra[,id_field],base_muestra[,date_field]),]
return(base_muestra)
}
Ant the internal function is:
remove_rows <- function(tabla, indice_fila, date_field, id_field, window_days) {
fecha = tabla[indice_fila, get(date_field)]
element = tabla[indice_fila, get(id_field)]
lim_sup=fecha + window_days
lim_inf=fecha - window_days
queda = tabla[ tabla[[id_field]] != element | tabla[[date_field]] > lim_sup | tabla[[date_field]] < lim_inf]
return(queda)
}
An example to use it is:
set.seed(1)
library(lubridate)
sem = sample(seq.Date(ymd(20150101),ymd(20180101),1), 3000, replace = T)
base = data.frame(fc_fin_semana = sem, cd_cliente=round(runif(3000)*10,0))
base=base[!duplicated(base),]
non_overlapping_sample(base, date_field='fc_fin_semana', 'cd_cliente', 182)
Any ideas to make it work faster?
Thanks!
EDITION:
An example of what would be wrong and right:
rbind is slow in loops. Try something like this:
non_overlapping_sample2 <- function(tbla, date_field, id_field, window_days) {
dt <- data.table(tbla)
dt[, (date_field) := ymd(dt[[date_field]])]
setkeyv(dt, c(id_field, date_field))
# create vectors for while loop:
rowIDS <- 1:nrow(dt)
selected_rows <- NULL
use <- rep(T, nrow(dt))
dates <- dt[[date_field]]
ids <- dt[[id_field]]
rowIDS2 <- rowIDS
while (length(rowIDS2) > 0) {
sid <- sample.int(length(rowIDS2), 1) # as rowIDS2 can be length 1 vector, use this approach
row_selected <- rowIDS2[sid] # selected row
selected_rows <- c(selected_rows, row_selected)
sel_date <- dates[row_selected] # selected date
sel_ID <- ids[row_selected] # selected ID
date_max <- sel_date + window_days
date_min <- sel_date - window_days
use[ids == sel_ID & dates <= date_max & dates >= date_min] <- FALSE
rowIDS2 <- rowIDS[use == TRUE] # subset for next sample
}
result <- dt[selected_rows, ] # dt subset
setorderv(result, c(id_field, date_field))
return(result)
}
In loop we do not need to do data.table\data.frame subsets, operate only on vectors.
Sub-setting can be done in the end.

data.table: parallel execution of row-wise function

I want to apply a function to some colums in every row of a data.table. I do this using something like this:
require(data.table)
## create some random data
n = 1000
p = 1000
set.seed(1)
data.raw <- matrix(rnorm(n*p), nrow = n, ncol = p)
rownames(data.raw) <- lapply(1:n, FUN = function(x, length)paste(sample(c(letters, LETTERS), length, replace=TRUE), collapse=""), length = 10)
colnames(data.raw) <- samples <- paste0("X", 1:n)
data.t <- data.table(data.raw)
data.t[, id := rownames(data.raw)]
setkey(data.t, id)
# apply function for each row
f <- function(x){return(data.frame(result1 = "abc", result2 = "def"))}
data.t[, c("result1", "result2") := f(.SD), .SDcols = samples, by = id]
is there any (easy) way to parallelize the execution of f for every id in the data.table?
I know that there are some questions here about parallelization of data.table, but I couldn't find a good answer in any of these.

Trying to optimize this code. Speed problems

This code gives me exactly what I want, but it gets really slow with larger datasets. Would greatly appreciate some insights on how I can do the same thing with more speed.
df = data.frame(v1 = runif(1:15000), v2 = runif(1:15000))
rolling.monthlies = lapply(df, function(x){
p = sapply(1:length(x), function(i){
m = rev(x[1:i])
m = m[seq(1,length(m),21)]
m = rev(m)
})
return(p)
})
We can eliminate the two rev calls by using seq like this. We can also use lapply in place of sapply since no simplification is done saving the attempt:
set.seed(123) # for reproducibility
df = data.frame(v1 = runif(1:15000), v2 = runif(1:15000)) # input
rolling.monthlies2 = lapply(df, function(x)
lapply( seq_along(x), function(i) x[seq(i %% 29, i, 29)] )
)

Using an R function on a column

I wish to use a function on a number of columns in a dataframe:
library(data.table)
id <- seq(1:1000)
region <- rep(c("A","B","C","D","E"),c(200,200,200,200,200))
treatment.1 <- sample(0:1, 1000, replace=T)
treatment.2 <- sample(0:1, 1000, replace=T)
d <- data.frame(id,region,treatment.1,treatment.2)
I wish to create a function which allows me to calculate the proportion of 1s by region (in different treatment groups). So far I have been using the following code:
setDT(d)[,.(.N,prop=sum(treatment.1==1)/.N),
by=region]
However, when I try and turn the code into a function, I am having some problems (the answer does not match what I previously got without the function):
treatment.pc <- function (x) {
setDT(d)[,.(.N,prop=sum(x==1)/.N),
by=region]
}
treatment.pc (d$treatment.1)
treatment.pc (d$treatment.2)
What do I need to do to the code to make it work?
setDT(d)
fun <- function (x) {
prob = mean(x==1L)
}
d[, c(lapply(.SD, fun), N = .N), by = region, .SDcols = c("treatment.1", "treatment.2")]
It's unclear to me if you need to wrap the last line into a function ...
fun2 <- function(DT, fun, cols) {
setDT(DT)
DT[, c(lapply(.SD, fun), N = .N), by = region, .SDcols = cols]
}
fun2(d, fun, c("treatment.1", "treatment.2"))
This might be a simpler solution for your problem using dplyr.
library(dplyr)
id <- seq(1:1000)
region <- rep(c("A","B","C","D","E"),c(200,200,200,200,200))
treatment.1 <- sample(0:1, 1000, replace=T)
treatment.2 <- sample(0:1, 1000, replace=T)
d <- data.frame(id,region,treatment.1,treatment.2)
by_col <- d %>% group_by(region) %>% summarise_each(funs(k = mean))
With only one line code you get the result I think you want and you don't have to write a function.

R data.table sliding window

What is the best (fastest) way to implement a sliding window function with the data.table package?
I'm trying to calculate a rolling median but have multiple rows per date (due to 2 additional factors), which I think means that the zoo rollapply function wouldn't work. Here is an example using a naive for loop:
library(data.table)
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt = data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
get_window <- function(date, factor1, factor2) {
criteria <- data.table(
date=as.IDate((date - 7):(date - 1), origin="1970-01-01"),
factor1=as.integer(factor1),
factor2=as.integer(factor2)
)
return(dt[criteria][, value])
}
output <- data.table(unique(dt[, list(date, factor1, factor2)]))[, window_median:=as.numeric(NA)]
for(i in nrow(output):1) {
print(i)
output[i, window_median:=median(get_window(date, factor1, factor2))]
}
data.table doesn't have any special features for rolling windows, currently. Further detail here in my answer to another similar question here :
Is there a fast way to run a rolling regression inside data.table?
Rolling median is interesting. It would need a specialized function to do efficiently (same link as in earlier comment) :
Rolling median algorithm in C
The data.table solutions in the question and answers here are all very inefficient, relative to a proper specialized rollingmedian function (which isn't available for R afaik).
I managed to get the example down to 1.4s by creating a lagged dataset and doing a huge join.
df <- data.frame(
id=30000,
date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
dt2 <- data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))
unique_set <- data.table(unique(dt[, list(original_date=date, factor1, factor2)]))
output2 <- data.table()
for(i in 1:7) {
output2 <- rbind(output2, unique_set[, date:=original_date-i])
}
setkeyv(output2, c("date", "factor1", "factor2"))
output2 <- output2[dt]
output2 <- output2[, median(value), by=c("original_date", "factor1", "factor2")]
That works pretty well on this test dataset but on my real one it fails with 8GB of RAM. I'm going to try moving up to one of the High Memory EC2 instance (with 17, 34 or 68GB RAM) to get it working. Any ideas on how to do this in a less memory intensive way would be appreciated
This solution works but it takes a while.
df <- data.frame(
id=30000,
date=rep(seq.Date(from=as.Date("2012-01-01"),to=as.Date("2012-01-30"),by="d"),each=1000),
factor1=rep(1:5, each=200),
factor2=1:5,
value=rnorm(30, 100, 10)
)
myFun <- function(dff,df){
median(df$value[df$date>as.Date(dff[2])-8 & df$date<as.Date(dff[2])-1 & df$factor1==dff[3] & df$factor2==dff[4]])
}
week_Med <- apply(df,1,myFun,df=df)
week_Med_df <- cbind(df,week_Med)
I address this in a related thread: https://stackoverflow.com/a/62399700/7115566
I suggest looking into the frollapply function. For instance, see below
library(data.table)
set.seed(17)
dt <- data.table(i = 1:100,
x = sample(1:10, 100, replace = T),
y = sample(1:10, 100, replace = T))
dt$index <- dt$x == dt$y
dt[,`:=` (MA = frollapply(index,10,mean)), ]
head(dt,12)

Resources